Get the latest tech news
Finding Dead Websites
As some of the work planned for Marginalia Search this year has been progressing a bit faster than anticipated, there was time to implement an unplanned change. This post details the implementation of a system for detecting when servers are online, to avoid serving dead links and improve data quality, and for detecting when websites have significant changes including ownership transfers and parking. Table Of Contents Feature Rationale Data Representation Live Data Event Data Change Detection Details Availability Detection Ownership Changes DNS Implementation Hurdles Scheduling Certificate Validation Conclusions Feature Rationale Availability detection is useful not just for filtering out dead links in the search results, but for informing the crawler that it should stop trying to reach a dead domain, as well as a host of other things.
This trade-off allows the retention of full historical snapshots, with smaller on-disk storage, while making schema changes somewhat more tolerable as the events tables are likely to grow very large over time. Building a model for accurately detecting ownership changes will be more work, and may require additional factors to be really useful, but finding this early success in identifying parked domains is very encouraging! Ultimately the validity ambition was backpedaled even further, and now only properly checks host name against SAN and expiry against the wall clock, with an indicative flag “could we verify the chain” that is not regarded as an error when it is set to false.
Or read this on Hacker News