Get the latest tech news
Show HN: Exploring HN by mapping and analyzing 40M posts and comments for fun
ker News by mapping and analyzing 40 million posts and comments for fun The above is a map of all Hacker News posts since its founding, laid semantically i.e. where there should be some relationship between positions and distances.
A quick primer of embeddings: they are a powerful and cool way to represent something(in this case, text) as a point in a high-dimensional space, which in practical terms just means an array of floats, one for its coordinate in that dimension. A lot of content even on Hacker News suffers from the well-known link rot: around 200K resulted in a 404, DNS lookup failure, or connection timeout, which is a sizable "hole" in the dataset that would be nice to mend. So, as a final push for a more "complete" dataset, I used the Wayback API to fetch the last few thousands of articles, some dating back years, which was very annoying because IA has very, very low rate limits (around 5 per minute).
Or read this on Hacker News