Get the latest tech news
Show HN: SemHash – Fast Semantic Text Deduplication for Cleaner Datasets
Fast Semantic Text Deduplication. Contribute to MinishLab/semhash development by creating an account on GitHub.
Additionally, it includes functions to inspect deduplication results, making it easier to understand and refine your data cleaning process. Scalable: SemHash can deduplicate large datasets with millions of records thanks to the ANN backends in Vicinity. DatasetOriginal Train SizeDeduplicated Train Size% RemovedDeduplication Time (s)bbc122511446.610.57senteval_cr301229900.730.14tweet_sentiment_extraction27481266952.861.77emotion16000156951.910.77amazon_counterfactual500049920.160.33ag_news12000010692110.905.20enron_spam317162054035.242.03subj800079900.120.63sst5854485260.210.5820_newgroups11314106845.570.73hatespeech_offensive22783220903.040.92ade176371571810.880.73imdb25000248300.681.76massive_scenario11514936618.660.47student1175196385645.668.80squad_v213031910969815.828.81wikitext180135088464550.8983.53DatasetTrain SizeTest SizeDeduplicated Test Size% RemovedDeduplication Time (s)bbc1225100087013.000.71senteval_cr30127537500.400.13tweet_sentiment_extraction27481353434123.451.53emotion16000200019263.700.65amazon_counterfactual5000500049900.200.51ag_news1200007600619818.453.74enron_spam317162000106047.001.94subj8000200019990.050.62sst58544221022050.230.5920_newgroups11314753270985.762.25hatespeech_offensive22783200019253.750.77ade176375879495215.770.81imdb2500025000247950.822.81massive_scenario115142974219026.360.46student1175195000239352.143.78squad_v213031911873118630.087.13wikitext18013504358213950.9240.32As can be seen, SemHash is extremely fast, and scales to large datasets with millions of records.
Or read this on Hacker News