Get the latest tech news
I want to break some laws too
I made an automated pipeline to clean data. The idea started from a paper called Minipile. This led me to a rabbit hole. If you’re careful about the data you use for training, you can can break the scaling laws. Who knew being a data snob could be so rewarding?
Model Performance vs Labels Seen (axis is in Log Scale,)You can even see kind of a trend line in this second image if you squint your eyes hard enough. The good thing is that the authors of this paper proposed a method to select the hardest or the easiest examples without any human supervision. I was using conda because I wanted to replicate this as closely as possible, but I think a better idea for next time would be to just create an empty environment and install everything directly.
Or read this on Hacker News