Get the latest tech news
A ‘Shocking’ Amount of the Web Is Already AI-Translated Trash, Scientists Determine
Researchers warn that most of the text we view online has been poorly translated into one or more languages—usually by a machine.
“We actually got interested in this topic because several colleagues who work in MT and are native speakers of low resource languages noted that much of the internet in their native language appeared to be MT generated,” Mehak Dhaliwal, a former applied science intern at AWS and current PhD student at the University of California, Santa Barbara, told Motherboard. “The vast majority came from articles that we characterized as low quality, requiring little or no expertise or advance effort to create, on topics like being taken more seriously at work, being careful about your choices, six tips for new boat owners, deciding to be happy, etc.”The researchers argued that the selection bias toward short sentences from low-quality articles was due to “low quality content (likely produced to generate ad revenue) being translated via MT en masse into many lower resource languages (again likely for the purpose of generating ad revenue). Our findings raise numerous concerns for multilingual model builders: Fluency (especially across sentences) and accuracy are lower for MT data, which could produce less fluent models with more hallucinations, and the selection bias indicates the data may be of lower quality, even before considering MT errors.”By signing up, you agree to the Terms of Use and Privacy Policy& to receive electronic communications from Vice Media Group, which may include marketing promotions, advertisements and sponsored content.
Or read this on r/technology