Get the latest tech news
AGI is not multimodal
"In projecting language back as the model for thought, we lose sight of the tacit embodied understanding that undergirds our intelligence." –Terry Winograd The recent successes of generative AI models have convinced some that AGI is imminent. While these models appear to capture the essence of human intelligence, they defy
Proponents of this theory cite the prowess of SOTA LLMs on various benchmarks, the convergence of large models to similar internal representations, and their favorite rendition of the idea that “language mirrors the structure of reality,” a notion that has been espoused at least by Plato, Wittgenstein, Foucault, and Eco. For example, someone who appears well-read and graceful in navigating social interactions is likely to score high in traits like sustained attention and theory of mind, which lie closer to measures of raw cognitive ability. In the early days, this was achieved by nudging the embeddings of, e.g. (language, vision, action) tuples to converge to similar latent vectors of meaning, a vast oversimplification of the kinds of relationships that may exist between modalities.
Or read this on Hacker News