Get the latest tech news

All-in-one embedding model for interleaved text, images, and screenshots


TL;DR — We are excited to announce voyage-multimodal-3, a new state-of-the-art for multimodal embeddings and a big step forward towards seamless RAG and semantic search for documents rich with both…

Unlike existing multimodal embedding models, voyage-multimodal-3 is capable of vectorizing interleaved texts + images and capturing key visual features from screenshots of PDFs, slides, tables, figures, and more, thereby eliminating the need for complex document parsing. With voyage-multimodal-3, there is no longer a need for screen parsing models, layout analysis, or any other complex text extraction pipelines; you can easily vectorize a knowledge base containing both pure-text documents as well unstructured data (such as PDFs/slides/webpages/etc) — screenshots are all you need. TaskDescription Datasets Table/figure retrievalTable/figure retrieval measures the strength of a model’s ability to match an image containing a table or figure (charts, graphs, etc) with descriptions, captions, or other textual queries which reference the figure.charxiv, mmtab-test, ChartQA, Chartve, FintabnetQA, PlotQA,Document screenshot retrievalIn this category, models are used to match queries with scans or screenshots of documents containing both text and charts.Energy, Healthcare Industry, Artificial Intelligence, Government Report, InfoVQA, DocVQA, ArxivQA, TabFQuad, TAT-DQA, Shift ProjectText-to-photo retrievalThis is the typical text-to-image matching used by CLIP and other CLIP-like models, where queries are associated with the most semantically relevant photos.meme-cap, mm-imdb, winoground, docciStandard text retrievalStandard text retrieval retrieves relevant documents by matching query strings with document strings.LeCaRDv2, LegalQuAD legal_summarization, AILA_casedocs, AILA_statutes, rag-benchmark-finance-apple-10K-2022, financebench, TAT-QA, finance-alpaca-csv fiqa-personal-finance-dataset, finance-financialmodelingprep-stock-news-sentiments-rss-feed, ConvFinQA, finqa, hc3_finance, dialogsum, QAConv, HQA-data, LeetCodeCpp-new, LeetCodeJava-new, LeetCodePython-new, humaneval, mbpp, ds1000-referenceonly, ds1000, apps_5doc, Huffpostsports, Huffpostscience, Doordash, Healthforcalifornia, Cohere, 5GEdge, OneSignal, Langchain, PyTorch1024Note that the standard text retrieval task encompasses all datasets used to evaluate voyage-3 and voyage-3-lite except long context and multilingual datasets.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of images

images

Photo of Screenshots

Screenshots

Photo of embedding model

embedding model

Related news:

News photo

ADL report finds Steam is 'rife' with racist posts and images

News photo

Chrome on iOS now lets you search using images and text at the same time

News photo

The images of Spain’s floods weren’t created by AI. The trouble is, people think they were