Get the latest tech news

How do we evaluate vector-based code retrieval?

Read time: 10 minutes Nearly all modern coding assistants and agents leverage some form of code retrieval — the task of retrieving relevant code snippets, docstrings, or documentation, etc., from c…

In this post, we will discuss the most typical subtasks for code retrieval, survey the existing public datasets, and explore strategies to create new evaluation benchmarks. However, as noted by Gong et al., an estimated 51% of labels in CoSQA are incorrect, i.e. the supposedly ground-truth code snippets and queries are mismatched for 51% of the dataset; this percentage is likely higher if subtle inconsistencies are included. To further support the community, portions of these datasets will be publicly released to establish new benchmarks that address common challenges, such as noisy labels and the lack of reasoning-intensive tasks.

Get the Android app

Or read this on Hacker News