Get the latest tech news

Beyond benchmarks: How DeepSeek-R1 and o1 perform on real-world tasks


o1 is slightly better at reasoning, but DeepSeek-R1 provides much more details about its reasoning, which is very useful to the user.

We considered a scenario where the user has invested $140 in the Magnificent Seven (Alphabet, Amazon, Apple, Meta, Microsoft, Nvidia, Tesla) on the first day of every month from January to December 2024. The model had also been confounded by a row in the Nvidia chart that had marked the company’s 10:1 stock split on June 10, 2024, and ended up miscalculating the final value of the portfolio. Another experiment we carried out required the model to compare the stats of four leading NBA centers and determine which one had the best improvement in field goal percentage (FG%) from the 2022/2023 to the 2023/2024 seasons.

Get the Android app

Or read this on Venture Beat

Read more on:

Photo of benchmarks

benchmarks

Photo of world tasks

world tasks

Photo of perform

perform

Related news:

News photo

AI researcher François Chollet is co-founding a nonprofit to build benchmarks for AGI

News photo

Red Hat Enterprise Linux 10 Beta Performance Looks Great - Initial RHEL 9 vs. RHEL 10 Benchmarks

News photo

New Box64 v0.3.2 and Box86 v0.3.8 released (Native Flags, Benchmarks, Box32)