DeepSeek R1 Crushes Advent of Code 2024: Our Latest Code Benchmark

Introduction

Large Language Models (LLMs) for code generation have taken the software development world by storm, offering automated solutions to complex coding challenges. In this post, we present a head-to-head comparison of popular LLMs using the Advent of Code 2024 dataset. Our goal? To see how well these models can handle real-world puzzle prompts, generate correct Python code, and ultimately shed light on which LLM truly excels at reasoning and problem-solving.

Why Advent of Code 2024?

Advent of Code puzzles are both fun and challenging, covering a wide variety of computational and logical tasks. This makes them an ideal benchmark for assessing an LLM’s ability to:

Interpret detailed instructions and complex storylines.
Apply algorithmic thinking to produce correct outputs.
Generalize solutions across varying puzzle structures.
By using the 2024 iteration, we ensure the dataset is fresh, diverse, and relevant for the latest LLM capabilities.

Dataset & Prompt

Dataset Overview

Our team curated the Advent of Code 2024 dataset from multiple sources, Each puzzle has at least 5 solutions associated with it, adding up to 245 total data rows.

Public Availability: The dataset is publicly accessible on Hugging Face.
Clean up & Validation: We verified correctness by cleaning up the datasets and validating each solution to official or community-endorsed answers. You can read more about our process here.

You can also access the dataset on Hugging Face using the following commands:

from huggingface_hub import hf_hub_download
import pandas as pd

REPO_ID = "Supa-AI/advent_of_code"
FILENAME = "aoc.csv"

dataset = pd.read_csv(
    hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset")
)

Prompt Design

To keep the comparison fair, we used a simple yet consistent prompt format. Essentially, we instruct the model to:

“Write a python program that reads a file named input.txt to solve the following coding challenge, focus on {part}, no explanation, pure code:\n{prompt}”

Focus on {part}: Each Advent of Code puzzle typically has two parts; we direct the LLM to solve either Part 1 or Part 2.
No Explanation: We request code-only output to keep model responses uniform and reduce extraneous text.

Experimental Setup

We tested several well-known LLMs, including:

• DeepSeek V3

• Llama 3.3 70B

• GPT 4o-mini

• Qwen2.5-Coder

• DeepSeek R1 (the latest release)

p.s. We still looking for OpenAI’s o1 sponsors

We accessed these models through various API providers (Hyperbolic, OpenAI, DeepSeek). The source code of our evaluation scripts can be found in our AI-toolkit on GitHub.

Evaluation Design

The pass@k Metric

To evaluate functional correctness, we used the pass@k metric. This approach checks if at least one of the generated solutions (out of k attempts) produces the correct result. Here’s why pass@k is valuable:

Industry-Standard: It’s widely used in code generation leaderboards like BigCode, EvalPlus, and LiveCodeBench.
Fairness: Different models might produce multiple valid solutions. pass@k looks at the best attempt rather than just the first.

For our project, we limited k=1, a setting consistent with typical code generation benchmarks.

Results

Model	Part 1 (%)	Part 2 (%)
Qwen2.5-Coder-32B-Instruct	44	8.33
DeepSeek-V3 – fp8	52	25
Llama-3.3-70B-Instruct	32	20.83
GPT 4o-mini	48	25
o1 mini	69	54
DeepSeek – R1	80	62.5

Read the full evaluation results here

Key Findings:

DeepSeek R1 Dominates

With a Part 1 accuracy of 80% and a Part 2 accuracy of 62.57%, DeepSeek R1 outperforms all other models by a significant margin.
This superior performance suggests stronger reasoning capabilities and a more nuanced understanding of the puzzle context and storytelling style in Advent of Code.

Part 1 vs. Part 2 Difficulty

All models, including DeepSeek R1, show a performance drop on Part 2.
Part 2 typically involves building upon the logic from Part 1 but with added complexity, revealing that more advanced context handling is still an open challenge for many LLMs.

Limitations & Future Work

Limited Model Coverage: While we compared a handful of popular models, there are many others (e.g., Codestral, OpenAI-o1, CodeLlama) that remain untested.
Single-Task Focus: Advent of Code offers a rich set of puzzles, but real-world development involves larger codebases, debugging cycles, and integration testing.

Next Steps

Broader Comparisons: We plan to include additional code-generation LLMs to broaden our benchmark.
Detailed Error Analysis: We aim to investigate where models fail in Part 2 to glean insights into improving context understanding.
Multi-Language Expansion: We are currently processing Advent of Code 2024 solutions in other programming languages like JavaScript, TypeScript, Rust, and more. Stay tuned by following our HuggingFace page for updates.

Conclusion

Our experiment underscores the value of thorough, puzzle-based benchmarking for LLMs in code generation. The latest DeepSeek R1 sets a new high watermark, demonstrating exceptional reasoning and puzzle-solving proficiency—particularly in an environment where contextual understanding is paramount.

With more models and deeper analyses on the horizon, we’re excited to continue refining our benchmarks and sharing insights that push the boundaries of code generation. Stay tuned for upcoming comparisons, and feel free to explore the dataset, code, and full results on HuggingFace and GitHub.

—

At SUPA, we’re experts in data labeling and provide human experts in different domain areas for evaluation, data curation, and RLHF (Reinforcement Learning from Human Feedback).

Reach out to us at supa.so if you need human experts to improve your code generation model.