We SUPA AI Blog – Preparing Code Eval Datasets: Data Cleaning and Automated Code Execution for Advent of Code with Docker and Python

By: Noah Rijkaard

This blog outlines a system for processing Advent of Code submissions written in various languages. The system utilizes Docker containers to execute the code and a Python script to manage the process.

Problem Statement

This is part of SUPA’s effort to curate Advent of Code datasets for evaluation and fine-tuning of LLM model. You can read how DeepSeek R1 perform on this datasets here or access the full datasets on Huggingface.

Advent of Code is an annual coding challenge where participants solve daily programming puzzles. The challenge involves writing code in different languages to solve the puzzles. To automate the evaluation of these submissions, we need a system that can execute code in various languages and capture the output.

Solution Overview

Our solution involves creating a Python script that processes the Advent of Code submissions. The script uses Docker containers to execute the code in the appropriate language. The Docker container is configured with the necessary tools and libraries for the specific language.

Here’s a breakdown of the key components:

Language Mapping: A dictionary that maps file extensions to Docker images containing the appropriate language environment.

language_map = {
    '.py': 'python:3.9',
    '.go': 'golang:1.16',
    '.c#': 'mcr.microsoft.com/dotnet/sdk:5.0',
    '.cs': 'mcr.microsoft.com/dotnet/sdk:5.0',
    '.js': 'node:14',
    '.java': 'openjdk:11',
    '.rb': 'ruby:2.7',
    '.cpp': 'gcc:latest',
    '.rs': 'rust:latest'
}

File Processing: The script iterates through each submission file, determines the language based on the file extension, and executes the code using the corresponding Docker image.

def process_file(question_path, file, repo_id, input_text, expected_output_1, expected_output_2, year, question_number):
    # ... (Implementation details)

Input Handling: If the submission file references an input file, the script copies the input file into the Docker container before executing the code

# Look for input file reference and copy if needed
match = re.search(r'["\']([^"\']+\.(?:txt|in|dat|input))["\']', content, re.IGNORECASE)
if match:
    input_file_name = match.group(1)
    # ... (Copy input file to container)

Output Capture: The script captures the stdout and stderr of the executed code, which are used to determine whether the solution is correct and to identify any errors.

stdout, stderr = execute_in_docker(question_path, file, docker_image, run_command)

Result Storage: The results are stored in a CSV file or uploaded to a DynamoDB database.

# Save results to CSV
with open(csv_output_path, 'w', newline='') as csv_output:
    # ... (Write results to CSV)

# Save results to DynamoDB
for item in results:
    dynamodb.put_item(
        # ... (Upload results to DynamoDB)
    )

Docker Execution

The execute_in_docker function handles the creation and execution of Docker containers. It takes the repository path, file name, Docker image, and run command as arguments.

def execute_in_docker(repo_path, file_name, docker_image, run_command, timeout=300):
    # ... (Implementation details)

The function first creates a Docker container using the docker create command. The container is configured to mount the repository path as a volume, set the working directory to the repository path, and use the specified Docker image.

docker create \
    --memory=512m \
    --cpus=1 \
    -v {repo_abs_path}:/app \
    -w /app \
    {docker_image} \
    {run_command}

Once the container is created, the function starts the container using the docker start command and captures the output. After the execution is complete, the function removes the container using the docker rm command.

This process ensures that each submission is executed in a clean and isolated environment, preventing any conflicts between different submissions or dependencies.

CSV Format

The results are stored in a CSV file with the following columns:

Column Name	Description
RepositoryID	The ID of the repository containing the submission.
ChallengeID	The name of the challenge file.
FileContent	The content of the challenge file.
Status	The status of the execution (Success or Failure).
question_1_correct	Whether the solution for question 1 is correct (Correct or Incorrect).
question_2_correct	Whether the solution for question 2 is correct (Correct or Incorrect).
ErrorMessage	The error message, if any.
Year	The year of the Advent of Code challenge.
QuestionNumber	The question number.
Language	The programming language used in the submission.

This format provides a structured way to store and analyze the results of the Advent of Code submissions.

Note: Some users may split a single day’s challenge into two separate files, one for part 1 and another for part 2. In such cases, the Status column will be marked as “Success” if either part 1 or part 2 is solved correctly.

Code Implementation

Here’s a breakdown of the key functions and their roles:

process_file: This function handles the execution of a single submission file. It determines the language, creates a Docker container, copies the input file (if necessary), executes the code, and captures the output.
process_repositories: This function processes all submissions in a given year and question. It iterates through the repository folders, identifies the submission files, and calls the process_file function for each file.
main: The main entry point of the script. It parses the command line arguments and calls the process_repositories function.

Execution

To run the script, provide the path to the folder containing the Advent of Code submissions and the year to be processed. The script will then process all submissions for the specified year and question.

Improvements

Parallel Execution: The script can be parallelized to improve performance by using multiple threads to process submissions concurrently.
Error Handling: The script can be enhanced with better error handling to provide more informative error messages.
Scalability: The script can be scaled to handle large datasets by using distributed processing techniques.

Conclusion

This solution provides a robust and efficient way to process Advent of Code submissions in various languages. By using Docker containers, we can ensure consistent execution environments for different languages. The Python script provides a flexible framework for customizing the processing pipeline and integrating with different storage mechanisms.