-
Evaluating LLMs’ Reasoning Ability: A Connect 4 Showdown
Abstract Given the recent interest in analyzing LLMs’ reasoning abilities, we conducted tests on several LLM models by having them play Connect 4 to evaluate their reasoning ability. The models used were 2 small models: DeepSeek-R1-Distill-Qwen-32B, DeepSeek-R1-Distill-Llama-70B, and 2 large models: GPT-o3-Mini, and DeepSeek-R1. The game was conducted in two rounds: Round 1: Deepseek-r1-distill-qwen-32b &…
-
AI Safety and Governance: Why It Matters More Than Ever
By: Devariah Christihapsari Why AI Safety Is More Than Just a Technical Issue Artificial Intelligence (AI) is now a core part of how things get done. It’s speeding up processes, making healthcare more precise, and personalizing our experiences from education to entertainment. However, as AI systems become more embedded in daily life, their impact extends…
-
Evaluating LLMs for Bahasa Indonesia: SEA-LIONv3 vs SahabatAI-v1
By: Devariah Christihapsari Abstract In Round 2 of our LLM evaluation, we compared Model A (SEA-LIONv3) and Model B (SahabatAI-v1) to assess their performance on Bahasa Indonesia tasks. Across 50 challenges covering language, domain knowledge, geography, and combined tasks. Model B took the lead with notable gains in linguistic and domain-specific accuracy. Yet, both models…
-
SUPA’s Advent of Code (Expanded, Curated, Verified) Dataset for Code Generation Evaluation and Training
Discover how SUPA’s Advent of Code (Expanded, Curated, Verified) Dataset accelerates code generation research. Explore Python, JavaScript, and Ruby solutions, verified test cases, and integrations for robust AI training—all available on Hugging Face.
-
Strengthening AI Governance in Southeast Asia
By: Devariah Christihapsari Executive Summary Southeast Asia is rapidly adopting Generative AI (GenAI) and Large Language Models (LLMs), unlocking new business opportunities while introducing new risks. Many organizations struggle with AI governance, trying to control every variable. However, a more effective approach is to build flexible systems that can anticipate, respond to, and recover from risks.…
-
Preparing Code Eval Datasets: Data Cleaning and Automated Code Execution for Advent of Code with Docker and Python
By: Noah Rijkaard This blog outlines a system for processing Advent of Code submissions written in various languages. The system utilizes Docker containers to execute the code and a Python script to manage the process. Problem Statement This is part of SUPA’s effort to curate Advent of Code datasets for evaluation and fine-tuning of LLM…
-
DeepSeek R1 Crushes Advent of Code 2024: Our Latest Code Benchmark
With a Part 1 accuracy of 80% and a Part 2 accuracy of 62.57%, DeepSeek R1 outperforms all other models by a significant margin. This superior performance suggests stronger reasoning capabilities and a more nuanced understanding of the puzzle context and storytelling style in Advent of Code.
-
SUPA’s Bilingual Dataset for Evaluating Reasoning Skills in STEM Subjects
SUPA’s Bilingual Dataset for Evaluating Reasoning Skills in STEM Subjects
-
Evaluating LLMs for Bahasa Indonesia: GPT-4o-mini vs SEA-LIONv3
We tested two large language models (LLMs), GPT-4o-mini and SEA-LIONv3, on their handling of Indonesian-specific questions. Through human expert evaluation of 50 questions across four categories, we found that SEA-LIONv3 excelled at understanding local details and nuances, while both models performed comparably on general language tasks. This difference highlights the potential of localized LLMs like…
-
PRESS RELEASE – TDCX and SUPA tie-up to help companies address a key barrier in generative AI adoption
Collaboration provides companies with a one-stop-solution for their data labeling needs Human-in-the-loop model ensures greater accuracy in data labeling outputs Singapore, June 24, 2024 — TDCX, an award-winning digital customer experience (CX) solutions provider for technology and blue-chip companies, today announced a strategic tie-up with SUPA, a generative artificial intelligence (AI)-powered data labeling company,…
Ready To Start Labeling Your Data with Us?