Inside the Black Box: How AI Hiring Models Reason—and Where Bias Hides in Southeast Asia

Inside the Black Box: How AI Hiring Models Reason—and Where Bias Hides in Southeast Asia

By: Devariah Christihapsari Where Bias Hides in AI Hiring “Would an AI model make the same hiring decision if the only thing that changed was the candidate’s ethnicity?” That was the core question we asked—and tested—across dozens of hiring simulations in Southeast Asia. We built side-by-side comparisons: two nearly identical candidates with the same qualifications,…

April 30, 2025
Evaluating LLMs’ Reasoning Ability: A Connect 4 Showdown

Abstract Given the recent interest in analyzing LLMs’ reasoning abilities, we conducted tests on several LLM models by having them play Connect 4 to evaluate their reasoning ability. The models used were two small models: DeepSeek-R1-Distil-Qwen-32 B and DeepSeek-R1-Distil-Llama-70 B, and two large models: GPT-o3-Mini and DeepSeek-R1. The game was conducted in two rounds: Round 1: Deepseek-r1-distill-qwen-32b & Deepseek-r1-distill-llama-70b; Round 2:…

March 26, 2025
AI Safety and Governance: Why It Matters More Than Ever

By: Devariah Christihapsari Why AI Safety Is More Than Just a Technical Issue Artificial Intelligence (AI) is now a core part of how things get done. It’s speeding up processes, making healthcare more precise, and personalizing our experiences from education to entertainment. However, as AI systems become more embedded in daily life, their impact extends…

March 14, 2025
Benchmarking Bahasa Indonesia LLMs: SEA-LIONv3 vs SahabatAI-v1

By: Devariah Christihapsari Abstract In Round 2 of our LLM evaluation, we compared Model A (SEA-LIONv3) and Model B (SahabatAI-v1) to assess their performance on Bahasa Indonesia tasks. Across 50 challenges covering language, domain knowledge, geography, and combined tasks. Model B took the lead with notable gains in linguistic and domain-specific accuracy. Yet, both models…

February 21, 2025
SUPA’s Advent of Code (Expanded, Curated, Verified) Dataset for Code Generation Evaluation and Training

Discover how SUPA’s Advent of Code (Expanded, Curated, Verified) Dataset accelerates code generation research. Explore Python, JavaScript, and Ruby solutions, verified test cases, and integrations for robust AI training—all available on Hugging Face.

February 18, 2025
Strengthening AI Governance in Southeast Asia

By: Devariah Christihapsari Executive Summary Southeast Asia is rapidly adopting Generative AI (GenAI) and Large Language Models (LLMs), unlocking new business opportunities while introducing new risks. Many organizations struggle with AI governance, trying to control every variable. However, a more effective approach is to build flexible systems that can anticipate, respond to, and recover from risks.…

February 12, 2025
Preparing Code Eval Datasets: Data Cleaning and Automated Code Execution for Advent of Code with Docker and Python

By: Noah Rijkaard This blog outlines a system for processing Advent of Code submissions written in various languages. The system utilizes Docker containers to execute the code and a Python script to manage the process. Problem Statement This is part of SUPA’s effort to curate Advent of Code datasets for evaluation and fine-tuning of LLM…

January 24, 2025
DeepSeek R1 Crushes Advent of Code 2024: Our Latest Code Benchmark

With a Part 1 accuracy of 80% and a Part 2 accuracy of 62.57%, DeepSeek R1 outperforms all other models by a significant margin. This superior performance suggests stronger reasoning capabilities and a more nuanced understanding of the puzzle context and storytelling style in Advent of Code.

January 24, 2025
SUPA’s Bilingual Dataset for Evaluating Reasoning Skills in STEM Subjects

SUPA’s Bilingual Dataset for Evaluating Reasoning Skills in STEM Subjects

January 17, 2025
Local vs Global: Testing GPT-4o-mini and SEA-LIONv3 on Bahasa Indonesia

We tested two large language models (LLMs), GPT-4o-mini and SEA-LIONv3, on their handling of Indonesian-specific questions. Through human expert evaluation of 50 questions across four categories, we found that SEA-LIONv3 excelled at understanding local details and nuances, while both models performed comparably on general language tasks. This difference highlights the potential of localized LLMs like…

January 17, 2025

Ready To Start Labeling Your Data with Us?