KomilionBench Phase 5 — 30 API calls · 10 tasks · 4 configurations · Feb 27 2026

Every response published unedited

Same prompts. Real outputs. Judge-scored against Opus Direct. Balanced scores 8.7/10 — above Opus baseline — at 56% less cost. Council V3 included: 8.77/10 avg across 10 real tasks.

TierScorevs OpusCost/task
Council V3
8.77/10+0.17$0.0000
Balanced
8.70/10+0.10$0.076
Opus Direct
8.60/10$0.172
Frugal
8.30/10-0.30$0.0033
Judge: google/gemini-2.5-flash · 3 runs/comparison averaged · 2026-02-23

Four specialists. One synthesised answer.

Council Mode routes your prompt through four AI specialists, each independently assessing it, then synthesises their verdicts into a single expert response. 10 tasks tested vs Opus Direct. T7 pending Phase 6 rerun after complexity gate fix.

avg 8.40/10wins 6/10 vs Opus DirectBest value — 56% less than Opus
TaskScorevs Opus
T1
Compound InterestCode Gen
7.00
+1.00
T2
Bug HuntDebugging
10.00
+2.00
T3
Async/AwaitExplanation
8.00
-0.33
T4
Config Parser TestsTests
8.00
+1.00
T5
RefactorRefactoring
9.00
+2.00
T6
AI Market SummarySummarization
8.00
-1.00
T7
Git CommitCommit
9.00
+1.00
T8
SQL OptimizationSQL
8.00
-1.00
T9
Real-Time ChatArchitecture
9.00
+1.00
T10
REST API DesignArchitecture
8.00
-1.00

Read the actual outputs

1

Compound Interest

Code Gen

Write a Python function calculate_compound_interest(principal, rate, years, compounds_per_year) with comprehensive unit tests covering edge cases.

Frugal

Here is the complete solution, including the compound interest calculation function and a robust suite of unit tests designed to handle standard use cases and critical edge cases.

Python Implementation

```python import unittest

def calculate_compound_interest(principal, rate, years, compounds_per_year): """ Calculates the total value of an investment using the compound interest form

7.0/10-2.0 vs Opus
$0.0076 · 24.0s
Komilion offers a functional solution with some tests, but Compare's response is significantly more complete and production-ready due to its…
brain
2

Bug Hunt

Debugging

This Python code has a bug: def find_duplicates(lst): seen = []; dupes = []; [seen.append(x) if x not in seen else dupes.append(x) for x in lst]; return dupes. Find the bug and fix it.

Frugal
The Bug The primary issue in your code is that it **returns duplicates multiple times** if an element appears more than twice in the list.

**Example of the bug:** If `lst = [1, 1, 1]`, the code logic works like this: 1. First `1`: Added to `seen`. 2. Second `1`: Already in `seen`, so added to `dupes`. 3. Third `1`: Already in `seen`, so added to `dupes` again. **Result:** `[1, 1]` instead of

9.0/10-1.0 vs Opus
$0.0020 · 7.1s
While both models correctly identified and fixed the main bug, Claude-Opus-4.6 offered a more thorough and detailed explanation of all issue…
brain
3

Async/Await

Explanation

Explain async/await in JavaScript. Cover: what problem it solves, how it works under the hood, common mistakes, and when NOT to use it.

Frugal
1. The Problem: Callback Hell and Promise Chaining Before `async/await`, JavaScript handled asynchronous operations (like API calls or file reading) using **Callbacks** or **Promises**.
  • **Callbacks** led to "Callback Hell," where nested functions made code unreadable and hard to debug.
  • **Promises** improved this with `.then()` and `.catch()`, but complex logic involving multiple condi
8.0/10-1.0 vs Opus
$0.0035 · 10.6s
While both responses were good, Claude's use of illustrative code and diagrams made its explanation of async/await significantly clearer and…
brain
4

Config Parser Tests

Tests

Write unit tests for this function: def parse_config(filepath): with open(filepath) as f: return json.load(f). Include edge cases for missing file, invalid JSON, empty file, and nested config.

Frugal

To test your `parse_config` function thoroughly, I have prepared a suite using `pytest` and `unittest.mock`. Using mocks is the best practice here because it allows us to simulate various file states (missing, empty, or corrupted) without needing to create actual files on your disk.

Unit Test Suite

```python import json import pytest from unittest.mock import patch, mock_open

The function

9.0/10+1.0 vs Opus
$0.0068 · 21.2s
🏆 Winner
Komilion provided a more elegant and complete solution by leveraging `unittest.mock` for file operations, which is generally preferred for u…
brain
5

Refactor

Refactoring

Refactor this function for readability and maintainability: def p(d,t,r,n): return d*(1+(r/n))**(n*t). Keep the logic identical.

Frugal

```python def calculate_compound_interest(principal, years, rate, compounding_periods): """ Calculates the future value of an investment using compound interest. Formula: A = P(1 + r/n)^(nt) :param principal: The initial amount of money (d) :param years: The number of years the money is invested (t) :param rate: The annual interest rate in decimal form (r) :pa

8.0/10-1.0 vs Opus
$0.0017 · 6.1s
Compare is superior due to its inclusion of explicit type hints, which significantly enhance code clarity and maintainability compared to Ko…
brain
6

AI Market Summary

Summarization

Summarize this in 150 words, preserving key numbers: The global AI market was valued at $142.3 billion in 2023. It is expected to grow at a CAGR of 37.3% from 2023 to 2030, reaching $1.81 trillion. North America dominates with 36.8% market share. Key growth drivers include natural language processing adoption (up 67% YoY), computer vision deployment in manufacturing (saving $1.2 trillion annually), and the proliferation of generative AI tools which saw 340 million users in 2023. Enterprise adoption rose from 22% to 55% between 2021–2023. The healthcare sector shows highest ROI at 3.8x. Challenges include regulatory uncertainty in 47 countries and a talent gap of 2.4 million specialists globally.

Frugal

The global AI market, valued at **$142.3 billion in 2023**, is projected to reach **$1.81 trillion by 2030**, exhibiting a robust **CAGR of 37.3%** from 2023. North America leads the market with a **36.8%** share. Key growth factors include a **67% YoY increase** in natural language processing adoption, computer vision saving **$1.2 trillion annually** in manufacturing, and generative AI tools att

9.0/10+0.0 vs Opus
$0.00000 · 2.6s
Both models performed exceptionally well, accurately summarizing the provided text and preserving all key numerical data within the specifie…
gemini-2.5-flash
7

Git Commit

Commit

Write a git commit message for this diff: added user authentication middleware, fixed session timeout from 30min to 24h, removed deprecated /api/v1/login endpoint, added rate limiting 100req/min per IP.

Frugal

feat: implement auth middleware and rate limiting

  • Added user authentication middleware
  • Updated session timeout from 30 minutes to 24 hours
  • Removed deprecated `/api/v1/login` endpoint
  • Implemented rate limiting (100 req/min per IP)
7.0/10-2.0 vs Opus
$0.00030 · 4.6s
While both responses correctly identify the changes, the second response is superior due to its adherence to conventional commit standards a…
brain
8

SQL Optimization

SQL

Optimize this SQL query for a users table with 10M rows: SELECT * FROM users WHERE email LIKE '%gmail.com' AND created_at > '2024-01-01' AND subscription_status = 'active' ORDER BY created_at DESC.

Frugal

To optimize this query for 10 million rows, we must address the most significant performance bottleneck: the **leading wildcard** (`%gmail.com`).

Standard B-Tree indexes cannot be used for `LIKE '%suffix'` searches, forcing a full table scan or a full index scan.

1. The Recommended Strategy: Functional Indexing If you frequently query by email domain, create a functional index on the revers
8.0/10-1.0 vs Opus
$0.0031 · 9.4s
Compare offered a more complete and accurate set of optimization strategies, particularly for the challenging leading wildcard `LIKE` condit…
brain
9

Real-Time Chat

Architecture

I am building a real-time chat application for 10,000 concurrent users. Current stack is Node.js REST API plus PostgreSQL. What architecture changes do I need to handle real-time messaging, presence indicators, and message history? Give me specific technology recommendations with trade-offs.

Frugal

To scale a real-time chat application to 10,000 concurrent users, your current REST/PostgreSQL setup will face two primary bottlenecks: **socket management** (statefulness) and **database write-throughput** (latency).

Here is the architectural blueprint to evolve your stack:

1. Communication Layer: WebSockets + Redis Pub/Sub A standard REST API cannot push data to clients. You need a persist
9.0/10+2.0 vs Opus
$0.0038 · 12.8s
🏆 Winner
Komilion offers a complete and well-reasoned architectural guide, whereas the other response, despite a good diagram, is incomplete and thus…
brain
10

REST API Design

Architecture

Design a REST API for a task management system. Requirements: users can create projects, add tasks to projects, assign tasks to team members, set due dates and priorities, track completion. Give me the full endpoint design with HTTP methods, request/response schemas, and authentication approach.

Frugal

I will design a comprehensive REST API for your task management system, covering authentication, resource modeling, and endpoint specifications.

**1. Authentication & Security** * **Mechanism:** JSON Web Token (JWT) via the `Authorization: Bearer <token>` header. * **Statelessness:** No session state is stored on the server. * **Ownership:** Every resource (Project/Task) must be scoped
7.0/10-2.0 vs Opus
$0.0044 · 11.9s
For a quick, functional API design, Komilion is sufficient, but for a truly robust and production-ready system, Compare offers a far more co…
brain

How we ran this

Judge model, methodology, and limitations

Judge

google/gemini-2.5-flash

3 runs per comparison, results averaged. Independent blind scoring.

Baseline

Opus 4.6 called directly via Anthropic SDK — not through Komilion. Ensures fair comparison.

Run date

2026-02-23

Tasks

10 real-world dev tasks. Compound interest, debugging, async/await, SQL, architecture, and more.

Limitations

  • LLM-as-judge is imperfect by definition. Read the outputs yourself — that's why they're published.
  • 10 tasks is a limited sample. Your workload will vary.
  • Scores are relative comparisons, not absolute quality ratings.

Run it yourself

python
from openai import OpenAI

client = OpenAI(
    base_url="https://www.komilion.com/api/v1",
    api_key="ck_your_key"
)

response = client.chat.completions.create(
    model="neo-mode/balanced",
    messages=[{"role": "user", "content": "your prompt"}]
)

# See which model was chosen + cost
print(response.model_extra["komilion"]["brainModel"])
print(response.model_extra["komilion"]["cost"])

Switch between tiers by changing the model string: neo-mode/frugal, neo-mode/balanced, neo-mode/premium.

The API is OpenAI-compatible. Drop it in wherever you use OpenAI — same interface, smarter routing.

free to start · no card required

KomilionBench Phase 5 · 10 tasks · 30 API calls · Feb 27 2026 · methodology