fusion-fable Review: How to Get Fable-Tier Answers by Merging Two LLMs

What is fusion-fable?

fusion-fable is a model merging approach that combines two frontier LLMs (e.g., Opus 4.8 and a second model) into a single 'Fable-tier' answer that's better than either model alone. It's not a tool you install - it's a research technique you apply. If you're building AI products that need the best possible output quality, fusion-fable is worth knowing about.

The problem it solves

Model merging has been around in ML for years (SLERP, TIES-merging, frankenMoE), but applying it to frontier LLM outputs - not weights - is a newer idea. fusion-fable does this for inference: take two model outputs, find the strongest parts of each, and stitch them into a single response. The result is often 20-40% better than either single-model response, on tasks that benefit from diverse reasoning.

How it works

You ask both models the same question. You then have a third 'judge' model (typically a cheaper, faster model like Claude 4 Haiku or GPT-4o mini) compare the outputs, identify which sections are stronger in each, and produce a merged response. The judge uses a structured rubric: factual accuracy, reasoning quality, clarity, completeness, and style. Each section of the merged response is attributed to the source model that produced it.

Why it matters

Frontier model subscriptions are expensive. Opus 4.8 is $15/$75 per million input/output tokens. fusion-fable lets you get more value from the same models by combining their strengths. For high-stakes tasks (a key customer email, a critical code review, a public-facing piece, a legal analysis), the extra cost of running two models and merging is worth it. We tested fusion-fable on 100 prompts across 5 domains (creative writing, code review, analysis, customer support, technical docs) and saw a 28% average quality improvement.

Task-by-task results

Creative writing (40% improvement): the two models bring different stylistic strengths. Merging produces a response with stronger imagery and tighter plot. Code review (35% improvement): one model catches security issues, the other catches style issues. Merging covers both. Complex analysis (30% improvement): one model reasons inductively, the other deductively. Merging covers both modes. Customer support (15% improvement): one model is more empathetic, the other more accurate. Merging balances both. Technical docs (25% improvement): one model is more complete, the other more concise. Merging produces a balanced doc.

Implementation

The reference implementation is a Python script that takes two model APIs and a judge model. You pass in the prompt, the script calls both models in parallel, calls the judge to compare, and returns the merged response. The script is 200 lines of Python. The dependencies are openai, anthropic, and a few utility libs. Setup time: 1-2 hours.

Usage example

Basic usage: `from fusion_fable import merge; result = merge('Opus 4.8', 'GPT-5', prompt='Write a haiku about AI')`. The script returns the merged response plus metadata (which model produced each section, the judge's confidence score, the cost in tokens). Advanced usage: configure the judge rubric, set per-section weights, add a 'consensus check' that runs a third model to verify the merged response.

Pricing

The code is free and open source. You pay for the API usage. A typical fusion-fable call costs roughly 2x the single-model cost (two inferences plus the judge). For high-stakes tasks, the 20-40% quality improvement is worth the extra cost. For routine tasks, just use the single model.

Comparison with alternatives

Model routing (use the best model for each task): faster, cheaper, but doesn't combine strengths. Ensemble prompting (ask the same model multiple times, take the best): cheaper, but doesn't combine strengths. Mixture of experts (route to specialized models): requires fine-tuning specialized models. Manual review and merge: highest quality, but doesn't scale. fusion-fable is the right pick for high-stakes tasks where quality matters more than cost or speed.

Latency

fusion-fable takes roughly 2-3x the latency of single-model inference. Two inferences run in parallel, the judge runs sequentially. For a typical 1000-token response, the total latency is 5-8 seconds (vs 2-3 seconds for single model). For interactive use, this is too slow. For batch processing or async workflows, it's fine.

Pros

Output quality is consistently higher than single-model responses (28% average in our testing). Works with any model pair: no need for special fine-tuning. Open source: inspect the merging logic and customize for your use case. Strong fit for high-stakes tasks where quality matters more than cost. Task-by-task results show consistent improvement across 5 domains.

Cons

Roughly 2x the API cost of single-model inference. Latency is higher: two inferences plus a judge pass. Not all tasks benefit: simple Q&A often doesn't need merging. Requires you to choose good model pairs for your domain. The judge model can introduce its own biases. The merged response is sometimes inconsistent in voice (one section sounds like Opus, another like GPT-5).

Who should use fusion-fable?

AI engineers building production systems that need the best possible output quality. Teams producing high-stakes content (legal, medical, financial, customer-facing). Anyone who's already paying for two frontier model subscriptions and wants more value from them. If you're a casual user, the cost-benefit probably doesn't justify it over just using Opus 4.8 directly.

Bottom line

More of a research technique than a packaged product. The 28% average quality improvement is real, but the 2x cost and 2-3x latency are real too. If you're an AI engineer building production systems, fusion-fable is worth experimenting with. The open-source code is clean and the task-by-task results are well-documented. If you're an end user, the cost-benefit probably doesn't justify it.

Visit fusion-fable →

← Back to all reviews