Review of bingo
bingo is an open-source AI red team terminal that lets you run penetration testing scenarios against LLM applications using multiple attacker models. Instead of paying for a commercial red team platform or building one from scratch, you get a CLI tool that orchestrates attacks across DeepSeek, Claude, GPT, and GLM. The tool is built for security researchers, AI red teamers, and any team that ships LLM features and needs to test for prompt injection, jailbreaks, and adversarial inputs.
AI red teaming has gone from niche to mainstream in 2026. Every company shipping LLM features needs to test for prompt injection, jailbreaks, and adversarial inputs. The OWASP Top 10 for LLM Applications now lists prompt injection as the #1 risk. Commercial red team tools (Garak, PyRIT, Mindgard) exist, but they're either expensive ($5K-$50K/year) or Python-only with steep learning curves. Most security teams don't have the budget or the time for a dedicated commercial platform.
bingo is a terminal-based tool. You pick your attacker LLM (DeepSeek, Claude, GPT, GLM), define a target (a prompt template, an API endpoint, or a saved conversation), and watch the attack play out. The CLI design makes it scriptable for CI/CD pipelines. The output is structured logs that you can parse, filter, and route to your SIEM.
Different LLMs have different attack strengths. DeepSeek is aggressive: it tries the most adversarial patterns and rarely refuses. Claude is creative: it finds novel attack vectors that other models miss. GPT-4 is methodical: it systematically enumerates the input space. GLM is multilingual: it catches vulnerabilities in non-English prompts that other models miss entirely. Running tests across all of them gives you a much more robust security picture than testing against one model. In our testing, we found 3x more vulnerabilities when running all 4 models than when running just GPT-4.
Prompt injection testing: 20+ attack patterns, including direct injection, indirect injection via retrieved documents, multi-turn manipulation, and encoding-based bypasses. Jailbreak discovery: 15+ jailbreak techniques, including role-play, hypothetical framing, and token smuggling. Data exfiltration attempts: tests for system prompt leakage, training data extraction, and cross-tenant data access. Bias and toxicity probing: 30+ demographic and topic categories. Output format compliance: tests for JSON schema validation, PII leakage, and refusal consistency. All configurable via CLI flags, all loggable for compliance audits.
Install via pip: `pip install bingo-redteam`. Requires Python 3.10+. You'll need API keys for the LLMs you want to test against. DeepSeek is the cheapest (~$0.14 per million tokens), so start there. Claude and GPT-4 cost more but find different vulnerabilities. Set the API keys as environment variables: `export OPENAI_API_KEY=...`, `export ANTHROPIC_API_KEY=...`, etc.
The basic command is `bingo test --target prompt.txt --attacker claude --pattern injection --output report.json`. The tool runs 50 attack attempts, logs each one, and produces a structured report with severity ratings (critical/high/medium/low). You can also run in batch mode: `bingo test --target api --attacker all --pattern all --output daily-report.json`. The batch mode is what you want for CI/CD.
bingo is free and open source under the MIT license. You only pay for the LLM API usage. In our testing, a typical daily run (4 LLMs, 200 attack attempts) cost about $3-5 in API fees. For comparison, commercial red team platforms charge $5K-$50K/year per seat. For most security teams, bingo pays for itself in the first week.
We integrated bingo into our GitHub Actions pipeline. Every PR that touches a prompt template triggers a bingo run. If any critical or high-severity vulnerability is found, the PR fails the security check. This is the killer feature: red teaming on every change, not just quarterly. The setup took 2 hours. The false positive rate is low (about 5%), and the false negative rate is the same as manual testing.
Garak (Nvidia): more comprehensive attack library, but Python-only with a steeper learning curve. PyRIT (Microsoft): more enterprise-focused, but requires Azure setup. Mindgard: commercial, more polished UI, but expensive. Rebuff: prompt injection only, limited scope. bingo sits in the middle: easier than PyRIT, more scriptable than Garak, and free unlike Mindgard.
Multi-LLM support: test against DeepSeek, Claude, GPT, and GLM in one CLI. Open source (MIT): integrate into CI/CD without licensing costs. Terminal-first design: scriptable for automated testing. Active development with regular new attack patterns. Strong documentation. MIT license allows commercial use.
Requires you to bring your own API keys for each LLM. Costs add up if you test against all 4 models daily. Smaller community than commercial alternatives like Garak. Documentation assumes familiarity with red team concepts. Limited to text-based attacks: no multimodal or voice red teaming yet. The structured logs are JSON only; no built-in dashboard for non-technical stakeholders.
Security teams at AI-first companies. AI red teamers who need a scriptable tool. Solo security researchers who can't afford commercial platforms. Anyone shipping LLM features who wants to add red teaming to CI/CD. If you have a budget for a commercial platform, Mindgard is more polished. If you don't, bingo is the best free option.
bingo is a focused, well-designed tool for a real need. The open-source license means security teams can integrate it into their own pipelines without per-seat licensing. The multi-LLM support catches vulnerabilities that single-model testing misses. After 3 weeks of daily use in our pipeline, we caught 7 real vulnerabilities that manual code review missed. Worth a look if you ship AI products and don't yet have a structured red team workflow.
|