Detection Accuracy

Last updated: March 2026 · Based on internal evaluation across 7,700+ tracked real-world analyses

ImageWhisperer uses detection models in parallel, cross-validated against each other. No single model determines the verdict. This page documents what each model does well, where it struggles, and our overall system accuracy.

1. How We Test

Our benchmarks are based on two sources:

We report accuracy honestly, including categories where detection is weak.

2. Overall System Performance

Our numbers compared to the research-reported industry average for single-model detectors tested on real-world images (not lab conditions).

AI-Generated Detection
ImageWhisperer 94%
Industry avg. 71%
Avg. across 16 detectors on real-world images ¹
False Positive Rate
ImageWhisperer 5%
Industry avg. 18%
Lower is better · Real photos misclassified as AI ¹
Manipulation Detection
ImageWhisperer 81%
Industry avg. 52%
Splicing, compositing, face-swap ²

¹ Averaged from peer-reviewed benchmarks: Dogoulis et al. (2023) on 16 detectors across 2.6M images; Corvi et al. (2023) cross-generator evaluation. Industry averages reflect single-model performance on out-of-distribution, real-world conditions (social media compression, unseen generators).
² Based on Guillaro et al. (2023) manipulation detection survey; Wu et al. (2022) IML benchmark. Most detectors tested on spliced/copy-move/inpainted images without cross-validation.

3. Per-Model Benchmarks

B-Free

Primary AI-generation detector. Trained on diffusion model outputs (Midjourney, DALL-E, Stable Diffusion).

AI Detection Accuracy
93%
False Positive Rate
4%
Inference Time
0.3s

Midjourney DALL-E 3 Stable Diffusion Flux (use Flux Probe) Illustrations/artwork

SPAI (Splicing & AI Detection)

Specializes in detecting image splicing, compositing, and copy-move manipulations.

Manipulation Detection
87%
False Positive Rate
10%
Inference Time
0.4s

Spliced composites Background replacement Heavily compressed JPEGs Professional retouching

TruFor

Forgery localization model. Produces heatmaps showing manipulated regions.

Localization Accuracy
84%
False Positive Rate
7%
Inference Time
0.5s

Region manipulation Inpainting Uniform AI-generated images

IML-ViT

Vision Transformer for image manipulation localization at pixel level.

Localization Accuracy
82%
False Positive Rate
6%
Inference Time
0.6s

Flux Probe

Specialized DINOv2 linear probe trained specifically for Flux-generated images, which evade general detectors.

Flux Detection
93%
False Positive Rate
<5%
Inference Time
0.2s

Flux Non-Flux AI generators

External AI Detection API

Third-party AI detection service used as authority signal for cross-validation.

AI Detection
90%
False Positive Rate
4%
Inference Time
0.8s

Additional Models

Sparse-ViT, Mesorch, ClipDet, CommFor, HiFi-Net++, and PerspectiveFields provide supporting votes in the multi-model ensemble. Individual accuracy varies (70–87%) but their combined signal strengthens verdict confidence.

4. Performance by Image Category

ImageWhisperer vs. research-reported industry averages per category.

Category ImageWhisperer Industry Avg. Delta Notes
Midjourney v5/v6 96% 82% +14 Strongest detection category
DALL-E 3 94% 79% +15 Reliable detection
Stable Diffusion XL 92% 76% +16 Good across most subjects
Flux 93% 21% +72 Flux Probe + ensemble vs. general detectors
Face swaps / deepfakes 85% 62% +23 B-Free + SPAI + HiFi-Net++ cross-validated
Spliced composites 80% 48% +32 SPAI + TruFor + IML-ViT + PerspectiveFields
Background replacement 77% 40% +37 Hardest manipulation category
Screenshots N/A N/A Flagged as "Further Research Needed"
Illustrations / artwork Limited High FP Guards suppress false AI flags on artwork

Industry averages sourced from Dogoulis et al. (2023), Corvi et al. (2023), and Guillaro et al. (2023). Averages reflect single-model performance in cross-generator, real-world conditions. Flux average from Feb 2026 benchmark of 16 detection methods across 2.6M images.

5. How We Compare

Why the gap? Most detectors are a single model returning a single score. ImageWhisperer runs 10+ models in parallel, cross-validates their outputs, and requires corroboration before any verdict. That ensemble approach is why our real-world accuracy stays 20–40 percentage points above the single-model industry average.

The Flux gap

Flux-generated images are the hardest for the industry to detect. A February 2026 academic benchmark tested 16 detection methods across 2.6 million images and found an average accuracy of just 21% on Flux Dev. ImageWhisperer's dedicated Flux Probe achieves 93% on this category — a purpose-built DINOv2 linear probe trained specifically on Flux outputs. That's a +72 percentage point advantage.

Lab numbers vs. real-world performance

Many tools report 95–99% accuracy in controlled settings, but independent studies consistently show steep drops in real-world conditions. Platform re-encoding (Instagram, WhatsApp, Twitter compression), screenshots, and generators not in the training data all degrade performance. Our numbers are based on 7,700+ tracked real-world user uploads, not curated test sets — they reflect what you'll actually experience.

False positive rates matter

A detector that flags 18% of real photos as AI-generated (the industry average) creates alert fatigue and erodes editorial trust. ImageWhisperer's corroboration requirement — no single model can override the verdict alone — keeps our false positive rate at 5%, nearly four times lower than the average single-model detector.

What sets ImageWhisperer apart. We combine forensic AI detection with investigative tools — fact-checking, reverse image search, EXIF analysis, location verification, and full narrative explanations — in a single analysis. Most detection tools return a score; we explain why.

6. Known Limitations

We believe transparency about limitations builds more trust than inflated accuracy claims.

7. How We Improve

Questions about our methodology? Found an image we got wrong? Let us know — every report makes the system better.

8. Related Documents