A Comprehensive and Open Benchmark for Detecting AI-Generated Images

Overview

The landscape of open-source computer vision currently grapples with a critical shortage of datasets and evaluation frameworks designed to benchmark systems that distinguish between real and AI-generated images. Previous studies have predominantly targeted content-specific subsets of this problem, such as human faces in images and videos (e.g., DeepfakeBench).

Although these efforts have proven valuable for testing new model architectures under limited conditions, they do not adequately address the broad spectrum of image types encountered in everyday scenarios. This report covers our efforts to fill this gap by developing Deepfake Detection Arena (DFD-Arena), a comprehensive and adaptable benchmark suitable for the diverse and complex nature of in-the-wild images.

Definitions

In this report, we use the term “synthetic images” to describe images that are produced by generative AI models, i.e. generative adversarial networks (GAN), vision transformers (VIT), diffusion models, and vision-language models (VLM).

Additionally, we define a synthetic image generated to match the semantic-level characteristics of a “real image”, or non AI-generated image, as the real image’s “synthetic mirror.

Real Image

baby_real.jpg

A real image from FFHQ, a dataset of human faces created by NVIDIA Research as a benchmark intended for GANs.

Using BLIP-2, OPT-6.7b, fine-tuned on COCO VLM + Meta-Llama-3.1-8B-Instruct-bnb-4bit (Unsloth) LLM, we generate the caption:

“A baby lies on a blue blanket in a sunny setting, surrounded by a blue background, in a portrait view.”

Synthetic Mirror

baby_fake_flux.jpg

Synthetic image output from our Synthetic Image Generation Pipeline, generated by FLUX.1-dev. using the generation arguments:

"guidance_scale": 2, "num_inference_steps": 100, "height": 512, "width": 512

In the subsequent sections, the term “model*” *****refers to any algorithm that processes an image to determine its classification, specifically to identify if it is real or synthetic. This includes not only pretrained machine learning architectures, but also heuristic and statistical modeling frameworks.

Furthermore, we define a “detector” ******as an algorithm that either employs a single model or orchestrates multiple models to perform the binary inference. The distinction between “model” ****and “detector” helps clarify the roles within multi-agent systems and mixture-of-expert frameworks, particularly those utilizing deep learning classifiers.

Data

The BitMind team has curated and generated a large collection of real and synthetic image datasets on HuggingFace with significant topic diversity. These are sourced from a foundation of datasets publicized by leading machine vision research teams–including NVIDIA, Microsoft, and Google–reputable in image segmentation, recognition, and classification spaces.

We have further extended these collections with synthetic mirror datasets generated in-house. These novel datasets are characterized by the semantic balance they share with real counterparts.

BitMind’s real and deepfake image datasets are continuing to grow, with new raw and preprocessed general and expert training datasets being continuously added for image diversity.

BitMind’s real and deepfake image datasets are continuing to grow, with new raw and preprocessed general and expert training datasets being continuously added for image diversity.

Open-Source Image Datasets

ImageNet

A growing dataset that currently contains over 14 million annotated images of real world scenes.

MS-COCO

328,000 images with a wide variety of real world scenes.

CelebA-HQ

30,000 images of celebrity faces with varying poses and backgrounds. CelebA-HQ was introduced by NVIDIA at ICLR.

FFHQ

70,000 high quality 1024x1024 png images of human faces, with varying age, ethnicity, and image backgrounds. FFHQ was made by NVIDIA Research for benchmarking Generative Adversarial Networks (GANs).

Flickr-30k

30,000 images from Flickr, with a variety of real world scenes.