TLDR; We created a holdout dataset to benchmark Subnet 34's overall performance on “in-the-wild” results. The overall accuracy of the subnet’s performance was 88%, which was further improved to 91% in initial subnet output aggregation experiments (Application of Benchmarking Results for Subnet Improvement )
This report presents a benchmark analysis for the BitMind Subnet, a decentralized framework aimed at identifying AI-generated media through binary predictions (classifying images as "real" or "fake"). The primary objective is to evaluate the Subnet's performance on a holdout dataset, representative of "in-the-wild" image inputs on the web. By testing the Subnet API's predictions on this unseen dataset, we can assess the model's generalization capabilities and its reliability as a tool for identifying AI-generated media in live environments. This augments our previous benchmarks for both on-chain performance and widely used computer vision research datasets (DFD Arena).
Below are the preliminary results from the Subnet's predictions on a dataset composed of real and AI-generated (synthetic) images. We sourced the holdout image dataset from organically occurring search engine, social media, and image hosting services, as well as synthetic mirrors of real sourced images:
Image Label | Source | Sample Size | Accuracy |
---|---|---|---|
Real | Google Images | 5000 | 86.04% |
Fake | Flickr “Artificial-Intelligence” Tagged Images | 1000 | 83.33% |
Fake | Reddit "r/aiArt" | 1000 | 83.40% |
Fake | Mobius Mirrors of Google Images | 1000 | 96.6% |
Fake | SDXL Mirrors of Google Images | 1000 | 94.68% |
Fake | Flux Mirrors of Google Images | 1000 | 92.07% |
Fake | RealVisXL v4.0 Mirrors of Google Images | 1000 | 87.46% |
Combined | All of the above datasets | 11000 | 87.98% |
The benchmarking results for the BitMind Subnet demonstrate its robust generalization across a variety of holdout datasets, achieving notable accuracy rates of over 83% on synthetic images and more than 86% on genuine images. This broad effectiveness underscores the subnet's strong adaptability and response to diverse and sophisticated forgery methods.
Moreover, the subnet demonstrates particularly impressive performance in optimizing the detection of images generated by models used in our validator challenges. It excels in identifying synthetic images created through Mobius and SDXL mirroring techniques, where accuracies exceed 94%. These results affirm the Subnet's capability to effectively tailor its detection mechanisms to the specific models defined for validators within our system.
A private holdout set of real images sourced using LLM-generated search terms to scrape Google Images, with filters in place to avoid contamination from AI-generated creations (e.g. date filter for images predating 2018). The api_prediction column contains the output label produced by the BitMind Subnet API for the given image.
The benchmarking process involved creating a comprehensive holdout dataset comprising both real and fake (AI-generate images), intended to simulate a broad range of "in-the-wild" conditions that our subnet API may encounter, e.g. through BitMind user applications. The process is broken down into several stages:
Data Collection
Real Image Data:
We developed a custom search engine query generator to source diverse real image content from Google Images. This generator uses a state-of-the-art large language model (LLM), unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit on HuggingFace, to produce unique search terms across 24 categories. These categories span tangible subjects (e.g., “country,” “animal”) and more abstract concepts (e.g., “culture,” “time”), ensuring a broad representation of real-world imagery.
“[INST] Generate a random example of the provided topic, or a phrase related to the topic. Do not reply with anything else.[/INST]”
To avoid contamination from AI-generated images, we apply a strict date filter, selecting only image candidates created before 2018, predating most consumer AI image generation solutions such as Midjourney (released July 12, 2022).
from transformers import pipeline
import torch
...
def generate_queries():
subjects = [
'animal',
'country',
'music',
...
]
queries = []
for subject in subjects:
messages = [
{
"role": "system",
"content": "[INST] Generate a random example of the provided topic, or a phrase related to the topic. Do not reply with anything else.[/INST]"
},
{
"role": "user",
"content": subject
}
]
pipe = pipeline("text-generation",
model="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
return_full_text=False,
max_new_tokens=5,
temperature=2.0) # High temperature for more diverse outputs
queries.append(pipe(messages)[0]['generated_text'])
return queries
if __name__ == "__main__":
queries = generate_queries()
# Output as JSON
print(json.dumps(queries))
async function mainTask() {
// 1. Generate and sanitize search queries
let searchQueries = await generateQueries();
searchQueries = searchQueries.map(str => str.replace(/[\\/:*?"<>|]/g, " "));
// 2. Webscrape image URLs using the search terms, and download images
results = await downloadImages(searchQueries, num_images_per_query);
const apiPredictions = new Map();
for (const [searchTerm, imageObjects] of Object.entries(results)) {
for (const [index, imageObject] of imageObjects.entries()) {
// 3. Load image as correct data format
const imageBuffer = await fs.readFile(imageObject.file);
imageBase64 = `data:image/jpeg;base64,${imageBuffer.toString('base64')}`;
// 4. Forward to BitMind Subnet API and collect results
const apiResult = await processImage(imageBase64);
const absolutePath = path.resolve(imageObject.file);
const predictionValue = apiResult.isAI ? 1 : 0;
apiPredictions.set(absolutePath, predictionValue);
}
}
// 5. Construct formatted rows to append to benchmark dataset
await uploadToHuggingFace(downloadsPath, apiPredictions, hfApiToken, results, true);
// 6. Clean downloads directory to prepare for restart
await cleanDownloads();
console.log('\\nProcessing complete.');
}
Fake Image Data:
AI-generated images were sourced through multiple avenues: