SCOOTER – A Human Evaluation Framework for Unrestricted Adversarial Examples

Dren Fazlija¹, Monty-Maximilian Zühlke¹, Johanna Schrader^1,4,
Arkadij Orlov², Clara Stein¹, Iyiola E. Olatunji³, Daniel Kudenko¹
¹L3S Research Center
²E.ON Grid Solutions
³University of Luxembourg
⁴CAIMed – Lower Saxony Center for Artiﬁcial Intelligence and Causal Methods in Medicine

Abstract (click to expand)

Unrestricted adversarial attacks aim to fool computer vision models without being constrained by ℓₚ-norm bounds to remain imperceptible to humans, for example, by changing an object's color. This allows attackers to circumvent traditional, norm-bounded defense strategies such as adversarial training or certified defense strategies. However, due to their unrestricted nature, there are also no guarantees of norm-based imperceptibility, necessitating human evaluations to verify just how authentic these adversarial examples look. While some related work assesses this vital quality of adversarial attacks, none provide statistically significant insights. This issue necessitates a unified framework that supports and streamlines such an assessment for evaluating and comparing unrestricted attacks. To close this gap, we introduce SCOOTER – an open-source, statistically powered framework for evaluating unrestricted adversarial examples. Our contributions are: (i) best-practice guidelines for crowd-study power, compensation, and Likert equivalence bounds to measure imperceptibility; (ii) the first large-scale human vs. model comparison across 346 human participants showing that three color-space attacks and three diffusion-based attacks fail to produce imperceptible images. Furthermore, we found that GPT-4o can serve as a preliminary test for imperceptibility, but it only consistently detects adversarial examples for four out of six tested attacks; (iii) open-source software tools, including a browser-based task template to collect annotations and analysis scripts in Python and R; (iv) an ImageNet-derived benchmark dataset containing 3K real images, 7K adversarial examples, and over 34K human ratings. Our findings demonstrate that automated vision systems do not align with human perception, reinforcing the need for a ground-truth SCOOTER benchmark.

Motivation of this Project

Unrestricted adversarial attacks (e.g., simply changing an object’s color) can fool state-of-the-art vision models even though the changes are obvious to humans – see below for some examples!

Color-based Attacks

SemanticAdv

cAdv

NCF

Original

Diffusion-based Attacks

DiffAttack

AdvPP

ACA

Because these attacks aren’t limited by traditional ℓₚ-norm “imperceptibility” constraints, we must involve people to judge how convincing the images really are.

Meet SCOOTER — Systemizing Confusion Over Observations To Evaluate Realness

An open-source, statistically powered framework for human-in-the-loop evaluation of unrestricted adversarial images, making studies easier to run and results easier to compare.

Key experimental findings

346 human participants vs. models: three color-based attacks and three diffusion-based attacks all failed to produce images that humans find imperceptible.
GPT-4o can act as a first litmus test but it reliably flags only 4 / 6 attack types — human evaluation is still essential.

What’s inside the framework

🔬 Best-practice playbook for crowd studies: power analysis, fair compensation guidelines, and Likert-scale equivalence bounds for statistically solid results.
🖥️ Plug-and-play tooling: a browser-based annotation template plus Python & R analysis scripts.
🗂️ Benchmark dataset: 3 k genuine ImageNet photos, 7 k adversarial counterparts, and 34 k+ human ratings (all CC-BY).

Take-home message

Current automated vision defenses and detectors do not align with human perception.
SCOOTER provides the community with a ground-truth benchmark and ready-made tools to close that gap, accelerating research on truly “stealthy” attacks and genuinely robust models.

Citation

@misc{fazlija2025scooterhumanevaluationframework,
      title={SCOOTER: A Human Evaluation Framework for Unrestricted Adversarial Examples}, 
      author={Dren Fazlija and Monty-Maximilian Zühlke and Johanna Schrader and Arkadij Orlov and Clara Stein and Iyiola E. Olatunji and Daniel Kudenko},
      year={2025},
      eprint={2507.07776},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.07776}, 
}