In-the-Wild Camouflage Attack on Vehicle Detectors through Controllable Image Editing

1Carnegie Mellon University, 2DEVCOM Army Research Laboratory, 3Florida State University

How can we generate realistic adversarial camouflages for real-world, in-the-wild vehicles without relying on simulation environments?

teasor

Given a real image, our method stylizes the target vehicle based on either its immediate surroundings (image-level) or a visual concept present in the overall scene (scene-level).


Contributions


We formulate camouflage attacks against detectors as a conditional image-editing problem.

We propose two camouflage strategies inspired by nature. The image-level strategy blends the vehicle with its surroundings, and the scene-level strategy adapts the vehicle to a common concept present in the scene.

We propose a novel pipeline based on ControlNet fine-tuning. Our method jointly enforces structural fidelity to maintain vehicle geometry, style consistency to produce stealthy camouflage, and an adversarial objective to reduce detectability by object detectors.

We evaluate our approach on the COCO (ground-view) and LINZ (aerial-view) datasets. Results demonstrate:

  • Stronger adversarial effectiveness, better preservation of vehicle structure, and improved stealthiness compared to SOTA;
  • Transferability to black-box detectors, diverse environmental conditions, and the physical world.
  • Robustness under preprocessing defense strategies such as denoising and smoothing;

Pipeline

teasor

As shown in (a) and (b), the pipeline consists of a No-Box Attack stage and a White-Box Attack stage. In (a), the ControlNet is fine-tuned to stylize vehicles using a reference region while preserving geometry and background through structure, style, and background supervisions (Lstruct, Ls, Lb). (b) further optimizes the model against a detector Mdet by incorporating an additional adversarial loss Ladv and a color-consistency loss Lc. (c) summarizes the conditions provided to ControlNet under the image-level and scene-level settings, and (d) illustrates the style loss Ls that aligns vehicle latent features with the reference area.

Featured Experiments

Qualitative Results

coco_image_level.

COCO Image-Level

coco_scene_level.

COCO Scene-Level

linz_image_level.

LINZ Image-Level

linz_scene_level.

LINZ Scene-Level


Cross-Location Transferability

For each scene, we composite the vehicle into five additional backgrounds while keeping the vehicle appearance unchanged. The new scenes introduce diverse environmental conditions, including different weather (e.g., cloudy, rainy, and foggy) and seasonal variations (e.g., summer and winter). Faster R-CNN achieves an AP50 of 99.5% on the corresponding clean images, whereas the camouflaged versions reduce performance to 38.2%, corresponding to a 61.3% drop.

original.

Original scenes

synthetic.

Synthetic scenes


Physical-World Transferability

We employ an Epson EpiqVisionTM Mini EF12 projector to project real images and use an iPhone 16 Pro Max to capture the resulting scenes. The phone is positioned close to both the projector’s optical axis and the intended camera center to minimize parallax and perspective distortion, ensuring the captured photos faithfully reflect the viewpoint of a real detector.

For the LINZ dataset, real-world images are projected onto a whiteboard, and a 3D-printed sedan model is placed at the corresponding vehicle location. For the COCO dataset, we reconstruct the scenes using Depth Anything 3 and 3D-print them. Then the images are projected directly onto 3D-printed scenes.

linz_setup.

LINZ setup

coco_setup.

COCO setup

For both datasets, each example is shown in five columns: clean digital images, photos of the projected clean scenes, style reference areas, composite images obtained by replacing the original vehicles with their camouflaged versions, and photos of the projected camouflaged scenes. Across both LINZ and COCO examples, detectors assign high confidence to the clean cases but substantially lower confidence after camouflage is applied, indicating that the adversarial appearance learned in simulation transfers effectively to physical-world settings.

linz_setup.

LINZ Examples

coco_setup.

COCO Examples

BibTeX


      TODO: Add BibTeX here.