Adapting Vehicle Detectors for Aerial Imagery to Unseen Domains with Weak Supervision

Xiao Fang1Minhyek Jeon1Zheyang Qin1Stanislav Panev1Celso de Melo2,
Shuowen Hu2Shayok Chakraborty1,3Fernando De la Torre1
1Carnegie Mellon University,  2Army Research Laboratory,  3Florida State University
Copied!

ICCV 2025


How can we use synthetic data to improve cross-domain object detection performance in aerial imagery?

Teaser Image

We harness the power of Stable Diffusion to generate synthetic images and automatically annotate them using cross-attention maps.

Our method outperforms supervised learning on source domain data, weakly supervised adaptation methods, unsupervised domain adaptation methods, and open-set object detectors by 4-23%, 6-10%, 7-40%, and more than 50% AP50, respectively.

We introduce two large-scale datasets captured in Utah and Selwyn to advance vehicle detection in aerial imagery.



Method Overview

pipeline

We propose a pipeline that generates high-quality aerial images along with their labels to address the domain shift challenge in aerial image analysis. The first stage involves finetuning Stable Diffusion to synthesize images for both the source and target domains. The second stage then automatically generates labels for the synthetic target images by leveraging cross-attention maps. The effectiveness of this pipeline is validated by a significant boost in detector performance. This approach surpasses both baseline detectors trained on source images and open-set detectors inferred directly on the target domain. Please see our paper for more implementation details.



Experiment Results

Method Backbone AP50 (%)
LINZ→UGRC DOTA→UGRC
Supervised Object detection (Source-only)
Faster-RCNNFaster R-CNN53.152.5
YOLOv5 YOLOv563.257.6
YOLOv8 YOLOv862.971.4
ViTDet ViTDet55.750.4
Open-set object detection
GLIP-T Swin-T 8.78.7
OmDet-Turbo Swin-T 14.414.4
OWL-V2 CLIP ViT 17.917.9
Unsupervised domain adaptation object detection
SIGMA Faster R-CNN36.234.8
TIA Faster R-CNN39.042.4
Adapt. Teacher Faster R-CNN29.037.0
CycleGAN-Turbo YOLOv529.160.8
SSDA-YOLO YOLOv552.349.6
Cross domain weakly supervised object detection
OCUD Faster R-CNN63.165.3
H2FA R-CNN Faster R-CNN61.868.3
OursFaster R-CNN69.375.5
OursYOLOv568.868.5
OursYOLOv875.467.1
OursViTDet72.067.1

Cross-domain object detection results. LINZ to UGRC and DOTA to UGRC. We report the AP50 result.

Method Vision Backbone LINZ→UGRC
Precision(%)
LINZ→UGRC
Recall(%)
Vision Large Language Model
Gemini 1.5 Flash -2.944.5
Gemini 2.0 Flash-Lite -6.626.3
InternVL3-8B InternViT 4.722.0
Qwen2.5-VL-7B ViT 0.44.8
DeepSeek-VL2-TinySigLIP-SO400M9.226.8
LLaVA-NeXTCLIP ViT 5.54.7
Ours
OursFaster R-CNN 63.868.2
OursYOLOv5 67.267.3
OursYOLOv8 70.076.3
OursViTDet72.067.1

Table 3. Comparison between our methods and VLLMs on UGRC dataset. We report the precision and recall metrics.

Figure 4

Cross-attention maps for different tokens. We analyze the effectiveness of utilizing multi-channel cross-attention maps by assessing the label quality on synthetic UGRC images. (a) Synthetic UGRC images. (b) Labels generated using only the cross-attention map of the word “car”. (c) Labels generated by using multi-channel heatmaps. (d) Multi-channel cross-attention maps. (e) Cross-attention maps of the word “car”. (f) Cross-attention maps of token [V₁], designed to capture the concept of cars. (g) Cross-attention maps of token [V₃] for background, inverted for comparison. Heatmaps highlight attention intensity. In (b) and (c), dotted lines denote predicted pseudo-boxes; white dots = vehicle centers.



LINZ and UGRC Dataset

Figure 4

We introduce two large-scale and high-quality benchmarks, LINZ and UGRC, designed to evaluate domain adaptive object detection (DAOD) methods under realistic distribution shifts. These datasets support geographic domain shift evaluation, and provide dense, expert-level annotations. Together, these datasets comprise over 5 million annotated images and over 80,000 labeled object instances, establishing a robust foundation for testing DAOD algorithms.

Property LINZ (Selwyn, New Zealand) UGRC (Utah, USA)
Training Images 1,451,144
(19,564 with vehicles)
2,142,849
(15,631 with vehicles)
Validation Images 188,744
(2,108 with vehicles)
271,252
(3,912 with vehicles)
Test Images 438,189
(2,629 with vehicles)
270,557
(1,510 with vehicles)
Object Instances
(Train / Val / Test)
29,495 / 2,574 / 3,640 26,001 / 9,878 / 12,900
Resolution 1920 × 2880 px
@12.5 cm/pixel
Up to 16,000 × 16,000 px
@12.5 cm/pixel
Source Land Information New Zealand (LINZ) Utah Geospatial Resource Center (UGRC)



BibTeX

@misc{fang2025adaptingvehicledetectorsaerial,
      title={Adapting Vehicle Detectors for Aerial Imagery to Unseen Domains with Weak Supervision}, 
      author={Xiao Fang and Minhyek Jeon and Zheyang Qin and Stanislav Panev and Celso de Melo and Shuowen Hu and Shayok Chakraborty and Fernando De la Torre},
      year={2025},
      eprint={2507.20976},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.20976}, 
}