Improving Noise Efficiency in Privacy-preserving Dataset Distillation

Runkai Zheng1, Vishnu Asutosh Dasu2, Yinong Oliver Wang1
Haohan Wang3, Fernando De la Torre1
1Carnegie Mellon University
2Pennsylvania State University
3University of Illinois Urbana-Champaign

TL;DR:

1. Problem: Existing Matching-based Differentially Private Dataset Distillation (DP-DD) pipelines waste privacy budget by resampling noisy signals every optimization step and rely on randomly initialized networks, yielding low‑SNR training signals.

2. Method:

  • Decoupled Optimization & Sampling (DOS): gathers DP‑protected signals once, then reuses them indefinitely;
  • Subspace Error Reduction (SER): projects those signals into an informative PCA subspace learned from auxiliary data, amplifying signal and adding little extra privacy cost.

3. Result: On CIFAR‑10, DOS + SER boosts accuracy by +10 pp with 50 images/class and retains prior accuracy using only 20 % of the distilled images, setting a new state of the art for privacy‑preserving dataset distillation.

Motivation

Introduction figure

Figure 1. (a) Overview of private dataset distillation. (b) CIFAR-10 classification accuracy using distilled images with varying privacy budgets and images per class (IPC).

Abstract:

Modern machine learning models heavily rely on large datasets that often include sensitive and private information, raising serious privacy concerns. Differentially private (DP) data generation offers a solution by creating synthetic datasets that limit the leakage of private information within a predefined privacy budget; however, it requires a substantial amount of data to achieve performance comparable to models trained on the original data. To mitigate the significant expense incurred with synthetic data generation, Dataset Distillation (DD) stands out for its remarkable training and storage efficiency. This efficiency is particularly advantageous when integrated with DP mechanisms, curating compact yet informative synthetic datasets without compromising privacy. However, current state-of-the-art private DD methods suffer from a synchronized sampling-optimization process and the dependency on noisy training signals from randomly initialized networks. This results in the inefficient utilization of private information due to the addition of excessive noise. To address these issues, we introduce a novel framework that decouples sampling from optimization for better convergence and improves signal quality by mitigating the impact of DP noise through matching in an informative subspace, all without incurring additional privacy costs. On CIFAR-10, our method achieves a 10.0% improvement with 50 images per class and 8.3% increase with just one-fifth the distilled set size of previous state-of-the-art methods, demonstrating significant potential to advance privacy-preserving dataset distillation.

Overview of the Pipeline

Framework figure

Figure 2. Overview of our proposed framework, which integrates Decoupled Optimization and Sampling (DOS) with Subspace Discovery for Error Reduction (SER).

Pipeline at a glance

  • Baseline (left, blue): matching‑based DD. Every iteration simultaneously
    1. samples a fresh mini‑batch from the private data,
    2. extracts two noisy feature sets through a randomly‑initialised network,
    3. clips + averages them, and
    4. optimises the synthetic images by minimising their feature gap.
    Because sampling and optimisation are locked together, each step burns privacy budget and the noise accumulates.
  • Ours—Dosser (right, green): two clearly separated stages.
    • Sampling stage (I₁ steps): before touching private data, we run Subspace Discovery once (PCA on auxiliary images) to pick the top‑p informative directions. Each sampling step then
      • instantiates a random network,
      • projects its features into that fixed subspace,
      • clips, adds DP noise, and stores the tuple containing the random seed for data augmentation, network parameters, projection matrix and DP-protected training signal in a cache S.
    • Optimisation stage (I₂ ≫ I₁ steps): we repeatedly draw cached signals from S, synthetic images, compute the projected features of the synthetic images, and minimise the gap to the DP-protected training signal.

The result is a distillation loop that reduces cumulative noise, reuses high‑SNR directions, and allows many more optimisation updates than private queries, yielding smaller yet more accurate DP‑protected synthetic datasets.

Evaluation Against Baselines

Method MNIST FashionMNIST CIFAR-10
IPC=10 IPC=50 IPC=10 IPC=50 IPC=10 IPC=50
DM w/o DP 97.8 99.2 84.6 88.7 52.1 60.6
DP-Sinkhorn 31.7±3.2 33.9±1.7 9.8±0.0 22.0±0.1 - -
DP-MERF 75.0±0.3 84.4±2.3 65.5±3.2 71.3±1.7 - -
DP-KIP-ScatterNet 25.8±2.1 13.8±2.6 17.7±1.5 16.2±1.2 16.8±1.1 9.5±0.5
PSG 78.6±0.7 - 68.5±0.5 - 33.6±0.3 -
NDPDC 93.1±0.4 94.1±0.4 77.7±0.6 78.8±0.4 39.4±0.8 42.3±0.8
Dosser (ours) 95.3±0.0 96.4±0.0 81.6±0.1 81.8±0.2 44.2±0.2 49.1±0.5
Dosser (ours) w/ PEA 96.4±0.0 96.7±0.1 80.1±0.5 83.1±0.5 50.6±0.1 52.3±0.6

Table 1. Comparison of accuracies achieved by various methods on MNIST, FashionMNIST, and CIFAR-10 datasets. Dosser achieves significant improvements over baseline methods, particularly on more complex datasets like CIFAR-10.

Ablation Studies

Dosser Components (IPC, ε)
DOS SER (10, 1) (50, 1) (10, 10) (50, 10)
41.7±0.0 45.7±0.1 54.1±0.0 57.7±0.0
47.7±0.1 51.0±0.2 56.7±0.5 61.1±0.1
46.5±0.2 47.8±0.3 54.7±0.5 57.6±0.0
50.6±0.1 52.3±0.3 58.0±0.2 61.0±0.0

Table 2. Ablation studies evaluating the individual contributions of DOS and SER components. Both modules provide significant improvements, with their combination yielding the best performance.

BibTeX:

@inproceedings{zheng2025improving,
  title     = {Improving Noise Efficiency in Privacy-preserving Dataset Distillation},
  author    = {Zheng, Runkai and Dasu, Vishnu Asutosh and Wang, Yinong Oliver and Wang, Haohan and De la Torre, Fernando},
  booktitle = {ICCV},
  year      = {2025}
}