Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching

About

The ultimate goal of Dataset Distillation is to synthesize a small synthetic dataset such that a model trained on this synthetic set will perform equally well as a model trained on the full, real dataset. Until now, no method of Dataset Distillation has reached this completely lossless goal, in part due to the fact that previous methods only remain effective when the total number of synthetic samples is extremely small. Since only so much information can be contained in such a small number of samples, it seems that to achieve truly loss dataset distillation, we must develop a distillation method that remains effective as the size of the synthetic dataset grows. In this work, we present such an algorithm and elucidate why existing methods fail to generate larger, high-quality synthetic sets. Current state-of-the-art methods rely on trajectory-matching, or optimizing the synthetic data to induce similar long-term training dynamics as the real data. We empirically find that the training stage of the trajectories we choose to match (i.e., early or late) greatly affects the effectiveness of the distilled dataset. Specifically, early trajectories (where the teacher network learns easy patterns) work well for a low-cardinality synthetic set since there are fewer examples wherein to distribute the necessary information. Conversely, late trajectories (where the teacher network learns hard patterns) provide better signals for larger synthetic sets since there are now enough samples to represent the necessary complex patterns. Based on our findings, we propose to align the difficulty of the generated patterns with the size of the synthetic dataset. In doing so, we successfully scale trajectory matching-based methods to larger synthetic datasets, achieving lossless dataset distillation for the very first time. Code and distilled datasets are available at https://gzyaftermath.github.io/DATM.

Ziyao Guo, Kai Wang, George Cazenavette, Hui Li, Kaipeng Zhang, Yang You• 2023

Related benchmarks

TaskDatasetResultRank
Image ClassificationCIFAR-100 (test)
Accuracy55
3518
Image ClassificationCIFAR10 (test)
Accuracy76.1
585
Image ClassificationTiny ImageNet (test)
Accuracy39.7
265
Image ClassificationCIFAR-10 (test)
Accuracy83.5
59
Image ClassificationCIFAR-10 Long Tailed Imbalance Ratio 50 (test)
Top-1 Accuracy50.3
57
Long-Tailed Image ClassificationCIFAR10-LT imbalance factor 100 (test)
Top-1 Accuracy44.3
46
Medical Image ClassificationCovid (test)
Accuracy87.38
43
Image ClassificationPathMNIST v2 (test)
Accuracy89.15
35
Image ClassificationCIFAR-10 Imbalance Factor 10 Long-Tailed (test)
Accuracy66.7
30
Image ClassificationCIFAR-10 Imbalance Factor 200 Long-Tailed (test)
Accuracy40.1
28
Showing 10 of 18 rows

Other info

Follow for update