We are grateful for the numerous submissions received for the 2nd Workshop on Navigating and Addressing Data Problems for Foundation Models (DATA-FM @ ICLR 2025). We thank all the authors who submited to our workshop. We are happy to announce this year's accepted papers. Congratulations!
Oral
Synthesizing Privacy-Preserving Text Data via Finetuning *without* Finetuning Billion-Scale LLMs
Bowen Tan, Zheng Xu, Eric P. Xing, Zhiting Hu, Shanshan Wu
OpenReview
[Show Abstract]
Synthetic data offers a promising path to train models while preserving data privacy. Differentially private (DP) finetuning of large language models (LLMs) as data generator is effective, but is impractical when computation resources are limited. Meanwhile, prompt-based methods like private evolution, depend heavily on manual prompts and ineffectively use private information in their filtering-based process. To overcome these limitations, we propose CTCL (Data Synthesis with Controllability and Clustering), a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale LLM finetuning. CTCL pretrains a lightweight 140M conditional generator and a clustering-based topic model on large-scale public data. To adapt to the private domain, the generator is DP-finetuned on private data for fine-grained textual information, while the topic model extracts a DP histogram representing distributional information. The DP generator then samples according to the DP histogram to synthesize a desired number of examples. Evaluation across five diverse domains demonstrates the effectiveness of our framework, particularly in the strong privacy regime. Further analysis validates the design of each framework component and highlights the scalability of our approach.
Synthetic data offers a promising path to train models while preserving data privacy. Differentially private (DP) finetuning of large language models (LLMs) as data generator is effective, but is impractical when computation resources are limited. Meanwhile, prompt-based methods like private evolution, depend heavily on manual prompts and ineffectively use private information in their filtering-based process. To overcome these limitations, we propose CTCL (Data Synthesis with Controllability and Clustering), a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale LLM finetuning. CTCL pretrains a lightweight 140M conditional generator and a clustering-based topic model on large-scale public data. To adapt to the private domain, the generator is DP-finetuned on private data for fine-grained textual information, while the topic model extracts a DP histogram representing distributional information. The DP generator then samples according to the DP histogram to synthesize a desired number of examples. Evaluation across five diverse domains demonstrates the effectiveness of our framework, particularly in the strong privacy regime. Further analysis validates the design of each framework component and highlights the scalability of our approach.
Data Mixing Can Induce Phase Transitions in Knowledge Acquisition
Xinran Gu, Kaifeng Lyu, Jiazheng Li, Jingzhao Zhang
OpenReview
[Show Abstract]
Large Language Models (LLMs) are typically trained on data mixtures: most data come from web scrapes, while a small portion is curated from high-quality sources with dense domain-specific knowledge.
In this paper, we show that when training LLMs on such data mixtures, knowledge acquisition from knowledge-dense datasets does not always follow a smooth scaling law but can exhibit phase transitions with respect to the mixing ratio and model size. First, through controlled experiments on a synthetic biography dataset mixed with web-scraped data, we demonstrate that:
(1) as we increase the model size to a critical value, the model suddenly transitions from memorizing very few to most of the biographies;
(2) below a critical mixing ratio, the model memorizes almost
nothing even with extensive training, but beyond
this threshold, it rapidly memorizes more biographies.
We then adopt an information-theoretic perspective to understand and characterize the existence and value of the thresholds. Based on these insights, we identify two mitigation strategies that improve the efficiency of knowledge acquisition from knowledge-dense datasets, and validate their effectiveness on both synthetic and real-world Wikipedia datasets.
Large Language Models (LLMs) are typically trained on data mixtures: most data come from web scrapes, while a small portion is curated from high-quality sources with dense domain-specific knowledge. In this paper, we show that when training LLMs on such data mixtures, knowledge acquisition from knowledge-dense datasets does not always follow a smooth scaling law but can exhibit phase transitions with respect to the mixing ratio and model size. First, through controlled experiments on a synthetic biography dataset mixed with web-scraped data, we demonstrate that: (1) as we increase the model size to a critical value, the model suddenly transitions from memorizing very few to most of the biographies; (2) below a critical mixing ratio, the model memorizes almost nothing even with extensive training, but beyond this threshold, it rapidly memorizes more biographies. We then adopt an information-theoretic perspective to understand and characterize the existence and value of the thresholds. Based on these insights, we identify two mitigation strategies that improve the efficiency of knowledge acquisition from knowledge-dense datasets, and validate their effectiveness on both synthetic and real-world Wikipedia datasets.
Demystifying Long CoT Reasoning in LLMs
Edward Yeo, Yuxuan Tong, Xinyao Niu, Graham Neubig, Xiang Yue
OpenReview
[Show Abstract]
Scaling inference compute has become a key driver of advanced reasoning in large language models (LLMs). A proven approach for scaling inference compute is to generate long chains-of-thought (CoTs), enabling models to engage in structured reasoning strategies such as backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the underlying mechanics of long CoT reasoning—examining the factors that enable models to generate extended reasoning trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we identify three key findings: 1) while SFT is not strictly necessary, it significantly simplifies training and improves efficiency; 2) reasoning capabilities tend to emerge with increased training compute but are not guaranteed, making reward shaping essential for stabilizing CoT length growth; and 3) scaling verifiable reward signals is critical for RL, and we find that leveraging noisy, web-extracted solutions with filtering mechanisms shows promising potential, particularly in out-of-distribution (OOD) reasoning tasks such as STEM problem-solving. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs.
Scaling inference compute has become a key driver of advanced reasoning in large language models (LLMs). A proven approach for scaling inference compute is to generate long chains-of-thought (CoTs), enabling models to engage in structured reasoning strategies such as backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the underlying mechanics of long CoT reasoning—examining the factors that enable models to generate extended reasoning trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we identify three key findings: 1) while SFT is not strictly necessary, it significantly simplifies training and improves efficiency; 2) reasoning capabilities tend to emerge with increased training compute but are not guaranteed, making reward shaping essential for stabilizing CoT length growth; and 3) scaling verifiable reward signals is critical for RL, and we find that leveraging noisy, web-extracted solutions with filtering mechanisms shows promising potential, particularly in out-of-distribution (OOD) reasoning tasks such as STEM problem-solving. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs.
Towards Internet-Scale Training For Agents
Brandon Trabucco, Gunnar A Sigurdsson, Robinson Piramuthu, Ruslan Salakhutdinov
OpenReview
[Show Abstract]
The predominant approach for training web navigation agents gathers human demonstrations for a set of popular websites and hand-written tasks, but it is becoming clear that human data is an inefficient resource. We develop a pipeline to facilitate internet-scale training for agents without laborious human annotations. In the first stage, an LLM generates tasks for 150k diverse websites. In the next stage, LLM agents complete tasks and produce trajectories. In the final stage, an LLM reviews the trajectories, and judges their success. Language models are competitive with human annotators, detecting and filtering out harmful content with an accuracy of 97\%, generating feasible tasks with an 89\% rate, and judging successful trajectories with an 82.6\% accuracy. Scaling the pipeline, agents based on \textit{Llama 3.1 70B} solve 16.7\% of tasks for 150k sites. Training on the data generated by our pipeline is competitive with training on human demonstrations. In data-limited settings derived from Mind2Web and WebLINX, we improve \textit{Step Accuracy} by up to +89.5\% and +122.9\% respectively for agents trained on mixtures of data from our pipeline, and human data. When training agents with all available human data from these benchmarks, agents fail to generalize to diverse real sites, and adding our data improves their generalization by +149.0\% for WebLINX and +156.3\% for Mind2Web. Code will be available at: \href{https://data-for-agents.github.io}{data-for-agents.github.io}.
The predominant approach for training web navigation agents gathers human demonstrations for a set of popular websites and hand-written tasks, but it is becoming clear that human data is an inefficient resource. We develop a pipeline to facilitate internet-scale training for agents without laborious human annotations. In the first stage, an LLM generates tasks for 150k diverse websites. In the next stage, LLM agents complete tasks and produce trajectories. In the final stage, an LLM reviews the trajectories, and judges their success. Language models are competitive with human annotators, detecting and filtering out harmful content with an accuracy of 97\%, generating feasible tasks with an 89\% rate, and judging successful trajectories with an 82.6\% accuracy. Scaling the pipeline, agents based on \textit{Llama 3.1 70B} solve 16.7\% of tasks for 150k sites. Training on the data generated by our pipeline is competitive with training on human demonstrations. In data-limited settings derived from Mind2Web and WebLINX, we improve \textit{Step Accuracy} by up to +89.5\% and +122.9\% respectively for agents trained on mixtures of data from our pipeline, and human data. When training agents with all available human data from these benchmarks, agents fail to generalize to diverse real sites, and adding our data improves their generalization by +149.0\% for WebLINX and +156.3\% for Mind2Web. Code will be available at: \href{https://data-for-agents.github.io}{data-for-agents.github.io}.
Poster
Nepotistically Trained Generative Image Models Collapse
Maty Bohacek, Hany Farid
OpenReview
[Show Abstract]
Trained on massive amounts of human-generated content, AI-generated image synthesis is capable of reproducing semantically coherent images that match the visual appearance of its training data. We show that when retrained on even small amounts of their own creation, these generative-AI models produce highly distorted images. We also show that this distortion extends beyond the text prompts used in retraining, and that once affected, the models struggle to fully heal even after retraining on only real images.
Trained on massive amounts of human-generated content, AI-generated image synthesis is capable of reproducing semantically coherent images that match the visual appearance of its training data. We show that when retrained on even small amounts of their own creation, these generative-AI models produce highly distorted images. We also show that this distortion extends beyond the text prompts used in retraining, and that once affected, the models struggle to fully heal even after retraining on only real images.
Understanding Private Learning From Feature Perspective
Meng Ding, Mingxi Lei, Shaopeng Fu, Di Wang, Jinhui Xu
OpenReview
[Show Abstract]
Differentially private Stochastic Gradient Descent (DP-SGD) has become integral to privacy-preserving machine learning, ensuring robust privacy guarantees in sensitive domains. Despite notable empirical advances leveraging features from non-private, pre-trained models to enhance DP-SGD training, a theoretical understanding of feature dynamics in private learning remains underexplored. This paper presents the first theoretical framework to analyze private training through the feature perspective. Inspired by the multi-patch structure in image data, we model a novel data distribution by clearly defining label-dependent features and label-independent noise—a critical aspect overlooked by existing analyses in the DP community. Employing a two-layer CNN with polynomial ReLU activation, we quantify the learning dynamics of noisy gradient descent through signal-to-noise ratio (SNR). Our findings reveal that (1) Effective private signal learning requires a higher signal-to-noise ratio compared to non-private training, and (2) When data noise memorization occurs in non-private learning, it will also occur in private learning, leading to poor generalization despite small training loss. Our findings highlight the challenges of private learning and prove the benefit of feature enhancement to improve SNR. Experiments on synthetic and real-world datasets also validate our theoretical findings.
Differentially private Stochastic Gradient Descent (DP-SGD) has become integral to privacy-preserving machine learning, ensuring robust privacy guarantees in sensitive domains. Despite notable empirical advances leveraging features from non-private, pre-trained models to enhance DP-SGD training, a theoretical understanding of feature dynamics in private learning remains underexplored. This paper presents the first theoretical framework to analyze private training through the feature perspective. Inspired by the multi-patch structure in image data, we model a novel data distribution by clearly defining label-dependent features and label-independent noise—a critical aspect overlooked by existing analyses in the DP community. Employing a two-layer CNN with polynomial ReLU activation, we quantify the learning dynamics of noisy gradient descent through signal-to-noise ratio (SNR). Our findings reveal that (1) Effective private signal learning requires a higher signal-to-noise ratio compared to non-private training, and (2) When data noise memorization occurs in non-private learning, it will also occur in private learning, leading to poor generalization despite small training loss. Our findings highlight the challenges of private learning and prove the benefit of feature enhancement to improve SNR. Experiments on synthetic and real-world datasets also validate our theoretical findings.
Language Model Preference Evaluation with Multiple Weak Evaluators
Zhengyu Hu, Jieyu Zhang, Zhihan Xiong, Alexander Ratner, Hui Xiong, Ranjay Krishna
OpenReview
[Show Abstract]
Despite the remarkable success of Large Language Models (LLMs), evaluating their outputs' quality regarding preference remains a critical challenge. Existing works usually leverage an LLM as the judge for comparing LLMs' output pairwisely, yet such model-based evaluator is weak evaluator due to conflicting preference, i.e., output A is better than B, B than C, but C than A, causing contradictory evaluation results. To address this, we introduce GED (Preference Graph Ensemble and Denoise), a novel approach that leverages multiple model-based evaluators to construct preference graphs, and then ensemble and denoise these graphs for better, non-contradictory evaluation results. In particular, our method consists of two primary stages: aggregating evaluations into a unified graph and applying a denoising process to eliminate cyclic inconsistencies, ensuring a directed acyclic graph (DAG) structure. We provide theoretical guarantees for our framework, demonstrating its efficacy in recovering the ground truth preference structure. Extensive experiments on ten benchmarks demonstrate GED's superiority in three applications: model ranking, response selection, and model alignment tasks. Notably, GED combines small LLM evaluators (e.g., Llama3-8B, Mistral-7B, Qwen2-7B) to outperform stronger ones (e.g., Qwen2-72B), showcasing its effectiveness in enhancing evaluation reliability and improving model performance.
Despite the remarkable success of Large Language Models (LLMs), evaluating their outputs' quality regarding preference remains a critical challenge. Existing works usually leverage an LLM as the judge for comparing LLMs' output pairwisely, yet such model-based evaluator is weak evaluator due to conflicting preference, i.e., output A is better than B, B than C, but C than A, causing contradictory evaluation results. To address this, we introduce GED (Preference Graph Ensemble and Denoise), a novel approach that leverages multiple model-based evaluators to construct preference graphs, and then ensemble and denoise these graphs for better, non-contradictory evaluation results. In particular, our method consists of two primary stages: aggregating evaluations into a unified graph and applying a denoising process to eliminate cyclic inconsistencies, ensuring a directed acyclic graph (DAG) structure. We provide theoretical guarantees for our framework, demonstrating its efficacy in recovering the ground truth preference structure. Extensive experiments on ten benchmarks demonstrate GED's superiority in three applications: model ranking, response selection, and model alignment tasks. Notably, GED combines small LLM evaluators (e.g., Llama3-8B, Mistral-7B, Qwen2-7B) to outperform stronger ones (e.g., Qwen2-72B), showcasing its effectiveness in enhancing evaluation reliability and improving model performance.
Explaining Length Bias in LLM-Based Preference Evaluations
Zhengyu Hu, Linxin Song, Jieyu Zhang, Zheyuan Xiao, Zhengyu Chen, Hui Xiong
OpenReview
[Show Abstract]
The use of large language models (LLMs) as judges, particularly in preference comparisons, has become widespread, but this reveals a notable bias towards longer responses, undermining the reliability of such evaluations. To better understand such bias, we propose to decompose the preference evaluation metric, specifically the win rate, into two key components: desirability and information mass, where the former is length-independent and related to trustworthiness such as correctness, toxicity, and consistency, and the latter is length-dependent and represents the amount of information in the response. We empirically demonstrated the decomposition through controlled experiments and found that response length impacts evaluations by influencing information mass. To derive a reliable evaluation metric that assesses content quality without being confounded by response length, we propose AdapAlpaca, a simple yet effective adjustment to win rate measurement. Specifically, AdapAlpaca ensures a fair comparison of response quality by aligning the lengths of reference and test model responses under equivalent length intervals.
The use of large language models (LLMs) as judges, particularly in preference comparisons, has become widespread, but this reveals a notable bias towards longer responses, undermining the reliability of such evaluations. To better understand such bias, we propose to decompose the preference evaluation metric, specifically the win rate, into two key components: desirability and information mass, where the former is length-independent and related to trustworthiness such as correctness, toxicity, and consistency, and the latter is length-dependent and represents the amount of information in the response. We empirically demonstrated the decomposition through controlled experiments and found that response length impacts evaluations by influencing information mass. To derive a reliable evaluation metric that assesses content quality without being confounded by response length, we propose AdapAlpaca, a simple yet effective adjustment to win rate measurement. Specifically, AdapAlpaca ensures a fair comparison of response quality by aligning the lengths of reference and test model responses under equivalent length intervals.
Rule-Based Rating and Selection of LLM Training Data
Xiaomin Li, Mingye Gao, Zhiwei Zhang, Chang Yue, Hong Hu
OpenReview
[Show Abstract]
The quality of training data is crucial for the performance of large language models (LLMs). There are recent studies utilizing LLMs to rate and select data based on scores from a small set of human-designed metrics (rules). However, existing rule-based methods often overly rely on human heuristics, lack robust metrics for rule evaluation, and exhibit limited adaptability to new tasks. In this paper, we propose a novel rule-based framework that leverages the orthogonality of score vectors corresponding to rules as a unique metric for rule evaluation. Our method employs an automated pipeline that first uses LLMs to generate a rule set that covers a wide range of data quality aspects. It then rates a batch of data according to these rules and applies the determinantal point process (DPP) from random matrix theory to select the most independent (orthogonal) rules. Then these rules are applied to rate all data and samples with the highest average scores are selected for further downstream tasks such as LLM fine-tuning. We validate our method through two experimental setups: 1) comparing against ground truth ratings and 2) benchmarking LLMs trained with the selected data. Our extensive experiments span various settings, including fine-tuning LLMs across the IMDB, Medical, Math, and Code domains. The results show that our DPP rule-based rating method consistently outperforms various other baselines, in terms of both rating accuracy and benchmark performance.
The quality of training data is crucial for the performance of large language models (LLMs). There are recent studies utilizing LLMs to rate and select data based on scores from a small set of human-designed metrics (rules). However, existing rule-based methods often overly rely on human heuristics, lack robust metrics for rule evaluation, and exhibit limited adaptability to new tasks. In this paper, we propose a novel rule-based framework that leverages the orthogonality of score vectors corresponding to rules as a unique metric for rule evaluation. Our method employs an automated pipeline that first uses LLMs to generate a rule set that covers a wide range of data quality aspects. It then rates a batch of data according to these rules and applies the determinantal point process (DPP) from random matrix theory to select the most independent (orthogonal) rules. Then these rules are applied to rate all data and samples with the highest average scores are selected for further downstream tasks such as LLM fine-tuning. We validate our method through two experimental setups: 1) comparing against ground truth ratings and 2) benchmarking LLMs trained with the selected data. Our extensive experiments span various settings, including fine-tuning LLMs across the IMDB, Medical, Math, and Code domains. The results show that our DPP rule-based rating method consistently outperforms various other baselines, in terms of both rating accuracy and benchmark performance.
Editable Concept Bottleneck Models
Lijie Hu, Chenyang Ren, Zhengyu Hu, Hongbin Lin, Cheng-Long Wang, Zhen Tan, Weimin Lyu, Jingfeng Zhang, Hui Xiong, Di Wang
OpenReview
[Show Abstract]
Concept Bottleneck Models (CBMs) have garnered much attention for their ability to elucidate the prediction process through a human-understandable concept layer. However, most previous studies focused on cases where the data, including concepts, are clean. In many scenarios, we always need to remove/insert some training data or new concepts from trained CBMs due to different reasons, such as privacy concerns, data mislabelling, spurious concepts, and concept annotation errors. Thus, the challenge of deriving efficient editable CBMs without retraining from scratch persists, particularly in large-scale applications. To address these challenges, we propose Editable Concept Bottleneck Models (ECBMs). Specifically, ECBMs support three different levels of data removal: concept-label-level, concept-level, and data-level. ECBMs enjoy mathematically rigorous closed-form approximations derived from influence functions that obviate the need for re-training. Experimental results demonstrate the efficiency and effectiveness of our ECBMs, affirming their adaptability within the realm of CBMs.
Concept Bottleneck Models (CBMs) have garnered much attention for their ability to elucidate the prediction process through a human-understandable concept layer. However, most previous studies focused on cases where the data, including concepts, are clean. In many scenarios, we always need to remove/insert some training data or new concepts from trained CBMs due to different reasons, such as privacy concerns, data mislabelling, spurious concepts, and concept annotation errors. Thus, the challenge of deriving efficient editable CBMs without retraining from scratch persists, particularly in large-scale applications. To address these challenges, we propose Editable Concept Bottleneck Models (ECBMs). Specifically, ECBMs support three different levels of data removal: concept-label-level, concept-level, and data-level. ECBMs enjoy mathematically rigorous closed-form approximations derived from influence functions that obviate the need for re-training. Experimental results demonstrate the efficiency and effectiveness of our ECBMs, affirming their adaptability within the realm of CBMs.
Template Matters: Understanding the Role of Instruction Templates in Multimodal Language Model Evaluation and Training
Shijian Wang, Linxin Song, Jieyu Zhang, Ryotaro Shimizu, Ao Luo, Li Yao, Cunjian Chen, Julian McAuley, Hanqian Wu
OpenReview
[Show Abstract]
Current multimodal language models (MLMs) evaluation and training approaches overlook the influence of instruction format, presenting an elephant-in-the-room problem. Previous research deals with this problem by manually crafting instructions, failing to yield significant insights due to limitations in diversity and scalability. In this work, we propose a programmatic instruction template generator capable of producing over 3.9B unique template combinations by filling randomly sampled positional synonyms into weighted sampled meta templates, enabling us to comprehensively examine the MLM's performance across diverse instruction templates. Our experiments across eight common MLMs on five benchmark datasets reveal that MLMs have high template sensitivities with at most 29% performance gaps between different templates. We further augment the instruction tuning dataset of LLaVA-1.5 with our template generator and perform instruction tuning on LLaVA-1.5-7B and LLaVA-1.5-13B. Models tuned on our augmented dataset achieve the best overall performance when compared with the same scale MLMs tuned on at most 75 times the scale of our augmented dataset, highlighting the importance of instruction templates in MLM training.
Current multimodal language models (MLMs) evaluation and training approaches overlook the influence of instruction format, presenting an elephant-in-the-room problem. Previous research deals with this problem by manually crafting instructions, failing to yield significant insights due to limitations in diversity and scalability. In this work, we propose a programmatic instruction template generator capable of producing over 3.9B unique template combinations by filling randomly sampled positional synonyms into weighted sampled meta templates, enabling us to comprehensively examine the MLM's performance across diverse instruction templates. Our experiments across eight common MLMs on five benchmark datasets reveal that MLMs have high template sensitivities with at most 29% performance gaps between different templates. We further augment the instruction tuning dataset of LLaVA-1.5 with our template generator and perform instruction tuning on LLaVA-1.5-7B and LLaVA-1.5-13B. Models tuned on our augmented dataset achieve the best overall performance when compared with the same scale MLMs tuned on at most 75 times the scale of our augmented dataset, highlighting the importance of instruction templates in MLM training.
Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs
Abhimanyu Hans, Yuxin Wen, Neel Jain, John Kirchenbauer, Hamid Kazemi, Prajwal Singhania, Siddharth Singh, Gowthami Somepalli, Jonas Geiping, Abhinav Bhatele, Tom Goldstein
OpenReview
[Show Abstract]
Large language models can memorize and repeat their training data, causing privacy and copyright risks. To mitigate memorization, we introduce a subtle modification to the next-token training objective that we call the goldfish loss. During training, a randomly sampled subsets of tokens are excluded from the loss computation. These dropped tokens are not memorized by the model, which prevents verbatim reproduction of a complete chain of tokens from the training set. We run extensive experiments training billion-scale LLaMA-2 models, both pre-trained and trained from scratch, and demonstrate significant reductions in extractable memorization with little to no impact on downstream benchmarks.
Large language models can memorize and repeat their training data, causing privacy and copyright risks. To mitigate memorization, we introduce a subtle modification to the next-token training objective that we call the goldfish loss. During training, a randomly sampled subsets of tokens are excluded from the loss computation. These dropped tokens are not memorized by the model, which prevents verbatim reproduction of a complete chain of tokens from the training set. We run extensive experiments training billion-scale LLaMA-2 models, both pre-trained and trained from scratch, and demonstrate significant reductions in extractable memorization with little to no impact on downstream benchmarks.
Context-Guided Responsible Data Augmentation with Diffusion Models
Khawar Islam, NAVEED AKHTAR
OpenReview
[Show Abstract]
Generative diffusion models offer a natural choice for data augmentation when training complex vision models. However, ensuring reliability of their generative content as augmentation samples remains an open challenge. Despite a number of techniques utilizing generative images to strengthen model training, it remains unclear how to utilize the combination of natural and generative images as a rich supervisory signal for effective model induction. In this regard, we propose a text-to-image (T2I) data augmentation method, named DiffCoRe-Mix, that computes a set of generative counterparts for a training sample with an explicitly constrained diffusion model that leverages sample-based context and negative prompting for a reliable augmentation sample generation. To preserve key semantic axes, we also filter out undesired generative samples in our augmentation process. To that end, we propose a hard-cosine filtration in the embedding space of CLIP. Our approach systematically mixes the natural and generative images at pixel and patch levels. We extensively evaluate our technique on ImageNet-1K, Tiny ImageNet-200, CIFAR-100, Flowers102, CUB-Birds, Stanford Cars, and Caltech datasets, demonstrating a notable increase in performance across the board, achieving up to $\sim 3\%$ absolute gain for top-1 accuracy over the state-of-the-art methods, while showing comparable computational overhead.
Generative diffusion models offer a natural choice for data augmentation when training complex vision models. However, ensuring reliability of their generative content as augmentation samples remains an open challenge. Despite a number of techniques utilizing generative images to strengthen model training, it remains unclear how to utilize the combination of natural and generative images as a rich supervisory signal for effective model induction. In this regard, we propose a text-to-image (T2I) data augmentation method, named DiffCoRe-Mix, that computes a set of generative counterparts for a training sample with an explicitly constrained diffusion model that leverages sample-based context and negative prompting for a reliable augmentation sample generation. To preserve key semantic axes, we also filter out undesired generative samples in our augmentation process. To that end, we propose a hard-cosine filtration in the embedding space of CLIP. Our approach systematically mixes the natural and generative images at pixel and patch levels. We extensively evaluate our technique on ImageNet-1K, Tiny ImageNet-200, CIFAR-100, Flowers102, CUB-Birds, Stanford Cars, and Caltech datasets, demonstrating a notable increase in performance across the board, achieving up to $\sim 3\%$ absolute gain for top-1 accuracy over the state-of-the-art methods, while showing comparable computational overhead.
The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data Contamination
Yifan Sun, Han Wang, Dongbai Li, Gang Wang, Huan Zhang
OpenReview
[Show Abstract]
Benchmark Data Contamination (BDC)—the inclusion of benchmark testing samples in the training set—has raised increasing concerns in Large Language Model (LLM) evaluation, leading to falsely inflated performance estimates and undermining evaluation reliability. To address this, researchers have proposed various mitigation strategies to update existing benchmarks, including modifying original questions or generating new ones based on them. However, a rigorous examination of the effectiveness of these mitigation strategies remains lacking. In this paper, we design a systematic and controlled pipeline along with two novel metrics—fidelity and contamination resistance—to provide a fine-grained and comprehensive assessment of existing BDC mitigation strategies. Previous assessment methods, such as accuracy drop and accuracy matching, focus solely on aggregate accuracy, often leading to incomplete or misleading conclusions. Our metrics address this limitation by emphasizing question-level evaluation result matching. Extensive experiments with 10 LLMs, 5 benchmarks, 20 BDC mitigation strategies, and 2 contamination scenarios reveal that no existing strategy significantly improves resistance over the vanilla case (i.e., no benchmark update) across all benchmarks, and none effectively balances fidelity and contamination resistance. These findings underscore the urgent need for designing more effective BDC mitigation strategies.
Benchmark Data Contamination (BDC)—the inclusion of benchmark testing samples in the training set—has raised increasing concerns in Large Language Model (LLM) evaluation, leading to falsely inflated performance estimates and undermining evaluation reliability. To address this, researchers have proposed various mitigation strategies to update existing benchmarks, including modifying original questions or generating new ones based on them. However, a rigorous examination of the effectiveness of these mitigation strategies remains lacking. In this paper, we design a systematic and controlled pipeline along with two novel metrics—fidelity and contamination resistance—to provide a fine-grained and comprehensive assessment of existing BDC mitigation strategies. Previous assessment methods, such as accuracy drop and accuracy matching, focus solely on aggregate accuracy, often leading to incomplete or misleading conclusions. Our metrics address this limitation by emphasizing question-level evaluation result matching. Extensive experiments with 10 LLMs, 5 benchmarks, 20 BDC mitigation strategies, and 2 contamination scenarios reveal that no existing strategy significantly improves resistance over the vanilla case (i.e., no benchmark update) across all benchmarks, and none effectively balances fidelity and contamination resistance. These findings underscore the urgent need for designing more effective BDC mitigation strategies.
Beyond ordinary Lipschitz constraints: Differentially Private optimization with TNC
Difei Xu, Meng Ding, Zihang Xiang, Jinhui Xu, Di Wang
OpenReview
[Show Abstract]
We study Stochastic Convex Optimization in Differential Privacy model (DP-SCO). Unlike previous studies, here we assume the population risk function satisfies
the Tsybakov Noise Condition (TNC) with some parameter $\theta>1$, where the Lipschitz constant of the loss could be extremely large or even unbounded, but the $\ell_2$-norm gradient of the loss has bounded $k$-th moment with $k\geq 2$.
For the Lipschitz case with $\theta\geq 2$, we first propose an $(\epsilon, \delta)$-DP algorithms whose utility bound is $\tilde{O}\left(\left(\tilde{r}_{2k}(\frac{1}{\sqrt{n}}+(\frac{\sqrt{d}}{n\epsilon}))^\frac{k-1}{k}\right)^\frac{\theta}{\theta-1}\right)$
in high probability, where $n$ is the sample size, $d$ is the model dimension, and $\tilde{r}_{2k}$ is a term that only depends on the $2k$-th moment of the gradient. It is notable that such an upper bound is independent of the Lipschitz constant. We then extend to the case where
$\theta\geq \bar{\theta}> 1$
for some known constant $\bar{\theta}$. Moreover, when the privacy budget $\epsilon$ is small enough, we show an upper bound of $\tilde{O}\left(\left(\tilde{r}_{k}(\frac{1}{\sqrt{n}}+(\frac{\sqrt{d}}{n\epsilon}))^\frac{k-1}{k}\right)^\frac{\theta}{\theta-1}\right)$
even if the loss function is not Lipschitz. For the lower bound, we show that for any $\theta\geq 2$, the private minimax rate for $\rho$-zero Concentrated Differential Privacy is lower bounded by $\Omega\left(\left(\tilde{r}_{k}(\frac{1}{\sqrt{n}}+(\frac{\sqrt{d}}{n\sqrt{\rho}}))^\frac{k-1}{k}\right)^\frac{\theta}{\theta-1}\right)$.
We study Stochastic Convex Optimization in Differential Privacy model (DP-SCO). Unlike previous studies, here we assume the population risk function satisfies the Tsybakov Noise Condition (TNC) with some parameter $\theta>1$, where the Lipschitz constant of the loss could be extremely large or even unbounded, but the $\ell_2$-norm gradient of the loss has bounded $k$-th moment with $k\geq 2$. For the Lipschitz case with $\theta\geq 2$, we first propose an $(\epsilon, \delta)$-DP algorithms whose utility bound is $\tilde{O}\left(\left(\tilde{r}_{2k}(\frac{1}{\sqrt{n}}+(\frac{\sqrt{d}}{n\epsilon}))^\frac{k-1}{k}\right)^\frac{\theta}{\theta-1}\right)$
in high probability, where $n$ is the sample size, $d$ is the model dimension, and $\tilde{r}_{2k}$ is a term that only depends on the $2k$-th moment of the gradient. It is notable that such an upper bound is independent of the Lipschitz constant. We then extend to the case where $\theta\geq \bar{\theta}> 1$
for some known constant $\bar{\theta}$. Moreover, when the privacy budget $\epsilon$ is small enough, we show an upper bound of $\tilde{O}\left(\left(\tilde{r}_{k}(\frac{1}{\sqrt{n}}+(\frac{\sqrt{d}}{n\epsilon}))^\frac{k-1}{k}\right)^\frac{\theta}{\theta-1}\right)$
even if the loss function is not Lipschitz. For the lower bound, we show that for any $\theta\geq 2$, the private minimax rate for $\rho$-zero Concentrated Differential Privacy is lower bounded by $\Omega\left(\left(\tilde{r}_{k}(\frac{1}{\sqrt{n}}+(\frac{\sqrt{d}}{n\sqrt{\rho}}))^\frac{k-1}{k}\right)^\frac{\theta}{\theta-1}\right)$.
Training and Evaluating Language Models with Template-based Data Generation
Yifan Zhang
OpenReview
[Show Abstract]
The rapid advancement of large language models (LLMs) such as GPT-3, PaLM, and Llama has significantly transformed natural language processing, showcasing remarkable capabilities in understanding and generating language. However, these models often struggle with tasks requiring complex reasoning, particularly in mathematical problem-solving, due in part to the scarcity of large-scale, high-quality, domain-specific datasets necessary for training sophisticated reasoning abilities. To address this limitation, we introduce Template-based Data Generation (TDG), a novel approach that leverages LLMs (GPT-4) to automatically generate parameterized meta-templates, which are then used to synthesize a vast array of high-quality problems and solutions. Leveraging TDG, we create TemplateMath Part I: TemplateGSM, a dataset comprising over 7 million synthetically generated grade school math problems—each accompanied by code-based and natural language solutions—with the potential to generate an effectively unlimited number more. This dataset alleviates the scarcity of large-scale mathematical datasets and serves as a valuable resource for pre-training, fine-tuning, and evaluating LLMs in mathematical reasoning. Our method not only enables the generation of virtually infinite data but also elevates data augmentation to a new level by using GPT-4 for meta-template generation, ensuring diverse and high-quality problem structures.
The rapid advancement of large language models (LLMs) such as GPT-3, PaLM, and Llama has significantly transformed natural language processing, showcasing remarkable capabilities in understanding and generating language. However, these models often struggle with tasks requiring complex reasoning, particularly in mathematical problem-solving, due in part to the scarcity of large-scale, high-quality, domain-specific datasets necessary for training sophisticated reasoning abilities. To address this limitation, we introduce Template-based Data Generation (TDG), a novel approach that leverages LLMs (GPT-4) to automatically generate parameterized meta-templates, which are then used to synthesize a vast array of high-quality problems and solutions. Leveraging TDG, we create TemplateMath Part I: TemplateGSM, a dataset comprising over 7 million synthetically generated grade school math problems—each accompanied by code-based and natural language solutions—with the potential to generate an effectively unlimited number more. This dataset alleviates the scarcity of large-scale mathematical datasets and serves as a valuable resource for pre-training, fine-tuning, and evaluating LLMs in mathematical reasoning. Our method not only enables the generation of virtually infinite data but also elevates data augmentation to a new level by using GPT-4 for meta-template generation, ensuring diverse and high-quality problem structures.
Model Collapse in the Self-Consuming Chain of Diffusion Finetuning: A Novel Perspective from Quantitative Trait Modeling
Youngseok Yoon, Dainong Hu, Iain Weissburg, Yao Qin, Haewon Jeong
OpenReview
[Show Abstract]
Model collapse, the severe degradation of generative models when iteratively trained on their own outputs, has gained significant attention in recent years. This paper examines Chain of Diffusion, where a pretrained text-to-image diffusion model is finetuned on its own generated images. We demonstrate that severe image quality degradation was universal and identify CFG scale as the key factor impacting this model collapse. Drawing on an analogy between the Chain of Diffusion and biological evolution, we then introduce a novel theoretical analysis based on quantitative trait modeling. Our theoretical analysis aligns with empirical observations of the generated images in the Chain of Diffusion. Finally, we propose Reusable Diffusion Finetuning (ReDiFine), a simple yet effective strategy inspired by genetic mutations. It operates robustly across various scenarios without requiring any hyperparameter tuning, making it a plug-and-play solution for reusable image generation.
Model collapse, the severe degradation of generative models when iteratively trained on their own outputs, has gained significant attention in recent years. This paper examines Chain of Diffusion, where a pretrained text-to-image diffusion model is finetuned on its own generated images. We demonstrate that severe image quality degradation was universal and identify CFG scale as the key factor impacting this model collapse. Drawing on an analogy between the Chain of Diffusion and biological evolution, we then introduce a novel theoretical analysis based on quantitative trait modeling. Our theoretical analysis aligns with empirical observations of the generated images in the Chain of Diffusion. Finally, we propose Reusable Diffusion Finetuning (ReDiFine), a simple yet effective strategy inspired by genetic mutations. It operates robustly across various scenarios without requiring any hyperparameter tuning, making it a plug-and-play solution for reusable image generation.
PiKE: Adaptive Data Mixing for Multi-Task Learning Under Low Gradient Conflicts
Zeman Li, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, Vahab Mirrokni
OpenReview
[Show Abstract]
Modern machine learning models are trained on diverse datasets and tasks to improve generalization. A key challenge in multitask learning is determining the optimal data mixing and sampling strategy across different data sources. Prior research in this multi-task learning setting has primarily focused on mitigating gradient conflicts between tasks. However, we observe that many real-world multitask learning scenarios—such as multilingual training and multi-domain learning in large foundation models—exhibit predominantly positive task interactions with minimal or no gradient conflict. Building on this insight, we introduce PiKE ($\text{\textbf{P}ositive gradient \textbf{i}nteraction-based \textbf{K}-task weights \textbf{E}stimator}$), an adaptive data mixing algorithm that dynamically adjusts task contributions throughout training. PiKE optimizes task sampling to minimize overall loss, effectively leveraging positive gradient interactions with almost no additional computational overhead. We establish theoretical convergence guarantees for PiKE and demonstrate its superiority over static and non-adaptive mixing strategies. Additionally, we extend PiKE to promote fair learning across tasks, ensuring balanced progress and preventing task underrepresentation.
Empirical evaluations on large-scale language model pretraining show that PiKE consistently outperforms existing heuristic and static mixing strategies, leading to faster convergence and improved downstream task performance.
Modern machine learning models are trained on diverse datasets and tasks to improve generalization. A key challenge in multitask learning is determining the optimal data mixing and sampling strategy across different data sources. Prior research in this multi-task learning setting has primarily focused on mitigating gradient conflicts between tasks. However, we observe that many real-world multitask learning scenarios—such as multilingual training and multi-domain learning in large foundation models—exhibit predominantly positive task interactions with minimal or no gradient conflict. Building on this insight, we introduce PiKE ($\text{\textbf{P}ositive gradient \textbf{i}nteraction-based \textbf{K}-task weights \textbf{E}stimator}$), an adaptive data mixing algorithm that dynamically adjusts task contributions throughout training. PiKE optimizes task sampling to minimize overall loss, effectively leveraging positive gradient interactions with almost no additional computational overhead. We establish theoretical convergence guarantees for PiKE and demonstrate its superiority over static and non-adaptive mixing strategies. Additionally, we extend PiKE to promote fair learning across tasks, ensuring balanced progress and preventing task underrepresentation. Empirical evaluations on large-scale language model pretraining show that PiKE consistently outperforms existing heuristic and static mixing strategies, leading to faster convergence and improved downstream task performance.
MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models
Shengkang Wang, Hongzhan Lin, Ziyang Luo, Zhen Ye, Guang Chen, Jing Ma
OpenReview
[Show Abstract]
Large vision-language models (LVLMs) have significantly improved multimodal reasoning tasks, such as visual question answering and image captioning. These models embed multimodal facts within their parameters, rather than relying on external knowledge bases to store factual information explicitly. However, the content discerned by LVLMs may deviate from factuality due to inherent bias or incorrect inference. In this work, we introduce MFC-Bench, a rigorous and comprehensive benchmark designed to evaluate the factual accuracy of LVLMs across three stages of verdict prediction for multimodal fact-checking (MFC): Manipulation, Out-of-Context, and Veracity Classification. Through our evaluation on MFC-Bench, we benchmarked a dozen diverse and representative LVLMs, uncovering that current models still fall short in MFC and demonstrate insensitivity to various forms of manipulated content. We hope that MFC-Bench could raise attention to the trustworthy AI potentially assisted by LVLMs in the future.
Large vision-language models (LVLMs) have significantly improved multimodal reasoning tasks, such as visual question answering and image captioning. These models embed multimodal facts within their parameters, rather than relying on external knowledge bases to store factual information explicitly. However, the content discerned by LVLMs may deviate from factuality due to inherent bias or incorrect inference. In this work, we introduce MFC-Bench, a rigorous and comprehensive benchmark designed to evaluate the factual accuracy of LVLMs across three stages of verdict prediction for multimodal fact-checking (MFC): Manipulation, Out-of-Context, and Veracity Classification. Through our evaluation on MFC-Bench, we benchmarked a dozen diverse and representative LVLMs, uncovering that current models still fall short in MFC and demonstrate insensitivity to various forms of manipulated content. We hope that MFC-Bench could raise attention to the trustworthy AI potentially assisted by LVLMs in the future.
DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks
Zhiliang Chen, Gregory Kang Ruey Lau, Chuan-Sheng Foo, Bryan Kian Hsiang Low
OpenReview
[Show Abstract]
The performance of a machine learning (ML) model depends heavily on the relevance of its training data to the domain of the downstream evaluation task. However, in practice, the data involved in an unseen evaluation task is often not known to us (e.g., conversations between an LLM and a user are end-to-end encrypted). So, it is not obvious what data would be relevant for training/fine-tuning the ML model to maximize its task performance. Instead, one can only deploy the ML model in the unseen evaluation task to gather multiple rounds of coarse feedback on how well the model has performed. This paper presents a novel global-to-local algorithm called DUET that can exploit the feedback loop by interleaving a data selection method with Bayesian optimization. As a result, DUET can efficiently refine the training data mixture from a pool of data domains to maximize the model's performance on the unseen evaluation task and its convergence to the optimal data mixture can be theoretically guaranteed by analyzing its cumulative regret. Empirical evaluation on image and LLM evaluation tasks shows that DUET finds better training data mixtures than conventional baselines.
The performance of a machine learning (ML) model depends heavily on the relevance of its training data to the domain of the downstream evaluation task. However, in practice, the data involved in an unseen evaluation task is often not known to us (e.g., conversations between an LLM and a user are end-to-end encrypted). So, it is not obvious what data would be relevant for training/fine-tuning the ML model to maximize its task performance. Instead, one can only deploy the ML model in the unseen evaluation task to gather multiple rounds of coarse feedback on how well the model has performed. This paper presents a novel global-to-local algorithm called DUET that can exploit the feedback loop by interleaving a data selection method with Bayesian optimization. As a result, DUET can efficiently refine the training data mixture from a pool of data domains to maximize the model's performance on the unseen evaluation task and its convergence to the optimal data mixture can be theoretically guaranteed by analyzing its cumulative regret. Empirical evaluation on image and LLM evaluation tasks shows that DUET finds better training data mixtures than conventional baselines.
Unlocking Post-hoc Dataset Inference with Synthetic Data
Bihe Zhao, Pratyush Maini, Franziska Boenisch, Adam Dziedzic
OpenReview
[Show Abstract]
The remarkable capabilities of large language models stem from massive internet-scraped training datasets, often obtained without respecting data owners' intellectual property rights. Dataset Inference (DI) enables data owners to verify unauthorized data use by identifying whether a suspect dataset was used for training. However, current DI methods require private held-out data with a distribution that closely matches the compromised dataset. Such held-out data are rarely available in practice, severely limiting the applicability of DI. In this work, we address this challenge by synthetically generating the required validation set through two key contributions: (1) creating high-quality, diverse synthetic data via a data generator trained on a carefully designed suffix-based completion task, and (2) bridging likelihood gaps between real and synthetic data, which is realized through post-hoc calibration. Extensive experiments on diverse text datasets show that using our generated data as a held-out set enables DI to detect the original training sets with high confidence, while maintaining a low false positive rate. This result empowers copyright owners to make legitimate claims on data usage and demonstrates our method’s reliability for real-world litigations.
The remarkable capabilities of large language models stem from massive internet-scraped training datasets, often obtained without respecting data owners' intellectual property rights. Dataset Inference (DI) enables data owners to verify unauthorized data use by identifying whether a suspect dataset was used for training. However, current DI methods require private held-out data with a distribution that closely matches the compromised dataset. Such held-out data are rarely available in practice, severely limiting the applicability of DI. In this work, we address this challenge by synthetically generating the required validation set through two key contributions: (1) creating high-quality, diverse synthetic data via a data generator trained on a carefully designed suffix-based completion task, and (2) bridging likelihood gaps between real and synthetic data, which is realized through post-hoc calibration. Extensive experiments on diverse text datasets show that using our generated data as a held-out set enables DI to detect the original training sets with high confidence, while maintaining a low false positive rate. This result empowers copyright owners to make legitimate claims on data usage and demonstrates our method’s reliability for real-world litigations.
Privacy Auditing for Large Language Models with Natural Identifiers
Lorenzo Rossi, Bartłomiej Marek, Franziska Boenisch, Adam Dziedzic
OpenReview
[Show Abstract]
The privacy auditing for large language models (LLMs) faces significant challenges. Membership inference attacks, once considered a practical privacy auditing tool, are unreliable for pretrained LLMs due to the lack of non-member data from the same distribution as the member data. Exacerbating the situation further, the dataset inference cannot be performed without such a non-member set. Finally, we lack a formal post hoc auditing of training privacy guarantees. Previous differential privacy auditing methods are impractical since they rely on inserting specially crafted canary data during training, making audits on already pre-trained LLMs impossible without expensive retraining. This work introduces natural identifiers (NIDs) as a novel solution to these challenges. NIDs are structured random strings, such as SSH keys, cryptographic hashes, and shortened URLs, which naturally occur in common LLM training datasets. Their format enables the generation of unlimited additional random strings from the same distribution, which can act as non-members or alternative canaries for audit. Leveraging this property, we show how NIDs support robust evaluation of membership inference attacks, enable dataset inference for any suspect set containing NIDs, and facilitate post hoc privacy auditing without retraining.
The privacy auditing for large language models (LLMs) faces significant challenges. Membership inference attacks, once considered a practical privacy auditing tool, are unreliable for pretrained LLMs due to the lack of non-member data from the same distribution as the member data. Exacerbating the situation further, the dataset inference cannot be performed without such a non-member set. Finally, we lack a formal post hoc auditing of training privacy guarantees. Previous differential privacy auditing methods are impractical since they rely on inserting specially crafted canary data during training, making audits on already pre-trained LLMs impossible without expensive retraining. This work introduces natural identifiers (NIDs) as a novel solution to these challenges. NIDs are structured random strings, such as SSH keys, cryptographic hashes, and shortened URLs, which naturally occur in common LLM training datasets. Their format enables the generation of unlimited additional random strings from the same distribution, which can act as non-members or alternative canaries for audit. Leveraging this property, we show how NIDs support robust evaluation of membership inference attacks, enable dataset inference for any suspect set containing NIDs, and facilitate post hoc privacy auditing without retraining.
OpenRAG: Optimizing RAG End-to-End via In-Context Retrieval Learning
Jiawei Zhou, Lei Chen
OpenReview
[Show Abstract]
Retrieval-augmented generation (RAG) serves as a bridge connecting large language models (LLMs) to downstream data sources. Despite their widespread adoption, existing RAG frameworks typically use off-the-shelf retrievers with large language models (LLMs) without joint training. In this paper, we analyze and empirically show that the relevance learned for traditional information retrieval scenarios may not consistently apply to RAG scenarios. To bridge this gap, we introduce OpenRAG, a RAG framework that is optimized end-to-end by tuning the retriever to capture in-context, open-ended relevance, enabling adaptation to the diverse and evolving needs. Extensive experiments across a wide range of tasks demonstrate that OpenRAG, by tuning a retriever end-to-end, leads to a consistent improvement of 4.0% over the original retriever, consistently outperforming existing state-of-the-art retrievers by 2.1%. Additionally, our results show that for certain tasks, a 0.2B retriever tuned end-to-end can achieve improvements surpassing those of RAG-oriented or instruction-tuned 8B LLMs, underscoring the cost-effectiveness of our approach for improving RAG systems.
Retrieval-augmented generation (RAG) serves as a bridge connecting large language models (LLMs) to downstream data sources. Despite their widespread adoption, existing RAG frameworks typically use off-the-shelf retrievers with large language models (LLMs) without joint training. In this paper, we analyze and empirically show that the relevance learned for traditional information retrieval scenarios may not consistently apply to RAG scenarios. To bridge this gap, we introduce OpenRAG, a RAG framework that is optimized end-to-end by tuning the retriever to capture in-context, open-ended relevance, enabling adaptation to the diverse and evolving needs. Extensive experiments across a wide range of tasks demonstrate that OpenRAG, by tuning a retriever end-to-end, leads to a consistent improvement of 4.0% over the original retriever, consistently outperforming existing state-of-the-art retrievers by 2.1%. Additionally, our results show that for certain tasks, a 0.2B retriever tuned end-to-end can achieve improvements surpassing those of RAG-oriented or instruction-tuned 8B LLMs, underscoring the cost-effectiveness of our approach for improving RAG systems.
RichSpace: Enriching Text-to-Video Prompt Space via Text Embedding Interpolation
Yuefan Cao, Chengyue Gong, Xiaoyu Li, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song
OpenReview
[Show Abstract]
Text-to-video generation models have made impressive progress, but they still struggle with generating videos with complex features. This limitation often arises from the inability of the text encoder to produce accurate embeddings, which hinders the video generation model. In this work, we propose a novel approach to overcome this challenge by selecting the optimal text embedding through interpolation in the embedding space. We demonstrate that this method enables the video generation model to produce the desired videos. Additionally, we introduce a simple algorithm using perpendicular foot embeddings and cosine similarity to identify the optimal interpolation embedding. Our findings highlight the importance of accurate text embeddings and offer a pathway for improving text-to-video generation performance.
Text-to-video generation models have made impressive progress, but they still struggle with generating videos with complex features. This limitation often arises from the inability of the text encoder to produce accurate embeddings, which hinders the video generation model. In this work, we propose a novel approach to overcome this challenge by selecting the optimal text embedding through interpolation in the embedding space. We demonstrate that this method enables the video generation model to produce the desired videos. Additionally, we introduce a simple algorithm using perpendicular foot embeddings and cosine similarity to identify the optimal interpolation embedding. Our findings highlight the importance of accurate text embeddings and offer a pathway for improving text-to-video generation performance.
Lightweight Dataset Pruning without Full Training via Example Difficulty and Prediction Uncertainty
Yeseul Cho, Baekrok Shin, Changmin Kang, Chulhee Yun
OpenReview
[Show Abstract]
Recent advances in deep learning rely heavily on massive datasets, leading to substantial storage and training costs. Dataset pruning aims to alleviate this demand by discarding redundant examples. However, many existing methods require training a model with a full dataset over a large number of epochs before being able to prune the dataset, which ironically makes the pruning process more expensive than just training the model on the entire dataset. To overcome this limitation, we introduce a Difficulty and Uncertainty-Aware Lightweight (DUAL) score, which aims to identify important samples from the early training stage by considering both example difficulty and prediction uncertainty. To address a catastrophic accuracy drop at extreme pruning, we further propose a ratio-adaptive sampling using Beta distribution. Experiments on various datasets and learning scenarios such as image classification with label noise, image corruption, and model architecture generalization demonstrate the superiority of our method over previous state-of-the-art (SOTA) approaches. Specifically, on ImageNet-1k, our method reduces the time cost for pruning to 66% compared to previous methods while achieving a SOTA, specifically 60% test accuracy at a 90% pruning ratio. On CIFAR datasets, the time cost is reduced to just 15% while maintaining SOTA performance.
Recent advances in deep learning rely heavily on massive datasets, leading to substantial storage and training costs. Dataset pruning aims to alleviate this demand by discarding redundant examples. However, many existing methods require training a model with a full dataset over a large number of epochs before being able to prune the dataset, which ironically makes the pruning process more expensive than just training the model on the entire dataset. To overcome this limitation, we introduce a Difficulty and Uncertainty-Aware Lightweight (DUAL) score, which aims to identify important samples from the early training stage by considering both example difficulty and prediction uncertainty. To address a catastrophic accuracy drop at extreme pruning, we further propose a ratio-adaptive sampling using Beta distribution. Experiments on various datasets and learning scenarios such as image classification with label noise, image corruption, and model architecture generalization demonstrate the superiority of our method over previous state-of-the-art (SOTA) approaches. Specifically, on ImageNet-1k, our method reduces the time cost for pruning to 66% compared to previous methods while achieving a SOTA, specifically 60% test accuracy at a 90% pruning ratio. On CIFAR datasets, the time cost is reduced to just 15% while maintaining SOTA performance.
Diversity Measurement and Subset Selection for Instruction Tuning Datasets
Peiqi Wang, Yikang Shen, Zhen Guo, Matthew Stallone, Yoon Kim, Polina Golland, Rameswar Panda
OpenReview
[Show Abstract]
We aim to select data subsets for the fine-tuning of large language models to more effectively follow instructions. Prior work has emphasized the importance of diversity in dataset curation but relied on heuristics such as the number of tasks. In this paper, we use determinantal point processes to capture the diversity and quality of instruction tuning datasets for subset selection. We propose to measure dataset diversity with log determinant distance that is the distance between the dataset of interest and a maximally diverse reference dataset. Our experiments demonstrate that the proposed diversity measure in the normalized weight gradient space is correlated with instruction-following performance. Consequently, it can be used to inform when data selection is the most helpful and to analyze dataset curation strategies. We demonstrate the utility of our data selection approach on various instruction tuning datasets.
We aim to select data subsets for the fine-tuning of large language models to more effectively follow instructions. Prior work has emphasized the importance of diversity in dataset curation but relied on heuristics such as the number of tasks. In this paper, we use determinantal point processes to capture the diversity and quality of instruction tuning datasets for subset selection. We propose to measure dataset diversity with log determinant distance that is the distance between the dataset of interest and a maximally diverse reference dataset. Our experiments demonstrate that the proposed diversity measure in the normalized weight gradient space is correlated with instruction-following performance. Consequently, it can be used to inform when data selection is the most helpful and to analyze dataset curation strategies. We demonstrate the utility of our data selection approach on various instruction tuning datasets.
Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?
Simon Park, Abhishek Panigrahi, Yun Cheng, Dingli Yu, Anirudh Goyal, Sanjeev Arora
OpenReview
[Show Abstract]
Vision Language Models (VLMs) are impressive at visual question answering and image captioning. But they underperform on multi-step visual reasoning—even compared to LLMs on the same tasks presented in text form—giving rise to perceptions of modality imbalance or brittleness.
Towards a systematic study of such issues, we introduce a synthetic framework for assessing the ability of VLMs to perform algorithmic visual reasoning, comprising three tasks: Table Readout, Grid Navigation, and Visual Analogy. Each has two levels of difficulty, SIMPLE and HARD, and even the SIMPLE versions are difficult for frontier VLMs. We propose strategies for training on the SIMPLE version of tasks that improve performance on the corresponding HARD task, i.e., simple-to-hard (S2H) generalization. This controlled setup, where each task also has an equivalent text-only version, allows a quantification of the modality imbalance and how it is impacted by training strategy. We show that 1) explicit image-to-text conversion is important in promoting S2H generalization on images, by transferring reasoning from text; 2) conversion can be internalized at test time. We also report results of mechanistic study of this phenomenon. We identify measures of gradient alignment that can identify training strategies that promote better S2H generalization. Ablations highlight the importance of chain-of-thought.
Vision Language Models (VLMs) are impressive at visual question answering and image captioning. But they underperform on multi-step visual reasoning—even compared to LLMs on the same tasks presented in text form—giving rise to perceptions of modality imbalance or brittleness.
Towards a systematic study of such issues, we introduce a synthetic framework for assessing the ability of VLMs to perform algorithmic visual reasoning, comprising three tasks: Table Readout, Grid Navigation, and Visual Analogy. Each has two levels of difficulty, SIMPLE and HARD, and even the SIMPLE versions are difficult for frontier VLMs. We propose strategies for training on the SIMPLE version of tasks that improve performance on the corresponding HARD task, i.e., simple-to-hard (S2H) generalization. This controlled setup, where each task also has an equivalent text-only version, allows a quantification of the modality imbalance and how it is impacted by training strategy. We show that 1) explicit image-to-text conversion is important in promoting S2H generalization on images, by transferring reasoning from text; 2) conversion can be internalized at test time. We also report results of mechanistic study of this phenomenon. We identify measures of gradient alignment that can identify training strategies that promote better S2H generalization. Ablations highlight the importance of chain-of-thought.
Proper Dataset Valuation by Pointwise Mutual Information
SHURAN ZHENG, Xuan Qi, Rui Ray Chen, Yongchan Kwon, James Zou
OpenReview
[Show Abstract]
Data plays a central role in the development of modern artificial intelligence, with high-quality data emerging as a key driver of model performance. This has prompted the development of various data curation methods in recent years. However, measuring the effectiveness of these data curation techniques remains a major challenge. Traditional evaluation methods, which assess a trained model's performance on specific benchmarks, risk promoting practices that merely make the data more similar to the test data. This issue exemplifies Goodhart’s law: when a measure becomes a target, it ceases to be a good measure. To address this, we propose an information-theoretic framework for evaluating data curation methods, where dataset quality is measured by its informativeness about the true model parameters using the Blackwell ordering. We compare informativeness by the Shannon mutual information of the evaluated data and the test data, and we propose a novel method for estimating the mutual information of datasets by training Bayesian models on embedded data and computing the mutual information from the model’s parameter posteriors. Experiments on real-world data demonstrate that our mutual information-based evaluation assigns appropriately lower scores to data curation strategies that reduce dataset informativeness, while traditional test score-based evaluation methods may favor data curation strategies that overfit to the test set but compromise the training data's informativeness.
Data plays a central role in the development of modern artificial intelligence, with high-quality data emerging as a key driver of model performance. This has prompted the development of various data curation methods in recent years. However, measuring the effectiveness of these data curation techniques remains a major challenge. Traditional evaluation methods, which assess a trained model's performance on specific benchmarks, risk promoting practices that merely make the data more similar to the test data. This issue exemplifies Goodhart’s law: when a measure becomes a target, it ceases to be a good measure. To address this, we propose an information-theoretic framework for evaluating data curation methods, where dataset quality is measured by its informativeness about the true model parameters using the Blackwell ordering. We compare informativeness by the Shannon mutual information of the evaluated data and the test data, and we propose a novel method for estimating the mutual information of datasets by training Bayesian models on embedded data and computing the mutual information from the model’s parameter posteriors. Experiments on real-world data demonstrate that our mutual information-based evaluation assigns appropriately lower scores to data curation strategies that reduce dataset informativeness, while traditional test score-based evaluation methods may favor data curation strategies that overfit to the test set but compromise the training data's informativeness.
Privacy Attacks on Image AutoRegressive Models
Antoni Kowalczuk, Jan Dubiński, Franziska Boenisch, Adam Dziedzic
OpenReview
[Show Abstract]
Image AutoRegressive generation has emerged as a new powerful paradigm with image autoregressive models (IARs) surpassing state-of-the-art diffusion models (DMs) in both image quality (FID: 1.48 vs. 1.58) and generation speed. However, the privacy risks associated with IARs remain unexplored, raising concerns regarding their responsible deployment. To address this gap, we conduct a comprehensive privacy analysis of IARs, comparing their privacy risks to the ones of DMs as reference points. Concretely, we develop a novel membership inference attack (MIA) that achieves a remarkably high success rate in detecting training images (with a TPR@FPR=1% of 86.38% vs. 4.91% for DMs with comparable attacks). We leverage our novel MIA to provide dataset inference (DI) for IARs, and show that it requires as few as 6 samples to detect dataset membership (compared to 200 for DI in DMs), confirming a higher information leakage in IARs. Finally, we are able to extract hundreds of training data points from an IAR (e.g., 698 from VAR-d30). Our results demonstrate a fundamental privacy-utility trade-off: while IARs excel in image generation quality and speed, they are significantly more vulnerable to privacy attacks compared to DMs. This trend suggests that utilizing techniques from DMs within IARs, such as modeling the per-token probability distribution using a diffusion procedure, holds potential to help mitigating IARs' vulnerability to privacy attacks.
Image AutoRegressive generation has emerged as a new powerful paradigm with image autoregressive models (IARs) surpassing state-of-the-art diffusion models (DMs) in both image quality (FID: 1.48 vs. 1.58) and generation speed. However, the privacy risks associated with IARs remain unexplored, raising concerns regarding their responsible deployment. To address this gap, we conduct a comprehensive privacy analysis of IARs, comparing their privacy risks to the ones of DMs as reference points. Concretely, we develop a novel membership inference attack (MIA) that achieves a remarkably high success rate in detecting training images (with a TPR@FPR=1% of 86.38% vs. 4.91% for DMs with comparable attacks). We leverage our novel MIA to provide dataset inference (DI) for IARs, and show that it requires as few as 6 samples to detect dataset membership (compared to 200 for DI in DMs), confirming a higher information leakage in IARs. Finally, we are able to extract hundreds of training data points from an IAR (e.g., 698 from VAR-d30). Our results demonstrate a fundamental privacy-utility trade-off: while IARs excel in image generation quality and speed, they are significantly more vulnerable to privacy attacks compared to DMs. This trend suggests that utilizing techniques from DMs within IARs, such as modeling the per-token probability distribution using a diffusion procedure, holds potential to help mitigating IARs' vulnerability to privacy attacks.
Chameleon: A Flexible Data-mixing Framework for Language Model Pretraining and Finetuning
Wanyun Xie, Francesco Tonin, Volkan Cevher
OpenReview
[Show Abstract]
Training data mixtures greatly impact the generalization performance of large language models. Existing domain reweighting methods often rely on costly weight computations and require retraining when new data is introduced. To this end, we introduce a flexible and efficient data mixing framework, Chameleon, that employs leverage scores to quantify domain importance within a learned embedding space. We first construct a domain affinity matrix over domain embeddings. The induced leverage scores determine a mixture that upweights domains sharing common representations in embedding space. This formulation allows direct transfer to new data by computing the new domain embeddings. In experiments, we demonstrate improvements over three key scenarios: (i) our computed weights improve performance on pretraining domains with a fraction of the compute of existing methods; (ii) Chameleon can adapt to data changes without proxy retraining, boosting few-shot reasoning accuracies when transferred to new data; (iii) our method enables efficient domain reweighting in finetuning, consistently improving test perplexity on all finetuning domains over uniform mixture.
Training data mixtures greatly impact the generalization performance of large language models. Existing domain reweighting methods often rely on costly weight computations and require retraining when new data is introduced. To this end, we introduce a flexible and efficient data mixing framework, Chameleon, that employs leverage scores to quantify domain importance within a learned embedding space. We first construct a domain affinity matrix over domain embeddings. The induced leverage scores determine a mixture that upweights domains sharing common representations in embedding space. This formulation allows direct transfer to new data by computing the new domain embeddings. In experiments, we demonstrate improvements over three key scenarios: (i) our computed weights improve performance on pretraining domains with a fraction of the compute of existing methods; (ii) Chameleon can adapt to data changes without proxy retraining, boosting few-shot reasoning accuracies when transferred to new data; (iii) our method enables efficient domain reweighting in finetuning, consistently improving test perplexity on all finetuning domains over uniform mixture.
BenchAgents: Automated Benchmark Creation with Agent Interaction
Natasha Butt, Varun Chandrasekaran, Neel Joshi, Besmira Nushi, Vidhisha Balachandran
OpenReview
[Show Abstract]
Evaluations are limited by benchmark availability. As models evolve, there is a need to create benchmarks that can measure progress on new generative capabilities. However, creating new benchmarks through human annotations is slow and expensive, restricting comprehensive evaluations for any capability. We introduce $\texttt{BenchAgents}$, a framework that methodically leverages large language models (LLMs) to automate benchmark creation for complex capabilities while inherently ensuring data and metric quality. $\texttt{BenchAgents}$ decomposes the benchmark creation process into planning, generation, data verification, and evaluation, each of which is executed by an LLM agent. These agents interact with each other and utilize human-in-the-loop feedback from benchmark developers to explicitly improve and flexibly control data diversity and quality. We use $\texttt{BenchAgents}$ to create benchmarks to evaluate capabilities related to planning and constraint satisfaction during text generation. We then use these benchmarks to study seven state-of-the-art models and extract new insights on common failure modes and model differences.
Evaluations are limited by benchmark availability. As models evolve, there is a need to create benchmarks that can measure progress on new generative capabilities. However, creating new benchmarks through human annotations is slow and expensive, restricting comprehensive evaluations for any capability. We introduce $\texttt{BenchAgents}$, a framework that methodically leverages large language models (LLMs) to automate benchmark creation for complex capabilities while inherently ensuring data and metric quality. $\texttt{BenchAgents}$ decomposes the benchmark creation process into planning, generation, data verification, and evaluation, each of which is executed by an LLM agent. These agents interact with each other and utilize human-in-the-loop feedback from benchmark developers to explicitly improve and flexibly control data diversity and quality. We use $\texttt{BenchAgents}$ to create benchmarks to evaluate capabilities related to planning and constraint satisfaction during text generation. We then use these benchmarks to study seven state-of-the-art models and extract new insights on common failure modes and model differences.
Position: What's the next frontier for Data-centric AI? Data Savvy Agents!
Nabeel Seedat, Jiashuo Liu, Mihaela van der Schaar
OpenReview
[Show Abstract]
The recent surge in AI agents that autonomously communicate, collaborate with humans and use diverse tools has unlocked promising opportunities in various real-world settings. However, a vital aspect remains underexplored: how agents handle data. Agents cannot achieve scalable autonomy without the ability to dynamically acquire, process, and continually evolve their data ecosystems to navigate complex and changing environments. In this position paper, we argue that data-savvy capabilities should be a top priority in the design of agentic systems to ensure reliable real-world deployment. Specifically, we propose four key capabilities to realize this vision:
(1) Proactive data acquisition: enabling agents to autonomously gather task-critical knowledge or solicit human input to address data gaps; (2) Sophisticated data processing: requiring context-aware and flexible handling of diverse data challenges and inputs; (3) Interactive test data synthesis: shifting from static benchmarks to dynamically generated interactive test data for agent evaluation; and (4) Continual adaptation: empowering agents to iteratively refine their data and background knowledge to adapt to shifting environments. While current agent research predominantly emphasizes reasoning, we hope this work inspires a broader reflection on the role of data-savvy agents as the next frontier in data-centric AI.
The recent surge in AI agents that autonomously communicate, collaborate with humans and use diverse tools has unlocked promising opportunities in various real-world settings. However, a vital aspect remains underexplored: how agents handle data. Agents cannot achieve scalable autonomy without the ability to dynamically acquire, process, and continually evolve their data ecosystems to navigate complex and changing environments. In this position paper, we argue that data-savvy capabilities should be a top priority in the design of agentic systems to ensure reliable real-world deployment. Specifically, we propose four key capabilities to realize this vision: (1) Proactive data acquisition: enabling agents to autonomously gather task-critical knowledge or solicit human input to address data gaps; (2) Sophisticated data processing: requiring context-aware and flexible handling of diverse data challenges and inputs; (3) Interactive test data synthesis: shifting from static benchmarks to dynamically generated interactive test data for agent evaluation; and (4) Continual adaptation: empowering agents to iteratively refine their data and background knowledge to adapt to shifting environments. While current agent research predominantly emphasizes reasoning, we hope this work inspires a broader reflection on the role of data-savvy agents as the next frontier in data-centric AI.
Towards Human-Guided, Data-Centric LLM Co-Pilots
Evgeny Saveliev, Jiashuo Liu, Nabeel Seedat, Anders Boyd, Mihaela van der Schaar
OpenReview
[Show Abstract]
Machine learning (ML) has the potential to revolutionize various domains and industries, but its adoption is often hindered by the disconnect between the needs of domain experts and translating these needs into robust and valid ML tools. Despite recent advances in LLM-based co-pilots to democratize ML for non-technical domain experts, these systems remain predominantly focused on model-centric aspects while overlooking critical data-centric challenges. This limitation is problematic in complex real-world settings where raw data often contains complex issues, such as missing values, label noise, and domain-specific nuances requiring tailored handling. To address this we introduce CliMB-DC, a human-guided, data-centric framework for LLM co-pilots that combines advanced data-centric tools with LLM-driven reasoning to enable robust, context-aware data processing. At its core, CliMB-DC introduces a novel, multi-agent reasoning system that combines a strategic coordinator for dynamic planning and adaptation with a specialized worker agent for precise execution. Domain expertise is then systematically incorporated to guide the reasoning process using a human-in-the-loop approach. To guide development, we formalize a taxonomy of key data-centric challenges that co-pilots must address. Thereafter, to address the dimensions of the taxonomy, we integrate state-of-the-art data-centric tools into an extensible, open-source architecture, facilitating the addition of new tools from the research community. Empirically, using real-world healthcare datasets we demonstrate CliMB-DC's ability to transform uncurated datasets into ML-ready formats, significantly outperforming existing co-pilot baselines for handling data-centric challenges. CliMB-DC promises to empower domain experts from diverse domains — healthcare, finance, social sciences and more — to actively participate in driving real-world impact using ML.
Machine learning (ML) has the potential to revolutionize various domains and industries, but its adoption is often hindered by the disconnect between the needs of domain experts and translating these needs into robust and valid ML tools. Despite recent advances in LLM-based co-pilots to democratize ML for non-technical domain experts, these systems remain predominantly focused on model-centric aspects while overlooking critical data-centric challenges. This limitation is problematic in complex real-world settings where raw data often contains complex issues, such as missing values, label noise, and domain-specific nuances requiring tailored handling. To address this we introduce CliMB-DC, a human-guided, data-centric framework for LLM co-pilots that combines advanced data-centric tools with LLM-driven reasoning to enable robust, context-aware data processing. At its core, CliMB-DC introduces a novel, multi-agent reasoning system that combines a strategic coordinator for dynamic planning and adaptation with a specialized worker agent for precise execution. Domain expertise is then systematically incorporated to guide the reasoning process using a human-in-the-loop approach. To guide development, we formalize a taxonomy of key data-centric challenges that co-pilots must address. Thereafter, to address the dimensions of the taxonomy, we integrate state-of-the-art data-centric tools into an extensible, open-source architecture, facilitating the addition of new tools from the research community. Empirically, using real-world healthcare datasets we demonstrate CliMB-DC's ability to transform uncurated datasets into ML-ready formats, significantly outperforming existing co-pilot baselines for handling data-centric challenges. CliMB-DC promises to empower domain experts from diverse domains — healthcare, finance, social sciences and more — to actively participate in driving real-world impact using ML.
Revisiting Semi-supervised Adversarial Robustness via Noise-aware Online Robust Distillation
Tsung-Han Wu, Hung-Ting Su, Shang-Tse Chen, Winston H. Hsu
OpenReview
[Show Abstract]
Training adversarially robust models under a low-labeling regime is crucial for real-world deployment. Robust self-training (RST), with standard training for pseudo labels followed by adversarial robust training, has emerged as a key paradigm in this setting. Recent advancements in RST primarily focus on leveraging strong pre-trained models to improve robustness and performance. However, we find that these methods often overlook the critical role of pseudo labels in the training pipeline, leading to worse results on extremely low labeling regimes (< 5\%). In this work, we introduce SNORD, a simple yet effective approach that significantly improves robustness by enhancing pseudo-label quality in the first stage and effectively managing label noise in the second stage leveraging advanced standard semi-supervised learning techniques. Experiments on CIFAR-10, CIFAR-100, and TinyImageNet-200 demonstrate that SNORD outperforms prior methods by up to 22\% in robust accuracy under low-labeling conditions. Furthermore, compared to fully supervised adversarial training, SNORD achieves 90\% relative robust accuracy under $\ell_{\infty} = 8/255$ AutoAttack, requiring only 0.1\%, 2\%, and 10\% labeled data for the three commonly used benchmarks, respectively. Additional analyses validate the contribution of each component and show that SNORD can be seamlessly integrated with existing adversarial pretraining strategies to further enhance robustness.
Training adversarially robust models under a low-labeling regime is crucial for real-world deployment. Robust self-training (RST), with standard training for pseudo labels followed by adversarial robust training, has emerged as a key paradigm in this setting. Recent advancements in RST primarily focus on leveraging strong pre-trained models to improve robustness and performance. However, we find that these methods often overlook the critical role of pseudo labels in the training pipeline, leading to worse results on extremely low labeling regimes (< 5\%). In this work, we introduce SNORD, a simple yet effective approach that significantly improves robustness by enhancing pseudo-label quality in the first stage and effectively managing label noise in the second stage leveraging advanced standard semi-supervised learning techniques. Experiments on CIFAR-10, CIFAR-100, and TinyImageNet-200 demonstrate that SNORD outperforms prior methods by up to 22\% in robust accuracy under low-labeling conditions. Furthermore, compared to fully supervised adversarial training, SNORD achieves 90\% relative robust accuracy under $\ell_{\infty} = 8/255$ AutoAttack, requiring only 0.1\%, 2\%, and 10\% labeled data for the three commonly used benchmarks, respectively. Additional analyses validate the contribution of each component and show that SNORD can be seamlessly integrated with existing adversarial pretraining strategies to further enhance robustness.
Differentially Private Synthetic Data via APIs 3: Using Simulators Instead of Foundation Model
Zinan Lin, Tadas Baltrusaitis, Sergey Yekhanin
OpenReview
[Show Abstract]
Differentially private (DP) synthetic data, which closely resembles the original private data while maintaining strong privacy guarantees, has become a key tool for unlocking the value of private data without compromising privacy. Recently, Private Evolution (PE) has emerged as a promising method for generating DP synthetic data. Unlike other training-based approaches, PE only requires access to inference APIs from foundation models, enabling it to harness the power of state-of-the-art models. However, a suitable foundation model for a specific private data domain is not always available. In this paper, we discover that the PE framework is sufficiently general to allow inference APIs beyond foundation models. Specifically, we show that simulators—such as computer graphics-based image synthesis tools—can also serve as effective APIs within the PE framework. This insight greatly expands the applicability of PE, enabling the use of a wide variety of domain-specific simulators for DP data synthesis. We explore the potential of this approach, named Sim-PE, in the context of image synthesis. Across three diverse simulators, Sim-PE performs well, improving the downstream classification accuracy of PE by up to 3x and reducing the FID score by up to 80%. We also show that simulators and foundation models can be easily leveraged together within the PE framework to achieve further improvements.
Differentially private (DP) synthetic data, which closely resembles the original private data while maintaining strong privacy guarantees, has become a key tool for unlocking the value of private data without compromising privacy. Recently, Private Evolution (PE) has emerged as a promising method for generating DP synthetic data. Unlike other training-based approaches, PE only requires access to inference APIs from foundation models, enabling it to harness the power of state-of-the-art models. However, a suitable foundation model for a specific private data domain is not always available. In this paper, we discover that the PE framework is sufficiently general to allow inference APIs beyond foundation models. Specifically, we show that simulators—such as computer graphics-based image synthesis tools—can also serve as effective APIs within the PE framework. This insight greatly expands the applicability of PE, enabling the use of a wide variety of domain-specific simulators for DP data synthesis. We explore the potential of this approach, named Sim-PE, in the context of image synthesis. Across three diverse simulators, Sim-PE performs well, improving the downstream classification accuracy of PE by up to 3x and reducing the FID score by up to 80%. We also show that simulators and foundation models can be easily leveraged together within the PE framework to achieve further improvements.
KGGen: Text To Knowledge Graph
Belinda Mo, Kyssen Yu, Joshua Kazdan, Proud Mpala, Lisa Yu, Chris Cundy, Charilaos Kanatsoulis, Sanmi Koyejo
OpenReview
[Show Abstract]
Recent interest in building foundation models for KGs has highlighted a fundamental challenge: knowledge-graph data is relatively scarce. The best-known KGs are primarily human-labeled, created by pattern-matching, or extracted using early NLP techniques. While human-generated KGs are in short supply, automatically extracted KGs are of questionable quality. We present a solution to this data scarcity problem in the form of a text-to-KG generator (KGGen), a package that uses language models to create high-quality graphs from plaintext. Unlike other KG extractors, KGGen clusters related entities to reduce sparsity in extracted KGs. KGGen is available as a Python package (pip install NAME REDACTED), making it accessible to anyone with an OpenAI API key. Along with KGGen, we release the first benchmark that tests an extractor's ability to produce a useful KG from plain text. We benchmark our new tool against existing extractors and demonstrate far superior performance.
Recent interest in building foundation models for KGs has highlighted a fundamental challenge: knowledge-graph data is relatively scarce. The best-known KGs are primarily human-labeled, created by pattern-matching, or extracted using early NLP techniques. While human-generated KGs are in short supply, automatically extracted KGs are of questionable quality. We present a solution to this data scarcity problem in the form of a text-to-KG generator (KGGen), a package that uses language models to create high-quality graphs from plaintext. Unlike other KG extractors, KGGen clusters related entities to reduce sparsity in extracted KGs. KGGen is available as a Python package (pip install NAME REDACTED), making it accessible to anyone with an OpenAI API key. Along with KGGen, we release the first benchmark that tests an extractor's ability to produce a useful KG from plain text. We benchmark our new tool against existing extractors and demonstrate far superior performance.
TOWARD EFFICIENT INFLUENCE FUNCTION: DROPOUT AS A COMPRESSION TOOL
Yuchen Zhang, Mohammad Mohammadi Amiri
OpenReview
[Show Abstract]
Assessing the impact of training data points on machine learning models is crucial for understanding the behavior of the model and enhancing the transparency of modern models. Influence function provides a theoretical framework for quantifying the effect of individual training data points on a model's performance on given specific test data points. However, the computational cost of influence function presents significant challenges, particularly for large-scale models. In this work, we introduce a novel approach that leverages dropout as a gradient compression mechanism to compute the influence functions more efficiently. Our methods significantly reduces computational and memory overhead, not only during the influence function computation but also in the compression process itself. Through theoretical analysis and empirical validation, we demonstrate that using dropout as a compression tool in influence function computation preserves critical components of the data influence and enables its application to modern large-scale models.
Assessing the impact of training data points on machine learning models is crucial for understanding the behavior of the model and enhancing the transparency of modern models. Influence function provides a theoretical framework for quantifying the effect of individual training data points on a model's performance on given specific test data points. However, the computational cost of influence function presents significant challenges, particularly for large-scale models. In this work, we introduce a novel approach that leverages dropout as a gradient compression mechanism to compute the influence functions more efficiently. Our methods significantly reduces computational and memory overhead, not only during the influence function computation but also in the compression process itself. Through theoretical analysis and empirical validation, we demonstrate that using dropout as a compression tool in influence function computation preserves critical components of the data influence and enables its application to modern large-scale models.
Data Efficient Pre-training for Language Models: An Empirical Study of Compute Efficiency and Linguistic Competence
Andreas Paraskeva, Max Johannes van Duijn, Maarten de Rijke, Suzan Verberne, Jan N. van Rijn
OpenReview
[Show Abstract]
Training large language models such as Llama is compute- and data-intensive, limiting optimisation, hindering low-resource training, and increasing environmental impact. This paper examines pre-training effectiveness on small, curated datasets based on (i) linguistic competence and (ii) compute efficiency. We compare the use of two small curated datasets for pre-training decoder-only Llama models of different sizes. The first dataset, TinyStories, is a collection of ChatGPT-generated children's stories. The second dataset, BabyLM, is a small, open-domain dataset used for training language models in the BabyLM challenge. We perform experiments with increasing amounts of data (yielding a learning curve) and size-variants of a Llama-based architecture. We found that Llama models trained on BabyLM outperform Llama models trained on TinyStories in formal linguistic competence. However, both datasets yield comparable results on functional linguistic tasks. Our analysis generally indicates more robust training on BabyLM, with lower observed variance across training instances. These findings suggest promising directions for data-efficient pre-training of language models. Narrative data benefits early-stage training and its inclusion in curriculum learning settings is worth investigating. BabyLM shows potential in resource-constrained settings by helping select promising candidate models, given that small data samples appear representative of the model's ultimate performance. Future work will expand datasets and benchmarks to validate these insights.
Training large language models such as Llama is compute- and data-intensive, limiting optimisation, hindering low-resource training, and increasing environmental impact. This paper examines pre-training effectiveness on small, curated datasets based on (i) linguistic competence and (ii) compute efficiency. We compare the use of two small curated datasets for pre-training decoder-only Llama models of different sizes. The first dataset, TinyStories, is a collection of ChatGPT-generated children's stories. The second dataset, BabyLM, is a small, open-domain dataset used for training language models in the BabyLM challenge. We perform experiments with increasing amounts of data (yielding a learning curve) and size-variants of a Llama-based architecture. We found that Llama models trained on BabyLM outperform Llama models trained on TinyStories in formal linguistic competence. However, both datasets yield comparable results on functional linguistic tasks. Our analysis generally indicates more robust training on BabyLM, with lower observed variance across training instances. These findings suggest promising directions for data-efficient pre-training of language models. Narrative data benefits early-stage training and its inclusion in curriculum learning settings is worth investigating. BabyLM shows potential in resource-constrained settings by helping select promising candidate models, given that small data samples appear representative of the model's ultimate performance. Future work will expand datasets and benchmarks to validate these insights.
Utilizing Language Models For Synthetic Knowledge Graph Generation
Shuran Fu, Peihua Mai, Zhang Jingqi, Yan Pang
OpenReview
[Show Abstract]
Knowledge Graphs play a pivotal role in various machine-learning tasks. However, constructing these datasets is challenging due to their semantic and structural complexity, often resulting in limited data size. Synthetic graph generation has been applied to augment graph datasets and has proven beneficial in domains such as social network analysis and recommendation systems. Despite this, generating graphs with extensive textual attributes remains underexplored. Large language models (LLMs) possess the capability to generate text and reason about complex data structures, including graphs. In this paper, we leverage the generative and reasoning abilities of LLMs to propose a novel framework for synthetic knowledge graph generation. Our framework integrates two transformers and a text data augmentation module, where prompt and fine-tuning approaches are used to generate sentences and Mahalanobis distance is applied to measure outliers. This framework offers straightforward application and high flexibility, which can effectively generate graph datasets that have a similar triple distribution with the real one. We combine the generated data with real data by either concatenation or mixture way and through extensive experiments on downstream tasks, we demonstrate the effectiveness and versatility of our approach.
Knowledge Graphs play a pivotal role in various machine-learning tasks. However, constructing these datasets is challenging due to their semantic and structural complexity, often resulting in limited data size. Synthetic graph generation has been applied to augment graph datasets and has proven beneficial in domains such as social network analysis and recommendation systems. Despite this, generating graphs with extensive textual attributes remains underexplored. Large language models (LLMs) possess the capability to generate text and reason about complex data structures, including graphs. In this paper, we leverage the generative and reasoning abilities of LLMs to propose a novel framework for synthetic knowledge graph generation. Our framework integrates two transformers and a text data augmentation module, where prompt and fine-tuning approaches are used to generate sentences and Mahalanobis distance is applied to measure outliers. This framework offers straightforward application and high flexibility, which can effectively generate graph datasets that have a similar triple distribution with the real one. We combine the generated data with real data by either concatenation or mixture way and through extensive experiments on downstream tasks, we demonstrate the effectiveness and versatility of our approach.
STAMP Your Content: Proving Dataset Membership via Watermarked Rephrasings
Saksham Rastogi, Pratyush Maini, Danish Pruthi
OpenReview
[Show Abstract]
Given how large parts of the publicly available text are crawled to pretrain large language models (LLMs), creators increasingly worry about the inclusion of their proprietary data for model training without attribution or licensing. Their concerns are also shared by benchmark curators whose test-sets might be compromised. In this paper, we present STAMP, a framework for detecting dataset membership—i.e., determining the inclusion of a dataset in the pretraining corpora of LLMs. Given an original piece of content, our proposal involves generating multiple watermarked rephrases such that a distinct watermark is embedded in each rephrasing. One version is released publicly while others are kept private. Subsequently, creators can compare model likelihoods between public and private versions using paired statistical tests to prove membership. We show that our framework can successfully detect contamination across four benchmarks which appear only once in the training data and constitute less than 0.001% of the total tokens, outperforming several contamination detection and dataset inference baselines. We verify that our approach preserves both the semantic meaning and the utility of benchmarks in comparing different models. We apply STAMP to two real-world scenarios to confirm the inclusion of paper abstracts and blog articles in the pretraining corpora.
Given how large parts of the publicly available text are crawled to pretrain large language models (LLMs), creators increasingly worry about the inclusion of their proprietary data for model training without attribution or licensing. Their concerns are also shared by benchmark curators whose test-sets might be compromised. In this paper, we present STAMP, a framework for detecting dataset membership—i.e., determining the inclusion of a dataset in the pretraining corpora of LLMs. Given an original piece of content, our proposal involves generating multiple watermarked rephrases such that a distinct watermark is embedded in each rephrasing. One version is released publicly while others are kept private. Subsequently, creators can compare model likelihoods between public and private versions using paired statistical tests to prove membership. We show that our framework can successfully detect contamination across four benchmarks which appear only once in the training data and constitute less than 0.001% of the total tokens, outperforming several contamination detection and dataset inference baselines. We verify that our approach preserves both the semantic meaning and the utility of benchmarks in comparing different models. We apply STAMP to two real-world scenarios to confirm the inclusion of paper abstracts and blog articles in the pretraining corpora.
How much of my dataset did you use? Quantitative Data Usage Inference in Machine Learning
Yao Tong, Jiayuan Ye, Sajjad Zarifzadeh, Reza Shokri
OpenReview
[Show Abstract]
How much of a given dataset was used to train a machine learning model? This is a critical question for data owners assessing the risk of unauthorized data usage and protecting their right (United States Code, 1976). However, previous work mistakenly treats this as a binary problem—inferring whether all or none or any or none of the data was used—which is fragile when faced with real, non-binary data usage risks. To address this, we propose a fine-grained analysis called Dataset Usage Cardinality Inference (DUCI), which estimates the exact proportion of data used. Our algorithm, leveraging debiased membership guesses, matches the performance of the optimal MLE approach (with a maximum error <0.1) but with significantly lower (e.g., $300 \times$ less) computational cost.
How much of a given dataset was used to train a machine learning model? This is a critical question for data owners assessing the risk of unauthorized data usage and protecting their right (United States Code, 1976). However, previous work mistakenly treats this as a binary problem—inferring whether all or none or any or none of the data was used—which is fragile when faced with real, non-binary data usage risks. To address this, we propose a fine-grained analysis called Dataset Usage Cardinality Inference (DUCI), which estimates the exact proportion of data used. Our algorithm, leveraging debiased membership guesses, matches the performance of the optimal MLE approach (with a maximum error <0.1) but with significantly lower (e.g., $300 \times$ less) computational cost.
Blind Baselines Beat Membership Inference Attacks for Foundation Models
Debeshee Das, Jie Zhang, Florian Tramèr
OpenReview
[Show Abstract]
Membership inference (MI) attacks try to determine if a data sample was used to train a machine learning model. For foundation models trained on unknown Web data, MI attacks are often used to detect copyrighted training materials, measure test set contamination, or audit machine unlearning. Unfortunately, we find that evaluations of MI attacks for foundation models are flawed, because they sample members and non-members from different distributions. For 8 published MI evaluation datasets, we show that blind attacks—that distinguish the member and non-member distributions without looking at any trained model—outperform state-of-the-art MI attacks. Existing evaluations thus tell us nothing about membership leakage of a foundation model’s training data.
Membership inference (MI) attacks try to determine if a data sample was used to train a machine learning model. For foundation models trained on unknown Web data, MI attacks are often used to detect copyrighted training materials, measure test set contamination, or audit machine unlearning. Unfortunately, we find that evaluations of MI attacks for foundation models are flawed, because they sample members and non-members from different distributions. For 8 published MI evaluation datasets, we show that blind attacks—that distinguish the member and non-member distributions without looking at any trained model—outperform state-of-the-art MI attacks. Existing evaluations thus tell us nothing about membership leakage of a foundation model’s training data.
Synthesizing Physical Backdoor Datasets: An Automated Framework Leveraging Deep Generative Models
Sze Jue Yang, Chinh Duc La, Quang H Nguyen, Eugene Bagdasarian, Kok-Seng Wong, Anh Tuan Tran, Chee Seng Chan, Khoa D Doan
OpenReview
[Show Abstract]
Backdoor attacks, representing an emerging threat to the integrity of deep neural networks, have garnered significant attention due to their ability to compromise deep learning systems clandestinely.
While numerous backdoor attacks occur within the digital realm, their practical implementation in real-world prediction systems remains limited and vulnerable to disturbances in the physical world.
Consequently, this limitation has given rise to the development of physical backdoor attacks, where trigger objects manifest as physical entities within the real world.
However, creating the requisite dataset to train or evaluate a physical backdoor model is a daunting task, limiting the backdoor researchers and practitioners from studying such physical attack scenarios. This paper unleashes a framework that empowers backdoor researchers to effortlessly create a malicious, physical backdoor dataset based on advances in generative modeling. Particularly, this framework involves 3 automatic modules: suggesting the suitable physical triggers, generating the poisoned candidate samples (either by synthesizing new samples or editing existing clean samples), and finally refining for the most plausible ones. As such, it effectively mitigates the perceived complexity associated with creating a physical backdoor dataset, transforming it from a daunting task into an attainable objective. Extensive experiment results show that datasets created by our framework enable researchers to achieve an impressive attack success rate on real physical world data and exhibit similar properties compared to previous physical backdoor attack studies. This paper offers researchers a valuable toolkit for studies of physical backdoors, all within the confines of their laboratories.
Backdoor attacks, representing an emerging threat to the integrity of deep neural networks, have garnered significant attention due to their ability to compromise deep learning systems clandestinely. While numerous backdoor attacks occur within the digital realm, their practical implementation in real-world prediction systems remains limited and vulnerable to disturbances in the physical world. Consequently, this limitation has given rise to the development of physical backdoor attacks, where trigger objects manifest as physical entities within the real world. However, creating the requisite dataset to train or evaluate a physical backdoor model is a daunting task, limiting the backdoor researchers and practitioners from studying such physical attack scenarios. This paper unleashes a framework that empowers backdoor researchers to effortlessly create a malicious, physical backdoor dataset based on advances in generative modeling. Particularly, this framework involves 3 automatic modules: suggesting the suitable physical triggers, generating the poisoned candidate samples (either by synthesizing new samples or editing existing clean samples), and finally refining for the most plausible ones. As such, it effectively mitigates the perceived complexity associated with creating a physical backdoor dataset, transforming it from a daunting task into an attainable objective. Extensive experiment results show that datasets created by our framework enable researchers to achieve an impressive attack success rate on real physical world data and exhibit similar properties compared to previous physical backdoor attack studies. This paper offers researchers a valuable toolkit for studies of physical backdoors, all within the confines of their laboratories.
MMA: Benchmarking Multi-Modal Large Language Model in Ambiguity Contexts
Ru Wang, Selena Song, Liang Ding, Mingming Gong, Yusuke Iwasawa, Yutaka Matsuo, Jiaxian Guo
OpenReview
[Show Abstract]
Multi-Modal Large Language Models (MLLMs) recently demonstrated strong capabilities in both instruction comprehension and responding, positioning them as promising tools for human-computer interaction. However, the inherent ambiguity of language poses a challenge, potentially leading models astray in task implementation due to differing interpretations of the same text within varying contexts. In multi-modal settings, visual information serves as a natural aid in disambiguating such scenarios. In this paper, we introduce the first benchmark specifically designed to evaluate the performance of \textbf{M}LL\textbf{M}s in \textbf{A}mbiguous contexts (MMA). This benchmark employs a multiple-choice visual question-answering format and includes 261 textual contexts and questions with ambiguous meaning. Each question is linked to a pair of images that suggest divergent scenarios, thus leading to different answers given the same question. These questions are stratified into three categories of ambiguity: lexical, syntactic, and semantic, to facilitate a detailed examination of MLLM performance across varying levels of ambiguity. By evaluating 24 proprietary and open-sourced MLLMs, we find that: (1) MLLMs often overlook scenario-specific information provided by images to clarify the ambiguity of texts. When presented with two different contextual images and asked the same question, MLLMs achieved an accuracy rate of only 53.22% in answering both correctly, compared to human performance at 88.97%.(2) Among the three types of ambiguity, models perform best under lexical ambiguity and worst under syntactic ambiguity. (3) Open-sourced models generally perform significantly lower than proprietary MLLMs, with an average performance gap of 12.59%, Claude 3.5 Sonnet, emerges as the top model, achieving 74.32% accuracy. These findings firstly underscore the current limitations of MLLMs in integrating visual information to clarify textual ambiguities and highlight critical areas for future improvements. The codes and benchmark data are https://github.com/AnonymousSubmitter-gpu/MMA_Anony
Multi-Modal Large Language Models (MLLMs) recently demonstrated strong capabilities in both instruction comprehension and responding, positioning them as promising tools for human-computer interaction. However, the inherent ambiguity of language poses a challenge, potentially leading models astray in task implementation due to differing interpretations of the same text within varying contexts. In multi-modal settings, visual information serves as a natural aid in disambiguating such scenarios. In this paper, we introduce the first benchmark specifically designed to evaluate the performance of \textbf{M}LL\textbf{M}s in \textbf{A}mbiguous contexts (MMA). This benchmark employs a multiple-choice visual question-answering format and includes 261 textual contexts and questions with ambiguous meaning. Each question is linked to a pair of images that suggest divergent scenarios, thus leading to different answers given the same question. These questions are stratified into three categories of ambiguity: lexical, syntactic, and semantic, to facilitate a detailed examination of MLLM performance across varying levels of ambiguity. By evaluating 24 proprietary and open-sourced MLLMs, we find that: (1) MLLMs often overlook scenario-specific information provided by images to clarify the ambiguity of texts. When presented with two different contextual images and asked the same question, MLLMs achieved an accuracy rate of only 53.22% in answering both correctly, compared to human performance at 88.97%.(2) Among the three types of ambiguity, models perform best under lexical ambiguity and worst under syntactic ambiguity. (3) Open-sourced models generally perform significantly lower than proprietary MLLMs, with an average performance gap of 12.59%, Claude 3.5 Sonnet, emerges as the top model, achieving 74.32% accuracy. These findings firstly underscore the current limitations of MLLMs in integrating visual information to clarify textual ambiguities and highlight critical areas for future improvements. The codes and benchmark data are https://github.com/AnonymousSubmitter-gpu/MMA_Anony
Improving Influence-based Instruction Tuning Data Selection for Balanced Learning of Diverse Capabilities
Qirun Dai, Dylan Zhang, Jiaqi W. Ma, Hao Peng
OpenReview
[Show Abstract]
Selecting appropriate training data is crucial for effective instruction fine-tuning of large language models (LLMs), which aims to (1) elicit strong capabilities, and (2) achieve balanced performance across a diverse range of tasks. Influence-based methods show promise in achieving (1) by estimating the contribution of each training example to the model's predictions, but often struggle with (2). Our systematic investigation reveals that this underperformance can be attributed to an inherent bias where certain tasks intrinsically have greater influence than others. As a result, data selection is often biased towards these tasks, not only hurting the model's performance on others but also, counterintuitively, harming performance on these high-influence tasks themselves.
As a remedy, we propose BIDS, a Balanced and Influential Data Selection algorithm. BIDS first normalizes influence scores of the training data, and then iteratively balances data selection by choosing the training example with the highest influence on the most underrepresented task. Experiments with both Llama-3 and Mistral-v0.3 on seven benchmarks spanning five diverse capabilities show that BIDS consistently outperforms both state-of-the-art influence-based algorithms and other non-influence-based selection frameworks. Surprisingly, training on a 15% subset selected by BIDS can even outperform full-dataset training with a much more balanced performance. Our analysis further highlights the importance of both instance-level normalization and iterative optimization of selected data for balanced learning of diverse capabilities.
Selecting appropriate training data is crucial for effective instruction fine-tuning of large language models (LLMs), which aims to (1) elicit strong capabilities, and (2) achieve balanced performance across a diverse range of tasks. Influence-based methods show promise in achieving (1) by estimating the contribution of each training example to the model's predictions, but often struggle with (2). Our systematic investigation reveals that this underperformance can be attributed to an inherent bias where certain tasks intrinsically have greater influence than others. As a result, data selection is often biased towards these tasks, not only hurting the model's performance on others but also, counterintuitively, harming performance on these high-influence tasks themselves.
As a remedy, we propose BIDS, a Balanced and Influential Data Selection algorithm. BIDS first normalizes influence scores of the training data, and then iteratively balances data selection by choosing the training example with the highest influence on the most underrepresented task. Experiments with both Llama-3 and Mistral-v0.3 on seven benchmarks spanning five diverse capabilities show that BIDS consistently outperforms both state-of-the-art influence-based algorithms and other non-influence-based selection frameworks. Surprisingly, training on a 15% subset selected by BIDS can even outperform full-dataset training with a much more balanced performance. Our analysis further highlights the importance of both instance-level normalization and iterative optimization of selected data for balanced learning of diverse capabilities.
Investigating Memorization in Video Diffusion Models
Chen Chen, Enhuai Liu, Daochang Liu, Mubarak Shah, Chang Xu
OpenReview
[Show Abstract]
Diffusion models, widely used for image and video generation, face a significant limitation: the risk of memorizing and reproducing training data during inference, potentially generating unauthorized copyrighted content. While prior research has focused on image diffusion models (IDMs), video diffusion models (VDMs) remain underexplored. To address this gap, we first formally define the two types of memorization in VDMs (content memorization and motion memorization) in a practical way that focuses on privacy preservation and applies to all generation types. We then introduce new metrics specifically designed to separately assess content and motion memorization in VDMs. Additionally, we curate a dataset of text prompts that are most prone to triggering memorization when used as conditioning in VDMs. By leveraging these prompts, we generate diverse videos from various open-source VDMs, successfully extracting numerous training videos from each tested model. Through the application of our proposed metrics, we systematically analyze memorization across various pretrained VDMs, including text-conditional and unconditional models, on a variety of datasets. Our comprehensive study reveals that memorization is widespread across all tested VDMs, indicating that VDMs can also memorize image training data in addition to video datasets. Finally, we propose efficient and effective detection strategies for both content and motion memorization, offering a foundational approach for improving privacy in VDMs. Code will be made available.
Diffusion models, widely used for image and video generation, face a significant limitation: the risk of memorizing and reproducing training data during inference, potentially generating unauthorized copyrighted content. While prior research has focused on image diffusion models (IDMs), video diffusion models (VDMs) remain underexplored. To address this gap, we first formally define the two types of memorization in VDMs (content memorization and motion memorization) in a practical way that focuses on privacy preservation and applies to all generation types. We then introduce new metrics specifically designed to separately assess content and motion memorization in VDMs. Additionally, we curate a dataset of text prompts that are most prone to triggering memorization when used as conditioning in VDMs. By leveraging these prompts, we generate diverse videos from various open-source VDMs, successfully extracting numerous training videos from each tested model. Through the application of our proposed metrics, we systematically analyze memorization across various pretrained VDMs, including text-conditional and unconditional models, on a variety of datasets. Our comprehensive study reveals that memorization is widespread across all tested VDMs, indicating that VDMs can also memorize image training data in addition to video datasets. Finally, we propose efficient and effective detection strategies for both content and motion memorization, offering a foundational approach for improving privacy in VDMs. Code will be made available.
Revisiting Multi-Modal LLM Evaluation
Jian Lu, Shikhar Srivastava, Junyu Chen, Robik Singh Shrestha, Manoj Acharya, Kushal Kafle, Christopher Kanan
OpenReview
[Show Abstract]
With the advent of multi-modal large language models (MLLMs), datasets used for visual question answering (VQA) and referring expression comprehension have seen a resurgence. However, the most popular datasets used to evaluate MLLMs are some of the earliest ones created (VQAv2, GQA, TextVQA et al.) and they have many known problems, including extreme bias, spurious correlations, and an inability to permit fine-grained analysis. In this paper, we pioneer evaluating recent MLLMs (LLaVA-OneVision, MiniGemini, CogVLM, GPT-4V et al.) on datasets designed to address weaknesses in earlier ones. We assess three VQA datasets: 1) TDIUC, which permits fine-grained analysis on 12 question types; 2) TallyQA, which has simple and complex counting questions; and 3) DVQA, which requires optical character recognition for chart understanding. We also study VQDv1, a dataset that crucially requires identifying all image regions that satisfy a given query. Our experiments reveal the weaknesses of many MLLMs that have not previously been reported.
With the advent of multi-modal large language models (MLLMs), datasets used for visual question answering (VQA) and referring expression comprehension have seen a resurgence. However, the most popular datasets used to evaluate MLLMs are some of the earliest ones created (VQAv2, GQA, TextVQA et al.) and they have many known problems, including extreme bias, spurious correlations, and an inability to permit fine-grained analysis. In this paper, we pioneer evaluating recent MLLMs (LLaVA-OneVision, MiniGemini, CogVLM, GPT-4V et al.) on datasets designed to address weaknesses in earlier ones. We assess three VQA datasets: 1) TDIUC, which permits fine-grained analysis on 12 question types; 2) TallyQA, which has simple and complex counting questions; and 3) DVQA, which requires optical character recognition for chart understanding. We also study VQDv1, a dataset that crucially requires identifying all image regions that satisfy a given query. Our experiments reveal the weaknesses of many MLLMs that have not previously been reported.
Tracing the Misuse of Personalized Textual Embeddings for Text-to-Image Models
Weitao Feng, Jiyan He, Jie Zhang, Tianyi Wei, Wenbo Zhou, Qing Guo, Weiming Zhang, Tianwei Zhang, Nenghai Yu
OpenReview
[Show Abstract]
Text-to-Image (T2I) models have achieved great success in generating high-quality images with diverse prompts. The emerging personalized textual embedding technology further empowers T2I models to create realistic images based on users' personalized concepts. This leads to a new AI business, with many commercial platforms for sharing or selling valuable personalized embeddings. However, this powerful technology comes with potential risks. Malicious users might exploit personalized textual embeddings to generate illegal content. To address this concern, these public platforms need reliable methods to trace and hold bad actors accountable. In this paper, we introduce concept watermarking, a novel approach that embeds robust watermarks into images generated from personalized embeddings. Specifically, an encoder embeds watermarks in the embedding space, while a decoder extracts these watermarks from generated images. We also develop a novel end-to-end training strategy that breaks down the diffusion model's sampling process to ensure effective watermarking. Extensive experiments demonstrate that our concept watermarking is effective for guarding personalized textual embeddings while guaranteeing their utility in terms of both visual fidelity and textual editability. More importantly, because the watermark exists at the concept level, it is robust against different processing distortions, diffusion sampling configurations, and adaptive attacks. Ablation studies are also conducted to validate the design rationale of each key component.
Text-to-Image (T2I) models have achieved great success in generating high-quality images with diverse prompts. The emerging personalized textual embedding technology further empowers T2I models to create realistic images based on users' personalized concepts. This leads to a new AI business, with many commercial platforms for sharing or selling valuable personalized embeddings. However, this powerful technology comes with potential risks. Malicious users might exploit personalized textual embeddings to generate illegal content. To address this concern, these public platforms need reliable methods to trace and hold bad actors accountable. In this paper, we introduce concept watermarking, a novel approach that embeds robust watermarks into images generated from personalized embeddings. Specifically, an encoder embeds watermarks in the embedding space, while a decoder extracts these watermarks from generated images. We also develop a novel end-to-end training strategy that breaks down the diffusion model's sampling process to ensure effective watermarking. Extensive experiments demonstrate that our concept watermarking is effective for guarding personalized textual embeddings while guaranteeing their utility in terms of both visual fidelity and textual editability. More importantly, because the watermark exists at the concept level, it is robust against different processing distortions, diffusion sampling configurations, and adaptive attacks. Ablation studies are also conducted to validate the design rationale of each key component.
Why Does Private Fine-Tuning Resist Differential Privacy Noise? A Representation Learning Perspective
Yue Zhao, Xia Yutong, Chendi Wang
OpenReview
[Show Abstract]
In this paper, we investigate the impact of differential privacy (DP) on the fine-tuning of publicly pre-trained models, focusing on Vision Transformers (ViTs). We introduce an approach for analyzing the DP fine-tuning process by leveraging a representation learning law to measure the separability of features across intermediate layers of the model. Through a series of experiments with ViTs pre-trained on ImageNet and fine-tuned on a subset of CIFAR-10, we explore the effects of DP noise on the learned representations. Our results show that, without proper hyperparameter tuning, DP noise can significantly degrade feature quality, particularly in high-privacy regimes. However, when hyperparameters are optimized, the impact of DP noise on the learned representations is limited, leading to high model accuracy even in high-privacy settings. These findings provide insight into how pre-training on public datasets can help mitigate the privacy-utility trade-off in private deep learning applications.
In this paper, we investigate the impact of differential privacy (DP) on the fine-tuning of publicly pre-trained models, focusing on Vision Transformers (ViTs). We introduce an approach for analyzing the DP fine-tuning process by leveraging a representation learning law to measure the separability of features across intermediate layers of the model. Through a series of experiments with ViTs pre-trained on ImageNet and fine-tuned on a subset of CIFAR-10, we explore the effects of DP noise on the learned representations. Our results show that, without proper hyperparameter tuning, DP noise can significantly degrade feature quality, particularly in high-privacy regimes. However, when hyperparameters are optimized, the impact of DP noise on the learned representations is limited, leading to high model accuracy even in high-privacy settings. These findings provide insight into how pre-training on public datasets can help mitigate the privacy-utility trade-off in private deep learning applications.
TsKAN: An Transparent Architecture for Improving the Interpretability of Multivariate Time Series Forecasting
Zechuan Chen, TianMing Sha, Ziyi Tang, Keze Wang
OpenReview
[Show Abstract]
In recent years, numerous deep learning models have been proposed for Multi-variate Time Series (MTS) forecasting, with Transformer-based models showing significant potential due to their ability to capture long-term dependencies. However, existing models based on MLPs or Transformers often suffer from a lack of interpretability due to their large parameter sizes, which can be problematic in many real-world applications. To address this issue, we propose TimeKAN, a model based on Kolmogorov-Arnold Networks. The KAN model offers two key advantages: (1) it achieves accuracy comparable to MLPs with significantly fewer parameters, and (2) its parameters can be symbolized, which makes it possible to interpret the meaning of the parameters. Additionally, instead of the usual attention mechanisms, we designed a Multi-Scale Patching (MSP) module for MTS that allows for more flexible and simple multi-patching and effectively extracts both temporal and cross-dimensional features. By leveraging this strategy along with KAN, TimeKAN constructs a hierarchical structure capable of utilizing information across different scales, leading to highly accurate predictions. Extensive experiments on six real-world datasets demonstrate that TimeKAN outperforms state-of-the-art (SOTA) methods in terms of predictive performance. Furthermore, we interpret TimeKAN by visualizing its learning process for extracting symbolized features, opening the black box and revealing meaningful patterns within the time series.
In recent years, numerous deep learning models have been proposed for Multi-variate Time Series (MTS) forecasting, with Transformer-based models showing significant potential due to their ability to capture long-term dependencies. However, existing models based on MLPs or Transformers often suffer from a lack of interpretability due to their large parameter sizes, which can be problematic in many real-world applications. To address this issue, we propose TimeKAN, a model based on Kolmogorov-Arnold Networks. The KAN model offers two key advantages: (1) it achieves accuracy comparable to MLPs with significantly fewer parameters, and (2) its parameters can be symbolized, which makes it possible to interpret the meaning of the parameters. Additionally, instead of the usual attention mechanisms, we designed a Multi-Scale Patching (MSP) module for MTS that allows for more flexible and simple multi-patching and effectively extracts both temporal and cross-dimensional features. By leveraging this strategy along with KAN, TimeKAN constructs a hierarchical structure capable of utilizing information across different scales, leading to highly accurate predictions. Extensive experiments on six real-world datasets demonstrate that TimeKAN outperforms state-of-the-art (SOTA) methods in terms of predictive performance. Furthermore, we interpret TimeKAN by visualizing its learning process for extracting symbolized features, opening the black box and revealing meaningful patterns within the time series.
Towards Comprehensive Preference Data Collection for Reward Modeling
Yulan Hu, Qingyang Li, Sheng Ouyang, Ge Chen, Jinman Zhao, Yong Liu
OpenReview
[Show Abstract]
Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment of large language models (LLMs) with human preferences. A critical component of RLHF is the reward model, which is trained on preference data and outputs a scalar reward for given text. However, the collection of high-quality preference data still lacks thorough investigation. Recent studies indicate that preference data is collected either by AI or humans, where chosen and rejected instances are identified between pairwise responses. We question whether this process effectively filters out noise and ensures sufficient diversity in the collected data. To address these concerns, for the first time, we propose a comprehensive framework for preference data collection, decomposing the process into four incremental steps: Prompt Collection, Response Generation, Response Filtering, and Human Labeling. This framework ensures the collection of high-quality preferences while reducing reliance on human labor. We conducted comprehensive experiments using the data collected at different stages, demonstrating the effectiveness of the proposed data collection method.
Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment of large language models (LLMs) with human preferences. A critical component of RLHF is the reward model, which is trained on preference data and outputs a scalar reward for given text. However, the collection of high-quality preference data still lacks thorough investigation. Recent studies indicate that preference data is collected either by AI or humans, where chosen and rejected instances are identified between pairwise responses. We question whether this process effectively filters out noise and ensures sufficient diversity in the collected data. To address these concerns, for the first time, we propose a comprehensive framework for preference data collection, decomposing the process into four incremental steps: Prompt Collection, Response Generation, Response Filtering, and Human Labeling. This framework ensures the collection of high-quality preferences while reducing reliance on human labor. We conducted comprehensive experiments using the data collected at different stages, demonstrating the effectiveness of the proposed data collection method.
Domain-Specific Benchmarking of Vision-Language Models: A Task Augmentation Framework Using Metadata
Tim Rädsch, Leon Mayer, Simon Pavicic, Ali Emre Kavur, Marcel Knopp, Barış Öztürk, Klaus Maier-Hein, Paul F Jaeger, Fabian Isensee, Annika Reinke, Lena Maier-hein
OpenReview
[Show Abstract]
Reliable evaluation of AI models is critical for scientific progress and practical application. While existing VLM benchmarks provide general insights into model capabilities, their heterogeneous designs and limited focus on a few imaging domains pose significant challenges for both cross-domain performance comparison and targeted domain-specific evaluation. To address this, we propose three key contributions: (1) a framework for the resource-efficient creation of domain-specific VLM benchmarks enabled by task augmentation for creating multiple diverse tasks from a single existing task, (2) the release of new VLM benchmarks for seven domains, created according to the same homogeneous protocol and including 162,946 thoroughly human-validated answers, and (3) an extensive benchmarking of 22 state-of-the-art VLMs on a total of 37,171 tasks, revealing performance variances across domains and tasks, thereby supporting the need for tailored VLM benchmarks. Adoption of our methodology will pave the way for the resource-efficient domain-specific selection of models and guide future research efforts toward addressing core open questions.
Reliable evaluation of AI models is critical for scientific progress and practical application. While existing VLM benchmarks provide general insights into model capabilities, their heterogeneous designs and limited focus on a few imaging domains pose significant challenges for both cross-domain performance comparison and targeted domain-specific evaluation. To address this, we propose three key contributions: (1) a framework for the resource-efficient creation of domain-specific VLM benchmarks enabled by task augmentation for creating multiple diverse tasks from a single existing task, (2) the release of new VLM benchmarks for seven domains, created according to the same homogeneous protocol and including 162,946 thoroughly human-validated answers, and (3) an extensive benchmarking of 22 state-of-the-art VLMs on a total of 37,171 tasks, revealing performance variances across domains and tasks, thereby supporting the need for tailored VLM benchmarks. Adoption of our methodology will pave the way for the resource-efficient domain-specific selection of models and guide future research efforts toward addressing core open questions.
LoBAM: LoRA-Based Backdoor Attack on Model Merging
Ming Yin, Jingyang Zhang, Jingwei Sun, Minghong Fang, Hai Helen Li, Yiran Chen
OpenReview
[Show Abstract]
Model merging is an emerging technique that integrates multiple models fine-tuned on different tasks to create a versatile model that excels in multiple domains.
This scheme, in the meantime, may open up backdoor attack opportunities where one single malicious model can jeopardize the integrity of the merged model.
Existing works try to demonstrate the risk of such attacks by assuming substantial computational resources, focusing on cases where the attacker can fully fine-tune the pre-trained model.
Such an assumption, however, may not be feasible given the increasing size of machine learning models.
In practice where resources are limited and the attacker can only employ techniques like Low-Rank Adaptation (LoRA) to produce the malicious model, it remains unclear whether the attack can still work and pose threats.
In this work, we first identify that the attack efficacy is significantly diminished when using LoRA for fine-tuning.
Then, we propose LoBAM, a method that yields high attack success rate with minimal training resources.
The key idea of LoBAM is to amplify the malicious weights in an intelligent way that effectively enhances the attack efficacy.
We demonstrate that our design can lead to improved attack success rate through extensive empirical experiments across various model merging scenarios.
Moreover, we show that our method has strong stealthiness and is difficult to detect.
Model merging is an emerging technique that integrates multiple models fine-tuned on different tasks to create a versatile model that excels in multiple domains. This scheme, in the meantime, may open up backdoor attack opportunities where one single malicious model can jeopardize the integrity of the merged model. Existing works try to demonstrate the risk of such attacks by assuming substantial computational resources, focusing on cases where the attacker can fully fine-tune the pre-trained model. Such an assumption, however, may not be feasible given the increasing size of machine learning models. In practice where resources are limited and the attacker can only employ techniques like Low-Rank Adaptation (LoRA) to produce the malicious model, it remains unclear whether the attack can still work and pose threats. In this work, we first identify that the attack efficacy is significantly diminished when using LoRA for fine-tuning. Then, we propose LoBAM, a method that yields high attack success rate with minimal training resources. The key idea of LoBAM is to amplify the malicious weights in an intelligent way that effectively enhances the attack efficacy. We demonstrate that our design can lead to improved attack success rate through extensive empirical experiments across various model merging scenarios. Moreover, we show that our method has strong stealthiness and is difficult to detect.
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
Bettina Messmer, Vinko Sabolčec, Martin Jaggi
OpenReview
[Show Abstract]
Dataset curation has become a basis for strong large language model (LLM) performance.
While various rule-based filtering heuristics exist for English and multilingual datasets, model-based filtering techniques have primarily focused on English.
To address the disparity stemming from limited research on non-English languages, we propose a model-based filtering framework for multilingual datasets that aims to identify a diverse set of structured and knowledge-rich samples.
Our approach emphasizes transparency, simplicity, and efficiency, leveraging Transformer- and FastText-based classifiers to ensure the broad accessibility of our technique and data.
We conduct comprehensive ablation studies on the FineWeb-2 web crawl dataset across diverse language families, scripts, and resource availability to demonstrate the effectiveness of our method.
Using a 1B-parameter Llama model trained on 70B and 119B tokens, our approach can match the baseline MMLU score with as little as 15\% of the training tokens, while also improving across other benchmarks.
These findings provide strong evidence for the generalizability of our approach to other languages. As a result, we extend our framework to 20 languages for which we will release the refined pretraining datasets.
Dataset curation has become a basis for strong large language model (LLM) performance. While various rule-based filtering heuristics exist for English and multilingual datasets, model-based filtering techniques have primarily focused on English. To address the disparity stemming from limited research on non-English languages, we propose a model-based filtering framework for multilingual datasets that aims to identify a diverse set of structured and knowledge-rich samples. Our approach emphasizes transparency, simplicity, and efficiency, leveraging Transformer- and FastText-based classifiers to ensure the broad accessibility of our technique and data. We conduct comprehensive ablation studies on the FineWeb-2 web crawl dataset across diverse language families, scripts, and resource availability to demonstrate the effectiveness of our method. Using a 1B-parameter Llama model trained on 70B and 119B tokens, our approach can match the baseline MMLU score with as little as 15\% of the training tokens, while also improving across other benchmarks. These findings provide strong evidence for the generalizability of our approach to other languages. As a result, we extend our framework to 20 languages for which we will release the refined pretraining datasets.
Context-Parametric Inversion: Why Instruction Finetuning Can Worsen Context Reliance
Sachin Goyal, Christina Baek, J Zico Kolter, Aditi Raghunathan
OpenReview
[Show Abstract]
A standard practice when using large language models is for users to supplement their instruction with an input context containing new information for the model to process. However, models struggle to reliably follow the input context, especially when it conflicts with their parametric knowledge from pretraining.
In-principle, one would expect models to adapt to the user context better after instruction finetuning, particularly when handling knowledge conflicts.
However, we observe a surprising failure mode: during instruction tuning, the context reliance under knowledge conflicts initially increases as expected, but then $\textit{gradually decreases as instruction finetuning progresses}$. This happens while the performance on standard benchmarks keeps on increasing far after this drop. We call this phenomenon $\textbf{context-parametric inversion}$ and observe it across multiple general purpose instruction tuning datasets such as TULU, Alpaca and Ultrachat, across different model families like Llama, Mistral, and Pythia. We perform various controlled studies and theoretical analysis to show that context-parametric inversion occurs due to examples in the instruction finetuning data where the input context provides information that aligns with model's parametric knowledge.
Our analysis suggests some natural mitigation strategies with limited but insightful gains, and serves as a useful starting point in addressing this deficiency in instruction finetuning.
A standard practice when using large language models is for users to supplement their instruction with an input context containing new information for the model to process. However, models struggle to reliably follow the input context, especially when it conflicts with their parametric knowledge from pretraining. In-principle, one would expect models to adapt to the user context better after instruction finetuning, particularly when handling knowledge conflicts. However, we observe a surprising failure mode: during instruction tuning, the context reliance under knowledge conflicts initially increases as expected, but then $\textit{gradually decreases as instruction finetuning progresses}$. This happens while the performance on standard benchmarks keeps on increasing far after this drop. We call this phenomenon $\textbf{context-parametric inversion}$ and observe it across multiple general purpose instruction tuning datasets such as TULU, Alpaca and Ultrachat, across different model families like Llama, Mistral, and Pythia. We perform various controlled studies and theoretical analysis to show that context-parametric inversion occurs due to examples in the instruction finetuning data where the input context provides information that aligns with model's parametric knowledge. Our analysis suggests some natural mitigation strategies with limited but insightful gains, and serves as a useful starting point in addressing this deficiency in instruction finetuning.
SubLIME*: Data Efficient Foundation Model Evaluation across Modalities, Languages and Benchmarks
Mahammad Parwez Alam, Gayathri Saranathan, Cong Xu, Javier Aula-Blasco, Martin Foltin, Tarun Kumar, Soon Yee Wong, Suparna Bhattacharya
OpenReview
[Show Abstract]
The exponential growth of foundation models has created an unsustainable evaluation paradigm, where comprehensive assessment incurs prohibitive computational costs and environmental impact. We introduce SubLIME* ("Less Is More for Evaluation"), an extensible framework that reduces evaluation costs by 10-100X through adaptive sampling while preserving model ranking fidelity (Spearman >0.9). Our core innovation lies in identifying minimal representative subsets through three key extensions: (1) SubLIME-I for text-to-image models combines difficulty and quality sampling methods validated on the image generation tasks, reducing inference time from 2792 hours to 28 hours for evaluating 27 models; (2) SubLIME-C eliminates cross-benchmark coding redundancies via LLM-guided similarity analysis (80% precision vs 66% baseline), improving correlation by 14% at fixed sample sizes; (3) SubLIME-M enables multilingual assessment through cross-lingual subset alignment, maintaining >0.8 rank correlation across 4 languages with 80% less data. SubLIME* experiments across modalities, languages and benchmarks show that using strategic sampling based on difficulty gradients, semantic diversity, and quality metrics maintains evaluation integrity while significantly reducing costs by orders of magnitude.
The exponential growth of foundation models has created an unsustainable evaluation paradigm, where comprehensive assessment incurs prohibitive computational costs and environmental impact. We introduce SubLIME* ("Less Is More for Evaluation"), an extensible framework that reduces evaluation costs by 10-100X through adaptive sampling while preserving model ranking fidelity (Spearman >0.9). Our core innovation lies in identifying minimal representative subsets through three key extensions: (1) SubLIME-I for text-to-image models combines difficulty and quality sampling methods validated on the image generation tasks, reducing inference time from 2792 hours to 28 hours for evaluating 27 models; (2) SubLIME-C eliminates cross-benchmark coding redundancies via LLM-guided similarity analysis (80% precision vs 66% baseline), improving correlation by 14% at fixed sample sizes; (3) SubLIME-M enables multilingual assessment through cross-lingual subset alignment, maintaining >0.8 rank correlation across 4 languages with 80% less data. SubLIME* experiments across modalities, languages and benchmarks show that using strategic sampling based on difficulty gradients, semantic diversity, and quality metrics maintains evaluation integrity while significantly reducing costs by orders of magnitude.
Adversarial Attacks on Data Attribution
Xinhe Wang, Pingbang Hu, Junwei Deng, Jiaqi W. Ma
OpenReview
[Show Abstract]
Data attribution aims to quantify the contribution of individual training data points to the outputs of an AI model, which has been used to measure the value of training data and compensate data providers. Given the impact on financial decisions and compensation mechanisms, a critical question arises concerning the adversarial robustness of data attribution methods. However, there has been little to no systematic research addressing this issue. In this work, we aim to bridge this gap by detailing a threat model with clear assumptions about the adversary's goal and capabilities and proposing principled adversarial attack methods on data attribution. We present two methods, Shadow Attack and Outlier Attack, which generate manipulated datasets to inflate the compensation adversarially. The Shadow Attack leverages knowledge about the data distribution in the AI applications, and derives adversarial perturbations through "shadow training", a technique commonly used in membership inference attacks. In contrast, the Outlier Attack does not assume any knowledge about the data distribution and relies solely on black-box queries to the target model's predictions. It exploits an inductive bias present in many data attribution methods - outlier data points are more likely to be influential - and employs adversarial examples to generate manipulated datasets. Empirically, in image classification and text generation tasks, the Shadow Attack can inflate the data-attribution-based compensation by at least 200%, while the Outlier Attack achieves compensation inflation ranging from 185% to as much as 643%.
Data attribution aims to quantify the contribution of individual training data points to the outputs of an AI model, which has been used to measure the value of training data and compensate data providers. Given the impact on financial decisions and compensation mechanisms, a critical question arises concerning the adversarial robustness of data attribution methods. However, there has been little to no systematic research addressing this issue. In this work, we aim to bridge this gap by detailing a threat model with clear assumptions about the adversary's goal and capabilities and proposing principled adversarial attack methods on data attribution. We present two methods, Shadow Attack and Outlier Attack, which generate manipulated datasets to inflate the compensation adversarially. The Shadow Attack leverages knowledge about the data distribution in the AI applications, and derives adversarial perturbations through "shadow training", a technique commonly used in membership inference attacks. In contrast, the Outlier Attack does not assume any knowledge about the data distribution and relies solely on black-box queries to the target model's predictions. It exploits an inductive bias present in many data attribution methods - outlier data points are more likely to be influential - and employs adversarial examples to generate manipulated datasets. Empirically, in image classification and text generation tasks, the Shadow Attack can inflate the data-attribution-based compensation by at least 200%, while the Outlier Attack achieves compensation inflation ranging from 185% to as much as 643%.
D$^3$: A Large Dataset for Training Code Language Models to Act Diff-by-Diff
Ulyana Piterbarg, Kanishk Gandhi, Lerrel Pinto, Noah Goodman, Rob Fergus
OpenReview
[Show Abstract]
We introduce D$^3$, a dataset for training LMs to iteratively synthesize general-purpose Python source code by generating file diffs. D$^3$ frames code synthesis as a goal-conditioned sequential decision-making problem, where goals, states, and actions are represented by token sequences corresponding to the description of a functionality to add, the current contents of a file, and a file diff, respectively. To construct the dataset, we filter, augment, and annotate code from a pretraining corpus of permissively licensed source code (The Stack) using Llama 3.1 70B Instruct and the LintSeq algorithm for sampling synthetic file diffs. D$^3$ contains 8 billion tokens of instruction + file-state + file-diff-sequence examples generated from 850,000 Human-written programs. In a preliminary set of experiments, we show that finetuning LMs like Llama 3.2 1B on examples from D$^3$ improves model performance on code synthesis, debugging, and repository-level editing tasks.
We introduce D$^3$, a dataset for training LMs to iteratively synthesize general-purpose Python source code by generating file diffs. D$^3$ frames code synthesis as a goal-conditioned sequential decision-making problem, where goals, states, and actions are represented by token sequences corresponding to the description of a functionality to add, the current contents of a file, and a file diff, respectively. To construct the dataset, we filter, augment, and annotate code from a pretraining corpus of permissively licensed source code (The Stack) using Llama 3.1 70B Instruct and the LintSeq algorithm for sampling synthetic file diffs. D$^3$ contains 8 billion tokens of instruction + file-state + file-diff-sequence examples generated from 850,000 Human-written programs. In a preliminary set of experiments, we show that finetuning LMs like Llama 3.2 1B on examples from D$^3$ improves model performance on code synthesis, debugging, and repository-level editing tasks.
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs
Shenao Zhang, Zhihan Liu, Boyi Liu, Yufeng Zhang, Yingxiang Yang, Yongfei Liu, Liyu Chen, Tao Sun, Zhaoran Wang
OpenReview
[Show Abstract]
Preference alignment in Large Language Models (LLMs) has significantly improved their ability to adhere to human instructions and intentions. However, existing direct alignment algorithms primarily focus on relative preferences and often overlook the qualitative aspects of responses, despite having access to preference data that includes reward scores from judge models during AI feedback. Striving to maximize the implicit reward gap between the chosen and the slightly inferior rejected responses can cause overfitting and unnecessary unlearning of the high-quality rejected responses. The unawareness of the reward scores also drives the LLM to indiscriminately favor the low-quality chosen responses and fail to generalize to responses with the highest rewards, which are sparse in data. To overcome these shortcomings, our study introduces reward-conditioned LLM policies that discern and learn from the entire spectrum of response quality within the dataset, helping extrapolate to more optimal regions. We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset. This dataset is easily integrated with existing direct alignment algorithms and is applicable to any preference dataset. The experimental results across instruction-following benchmarks including AlpacaEval 2.0, MT-Bench, and Arena-Hard-Auto demonstrate that our approach consistently boosts the performance of DPO by a considerable margin across diverse models such as Zephyr, Mistral, Qwen2, Llama3.1, Gemma2, and SPPO. Additionally, on six academic benchmarks including GSM8K, GPQA, MUSR, TruthfulQA, BBH, and ARC, our method improves their average accuracy. When applying our method to on-policy data, the resulting DPO model outperforms various baselines and achieves state-of-the-art results on AlpacaEval 2.0. Through comprehensive ablation studies, we demonstrate that our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere dataset expansion.
Preference alignment in Large Language Models (LLMs) has significantly improved their ability to adhere to human instructions and intentions. However, existing direct alignment algorithms primarily focus on relative preferences and often overlook the qualitative aspects of responses, despite having access to preference data that includes reward scores from judge models during AI feedback. Striving to maximize the implicit reward gap between the chosen and the slightly inferior rejected responses can cause overfitting and unnecessary unlearning of the high-quality rejected responses. The unawareness of the reward scores also drives the LLM to indiscriminately favor the low-quality chosen responses and fail to generalize to responses with the highest rewards, which are sparse in data. To overcome these shortcomings, our study introduces reward-conditioned LLM policies that discern and learn from the entire spectrum of response quality within the dataset, helping extrapolate to more optimal regions. We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset. This dataset is easily integrated with existing direct alignment algorithms and is applicable to any preference dataset. The experimental results across instruction-following benchmarks including AlpacaEval 2.0, MT-Bench, and Arena-Hard-Auto demonstrate that our approach consistently boosts the performance of DPO by a considerable margin across diverse models such as Zephyr, Mistral, Qwen2, Llama3.1, Gemma2, and SPPO. Additionally, on six academic benchmarks including GSM8K, GPQA, MUSR, TruthfulQA, BBH, and ARC, our method improves their average accuracy. When applying our method to on-policy data, the resulting DPO model outperforms various baselines and achieves state-of-the-art results on AlpacaEval 2.0. Through comprehensive ablation studies, we demonstrate that our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere dataset expansion.
Approximations to worst-case data dropping: unmasking failure modes
Jenny Y. Huang, David R. Burt, Yunyi Shen, Tin D. Nguyen, Tamara Broderick
OpenReview
[Show Abstract]
A data analyst would worry about generalization if dropping a very small fraction of data points from a study could change its substantive conclusions. Finding the worst-case data subset to drop poses a combinatorial optimization problem. To overcome this intractability, recent works propose using additive approximations, which treat the contribution of a collection of data points as the sum of their individual contributions, and greedy approximations, which iteratively select the point with the highest impact to drop and re-run the data analysis without that point [Broderick et al., 2020, Kuschnig et al., 2021]. We identify that, even in a setting as simple as OLS linear regression, many of these approximations can break down in realistic data arrangements. Several of our examples reflect masking, where one data point may hide or conceal the effect of another data point. We provide recommendations for users and suggest directions for future development.
A data analyst would worry about generalization if dropping a very small fraction of data points from a study could change its substantive conclusions. Finding the worst-case data subset to drop poses a combinatorial optimization problem. To overcome this intractability, recent works propose using additive approximations, which treat the contribution of a collection of data points as the sum of their individual contributions, and greedy approximations, which iteratively select the point with the highest impact to drop and re-run the data analysis without that point [Broderick et al., 2020, Kuschnig et al., 2021]. We identify that, even in a setting as simple as OLS linear regression, many of these approximations can break down in realistic data arrangements. Several of our examples reflect masking, where one data point may hide or conceal the effect of another data point. We provide recommendations for users and suggest directions for future development.
Common Functional Decompositions Can Mis-attribute Differences in Outcomes Between Populations
Manuel Quintero, William T. Stephenson, Advik Shreekumar, Tamara Broderick
OpenReview
[Show Abstract]
In science and social science, we often wish to explain why an outcome is different in two populations. For instance, if a jobs program benefits members of one city more than another, is that due to differences in program participants (particular covariates) or the local labor markets (outcomes given covariates)? The Kitagawa-Oaxaca-Blinder (KOB) decomposition is a standard tool in econometrics that explains the difference in the mean outcome across two populations. However, the KOB decomposition assumes a linear relationship between covariates and outcomes, while the true relationship may be meaningfully nonlinear. Modern machine learning boasts a variety of nonlinear functional decompositions for the relationship between outcomes and covariates in one population. It seems natural to extend the KOB decomposition using these functional decompositions. We observe that a successful extension should not attribute the differences to covariates — or, respectively, outcomes given covariates — if those are the same in the two populations. Unfortunately, we demonstrate that, even in simple examples, two common decompositions — the functional ANOVA and Accumulated Local Effects — can attribute differences to outcomes given covariates, even when they are identical in two populations. We provide and partially prove a conjecture that this misattribution arises in any additive decomposition that depends on the distribution of covariates.
In science and social science, we often wish to explain why an outcome is different in two populations. For instance, if a jobs program benefits members of one city more than another, is that due to differences in program participants (particular covariates) or the local labor markets (outcomes given covariates)? The Kitagawa-Oaxaca-Blinder (KOB) decomposition is a standard tool in econometrics that explains the difference in the mean outcome across two populations. However, the KOB decomposition assumes a linear relationship between covariates and outcomes, while the true relationship may be meaningfully nonlinear. Modern machine learning boasts a variety of nonlinear functional decompositions for the relationship between outcomes and covariates in one population. It seems natural to extend the KOB decomposition using these functional decompositions. We observe that a successful extension should not attribute the differences to covariates — or, respectively, outcomes given covariates — if those are the same in the two populations. Unfortunately, we demonstrate that, even in simple examples, two common decompositions — the functional ANOVA and Accumulated Local Effects — can attribute differences to outcomes given covariates, even when they are identical in two populations. We provide and partially prove a conjecture that this misattribution arises in any additive decomposition that depends on the distribution of covariates.
Building Bridges, Not Walls: Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution
Shichang Zhang, Tessa Han, Usha Bhalla, Himabindu Lakkaraju
OpenReview
[Show Abstract]
The increasing complexity of AI systems has made understanding their behavior a critical challenge, especially for foundation models. Numerous methods have been developed to attribute model behavior to three key aspects: input features, training data, and internal model components. However, these attribution methods are studied and applied rather independently, resulting in a fragmented landscape of approaches and terminology. This position paper argues that feature, data, and component attribution methods share fundamental similarities, and bridging them can benefit interpretability research. We conduct a detailed analysis of successful methods of these three attribution aspects and present a unified view to demonstrate that these seemingly distinct methods employ similar approaches, such as perturbations, gradients, and linear approximations, differing primarily in their perspectives rather than core techniques. Our unified perspective enhances understanding of existing attribution methods, identifies shared concepts and challenges, makes this field more accessible to newcomers, and highlights new directions not only for attribution and interpretability but also for broader AI research, including model editing, steering, and regulation. Ultimately, facilitating research of foundation models.
The increasing complexity of AI systems has made understanding their behavior a critical challenge, especially for foundation models. Numerous methods have been developed to attribute model behavior to three key aspects: input features, training data, and internal model components. However, these attribution methods are studied and applied rather independently, resulting in a fragmented landscape of approaches and terminology. This position paper argues that feature, data, and component attribution methods share fundamental similarities, and bridging them can benefit interpretability research. We conduct a detailed analysis of successful methods of these three attribution aspects and present a unified view to demonstrate that these seemingly distinct methods employ similar approaches, such as perturbations, gradients, and linear approximations, differing primarily in their perspectives rather than core techniques. Our unified perspective enhances understanding of existing attribution methods, identifies shared concepts and challenges, makes this field more accessible to newcomers, and highlights new directions not only for attribution and interpretability but also for broader AI research, including model editing, steering, and regulation. Ultimately, facilitating research of foundation models.
A Versatile Influence Function for Data Attribution with Non-Decomposable Loss
Junwei Deng, Weijing Tang, Jiaqi W. Ma
OpenReview
[Show Abstract]
Influence function, a technique rooted in robust statistics, has been adapted in modern machine learning for a novel application: data attribution—quantifying how individual training data points affect a model's predictions. However, the common derivation of influence functions in the data attribution literature is limited to loss functions that decompose into a sum of individual data point losses, with the most prominent examples known as M-estimators. This restricts the application of influence functions to more complex learning objectives, which we refer to as non-decomposable losses, such as contrastive or ranking losses, where a unit loss term depends on multiple data points and cannot be decomposed further. In this work, we bridge this gap by revisiting the general formulation of influence function from robust statistics, which extends beyond M-estimators. Based on this formulation, we propose a novel method, the Versatile Influence Function (VIF), that can be straightforwardly applied to machine learning models trained with any non-decomposable loss. In comparison to the classical approach in statistics, the proposed VIF is designed to fully leverage the power of auto-differentiation, hereby eliminating the need for case-specific derivations of each loss function. We demonstrate the effectiveness of VIF across three examples: Cox regression for survival analysis, node embedding for network analysis, and listwise learning-to-rank for information retrieval. In all cases, the influence estimated by VIF closely resembles the results obtained by brute-force leave-one-out retraining, while being up to 1000 times faster to compute. We believe VIF represents a significant advancement in data attribution, enabling efficient influence-function-based attribution across a wide range of machine learning paradigms, with broad potential for practical use cases.
Influence function, a technique rooted in robust statistics, has been adapted in modern machine learning for a novel application: data attribution—quantifying how individual training data points affect a model's predictions. However, the common derivation of influence functions in the data attribution literature is limited to loss functions that decompose into a sum of individual data point losses, with the most prominent examples known as M-estimators. This restricts the application of influence functions to more complex learning objectives, which we refer to as non-decomposable losses, such as contrastive or ranking losses, where a unit loss term depends on multiple data points and cannot be decomposed further. In this work, we bridge this gap by revisiting the general formulation of influence function from robust statistics, which extends beyond M-estimators. Based on this formulation, we propose a novel method, the Versatile Influence Function (VIF), that can be straightforwardly applied to machine learning models trained with any non-decomposable loss. In comparison to the classical approach in statistics, the proposed VIF is designed to fully leverage the power of auto-differentiation, hereby eliminating the need for case-specific derivations of each loss function. We demonstrate the effectiveness of VIF across three examples: Cox regression for survival analysis, node embedding for network analysis, and listwise learning-to-rank for information retrieval. In all cases, the influence estimated by VIF closely resembles the results obtained by brute-force leave-one-out retraining, while being up to 1000 times faster to compute. We believe VIF represents a significant advancement in data attribution, enabling efficient influence-function-based attribution across a wide range of machine learning paradigms, with broad potential for practical use cases.
PhantomWiki: On-Demand Datasets for Reasoning and Retrieval Evaluation
Albert Gong, Kamilė Stankevičiūtė, Chao Wan, Anmol Kabra, Raphael Thesmar, Johann Lee, Julius Klenke, Carla P Gomes, Kilian Q Weinberger
OpenReview
[Show Abstract]
High-quality benchmarks are essential for evaluating reasoning and retrieval capabilities of large language models (LLMs). However, curating datasets for this purpose is not a permanent solution as they are prone to data leakage and inflated performance results. To address these challenges, we propose PhantomWiki: a pipeline to generate unique, factually consistent document corpora with diverse question-answer pairs. Unlike prior work, PhantomWiki is neither a fixed dataset, nor is it based on any existing data. Instead, a new PhantomWiki instance is generated on demand for each evaluation. We vary the question difficulty and corpus size to disentangle reasoning and retrieval capabilities, respectively, and find that PhantomWiki datasets are surprisingly challenging for frontier LLMs. Thus, we contribute a scalable and data leakage-resistant framework for disentangled evaluation of reasoning, retrieval, and tool-use abilities.
High-quality benchmarks are essential for evaluating reasoning and retrieval capabilities of large language models (LLMs). However, curating datasets for this purpose is not a permanent solution as they are prone to data leakage and inflated performance results. To address these challenges, we propose PhantomWiki: a pipeline to generate unique, factually consistent document corpora with diverse question-answer pairs. Unlike prior work, PhantomWiki is neither a fixed dataset, nor is it based on any existing data. Instead, a new PhantomWiki instance is generated on demand for each evaluation. We vary the question difficulty and corpus size to disentangle reasoning and retrieval capabilities, respectively, and find that PhantomWiki datasets are surprisingly challenging for frontier LLMs. Thus, we contribute a scalable and data leakage-resistant framework for disentangled evaluation of reasoning, retrieval, and tool-use abilities.
Unstable Unlearning: The Hidden Risk of Concept Resurgence in Diffusion Models
Vinith Menon Suriyakumar, Rohan Alur, Ayush Sekhari, Manish Raghavan, Ashia C. Wilson
OpenReview
[Show Abstract]
Text-to-image diffusion models rely on massive, web-scale datasets. Training them from scratch is computationally expensive, and as a result, developers often prefer to make incremental updates to existing models. These updates often compose fine-tuning steps (to learn new concepts or improve model performance) with "unlearning" steps (to "forget" existing concepts, such as copyrighted works or explicit content). In this work, we demonstrate a critical and previously unknown vulnerability that arises in this paradigm: even under benign, non-adversarial conditions, fine-tuning a text-to-image diffusion model on seemingly unrelated images can cause it to ``relearn" concepts that were previously "unlearned." We comprehensively investigate the causes and scope of this phenomenon, which we term \emph{concept resurgence}, by performing a series of experiments across several SOTA concept unlearning methods with subsequent fine-tuning of Stable Diffusion v1.4 and Stable Diffusion v2.1. Our findings underscore the fragility of composing incremental model updates, and raise serious new concerns about current approaches to ensuring the safety and alignment of text-to-image diffusion models.
Text-to-image diffusion models rely on massive, web-scale datasets. Training them from scratch is computationally expensive, and as a result, developers often prefer to make incremental updates to existing models. These updates often compose fine-tuning steps (to learn new concepts or improve model performance) with "unlearning" steps (to "forget" existing concepts, such as copyrighted works or explicit content). In this work, we demonstrate a critical and previously unknown vulnerability that arises in this paradigm: even under benign, non-adversarial conditions, fine-tuning a text-to-image diffusion model on seemingly unrelated images can cause it to ``relearn" concepts that were previously "unlearned." We comprehensively investigate the causes and scope of this phenomenon, which we term \emph{concept resurgence}, by performing a series of experiments across several SOTA concept unlearning methods with subsequent fine-tuning of Stable Diffusion v1.4 and Stable Diffusion v2.1. Our findings underscore the fragility of composing incremental model updates, and raise serious new concerns about current approaches to ensuring the safety and alignment of text-to-image diffusion models.
Preserving Product Fidelity in Large Scale Image Recontextualization with Diffusion Models
Ishaan Malhi, Praneet Dutta, Ellie Talius, Sally Ma, Brendan Driscoll, Krista Holden, Garima Pruthi, Arunachalam Narayanaswamy
OpenReview
[Show Abstract]
We present a framework for high-fidelity product image recontextualization using text-to-image diffusion models and a novel data augmentation pipeline. This pipeline leverages image-to-video diffusion, in/outpainting, and counterfactual generation to create synthetic training data, addressing limitations of real-world data collection for this task. Our method improves the quality and diversity of generated images by disentangling product representations and enhancing the model's understanding of product characteristics. Evaluation on the ABO dataset and a private product dataset, using automated metrics and human assessment, demonstrates the effectiveness of our framework in generating realistic and compelling product visualizations, with implications for diverse applications such as e-commerce and virtual product showcasing.
We present a framework for high-fidelity product image recontextualization using text-to-image diffusion models and a novel data augmentation pipeline. This pipeline leverages image-to-video diffusion, in/outpainting, and counterfactual generation to create synthetic training data, addressing limitations of real-world data collection for this task. Our method improves the quality and diversity of generated images by disentangling product representations and enhancing the model's understanding of product characteristics. Evaluation on the ABO dataset and a private product dataset, using automated metrics and human assessment, demonstrates the effectiveness of our framework in generating realistic and compelling product visualizations, with implications for diverse applications such as e-commerce and virtual product showcasing.
On the Power of Context-Enhanced Learning in LLMs
Xingyu Zhu, Abhishek Panigrahi, Sanjeev Arora
OpenReview
[Show Abstract]
We formalize a new concept for LLMs, context-enhanced learning. It involves standard gradient-based learning on text except that the context is enhanced with additional data on which no auto-regressive gradients are computed. This setting is a gradient-based analog of usual in-context learning (ICL) and appears in some recent works.
Using a multi-step reasoning task, we prove in a simplified setting that context-enhanced learning can be exponentially more sample-efficient than standard learning when the model is capable of ICL. At a mechanistic level, we find that the benefit of context-enhancement arises from a more accurate gradient learning signal. We also experimentally demonstrate that it appears hard to detect or recover learning materials that were used in the context during training. This may have implications for data security as well as copyright.
We formalize a new concept for LLMs, context-enhanced learning. It involves standard gradient-based learning on text except that the context is enhanced with additional data on which no auto-regressive gradients are computed. This setting is a gradient-based analog of usual in-context learning (ICL) and appears in some recent works.
Using a multi-step reasoning task, we prove in a simplified setting that context-enhanced learning can be exponentially more sample-efficient than standard learning when the model is capable of ICL. At a mechanistic level, we find that the benefit of context-enhancement arises from a more accurate gradient learning signal. We also experimentally demonstrate that it appears hard to detect or recover learning materials that were used in the context during training. This may have implications for data security as well as copyright.
Information-theoretic Quantification of Inherent Discrimination Bias in Training Data for Supervised Learning
Sokrat Aldarmini, Mohamed S Nafea
OpenReview
[Show Abstract]
Algorithmic fairness research has primarily focused on adapting learning models to mitigate discrimination based on protected attributes, yet understanding inherent discrimination biases in training data remains largely unexplored. Given that data mining/engineering and model development are often conducted separately, quantifying these biases for potential downstream models is crucial for informed data engineering. We address this challenge by developing an information-theoretic framework to quantify the marginal impacts of dataset features on the discrimination bias of any downstream classifier. Our approach theoretically argues for measures aligning with specific desired properties and fairness notions. Specifically, we postulate a set of desired properties for candidate discrimination measures and derive measures that (partially) satisfy them. Distinct sets of these properties align with different fairness criteria like demographic parity or equalized odds, which we show can be in disagreement and not simultaneously satisfied by a single measure. We employ the Shapley value from cooperative game theory to determine individual features' marginal contributions to overall discrimination. We show the equivalence among some candidate measures under Shapley value aggregation and rigorously prove its effectiveness in eliminating redundancy. We conduct a comprehensive empirical ablation study on real-world and synthetic datasets to validate our measures' efficacy in capturing features' discriminatory impacts. For synthetic data generation, we use a parametric linear structural causal model and systematically examine distinct parameter settings corresponding to diverse data correlation structures, generating numerous datasets under these conditions to rigorously validate our theoretical framework. Overall, our analysis yields empirically validated guidelines for selecting discrimination measures based on data conditions and fairness criteria, establishing a robust framework for quantifying inherent discrimination bias in the data.
Algorithmic fairness research has primarily focused on adapting learning models to mitigate discrimination based on protected attributes, yet understanding inherent discrimination biases in training data remains largely unexplored. Given that data mining/engineering and model development are often conducted separately, quantifying these biases for potential downstream models is crucial for informed data engineering. We address this challenge by developing an information-theoretic framework to quantify the marginal impacts of dataset features on the discrimination bias of any downstream classifier. Our approach theoretically argues for measures aligning with specific desired properties and fairness notions. Specifically, we postulate a set of desired properties for candidate discrimination measures and derive measures that (partially) satisfy them. Distinct sets of these properties align with different fairness criteria like demographic parity or equalized odds, which we show can be in disagreement and not simultaneously satisfied by a single measure. We employ the Shapley value from cooperative game theory to determine individual features' marginal contributions to overall discrimination. We show the equivalence among some candidate measures under Shapley value aggregation and rigorously prove its effectiveness in eliminating redundancy. We conduct a comprehensive empirical ablation study on real-world and synthetic datasets to validate our measures' efficacy in capturing features' discriminatory impacts. For synthetic data generation, we use a parametric linear structural causal model and systematically examine distinct parameter settings corresponding to diverse data correlation structures, generating numerous datasets under these conditions to rigorously validate our theoretical framework. Overall, our analysis yields empirically validated guidelines for selecting discrimination measures based on data conditions and fairness criteria, establishing a robust framework for quantifying inherent discrimination bias in the data.
Abg-SciQA: A dataset for Understanding and Resolving Ambiguity in Scientific Questions
Tiejin Chen, Kuan-Ru Liou, Mithun Shivakoti, Aaryan Gaur, Pragya Kumari, Meiqi Guo, Hua Wei
OpenReview
[Show Abstract]
Asking ambiguous questions is a natural aspect of human communication, making it essential for Large Language Models (LLMs) to effectively recognize and address ambiguities. However, there is a lack of a comprehensive analysis of how well LLMs detect and solve ambiguities. Besides, though there exist several datasets on ambiguity, the absence of explicit explanations of ambiguity and annotations of ambiguity types limits the comprehensive evaluation. To address this issue, we introduce Abg-SciQA, a dataset designed to evaluate and help LLMs detect ambiguities and generate appropriate clarification questions using challenge questions in the area of social and nature science. Abg-SciQA encompasses four tasks: Ambiguity Detection, Ambiguity Type Classification, Clarification Question Generation, and Clarification-based Question Answering, where each task has corresponding annotations. We evaluate the dataset using both closed-source and open-source LLMs and fine-tune it on open-source LLMs. Our experiments show that the most state-of-the-art LLMs still encounter difficulties in resolving ambiguity in natural questions, and fine-tuning on Abg-SciQA can significantly enhance their capabilities to understand and address ambiguities. Notably, in the Ambiguity Type Classification task, the F1 score of Llama2-13b improves significantly from 16.6\% to 79.1\%. On the other hand, Abg-SciQA remains a challenging benchmark for LLMs, revealing ample room for model improvement. Our dataset can be found here.
Asking ambiguous questions is a natural aspect of human communication, making it essential for Large Language Models (LLMs) to effectively recognize and address ambiguities. However, there is a lack of a comprehensive analysis of how well LLMs detect and solve ambiguities. Besides, though there exist several datasets on ambiguity, the absence of explicit explanations of ambiguity and annotations of ambiguity types limits the comprehensive evaluation. To address this issue, we introduce Abg-SciQA, a dataset designed to evaluate and help LLMs detect ambiguities and generate appropriate clarification questions using challenge questions in the area of social and nature science. Abg-SciQA encompasses four tasks: Ambiguity Detection, Ambiguity Type Classification, Clarification Question Generation, and Clarification-based Question Answering, where each task has corresponding annotations. We evaluate the dataset using both closed-source and open-source LLMs and fine-tune it on open-source LLMs. Our experiments show that the most state-of-the-art LLMs still encounter difficulties in resolving ambiguity in natural questions, and fine-tuning on Abg-SciQA can significantly enhance their capabilities to understand and address ambiguities. Notably, in the Ambiguity Type Classification task, the F1 score of Llama2-13b improves significantly from 16.6\% to 79.1\%. On the other hand, Abg-SciQA remains a challenging benchmark for LLMs, revealing ample room for model improvement. Our dataset can be found here.
The surprising amount of arbitrariness in Shapley-value data valuation
Hannah Diehl, Ashia C. Wilson
OpenReview
[Show Abstract]
The growing economic importance of data has generated interest in principled methods for data valuation. Particular attention has been given to the Shapley value, a result from cooperative game theory that defines the unique distribution of a game's rewards to contributors subject to specified fairness axioms. By casting a machine learning task as a cooperative game, Shapley-based data valuation purports to equitably attribute model performance to individuals. However, the practical operationalization of this process depends on a wide array of practitioner decisions. Many of these decisions lie outside of the scope of the underlying machine learning task, introducing a potential for arbitrary decision making. The sensitivity of valuation outcomes to these intermediate decisions threatens the desired fairness properties. In light of these surfaced concerns, we evaluate the face-value equitability of Shapley for data valuation.
The growing economic importance of data has generated interest in principled methods for data valuation. Particular attention has been given to the Shapley value, a result from cooperative game theory that defines the unique distribution of a game's rewards to contributors subject to specified fairness axioms. By casting a machine learning task as a cooperative game, Shapley-based data valuation purports to equitably attribute model performance to individuals. However, the practical operationalization of this process depends on a wide array of practitioner decisions. Many of these decisions lie outside of the scope of the underlying machine learning task, introducing a potential for arbitrary decision making. The sensitivity of valuation outcomes to these intermediate decisions threatens the desired fairness properties. In light of these surfaced concerns, we evaluate the face-value equitability of Shapley for data valuation.
Autoregressive Optimal Design for Language Models
Rohan Deb, Kiran Koshy Thekumparampil, Kousha Kalantari, Gaurush Hiranandani, Shoham Sabach, Branislav Kveton
OpenReview
[Show Abstract]
Supervised fine-tuning (SFT) is a standard approach to adapting large language models (LLMs) to new domains. In this work, we improve the statistical efficiency of SFT by selecting an informative subset of training examples. Specifically, for a fixed budget of training examples, which determines the computational cost of fine-tuning, we determine the most informative ones. The key idea in our method is to select examples that maximize the Hessian of the log-likelihood of the LLM. We approximate it efficiently by linearizing the LLM at the last layer using multinomial logistic regression models. Our approach is computationally efficient, analyzable, and performs well empirically. We demonstrate this on several problems, and back our claims with both quantitative results and an LLM evaluation.
Supervised fine-tuning (SFT) is a standard approach to adapting large language models (LLMs) to new domains. In this work, we improve the statistical efficiency of SFT by selecting an informative subset of training examples. Specifically, for a fixed budget of training examples, which determines the computational cost of fine-tuning, we determine the most informative ones. The key idea in our method is to select examples that maximize the Hessian of the log-likelihood of the LLM. We approximate it efficiently by linearizing the LLM at the last layer using multinomial logistic regression models. Our approach is computationally efficient, analyzable, and performs well empirically. We demonstrate this on several problems, and back our claims with both quantitative results and an LLM evaluation.
The Delta Learning Hypothesis: Preference Tuning on Weak Data Can Yield Strong Gains
Scott Geng, Hamish Ivison, Chun-Liang Li, Maarten Sap, Jerry Li, Ranjay Krishna, Pang Wei Koh
OpenReview
[Show Abstract]
Preference tuning has greatly improved large language models (LLMs), yet obtaining preference data remains challenging, often requiring expensive human annotation or strong LLM judges to assess response quality. We explore the feasibility of synthetically generating preference pairs without optimizing the preferred response quality to train LLMs that surpass the preferred responses. We formulate the delta learning hypothesis, which posits that models can improve beyond the quality of their training data by learning solely from the relative quality difference—rather than the absolute quality—of paired responses. To validate this hypothesis, we conduct controlled experiments across diverse domains: a toy stylistic task (bold section generation), a math reasoning task (GSM8K), and real-world instruction-following. We show that preference tuning via Direct Preference Optimization (DPO) can enable models to extrapolate improvements from suboptimal data, whereas directly imitating weak data through supervised fine-tuning (SFT) can degrade performance. Armed with these insights, we build a simple weak-to-strong setup that achieves consistent gains over Llama-3.1-8B-Instruct, as well as a SOTA-competitive preference dataset—all without any strong judge.
Preference tuning has greatly improved large language models (LLMs), yet obtaining preference data remains challenging, often requiring expensive human annotation or strong LLM judges to assess response quality. We explore the feasibility of synthetically generating preference pairs without optimizing the preferred response quality to train LLMs that surpass the preferred responses. We formulate the delta learning hypothesis, which posits that models can improve beyond the quality of their training data by learning solely from the relative quality difference—rather than the absolute quality—of paired responses. To validate this hypothesis, we conduct controlled experiments across diverse domains: a toy stylistic task (bold section generation), a math reasoning task (GSM8K), and real-world instruction-following. We show that preference tuning via Direct Preference Optimization (DPO) can enable models to extrapolate improvements from suboptimal data, whereas directly imitating weak data through supervised fine-tuning (SFT) can degrade performance. Armed with these insights, we build a simple weak-to-strong setup that achieves consistent gains over Llama-3.1-8B-Instruct, as well as a SOTA-competitive preference dataset—all without any strong judge.
NICE: Non-Differentiable Evaluation Metric-Based Data Selection for Instruction Tuning
Jingtan Wang, Xiaoqiang Lin, Rui Qiao, Pang Wei Koh, Chuan-Sheng Foo, Bryan Kian Hsiang Low
OpenReview
[Show Abstract]
Curating data for instruction tuning is crucial for enhancing the performance of large language models (LLMs). This work aims to select training data for instruction tuning to improve the LLM performance on specific tasks. Existing methods often rely on next-token prediction (NTP) loss as a proxy for target task performance due to the non-differentiable nature of performance evaluation metrics. They select training data points that are most helpful in reducing validation loss. However, there is a discrepancy between minimizing NTP loss and maximizing performance (e.g., code pass rate in code generation). To remedy this, we introduce a novel Non-differentiable evaluation metric-based InfluenCe Estimation (NICE), which leverages the policy gradient to select the training data that improves the performance. Moreover, NICE can perform data selection in the absence of labels (ground-truth responses) when the evaluation metrics do not require labels (e.g., a reward model can output reward scores without supervision from labels). Experimental results show that our approach outperforms existing data selection baselines that use NTP loss in diverse and realistic scenarios. Notably, subsets selected by NICE often produce models that outperform those trained on the full dataset.
Curating data for instruction tuning is crucial for enhancing the performance of large language models (LLMs). This work aims to select training data for instruction tuning to improve the LLM performance on specific tasks. Existing methods often rely on next-token prediction (NTP) loss as a proxy for target task performance due to the non-differentiable nature of performance evaluation metrics. They select training data points that are most helpful in reducing validation loss. However, there is a discrepancy between minimizing NTP loss and maximizing performance (e.g., code pass rate in code generation). To remedy this, we introduce a novel Non-differentiable evaluation metric-based InfluenCe Estimation (NICE), which leverages the policy gradient to select the training data that improves the performance. Moreover, NICE can perform data selection in the absence of labels (ground-truth responses) when the evaluation metrics do not require labels (e.g., a reward model can output reward scores without supervision from labels). Experimental results show that our approach outperforms existing data selection baselines that use NTP loss in diverse and realistic scenarios. Notably, subsets selected by NICE often produce models that outperform those trained on the full dataset.
ADSO: Adaptive Data Mixture & Scale Optimization. A Multi-Scale Multi-Fidelity Bayesian Optimization Approach.
Andrew Wei Tung Siah, Haozhe Chen, C. Daniel Guetta, Tianyi Peng, Hongseok Namkoong, Tzu-Ching Yen
OpenReview
[Show Abstract]
LLM pre-training requires careful curation of data sources, a process that currently relies heavily on intuition or costly trial-and-error. Since existing ad hoc approaches are unlikely to transfer across domains or data types, we present a unifying framework for data mixture optimization where (mixtures, model scale, training steps) are chosen to balance cost and potential information gain.
Going beyond the canonical
deterministic extrapolation in scaling laws,
we present a sequential decision-making framework where uncertainty in outcomes
is explicitly modeled and sharpened as more measurements are gathered.
In particular, we formulate a multi-scale, multi-fidelity Bayesian Optimization (BO) problem where
information from smaller-scale experiments can systematically inform larger-scale training decisions.
We design an adaptive algorithm that takes into account different measurement fidelities provided by model scale and training steps and empirically demonstrate it on a predictor built on 472 pre-training runs with varying data compositions.
Compared to standard BO baselines, instantiating our approach with even simple kernels and acquisition functions can allow principled decisions across training models from 20M to 1B parameters and achieve \textbf{2.7x} and \textbf{6x} speedups compared to multi-fidelity BO and random search baselines in finding the best data mixture for downstream performance under fixed compute budgets. In sum, our adaptive framework underscores potential efficiency gains
achievable by developing principled and transferrable data mixture optimization methods.
Our code is publicly available at \url{https://github.com/anonWAEWA/ADSO}.
LLM pre-training requires careful curation of data sources, a process that currently relies heavily on intuition or costly trial-and-error. Since existing ad hoc approaches are unlikely to transfer across domains or data types, we present a unifying framework for data mixture optimization where (mixtures, model scale, training steps) are chosen to balance cost and potential information gain. Going beyond the canonical deterministic extrapolation in scaling laws, we present a sequential decision-making framework where uncertainty in outcomes is explicitly modeled and sharpened as more measurements are gathered. In particular, we formulate a multi-scale, multi-fidelity Bayesian Optimization (BO) problem where information from smaller-scale experiments can systematically inform larger-scale training decisions. We design an adaptive algorithm that takes into account different measurement fidelities provided by model scale and training steps and empirically demonstrate it on a predictor built on 472 pre-training runs with varying data compositions. Compared to standard BO baselines, instantiating our approach with even simple kernels and acquisition functions can allow principled decisions across training models from 20M to 1B parameters and achieve \textbf{2.7x} and \textbf{6x} speedups compared to multi-fidelity BO and random search baselines in finding the best data mixture for downstream performance under fixed compute budgets. In sum, our adaptive framework underscores potential efficiency gains achievable by developing principled and transferrable data mixture optimization methods. Our code is publicly available at \url{https://github.com/anonWAEWA/ADSO}.
Contrastive Private Data Synthesis via Weighted Multi-PLM Fusion
Tianyuan Zou, Yang Liu, Peng Li, Yufei Xiong, Jianqing Zhang, Jingjing Liu, Ye Ouyang, Xiaozhou Ye, Yaqin Zhang
OpenReview
[Show Abstract]
Substantial quantity and high quality are the golden rules of making a good training dataset with sample privacy protection equally important.
Generating synthetic samples that resemble high-quality private data while ensuring Differential Privacy (DP), a formal privacy guarantee, promises scalability and practicality.
However, existing methods relying on pre-trained models for data synthesis
often struggle in data-deficient scenarios, suffering from limited sample size, inevitable generation noise and existing pre-trained model bias.
To address these challenges, we propose a novel contrAstive private data Synthesis via Weighted multiple Pre-trained language models (PLM) framework, named as WASP.
WASP utilizes limited private samples for more accurate private data distribution estimation via a Top-$Q$ voting mechanism, and leverages low-quality synthetic samples for contrastive generation via collaboration among dynamically weighted multiple pre-trained models.
Extensive experiments on 6 well-developed datasets with 6 open-source and $3$ closed-source PLMs demonstrate the superiority of WASP in improving model performance over diverse downstream tasks.
Code is available at https://anonymous.4open.science/r/WASP.
Substantial quantity and high quality are the golden rules of making a good training dataset with sample privacy protection equally important. Generating synthetic samples that resemble high-quality private data while ensuring Differential Privacy (DP), a formal privacy guarantee, promises scalability and practicality. However, existing methods relying on pre-trained models for data synthesis often struggle in data-deficient scenarios, suffering from limited sample size, inevitable generation noise and existing pre-trained model bias. To address these challenges, we propose a novel contrAstive private data Synthesis via Weighted multiple Pre-trained language models (PLM) framework, named as WASP. WASP utilizes limited private samples for more accurate private data distribution estimation via a Top-$Q$ voting mechanism, and leverages low-quality synthetic samples for contrastive generation via collaboration among dynamically weighted multiple pre-trained models. Extensive experiments on 6 well-developed datasets with 6 open-source and $3$ closed-source PLMs demonstrate the superiority of WASP in improving model performance over diverse downstream tasks. Code is available at https://anonymous.4open.science/r/WASP.
Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning
Yilun Kong, Hangyu Mao, Qi Zhao, Bin Zhang, Jingqing Ruan, Li Shen, Yongzhe Chang, Xueqian Wang, Rui Zhao, Dacheng Tao
OpenReview
[Show Abstract]
Prompt engineering has demonstrated remarkable success in enhancing the performance of large language models (LLMs) across diverse tasks. However, most existing prompt optimization methods only focus on the task-level performance, overlooking the importance of query-preferred prompts, which leads to suboptimal performances. Additionally, these methods rely heavily on frequent interactions with LLMs to obtain feedback for guiding the optimization process, incurring substantial redundant interaction costs. In this paper, we introduce Query-dependent Prompt Optimization ($\textbf{QPO}$), which leverages multi-loop offline reinforcement learning to iteratively fine-tune a small pretrained language model to generate optimal prompts tailored to the input queries, thus significantly improving the prompting effect on the large target LLM. We derive insights from offline prompting demonstration data, which already exists in large quantities as a by-product of benchmarking diverse prompts on open-sourced tasks, thereby circumventing the expenses of online interactions. Furthermore, we continuously augment the offline dataset with the generated prompts in each loop, as the prompts from the fine-tuned model are supposed to outperform the source prompts in the original dataset. These iterative loops bootstrap the model towards generating optimal prompts. Experiments on various LLM scales and diverse NLP and math tasks demonstrate the efficacy and cost-efficiency of our method in both zero-shot and few-shot scenarios.
Prompt engineering has demonstrated remarkable success in enhancing the performance of large language models (LLMs) across diverse tasks. However, most existing prompt optimization methods only focus on the task-level performance, overlooking the importance of query-preferred prompts, which leads to suboptimal performances. Additionally, these methods rely heavily on frequent interactions with LLMs to obtain feedback for guiding the optimization process, incurring substantial redundant interaction costs. In this paper, we introduce Query-dependent Prompt Optimization ($\textbf{QPO}$), which leverages multi-loop offline reinforcement learning to iteratively fine-tune a small pretrained language model to generate optimal prompts tailored to the input queries, thus significantly improving the prompting effect on the large target LLM. We derive insights from offline prompting demonstration data, which already exists in large quantities as a by-product of benchmarking diverse prompts on open-sourced tasks, thereby circumventing the expenses of online interactions. Furthermore, we continuously augment the offline dataset with the generated prompts in each loop, as the prompts from the fine-tuned model are supposed to outperform the source prompts in the original dataset. These iterative loops bootstrap the model towards generating optimal prompts. Experiments on various LLM scales and diverse NLP and math tasks demonstrate the efficacy and cost-efficiency of our method in both zero-shot and few-shot scenarios.
Improving Multimodal Large Language Models in Low-Resource Language Contexts
Yufei Gao, Feijiaying, Guohang Yan, Yunshi Lan
OpenReview
[Show Abstract]
In recent years, open-source Multimodal Large Language Models (MLLM) have developed rapidly, but their strengths remain primarily in mainstream languages such as English and Chinese. Due to the relative scarcity of data for non-mainstream languages, these models perform poorly in low-resource languages, struggling not only to understand and generate them fluently but also to grasp the knowledge familiar to their speakers. Recognizing the importance of low-resource language data, this paper collects multimodal data containing small-language knowledge from relevant websites. Moreover, we propose a two-stage training approach to improving multimodal large language models in low-resource language contexts. In the first stage, multimodal capabilities are transferred to low-resource languages, while the second stage further supplements the model with the knowledge in the collected dataset. Experimental results demonstrate that this data collection strategy and training method effectively extend MLLM's multimodal capabilities to low-resource languages and enable multimodal large models to perform better in such contexts.
In recent years, open-source Multimodal Large Language Models (MLLM) have developed rapidly, but their strengths remain primarily in mainstream languages such as English and Chinese. Due to the relative scarcity of data for non-mainstream languages, these models perform poorly in low-resource languages, struggling not only to understand and generate them fluently but also to grasp the knowledge familiar to their speakers. Recognizing the importance of low-resource language data, this paper collects multimodal data containing small-language knowledge from relevant websites. Moreover, we propose a two-stage training approach to improving multimodal large language models in low-resource language contexts. In the first stage, multimodal capabilities are transferred to low-resource languages, while the second stage further supplements the model with the knowledge in the collected dataset. Experimental results demonstrate that this data collection strategy and training method effectively extend MLLM's multimodal capabilities to low-resource languages and enable multimodal large models to perform better in such contexts.
Enhancing Interpretability in Generative AI Through Search-Based Data Influence Analysis
Theodoros Aivalis, Iraklis A. Klampanos, Antonis Troumpoukis, Joemon M. Jose
OpenReview
[Show Abstract]
Generative AI models offer powerful capabilities but often lack transparency, making it difficult to interpret their output. This is critical in cases involving artistic or copyrighted content. This work introduces a search-inspired approach to improve the interpretability of these models by analysing the influence of training data on their outputs. Our method provides observational interpretability by focusing on a model’s output rather than on its internal state. We consider both raw data and latent-space embeddings when searching for the influence of data items
in generated content. We evaluate our method by retraining models locally and by demonstrating the method’s ability to uncover influential subsets in the training data. This work lays the groundwork for future extensions, including user-based evaluations with domain experts, which is expected to improve observational interpretability further.
Generative AI models offer powerful capabilities but often lack transparency, making it difficult to interpret their output. This is critical in cases involving artistic or copyrighted content. This work introduces a search-inspired approach to improve the interpretability of these models by analysing the influence of training data on their outputs. Our method provides observational interpretability by focusing on a model’s output rather than on its internal state. We consider both raw data and latent-space embeddings when searching for the influence of data items in generated content. We evaluate our method by retraining models locally and by demonstrating the method’s ability to uncover influential subsets in the training data. This work lays the groundwork for future extensions, including user-based evaluations with domain experts, which is expected to improve observational interpretability further.
$f$-SCRUB: Unbounded Machine Unlearning Via $f$-divergences
Amirhossein Bagheri, Radmehr Karimian, Gholamali Aminian
OpenReview
[Show Abstract]
Deep Machine Unlearning addresses the problem of removing the effect of a subset of data points from a trained model. Machine Unlearning has various implications for the performance of algorithms. A well-known algorithm, SCRUB~\citep{kurmanji2023unboundedmachineunlearning}, has served as a baseline and achieved key objectives such as removing biases, resolving confusion caused by mislabeled data in trained models, and allowing users to exercise their "right to be forgotten" to protect user privacy. Building on this algorithm, we introduce $f$-SCRUB, an extension of SCRUB that employs different $f$-divergences instead of KL divergence. We analyze the role of these divergences and their impact on the resolution of unlearning problems in various scenarios.
Deep Machine Unlearning addresses the problem of removing the effect of a subset of data points from a trained model. Machine Unlearning has various implications for the performance of algorithms. A well-known algorithm, SCRUB~\citep{kurmanji2023unboundedmachineunlearning}, has served as a baseline and achieved key objectives such as removing biases, resolving confusion caused by mislabeled data in trained models, and allowing users to exercise their "right to be forgotten" to protect user privacy. Building on this algorithm, we introduce $f$-SCRUB, an extension of SCRUB that employs different $f$-divergences instead of KL divergence. We analyze the role of these divergences and their impact on the resolution of unlearning problems in various scenarios.
The Price is Right? Making Data Valuations Incentive-Compatible
Dongyang Fan, Tyler J. Rotello, Sai Praneeth Karimireddy
OpenReview
[Show Abstract]
Data valuation has increasingly been recognized as a critical mechanism for determining fair compensation for data contributors in ML tasks. Despite its significance, the game-theoretic aspects of the existing valuation metrics have not been studied. In this paper, we study a data marketplace where sellers incur private heterogeneous costs for sharing their data. Perhaps surprisingly, we discover that existing methods (Data Shapely and Leave-One-Out) are not incentive-compatible. Specifically, they encourage sellers to misreport their true costs, leading to market inefficiencies. To address this, we propose a new pricing rule that we theoretically prove simultaneously satisfies: i) incentive-compatibility, ii) market-efficiency, iii) individual rationality, and iv) budget balancing, while also v) being the lowest possible price under these constraints. Our results underscore the importance of game theoretic considerations while designing data valuation metrics.
Data valuation has increasingly been recognized as a critical mechanism for determining fair compensation for data contributors in ML tasks. Despite its significance, the game-theoretic aspects of the existing valuation metrics have not been studied. In this paper, we study a data marketplace where sellers incur private heterogeneous costs for sharing their data. Perhaps surprisingly, we discover that existing methods (Data Shapely and Leave-One-Out) are not incentive-compatible. Specifically, they encourage sellers to misreport their true costs, leading to market inefficiencies. To address this, we propose a new pricing rule that we theoretically prove simultaneously satisfies: i) incentive-compatibility, ii) market-efficiency, iii) individual rationality, and iv) budget balancing, while also v) being the lowest possible price under these constraints. Our results underscore the importance of game theoretic considerations while designing data valuation metrics.
A Missing Testbed for LLM Pre-Training Membership Inference Attacks
Mingjian Jiang, Ken Ziyu Liu, Sanmi Koyejo
OpenReview
[Show Abstract]
We introduce a simple and rigorous testbed for membership inference attacks (MIA) against pre-training sequences for large language models (LLMs).
Our testbed addresses the following gaps in existing evaluations, which lack:
(1) \textit{uniform} sampling of member/non-member documents of varying lengths from pre-training shards;
(2) large-scale \textit{deduplication} at varying strengths, both within and across the sampled members/non-members; and
(3) rigorous \textit{statistical tests} to detect member/non-member distribution shifts that cause faulty evaluations and are otherwise imperceptible to the heuristic techniques used in prior work.
We provide both global- and domain-level datasets (e.g., Reddit, Stack Exchange, Wikipedia), derived from fully-open pre-trained LLM/dataset pairs including Pythia/Pile, Olmo/Dolma, and our custom pre-trained GPT-2-Large on FineWeb-Edu.
We additionally open source a modular and extensible codebase that facilitates the creation of custom, statistically validated, and deduplicated evaluation data using future open models and datasets.
In sum, our work is a concrete step towards addressing the evaluation issues discussed by prior work.
We introduce a simple and rigorous testbed for membership inference attacks (MIA) against pre-training sequences for large language models (LLMs). Our testbed addresses the following gaps in existing evaluations, which lack: (1) \textit{uniform} sampling of member/non-member documents of varying lengths from pre-training shards; (2) large-scale \textit{deduplication} at varying strengths, both within and across the sampled members/non-members; and (3) rigorous \textit{statistical tests} to detect member/non-member distribution shifts that cause faulty evaluations and are otherwise imperceptible to the heuristic techniques used in prior work. We provide both global- and domain-level datasets (e.g., Reddit, Stack Exchange, Wikipedia), derived from fully-open pre-trained LLM/dataset pairs including Pythia/Pile, Olmo/Dolma, and our custom pre-trained GPT-2-Large on FineWeb-Edu. We additionally open source a modular and extensible codebase that facilitates the creation of custom, statistically validated, and deduplicated evaluation data using future open models and datasets. In sum, our work is a concrete step towards addressing the evaluation issues discussed by prior work.
Aioli: A Unified Optimization Framework for Language Model Data Mixing
Mayee F Chen, Michael Y. Hu, Nicholas Lourie, Kyunghyun Cho, Christopher Re
OpenReview
[Show Abstract]
Language model performance depends on identifying the optimal mixture of data groups to train on (e.g., law, code, math). Prior work has proposed a diverse set of methods to efficiently learn mixture proportions, ranging from fitting regression models over training runs to dynamically updating proportions throughout training. Surprisingly, we find that no existing method consistently outperforms a simple stratified sampling baseline in terms of average test perplexity. To understand this inconsistency, we unify existing methods into a standard framework, showing they are equivalent to solving a common optimization problem: minimize average loss subject to a method-specific mixing law—an implicit assumption on the relationship between loss and mixture proportions. This framework suggests that measuring the fidelity of a method's mixing law can offer insights into its performance. Empirically, we find that existing methods set their mixing law parameters inaccurately, resulting in the inconsistent mixing performance we observe. Using this insight, we derive a new online method named Aioli, which directly estimates the mixing law parameters throughout training and uses them to dynamically adjust proportions. Empirically, Aioli outperforms stratified sampling on 6 out of 6 datasets by an average of 0.27 test perplexity points, whereas existing methods fail to consistently beat stratified sampling, doing up to 6.9 points worse. Moreover, in a practical setting where proportions are learned on shorter runs due to computational constraints, Aioli can dynamically adjust these proportions over the full training run, consistently improving performance over existing methods by up to 12.012 test perplexity points.
Language model performance depends on identifying the optimal mixture of data groups to train on (e.g., law, code, math). Prior work has proposed a diverse set of methods to efficiently learn mixture proportions, ranging from fitting regression models over training runs to dynamically updating proportions throughout training. Surprisingly, we find that no existing method consistently outperforms a simple stratified sampling baseline in terms of average test perplexity. To understand this inconsistency, we unify existing methods into a standard framework, showing they are equivalent to solving a common optimization problem: minimize average loss subject to a method-specific mixing law—an implicit assumption on the relationship between loss and mixture proportions. This framework suggests that measuring the fidelity of a method's mixing law can offer insights into its performance. Empirically, we find that existing methods set their mixing law parameters inaccurately, resulting in the inconsistent mixing performance we observe. Using this insight, we derive a new online method named Aioli, which directly estimates the mixing law parameters throughout training and uses them to dynamically adjust proportions. Empirically, Aioli outperforms stratified sampling on 6 out of 6 datasets by an average of 0.27 test perplexity points, whereas existing methods fail to consistently beat stratified sampling, doing up to 6.9 points worse. Moreover, in a practical setting where proportions are learned on shorter runs due to computational constraints, Aioli can dynamically adjust these proportions over the full training run, consistently improving performance over existing methods by up to 12.012 test perplexity points.
Robust In-Context Learning via Multi-Armed Bandit-Based Partition Selection
Varul Srivastava, Sankarshan Damle, Manisha Padala
OpenReview
[Show Abstract]
In-context learning (ICL) enables Large Language Models (LLMs) to adapt to new tasks without parameter updates, relying solely on exemplar selection. However, in real-world scenarios, data partitions may contain corrupted labels, degrading ICL performance. We address this challenge by formulating partition selection as a multi-armed bandit (MAB) problem, where each evaluation sample serves as a pull, allowing the model to identify the most reliable partitions iteratively. Using an Upper Confidence Bound (UCB) strategy, we progressively refine exemplar selection to mitigate the impact of noisy data. Empirical results demonstrate that UCB-based partition selection recovers performance comparable to settings without label noise, highlighting its effectiveness in improving ICL robustness.
In-context learning (ICL) enables Large Language Models (LLMs) to adapt to new tasks without parameter updates, relying solely on exemplar selection. However, in real-world scenarios, data partitions may contain corrupted labels, degrading ICL performance. We address this challenge by formulating partition selection as a multi-armed bandit (MAB) problem, where each evaluation sample serves as a pull, allowing the model to identify the most reliable partitions iteratively. Using an Upper Confidence Bound (UCB) strategy, we progressively refine exemplar selection to mitigate the impact of noisy data. Empirical results demonstrate that UCB-based partition selection recovers performance comparable to settings without label noise, highlighting its effectiveness in improving ICL robustness.
Defending LVLMs Against Vision Attacks through Partial-Perception Supervision
Qi Zhou, Tianlin Li, Qing Guo, Dongxia Wang, Yun Lin, Yang Liu, Jin Song Dong
OpenReview
[Show Abstract]
Recent studies have raised significant concerns regarding the vulnerability of Large Vision Language Models (LVLMs) to maliciously injected or perturbed input images, which can mislead their responses. Existing defense methods show that such vision attacks are sensitive to image modifications especially cropping, using majority voting across responses of modified images as corrected responses. However, these modifications often result in partial images and distort the semantics, which reduces response quality on clean images after voting. Instead of directly using responses from partial images for voting, we investigate using them to supervise (guide) the LVLM's responses to the original images at inference time. We propose a black-box, training-free method called \textbf{DPS (Defense through Partial-Perception Supervision)}. In this approach, the model is prompted using the responses generated by a model that perceives only a partial image. With DPS, the model can adjust its response based on partial image understanding when under attack, while confidently maintaining its original response for clean input.
Recent studies have raised significant concerns regarding the vulnerability of Large Vision Language Models (LVLMs) to maliciously injected or perturbed input images, which can mislead their responses. Existing defense methods show that such vision attacks are sensitive to image modifications especially cropping, using majority voting across responses of modified images as corrected responses. However, these modifications often result in partial images and distort the semantics, which reduces response quality on clean images after voting. Instead of directly using responses from partial images for voting, we investigate using them to supervise (guide) the LVLM's responses to the original images at inference time. We propose a black-box, training-free method called \textbf{DPS (Defense through Partial-Perception Supervision)}. In this approach, the model is prompted using the responses generated by a model that perceives only a partial image. With DPS, the model can adjust its response based on partial image understanding when under attack, while confidently maintaining its original response for clean input.
RepFair-QGAN: Alleviating Representation Bias in Quantum Generative Adversarial Networks Using Gradient Clipping
Kamil Sabbagh, Hadi Salloum, Yaroslav Kholodov
OpenReview
[Show Abstract]
This study introduces a novel application of Quantum Generative Adversarial Networks (QGANs) by incorporating a new fairness principle, \textit{representational fairness}, which improves equitable representation of various demographic groups in quantum-generated data. We propose a \textit{group-wise} gradient norm clipping technique that constrains the magnitude of discriminator updates for each demographic group, thereby promoting fair data generation. Furthermore, our approach mitigates the issue of mode collapse, which is inherent in both QGANs and classical GANs. Empirical evaluations confirm that this method enhances \textit{representational fairness} while maintaining high-quality sample generation.
This study introduces a novel application of Quantum Generative Adversarial Networks (QGANs) by incorporating a new fairness principle, \textit{representational fairness}, which improves equitable representation of various demographic groups in quantum-generated data. We propose a \textit{group-wise} gradient norm clipping technique that constrains the magnitude of discriminator updates for each demographic group, thereby promoting fair data generation. Furthermore, our approach mitigates the issue of mode collapse, which is inherent in both QGANs and classical GANs. Empirical evaluations confirm that this method enhances \textit{representational fairness} while maintaining high-quality sample generation.