- Efficient Adaptive Optimization via Subset-Norm and Subspace-Momentum: Fast, Memory-Reduced Training with Convergence Guarantees We introduce two complementary techniques for efficient adaptive optimization that reduce memory requirements while accelerating training of large-scale neural networks. The first technique, Subset-Norm adaptive step size, generalizes AdaGrad-Norm and AdaGrad(-Coordinate) by reducing the second moment term's memory footprint from O(d) to O(d) through step-size sharing, where d is the model size. For non-convex smooth objectives under coordinate-wise sub-gaussian gradient noise, we prove a noise-adapted high-probability convergence guarantee showing improved dimensional dependence over existing methods. Our second technique, Subspace-Momentum, reduces the momentum state's memory footprint by operating in a low-dimensional subspace while applying standard SGD in the orthogonal complement. We establish high-probability convergence rates under similar relaxed assumptions. Empirical evaluation on LLaMA models from 60M to 1B parameters demonstrates the effectiveness of our methods, where combining subset-norm with subspace-momentum achieves Adam's validation perplexity in approximately half the training tokens (6.8B vs 13.1B) while using only 20% of the Adam's optimizer-states memory footprint and requiring minimal additional hyperparameter tuning. 2 authors · Nov 11, 2024
2 Diversity Measurement and Subset Selection for Instruction Tuning Datasets We aim to select data subsets for the fine-tuning of large language models to more effectively follow instructions. Prior work has emphasized the importance of diversity in dataset curation but relied on heuristics such as the number of tasks. In this paper, we use determinantal point processes to capture the diversity and quality of instruction tuning datasets for subset selection. We propose to measure dataset diversity with log determinant distance that is the distance between the dataset of interest and a maximally diverse reference dataset. Our experiments demonstrate that the proposed diversity measure in the normalized weight gradient space is correlated with downstream instruction-following performance. Consequently, it can be used to inform when data selection is the most helpful and to analyze dataset curation strategies. We demonstrate the utility of our approach on various instruction tuning datasets. 7 authors · Feb 3, 2024 2
- Subset-Based Instance Optimality in Private Estimation We propose a new definition of instance optimality for differentially private estimation algorithms. Our definition requires an optimal algorithm to compete, simultaneously for every dataset D, with the best private benchmark algorithm that (a) knows D in advance and (b) is evaluated by its worst-case performance on large subsets of D. That is, the benchmark algorithm need not perform well when potentially extreme points are added to D; it only has to handle the removal of a small number of real data points that already exist. This makes our benchmark significantly stronger than those proposed in prior work. We nevertheless show, for real-valued datasets, how to construct private algorithms that achieve our notion of instance optimality when estimating a broad class of dataset properties, including means, quantiles, and ell_p-norm minimizers. For means in particular, we provide a detailed analysis and show that our algorithm simultaneously matches or exceeds the asymptotic performance of existing algorithms under a range of distributional assumptions. 4 authors · Mar 1, 2023
- Kunnafonidilaw ka Cadeau: an ASR dataset of present-day Bambara We present Kunkado, a 160-hour Bambara ASR dataset compiled from Malian radio archives to capture present-day spontaneous speech across a wide range of topics. It includes code-switching, disfluencies, background noise, and overlapping speakers that practical ASR systems encounter in real-world use. We finetuned Parakeet-based models on a 33.47-hour human-reviewed subset and apply pragmatic transcript normalization to reduce variability in number formatting, tags, and code-switching annotations. Evaluated on two real-world test sets, finetuning with Kunkado reduces WER from 44.47\% to 37.12\% on one and from 36.07\% to 32.33\% on the other. In human evaluation, the resulting model also outperforms a comparable system with the same architecture trained on 98 hours of cleaner, less realistic speech. We release the data and models to support robust ASR for predominantly oral languages. 4 authors · Dec 22, 2025
- On a conjecture of Gross, Mansour and Tucker for $Δ$-matroids Gross, Mansour, and Tucker introduced the partial-duality polynomial of a ribbon graph [Distributions, European J. Combin. 86, 1--20, 2020], the generating function enumerating partial duals by the Euler genus. Chmutov and Vignes-Tourneret wondered if this polynomial and its conjectured properties would hold for general delta-matroids, which are combinatorial abstractions of ribbon graphs. Yan and Jin contributed to this inquiry by identifying a subset of delta-matroids-specifically, even normal binary ones-whose twist polynomials are characterized by a singular term. Building upon this foundation, the current paper expands the scope of the investigation to encompass even non-binary delta-matroids, revealing that none of them have width-changing twists. 1 authors · Apr 21, 2024
8 Emergent properties with repeated examples We study the performance of transformers as a function of the number of repetitions of training examples with algorithmically generated datasets. On three problems of mathematics: the greatest common divisor, modular multiplication, and matrix eigenvalues, we show that for a fixed number of training steps, models trained on smaller sets of repeated examples outperform models trained on larger sets of single-use examples. We also demonstrate that two-set training - repeated use of a small random subset of examples, along normal sampling on the rest of the training set - provides for faster learning and better performance. This highlights that the benefits of repetition can outweigh those of data diversity. These datasets and problems provide a controlled setting to shed light on the still poorly understood interplay between generalization and memorization in deep learning. 2 authors · Oct 9, 2024 3
- BHRAM-IL: A Benchmark for Hallucination Recognition and Assessment in Multiple Indian Languages Large language models (LLMs) are increasingly deployed in multilingual applications but often generate plausible yet incorrect or misleading outputs, known as hallucinations. While hallucination detection has been studied extensively in English, under-resourced Indian languages remain largely unexplored. We present BHRAM-IL, a benchmark for hallucination recognition and assessment in multiple Indian languages, covering Hindi, Gujarati, Marathi, Odia, along with English. The benchmark comprises 36,047 curated questions across nine categories spanning factual, numerical, reasoning, and linguistic tasks. We evaluate 14 state-of-the-art multilingual LLMs on a benchmark subset of 10,265 questions, analyzing cross-lingual and factual hallucinations across languages, models, scales, categories, and domains using category-specific metrics normalized to (0,1) range. Aggregation over all categories and models yields a primary score of 0.23 and a language-corrected fuzzy score of 0.385, demonstrating the usefulness of BHRAM-IL for hallucination-focused evaluation. The dataset, and the code for generation and evaluation are available on GitHub (https://github.com/sambhashana/BHRAM-IL/) and HuggingFace (https://huggingface.co/datasets/sambhashana/BHRAM-IL/) to support future research in multilingual hallucination detection and mitigation. 4 authors · Dec 1, 2025
- On Loewner energy and curve composition The composition gamma circ eta of Jordan curves gamma and eta in universal Teichm\"uller space is defined through the composition h_gamma circ h_eta of their conformal weldings. We show that whenever gamma and eta are curves of finite Loewner energy I^L, the energy of the composition satisfies $I^L(gamma circ eta) lesssim_K I^L(gamma) + I^L(eta), with an explicit constant in terms of the quasiconformal K of \gamma and \eta. We also study the asymptotic growth rate of the Loewner energy under n self-compositions \gamma^n := \gamma \circ \cdots \circ \gamma, showing limsup_{n rightarrow infty} 1{n}log I^L(gamma^n) lesssim_K 1, again with explicit constant. Our approach is to define a new conformally-covariant rooted welding functional W_h(y), and show W_h(y) \asymp_K I^L(\gamma) when h is a welding of \gamma and y is any root (a point in the domain of h). In the course of our arguments we also give several new expressions for the Loewner energy, including generalized formulas in terms of the Riemann maps f and g for \gamma which hold irrespective of the placement of \gamma on the Riemann sphere, the normalization of f and g, and what disks D, D^c \subset \mathbb{C} serve as domains. An additional corollary is that I^L(\gamma) is bounded above by a constant only depending on the Weil--Petersson distance from \gamma$ to the circle. 2 authors · May 6, 2025
- PC-DARTS: Partial Channel Connections for Memory-Efficient Architecture Search Differentiable architecture search (DARTS) provided a fast solution in finding effective network architectures, but suffered from large memory and computing overheads in jointly training a super-network and searching for an optimal architecture. In this paper, we present a novel approach, namely, Partially-Connected DARTS, by sampling a small part of super-network to reduce the redundancy in exploring the network space, thereby performing a more efficient search without comprising the performance. In particular, we perform operation search in a subset of channels while bypassing the held out part in a shortcut. This strategy may suffer from an undesired inconsistency on selecting the edges of super-net caused by sampling different channels. We alleviate it using edge normalization, which adds a new set of edge-level parameters to reduce uncertainty in search. Thanks to the reduced memory cost, PC-DARTS can be trained with a larger batch size and, consequently, enjoys both faster speed and higher training stability. Experimental results demonstrate the effectiveness of the proposed method. Specifically, we achieve an error rate of 2.57% on CIFAR10 with merely 0.1 GPU-days for architecture search, and a state-of-the-art top-1 error rate of 24.2% on ImageNet (under the mobile setting) using 3.8 GPU-days for search. Our code has been made available at: https://github.com/yuhuixu1993/PC-DARTS. 7 authors · Jul 12, 2019