Search Machine Learning Papers

Tips: Use specific ML terms like "transformer", "reinforcement learning", or "computer vision". Search for authors with quotes like "Hinton".

Search Results for "transformer attention"

Found 20 papers

cs.CL cs.AI cs.LG
2024-11-20
Hymba: A Hybrid-head Architecture for Small Language Models
Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov

We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficienc...

cs.LG cs.AI cs.CL
2025-02-06
KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference
Xing Li, Zeyu Xing, Yiming Li, Linping Qu, Hui-Ling Zhen, Wulong Liu, Yiwu Yao, Sinno Jialin Pan, Mingxuan Yuan

KV cache quantization can improve Large Language Models (LLMs) inference throughput and latency in long contexts and large batch-size scenarios while preserving LLMs effectiveness. However, current me...

cs.LG cs.AI cs.CL
2025-04-18
Integrating Locality-Aware Attention with Transformers for General Geometry PDEs
Minsu Koh, Beom-Chul Park, Heejo Kong, Seong-Whan Lee

Neural operators have emerged as promising frameworks for learning mappings governed by partial differential equations (PDEs), serving as data-driven alternatives to traditional numerical methods. Whi...

cs.CL cs.AI cs.LG
2024-06-04
Block Transformer: Global-to-Local Language Modeling for Fast Inference
Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun

We introduce the Block Transformer which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks associated with self-attention. Self-attentio...

cs.CL cs.AI cs.LG
2024-10-10
RecurFormer: Not All Transformer Heads Need Self-Attention
Ruiqing Yan, Linghan Zheng, Xingbo Du, Han Zou, Yufeng Guo, Jianfei Yang

Transformer-based large language models (LLMs) excel in modeling complex language patterns but face significant computational costs during inference, especially with long inputs due to the attention m...

cs.CL cs.AI cs.LG
2024-11-11
More Expressive Attention with Negative Weights
Ang Lv, Ruobing Xie, Shuaipeng Li, Jiayi Liao, Xingwu Sun, Zhanhui Kang, Di Wang, Rui Yan

We propose a novel attention mechanism, named Cog Attention, that enables attention weights to be negative for enhanced expressiveness, which stems from two key factors: (1) Cog Attention enhances par...

cs.LG cs.AI cs.CL
2024-10-07
TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
Lijie Yang, Zhihao Zhang, Zhuofu Chen, Zikun Li, Zhihao Jia

Large language models (LLMs) have driven significant advancements across diverse NLP tasks, with long-context models gaining prominence for handling extended inputs. However, the expanding key-value (...

cs.CL cs.AI cs.LG
2024-02-13
Measuring and Controlling Instruction (In)Stability in Language Model Dialogs
Kenneth Li, Tianle Liu, Naomi Bashkansky, David Bau, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg

System-prompting is a standard tool for customizing language-model chatbots, enabling them to follow a specific instruction. An implicit assumption in the use of system prompts is that they will be st...

cs.LG cs.AI cs.CL
2023-10-01
JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention
Yuandong Tian, Yiping Wang, Zhenyu Zhang, Beidi Chen, Simon Du

We propose Joint MLP/Attention (JoMA) dynamics, a novel mathematical framework to understand the training procedure of multilayer Transformer architectures. This is achieved by integrating out the sel...

cs.CL cs.AI cs.IR cs.LG
2019-07-01
Do Transformer Attention Heads Provide Transparency in Abstractive Summarization?
Joris Baan, Maartje ter Hoeve, Marlies van der Wees, Anne Schuth, Maarten de Rijke

Learning algorithms become more powerful, often at the cost of increased complexity. In response, the demand for algorithms to be transparent is growing. In NLP tasks, attention distributions learned ...

cs.CL cs.AI cs.LG
2024-04-03
Cross-Architecture Transfer Learning for Linear-Cost Inference Transformers
Sehyun Choi

Recently, multiple architectures has been proposed to improve the efficiency of the Transformer Language Models through changing the design of the self-attention block to have a linear-cost inference ...

cs.LG cs.AI cs.CL
2024-06-22
What Matters in Transformers? Not All Attention is Needed
Shwai He, Guoheng Sun, Zheyu Shen, Ang Li

While scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks, it also introduces redundant architectures, posing efficiency challenges for r...

cs.CL cs.AI cs.LG I.2.7
2025-05-05
RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale
Daniel Goldstein, Eric Alcaide, Janna Lu, Eugene Cheah

We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along wi...

cs.LG cs.AI cs.CC cs.CL
2024-12-23
Theoretical Constraints on the Expressive Power of $\mathsf{RoPE}$-based Tensor Attention Transformers
Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Mingda Wan

Tensor Attention extends traditional attention mechanisms by capturing high-order correlations across multiple modalities, addressing the limitations of classical matrix-based attention. Meanwhile, Ro...

cs.LG cs.AI cs.CL
2025-06-30
Interpretable AI for Time-Series: Multi-Model Heatmap Fusion with Global Attention and NLP-Generated Explanations
Jiztom Kavalakkatt Francis, Matthew J Darr

In this paper, we present a novel framework for enhancing model interpretability by integrating heatmaps produced separately by ResNet and a restructured 2D Transformer with globally weighted input sa...

cs.LG cs.AI cs.CL cs.DS
2024-10-14
Learning Linear Attention in Polynomial Time
Morris Yau, Ekin Akyürek, Jiayuan Mao, Joshua B. Tenenbaum, Stefanie Jegelka, Jacob Andreas

Previous research has explored the computational expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the learnability of these simulators from observational ...

cs.LG cs.AI cs.CL
2025-06-20
From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers
Jingtong Su, Julia Kempe, Karen Ullrich

Transformers have achieved state-of-the-art performance across language and vision tasks. This success drives the imperative to interpret their internal mechanisms with the dual goals of enhancing per...

cs.CL cs.AI cs.LG
2023-05-24
Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model
Yinghan Long, Sayeed Shafayet Chowdhury, Kaushik Roy

Transformers have shown dominant performance across a range of domains including language and vision. However, their computational cost grows quadratically with the sequence length, making their usage...

cs.LG cs.AI cs.CL
2024-04-03
How Sparse Attention Approximates Exact Attention? Your Attention is Naturally $n^C$-Sparse
Yichuan Deng, Zhao Song, Jing Xiong, Chiwun Yang

Sparse Attention is a technique that approximates standard attention computation with sub-quadratic complexity. This is achieved by selectively ignoring smaller entries in the attention matrix during ...

cs.LG cond-mat.mes-hall cond-mat.mtrl-sci cs.AI cs.CL
2025-01-04
Graph-Aware Isomorphic Attention for Adaptive Dynamics in Transformers
Markus J. Buehler

We present an approach to modifying Transformer architectures by integrating graph-aware relational reasoning into the attention mechanism, merging concepts from graph neural networks and language mod...