Search Machine Learning Papers

Search Results for "transformer attention"

Found 20 papers

cs.CL cs.AI cs.LG

2024-11-20

Hymba: A Hybrid-head Architecture for Small Language Models

Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov

We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficienc...

cs.LG cs.AI cs.CL

2025-04-18

Integrating Locality-Aware Attention with Transformers for General Geometry PDEs

Minsu Koh, Beom-Chul Park, Heejo Kong, Seong-Whan Lee

Neural operators have emerged as promising frameworks for learning mappings governed by partial differential equations (PDEs), serving as data-driven alternatives to traditional numerical methods. Whi...

cs.CL cs.AI cs.LG

2024-06-04

Block Transformer: Global-to-Local Language Modeling for Fast Inference

Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun

We introduce the Block Transformer which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks associated with self-attention. Self-attentio...

cs.LG cs.AI cs.CL

2025-02-06

KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

Xing Li, Zeyu Xing, Yiming Li, Linping Qu, Hui-Ling Zhen, Wulong Liu, Yiwu Yao, Sinno Jialin Pan, Mingxuan Yuan

KV cache quantization can improve Large Language Models (LLMs) inference throughput and latency in long contexts and large batch-size scenarios while preserving LLMs effectiveness. However, current me...

cs.CL cs.AI cs.LG

2024-11-11

More Expressive Attention with Negative Weights

Ang Lv, Ruobing Xie, Shuaipeng Li, Jiayi Liao, Xingwu Sun, Zhanhui Kang, Di Wang, Rui Yan

We propose a novel attention mechanism, named Cog Attention, that enables attention weights to be negative for enhanced expressiveness, which stems from two key factors: (1) Cog Attention enhances par...

cs.CL cs.AI cs.LG

2024-10-10

RecurFormer: Not All Transformer Heads Need Self-Attention

Ruiqing Yan, Linghan Zheng, Xingbo Du, Han Zou, Yufeng Guo, Jianfei Yang

Transformer-based large language models (LLMs) excel in modeling complex language patterns but face significant computational costs during inference, especially with long inputs due to the attention m...

cs.LG cs.AI cs.CL

2024-10-07

TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

Lijie Yang, Zhihao Zhang, Zhuofu Chen, Zikun Li, Zhihao Jia

Large language models (LLMs) have driven significant advancements across diverse NLP tasks, with long-context models gaining prominence for handling extended inputs. However, the expanding key-value (...

cs.CL cs.AI cs.LG

2024-02-13

Measuring and Controlling Instruction (In)Stability in Language Model Dialogs

Kenneth Li, Tianle Liu, Naomi Bashkansky, David Bau, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg

System-prompting is a standard tool for customizing language-model chatbots, enabling them to follow a specific instruction. An implicit assumption in the use of system prompts is that they will be st...

cs.LG cs.AI cs.CL

2023-10-01

JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention

Yuandong Tian, Yiping Wang, Zhenyu Zhang, Beidi Chen, Simon Du

We propose Joint MLP/Attention (JoMA) dynamics, a novel mathematical framework to understand the training procedure of multilayer Transformer architectures. This is achieved by integrating out the sel...

cs.CL cs.AI cs.IR cs.LG

2019-07-01

Do Transformer Attention Heads Provide Transparency in Abstractive Summarization?

Joris Baan, Maartje ter Hoeve, Marlies van der Wees, Anne Schuth, Maarten de Rijke

Learning algorithms become more powerful, often at the cost of increased complexity. In response, the demand for algorithms to be transparent is growing. In NLP tasks, attention distributions learned ...

cs.LG cs.AI cs.CL

2025-06-30

Interpretable AI for Time-Series: Multi-Model Heatmap Fusion with Global Attention and NLP-Generated Explanations

Jiztom Kavalakkatt Francis, Matthew J Darr

In this paper, we present a novel framework for enhancing model interpretability by integrating heatmaps produced separately by ResNet and a restructured 2D Transformer with globally weighted input sa...

cs.LG cs.AI cs.CC cs.CL

2024-12-23

Theoretical Constraints on the Expressive Power of $\mathsf{RoPE}$-based Tensor Attention Transformers

Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Mingda Wan

Tensor Attention extends traditional attention mechanisms by capturing high-order correlations across multiple modalities, addressing the limitations of classical matrix-based attention. Meanwhile, Ro...

cs.CL cs.AI cs.LG I.2.7

2025-05-05

RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

Daniel Goldstein, Eric Alcaide, Janna Lu, Eugene Cheah

We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along wi...

cs.CL cs.AI cs.LG

2024-04-03

Cross-Architecture Transfer Learning for Linear-Cost Inference Transformers

Sehyun Choi

Recently, multiple architectures has been proposed to improve the efficiency of the Transformer Language Models through changing the design of the self-attention block to have a linear-cost inference ...

cs.LG cs.AI cs.CL cs.DS

2024-10-14

Learning Linear Attention in Polynomial Time

Morris Yau, Ekin Akyürek, Jiayuan Mao, Joshua B. Tenenbaum, Stefanie Jegelka, Jacob Andreas

Previous research has explored the computational expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the learnability of these simulators from observational ...

cs.LG cs.AI cs.CL

2025-06-20

From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers

Jingtong Su, Julia Kempe, Karen Ullrich

Transformers have achieved state-of-the-art performance across language and vision tasks. This success drives the imperative to interpret their internal mechanisms with the dual goals of enhancing per...

cs.CL cs.AI cs.LG

2023-05-24

Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model

Yinghan Long, Sayeed Shafayet Chowdhury, Kaushik Roy

Transformers have shown dominant performance across a range of domains including language and vision. However, their computational cost grows quadratically with the sequence length, making their usage...

cs.LG cs.AI cs.CL

2024-06-22

What Matters in Transformers? Not All Attention is Needed

Shwai He, Guoheng Sun, Zheyu Shen, Ang Li

While scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks, it also introduces redundant architectures, posing efficiency challenges for r...

cs.LG cs.AI cs.CL

2024-04-03

How Sparse Attention Approximates Exact Attention? Your Attention is Naturally $n^C$-Sparse

Yichuan Deng, Zhao Song, Jing Xiong, Chiwun Yang

Sparse Attention is a technique that approximates standard attention computation with sub-quadratic complexity. This is achieved by selectively ignoring smaller entries in the attention matrix during ...

cs.CL cs.AI cs.LG

2025-01-29

DINT Transformer

Yueyang Cang, Yuhang Liu, Xiaoteng Zhang, Erlu Zhao, Li Shi

DIFF Transformer addresses the issue of irrelevant context interference by introducing a differential attention mechanism that enhances the robustness of local attention. However, it has two critical ...