Search Machine Learning Papers
Search Results for "transformer attention"
Found 20 papers
Hymba: A Hybrid-head Architecture for Small Language Models
We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficienc...
KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference
KV cache quantization can improve Large Language Models (LLMs) inference throughput and latency in long contexts and large batch-size scenarios while preserving LLMs effectiveness. However, current me...
Integrating Locality-Aware Attention with Transformers for General Geometry PDEs
Neural operators have emerged as promising frameworks for learning mappings governed by partial differential equations (PDEs), serving as data-driven alternatives to traditional numerical methods. Whi...
Block Transformer: Global-to-Local Language Modeling for Fast Inference
We introduce the Block Transformer which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks associated with self-attention. Self-attentio...
RecurFormer: Not All Transformer Heads Need Self-Attention
Transformer-based large language models (LLMs) excel in modeling complex language patterns but face significant computational costs during inference, especially with long inputs due to the attention m...
More Expressive Attention with Negative Weights
We propose a novel attention mechanism, named Cog Attention, that enables attention weights to be negative for enhanced expressiveness, which stems from two key factors: (1) Cog Attention enhances par...
TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
Large language models (LLMs) have driven significant advancements across diverse NLP tasks, with long-context models gaining prominence for handling extended inputs. However, the expanding key-value (...
Measuring and Controlling Instruction (In)Stability in Language Model Dialogs
System-prompting is a standard tool for customizing language-model chatbots, enabling them to follow a specific instruction. An implicit assumption in the use of system prompts is that they will be st...
JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention
We propose Joint MLP/Attention (JoMA) dynamics, a novel mathematical framework to understand the training procedure of multilayer Transformer architectures. This is achieved by integrating out the sel...
Do Transformer Attention Heads Provide Transparency in Abstractive Summarization?
Learning algorithms become more powerful, often at the cost of increased complexity. In response, the demand for algorithms to be transparent is growing. In NLP tasks, attention distributions learned ...
Cross-Architecture Transfer Learning for Linear-Cost Inference Transformers
Recently, multiple architectures has been proposed to improve the efficiency of the Transformer Language Models through changing the design of the self-attention block to have a linear-cost inference ...
What Matters in Transformers? Not All Attention is Needed
While scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks, it also introduces redundant architectures, posing efficiency challenges for r...
RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale
We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along wi...
Theoretical Constraints on the Expressive Power of $\mathsf{RoPE}$-based Tensor Attention Transformers
Tensor Attention extends traditional attention mechanisms by capturing high-order correlations across multiple modalities, addressing the limitations of classical matrix-based attention. Meanwhile, Ro...
Interpretable AI for Time-Series: Multi-Model Heatmap Fusion with Global Attention and NLP-Generated Explanations
In this paper, we present a novel framework for enhancing model interpretability by integrating heatmaps produced separately by ResNet and a restructured 2D Transformer with globally weighted input sa...
Learning Linear Attention in Polynomial Time
Previous research has explored the computational expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the learnability of these simulators from observational ...
From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers
Transformers have achieved state-of-the-art performance across language and vision tasks. This success drives the imperative to interpret their internal mechanisms with the dual goals of enhancing per...
Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model
Transformers have shown dominant performance across a range of domains including language and vision. However, their computational cost grows quadratically with the sequence length, making their usage...
How Sparse Attention Approximates Exact Attention? Your Attention is Naturally $n^C$-Sparse
Sparse Attention is a technique that approximates standard attention computation with sub-quadratic complexity. This is achieved by selectively ignoring smaller entries in the attention matrix during ...
Graph-Aware Isomorphic Attention for Adaptive Dynamics in Transformers
We present an approach to modifying Transformer architectures by integrating graph-aware relational reasoning into the attention mechanism, merging concepts from graph neural networks and language mod...