MOM: LINEAR SEQUENCE MODELING WITH MIXTURE-OF-MEMORIES
[Overview]
MoM: Mixture of Memories 아키텍처 요약
1. 소개
MoM은 기억의 혼합(Mixture of Memories)을 기반으로 한 모델로, 장기 의존성과 다양한 정보 유지를 위한 메커니즘을 제공합니다. 이 모델은 로터(router), 선형 순환 기억 모듈(linear recurrent memory module), 공유 기억(shared memory)을 핵심 구성 요소로 포함하며, 각각의 역할을 통해 효율적인 정보 처리를 수행합니다.
2. 로터 메커니즘
- 목적: 입력 토큰을 중요도에 따라 상위 k개의 기억(top-k memories)에 할당.
- 작동 방식:
- 입력 토큰 에 대해 선형 레이어를 통해 중요도 점수(scores)를 생성.
- softmax 함수를 적용한 후, 상위 k개의 점수를 선택하고 정규화.
- 수식:
3. 선형 순환 기억 모듈
- 각 기억의 업데이트:
- Key-Value 투영: 입력 를 각 기억의 키 과 값 으로 투영.
- 기억 업데이트:
- 기억 혼합: 업데이트된 기억 상태를 중요도 점수 로 가중 합산.
- 출력 계산:
4. 공유 기억
- 목적: 장기 의존성 캡처를 위해 전체 시퀀스 정보를 저장.
- 특징:
- 모든 토큰에 접근 가능.
- 모델의 성능 및 안정성 향상에 기여.
5. 하드웨어 효율적 구현
- 핵심 아이디어:
- 토큰을 메모리 라우팅 결과에 따라 그룹화 → varlen 입력 시퀀스 생성.
- Triton 커널을 활용한 효율적 연산.
- 구현 단계:
- 라우팅 결과에 따라 토큰 재정렬.
- varlen 입력 시퀀스로 연결.
- Triton 커널로 처리.
- 결과를 원래 기억 순서로 복원.
- 최종 출력을 원래 시퀀스 순서로 복원.
6. 기억 업데이트 규칙 비교
다양한 선형 시퀀스 모델의 기억 업데이트 규칙은 아래 표에서 확인 가능.
메서드 기억 업데이트 규칙 Linear Attn RetNet GLA DeltaNet … …
7. 요약
MoM은 로터, 기억 모듈, 공유 기억을 통해 다양한 정보 유지와 장기 의존성 학습을 가능하게 합니다. 하드웨어 효율적 구현을 통해 Triton 커널과 같은 기존 기술을 활용하며, 성능과 확장성을 동시에 달성합니다.
목차
- MOM: LINEAR SEQUENCE MODELING WITH MIXTURE-OF-MEMORIES
- ABSTRACT
- 1 INTRODUCTION
- 2 Preliminary - 3 METHOD
- 4 EXPERIMENTS
- A RELATED WORK
- MIXTURE-OF-EXPERTS
- [[#span-idpage-13-0spanb-comparison-between-mom-and-moe|B COMPARISON BETWEEN MOM AND MOE]]
- [[#span-idpage-14-0spanc-experiments-details|C EXPERIMENTS DETAILS]]
- D COMMONSENSE REASONING TASKS
- E Training Loss Comparison - F THE HYBRID OF MOM AND TRANSFORMER
- [[#span-idpage-15-0spang-fairness|G FAIRNESS]]
- [[#span-idpage-16-1spanh-detailed-scaling-results|H DETAILED SCALING RESULTS]]
- [[#span-idpage-16-0spani-full-results|I FULL RESULTS]]
MOM: LINEAR SEQUENCE MODELING WITH MIXTURE-OF-MEMORIES
[요약]
- 본 논문은 “MOM: LINEAR SEQUENCE MODELING WITH MIXTURE-OF-MEMORIES”에서 선형 시퀀스 모델링을 위한 혼합 기억(Mixture-of-Memories) 접근법을 제안한다.
- 저자들은 기억의 적대자는 시간이 아닌 다른 기억이라는 David Eagleman의 인용구를 통해 기억 간 경쟁 구조의 중요성을 강조한다.
- 연구팀은 상해 인공지능 실험실, 청화대학교, 복단대학교 등 다수의 기관에서 협력하여 모델의 이론적 기반과 실험적 검증을 수행했다.
- MOM은 기존 시퀀스 모델의 한계를 극복하기 위해 다중 기억 메커니즘을 통합한 새로운 프레임워크를 도입한다.
- 이 모델은 장기 의존성 처리 및 기억 선택성 향상을 목표로, 자연어 처리와 시계열 분석 분야에 기여할 잠재력을 보유하고 있다.
Jusen Du1,2*, Weigao Sun1B† , Disen Lan1,3*, Jiaxi Hu4 , Yu Cheng5B 1Shanghai AI Laboratory 2Tsinghua University 3Fudan University 4The Hong Kong University of Science and Technology (Guangzhou) 5The Chinese University of Hong Kong
“The enemy of memory isn’t time; it’s other memories.” – David Eagleman
ABSTRACT
Linear sequence modeling methods, such as linear attention, state space modeling, and linear RNNs, offer significant efficiency improvements by reducing the complexity of training and inference. However, these methods typically compress the entire input sequence into a single fixed-size memory state, which leads to suboptimal performance on recall-intensive tasks. To address this limitation, we introduce a novel architecture called Mixture-of-Memories (MoM). MoM utilizes multiple independent memory states, with a router network directing input tokens to specific memory states. This approach greatly enhances the overall memory capacity while minimizing memory interference. MoM serves as a general framework that can be seamlessly combined with diverse memory update mechanisms across linear models. As a result, MoM performs exceptionally well on recall-intensive tasks, surpassing existing linear sequence modeling techniques. Despite incorporating multiple memory states, the computation of each memory state remains linear in complexity, allowing MoM to retain the linear-complexity advantage during training, while constant-complexity during inference. Our experimental results show that MoM outperforms current linear sequence models on downstream language tasks, particularly recall-intensive tasks, and even achieves performance comparable to Transformer models. The code is released at https://github.com/OpenSparseLLMs/MoM and is also released as a part of https://github.com/OpenSparseLLMs/Linear-MoE.
1 INTRODUCTION
[요약]
- 어텐션 메커니즘은 AI 분야에서 언어, 시각, 오디오 등 다양한 모달리티 발전에 기여했으며, Transformer는 장거리 의존성을 처리하는 데 효과적이지만, 시퀀스 길이에 따른 제곱 시간 복잡도로 인해 긴 시퀀스 처리에 한계가 있다.
- 선형 시퀀스 모델링 방법(예: 선형 어텐션, 상태 공간 모델링)은 O(n) 학습 복잡도와 O(1) 추론 복잡도를 제공하지만, 고정 크기 메모리 상태로 인해 메모리 용량 한계와 메모리 간섭 문제가 발생한다.
- 인간의 해마에서 테타와 감마 진동이 병렬적으로 작용해 메모리 간섭을 줄이는 방식을 참고해 제안된 Mixture-of-Memories(MoM) 아키텍처는 독립적인 메모리 상태를 병렬 처리해 메모리 용량을 확장하고, 메모리 간섭을 제거하며, 언어 작업에서 Transformer와 유사한 성능을 달성한다.
Attention mechanisms have made significant contributions to the field of artificial intelligence, advancing various modalities such as language, vision, audio, video, graphs, and even time series (Achiam et al., 2023; Team, 2023). The Transformer (Vaswani, 2017), known for its ability to capture long-range dependencies, has become a foundational architecture in this space. However, traditional Transformers encounter computational challenges due to their quadratic time complexity, O(n 2 ), with respect to sequence length n, making it difficult to scale to long sequences. To overcome this limitation, several linear sequence modeling methods have been proposed, including linear attention (Katharopoulos et al., 2020; Qin et al., 2023a; Li et al., 2025), state space modeling (Gu & Dao, 2024; Dao & Gu, 2024), and linear RNNs (Peng et al., 2024; Qin et al., 2024d), which offer O(n) training complexity and O(1) inference complexity. These approaches often reduce the input sequence to a fixed-size hidden space, collapsing the information into a single “memory state”. While these methods enhance efficiency, they face two main challenges: limited memory capacity and memory interference. When new information overwrites the single fixed-size memory state, previously stored representations may degrade, which negatively impacts its long-term memory performance on recall-intensive tasks.
We argue that the strong performance of Transformer models on recall-intensive tasks arises from their ability to avoid memory interference by maintaining independent key-value caches for each
* Interns at Shanghai AI Laboratory; B Corresponding Authors: Weigao Sun (sunweigao@outlook.com) and Yu Cheng (chengyu@cse.cuhk.edu.hk); † Project Lead.
token, thus offering virtually unlimited memory capacity. In contrast, linear sequence modeling relies on extreme compression, consolidating all the input information into a single fixed-size memory state (Katharopoulos et al., 2020; Dao & Gu, 2024). This approach results in limited memory capacity and inherently leads to memory interference issues.
Interestingly, the human brain has developed mechanisms that enable large memory capacity while reducing memory interference. Neuroscience studies show that in the hippocampus, theta oscillations ( Hz) and gamma oscillations ( Hz) work together to support a neural coding mechanism for multi-item memory (Buzsáki, 2002; Lisman & Jensen, 2013). Specifically, each theta cycle is subdivided into multiple gamma subcycles, and within each gamma subcycle, a distinct group of neurons is activated following the “E%-max” mechanism (de Almeida et al., 2009). This sequential activation temporally separates different memory items, thus preventing interference.
Inspired by these biological insights, we propose a new architecture called Mixture-of-Memories (MoM), which aims to strike a balance between the explicit token representations in Transformers and the extreme compression found in earlier linear sequence modeling methods. MoM employs multiple independent memory states, with a router network that directs input tokens to specific memory states. The input sequence is divided into a predefined number of subsequences (phase-specific neural assemblies), which are processed in parallel and fed into the corresponding memory projections (dentate microcircuits) to generate key-value pairs. As the linear sequence modeling layer processes each subsequence using an RNN-like update mechanism, it produces multiple memory states that capture different aspects of the input sequence. The final output is computed as a weighted sum of these memories, which we refer to as the mixture-of-memories. This approach expands memory capacity and eliminates memory interference, enabling MoM to significantly outperform existing linear sequence models that rely on a single fixed-size memory state.
Our contributions can be summarized as follows:
- We present MoM, an architecture that incorporates multiple independent memory states, significantly enhancing memory capacity and eliminating memory interference, while retaining the efficiency benefits of linear-time training and constant-memory inference.
- Distinct with existing gating mechanisms, MoM is a new paradigm to reduce memory interference by separating the memory states. The overall design is broadly compatible with diverse linear sequence modeling methods, making it a straightforward and effective approach to boost task performance.
- Through empirical evaluation, we show that MoM outperforms strong linear sequence modeling baselines across a variety of language tasks, particularly on recall-intensive tasks. MoM even achieves performance on par with Transformer models, a feat that current linear sequence modeling methods struggle to match.
2 Preliminary
[요약]
- 본 논문에서는 행 벡터를 대문자로 표기하고, 행렬은 대문자로 표기한다.
- 예를 들어, 는 행 벡터를 나타내며, Q, K는 행렬을 나타낸다.
- 동일한 문자는 행렬의 행을 나타내며, 는 Q의 t번째 행을 의미한다.
For notations in this work, we use bold lower-case letters for row vectors (e.g., ), bold upper-case letters for matrices (e.g., Q, K) and the identical letters represent a row in the matrix, e.g., is the t-th row of Q.
LINEAR ATTENTION
[요약]
- Transformer의 시간 복잡도를 줄이기 위해 Linear Transformers는 softmax 주의 메커니즘을 특징 맵의 점곱으로 대체하여, 와 같은 수식을 제안했다.
- 메모리 관점에서 이는 재귀 형식 로 표현 가능하며, 시퀀스 정보를 단일 메모리 상태로 압축하는 선형 재귀 레이어로 작동한다.
- 최근 연구에서는 메모리 구조 최적화(예: 게이트 업데이트)와 메모리 용량 확장(예: Peng et al., 2024) 등이 주요 발전 방향으로 제시되었다.
To reduce the time complexity of Transformer attention, various optimization techniques have been proposed. Linear Transformers (Katharopoulos et al., 2020) replace the softmax attention mechanism with dot-product of feature maps :
(1)
where . While the presence of the denominator may lead to numerical instability (Qin et al., 2024b) and the feature map can utilize an identity function, which we omit for simplicity. In perspective of memory, the formulation can also be written in a recurrent format:
(2)
This indicates that linear attention can function as a linear recurrent layer with a matrix-valued hidden state M which we refer to as memory sate and the output is generated by querying the memory state M. This represents the ultimate compression of sequence information, condensing the entire sequence into a single memory state.
Building on the foundational concepts of linear attention and memory perspective, some recent advancements have focused on optimizing memory structure, including gated updates (Yang et al., 2023; Qin et al., 2024e;d) and memory capacity expansion (Peng et al., 2024; Qin et al., 2024d).
3 METHOD
[요약]
- 선형 시퀀스 모델은 고정 크기의 메모리 상태로 전체 시퀀스 데이터를 압축하지만, 정보 손실을 줄이기 위한 게이팅 메커니즘 도입 등에도 불구하고 압축 과정에서 일부 퇴화가 불가피하다.
- 메모리 용량 확대는 이 문제를 완화할 수 있지만, 기존 접근법은 RNN 상태 크기만 증가시키는 방식으로, 전체 시퀀스의 다면적 정보를 포착하기 어렵다.
- 시퀀스 정보는 다각적인 특성을 포함하므로, 단일 확장 메모리가 여러 측면을 동시에 반영하기 어려우며, 새로운 정보가 기존 메모리와 간섭될 수 있다.
- 게이팅 메커니즘으로 입력 필터링하거나 기존 메모리 덮어쓰기보다는, 다양한 정보를 간섭 없이 보존할 수 있는 대안 전략이 필요하다는 점을 제안한다.
3.1 MOTIVATION
[요약]
선형 시퀀스 모델은 전체 시퀀스 데이터를 고정 크기의 메모리 상태로 압축하지만, 게이팅 메커니즘 도입이나 메모리 수정 제어를 통해 정보 손실을 줄이는 데 있어 한계가 존재한다. 메모리 용량 확대는 이 문제를 완화할 수 있지만, 기존 접근법은 RNN 상태 크기만 증가시키는 방식으로, 시퀀스 내 다면적인 정보를 모두 포착하지 못하는 문제가 있었다. 시퀀스 정보는 복잡한 특성을 지니고 있어, 단일 확장된 메모리가 다양한 데이터 측면을 동시에 반영하기 어렵다. 새로운 정보가 기존 메모리와 간섭할 수 있는 문제를 해결하기 위해, 메모리 간섭 없이 다양한 정보를 보존할 수 있는 대안 전략이 필요하다.
Linear sequence models compress the entire sequence data into a fixed-size memory state. Despite numerous efforts to minimize information loss, such as introducing gating mechanisms and employing more precise control over memory modifications (Orvieto et al., 2023; De et al., 2024; Beck et al., 2024; Yang et al., 2023; Zhang et al., 2024), some degradation in this compression process is inevitable. Expanding the memory capacity has been shown to mitigate this issue to some extent, with studies indicating that increasing memory capacity can enhance model performance (Qin et al., 2024d; Peng et al., 2024).
However, previous approaches that simply increased the size of the RNN state, essentially expanding a single memory state, struggled to capture the full spectrum of information within an entire sequence. We propose that this difficulty arises because sequence information is often multifaceted, and a single, expanded memory may not be capable of simultaneously capturing multiple aspects of the data. Inputs that introduce new or orthogonal information may interfere with existing memory content when using a shared memory. Rather than discarding these inputs through gating mechanisms or overwriting the existing memory state, it may be more effective to consider alternative strategies that allow for the preservation of diverse information without interference.
3.2 MOM: MIXTURE-OF-MEMORIES
[요약]
- MOM(Mixture-of-Memories)은 theta-gamma 진동과 Mixture-of-Experts(MoE) 개념을 결합하여 다중 항목 기억을 인코딩하는 새로운 접근법을 제안한다.
- 이 접근법은 다양한 입력에 따라 선택적으로 업데이트되는 복수의 메모리 상태를 활용하여 기억 용량을 확장하고, 다양한 정보를 분리 저장한다.
- MOM의 메모리 상태는 MoE의 전문가와 유사하지만, 별도 네트워크 대신 선형 재귀 메커니즘 내에 포함된 개별 RNN 상태로 구현되어 서로 다른 정보 유형을 병렬 처리한다.
- 모든 입력 토큰이 K개의 메모리 상태를 선택적으로 활성화하고 업데이트하는 동시에 비활성화된 메모리 상태는 간섭을 방지하기 위해 변경되지 않는다.
- 또한 지속적으로 활성화되는 공유 메모리가 추가되어, 기본 메모리 업데이트 메커니즘과 복잡한 게이팅 메커니즘을 모두 포함하는 구조를 형성한다.
To address the challenge outlined above, we propose a novel approach for encoding multiitem memory such as theta-gamma oscillations (Lisman & Jensen, 2013), and concepts from Mixture-of-Experts (MoE) (Shazeer et al., 2017), where different experts handle specific tokens. In this approach, we leverage multiple memory states, each of which is selectively updated by different inputs. This increases the memory capacity and enables the model to retain diverse pieces of information by storing various types of inputs in separate memory states.
In our framework, the memory states function similarly to the experts in MoE. However, instead of relying on completely separate networks, these modules are individual RNN states embedded within a linear recurrent mechanism. This design allows for the isolation of memory updates while concurrently managing distinct types of information. It is important to note that MoM fundamentally differs from traditional MoE, as we will discuss in Appendix B. Figure 1 provides an overview of the MoM architecture. Below, we introduce the structure of the MoM layer and explain how this multi-

Figure 1: MoM Architecture. Each input token selectively activates and updates K memory states, leaving non-activated memory states unchanged to avoid interference from current input. Additionally, we introduce a continuously activated shared memory. This figure presents the basic memory update mechanism; other mechanisms involving gating or more complex updates follow a similar approach.
memory architecture is implemented in the context of linear sequence modeling.
3.2.1 ROUTER
[요약]
- ROUTER는 입력을 다양한 메모리 상태에 할당하기 위해 중요도 점수를 기반으로 top-k 개념을 활용한다.
- 각 토큰에 대해 선형 레이어를 통해 점수를 생성한 후 softmax 함수를 적용하여 상위 k개의 점수를 선택하고 정규화한다.
- 식 (3)과 (4)에 따르면, 는 top-k 중요도 점수를, 는 정규화된 점수를 나타내며, 이는 입력 의 메모리 할당에 사용된다.
- 학습 가능한 가중치 는 중요도 점수의 계산에 영향을 미치는 파라미터로 포함된다.
We use a router to assign inputs to different memory states. Utilizing the top-k concept, each token is routed to the top-k memories based on its importance scores. Specifically, we use a simple linear layer to generate these scores for each input token. After applying a softmax function, we select the top-k scores and normalize them.
(3)
g_t = \frac{\mathbf{scores}_t}{\sum \mathbf{scores}_t} \in \mathbb{R}^k, \tag{4}
where , k is the top-k number, is learnable weight, is the normalized importance scores of the input .
3.2.2 Linear Recurrent Memory Module
[요약]
- 라우터 네트워크 이후 입력 는 상위 k개의 선형 재귀 메모리 모듈로 전달되며, 해당 메모리만 활성화되고 나머지는 비활성화된다.
- 각 활성화된 메모리에서는 입력을 과 으로 프로젝션한 후, 이들을 이용해 메모리 상태 을 업데이트하며, 업데이트 방식은 forget gate 등 다양한 방법을 포함할 수 있다.
- 업데이트된 메모리 상태는 중요도 점수를 기반으로 가중합을 통해 혼합되며, 이 혼합된 메모리 에 쿼리 벡터 를 적용해 출력 를 생성한다.
- 출력은 활성화 함수, 정규화, 선형 변환을 거쳐 최종 결과로 처리되며, 공유 메모리 메커니즘을 통해 장기 의존성을 강화한다.
- 다양한 메모리 업데이트 규칙(예: Linear Attn, RetNet 등)은 표 1에서 정리되어 있으며, 모델은 입력에 따라 다른 메모리 모듈에 토큰을 분배한다.
After the router network, the input is directed to top-k linear recurrent modules, meaning that the top-k memories are activated while the others remain inactive.
Each Memory. For each activated memory, indexed by m, we perform the following operation:
- Key and Value Projections: We project the input to and using and :
\boldsymbol{k}_{t}^{m} = \boldsymbol{x}_{t} \boldsymbol{W}_{k}^{m}, \boldsymbol{v}_{t}^{m} = \boldsymbol{x}_{t} \boldsymbol{W}_{v}^{m} \in \mathbb{R}^{d}, \tag{5}
where , are learnable projection weights for kv of the m-th memory module.
- Memory Update: We update the activated memory state using , :
(6)
The equation above represents the simplest form of memory update for clarity. Our approach is flexible and does not rely on a specific memory update mechanism. To enhance performance, we can incorporate mechanisms such as forget gates (Sun et al., 2023).
More generally, our method can be adapted to incorporate various memory update methods proposed in previous work. Detailed descriptions of these methods are provided in Table 1.
Memory Mixing. After updating the activated memory states, we perform a weighted sum of these memory states using the importance scores obtained from Equation(4).
\tilde{M}_t = \sum_{m} g_t^{(m)} M_t^m \in \mathbb{R}^{d \times d}, \tag{7}
where is one activated memory and is the importance score of .
We then obtain the output of the MoM by applying query vector to the mixed memory :
(8)
Finally, the output of the MoM layer is computed by applying an activation function, normalization, and a linear transformation.
Throughout the recurrent process, only a subset of memory states is activated and updated at each time step, while memory states that are not routed remain inactive and unchanged. When the input passes through the key-value
Table 1: Memory Update Rules. We demonstrate that several linear sequence models can be viewed as recurrent models in terms of memory updates, where are data-dependent scaler, is data-dependent vector, and is a data-independent constant.
| Method | Memory Update Rule |
|---|---|
| Linear Attn | |
| RetNet | |
| GLA | |
| DeltaNet | |
| G-DeltaNet | |
| TTT | |
| Titans | |
| Mamba2 | |
| HGRN2 | |
| RWKV6 | |
| RWKV7 |
projection layer, it generates multiple sets of keys and values that are fed into different memory modules. This design enables the model to maintain multiple memory states, each preserving distinct pieces of information. By aggregating the activated memories into a comprehensive mixed
memory by weighted summation, the query can effectively retrieve information from this mixed memory, and generate attention output followed by other layers.
Shared Memory. To enhance our model’s ability to capture long-term dependencies, we introduce a shared memory mechanism. This shared memory has access to the entire sequence information, allowing it to effectively store and retrieve long-term information. By integrating shared memory into our model, we ensure that it can leverage the complete historical context, resulting in significant improvements in performance and robustness.

Figure 2: Hardware-efficient Implementation of MoM. Tokens sharing the same color are routed to the same memory. ① Tokens are first split into groups according to memory routing results, ② then concatenated into a varlen input sequence, ③ processed by the Triton kernel, ④ the outputs are returned, ⑤ split back into their respective memories, and ⑥ finally restored to the original sequence order. For clarity, the illustration shows the top-1 routing case, and the qkv projection is omitted.
3.3 HARDWARE-EFFICIENT IMPLEMENTATION
[요약]
MoM의 하드웨어 효율적인 구현은 메모리 혼합을 쿼리 곱셈 전에 수행함으로써, 선형 시퀀스 모델에서 사용된 Triton 기반 연산자 재사용이 가능하다. 시퀀스 토큰을 라우팅 결과에 따라 메모리 레이아웃에 맞게 재정렬한 후, varlen 연산을 위한 토큰을 연결하고 가중 합산을 통해 결과를 집계한다. 이를 통해 MoM의 연산을 varlen 연산으로 효과적으로 축소하여 효율적인 실행이 가능하다. 각 메모리에 할당된 토큰은 라우팅 가중치에 따라 순서가 정의되며, 토큰의 인덱스 집합과 누적 경계를 계산하여 평탄화된 시퀀스를 생성한다. 키와 값은 메모리별로 특정한 투영 행렬을 사용하고, 각 세그먼트에 메모리별 커널을 적용한 후 원래 시퀀스로 결과를 매핑하여 가중 합산으로 최종 출력을 생성한다.
In the implementation of MoM, mixing memories before query multiplication is equivalent to multiplying each memory by the query and then mixing the results, allowing us to reuse efficient Triton-based operators from prior linear sequence models. We first reorder the sequence tokens according to the routing results so that they follow the memory layout. The reordered tokens are then concatenated with varlen for operator computation, after which the results are aggregated via weighted summation. In this way, MoM’s computation can be effectively reduced to varlen operations, enabling efficient execution. We elaborate on this process below.
Given input tokens for batch and time step , each token is routed to one or more memories with routing weights satisfying .
For each (b, m), define the ordered index set
where is the original sequence index of the j-th token assigned to memory m, and . We index buckets lexicographically by p = (b-1)M + m and define cumulative boundaries
The flattened sequence is obtained by
with varlen representation , where .
For each bucket p=(b-1)M+m, queries share a projection matrix , while keys and values use memory-specific projections :
A memory-specific kernel with parameters is applied independently to each segment:
Mapping outputs back to the original sequence, the j-th token in Ib,m has per-memory output
Finally, token-level representations are reconstructed by weighted summation:
4 EXPERIMENTS
[요약]
- 실험에서는 Gated DeltaNet을 메모리 업데이트 메커니즘으로 사용한 MoM 모델을 평가하며, RetNet, GLA, Transformer++ 등 기존 모델들과 비교하였다.
- 모든 기준 모델은 동일한 토큰 수로부터 스크래치로 훈련되었으며, SlimPajama 데이터셋과 Mistral 토큰화기 사용한 100B 토큰 샘플링을 기반으로 훈련하였다.
- MoM의 주요 목표는 선형 시퀀스 모델의 메모리 용량을 희소 활성화를 통해 확장하는 것으로, 키와 값 프로젝션에만 희소 활성화를 적용하여 성능 향상을 달성하였다.
- 모델 크기 표기는 “380M” (24층, 1024 은닉 크기)과 “1.3B” (24층, 2048 은닉 크기)로 나타내었으며, 공정성에 대한 자세한 내용은 부록 G에 제공되었다.
4.1 EXPERIMENTAL SETUPS
[요약]
- 실험에서 MoM 모델은 Gated DeltaNet을 메모리 업데이트 메커니즘으로 사용하며, 4개의 메모리 상태 중 2개가 시간 단계별로 활성화되는 구조를 채택했다.
- 기준 모델로는 RetNet, GLA, Transformer++ 등과 비교했으며, 모든 기준 모델은 동일한 토큰 수로부터 처음부터 훈련하여 공정성을 확보했다.
- 훈련 데이터는 100B 토큰의 SlimPajama 데이터셋을 사용했고, 380M 및 1.3B 파라미터 크기의 모델을 각각 15B 토큰과 0.5M 배치 크기로 훈련했다.
- MoM의 주요 목표는 선형 시퀀스 모델의 메모리 용량 확대를 위해 키와 값 프로젝션에만 희소 활성화를 적용하는 것이라며, 이는 성능 향상에 따른 파라미터 증가를 정당화했다.
Models. In our experiments, we employ the Gated DeltaNet (Yang et al., 2024) as the memory update mechanism in MoM. The model is configured with four memory states, two of which are activated at each time step, along with a shared memory.
Baselines. We evaluate MoM against several linear recurrent models and Transformers, including RetNet (Sun et al., 2023), GLA (Yang et al., 2023), Gated DeltaNet (Yang et al., 2024), and Transformer++ (Touvron et al., 2023), which incorporates Rotary Position Embeddings (Su et al., 2024) and GLU (Shazeer, 2020) into the Transformer architecture. To ensure a fair comparison, we train all baseline models from scratch using the exact same number of tokens.
Training. We follow the training procedure described by Yang et al. (2023), utilizing the SlimPajama dataset (Soboleva et al., 2023) sampled with 100B tokens and tokenized using the Mistral tokenizer (Jiang et al., 2023). We train models from scratch with parameter sizes of 380M and 1.3B, respectively. For the 380M models, we train on 15B tokens with a batch size of 0.5M tokens. More detailed training configuration is provided in Appendix C. We utilized publicly available pretrained weights from Zhang et al. (2024) with exactly same configuration 1 .
Parameter Explanation. We report model sizes using the common shorthand, where “380M” denotes a configuration with 24 layers and hidden size 1024, and “1.3B” denotes 24 layers with hidden size 2048. The main goal of MoM is to expand the memory capacity of linear sequence models through sparse activation. To this end, we apply sparse activation only to the key and value projections, which results in a small increase in activated parameters that is well justified by the performance gains. A detailed discussion on fairness is provided in Appendix G.
4.2 MAIN RESULTS
[요약]
- 선형 시퀀스 모델은 메모리 용량 한계로 인해 Transformer 모델보다 회기 중심 작업에서 성능 차이가 크며, 이는 맥락 정보 처리 능력 평가에 적합한 벤치마크로 작용한다.
- FDA, SWDE, SQuAD, NQ, TriviaQA, Drop 등 6개의 회기 중심 작업을 통해 모델의 맥락 기반 검색 및 이해 능력을 검증하였다.
- 제안된 모델은 메모리 용량 확대와 메모리 혼합 메커니즘을 통해 다른 선형 모델 대비 성능 향상을 달성했으며, Transformer 모델과의 성능 격차를 크게 좁혔다.
- 이는 장거리 의존성 캡처 및 활용 능력 향상으로 인해 맥락 정보가 중요한 작업에서 효과적인 성과를 보인다는 점을 입증한다.
4.2.1 RECALL-INTENSIVE TASKS
[요약]
- 선형 시퀀스 모델은 기억 용량 한계로 인해 Transformer 모델보다 회상 중심 작업에서 성능 격차가 발생하며, 이는 맥락 정보 처리 능력을 평가하는 중요한 기준이 된다.
- FDA, SWDE, SQuAD, NQ, TriviaQA, Drop 등 6개의 회상 중심 작업을 통해 모델의 맥락 기반 검색 및 이해 능력을 철저히 평가하였다.
- 제안된 접근법은 기억 용량 확대와 기억 혼합 메커니즘을 통해 다른 선형 모델보다 성능 향상을 달성하며, Transformer 모델과의 성능 격차를 효과적으로 좁혔다.
- 이는 장거리 의존성 캡처 및 활용 능력 향상으로 인해 맥락 정보가 중요한 작업에서 성능이 향상되었음을 보여준다.
Linear sequence models, due to their limited memory capacity, often exhibit a significant performance gap compared to Transformer models, especially in recall-intensive tasks where extensive context is crucial. These tasks highlight notable performance differences among various linear models, making them a more accurate benchmark for evaluating a linear model’s capabilities in handling contextual information.
To thoroughly assess our model’s proficiency in such scenarios, we test six recall-intensive tasks following Arora et al. (2024): FDA (Arora et al., 2023), SWDE (Arora et al., 2023; Lockard et al., 2019), SQuAD (Rajpurkar et al., 2018), NQ (Kwiatkowski et al., 2019), TriviaQA (Joshi et al., 2017) and Drop (Dua et al., 2019). These tasks are designed to challenge a model’s ability to perform context-based retrieval and comprehension.
As shown in Table 2, our proposed approach, benefiting from increased memory capacity and memory mixing mechanism, achieves significant improvements over other linear sequence models. Specifically, our model effectively narrows the performance gap with Transformer models. This improvement underscores the advantage of our method in capturing and utilizing long-range dependencies, thereby enhancing performance on tasks that require extensive contextual understanding.
4.2.2 LONG CONTEXT TASKS
[요약]
- MoM 모델은 1.3B 규모에서 Transformer 모델에 가까운 성능을 보이며, 다른 선형 모델들보다 Recall-Intensive 작업에서 우수한 결과를 기록했다.
- Long-Bench 벤치마크 평가에서 MoM은 요약, 코드 완성 등 다양한 작업에서 평균 15.64의 점수를 기록하며 최고 성과를 달성했다.
- 메모리 업데이트 메커니즘 비교에서 MoM은 단일 메모리 확장 대신 별도 메모리 세그먼트 사용으로 더 높은 성능을 나타냈다.
- ARC-e, HellaSwag 등 주요 작업에서 MoM은 GLA 확장 모델 대비 1~2% 이상의 정확도 향상을 보였다.
1Models marked with an asterisk † use open-source pretrained weights with identical training configurations.
Table 2: Results on Recall-Intensive Tasks. All inputs are truncated to a maximum length of 2K tokens. MoM significantly outperforms all other linear models across both model sizes. In the 1.3B model, MoM even achieves performance very close to that of Transformer models.
Assessing performance on long-context tasks is crucial for linear models, as it reflects their ability to handle long-range dependencies effectively. We evaluated our model’s comprehension of long contexts using the Long-Bench benchmark (Bai et al., 2024; Contributors, 2023). In Table 3, we present the average results across various categories, including summarization, few-shot learning, synthetic tasks, and code completion, along with the overall mean across all tasks. The complete detailed results are provided in Appendix I.
Table 3: LongBench Benchmark Results. Note: Sum = Summarization, FS = Few-shot, Syn = Synthetic.
| Model | Sum | FS | Syn | Code | Avg. |
|---|---|---|---|---|---|
| RetNet † | 6.30 | 15.76 | 2.64 | 40.52 | 13.61 |
| HGRN2 † | 6.51 | 15.50 | 2.61 | 40.11 | 13.02 |
| 7.75 | 20.29 | 1.92 | 42.83 | 14.61 | |
| Gated DeltaNet | 7.14 | 18.00 | 2.10 | 41.52 | 13.98 |
| MoM | 6.89 | 21.26 | 2.63 | 47.79 | 15.64 |
Table 4: Comparison of Mixture of Memories and Single Memory Expanded. We constructed MoM models using different memory update mechanisms. Separate memory segments yielded better performance compared to simply increasing the memory capacity of a single memory.
| ARC-e | e ARC-o | Hella. | Lamb | . PIOA | Wino. | |||
|---|---|---|---|---|---|---|---|---|
| Model | Params | acc† | acc† | acc↑ | acc† | Avg. | ||
| GLA expanded | 425M | 42.34 | 22.95 | 34.56 | 20.45 | 63.00 | 50.12 | 38.90 |
| GLA MoM | 395M | 42.85 | 24.15 | 36.60 | 23.23 | 63.22 | 49.88 | 39.99 |
| Gated DeltaNet expanded | 550M | 43.60 | 24.66 | 37.80 | 26.90 | 64.47 | 50.51 | 41.32 |
| Gated DeltaNet MoM | 444M | 44.65 | 24.74 | 36.54 | 27.93 | 66.16 | 51.78 | 41.97 |
| Model | Params | FDA | SWDE | SQUAD | NQ | TriviaQA | Drop | Avg. |
| GLA expanded | 425M | 15.08 | 20.15 | 28.28 | 13.30 | 41.65 | 18.74 | 22.87 |
| GLA MoM | 395M | 9.90 | 21.65 | 29.36 | 14.16 | 45.20 | 20.89 | 23.53 |
24.27
29.90
30.03
29.69
17.74
16.60
48.34
48.82
19.26
20.99
26.32
28.16
4.2.3 MIXED MEMORY VS. SINGLE MEMORY
550M
444M
18.26
22.98
Gated DeltaNet expanded
Gated DeltaNet MoM
To validate the effectiveness of our mixed memory mechanism, we compare our MoM model with mixed memories to a baseline model that uses an expanded single memory with the same activated memory capacity. We adopt the same memory update method as existing linear models and extend it within our MoM framework. For comparison, we employed the commonly used method of expanding the single memory by expanding the dimension of to match the total size of all activated memories in the MoM model. We evaluate their performance on common-sense reasoning tasks and recall-intensive tasks in Table 4.


Figure 3: Inference Efficiency of MoM. We Figure 4: Length Extrapolation. We extrapdemonstrate the inference time and GPU memolated models trained on 2K sequences to a ory consumption required to generate 1K tokens length of 32K for perplexity (ppl) evaluation. at specific sequence lengths.
The experimental results demonstrated that using multiple mixed memories leads to a greater improvement than simply expanding the capacity of a single memory with less parameters. This confirms that mixed memory can effectively reduce interference from different inputs. Assigning inputs specifically to different memories, combined with the use of a forget gate, proves to be a more effective approach for reducing interference than relying solely on a forget gate.
4.2.4 EFFICIENCY
We compare the inference speed and memory usage of MoM and Transformer++ with flash attention in Fig 3. Our analysis demonstrates that MoM exhibits linear complexity, showcasing significant advantages over the Transformer model when handling long sequences. Specifically, MoM’s efficient memory update mechanisms allow it to process longer inputs with reduced computational overhead, positioning it as a more scalable solution for large-scale natural language processing tasks.
4.2.5 LENGTH EXTRAPOLATION
We pretrained the models on the Slimpajama dataset with a 2K context length and conducted extrapolation experiments on various lengths using the Fineweb(Penedo et al., 2024) dataset. We extended the length to 32K to calculate perplexity (ppl). As shown in Fig 4, the Transformer model experienced a significant increase in ppl due to its poor extrapolation capability. Among the linear models, MoM achieved the best results.
4.2.6 Memory Analysis
Memory Load Balance Analysis. To evaluate whether each memory segment in MoM is effectively balanced during inference on downstream tasks, we analyzed the number of tokens routed to each layer using around 300k tokens from the ARC-easy benchmark. We visualized the results with auxiliary loss (following the formulation introduced in Switch Transformer Fedus et al. (2022)) in Fig 5 with heatmaps and we also visualized results with auxiliary loss in Fig 9. Due to the adoption of auxiliary loss, the memory segments in each layer are almost uniformly routed and activated.
Functions of Different Memories. We analyzed the memory routing within the model and identified the categories of tokens processed by each of the four memories. By examining an intermediate layer, we observed that, unlike traditional MoE in FFN layers where experts often lack clear specialization and focus on syntactic punctuation(Jiang et al., 2023), MoM displays specialization among its memories. This finding suggests potential for exploring an MoE architecture where each memory serves a specific role. These distinctions among different memories are presented in Table 5.
4.2.7 MOM SCALING UP & ABLATION STUDY
We examine the effect of scaling both the number of memory states and the number of top-k activations in MoM. To ensure comparability, Fig. 6 reports results with a fixed activation ratio of 0.5. Increasing the number of memories from 1 to 8 consistently improves performance across both recall-intensive and commonsense benchmarks. These results indicate that enlarging the memory
Table 5: Functions of Different Memories. By analyzing the types of tokens routed to the model’s intermediate layer, we observed a degree of specialization among the different memories.
| 0.05 | 0.00 | ords | s/inc | plet | ouns | Frag | ntec | l/op | en i | tion | 0.4 | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mem-1 | 0.26 | 0.25 | 0.28 | 0.23 | 0.25 | 0.24 | 0.21 | 0.22 | 0.22 | 0.25 | 0.26 | 0.23 | 0.19 | 0.33 | 0.37 | 0.28 | 0.28 | 0.22 | 0.24 | 0.34 | 0.31 | 0.27 | 0.14 | 0.25 | - 0.3 |
| Mem-2 | 0.25 | 0.22 | 0.10 | 0.21 | 0.24 | 0.27 | 0.26 | 0.26 | 0.26 | 0.24 | 0.22 | 0.20 | 0.25 | 0.10 | 0.14 | 0.27 | 0.22 | 0.24 | 0.20 | 0.20 | 0.25 | 0.20 | 0.27 | 0.22 | - 0.3 |
| nem-2 | 0.25 | 0.23 | 0.19 | 0.24 | 0.27 | 0.26 | 0.26 | 0.26 | 0.24 | 0.23 | 0.29 | 0.25 | 0.10 | 0.14 | 0.27 | 0.23 | 0.54 | 0.20 | 0.25 | 0.28 | 0.27 | 0.23 | - 0.2 | ||
| - 0.2 | |||||||||||||||||||||||||
| Mem-3 | 0.23 | 0.26 | 0.25 | 0.29 | 0.21 | 0.25 | 0.24 | 0.23 | 0.26 | 0.30 | 0.25 | 0.29 | 0.32 | 0.20 | 0.29 | 0.23 | 0.21 | 0.21 | 0.12 | 0.22 | 0.21 | 0.21 | 0.27 | 0.25 | - 0.1 |
| - C | |||||||||||||||||||||||||
| Mem-4 | 0.26 | 0.26 | 0.27 | 0.27 | 0.30 | 0.25 | 0.29 | 0.30 | 0.26 | 0.20 | 0.26 | 0.24 | 0.30 | 0.22 | 0.27 | 0.24 | 0.36 | 0.23 | 0.24 | 0.25 | 0.33 | 0.26 | - 0 | ||
| 4 | 0.26 | 0.26 | 0.27 | 0.27 | 0.30 | 0.25 | 0.29 | 0.30 | 0.26 | 0.20 | 0.26 | 0.24 | 0.30 | 0.22 | 0.27 | 0.24 | 0.36 | 0.23 | 0.24 | 0.25 | 0.33 | 0.26 | - 0. | ||
| 3 | 2 | ψ | 4 | ارا | φ | 7 | φ | 6-7 | 10 | Ę. | -12 | -13 | -14 | -15 | -16 | -17 | -18 | -19 | -20 | -21 | -22 | 23 | - 0. |
Figure 5: Memory Load Balance Analysis. Token Routing Distribution Across Layers and Memories with Aux Loss.
pool effectively mitigates interference and enhances capacity. More comprehensive results covering other activation ratios and activation settings are provided in Appendix H.
We further study the influence of auxiliary loss and shared memory in MoM, using a 380M-parameter model trained on 15B tokens. As shown in Table 6, auxiliary loss improves stability and performance when applied with a suitable weight. In addition, shared memory consistently benefits performance with global information. These results highlight the complementary roles of auxiliary loss and shared memory in stabilizing and enhancing MoM.
5 CONCLUSION
In this paper, we propose MoM, a novel architecture that enhances memory capacity and eliminates memory interference. By leveraging multiple independent memory states, MoM significantly improves performance on recall-intensive tasks while maintaining the efficiency advantages of linear models. Instead of simply discarding tokens as done in gating mechanisms, our memory separation paradigm provides a more effective way to preserve sequence information. Our experimental results demonstrate that MoM outperforms existing linear sequence modeling methods, particularly on tasks requiring strong recall, and achieves performance comparable to Transformer models. This makes MoM a promising approach for applications need strong efficiency and recall-intensive performance, paving the way for efficient sequence modeling.
6 ETHICS STATEMENT
This work does not involve human subjects, sensitive data, or high-risk applications. All experiments are conducted on publicly available datasets. We encourage responsible and ethical use of the proposed methods in line with community standards.
REFERENCES
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.

Figure 6: Scaling performance with increasing number of memories with a fixed activation ratio of 0.5.
Table 6: Ablation on memory count and shared memory, showing average accuracy across recall-intensive tasks.
| Recall ↑ Common ↑ | ||
|---|---|---|
| Aux Loss Scale | ||
| 1e-2 | 27.59 | 42.10 |
| 5e-3 | 26.55 | 41.71 |
| 1e-3 | 28.16 | 41.97 |
| 0 | 27.23 | 41.58 |
| Shared Memory | ||
| w/ shared memory | 28.16 | 41.97 |
| w/o shared memory | 26.06 | 40.38 |
Simran Arora, Brandon Yang, Sabri Eyuboglu, Avanika Narayan, Andrew Hojel, Immanuel Trummer, and Christopher Re. Language models enable simple systems for generating structured views ´ of heterogeneous data lakes, 2023.
Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, and Christopher Re. Simple linear attention language models balance ´ the recall-throughput tradeoff. arXiv preprint arXiv:2402.18668, 2024.
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding, 2024. URL https://arxiv.org/abs/ 2308.14508.
Maximilian Beck, Korbinian Poppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, ¨ Michael Kopp, Gunter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Extended ¨ long short-term memory. arXiv preprint arXiv:2405.04517, 2024.
Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time. arXiv preprint arXiv:2501.00663, 2024.
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
Gyorgy Buzs ¨ aki. Theta oscillations in the hippocampus. ´ Neuron, 33(3):325–340, 2002.
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018.
OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixtureof-experts language models. arXiv preprint arXiv:2401.06066, 2024.
Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv preprint arXiv:2405.21060, 2024.
Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models. arXiv preprint arXiv:2402.19427, 2024.
-
Licurgo de Almeida, Marco Idiart, and John E Lisman. A second function of gamma frequency oscillations: an e%-max winner-take-all mechanism selects which cells fire. Journal of Neuroscience, 29(23):7497–7503, 2009.
-
Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161, 2019.
-
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.
-
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, 07 2024. URL https://zenodo.org/records/ 12608602.
-
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2024. URL https://arxiv.org/abs/2312.00752.
-
Albert Gu, Karan Goel, and Christopher Re. Efficiently modeling long sequences with structured ´ state spaces, 2022. URL https://arxiv.org/abs/2111.00396.
-
Ankit Gupta, Albert Gu, and Jonathan Berant. Diagonal state spaces are as effective as structured state spaces. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 22982–22994. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper\_files/paper/2022/file/ 9156b0f6dfa9bbd18c79cc459ef5d61c-Paper-Conference.pdf.
-
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
-
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.
-
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc¸ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pp. 5156–5165. PMLR, 2020.
-
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
-
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
-
Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, et al. Minimax-01: Scaling foundation models with lightning attention. arXiv preprint arXiv:2501.08313, 2025.
-
John E Lisman and Ole Jensen. The theta-gamma neural code. Neuron, 77(6):1002–1016, 2013.
-
Colin Lockard, Prashant Shiralkar, and Xin Luna Dong. OpenCeres: When open information extraction meets the semi-structured web. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3047–3056, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1309. URL https://aclanthology.org/N19-1309.
-
Ilya Loshchilov, Frank Hutter, et al. Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101, 5, 2017.
-
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016.
-
Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. In International Conference on Machine Learning, pp. 26670–26698. PMLR, 2023.
-
Denis Paperno, German Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, ´ Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernandez. The lambada dataset: ´ Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016.
-
Guilherme Penedo, Hynek Kydl´ıcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin ˇ Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum? id=n6SCkn2QaG.
-
Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Xingjian Du, Teddy Ferdinan, Haowen Hou, et al. Eagle and finch: Rwkv with matrixvalued states and dynamic recurrence. arXiv preprint arXiv:2404.05892, 2024.
-
Zhen Qin, Dong Li, Weigao Sun, Weixuan Sun, Xuyang Shen, Xiaodong Han, Yunshen Wei, Baohong Lv, Xiao Luo, Yu Qiao, et al. Transnormerllm: A faster and better large language model with improved transnormer. arXiv preprint arXiv:2307.14995, 2023a.
-
Zhen Qin, Dong Li, Weigao Sun, Weixuan Sun, Xuyang Shen, Xiaodong Han, Yunshen Wei, Baohong Lv, Fei Yuan, Xiao Luo, et al. Scaling transnormer to 175 billion parameters. arXiv preprint arXiv:2307.14995, 2023b.
-
Zhen Qin, Xuyang Shen, Dong Li, Weigao Sun, Stan Birchfield, Richard Hartley, and Yiran Zhong. Unlocking the secrets of linear complexity sequence model from a unified perspective. arXiv preprint arXiv:2405.17383, 2024a.
-
Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models. arXiv preprint arXiv:2401.04658, 2024b.
-
Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Various lengths, constant speed: Efficient language modeling with lightning attention. arXiv preprint arXiv:2405.17381, 2024c.
-
Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, and Yiran Zhong. Hgrn2: Gated linear rnns with state expansion. arXiv preprint arXiv:2404.07904, 2024d.
-
Zhen Qin, Songlin Yang, and Yiran Zhong. Hierarchically gated recurrent neural network for sequence modeling. Advances in Neural Information Processing Systems, 36, 2024e.
-
Xiaoye Qu, Daize Dong, Xuyang Hu, Tong Zhu, Weigao Sun, and Yu Cheng. Llama-moe v2: Exploring sparsity of llama from perspective of mixture-of-experts with post-training. arXiv preprint arXiv:2411.15708, 2024.
-
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–16. IEEE, 2020.
-
Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad, 2018.
-
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. arXiv preprint arXiv:1907.10641, 2019.
-
Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
-
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
-
Xuyang Shen, Dong Li, Ruitao Leng, Zhen Qin, Weigao Sun, and Yiran Zhong. Scaling laws for linear complexity language models. arXiv preprint arXiv:2406.16690, 2024.
-
Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama, 2023. URL https: //huggingface.co/datasets/cerebras/SlimPajama-627B.
-
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
-
Weigao Sun, Zhen Qin, Dong Li, Xuyang Shen, Yu Qiao, and Yiran Zhong. Linear attention sequence parallelism. arXiv preprint arXiv:2404.02882, 2024a.
-
Weigao Sun, Zhen Qin, Weixuan Sun, Shidi Li, Dong Li, Xuyang Shen, Yu Qiao, and Yiran Zhong. Co2: Efficient distributed training with full communication-computation overlap. arXiv preprint arXiv:2401.16265, 2024b.
-
Weigao Sun, Disen Lan, Yiran Zhong, Xiaoye Qu, and Yu Cheng. Lasp-2: Rethinking sequence parallelism for linear attention and its hybrid. arXiv preprint arXiv:2502.07563, 2025.
-
Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states. arXiv preprint arXiv:2407.04620, 2024c.
-
Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023.
-
Xiaqiang Tang, Weigao Sun, Siyuan Hu, Yiyang Sun, and Yafeng Guo. Ms-net: A multi-path sparse model for motion prediction in multi-scenes. IEEE Robotics and Automation Letters, 2023.
-
InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities, 2023.
-
Qwen Team. Qwen1.5-moe: Matching 7b model performance with 1/3 activated parameters”, February 2024. URL https://qwenlm.github.io/blog/qwen-moe/.
-
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothee´ Lacroix, Baptiste Roziere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and ` efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
-
A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
-
Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635, 2023.
-
Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. arXiv preprint arXiv:2412.06464, 2024.
-
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
-
Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, et al. Gated slot attention for efficient linear-time sequence modeling. arXiv preprint arXiv:2409.07146, 2024.
Beitong Zhou, Jun Liu, Weigao Sun, Ruijuan Chen, Claire J Tomlin, and Ye Yuan. pbsgd: Powered stochastic gradient descent methods for accelerated non-convex optimization. In IJCAI, pp. 3258– 3266, 2020.
Tong Zhu, Xiaoye Qu, Daize Dong, Jiacheng Ruan, Jingqi Tong, Conghui He, and Yu Cheng. Llama-moe: Building mixture-of-experts from llama with continual pre-training. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 15913– 15923, 2024.
A RELATED WORK
LINEAR RECURRENT MODELS
Linear recurrent models, comprising linear attention, linear RNNs, state-space models (SSMs), have garnered significant research interests (Qin et al., 2023b). The advancement of SSMs began with the pioneering work on S4 (Gu et al., 2022), which was later optimized through a diagonalized version (Gupta et al., 2022). Despite their strong performance on the LRA benchmark, these models have faced challenges in language modeling mainly because they rely solely on data-independent processes. As research progressed, constant forgetting gates were introduced, helping to alleviate some interference issues by uniformly managing memory decay (Sun et al., 2023; Gu & Dao, 2024). The next breakthrough involved data-dependent forget gates. These allowed models to dynamically adjust memory updates based on the input data, significantly enhancing performance across various tasks (Qin et al., 2024c;d; Yang et al., 2023; Zhang et al., 2024; Yang et al., 2024; Qin et al., 2024a). Sequence parallelism techniques are well adapted on linear recurrent models (Sun et al., 2024a; 2025) for efficient long context training. There have also been recent advancements in scaling law and test-time regression optimization (Shen et al., 2024; Sun et al., 2024c; Behrouz et al., 2024).
Building on these advancements, our MoM model incorporates data-dependent mechanisms that selectively update memory. By efficiently managing interference through tailored memory updates and leveraging increased memory capacity, MoM represents a further evolution, improving model expressiveness and performance.
MIXTURE-OF-EXPERTS
Mixture-of-Experts (MoE) is a technique designed to enhance the capacity of deep neural networks while maintaining computational efficiency (Fedus et al., 2022; Rajbhandari et al., 2020; Lepikhin et al., 2020; Tang et al., 2023; Zhu et al., 2024; Qu et al., 2024). MoE achieves this by activating a subset of parameters, known as “experts”, for each input, which reduces the computational costs. Shazeer first integrated MoE into LSTM layers (Shazeer et al., 2017). The Switch Transformer (Fedus et al., 2022) refined this approach by simplifying the gating mechanism to select only one expert per input. Gshard (Lepikhin et al., 2020) further advanced this by using a top-2 expert routing strategy to improve performance. Recent MoE models, such as Deepseek-MoE (Dai et al., 2024), introduce shared experts to capture and consolidate common knowledge across different contexts, while designing fine-grained experts to increase combinatorial flexibility.
B COMPARISON BETWEEN MOM AND MOE
While our approach to implementing the Mixture-of-Memories (MoM) draws inspiration from the Mixture-of-Experts (MoE) framework, there are significant differences that distinguish our method from traditional MoE implementations.
• Purpose: The MoE was introduced to scale up the number of parameters without significantly increasing computational resources. It address the limitations of dense models in scaling both parameters and computational demands through sparse activation. However, MoM is designed to expand the memory capacity of linear attention models while preserving their linear time complexity. By sparsely activating memories and using weighed summation to create a mixed memory, MoM effectively address the challenge of forgetting
Table 7: Results on Common-Sense Reasoning Tasks. The performance of linear models and Transformer models is comparable; however, MoM consistently achieves the best average performance across all model sizes.
| Scale | Model | Wiki. ppl↓ | Lamb. ppl↓ | ARC-e acc↑ | ARC-c acc n ↑ | Hella. acc n ↑ | Lamb. acc↑ | PIQA acc↑ | Wino. acc↑ | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|
| 380M Params | Transformer++ | 26.88 | 76.46 | 44.91 | 25.94 | 34.95 | 26.90 | 64.31 | 51.07 | 41.35 |
| 15B Tokens | RetNet | 31.07 | 87.11 | 44.49 | 23.04 | 33.86 | 23.93 | 63.49 | 52.33 | 40.19 |
| L=24, | HGRN2 | 27.90 | 77.40 | 45.24 | 23.63 | 35.61 | 24.74 | 65.45 | 54.06 | 41.46 |
| GLA | 28.78 | 79.95 | 44.53 | 22.27 | 34.84 | 24.94 | 63.93 | 51.38 | 40.32 | |
| GSA | 28.17 | 82.50 | 45.50 | 24.23 | 35.00 | 24.02 | 64.85 | 50.43 | 40.67 | |
| Gated DeltaNet | 26.47 | 58.59 | 46.04 | 23.55 | 35.18 | 27.01 | 66.05 | 50.83 | 41.44 | |
| MoM | 25.86 | 55.41 | 44.65 | 24.74 | 36.54 | 27.93 | 66.16 | 51.78 | 41.97 | |
| 1.3B Params | Transformer++† | 17.61 | 19.29 | 55.01 | 28.07 | 49.21 | 40.95 | 70.08 | 56.27 | 49.93 |
| 100B Tokens | RetNet † | 18.18 | 21.97 | 57.49 | 26.88 | 48.09 | 37.75 | 69.37 | 53.28 | 48.81 |
| L=24, | HGRN2 † | 17.32 | 15.65 | 58.33 | 28.07 | 51.93 | 42.31 | 71.33 | 52.01 | 50.66 |
| 17.61 | 19.66 | 55.18 | 27.56 | 48.89 | 40.03 | 69.86 | 53.91 | 49.24 | ||
| 16.69 | 16.02 | 58.33 | 28.33 | 50.98 | 42.03 | 72.25 | 53.43 | 50.89 | ||
| Gated DeltaNet | 17.14 | 18.80 | 56.82 | 27.39 | 49.77 | 39.94 | 71.76 | 51.78 | 49.58 | |
| MoM | 16.64 | 14.83 | 55.35 | 27.99 | 50.95 | 43.43 | 71.27 | 56.83 | 50.97 |
historical information in linear attention. Moreover, by separating the memory into distinct states, MoM reduces interference between different pieces of information.
• Structure: In conventional MoE, each expert is a separate neural network within the feed-forward network (FFN) layer such as Qwen-MoE (Team, 2024) and Linear-MoE Sun et al. (2024a). In contrast, in MoM, each memory is an RNN state with unique key-value projection weights to generate different key-value pairs. MoE operates during the channel mixing phase, where each token is processed independently by selected experts. On the other hand, MoM functions during the token mixing phase, where each memory processes different segments of the sequence, preserving inter-token relationships.
C EXPERIMENTS DETAILS
For the 380M models, we train on 15B tokens with a batch size of 0.5M tokens. The warmup tokens count is set to 0.25M. We set the hidden ratio of our model to 3 to keep the activated parameter count approximately the same. For the 1.3B models, we train on 100B tokens with a batch size of 2M tokens. The warmup tokens count is 1B. We employ AdamW optimizer (Loshchilov et al., 2017; Sun et al., 2024b) with learning rate of 3e-4 with cosine learning rate schedule (Zhou et al., 2020). The weight decay is set to 0.01 and gradient clipping is 1.0. Our experiments were conducted using 32 NVIDIA A800 GPUs. Training the 380M parameter model required approximately 10 hours, while the 1.3B parameter model took around 6 days.
D COMMONSENSE REASONING TASKS
As shown in Table 7, we report the language modeling perplexity and zero-shot performance of commonsense reasoning tasks following (Zhang et al., 2024) which includes WikiText (Merity et al., 2016), LAMBADA (Paperno et al., 2016), ARC-easy, ARC-challenge (Clark et al., 2018), HellaSwag (Zellers et al., 2019), PiQA (Bisk et al., 2020) and WinoGrande (Sakaguchi et al., 2019). The evaluation results are based on the lm-evaluation-harness (Gao et al., 2024).
Experimental results show that MoM outperforms other linear models and surpassed the Transformer model as well.
E Training Loss Comparison
To further assess the learning efficiency of MoM, we compared the training loss curves of MoM with those of other baseline models. As depicted in Figure 7, MoM consistently maintains the lowest loss
throughout the entire training phase. Even as training nears convergence, MoM continues to exhibit a clear advantage over other methods.

Figure 7: Training Loss. Loss curves for training 380M models on 15B tokens with a fixed random seed of 42.
F THE HYBRID OF MOM AND TRANSFORMER
We delve deeper into the hybridization of MoM and Transformer layers by integrating 1 Transformer layer after every 7 MoM layers, resulting in only 3 Transformer layers across a total of 24 layers. The performance on commonsense reasoning and recall-intensive tasks is presented in the table 8. MoM-Hybrid demonstrates significantly improved results compared to Transformer models, despite using only 3 layers of global attention.
Table 8: Hybrid Model Performance. The hybrid model integrates 1 Transformer layer after every 7 MoM layers, resulting in only 3 Transformer layers across a total of 24 layers.
| Transformer++ | 46.14 | 25.87 | 33.22 | 18.94 | 45.97 | 20.03 |
|---|---|---|---|---|---|---|
| MoM | 22.98 | 29.90 | 29.69 | 16.60 | 48.82 | 20.99 |
| MoM Hybrid | 58.13 | 44.05 | 35.71 | 20.18 | 48.10 | 20.60 |
| Model | ARC-e acc↑ | ARC-c accn↑ | Hella. accn↑ | Lamb. acc↑ | PIQA acc↑ | Wino. acc↑ | Avg. |
|---|---|---|---|---|---|---|---|
| Transformer++ MoM | 44.91 44.65 | 25.94 24.74 | 34.95 36.54 | 26.90 27.93 | 64.31 66.16 | 51.07 51.78 | 41.35 41.97 |
| MoM Hybrid | 46.55 | 24.49 | 36.45 | 28.86 | 65.51 | 52.41 | 42.38 |
G FAIRNESS
To enhance memory capacity, MoM applies sparse activation to the key and value projections. Although these projections constitute a small portion of the overall model parameters, this inevitably increases the parameter count. Due to differing linear model structures, aligning both parameter count and memory capacity exactly is challenging. Thus, to ensure fairness, we conduct comparisons from two perspectives: equal activated parameter count and equal memory capacity.
G.1 EQUAL ACTIVATED PARAMETER COUNT
To ensure a fair comparison of parameters, we reduced the MLP hidden ratio to 2 and retrained the MoM model using the same training configurations as in Section 4.1. Both MoM and Gated Deltanet were set with 400M activated parameters. Although the smaller hidden ratio might impact the model’s commonsense knowledge, we tested on commonsense reasoning tasks and recall-intensive benchmarks. MoM consistently outperformed Gated Deltanet in both tests, further validating the effectiveness of the MoM approach. The results are presented in Table 9
| Model | Params | ARC-e acc↑ | ARC-c accn↑ | Hella. accn↑ | Lamb. acc↑ | PIQA acc↑ | Wino. acc↑ | Avg. | |
|---|---|---|---|---|---|---|---|---|---|
| Gated DeltaNet | 400M | 46.04 | 23.55 | 35.18 | 27.01 | 66.05 | 50.83 | 41.44 | |
| MoM | 400M | 47.10 | 23.72 | 35.43 | 26.88 | 64.64 | 51.22 | 41.50 | |
| Model | Params | FDA | SWDE | SQUAD | NQ | TriviaQA | Drop | Avg. | |
| Gated DeltaNet | 400M | 20.53 | 23.24 | 28.55 | 14.98 | 44.91 | 16.48 | 24.78 | |
| MoM | 400M | 24.16 | 25.59 | 29.46 | 15.36 | 46.15 | 18.35 | 26.51 |
Table 9: Comparison with the Same Activated Parameters. MoM and Gated DeltaNet with 400M activated parameters are tested.
G.2 EQUAL MEMORY CAPACITY
To ensure a fair comparison of memory capacity, we also compared the single extended memory model with the MoM model. Notably, the single extended memory model has more parameters than the activated parameters in the MoM due to the extension of the v dimension. MoM expands memory more elegantly and significantly outperforms in both recall and commonsense tasks. This comparison result is presented in Table 4.
H DETAILED SCALING RESULTS
To further examine the scalability of MoM from the perspective of memory capacity, we evaluate the effect of enlarging the memory pool beyond the main settings. Specifically, we compare two activation ratios, where the number of active memories accounts for either 0.5 or 0.25 of the total memory states. In all cases, an additional shared memory is included. Starting from a single memory as the baseline, we expand the number of memories to 2, 4, 8, and 16, and report the averaged results on both recall-intensive and commonsense benchmarks. The results, shown in Fig. 8, indicate that under a fixed activation ratio, increasing the memory size consistently improves performance on both categories of tasks.

Figure 8: Detailed scaling results. Performance shows a general improvement on both recallintensive and commonsense tasks as the number of memories increases.
I FULL RESULTS
| RetNet | 9.23 | 7.23 | 6.3 | 15.76 | 2.64 | 40.52 | 15.44 | 13.5 | 13.61 |
|---|---|---|---|---|---|---|---|---|---|
| HGRN2 | 7.38 | 6.02 | 6.51 | 15.5 | 2.61 | 40.11 | 14.28 | 13.12 | 13.02 |
| GSA | 8.21 | 6.63 | 7.75 | 20.29 | 1.92 | 42.83 | 15.06 | 15.2 | 14.61 |
| Gated DeltaNet | 8.52 | 6.61 | 7.14 | 18 | 2.1 | 41.52 | 14.19 | 14.63 | 13.98 |
| MoM | 8.14 | 7.11 | 6.89 | 21.26 | 2.63 | 47.79 | 17.33 | 15.71 | 15.64 |
Table 10: Complete Results of LongBench. SQA: Single-doc QA, MQA: Multi-doc QA, Sum: Summarization, FS: Few-shot learning, Syn: Synthetic

Figure 9: Memory Load Balance Analysis. Token Routing Distribution Across Layers and Memories without Aux Loss.