SlotMemory: Object-Centric KV Memory for Streaming Long-Video Generation

Weijia Dou1, Hui Li1, Jiahao Cui1, Lei Zhou2, Jingdong Wang3, Siyu Zhu1
1Fudan University   2Meta super intelligence   3Baidu
arXiv preprint (2026)

Corresponding author: Siyu Zhu (siyuzhu@fudan.edu.cn)

Remember objects, not frames.

81.61 60s Quality  |  74.29 30s Dynamic Score  |  22.8% Dynamic Consistency Gain  |  60s Interactive Narratives

SlotMemory shifts long-video memory from temporal indexing ("when") to semantic slot routing ("what"), enabling persistent entity identity and prompt-aware retrieval in streaming diffusion generation.

Method Overview

SlotMemory formulates streaming long-video generation as a bounded-memory read-write-update loop. For each chunk, the model retrieves semantically relevant slot memories, generates the current chunk, writes new slot-conditioned KV items, and updates the memory bank under a fixed budget.

  1. Read: retrieve relevant slot memories from the long-term bank.
  2. Generate: denoise the current chunk with local and retrieved KV context.
  3. Write: convert current chunk features into slot-conditioned KV memories.
  4. Update: retain high-value memories and evict obsolete ones under budget.
SlotMemory read-write-update pipeline

Enhanced Consistency in Extended Video Generation

Core takeaway: SlotMemory demonstrates pronounced improvements in long-horizon video synthesis (30–60s), where maintaining memory fidelity is critical for preserving entity identity and adhering to prompt instructions. These results highlight the efficacy of object-centric KV memory over conventional temporal-centric approaches.

60-second Interactive Multi-Prompt Evaluation

Method Quality 0-10s CLIP 10-20s 20-30s 30-40s 40-50s 50-60s
Infinity-RoPE 79.98 23.87 22.62 22.16 21.82 22.20 22.20
LongLive 79.09 26.49 25.60 24.89 24.48 24.81 24.54
MemFlow 78.57 26.29 24.09 23.15 23.93 23.68 23.39
SlotMemory 81.61 26.81 26.11 25.18 25.33 24.91 25.25

30-second Single-Prompt Long Video Evaluation

Method Total Quality Semantic Dynamic Imaging Temporal Style
Self-Forcing 82.21 83.38 77.54 51.85 68.17 23.31
Infinity-RoPE 83.16 84.17 79.13 57.04 67.94 23.79
MemFlow 82.95 83.86 79.31 60.46 67.68 23.89
LongLive 82.77 83.31 80.64 40.19 68.96 23.97
SlotMemory 84.28 85.23 80.49 74.29 72.26 24.34

SlotMemory leads most long-horizon metrics, with especially large gains in Dynamic and Imaging Quality.

Qualitative Demonstrations

In Ours Only: 2x2 per Group, Group #1 corresponds to the 60-second interactive multi-prompt scenario, while Group #2 corresponds to the 30-second single-prompt evaluation. The second row facilitates direct visual comparison between SlotMemory outputs and baseline methods.

Ours Only: 2x2 per Group

Grouped Comparison: 3 Videos per Group

Citation

@article{dou2026slotmemory,
  title={SlotMemory: Object-Centric KV Memory for Streaming Long-Video Generation},
  author={Weijia Dou and Hui Li and Jiahao Cui and Lei Zhou and Jingdong Wang and Siyu Zhu},
  journal={arXiv preprint},
  year={2026},
  url={https://tj12323.github.io/SlotMemory/}
}