arxiv:2603.27538

LongCat-Next: Lexicalizing Modalities as Discrete Tokens

Published on Mar 29

· Submitted by

LI, XIAOTONG on Apr 1

Authors:

Haoze Sun ,

Abstract

Discrete Native Autoregressive framework enables unified multimodal processing by representing diverse modalities in a shared discrete space through a novel visual transformer architecture.

AI-generated summary

The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Autoregressive (DiNA), a unified framework that represents multimodal information within a shared discrete space, enabling a consistent and principled autoregressive modeling across modalities. A key innovation is the Discrete Native Any-resolution Visual Transformer (dNaViT), which performs tokenization and de-tokenization at arbitrary resolutions, transforming continuous visual signals into hierarchical discrete tokens. Building on this foundation, we develop LongCat-Next, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal modality-specific design. As an industrial-strength foundation model, it excels at seeing, painting, and talking within a single framework, achieving strong performance across a wide range of multimodal benchmarks. In particular, LongCat-Next addresses the long-standing performance ceiling of discrete vision modeling on understanding tasks and provides a unified approach to effectively reconcile the conflict between understanding and generation. As an attempt toward native multimodality, we open-source the LongCat-Next and its tokenizers, hoping to foster further research and development in the community. GitHub: https://github.com/meituan-longcat/LongCat-Next

View arXiv page View PDF Project page GitHub 310 Add to collection

Community

XiaotongLi97

Paper author Paper submitter 2 days ago

This comment has been hidden (marked as Off-Topic)

XiaotongLi97

Paper author Paper submitter 1 day ago

An open-source native multimodal model built with a pure discrete autoregressive architecture, supporting unified modeling for visual understanding, generation, audio, and language tasks.

avahal

about 24 hours ago

the most interesting bit to me is how they serialize vision into language-like tokens without treating modality as an afterthought. dNaViT's arbitrary-resolution tokenization and a hierarchical, semantically grounded codebook seem like a clean way to unify understanding and generation across text, image, and audio. i want to see how this scales to high-res video or long-tail textures—does the token hierarchy preserve fine details when motion and noise come into play? the arXivLens breakdown helped me parse the method, especially the SAE+RVQ stack that yields the discrete stream. one concrete question: how do you ensure the semantic tokens stay aligned across modalities when their native statistics drift during fine-tuning or domain shift?