Papers
arxiv:2603.27538

LongCat-Next: Lexicalizing Modalities as Discrete Tokens

Published on Mar 29
· Submitted by
LI, XIAOTONG
on Apr 1
#3 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Discrete Native Autoregressive framework enables unified multimodal processing by representing diverse modalities in a shared discrete space through a novel visual transformer architecture.

AI-generated summary

The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Autoregressive (DiNA), a unified framework that represents multimodal information within a shared discrete space, enabling a consistent and principled autoregressive modeling across modalities. A key innovation is the Discrete Native Any-resolution Visual Transformer (dNaViT), which performs tokenization and de-tokenization at arbitrary resolutions, transforming continuous visual signals into hierarchical discrete tokens. Building on this foundation, we develop LongCat-Next, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal modality-specific design. As an industrial-strength foundation model, it excels at seeing, painting, and talking within a single framework, achieving strong performance across a wide range of multimodal benchmarks. In particular, LongCat-Next addresses the long-standing performance ceiling of discrete vision modeling on understanding tasks and provides a unified approach to effectively reconcile the conflict between understanding and generation. As an attempt toward native multimodality, we open-source the LongCat-Next and its tokenizers, hoping to foster further research and development in the community. GitHub: https://github.com/meituan-longcat/LongCat-Next

Community

Paper author Paper submitter
This comment has been hidden (marked as Off-Topic)
Paper author Paper submitter

An open-source native multimodal model built with a pure discrete autoregressive architecture, supporting unified modeling for visual understanding, generation, audio, and language tasks.

the most interesting bit to me is how they serialize vision into language-like tokens without treating modality as an afterthought. dNaViT's arbitrary-resolution tokenization and a hierarchical, semantically grounded codebook seem like a clean way to unify understanding and generation across text, image, and audio. i want to see how this scales to high-res video or long-tail textures—does the token hierarchy preserve fine details when motion and noise come into play? the arXivLens breakdown helped me parse the method, especially the SAE+RVQ stack that yields the discrete stream. one concrete question: how do you ensure the semantic tokens stay aligned across modalities when their native statistics drift during fine-tuning or domain shift?

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2603.27538
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.27538 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.27538 in a Space README.md to link it from this page.

Collections including this paper 3