Diffusion Single File
comfyui

plz drop the LLM Adapter. It bottlenecks Qwen 0.6B's true potential (NLU & Chinese)

#52
by Schelluwu123 - opened

I need to point out a critical flaw in the current training strategy.

The current operation maps Qwen embeddings to T5. This is equivalent to trying to remodel T5 into Qwen, but it results in a "defective product" (a crippled Qwen). It performs worse than the original T5 and completely wastes Qwen's native capabilities.

CRUCIAL POINT:
For a DiT architecture, the LLM-Adapter(t5<->qwen) does NOT need to be trained.
Since we are using Qwen 0.6B, its hidden_states[-2] dimensions are already consistent with T5

This unnecessary mapping acts as a bottleneck that dilutes semantic richness. It is the primary reason why Natural Language Understanding (NLU) and Chinese language capabilities are degraded.

Please stop training the adapter and switch to Direct Alignment to fix the NLU and Chinese capability issues.

I don't know why a text-to-image (t2i) model that uses Qwen3 0.6B as its text encoder (TE) is still using T5's tokenizer, but I definitely think this is wrong—it will make the high-level embedding vocabulary matrix behind Qwen inaccessible..Training it this way will cause Qwen to lose its inherent advantages as a text encoder. Right now, it’s essentially turning into a form of distillation from T5 to the Qwen model, so the final text encoding capability won’t exceed that of T5.

Schelluwu123 changed discussion status to closed
Schelluwu123 changed discussion status to open
Schelluwu123 changed discussion title from Model is BROKEN due to Tokenizer/Encoder Mismatch to CRITICAL ARCHITECTURE FLAW: T5 Tokenizer paired with Qwen Encoder
Schelluwu123 changed discussion title from CRITICAL ARCHITECTURE FLAW: T5 Tokenizer paired with Qwen Encoder to plz drop the Adapter. It bottlenecks Qwen 0.6B's true potential (NLU & Chinese)
Schelluwu123 changed discussion title from plz drop the Adapter. It bottlenecks Qwen 0.6B's true potential (NLU & Chinese) to plz drop the LLM Adapter. It bottlenecks Qwen 0.6B's true potential (NLU & Chinese)

The reason it doesn't understand Chinese is definitely just that Cosmos 2 2B was not captioned in any language other than English to begin with, and Anima hasn't been captioned in any language other than English either.

The reason it doesn't understand Chinese is definitely just that Cosmos 2 2B was not captioned in any language other than English to begin with, and Anima hasn't been captioned in any language other than English either.

That's not the case. I've previously tested models including neta-lumina (gemma2-2B) and Newbie (Lumina architecture, Gemma3-4B combined with Jina CLIP Text Encoder, TE). The vast majority of captions for these models are strictly in English, yet these multilingual LLM Text Encoders inherently have this generalization ability — they can understand multilingual prompts. Given that, it's extremely odd that a domestic Chinese model like Qwen3 lacks this Chinese generalization capability in Text-to-Image (T2I) scenarios.

Yeah, this seems like odd design choice. It definitely still seems better at prompt understanding, but it seems like your saying it maps to another format which has less semantic information correct? Explains the weird weaknesses and strengths a lot. Probably too late to change it now for full model, but would be great if they release a Anima 2 eventually and learn from mistakes of this model, still looking forward to full model at end of the day though :-)

CircleStone Labs org

T5 wasn't trained on Chinese. The base Cosmos2 model with T5 has no ability to understand Chinese. This, combined with the fact that all captions are in English, is why the model only understands English.

The LLM adapter is converting Qwen3 embedding space to T5 embedding space so that the model doesn't have to relearn everything from scratch.

It performs worse than the original T5

Every comparison I've done with Cosmos2 base has it performing at least as well as T5.

For a DiT architecture, the LLM-Adapter(t5<->qwen) does NOT need to be trained.

It does need to be there and be trained, else the model is having to relearn everything.

Since we are using Qwen 0.6B, its hidden_states[-2] dimensions are already consistent with T5

Yes, they are both 1024 dim. You can feed Qwen3 0.6b embeddings directly into the Cosmos DiT. This is actually the first thing I tried. It doesn't work (or at least, as mentioned it's having to relearn everything which is very slow), hence the LLM adapter. The LLM adapter is also serving as somewhat of a "mini trainable text encoder" which is particularly helping with learning artist styles.

I've previously tested models including neta-lumina (gemma2-2B) and Newbie (Lumina architecture, Gemma3-4B combined with Jina CLIP Text Encoder, TE)

The LLMs fore these models are much more multilingual than T5. Furthermore the base model for both of them is Lumina, which is explicitly trained to understand Chinese. And I believe both models have at least some captions in Chinese.

The LLM adapter is converting Qwen3 embedding space to T5 embedding space so that the model doesn't have to relearn everything from scratch.

I see, that's what I assumed the reason was perhaps. Only way to fix it would be a better base model or pretty much nuking the models existing knowledge. Hopefully better base comes around

Over at Banodoco discord, we played extensively with trying to even see if we could make Zimage talk to anything besides Qwen3 4b... even 8b didn't work, and VL models didn't work... attempting to make an image model understand the output of a encoder in anything but the structure it was trained on seems almost impossible... It's like being raised your entire life in a fluent language, and then being dumped into a world where they not only don't speak your tongue, but they use sounds you can't understand or perhaps even process. Even getting basic rudimentary communication would be a win, but the odds of being dumped in that new environment and remaining even close to as functional (within that same level of interactivity as before) seems against all odds.

Using a tiny Qwen3 LLM to 'bridge the gap' make a lot of sense here. The entire (frustrating) thread #56 focused on what seemed to many to be a very unrelated topic (how Qwen3 4b retraining used as Text Encoders affect the image generated by Zimage (and other image models using it as well).

Using the 'translator analogy' here:

T5 turns 'English' into a sert of tokens. It's not as 'braindead' as ClipL, and for a time, it was 'the natural language clip encoder' of choice. It's been far surpassed by using full LLMs as TEs. But on the spectrum of TEs, it's not as good as an LLM, and it's not as bad as ClipL....

But in a interesting spin, using Qwen3 (even a tiny model like 0.6b) as a second translator, a trainable one... by the same analogy:
The Image model only speaks 'T5'.... T5 is not a bad but not great 'encoder', it's got some natural language skills, but it's a pale shadow of a true LLM.
So "What if... we put a True LLM in front..." We talk to the LLM so our language(s) into it's token language
and then it can learn how to best communicate to the T5 'translator' (which is feeding the image model)
So it's a game of translation telephone:
User -> any human language -> Qwen3 0.6b into qwen tokens -> T5 into T5 tokens -> Anima model (who only speaks 'T5')

So the critics are correct, the limits of T5 will 'mostly' hamper you... it's just got a dialect/accent that it can't lose, and that filters everything.
Qwen3 0.6b is a better translator, sitting in the middle between the Human and T5 (which did fine understanding humans in other models), and is attempting
to 'be better' at telling T5 'what the human really meant was...'

And the critics complaining that a tiny 0.6b is the wrong choice... the true bottleneck is T5... I suspect if you put a larger more capable LLM like a 8b model, it MIGHT do better, but at the cost of a lot more 'work'... and the better model is still forced to ONLY translate to the T5 model, so there are limits about how 'good it can get the point across'

Getting rid of T5 would mean starting from scratch. Not a good goal or choice.
Using a bigger Qwen3 model than 0.6b is probably a waste... it's like hiring a college level tutor to tutor your 7 year old in basic math. You can get a cheaper faster and just as effective tutor in a high school level tutor, or a grade school tutor. Your 7 year old (aka T5) is smart, but they are Not That Smart. And you aren't replacing them.

I'm impressed with what Anima does. It's (my take on it) a massive upgrade to what is almost entirely an eco system still filled by SDXL, Pony, Illustrious, and so on... The crowd using it dabbled in Flux, but stuck to SDXL levels, because it WORKS for them. This model is an attempt to reproduce the SDXL experience using newer tech...
You can question why... but the how is working well.

T5 wasn't trained on Chinese. The base Cosmos2 model with T5 has no ability to understand Chinese. This, combined with the fact that all captions are in English, is why the model only understands English.

The LLM adapter is converting Qwen3 embedding space to T5 embedding space so that the model doesn't have to relearn everything from scratch.

It performs worse than the original T5

Every comparison I've done with Cosmos2 base has it performing at least as well as T5.

For a DiT architecture, the LLM-Adapter(t5<->qwen) does NOT need to be trained.

It does need to be there and be trained, else the model is having to relearn everything.

Since we are using Qwen 0.6B, its hidden_states[-2] dimensions are already consistent with T5

Yes, they are both 1024 dim. You can feed Qwen3 0.6b embeddings directly into the Cosmos DiT. This is actually the first thing I tried. It doesn't work (or at least, as mentioned it's having to relearn everything which is very slow), hence the LLM adapter. The LLM adapter is also serving as somewhat of a "mini trainable text encoder" which is particularly helping with learning artist styles.

I've previously tested models including neta-lumina (gemma2-2B) and Newbie (Lumina architecture, Gemma3-4B combined with Jina CLIP Text Encoder, TE)

The LLMs fore these models are much more multilingual than T5. Furthermore the base model for both of them is Lumina, which is explicitly trained to understand Chinese. And I believe both models have at least some captions in Chinese.

Got it.Thank you for your dedication to this project.❤️

Schelluwu123 changed discussion status to closed

Sign up or log in to comment