Yuriy Perezhohin PRO

yuriyvnv

https://scholar.google.com/citations?user=I5uzFtwAAAAJ&hl=en

AI & ML interests

Automatic Speech Recognition, Embeddings, Code Generation, Synthetic Data Generation and Filtering

Recent Activity

updated a model about 21 hours ago

yuriyvnv/experiments_parakeet

published a model 1 day ago

yuriyvnv/experiments_parakeet

updated a dataset 2 days ago

yuriyvnv/synthetic_asr_et_sl

View all activity

Organizations

Posts 2

Post

310

🎯 WAVe-1B-Multimodal-NL: Word-Level Speech Quality Assessment for Dutch

Following the release of the Portuguese model, we're releasing the Dutch variant of WAVe — a 1B multimodal embedding model that assesses synthetic speech quality at the word level, thereby improving the quality of synthetically augmented datasets for training ASR models.

Trained on CommonVoice 16.1 Dutch with 5 corruption strategies, this model catches mispronunciations, timing errors, and prosody issues in synthetic data that sentence-level embeddings miss entirely.
Resources

- Dutch model: yuriyvnv/WAVe-1B-Multimodal-NL
- Portuguese model: yuriyvnv/WAVe-1B-Multimodal-PT
- Code: https://github.com/yuriyvnv/WAVe

This model builds on CommonVoice Dutch data — thanks to @mozilla and the CommonVoice community for making multilingual speech data accessible.

Would be great to hear from the Dutch NLP community — @BramVanroy @GroNLP — especially if you're working on Dutch ASR or TTS pipelines where quality filtering could help. Also tagging @hf-audio as this sits at the intersection of speech processing and data curation.

Post

2187

🎯 WAVe: 1B Multimodal Embedding Model for Word-Level Speech Quality

Multimodal embeddings for speech + transcript that verify quality at the word level, not just sentence level. Catches mispronunciations, timing errors, and prosody issues that sentence-level filters miss.

📊 Impact on Portuguese ASR:
• 34% reduction in training steps
• 50% better cross-domain generalization
• 30% less synthetic data needed
• Word-aligned attention finds errors other methods miss

🏗️ Architecture:
• Text: XLM-RoBERTa (278M params)
• Audio: Wav2Vec2-BERT 2.0 (581M params)
• Word Alignment: Multi-head attention + GLU (14M params)
• Total: 1B parameters

from transformers import AutoModel, AutoProcessor

  processor = AutoProcessor.from_pretrained(
      "yuriyvnv/WAVe-1B-Multimodal-PT",
      trust_remote_code=True
  )
  model = AutoModel.from_pretrained(
      "yuriyvnv/WAVe-1B-Multimodal-PT",
      trust_remote_code=True
  )

# Assess speech-transcript alignment

inputs = processor(text="Olá, como está?", audio=audio_array, sampling_rate=16000, return_tensors="pt")
  quality = model(**inputs).quality_score.item()

Perfect for filtering synthetic speech datasets before ASR training.

Model: yuriyvnv/WAVe-1B-Multimodal-PT
Code to create WAVe : https://github.com/yuriyvnv/WAVe
#multimodal #speech #embeddings #asr
#syntheticdata #qualityassessment

View all Posts