open/ acc

posted an update about 17 hours ago

Post

185

sKT-Ai-Labs

Join fast we will soon published tokens and all join and get started because we will soon off join request button if you want you can join fast guys

sdiazlor

posted an update 5 days ago

Post

Pruna OSS is turning 1! To mark this milestone, we're launching the First Prune initiative.

What's First Prune:
If you contribute to open issues at our GitHub repo, you earn Pruna Inference API credits.

How you can participate:
• Pick an open issue labelled "first-prune" and assign it to you
• Submit your PR and mark it ready for review by April 30
• Find out more in the PR template when you open a PR

Each merged PR scores 30 credits.

Let’s build something great together! Find your issue: https://github.com/PrunaAI/pruna/issues

posted an update 5 days ago

Post

404

Training mRNA Language Models Across 25 Species for $165

We built an end-to-end protein AI pipeline covering structure prediction, sequence design, and codon optimization. After comparing multiple transformer architectures for codon-level language modeling, CodonRoBERTa-large-v2 emerged as the clear winner with a perplexity of 4.10 and a Spearman CAI correlation of 0.40, significantly outperforming ModernBERT. We then scaled to 25 species, trained 4 production models in 55 GPU-hours, and built a species-conditioned system that no other open-source project offers. Complete results, architectural decisions, and runnable code below.

https://huggingface.co/blog/OpenMed/training-mrna-models-25-species

posted an update 6 days ago

Post

2508

🚀 Bharat AI Revolution ka Hissa Banein! 🇮🇳

Kya aap Bharat ko AI ki duniya mein ek nayi pehchan dilana chahte hain ?

SKT AI Labs sirf ek naam nahi, ek mission hai—desh ko digital shakti dene ka aur "Viksit Bharat" ke sapne ko sach karne ka.

Humse Kyun Judein?

1. Desh ka Apna AI: Hum aise models bana rahe hain jo khas taur par Bharat ki zarooraton aur bhashaon ke liye hain.

2. Open Collaboration: Hamare Hugging Face repository par hamare kaam ko dekhein, test karein aur apna yogdan dein.

3. Technological Growth: Agar aap student hain, developer hain ya tech enthusiast hain, toh hamare saath naya seekhne aur grow karne ka yeh behtareen mauka hai.

Join here

sKT-Ai-Labs
🔗

sKT-Ai-Labs

Aaiye, saath milkar Bharat AI Revolution ko aage badhate hain! 💻🔥

#SKTAILabs #DigitalIndia #AIRevolution #ViksitBharat #TechInnovation #JoinTheMission

posted an update 7 days ago

Post

6785

SOME NEW HINDI + ENGLISH DATASETS

🔗
- sKT-Ai-Labs/HIN
- sKT-Ai-Labs/SKT-MIX
- sKT-Ai-Labs/ST-H

Download and Use And Train Models

You Can Alsoo Use ST-x-LIGHTING Module For Faster Training

pip install ST-x-LIGHT-V11

2 replies

posted an update 11 days ago

Post

2134

We annotated 119K medical images with two frontier VLMs (Qwen 3.5, Kimi K2.5), cross-validated at 93% agreement, and produced 110K training records, all for under $500. Fine-tuning 3 small models (2-3B params) improved all benchmarks: best model reaches +15.0% average exact match.

Everything is open-sourced: datasets, adapters, and code.

https://huggingface.co/blog/OpenMed/synthvision

2 replies

posted an update 13 days ago

Post

5554

We are thrilled to announce the launch of SKT-OMNI-CORPUS-146T-V1, a massive-scale, high-quality dataset designed to power the next generation of Foundation Models (LLMs) from scratch.
Developed at SKT AI LABS, this corpus is not just a collection of data; it’s a mission to decentralize high-grade AI training for regional languages and global knowledge.

💎 Key Highlights:

•• Massive Scale: Targeting a multi-terabyte architecture for 146T-level tokenization.

•• Pure Quality: Curated from 500+ Elite Sources

•• Structured for MoE: Perfectly sharded into 3.5GB standardized units (SKT-𝕻 series) for seamless distributed training.

🤝 Open for Collaboration!

We are looking for AI researchers, CUDA engineers, and data scientists to join us in this journey of building Project Surya and the ST-X Series models. Whether it's optimization, custom tokenization, or architecture design—let’s build the future together.

Explore the Dataset on Hugging Face:

🔗 Shrijanagain/SKT-OMNI-CORPUS-146T-V1

DSR -- 🔗 Shrijanagain/SKT-DSRx10000

#AI #MachineLearning #OpenSource #IndicAI #SKTAILABS #LLM #BigData #HuggingFace #InnovationIndia

posted an update 17 days ago

Post

5444

Surya-1.1T: Scaling Beyond Human-Level Reasoning via 146 Trillion Token Pre-training
Author: SKT AI LABS
Affiliation: SKT AI Labs / Project Surya
Model Architecture: Optimized Dense Transformer
Parameters: 1.1 Trillion
Training Tokens: 146 Trillion

Wanna collaborate us Friends let's Start Journey we have Collected 146 trillon tokens and done pre training but we need to made more powerfull

Whitepaper - https://github.com/SHRIJANAGAIN/PROFF

56 replies

Nymbo

posted an update 20 days ago

Post

6410

We should really have a release date range slider on the /models page. Tired of "trending/most downloaded" being the best way to sort and still seeing models from 2023 on the first page just because they're embedded in enterprise pipelines and get downloaded repeatedly. "Recently Created/Recently Updated" don't solve the discovery problem considering the amount of noise to sift through.

Slight caveat: Trending actually does have some recency bias, but it's not strong/precise enough.

3 replies

sdiazlor

posted an update 25 days ago

Post

2602

More OSS than ever with the latest pruna 0.3.2 release. It extends existing algorithm families, such as compilers, kernels, and pruners, and adds new ones, including decoders, distillers, enhancers, and recoverers. But it's not only a collection of algorithms; instead, you can easily combine them to get the biggest efficiency win.

Read the full blog here: https://huggingface.co/blog/PrunaAI/pruna-0-3-2-open-source-optimization-algorithms

posted an update about 1 month ago

Post

2683

Introducing the github-codereview dataset: A compilation of 200k+ human-written code reviews from top OSS projects (React, Tensorflow, VSCode...).

I finetuned a Qwen2.5-Coder-32B-Instruct model with this dataset and saw significant improvements in generating better code fixes and review comments (4x improved BLEU-4, ROUGE-L, SBERT scores compared to base model).

#codereview #code #datasets

ronantakizawa/github-codereview

posted an update about 1 month ago

Post

4764

DNA, mRNA, proteins, AI. I spent the last year going deep into computational biology as an ML engineer. This is Part I of what I found. 🧬

In 2024, AlphaFold won the Nobel Prize in Chemistry.

By 2026, the open-source community had built alternatives that outperform it.

That's the story I find most interesting about protein AI right now. Not just the science (which is incredible), but the speed at which open-source caught up. Multiple teams, independently, reproduced and then exceeded AlphaFold 3's accuracy with permissive licenses. The field went from prediction to generation: we're not just modeling known proteins anymore, we're designing new ones.

I spent months mapping this landscape for ML engineers. What the architectures actually are (spoiler: transformers and diffusion models), which tools to use for what, and which ones you can actually ship commercially.

New post on the Hugging Face blog: https://huggingface.co/blog/MaziyarPanahi/protein-ai-landscape

Hope you all enjoy! 🤗

2 replies

posted an update about 1 month ago

Post

2432

Introducing the WebUI dataset: a compilation of screenshot to code pairs of modern websites detailing the styling, framework used, and box bounds for all viewports (Desktop, mobile, tablet).

This dataset showed signs of improved performance in web design LLM benchmarks for a finetuned QWEN 2.5 VL-7B!

#web #ui #datasets

ronantakizawa/webui

3 replies

Tonic

posted an update about 1 month ago

Post

3469

🤔 Who would win ?

- a fully subsidized ai lab
OR
- 3 random students named

kurakurai ?

demo : Tonic/fr-on-device

if you like it give the demo a little star and send a shoutout to : @MaxLSB @jddqd and @GAD-cell for absolutely obliterating the pareto frontier of the french language understanding .

4 replies

Paper • 2602.17004 • Published Feb 19 • 18

authored a paper about 1 month ago

Arcee Trinity Large Technical Report

posted an update about 1 month ago

Post

2117

Introducing the github-top-code dataset: A curated dataset of 1.3M+ source code files from GitHub's top ranked developers.

I collected the best source code files from Github's highest trending developers of all time, and compiled a dataset to train LLMs to write well-structured, production-grade code.

#dataset #codedataset #pretraining

ronantakizawa/github-top-code

Tonic

posted an update about 2 months ago

Post

3330

🙋🏻‍♂️hello my lovelies ,

it is with great pleasure i present to you my working one-click deploy 16GB ram completely free huggingface spaces deployment.

repo : Tonic/hugging-claw (use git clone to inspect)
literally the one-click link : Tonic/hugging-claw

you can also run it locally and see for yourself :

docker run -it -p 7860:7860 --platform=linux/amd64 \
-e HF_TOKEN="YOUR_VALUE_HERE" \
-e OPENCLAW_GATEWAY_TRUSTED_PROXIES="YOUR_VALUE_HERE" \
-e OPENCLAW_GATEWAY_PASSWORD="YOUR_VALUE_HERE" \
-e OPENCLAW_CONTROL_UI_ALLOWED_ORIGINS="YOUR_VALUE_HERE" \
registry.hf.space/tonic-hugging-claw:latest

just a few quite minor details i'll take care of but i wanted to share here first

2 replies

posted an update about 2 months ago

Post

285

Introducing the LeetCode Assembly Dataset: a dataset of 400+ LeetCode problem solutions in assembly across x86-64, ARM64, MIPS64, and RISC-V using GCC & Clang at -O0/-O1/-O2/-O3 optimizations.

This dataset is perfect for teaching LLMs complex compiler behavior!

#dataset #leetcode #assembly

ronantakizawa/leetcode-assembly

posted an update about 2 months ago

Post

2333

Announcing: OpenMed Multilingual PII Detection Models

Today I am releasing 105 open-source models for Personally Identifiable Information (PII) detection in French, German, and Italian.

All Apache 2.0 licensed. Free for commercial use. No restrictions.

Performance:

- French: 97.97% F1 (top model)
- German: 97.61% F1 (top model)
- Italian: 97.28% F1 (top model)

All top-10 models per language exceed 96% F1

Coverage:

55+ PII entity types per language
Native ID formats: NSS (French), Sozialversicherungsnummer (German), Codice Fiscale (Italian)
Language-specific address, phone, and name patterns

Training Data:

French: 49,580 samples
German: 42,250 samples
Italian: 40,944 samples

Why Multilingual?

European healthcare operates in European languages. Clinical notes, patient records, and medical documents are generated in French, German, Italian, and other languages.

Effective de-identification requires:

- Native language understanding — not translation
- Local ID format recognition — each country has unique patterns
- Cultural context awareness — names, addresses, and formats vary
- These models deliver production-ready accuracy without requiring data to leave your infrastructure or language.

HIPAA & GDPR Compliance
Built for US and European privacy regulations:

- On-premise deployment: Process data locally with zero external dependencies
- Data sovereignty: No API calls, no cloud services, no cross-border transfers
- Air-gapped capable: Deploy in fully isolated environments if required
- Regulatory-grade accuracy: Supporting Expert Determination standards
- HIPAA and GDPR compliance across languages, without compliance gaps.

Use Cases
- Hospital EHR systems: Automated patient record de-identification
- Clinical research: Multilingual dataset preparation for studies
- Insurance companies: Claims processing across

https://huggingface.co/collections/OpenMed/multilingual-pii-and-de-identification

1 reply