DIBT-Russian (Data is Better Together - Russian Language Team)

ZennyKenny

posted an update 3 days ago

Post

226

Has anyone tried Strawberry Browser? https://strawberrybrowser.com/?ref_id=8D41NQCY7

😇 Shamelessly sharing my referral link here to move up in the waitlist line. Help me out, give it a click.

2 replies

·

ZennyKenny

posted an update 11 days ago

Post

2145

Did Hugging Face just ban hammer a bunch of bot accounts or am I just so uninteresting that 30% of my subs dropped me overnight?

😬 Wait, don't answer that.

2 replies

·

ZennyKenny

authored a paper 12 days ago

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

Paper • 2510.08697 • Published 17 days ago • 32

ZennyKenny

posted an update 13 days ago

Post

207

🔥 BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution from

bigcode is now available on Hugging Face!

👉 Check out the paper and please drop an upvote if you like the work BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution (2510.08697)

ZennyKenny

posted an update 23 days ago

Post

1234

🥊 Big Code Arena is live! bigcode/arena

💡

bigcode is an open scientific collaboration working on responsible training of large language models for coding applications.

👉 The Arena ranks LLMs based on their ability to support natural language vibe coding requests in a competitive format, based on feedback from human reviewers.

🧠 It was a pleasure to contribute to this project led by @terryyz and appear as an additional contributor in the Big Code Arena paper.

ZennyKenny

posted an update 28 days ago

Post

8886

🖤 Probably one of my favorite projects that I've worked on so far, introducing Новояз (Novoyaz).

🛠 One of the first acts of the Bolshevik government after the Russian Revolution was the reform and standardization of the Russian language, which at the time had a non-standard and challenging orthography.

📚 Upon its reform the government launched a nationwide campaign called Ликбез (Likbez), which sought to improve literacy in the country (by the way, it worked, bringing the national literacy rate from <20% in the 1920s to >80% by the 1930s).

‼ While this is a remarkable result that should absolutely be celebrated, it's one that has left behind literally hundreds of thousands if not millions of artifacts using pre-reform Russian orthography.

😓 Researchers and historians are working tirelessly to translate these artifacts to modern Russian so that they may be archived and studied but many have told me that. they are doing this BY HAND (!).

💡 I thought, well this is a perfect use case for OCR and a fine-tuned LLM to step in and help to aid in this important work!

🌏 Introducing НОВОЯЗ (NOVOYAZ)! Powered by ChatDOC/OCRFlux-3B and ZennyKenny/oss-20b-prereform-to-modern-ru-merged, researchers can now convert images of their pre-reform documents to modern Russian orthography using the power of open-source AI!

Check it out and drop a like to support more real-world use cases for open source AI outside of traditional tech-centric domains!

ZennyKenny/Novoyaz

ZennyKenny

posted an update 30 days ago

Post

555

🔒 Like a lot of other AI builders, I have some anxiety about the emerging surveillance-capitalist paradigm emerging in the AI space.

👉 Of course-- this kind of thing isn't completely new and has been going on for decades, but the difference is the stronger immersion of AI tools into our daily lives (compared to something like a search engine or social network).

❕ That's why I was really excited to come across Lumo: https://lumo.proton.me/u/1/

❕ Lumo is created by

ProtonPrivacy and offers privacy-first features that make sure that what you do with you AI assistant is your business.

❕ I already trust Proton with my other business apps and I've never been disappointed, plus the Lumo architecture is really fantastic, dynamically routing each query to the most appropriate model for the request.

🔥 Really awesome stuff Proton, thank you as always.

ZennyKenny

posted an update about 1 month ago

Post

2376

The reactions to mostlyai/synthetic-sdk-demo have been incredible! 🔥

Some users wrote that they were having performance issues on larger datasets, so I've capped the Space's input to 5000 rows and 10 columns, but you can always use the open source SDK that powers the space any time you want on datasets of arbitrary size and shape!

Check it out: https://github.com/mostly-ai/mostlyai 👈

ZennyKenny

posted an update about 1 month ago

Post

2636

The open source Synthetic Data SDK from MOSTLY AI:

mostlyai offers the ability to generate realistic, privacy-safe synthetic data with just a few lines of Python.

Try it out yourself in a No Code UI in the SDK Demo Space: mostlyai/synthetic-sdk-demo

ZennyKenny

posted an update 2 months ago

Post

2595

It's just a matter of time before all the data leakage and data scraping associated with building, training, and using AI results in some kind of major scandal.

That's why I think this paper by @spintronic is so important: Privacy-Preserving Tabular Synthetic Data Generation Using TabularARGN (2508.06647)

Glad to know that there are already researchers looking to mitigate and address this risk before the s**t hits the fan.

2 replies

·

dvilasuero

posted an update 5 months ago

Post

2993

Super excited to launch Hugging Face Sheets: Spreadsheets meet AI and unstructured data.

A few months ago, we started imagining new ways to build and transform datasets with the latest open-source models.

Today, I'm thrilled to introduce our first step in this direction.

In a nutshell:

📁 Effortlessly run prompts and models over your data.
🌐 Agentic search for accuracy and real-time information.
🖼️ Familiar, minimalistic interface for interacting with data.
🎯 Human feedback 2.0: Your input directly improves generated data.
💯 Access hundreds of open models and leading inference providers.

Go to this space to try it out!

aisheets/sheets

Leave your questions below, we're just getting started!

3 replies

·

ZennyKenny

posted an update 6 months ago

Post

952

Community! 💡💡💡

It's the last day to submit your datasets for the Reasoning Datasets Competition: https://www.bespokelabs.ai/blog/reasoning-datasets-competition

Here are my submissions:
- ZennyKenny/synthetic_vc_financial_decisions_reasoning_dataset
- ZennyKenny/cosa-benchmark-dataset
- ZennyKenny/tactical-military-reasoning-v.1.0
- ZennyKenny/tron-dataset-v.1.0

Have a look and drop a ❤️ or comment! Check out the entire collection of submissions here: https://huggingface.co/datasets?other=reasoning-datasets-competition

ZennyKenny

posted an update 6 months ago

Post

3152

After hearing the news that Marc Andreessen thinks that the only job that is safe from AI replacement is venture capital: https://gizmodo.com/marc-andreessen-says-one-job-is-mostly-safe-from-ai-venture-capitalist-2000596506 🧠🧠🧠

The Reasoned Capital synthetic dataset suddenly feels much more topical: ZennyKenny/synthetic_vc_financial_decisions_reasoning_dataset 🔥🔥🔥

Really looking forward to potentially expanding this architecture and seeing how algorithmic clever investment truly is! 💰💰💰

ZennyKenny

posted an update 6 months ago

Post

3382

When I heard the Reasoning Dataset Competition deadline was extended to 9 May, I knew I had time to get in one more entry. 🔥🔥🔥

With the rise of Vibe Coding, and the potential risks that are introduced by humans letting LLMs build their apps for them, lots of people are (rightfully) concerned about the safety of the code that is hitting prod.

In response to that, I'm happy to present my final submission to the Reasoning Dataset Competition and attempt to start benchmarking the ability of LLMs to identify unsafe and / or exploitable code by way of the CoSa (Code Safety) benchmark: ZennyKenny/cosa-benchmark-dataset

Currently a curated set of 200 examples, calibrated on OpenAI's standard issue models (GPT-4.1, o4 mini, and GPT-3.5 Turbo) as "baseline performance" (70% decile). Check it out and drop a ❤️ if you think it could be useful or hit the Community section with suggestions / critiques.

3 replies

·

ZennyKenny

posted an update 6 months ago

Post

1383

The same way the advent of Adobe Illustrator has led to innovation in the way that creative professionals work, I earnestly believe that AI will do the same (contrary to the popular opinion that it represents some regression in the world of creatives).

@natalika and I were speaking about this topic and like most illustrators she has some understandable concerns about the spread of AI in her field. She also told me how much time she spends generating concept art that will never see the light of day in >98% of cases. 💡

To me, that sounded like a perfect opportunity to leverage image diffusion in a way that helps artists spend more time creating cool stuff rather than just malevolently mining their work and using it without credit. Using the Black Forest Labs base model FLUX, Replicate, and about $5 of H100 compute, I post-trained a LoRA adapter on a set of her images associated with one project she's working on and spun up an app with Hugging Face Spaces (and Zero GPU for the win).

I give you, Natalie Diffusion: ZennyKenny/natalie-diffusion

Now, generating concept art in her particular style takes seconds instead of hours and when it's time to put the work into production, a human designer is still invaluable. And building it in the open hopefully inspires other use cases amongst other designers. 🖖

2 replies

·

ZennyKenny

posted an update 6 months ago

Post

2745

I've created a new dataset using the Algorithm of Thoughts architecture proposed by Sel et al. (2023) in a reasoning context. (paper: https://arxiv.org/pdf/2308.10379)

The dataset simulates the discovery phase of a fictitious VC firm called Reasoned Capital and, once expanded, can be used to create models which are able to make complex, subjective financial decisions based on different criteria.

The generation process encourages recursive problem-solving in increasingly complex prompts to encourage models to assess and reevaluate the conclusions and generated opinions of upstream models. Pretty neat stuff, and I'm not aware of this architecture being used in a reasoning context anywhere else.

Check it out: ZennyKenny/synthetic_vc_financial_decisions_reasoning_dataset

ZennyKenny

posted an update 6 months ago

Post

575

Phew, maybe a little dark, but I've submitted my second dataset to the Reasoning Datasets Competition: ZennyKenny/tactical-military-reasoning-v.1.0

I'd be interested to hear the community's thoughts on the applications of AI in the military. Especially in the wargaming space.

This is something that feels inevitable (and realistically, probably already in progress). Doesn't it make sense for us to have an understanding of the mechanics of such processes? Surely they will never be open source.

9 replies

·

ZennyKenny

posted an update 6 months ago

Post

1449

Submitted my first dataset for the Reasoning Datasets Competition! https://huggingface.co/datasets/ZennyKenny/TRON-dataset-v.1.0

This dataset is designed to post-train Metareasoning agents, or those agents whose job it is to quickly (and importantly, cheaply) reason through whether it makes sense to launch a full reasoning job or simply use a simple completions job.

There's still plenty of time to join the competition! https://www.bespokelabs.ai/blog/reasoning-datasets-competition

Generation notebook (linked in dataset) is open source and pretty well generalized if I don't say so myself, so you can use it to make your own Metareasoning datasets.

Shoutout to @onekq for his inspiring comment on this topic.

ZennyKenny

posted an update 7 months ago

Post

2792

Just signed up for the Reasoning Datasets Competition from Hugging Face, Together AI, and Bespoke Labs!

Looking forward to seeing what the community comes up with to help train better reasoning models.

Join the fray: https://www.bespokelabs.ai/blog/reasoning-datasets-competition

4 replies

·

ZennyKenny

posted an update 7 months ago

Post

2140

A few new Russian-language synthetic datasets. The labelling is good, but some of the syntax and grammar is not great.

Great for Russian-language classification models, probably not great for fine-tuning Russian-langauge text generation.

- Virtual Assistant Query / Responses: ZennyKenny/ru_virtual_assistant_chatgpt_distill
- LLM Query / Responses: ZennyKenny/russian_llm_response_chatgpt_distill

Crazy how much language drift is still an issue, especially given that Russian constitutes nearly 5% of the content on the internet.

Data is Better Together - Russian Language Team

AI & ML interests

Recent Activity

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

AI & ML interests

Recent Activity

Team members 4

DIBT-Russian's activity