Yacine Jernite(@YJernite) 's Twitter Profileg
Yacine Jernite

@YJernite

ML & Society lead @huggingface, NLPer at heart, focusing on data and ML systems governance these days
he/him
#BlackLivesMatter

ID:1218217239213723648

calendar_today17-01-2020 17:03:30

3,0K Tweets

3,6K Followers

1,4K Following

EleutherAI(@AiEleuther) 's Twitter Profile Photo

An essential blocker to training LLMs on public domain books is not knowing which books are in the public domain. We're working on it, but it's slow and costly... if you're interested in providing support reach out!

account_circle
Hugo Laurençon(@HugoLaurencon) 's Twitter Profile Photo

The Cauldron is a massive collection of 50 high-quality datasets, all converted to the user/assistant format, and ready to use to fine-tune any Vision Language Model.

It covers a wide range of tasks detailed below⬇️

huggingface.co/datasets/Huggi…

account_circle
Stella Biderman(@BlancheMinerva) 's Twitter Profile Photo

Training data transparency is an unambiguous win for society, but all the incentives are against companies doing it right now. We need to fix this as soon as possible.

account_circle
Yacine Jernite(@YJernite) 's Twitter Profile Photo

Grateful for the folks within companies that are training LLMs who keep pushing for training data transparency despite pushback (you know who you are)🤗

account_circle
Nick Vincent(@nickmvincent) 's Twitter Profile Photo

So one can use the model to train other AI systems. And updating your name seems like a small price to pay (and it is a form of attribution).

But this becomes pretty funny pretty quickly if other open models adopt the same rule.

account_circle
Jesse Dodge(@JesseDodge) 's Twitter Profile Photo

Today Meta released Llama 3! Congrats to the team.

In their blog post they wrote that, 'the curation of a large, high-quality training dataset is paramount', while providing almost no information about how it was made, how it was filtered, or its contents.

Today Meta released Llama 3! Congrats to the team. In their blog post they wrote that, 'the curation of a large, high-quality training dataset is paramount', while providing almost no information about how it was made, how it was filtered, or its contents.
account_circle
Nik Marda(@nrmarda) 's Twitter Profile Photo

A really fantastic job opportunity at OMB just opened up today — this person would be central to the White House's oversight of federal agencies' use of AI: usajobs.gov/job/786286500

A really fantastic job opportunity at OMB just opened up today — this person would be central to the White House's oversight of federal agencies' use of AI: usajobs.gov/job/786286500
account_circle
Alexander Doria(@Dorialexander) 's Twitter Profile Photo

We also aim to define good practices regarding the use of free licensed content: we provide the provenance metadata and credits to the original authors. In accordance with the philosophy of Creative Commons this data should be preferably used for training open reproducible models

We also aim to define good practices regarding the use of free licensed content: we provide the provenance metadata and credits to the original authors. In accordance with the philosophy of Creative Commons this data should be preferably used for training open reproducible models
account_circle
Yixin Wan(@yixin_wan_) 's Twitter Profile Photo

How to identify bias in language agency?Eg. in texts describing White men as “leading” & Black women as “helping”?🧐
🔎String matching?❌NO!
🔎Sentiment classifier?❌No!
✅Our agency classifier CAN! It reveals gender, racial, and intersectional bias🤯
🔗:
arxiv.org/abs/2404.10508

account_circle
Nathan Lambert(@natolambert) 's Twitter Profile Photo

Here's OLMo 1.7-7B

We figured out how to fix the MMLU score for the first OLMo 7B model when training the bigger one, so we got you OLMo 1.7-7B.

Better data (Dolma 1.7) + staged training = 24 point increase.

Oh it has 2x the context length too (4096)

huggingface.co/allenai/OLMo-1…

Here's OLMo 1.7-7B We figured out how to fix the MMLU score for the first OLMo 7B model when training the bigger one, so we got you OLMo 1.7-7B. Better data (Dolma 1.7) + staged training = 24 point increase. Oh it has 2x the context length too (4096) huggingface.co/allenai/OLMo-1…
account_circle
MMitchell(@mmitchell_ai) 's Twitter Profile Photo

Spent a good chunk of today reading through Rest of World's AI elections tracker, which documents how AI is influencing elections. Very good use of my time. Recommend.

account_circle
Kyle Lo(@kylelostat) 's Twitter Profile Photo

notable stuff:
🦉ton of perf boost from mixing instruct data at end (e.g., flan)
🐋anneal learning rate (Fig 9b in arxiv.org/abs/2403.08763)
🐞changing data mix boosts MMLU at some cost to other evals

🍇huggingface.co/allenai/dolma
🧀huggingface.co/allenai/OLMo-1…

account_circle
Deb Raji(@rajiinio) 's Twitter Profile Photo

I'm glad someone said this -- I've always found it strange that there are malicious actors actively out there, intentionally producing and disseminating harmful/deceptive content only for the policy response to be 'victims should have known better' 🫤

account_circle
merve(@mervenoyann) 's Twitter Profile Photo

I see you all send your documents to close-source APIs, this is not ok 👎
As a person who has seen many open-source document models I am amazed by what IDEFICS2 has done with document understanding 🤯🤩
Please use it! Has Apache 2.0 license ❤️

I see you all send your documents to close-source APIs, this is not ok 👎 As a person who has seen many open-source document models I am amazed by what IDEFICS2 has done with document understanding 🤯🤩 Please use it! Has Apache 2.0 license ❤️
account_circle
Alex Hanna (اليكس حنٌا)(@alexhanna) 's Twitter Profile Photo

With IDF's 'Lavender' and its use in the violence in Gaza, the point is not the technology or its errors, but plausible deniability and ideologies of scale. New for the newsletter.

buttondown.email/maiht3k/archiv…

account_circle
EleutherAI(@AiEleuther) 's Twitter Profile Photo

We are excited to see torchtune, a newly announced PyTorch-native finetuning library, integrate with our LM Evaluation Harness library for standardized, reproducible evaluations!

Read more here:
Blog: pytorch.org/blog/torchtune…
Thread:

account_circle