Yacine Jernite (@YJernite) Twitter Tweets • TwiCopy

EleutherAI

1 week ago

An essential blocker to training LLMs on public domain books is not knowing which books are in the public domain. We're working on it, but it's slow and costly... if you're interested in providing support reach out!

thumb_up_off_alt49

chat_bubble_outline0

repeat7

shareShare

account_circle

Hugo Laurençon

@HugoLaurencon

1 week ago

The Cauldron is a massive collection of 50 high-quality datasets, all converted to the user/assistant format, and ready to use to fine-tune any Vision Language Model.

It covers a wide range of tasks detailed below⬇️

huggingface.co/datasets/Huggi…

thumb_up_off_alt29

chat_bubble_outline0

repeat5

shareShare

account_circle

Alexander Doria

@Dorialexander

1 week ago

Really honored to see that Youtube-Commons is now the most trending dataset on Hugging Face

thumb_up_off_alt67

chat_bubble_outline0

repeat7

shareShare

account_circle

Stella Biderman

@BlancheMinerva

1 week ago

Training data transparency is an unambiguous win for society, but all the incentives are against companies doing it right now. We need to fix this as soon as possible.

account_circle

Yacine Jernite

@YJernite

1 week ago

Grateful for the folks within companies that are training LLMs who keep pushing for training data transparency despite pushback (you know who you are)🤗

thumb_up_off_alt53

chat_bubble_outline0

repeat4

shareShare

account_circle

Nick Vincent

@nickmvincent

1 week ago

So one can use the model to train other AI systems. And updating your name seems like a small price to pay (and it is a form of attribution).

But this becomes pretty funny pretty quickly if other open models adopt the same rule.

thumb_up_off_alt2

chat_bubble_outline0

repeat1

shareShare

account_circle

Jesse Dodge

@JesseDodge

1 week ago

Today Meta released Llama 3! Congrats to the team.

In their blog post they wrote that, 'the curation of a large, high-quality training dataset is paramount', while providing almost no information about how it was made, how it was filtered, or its contents.

thumb_up_off_alt72

chat_bubble_outline0

repeat9

shareShare

account_circle

Nik Marda

@nrmarda

1 week ago

A really fantastic job opportunity at OMB just opened up today — this person would be central to the White House's oversight of federal agencies' use of AI: usajobs.gov/job/786286500

account_circle

Alexander Doria

@Dorialexander

1 week ago

We also aim to define good practices regarding the use of free licensed content: we provide the provenance metadata and credits to the original authors. In accordance with the philosophy of Creative Commons this data should be preferably used for training open reproducible models

thumb_up_off_alt17

chat_bubble_outline0

repeat1

shareShare

account_circle

Yixin Wan

@yixin_wan_

1 week ago

How to identify bias in language agency?Eg. in texts describing White men as “leading” & Black women as “helping”?🧐
🔎String matching?❌NO!
🔎Sentiment classifier?❌No!
✅Our agency classifier CAN! It reveals gender, racial, and intersectional bias🤯
🔗:
arxiv.org/abs/2404.10508

account_circle

A. Feder Cooper

@afedercooper

1 week ago

and an extra thank you to our speakers, student volunteers, and to my co-organizers!

Katherine Lee James Grimmelmann Hoda Heidari AlexandraReeveGivens Miranda Bogen Paul Ohm
Hoon Kim Nari Johnson @michaelfeffer Ningjing Tang

thumb_up_off_alt16

chat_bubble_outline0

repeat2

shareShare

account_circle

Nathan Lambert

@natolambert

1 week ago

Here's OLMo 1.7-7B

We figured out how to fix the MMLU score for the first OLMo 7B model when training the bigger one, so we got you OLMo 1.7-7B.

Better data (Dolma 1.7) + staged training = 24 point increase.

Oh it has 2x the context length too (4096)

huggingface.co/allenai/OLMo-1…

account_circle

MMitchell

@mmitchell_ai

1 week ago

Better data, better model

thumb_up_off_alt33

chat_bubble_outline0

repeat3

shareShare

account_circle

MMitchell

@mmitchell_ai

1 week ago

Spent a good chunk of today reading through Rest of World's AI elections tracker, which documents how AI is influencing elections. Very good use of my time. Recommend.

account_circle

Kyle Lo

@kylelostat

1 week ago

notable stuff:
🦉ton of perf boost from mixing instruct data at end (e.g., flan)
🐋anneal learning rate (Fig 9b in arxiv.org/abs/2403.08763)
🐞changing data mix boosts MMLU at some cost to other evals

🍇huggingface.co/allenai/dolma
🧀huggingface.co/allenai/OLMo-1…

thumb_up_off_alt51

chat_bubble_outline0

repeat9

shareShare

account_circle

Deb Raji

@rajiinio

1 week ago

I'm glad someone said this -- I've always found it strange that there are malicious actors actively out there, intentionally producing and disseminating harmful/deceptive content only for the policy response to be 'victims should have known better' 🫤

account_circle

VentureBeat

@VentureBeat

1 week ago

Hugging Face introduces Idefics2, an 8B open-source visual language model venturebeat.com/ai/hugging-fac…

account_circle

merve

@mervenoyann

1 week ago

I see you all send your documents to close-source APIs, this is not ok 👎
As a person who has seen many open-source document models I am amazed by what IDEFICS2 has done with document understanding 🤯🤩
Please use it! Has Apache 2.0 license ❤️

account_circle

Alex Hanna (اليكس حنٌا)

@alexhanna

1 week ago

With IDF's 'Lavender' and its use in the violence in Gaza, the point is not the technology or its errors, but plausible deniability and ideologies of scale. New for the newsletter.

buttondown.email/maiht3k/archiv…

account_circle

EleutherAI

@AiEleuther

1 week ago

We are excited to see torchtune, a newly announced PyTorch-native finetuning library, integrate with our LM Evaluation Harness library for standardized, reproducible evaluations!

Read more here:
Blog: pytorch.org/blog/torchtune…
Thread:

thumb_up_off_alt54

chat_bubble_outline0

repeat7

shareShare

account_circle