lmsys.org(@lmsysorg) 's Twitter Profileg
lmsys.org

@lmsysorg

Large Model Systems Organization. We created Vicuna and Chatbot Arena! Compare 30+ LLMs (GPT-4/Claude/Llamas) side-by-side at https://t.co/IDFeIDIOtm

ID:1641378826537295874

linkhttp://lmsys.org calendar_today30-03-2023 09:56:38

435 Tweets

40,8K Followers

173 Following

lmsys.org(@lmsysorg) 's Twitter Profile Photo

We hope you all have fun with it! Finally, its public version 'gpt-4o' is now up to the Arena. Come chat & vote at chat.lmsys.org!

Learn more in GPT-4o release blog post
openai.com/index/hello-gp…

account_circle
lmsys.org(@lmsysorg) 's Twitter Profile Photo

It also demonstrates strong performance via community vibe-check. Check out cool gpt2-chatbot demos here.
x.com/minchoi/status…

account_circle
lmsys.org(@lmsysorg) 's Twitter Profile Photo

Significantly higher win-rate against all other models.
e.g., ~80% win-rate vs GPT-4 (June) in non-tie battles.

Significantly higher win-rate against all other models. e.g., ~80% win-rate vs GPT-4 (June) in non-tie battles.
account_circle
lmsys.org(@lmsysorg) 's Twitter Profile Photo

Breaking news — gpt2-chatbots result is now out!

gpt2-chatbots have just surged to the top, surpassing all the models by a significant gap (~50 Elo). It has become the strongest model ever in the Arena!

With improvement across all boards, especially reasoning & coding…

Breaking news — gpt2-chatbots result is now out! gpt2-chatbots have just surged to the top, surpassing all the models by a significant gap (~50 Elo). It has become the strongest model ever in the Arena! With improvement across all boards, especially reasoning & coding…
account_circle
William Fedus(@LiamFedus) 's Twitter Profile Photo

GPT-4o is our new state-of-the-art frontier model. We’ve been testing a version on the LMSys arena as im-also-a-good-gpt2-chatbot 🙂. Here’s how it’s been doing.

GPT-4o is our new state-of-the-art frontier model. We’ve been testing a version on the LMSys arena as im-also-a-good-gpt2-chatbot 🙂. Here’s how it’s been doing.
account_circle
lmsys.org(@lmsysorg) 's Twitter Profile Photo

4. Qualitatively, we also find Llama 3’s outputs are friendlier and more conversational than other models. These traits appear more often in battles that Llama 3 wins.

Llama 3 also really loves exclamations!

4. Qualitatively, we also find Llama 3’s outputs are friendlier and more conversational than other models. These traits appear more often in battles that Llama 3 wins. Llama 3 also really loves exclamations!
account_circle
lmsys.org(@lmsysorg) 's Twitter Profile Photo

3. Deduplication or outliers do not significantly affect the win rate.

We also sanity-check votes and prompts to avoid certain users being over-represented. Results show that there's no change on Llama 3's win rate before/after.

3. Deduplication or outliers do not significantly affect the win rate. We also sanity-check votes and prompts to avoid certain users being over-represented. Results show that there's no change on Llama 3's win rate before/after.
account_circle
lmsys.org(@lmsysorg) 's Twitter Profile Photo

(Cont'd) We show Llama 3-70b-Instruct's win rate conditioned on hierarchical criteria subsets. Some criteria separate the model's strengths and weaknesses.

(Cont'd) We show Llama 3-70b-Instruct's win rate conditioned on hierarchical criteria subsets. Some criteria separate the model's strengths and weaknesses.
account_circle
lmsys.org(@lmsysorg) 's Twitter Profile Photo

2. As prompts get challenging*, the gap between Llama 3 against top-tier models becomes larger.

* We define challenging using several criteria like complexity, problem-solving, domain knowledge, and more.

2. As prompts get challenging*, the gap between Llama 3 against top-tier models becomes larger. * We define challenging using several criteria like complexity, problem-solving, domain knowledge, and more.
account_circle
lmsys.org(@lmsysorg) 's Twitter Profile Photo

Exciting new blog -- What’s up with Llama-3?

Since Llama 3’s release, it has quickly jumped to top of the leaderboard. We dive into our data and answer below questions:

- What are users asking? When do users prefer Llama 3?
- How challenging are the prompts?
- Are certain users…

Exciting new blog -- What’s up with Llama-3? Since Llama 3’s release, it has quickly jumped to top of the leaderboard. We dive into our data and answer below questions: - What are users asking? When do users prefer Llama 3? - How challenging are the prompts? - Are certain users…
account_circle