lmsys.org (@lmsysorg) Twitter Tweets • TwiCopy

repeat4

account_circle

lmsys.org

16 hours ago

It also demonstrates strong performance via community vibe-check. Check out cool gpt2-chatbot demos here.
x.com/minchoi/status…

thumb_up_off_alt42

repeat2

account_circle

lmsys.org

16 hours ago

Significantly higher win-rate against all other models.
e.g., ~80% win-rate vs GPT-4 (June) in non-tie battles.

thumb_up_off_alt77

repeat6

account_circle

lmsys.org

16 hours ago

In more challenging Coding Arena, we see even bigger gap (~100 Elo)!

account_circle

lmsys.org

16 hours ago

Confidence intervals chart — huge gap against previous top-5 models.

thumb_up_off_alt133

repeat8

account_circle

Breaking news — gpt2-chatbots result is now out!

gpt2-chatbots have just surged to the top, surpassing all the models by a significant gap (~50 Elo). It has become the strongest model ever in the Arena!

With improvement across all boards, especially reasoning & coding…

account_circle

William Fedus

@LiamFedus

19 hours ago

GPT-4o is our new state-of-the-art frontier model. We’ve been testing a version on the LMSys arena as im-also-a-good-gpt2-chatbot 🙂. Here’s how it’s been doing.

account_circle

lmsys.org

5 days ago

4. Qualitatively, we also find Llama 3’s outputs are friendlier and more conversational than other models. These traits appear more often in battles that Llama 3 wins.

Llama 3 also really loves exclamations!

thumb_up_off_alt52

repeat0

account_circle

lmsys.org

5 days ago

3. Deduplication or outliers do not significantly affect the win rate.

We also sanity-check votes and prompts to avoid certain users being over-represented. Results show that there's no change on Llama 3's win rate before/after.

thumb_up_off_alt35

repeat0

account_circle

lmsys.org

5 days ago

(Cont'd) We show Llama 3-70b-Instruct's win rate conditioned on hierarchical criteria subsets. Some criteria separate the model's strengths and weaknesses.

thumb_up_off_alt56

repeat2

account_circle

lmsys.org

5 days ago

2. As prompts get challenging*, the gap between Llama 3 against top-tier models becomes larger.

* We define challenging using several criteria like complexity, problem-solving, domain knowledge, and more.

thumb_up_off_alt84

repeat4

account_circle

lmsys.org