Miles Turpin (@milesaturpin) Twitter Tweets • TwiCopy

Miles Turpin

@milesaturpin

+ Follow

Language model alignment @nyuniversity

ID:865609028579213312

linkhttp://milesturp.in/about calendar_today19-05-2017 16:44:09

365 Tweets

988 Followers

1,3K Following

david rein

@idavidrein

2 weeks ago

Is GPQA garbage?

A couple weeks ago, typedfemale pointed out some mistakes in a GPQA question, so I figured this would be a good opportunity to discuss how we interpret benchmark scores, and what our goals should be when creating benchmarks.

account_circle

Jacob Pfau

@jacob_pfau

1 month ago

Do models need to reason in words to benefit from chain-of-thought tokens?

In our experiments, the answer is no! Models can perform on par with CoT using repeated '...' filler tokens.
This raises alignment concerns: Using filler, LMs can do hidden reasoning not visible in CoT🧵

account_circle

Usman Anwar

@usmananwar391

1 month ago

We released this new agenda on LLM-safety yesterday. This is VERY comprehensive covering 18 different challenges.

My co-authors have posted tweets for each of these challenges. I am going to collect them all here!

P.S. this is also now on arxiv: arxiv.org/abs/2404.09932

account_circle

Davis Brown

@davisbrownr

1 month ago

🚨 Come work with PNNL on AI Safety and Security! These are unique roles working on safety for a DOE national laboratory's national security mission. Applications close April 10th (this Wednesday), some detail and rolls below in 🧵

thumb_up_off_alt10

chat_bubble_outline0

repeat4

shareShare

account_circle

Cas (Stephen Casper)

@StephenLCasper

2 months ago

🚨 New paper: Defending Against Unforeseen Failure Modes with Latent Adversarial Training

We argue that LAT can be a key tool for safer AI because it can help address the gap between failure modes that developers identify 🎯 and ones they miss 🤔.

arxiv.org/abs/2403.05030

account_circle

Sam Bowman

@sleepinyourhat

2 months ago

🚨📄 Following up on 'LMs Don't Always Say What They Think', Miles Turpin et al. now have an intervention that dramatically reduces the problem! 📄🚨

It's not a perfect solution, but it's a simple method with few assumptions and it generalizes *much* better than I'd expected.

thumb_up_off_alt72

chat_bubble_outline0

repeat8

shareShare

account_circle