Miles Turpin(@milesaturpin) 's Twitter Profileg
Miles Turpin

@milesaturpin

Language model alignment @nyuniversity

ID:865609028579213312

linkhttp://milesturp.in/about calendar_today19-05-2017 16:44:09

365 Tweets

988 Followers

1,3K Following

david rein(@idavidrein) 's Twitter Profile Photo

Is GPQA garbage?

A couple weeks ago, typedfemale pointed out some mistakes in a GPQA question, so I figured this would be a good opportunity to discuss how we interpret benchmark scores, and what our goals should be when creating benchmarks.

Is GPQA garbage? A couple weeks ago, @typedfemale pointed out some mistakes in a GPQA question, so I figured this would be a good opportunity to discuss how we interpret benchmark scores, and what our goals should be when creating benchmarks.
account_circle
Jacob Pfau(@jacob_pfau) 's Twitter Profile Photo

Do models need to reason in words to benefit from chain-of-thought tokens?

In our experiments, the answer is no! Models can perform on par with CoT using repeated '...' filler tokens.
This raises alignment concerns: Using filler, LMs can do hidden reasoning not visible in CoT🧡

Do models need to reason in words to benefit from chain-of-thought tokens? In our experiments, the answer is no! Models can perform on par with CoT using repeated '...' filler tokens. This raises alignment concerns: Using filler, LMs can do hidden reasoning not visible in CoT🧡
account_circle
Usman Anwar(@usmananwar391) 's Twitter Profile Photo

We released this new agenda on LLM-safety yesterday. This is VERY comprehensive covering 18 different challenges.

My co-authors have posted tweets for each of these challenges. I am going to collect them all here!

P.S. this is also now on arxiv: arxiv.org/abs/2404.09932

account_circle
Davis Brown(@davisbrownr) 's Twitter Profile Photo

🚨 Come work with PNNL on AI Safety and Security! These are unique roles working on safety for a DOE national laboratory's national security mission. Applications close April 10th (this Wednesday), some detail and rolls below in 🧡

🚨 Come work with PNNL on AI Safety and Security! These are unique roles working on safety for a DOE national laboratory's national security mission. Applications close April 10th (this Wednesday), some detail and rolls below in 🧡
account_circle
Cas (Stephen Casper)(@StephenLCasper) 's Twitter Profile Photo

🚨 New paper: Defending Against Unforeseen Failure Modes with Latent Adversarial Training

We argue that LAT can be a key tool for safer AI because it can help address the gap between failure modes that developers identify 🎯 and ones they miss πŸ€”.

arxiv.org/abs/2403.05030

🚨 New paper: Defending Against Unforeseen Failure Modes with Latent Adversarial Training We argue that LAT can be a key tool for safer AI because it can help address the gap between failure modes that developers identify 🎯 and ones they miss πŸ€”. arxiv.org/abs/2403.05030
account_circle
Sam Bowman(@sleepinyourhat) 's Twitter Profile Photo

πŸš¨πŸ“„ Following up on 'LMs Don't Always Say What They Think', Miles Turpin et al. now have an intervention that dramatically reduces the problem! πŸ“„πŸš¨

It's not a perfect solution, but it's a simple method with few assumptions and it generalizes *much* better than I'd expected.

account_circle