Discussion about this post

User's avatar
Nathan Lambert's avatar

This is a nice collection of plots. What is tough when comparing models across generations (i.e. gpt 3.5 to 4 to 4.5) is that post-training has gotten more effective and also potentially more focused on the evals of interest. It's very hard to know without access to the underlying base models.

Also, CoT isn't the only way for inference time compute. I bet most inference time compute has improvements that start as linear. Likely CoT too. Then, others like lightly parallel search will also stack on top of it.

Pretraining, and scaling all the paradigms at once, is just waiting on cluster buildouts. Still, more progress in post-training right now because iteration is easier.

Expand full comment
Evan Armstrong's avatar

this was well done—4.5 is criminally underrated

Expand full comment
1 more comment...

No posts