The machine learning industry has traditionally adopted an academic approach, focusing primarily on performance across a range of tasks. LLMs like GPT-4 are a testament to this, having been scaled up to demonstrate impressive & diverse task capability. This scaling has also led to emergent abilities, debates about the true nature of which rage on.

However, general capability doesn’t necessarily translate to completing tasks as an individual user would prefer. This is a failure mode that anyone building agents will inevitably encounter. The focus, therefore, needs to shift from how language models perform tasks in a general sense to how they perform tasks on a user-specific basis.

Take summarization. It’s a popular machine learning task at which models have become quite proficient…at least from a benchmark perspective. However, when models summarize for users with a pulse, they fall short. The reason is simple: the models don’t know this individual. The key takeaways for a specific user differ dramatically from the takeaways any possible internet user would probably note.

So a shift in focus toward user-specific task performance would provide a much more dynamic & realistic approach. Catering to individual needs & paving the way for more personalized & effective ML applications.