Thoughts on NeurIPS 2025

I had the pleasure to be sent to NeurIPS by my employer earlier this month. Or, at least I was supposed to. An unexpected flu meant I was attending virtually. Here are some general thoughts on the event.

It's been a while since the conference. But, better late than never!

Preamble

Conferences are a fun way to see the world and learn. The first big one I went to was GECCO 2019 in Prague¹. Later that year I shipped off to NeurIPS 2019² in Vancouver. Unfortunately, I haven't been to one in person since 2019!

I was jazzed to go to San Diego, but caught the flu over Thanksgiving and had to stay home. So, I unfortunately watched from the comfort of my home.

Below are my thoughts on general themes and reflections. I won't lie and say I read every paper and synthesized all the tracks. These are pulled from expos, invited talks, and the best papers.

I certainly haven't kept up with all the other ML conferences this year. So, this may be nothing new.

Anti-anthropomorphization-arian-ism

I get it. The computer is talking like a person. So, we want to treat it like a person.

Some of the connecting threads between talks felt like "these problems are arising because we're treating the computer like people." There's some specific points on benchmarks that I'll talk about next. This section will be more general.

The pieces I think that are relevant here were:

Tufekci's invited talk Are We Having the Wrong Nightmares About AI?
Jiang et al.'s Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)⁴.

Here I'm looking for risks that arise from anthropomorphization that aren't specifically related to benchmarks.

Tufekci's talk started with this premise. To paraphrase, she claimed that AI replacing human labor is completely the wrong framing. Seeing a language model as an (eventual) smarter human limits the scope of risk assessment.

She also made some interesting comparisons to past technological developments. For example, just like a language model is not a smarter human, an automobile is not a faster horse.

She went on to discuss five areas that generative language models are already affecting humanity and will "break" as it advances. Namely, proof of effort, authenticity, accuracy, sincerity, and humanity. They all more or less are what they sound like. Further, they all stem from us seeing a little too much of ourselves in the computer.

Proof of sincerity, for example. This goes beyond the issues of "is this e-mail sincere?" Sycophancy remains a huge problem in language models. Vulnerable people are falling into deep, dark holes talking to these models³. To quote Tufekci, the problem here is an "insincere model acting with all the human signals of sincerity."

Jiang et al.'s paper is a bit of a divergence from Tufekci's talk. At face value, the issues they raise are not directly due to over-anthropomorphizing language models.

The connection I wanted to make requires the hindsight of their findings (which are great). As a reminder, this paper found that, when prompted with creative tasks, basically all models output the same thing. Ask for a poem from three models, you'll get three strikingly similar poems.

A bug? Or a misaligned expectation?

I say that this one's on us. How could we expect things to be different? All models are basically trained on the same data. So, in hindsight, of course they all output the same thing.

In my opinion, it isn't immediately obvious because one sees different models as fundamentally different beings. Rather than small tweaks to learning algorithms and architectures on the same heap of data.

Outside the bounds of the conference now, but I like the idea that we should rethink how we talk about language models. They aren't copilots, they aren't a PhD reasoner in a box. They're tools. They're good for some things and bad for others. Focus on solving the right problem and then reach for the right tools to solve that problem.

Souring on benchmarks

Number go up is dead, long live number go up.

Before this conference, I felt a growing vibe that people were fed up with benchmark maxing.

Ilya Sutskever points it out in his Dwarkesh interview. To paraphrase, models are killing benchmarks, but are not having economic impact on real tasks.

The pieces I think are relevant here were:

Mitchell's invited talk On the Science of “Alien Intelligences”: Evaluating Cognitive Capabilities in Babies, Animals, and AI.
Zhang's talk Advancing Rigor: Trust and Reliability in LLM Evaluation as part of Capital One's expo workshop on Exploring Trust and Reliability in LLM Evaluation.
The tutorial talk The Science of Benchmarking: What’s Measured, What’s Missed, and What’s Next⁵.

Mitchell's talk specifically implored researchers to "be aware of [their] own cognitive bias towards anthropomorphism". So more points for anti-anthropomorphization-arian-ism. She went on to discuss five more principles for evaluation. Specifically things like:

Be skeptical of others' results. What shortcuts or memorization could have occurred to achieve a result?
Analyze failures, embrace negative results.
Design novel variations of evaluations to test robustness and generalization.
Replicate and build on others' results.

This sounds strikingly like "just do good science." But I think it rings true. Benchmarks have gotten to the point where they are just completely untrustworthy. No one replicates, no one gives standard errors, I haven't seen a p-value in years.

Next comes Zhang's talk. Long story short, he showed some very interesting results on benchmark contamination. It's clear that many models have the answer sheets to the tests they're being given. I'm glossing over a ton on his methods.

Finally comes the tutorial, The Science of Benchmarking: What’s Measured, What’s Missed, and What’s Next. It was three talks. My takeaways from each were as follows.

Talk 1: Epistemology, Design & Practice

Main idea here was bringing back real science to benchmarking. We're talking confidence intervals. We're talking t-tests. We're talking random seeding. The good stuff.

Let's look at the Claude Opus 4.5 announcement. They have the customary table of various benchmarks and different SOTA models' performance.

They have bolded their result on \(\tau\)2-bench. This row shows (on the retail use case):

Opus 4.5: 88.9%
Gemini 3 Pro: 85.3%

Of course, 88.9 is a larger number than 85.3. But that's all we can really say here. There's so much context lost on how these benchmarks work exactly. We have no idea what the spread on this evaluation might be. Give us a bootstrapped confidence interval or something!

Talk 2: Limitations

To quote, or probably paraphrase: "static benchmarks do not account for evolving model usage." In the extreme case, we can't expect performance on a multiple choice quiz to translate to a language model doing our taxes for us.

Two other points I liked from this talk were:

Benchmarks have poor quality labels: if we remove "noisy" labels, we can completely reorder a ranking of models.
Benchmarks fail to meet expert requirements: the speaker referenced a result stating doctors found that a medical benchmark had scant relevance for clinical work.

Talk 3: Emerging Paradigms

First point initially commented on benchmarks' high barrier to entry. They're often nontrivial to judge. Tasks that require some kind of agent are highly implementation dependent (e.g. which framework was used?). Of course, this information is never public.

Benchmarks have started to move to "economically viable" tasks. For example, GPDval and ScaleAI's remote labor index.

I also have to shout out Vending-Bench. I used to work with a couple of the people at Andon Labs. I'm glad to see them getting more attention.

Panel conversation

All the folks on the panel had a lot of great things to say. But I wanted to focus on two of Ofir Press's points. For reference, he's partially responsible for SWE-bench.

First, he gave a free idea for a benchmark. He suggested a benchmark on refactoring codebases. He commented that every way to measure this is very susceptible to reward hacking.

Second, he commented that practitioners should be wary of releasing a new benchmark that current SOTA models can achieve 70% on. If such a benchmark catches on, it would undoubtedly be saturated quickly.

A final thought from my brain: certainly this is all just from money. Benchmark number goes up, stock (or valuation) number goes up. It's that simple folks.

Pushing on learning paradigms

I'm a bit out of my depth here, but I wanted to touch on exciting ideas on learning.

Generally, it seems like the holy grails right now are:

Continual learning
Problem finding

If both could be made scalable (and domain-agnostic), you would have a model that doesn't forget its past experiences and is quite creative.

The pieces I think that are relevant here were:

Sutton's invited talk The Oak Architecture: A Vision of SuperIntelligence from Experience.
Choi's invited talk The Art of (Artificial) Reasoning.
Cho's invited talk From Benchmarks to Problems - A Perspective on Problem Finding in AI.
Yue et al.'s Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Behrouz et al.'s Nested Learning: The Illusion of Deep Learning Architectures.

Sutton, Cho, and Behrouz et al's pieces all inspired great hope toward the future of learning paradigms. I'm not sure if they're competing or orthogonal. Behrouz et al. were actually showing results with the HOPE architecture. The same with Cho's presentation on "learning to X". While Sutton's was more of a vision of what an ideal approach will have in the future.

Meanwhile, I thought Choi and Yue et al. had conflicting results. Yue et al. showed that RLVR doesn't teach models anything new. Choi seemed to hint that it does actually work if you do it correctly. However, she didn't seem to directly tackle the idea of RLVR leading to extrapolation. Which is what Yue et al. specifically were discrediting.

Again, I'm treading water at best here. So, this might be way off.

Anything but language

A smaller point, but I love the continued rise of anything but language foundation models. For example, Shopify had an expo session on Foundational Generative Recommendations for E-Commerce. They showed lots of interesting results, such as confirming that generative models for recommendations do indeed follow scaling laws⁶.

Similar things are being done at companies like Nubank. See here for their blog series on this.

Transaction data is ripe for this kind of thing. Every financial institution is sitting on a mountain of data that looks like this. Maybe someday we'll have an open source transactional foundation model for downstream tasks like fraud detection or product recommendations.

Of course, there's a bundle of applications of other applications like this in biological data, robotics, physical simulation, and others. Glad to see things are going strong in those areas.

Virtual conference blues

A final thought on virtual conferences. I will say that the NeurIPS team (or whoever they hired) did a great job with the virtual conference. Very minimal hiccups and things started approximately on time. I also appreciated them taking a few virtual questions on the talks. My only gripe is a few of the fun sounding talks not being streamed (specifically, Emergent Stories: AI Agents and New Narratives and Design for Intelligence: Humans, Machines and Nature in Emergent Co-Creation).

If you're not presenting, 90% of the value of a conference is seeing faces and shaking hands. I love the expo and talking to people from random tech companies. That's the real energizing part for me.

Once I sniffled and coughed my way to Thursday, I was feeling low motivation to continue watching paper talks.

Not to brag, but I was second author on best paper for the evolutionary machine learning track (link) at GECCO 2019. These were good days. At the time, our best experiments came from our first author heating his apartment with a 1080Ti he had on his consumer desktop. Simpler times. ↩
Not to brag, but NeurIPS 2019 rejected the same paper I mentioned above. ↩
The Emerging Problem of "AI Psychosis". ↩
As I said, I'm not exactly pouring over the NeurIPS publications with this one. This piece won best paper on the dataset and benchmarking track. See here. ↩
You can check out their public website here. ↩
I say confirming since the original HSTU paper already showed that "generative recommenders" follow scaling laws. The Shopify people pointed this out as well. ↩