Death by dog food
I've been slacking on blog posts. So this one is a filler episode at best.
I learned the term dogfooding recently and it really resonated.
I'm overloading the term a little bit in this blog but, I hope the story still works.
Augmenting our generations
The main themes I see out in the wild are (1) people grappling with information retrieval and (2) people grappling with an agent loop where it probably shouldn't be.
Regardless, they're making the computer talk and people are excited. Documents come in, something happens, the outputs are somehow economically useful1.
At some point a stakeholder who will decide the fate of a product is brought in to test things out (dogfooding). They try it, a thumb goes up or down, and the technical team's life becomes easier or harder respectively.
Not everyone does this of course, but look up some blogs on RAG systems and usually evaluations are a footnote or not mentioned at all2. The culture of vibey evaluations is strong.
Dog food
Of course, this is maybe the most risky way to evaluate a system. If the thumb goes down (the dog food is just meant for dogs after all) due to a fluke in retrieval or answer generation, life gets hard. Stakeholders lose trust, the team backpedals, rehearsed workflows are fallen back on.
I've never liked this "it works when I try it" approach. Especially in a particularly stochastic era of AI.
Dogfooding cannot be the answer for even the simplest RAG system.
One can't survive on dog food alone
I think I'm missing regular old machine learning more and more these days. Imagine the days before ChatGPT came out. You're building your favorite model3. You understand your data, clean it up, engineer and select some features, do hyperparameter tuning, and save off your model artifact to model heaven (wherever you'll be deploying it).
You'd never get away with this! We have no idea what our model's performance is on some "generally agreed upon method for evaluation." In other words, performance on some hold out dataset4.
In my day to day, I've been focusing more on the least """agentic""" piece of the puzzle: information retrieval. There's a rich history of evaluating these types of systems, and I think there's lots more we can do in the current era of AI. Specifically things that are focused on numbers and facts instead of vibes based evaluations.
I hope to be able to share more on this publicly soon!
-
Soon, I'll pretend to be an economist and talk more about this. ↩
-
Without being a hater, here are some examples link, link, link, link. Yes, the focus of these are not evaluations. Just illustrating a point from a few top results in a "blog on rag" Google search. ↩
-
Let me guess... XGBoost. No! LightGBM. It's CatBoost definitely. Logistic regression who? ↩
-
If you're reading my blog, you've probably also seen this one. Similar idea there. ↩