I highly rate Braintrust.
It wouldn’t be too difficult to build something like that for your own usage, but I found it pretty easy to get datasets set up.
Essentially a game changer in understanding if your prompts are working. Especially if you’re doing something which requires high levels of consistency.
In our case we would use LLM for classification which fits in perfectly with evals.
“I admit that I still disagreed with him after the exchange, but I had a new respect for him as a designer because he was able to articulate a rationale for his decision.”
Any competent designer gets really good at justifying their decisions. Everyone has an opinion about design and thinks that their taste is correct.
I’m glad I don’t have to deal with that on the software side.
Marginally related, I feel the same way about honesty, especially in a work context.
I’ve always prided myself in being an honest but considerate person.
A recent experience with a colleague who weaponised my honesty in an attempt to manipulate me has left a foul taste in my mouth. Luckily their contract ended and the problem resolved itself.
But I remember distinctly feeling that I will be professional and polite but I do not automatically owe anyone my honesty.