Jason Arbon’s Post

View profile for Jason Arbon, graphic

🧙♂️ CEO, Checkie.AI test.ai | Google | Microsoft, Chrome | Search, “Automating the World"

Save #AI time ⏰ and money 💰! What is the best AI/LLM for Software testers? I created LLM Evals with tester-specific prompts to find out-- over 2000 prompts evaluated against 20 different quality attributes such as acurracy and creativity. I did this work to inform decisions powering Checkie.AI's Automated #AI #testing. Simple checks use the smaller faster and less expensive models. For big-thinking or complex tasks, Checkie.AI uses the larger models. I figured other folks might find this info useful too. You can signup here if you are interested in seeing #AI test results for your app: https://checkie.ai/ I evaluated two LLMs, OpenAI and Anthropic, using different versions of each: * OpenAI: GPT35Turbo and GPT4o * Anthropic: Haiku, Sonnet and Opus Testing if LLMs were improving for Software Testing tasks over time, or with increasing model sizes. Findings: * GPT4o was consisestly the best--but by a small margin * All models performed well, with an average score of 9.0 out of 10. * They also performed very similarly across the different attributes, mostly within the margin of error. * All model responses shared weakness in the quality attributes of: 'Creativity', 'Interactive Quality', and 'Originality'--as many suspected. * Models seem to be getting better over time--but only ever so slightly. Note that the pricing difference between the models can vary by over 100X!--but they are delivering essentially the same results for testers, so save your $ and use the smaller, and faster models in all but most critical testing applications. Full results if you want to explore: https://lnkd.in/gBcU6-d9 I used both GPT4o and Claude to evaluate the responces/completions. No humans were harmed in the evaluation process. 🤷♂️

  • No alternative text description for this image
Jason Arbon

🧙♂️ CEO, Checkie.AI test.ai | Google | Microsoft, Chrome | Search, “Automating the World"

4d

[FWIW, I deleted two threads from folks who like to spread their own misconceptions and fear around #AI, dont understand how it works, and aren't familiar with testing content or 'relevance' at scale. Sigh]

This is great to understand the capabilities. Jason Arbon will you also share the data how you evaluated the LLMs?

Like
Reply

Great insights! It's interesting to see how smaller, faster models can save both time and money while still delivering quality results. Thanks for sharing the detailed findings!

Ognjen Ninic

High-performing software QA test engineer ☑ | ISTQB CTAL-TA | Testing makes difference between quality and garbage

4d

Jason Arbon Thank you for sharing the results 🙌

Orlando K.

Ask me how our AI agent will solve QA Engineering problems once and for all

4d

Good job Jason, I love this!

This is some good research! Thanks for sharing! Matt DeYoung

See more comments

To view or add a comment, sign in

Explore topics