Checkie.AI's best AI/LLM for Software testers

🧙♂️ CEO founder, testers.ai

1y Edited

Save #AI time ⏰ and money 💰! What is the best AI/LLM for Software testers? I created LLM Evals with tester-specific prompts to find out-- over 2000 prompts evaluated against 20 different quality attributes such as acurracy and creativity. I did this work to inform decisions powering Checkie.AI's Automated #AI #testing. Simple checks use the smaller faster and less expensive models. For big-thinking or complex tasks, Checkie.AI uses the larger models. I figured other folks might find this info useful too. You can signup here if you are interested in seeing #AI test results for your app: https://checkie.ai/ I evaluated two LLMs, OpenAI and Anthropic, using different versions of each: * OpenAI: GPT35Turbo and GPT4o * Anthropic: Haiku, Sonnet and Opus Testing if LLMs were improving for Software Testing tasks over time, or with increasing model sizes. Findings: * GPT4o was consisestly the best--but by a small margin * All models performed well, with an average score of 9.0 out of 10. * They also performed very similarly across the different attributes, mostly within the margin of error. * All model responses shared weakness in the quality attributes of: 'Creativity', 'Interactive Quality', and 'Originality'--as many suspected. * Models seem to be getting better over time--but only ever so slightly. Note that the pricing difference between the models can vary by over 100X!--but they are delivering essentially the same results for testers, so save your $ and use the smaller, and faster models in all but most critical testing applications. Full results if you want to explore: https://lnkd.in/gBcU6-d9 I used both GPT4o and Claude to evaluate the responces/completions. No humans were harmed in the evaluation process. 🤷♂️

8 Comments

Jason Arbon

🧙♂️ CEO founder, testers.ai

[FWIW, I deleted two threads from folks who like to spread their own misconceptions and fear around #AI, dont understand how it works, and aren't familiar with testing content or 'relevance' at scale. Sigh]

2 Reactions

GrowScale.Win

Great insights! It's interesting to see how smaller, faster models can save both time and money while still delivering quality results. Thanks for sharing the detailed findings!