Two Related Prompts, Different Results: Qwen 3.5 And Gemma 4 Need Different Prompting Than Qwen 3.6
| With every new model release there's the "better than Opus 6.13" guys vs the "this is so bad, why did they even release it" camp and I'm always wondering which one is using it wrong. So I did a little test with 2 related prompts, 3 models and ran each combination 10 times. Short prompt:
Expected Answer: 300. Assumption:
Result:
I was really surprised by this, turns out there's not just good prompts and bad prompts but even apparently similar models (Qwen 3.5 vs 3.6) can require different prompting styles. For context: the other prompt contains the exact same sentences, but embedded in a longer story:
The full long prompt can be found here The full data of the comparison (token in- and output numbers, model answers etc): https://evaluateai.ai/app/comparisons/7d1baf23-49d0-484b-8c59-854dcc2e4f64/results/ Disclaimer: that's my website, I created it specifically to compare (local) LLMs. You can create one or several prompts, point it at your local endpoint and then compare the results. [link] [comments] |
Popular Products
-
Large Wall Calendar Planner$55.76$27.78 -
Child Safety Cabinet Locks - Set of 6$83.56$41.78 -
USB Touchscreen Heated Fingerless Gloves$75.56$37.78 -
Golf Swing Trainer Practice Stick wit...$21.56$10.78 -
Golf Swing Training Belt$41.56$20.78