What have you guys found by comparing models against their smaller versions?
Llama 3, DeepSeek-R1, Gemma2, Phi3. They all have versions with less and more parameters. I was wondering what have you guys found when giving the same prompts and context to the smaller and bigger versions. Specially Deepseek, since it's the only reasoning model of the list
I am planing to run a test this week, exclusively with R1, to run 1000 prompts on R1 1.5b and 8b. And then have a model compare not only the responses, but also the thought process
I haven't used 1.5b that much, but it does seem to generate smaller responses. I haven't used the DeepSeek website much either, i can see it's thinking is by far the largest. As for 8b, the thought process seems always:
find what the user wants -> remember what you've been told -> remember something else you've been told (if you do remember) -> think for a paragraph or two -> summaryze
Part of the reason i haven't used the smaller model is because, even if it outputs faster, i rarely need that. So accuracy it is