33 lines
1.4 KiB
Markdown
33 lines
1.4 KiB
Markdown
## Refining the agent output
|
|
|
|
Experiments modifying pipeline
|
|
|
|
| Model | % Correct | % Change |
|
|
|------------------|----------:|---------:|
|
|
| BASELINE | 33 | 0 |
|
|
| Improv Prompt | 39.96 | 0.21 |
|
|
| Add Examples | 44.67 | 0.35 |
|
|
| Date | 45.51 | 0.38 |
|
|
| Chain of Thought | 43.38 | 0.31 |
|
|
| Self-Critique | 44.36 | 0.34 |
|
|
|
|
Experiments with different model types:
|
|
| Model | % Correct | % Change |
|
|
|-------------------------------|----------:|---------:|
|
|
| gpt-5-mini | 45.51 | |
|
|
| gpt-5.4-mini | 32.4 | |
|
|
| gpt-5.4-nano | 23.28 | |
|
|
| gpt-4.1-mini | 27.85 | |
|
|
| gpt-4o-mini | 32.47 | |
|
|
| llama3.1:8b-instruct-q4_K_M | ? | |
|
|
| qwen3.5:9b | 0 | |
|
|
|
|
%age valid URLS
|
|
| Model | Number | % Age |
|
|
|-------------------------------|----------:|---------:|
|
|
| gpt-5-mini | 22/405 | 5.43 |
|
|
| gpt-5.4-mini | 29/278 | 10.43 |
|
|
| gpt-5.4-nano | 6/210 | 2.85 |
|
|
| gpt-4.1-mini | 15/269 | 5.57 |
|
|
| gpt-4o-mini | 27/287 | 9.407 |
|
|
| llama3.1:8b-instruct-q4_K_M | ? | ? | |