## Refining the agent output

Experiments modifying pipeline

| Model            | % Correct | % Change |
|------------------|----------:|---------:|
| BASELINE         | 33        | 0        |
| Improv Prompt    | 39.96     | 0.21     |
| Add Examples     | 44.67     | 0.35     |
| Date             | 45.51     | 0.38     |
| Chain of Thought | 43.38     | 0.31     |
| Self-Critique    | 44.36     | 0.34     |

Experiments with different model types:
| Model                         | % Correct | % Change |
|-------------------------------|----------:|---------:|
| gpt-5-mini                    | 45.51     |          |
| gpt-5.4-mini                  | 32.4      |          |
| gpt-5.4-nano                  | 23.28     |          |
| gpt-4.1-mini                  | 27.85     |          |
| gpt-4o-mini                   | 32.47     |          |
| llama3.1:8b-instruct-q4_K_M   | ?         |          |
| qwen3.5:9b                    | 0         |          |

%age valid URLS
| Model                         | Number    | % Age    |
|-------------------------------|----------:|---------:|
| gpt-5-mini                    | 22/405    | 5.43     |
| gpt-5.4-mini                  | 29/278    | 10.43    |
| gpt-5.4-nano                  | 6/210     | 2.85     |
| gpt-4.1-mini                  | 15/269    | 5.57     |
| gpt-4o-mini                   | 27/287    | 9.407    |