Skip to content

Conversation

droot
Copy link
Member

@droot droot commented Aug 14, 2025

No description provided.

@droot droot requested review from janetkuo and noahlwest August 14, 2025 21:58
@droot droot merged commit beb33b5 into GoogleCloudPlatform:main Aug 14, 2025
6 checks passed
| gemini-2.5-flash-preview-04-17 | 10 | 0 | 100% |
| gemini-2.5-pro-preview-03-25 | 10 | 0 | 100% |
| gemma-3-27b-it | 8 | 2 | 80% |
| AWS Bedrock Claude 3.7 Sonnet | 10 | 0 | 100% |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To improve the clarity of the benchmark, could we clarify if AWS Bedrock is just the access layer? It might be better to list the core model, 'Claude 3.7 Sonnet', to ensure we're comparing the models directly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will add a column for the llm provider in the final benchmark report for completeness and reproducibility. I have seen differences in behavior of the model across different providers (in bedrock case, there are some prompt specific changes as well that I am not super sure of) and when we include secondary metrics e.g. cost, latency etc, it will become even more important. This will also be critical for open models where one inference stack (llama.cpp, vllm) affects even the accuracy of the same model.

/cc @noahlwest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants