Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -442,11 +442,11 @@ Comprehensive evaluation on identical 10-task Kubernetes benchmark with proper C

| Model | Success | Fail | Success Rate |
|-------|---------|------|--------------|
| **AWS Bedrock Claude 3.7 Sonnet** | **10** | **0** | **100%** |
| **AWS Bedrock Claude Sonnet 4** | **9** | **1** | **90%** |
| gemini-2.5-flash-preview-04-17 | 10 | 0 | 100% |
| gemini-2.5-pro-preview-03-25 | 10 | 0 | 100% |
| gemma-3-27b-it | 8 | 2 | 80% |
| AWS Bedrock Claude 3.7 Sonnet | 10 | 0 | 100% |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To improve the clarity of the benchmark, could we clarify if AWS Bedrock is just the access layer? It might be better to list the core model, 'Claude 3.7 Sonnet', to ensure we're comparing the models directly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will add a column for the llm provider in the final benchmark report for completeness and reproducibility. I have seen differences in behavior of the model across different providers (in bedrock case, there are some prompt specific changes as well that I am not super sure of) and when we include secondary metrics e.g. cost, latency etc, it will become even more important. This will also be critical for open models where one inference stack (llama.cpp, vllm) affects even the accuracy of the same model.

/cc @noahlwest

| AWS Bedrock Claude Sonnet 4 | 9 | 1 | 90% |

**Test Environment**: Kind cluster v1.27.3 with Calico CNI (full NetworkPolicy support)
**Tasks**: create-pod, create-pod-mount-configmaps, create-pod-resources-limits, create-network-policy, fix-crashloop, fix-image-pull, fix-service-routing, list-images-for-pods, scale-deployment, scale-down-deployment
Expand Down
Loading