Skip to content

Commit a6165a5

Browse files
authored
Merge pull request #396 from trycua/feat/hackathon-notebook
Add Jupyter notebook for the SOTA challenge
2 parents ba72f58 + c5ca6e9 commit a6165a5

File tree

1 file changed

+188
-0
lines changed

1 file changed

+188
-0
lines changed

notebooks/hud_hackathon.ipynb

Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "a5d6b2ed",
6+
"metadata": {},
7+
"source": [
8+
"# Computer-Use Agents SOTA Challenge\n",
9+
"\n",
10+
"This notebook demonstrates how to create a computer use agent with Cua and evaluate it using HUD."
11+
]
12+
},
13+
{
14+
"cell_type": "markdown",
15+
"id": "19f92431",
16+
"metadata": {},
17+
"source": [
18+
"## Step 1: Connect to cloud services\n",
19+
"\n",
20+
"You will need a Cua account to run computer use agents in the cloud and a HUD account to evaluate them.\n",
21+
"\n",
22+
"1. Create a Cua account at https://www.trycua.com/\n",
23+
"2. Start a Cua container at https://www.trycua.com/dashboard/containers\n",
24+
"3. Create a HUD account at https://www.hud.dev/\n",
25+
"4. Create a .env file like this:\n",
26+
"\n",
27+
"```\n",
28+
"# Required environment variables:\n",
29+
"CUA_API_KEY=\n",
30+
"CUA_CONTAINER_NAME=\n",
31+
"HUD_API_KEY=\n",
32+
"\n",
33+
"# Any LLM provider will work:\n",
34+
"ANTHROPIC_API_KEY=\n",
35+
"OPENAI_API_KEY=\n",
36+
"```"
37+
]
38+
},
39+
{
40+
"cell_type": "code",
41+
"execution_count": null,
42+
"id": "2f23828d",
43+
"metadata": {},
44+
"outputs": [],
45+
"source": [
46+
"# Read the .env file\n",
47+
"\n",
48+
"from dotenv import load_dotenv\n",
49+
"\n",
50+
"load_dotenv(dotenv_path='../.env')\n",
51+
"load_dotenv(dotenv_path='.env')"
52+
]
53+
},
54+
{
55+
"cell_type": "markdown",
56+
"id": "5c8bef64",
57+
"metadata": {},
58+
"source": [
59+
"## Step 2: Create a Computer Use Agent\n",
60+
"\n",
61+
"Connect to your running Cua container using the Cua SDK and initialize an agent."
62+
]
63+
},
64+
{
65+
"cell_type": "code",
66+
"execution_count": null,
67+
"id": "cd4393b0",
68+
"metadata": {},
69+
"outputs": [],
70+
"source": [
71+
"import logging\n",
72+
"from pathlib import Path\n",
73+
"import os\n",
74+
"\n",
75+
"from agent import ComputerAgent\n",
76+
"from computer import Computer, VMProviderType\n",
77+
"\n",
78+
"# Connect to your existing cloud container\n",
79+
"computer = Computer(\n",
80+
" os_type=\"linux\",\n",
81+
" provider_type=VMProviderType.CLOUD,\n",
82+
" api_key=os.getenv(\"CUA_API_KEY\"),\n",
83+
" name=os.getenv(\"CUA_CONTAINER_NAME\"),\n",
84+
" verbosity=logging.INFO\n",
85+
")\n",
86+
"\n",
87+
"# Create agent\n",
88+
"agent = ComputerAgent(\n",
89+
" model=\"openai/computer-use-preview\",\n",
90+
" tools=[computer],\n",
91+
" trajectory_dir=str(Path(\"trajectories\")),\n",
92+
" only_n_most_recent_images=3,\n",
93+
" verbosity=logging.INFO\n",
94+
")"
95+
]
96+
},
97+
{
98+
"cell_type": "markdown",
99+
"id": "12b9c22c",
100+
"metadata": {},
101+
"source": [
102+
"## Step 3: Run a Simple Task\n",
103+
"\n",
104+
"Try running the computer use agent on a simple task."
105+
]
106+
},
107+
{
108+
"cell_type": "code",
109+
"execution_count": null,
110+
"id": "f3a32ea8",
111+
"metadata": {},
112+
"outputs": [],
113+
"source": [
114+
"tasks = [\n",
115+
" \"Look for a repository named trycua/cua on GitHub.\"\n",
116+
"]\n",
117+
"\n",
118+
"for i, task in enumerate(tasks):\n",
119+
" print(f\"\\nExecuting task {i}/{len(tasks)}: {task}\")\n",
120+
" async for result in agent.run(task):\n",
121+
" print(result)\n",
122+
" pass\n",
123+
"\n",
124+
" print(f\"\\n✅ Task {i+1}/{len(tasks)} completed: {task}\")"
125+
]
126+
},
127+
{
128+
"cell_type": "markdown",
129+
"id": "eb4edbb5",
130+
"metadata": {},
131+
"source": [
132+
"## Step 4: Evaluate the Agent with HUD\n",
133+
"\n",
134+
"Test your agent's performance on a selection of tasks from the OSWorld benchmark:"
135+
]
136+
},
137+
{
138+
"cell_type": "code",
139+
"execution_count": null,
140+
"id": "6bf0887e",
141+
"metadata": {},
142+
"outputs": [],
143+
"source": [
144+
"import uuid\n",
145+
"from pprint import pprint\n",
146+
"from agent.integrations.hud import run_full_dataset\n",
147+
"\n",
148+
"# Full dataset evaluation (runs via HUD's run_dataset under the hood)\n",
149+
"job_name = f\"osworld-test-{str(uuid.uuid4())[:4]}\"\n",
150+
"\n",
151+
"results = await run_full_dataset(\n",
152+
" dataset=\"hud-evals/OSWorld-Verified-XLang\", # You can also pass a Dataset or a list[dict]\n",
153+
" job_name=job_name, # Optional; defaults to a timestamp for custom datasets\n",
154+
" model=\"openai/computer-use-preview\", # Or any supported model string\n",
155+
" max_concurrent=20, # Tune to your infra\n",
156+
" max_steps=50, # Safety cap per task\n",
157+
" split=\"train[:3]\" # Limit to just 3 tasks\n",
158+
")\n",
159+
"\n",
160+
"# results is a list from hud.datasets.run_dataset; inspect/aggregate as needed\n",
161+
"print(f\"Job: {job_name}\")\n",
162+
"print(f\"Total results: {len(results)}\")\n",
163+
"pprint(results[:3])"
164+
]
165+
},
166+
{
167+
"cell_type": "markdown",
168+
"id": "5b89a103",
169+
"metadata": {},
170+
"source": [
171+
"# Step 5: Improve your Agent\n",
172+
"\n",
173+
"Improve your agent to get the highest score possible on OSWorld-Verified. Here are some ideas to get you started:\n",
174+
"\n",
175+
"- Experiment with different models or combinations of models\n",
176+
"- Try adding your custom tools to the agent\n",
177+
"- Read the ComputerAgent source code, and come up with your own improved version/subclass"
178+
]
179+
}
180+
],
181+
"metadata": {
182+
"language_info": {
183+
"name": "python"
184+
}
185+
},
186+
"nbformat": 4,
187+
"nbformat_minor": 5
188+
}

0 commit comments

Comments
 (0)