Merge pull request #396 from trycua/feat/hackathon-notebook

jamesmurdza · web-flow · commit a6165a5a2d59 · 2025-09-05T11:28:21.000-04:00
Add Jupyter notebook for the SOTA challenge
diff --git a/notebooks/hud_hackathon.ipynb b/notebooks/hud_hackathon.ipynb
@@ -0,0 +1,188 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "a5d6b2ed",
+   "metadata": {},
+   "source": [
+    "# Computer-Use Agents SOTA Challenge\n",
+    "\n",
+    "This notebook demonstrates how to create a computer use agent with Cua and evaluate it using HUD."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "19f92431",
+   "metadata": {},
+   "source": [
+    "## Step 1: Connect to cloud services\n",
+    "\n",
+    "You will need a Cua account to run computer use agents in the cloud and a HUD account to evaluate them.\n",
+    "\n",
+    "1. Create a Cua account at https://www.trycua.com/\n",
+    "2. Start a Cua container at https://www.trycua.com/dashboard/containers\n",
+    "3. Create a HUD account at https://www.hud.dev/\n",
+    "4. Create a .env file like this:\n",
+    "\n",
+    "```\n",
+    "# Required environment variables:\n",
+    "CUA_API_KEY=\n",
+    "CUA_CONTAINER_NAME=\n",
+    "HUD_API_KEY=\n",
+    "\n",
+    "# Any LLM provider will work:\n",
+    "ANTHROPIC_API_KEY=\n",
+    "OPENAI_API_KEY=\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2f23828d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Read the .env file\n",
+    "\n",
+    "from dotenv import load_dotenv\n",
+    "\n",
+    "load_dotenv(dotenv_path='../.env')\n",
+    "load_dotenv(dotenv_path='.env')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5c8bef64",
+   "metadata": {},
+   "source": [
+    "## Step 2: Create a Computer Use Agent\n",
+    "\n",
+    "Connect to your running Cua container using the Cua SDK and initialize an agent."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cd4393b0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import logging\n",
+    "from pathlib import Path\n",
+    "import os\n",
+    "\n",
+    "from agent import ComputerAgent\n",
+    "from computer import Computer, VMProviderType\n",
+    "\n",
+    "# Connect to your existing cloud container\n",
+    "computer = Computer(\n",
+    "    os_type=\"linux\",\n",
+    "    provider_type=VMProviderType.CLOUD,\n",
+    "    api_key=os.getenv(\"CUA_API_KEY\"),\n",
+    "    name=os.getenv(\"CUA_CONTAINER_NAME\"),\n",
+    "    verbosity=logging.INFO\n",
+    ")\n",
+    "\n",
+    "# Create agent\n",
+    "agent = ComputerAgent(\n",
+    "    model=\"openai/computer-use-preview\",\n",
+    "    tools=[computer],\n",
+    "    trajectory_dir=str(Path(\"trajectories\")),\n",
+    "    only_n_most_recent_images=3,\n",
+    "    verbosity=logging.INFO\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "12b9c22c",
+   "metadata": {},
+   "source": [
+    "## Step 3: Run a Simple Task\n",
+    "\n",
+    "Try running the computer use agent on a simple task."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f3a32ea8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tasks = [\n",
+    "    \"Look for a repository named trycua/cua on GitHub.\"\n",
+    "]\n",
+    "\n",
+    "for i, task in enumerate(tasks):\n",
+    "    print(f\"\\nExecuting task {i}/{len(tasks)}: {task}\")\n",
+    "    async for result in agent.run(task):\n",
+    "        print(result)\n",
+    "        pass\n",
+    "\n",
+    "    print(f\"\\n✅ Task {i+1}/{len(tasks)} completed: {task}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "eb4edbb5",
+   "metadata": {},
+   "source": [
+    "## Step 4: Evaluate the Agent with HUD\n",
+    "\n",
+    "Test your agent's performance on a selection of tasks from the OSWorld benchmark:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6bf0887e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import uuid\n",
+    "from pprint import pprint\n",
+    "from agent.integrations.hud import run_full_dataset\n",
+    "\n",
+    "# Full dataset evaluation (runs via HUD's run_dataset under the hood)\n",
+    "job_name = f\"osworld-test-{str(uuid.uuid4())[:4]}\"\n",
+    "\n",
+    "results = await run_full_dataset(\n",
+    "    dataset=\"hud-evals/OSWorld-Verified-XLang\",          # You can also pass a Dataset or a list[dict]\n",
+    "    job_name=job_name,                   # Optional; defaults to a timestamp for custom datasets\n",
+    "    model=\"openai/computer-use-preview\", # Or any supported model string\n",
+    "    max_concurrent=20,                   # Tune to your infra\n",
+    "    max_steps=50,                        # Safety cap per task\n",
+    "    split=\"train[:3]\"                    # Limit to just 3 tasks\n",
+    ")\n",
+    "\n",
+    "# results is a list from hud.datasets.run_dataset; inspect/aggregate as needed\n",
+    "print(f\"Job: {job_name}\")\n",
+    "print(f\"Total results: {len(results)}\")\n",
+    "pprint(results[:3])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5b89a103",
+   "metadata": {},
+   "source": [
+    "# Step 5: Improve your Agent\n",
+    "\n",
+    "Improve your agent to get the highest score possible on OSWorld-Verified. Here are some ideas to get you started:\n",
+    "\n",
+    "- Experiment with different models or combinations of models\n",
+    "- Try adding your custom tools to the agent\n",
+    "- Read the ComputerAgent source code, and come up with your own improved version/subclass"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}