Skip to content

Conversation

KRRT7
Copy link
Contributor

@KRRT7 KRRT7 commented Sep 17, 2025

PR Type

Enhancement, Tests, Bug fix


Description

  • Derive test context from pytest hooks

  • Add async throughput calculation support

  • Extend baseline model with throughput

  • Update tests for new interfaces


Diagram Walkthrough

flowchart LR
  pytestHook["pytest hooks set test context env"] -- "env vars" --> extractor["extract_test_context_from_frame uses env"]
  runParse["run_and_parse_tests returns results thrice"] -- "behavior/perf results" --> calcTP["calculate_async_throughput aggregates stdout"]
  calcTP -- "function throughput" --> baseline["OriginalCodeBaseline.async_throughput"]
Loading

File Walkthrough

Relevant files
Enhancement
6 files
codeflash_wrap_decorator.py
Replace stack introspection with env-based test context   
+12/-174
models.py
Add async throughput to baseline model                                     
+1/-0     
function_optimizer.py
Plumb results for throughput; compute and store async throughput
+44/-10 
parse_test_output.py
Add helpers to compute throughput from stdout                       
+63/-0   
pytest_plugin.py
Set/clear test context env vars per test                                 
+22/-0   
test_runner.py
Log executed pytest command for visibility                             
+1/-1     
Tests
8 files
test_async_run_and_parse_tests.py
Adapt to new API and class-name expectations                         
+10/-10 
test_async_wrapper_sqlite_validation.py
Provide env context; fix sync context extraction test       
+5/-2     
test_codeflash_capture.py
Update for triple-return API and expectations                       
+12/-12 
test_extract_test_context_from_frame.py
Remove obsolete stack-based context tests                               
+0/-217 
test_instrument_all_and_run.py
Adjust tests to new run_and_parse_tests signature               
+5/-5     
test_instrument_tests.py
Update behavioral/perf/line profile calls to new API         
+22/-22 
test_instrumentation_run_results_aiservice.py
Align to new return tuple from test runner                             
+5/-5     
test_pickle_patcher.py
Adapt pickle patch tests to new API                                           
+4/-4     
Configuration changes
1 files
codeflash.code-workspace
Update launch args to async concurrency example                   
+5/-1     

Copy link

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Regex Robustness

The new throughput parsers rely on strict colon-delimited patterns without escaping, which may fail if module/function names contain colons or unexpected characters, or if stdout interleaves tags. Consider anchoring, stricter grouping, and normalization to avoid false positives/negatives.

start_pattern = re.compile(r"!\$######([^:]*):([^:]*):([^:]*):([^:]*):([^:]+)######\$!")
end_pattern = re.compile(r"!######([^:]*):([^:]*):([^:]*):([^:]*):([^:]+)######!")


def calculate_function_throughput_from_stdout(stdout: str, function_name: str) -> int:
    """A completed execution is defined as having both a start tag and matching end tag:
    Start: !$######test_module:test_function:function_name:loop_index:iteration_id######$!
    End:   !######test_module:test_function:function_name:loop_index:iteration_id######!
    """
    start_matches = start_pattern.findall(stdout)
    end_matches = end_pattern.findall(stdout)
    end_matches_set = set(end_matches)

    # Count completed executions for the specific function only
    function_throughput = 0

    for start_match in start_matches:
        # Check if this execution is for the function we're interested in and has a matching end tag
        # function_name is at index 2 in the match tuple
        if start_match in end_matches_set and len(start_match) > 2 and start_match[2] == function_name:
            function_throughput += 1

    return function_throughput
Excessive Logging

Logging the full pytest command at info level on every run could be noisy and leak paths; consider using debug level or redacting sensitive parts.

blocklist_args = [f"-p no:{plugin}" for plugin in BEHAVIORAL_BLOCKLISTED_PLUGINS if plugin != "cov"]
logger.info(f"{' '.join(coverage_cmd + common_pytest_args + blocklist_args + result_args + test_files)}")
results = execute_test_subprocess(
Env Var Dependence

extract_test_context_from_frame now hard-requires environment variables and raises if missing; ensure all runners (including unittest and non-pytest paths) set these vars or add a safer fallback to prevent crashes.

def extract_test_context_from_frame() -> tuple[str, str | None, str]:
    # test_module = os.environ.get("CODEFLASH_TEST_MODULE")
    test_module = os.environ["CODEFLASH_TEST_MODULE"]
    test_class = os.environ.get("CODEFLASH_TEST_CLASS", None)
    # test_function = os.environ.get("CODEFLASH_TEST_FUNCTION")
    test_function = os.environ["CODEFLASH_TEST_FUNCTION"]

    if test_module and test_function:
        return (test_module, test_class if test_class else None, test_function)

    raise RuntimeError(
        "Test context environment variables not set - ensure tests are run through codeflash test runner"
    )

Copy link

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Normalize end tags before matching

The start/end tag formats elsewhere include optional extra timing appended to the
end tag, making tuple equality fail and undercounting throughput. Normalize end
matches (strip trailing timing after the last colon) before comparing to start
tuples. This aligns with the normalization used above and ensures completed
executions are counted correctly.

codeflash/verification/parse_test_output.py [81-103]

 start_pattern = re.compile(r"!\$######([^:]*):([^:]*):([^:]*):([^:]*):([^:]+)######\$!")
 end_pattern = re.compile(r"!######([^:]*):([^:]*):([^:]*):([^:]*):([^:]+)######!")
 
 
 def calculate_function_throughput_from_stdout(stdout: str, function_name: str) -> int:
     """A completed execution is defined as having both a start tag and matching end tag:
     Start: !$######test_module:test_function:function_name:loop_index:iteration_id######$!
-    End:   !######test_module:test_function:function_name:loop_index:iteration_id######!
+    End:   !######test_module:test_function:function_name:loop_index:iteration_id[:timing]######!
     """
     start_matches = start_pattern.findall(stdout)
-    end_matches = end_pattern.findall(stdout)
-    end_matches_set = set(end_matches)
+    raw_end_matches = end_pattern.findall(stdout)
 
-    # Count completed executions for the specific function only
+    # Normalize end matches by stripping optional trailing ':timing' from the last group
+    normalized_end_matches = []
+    for m in raw_end_matches:
+        if m and len(m) == 5:
+            last = m[4]
+            parts = last.split(":")
+            if len(parts) > 1:
+                # keep everything except the last timing segment
+                last = ":".join(parts[:-1])
+            normalized_end_matches.append((m[0], m[1], m[2], m[3], last))
+        else:
+            normalized_end_matches.append(m)
+
+    end_matches_set = set(normalized_end_matches)
+
     function_throughput = 0
-
     for start_match in start_matches:
-        # Check if this execution is for the function we're interested in and has a matching end tag
-        # function_name is at index 2 in the match tuple
-        if start_match in end_matches_set and len(start_match) > 2 and start_match[2] == function_name:
+        if len(start_match) > 2 and start_match[2] == function_name and start_match in end_matches_set:
             function_throughput += 1
 
     return function_throughput
Suggestion importance[1-10]: 8

__

Why: This addresses a subtle correctness bug: end tags may include timing info causing tuple mismatch; normalization aligns with earlier logic and prevents undercounting. High impact for accurate throughput metrics and consistent with PR context.

Medium
Prevent KeyError on env access

Using direct indexing on os.environ will raise KeyError in environments outside
pytest or if hooks fail, causing hard crashes. Safely read the variables with .get()
and raise the existing RuntimeError only when required fields are missing. This
keeps previous behavior while avoiding unexpected KeyError.

codeflash/code_utils/codeflash_wrap_decorator.py [33-45]

 def extract_test_context_from_frame() -> tuple[str, str | None, str]:
-    # test_module = os.environ.get("CODEFLASH_TEST_MODULE")
-    test_module = os.environ["CODEFLASH_TEST_MODULE"]
+    test_module = os.environ.get("CODEFLASH_TEST_MODULE")
     test_class = os.environ.get("CODEFLASH_TEST_CLASS", None)
-    # test_function = os.environ.get("CODEFLASH_TEST_FUNCTION")
-    test_function = os.environ["CODEFLASH_TEST_FUNCTION"]
+    test_function = os.environ.get("CODEFLASH_TEST_FUNCTION")
 
     if test_module and test_function:
         return (test_module, test_class if test_class else None, test_function)
 
     raise RuntimeError(
         "Test context environment variables not set - ensure tests are run through codeflash test runner"
     )
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that direct os.environ indexing can raise KeyError and proposes a safe .get() fallback without changing semantics. It's contextually accurate and reduces unexpected crashes, though not critical.

Medium
General
Avoid blank class env leakage

Writing empty string for CODEFLASH_TEST_CLASS makes downstream consumers distinguish
"no class" vs "unset". Use pop(..., None) before setup to avoid stale values, and
only set CODEFLASH_TEST_CLASS when a class exists. This prevents confusing blank
class names and ensures clean state per test.

codeflash/verification/pytest_plugin.py [454-474]

 @pytest.hookimpl(tryfirst=True)
 def pytest_runtest_setup(self, item: pytest.Item) -> None:
+    # Clear any previous test context to avoid leakage between tests
+    for var in ("CODEFLASH_TEST_MODULE", "CODEFLASH_TEST_CLASS", "CODEFLASH_TEST_FUNCTION"):
+        os.environ.pop(var, None)
+
     test_module_name = item.module.__name__ if item.module else "unknown_module"
 
-    test_class_name = None
-    if item.cls:
-        test_class_name = item.cls.__name__
+    test_class_name = item.cls.__name__ if item.cls else None
 
     test_function_name = item.name
     if "[" in test_function_name:
         test_function_name = test_function_name.split("[", 1)[0]
 
     os.environ["CODEFLASH_TEST_MODULE"] = test_module_name
-    os.environ["CODEFLASH_TEST_CLASS"] = test_class_name or ""
+    if test_class_name:
+        os.environ["CODEFLASH_TEST_CLASS"] = test_class_name
     os.environ["CODEFLASH_TEST_FUNCTION"] = test_function_name
 
 @pytest.hookimpl(trylast=True)
 def pytest_runtest_teardown(self, item: pytest.Item) -> None:
     """Clean up test context environment variables after each test."""
     for var in ["CODEFLASH_TEST_MODULE", "CODEFLASH_TEST_CLASS", "CODEFLASH_TEST_FUNCTION"]:
         os.environ.pop(var, None)
Suggestion importance[1-10]: 6

__

Why: Clearing env vars before setup and omitting empty class values improves robustness and state isolation. It's a reasonable enhancement with moderate impact; existing code already cleans up in teardown so the issue is not severe.

Low

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant