Skip to content

Conversation

misrasaurabh1
Copy link
Contributor

@misrasaurabh1 misrasaurabh1 commented Sep 13, 2025

User description

implements variable name anonymization, according to the order in which they are seen


PR Type

Enhancement, Tests


Description

  • Add AST-based code deduplication utilities

  • Normalize local variable names robustly

  • Integrate normalization into optimizer cache

  • Add comprehensive deduplication tests


Diagram Walkthrough

flowchart LR
  A["New dedup utils (normalize/duplicate)"] -- "imported by" --> B["function_optimizer"]
  B["function_optimizer"] -- "uses for candidate cache key" --> C["normalized code"]
  A -- "validated by" --> D["tests/test_code_deduplication.py"]
Loading

File Walkthrough

Relevant files
Enhancement
deduplicate_code.py
Introduce AST-based code normalization utilities                 

codeflash/code_utils/deduplicate_code.py

  • Add AST transformer to normalize local vars
  • Preserve functions, classes, params, imports, builtins
  • Add docstring removal and fingerprinting
  • Provide duplicate detection helper
+235/-0 
function_optimizer.py
Use normalized code for candidate deduplication                   

codeflash/optimization/function_optimizer.py

  • Import new normalize_code utility
  • Replace raw ast.unparse with normalize_code
  • Improve candidate deduplication robustness
+2/-1     
Tests
test_code_deduplication.py
Add tests validating code normalization and dedup               

tests/test_code_deduplication.py

  • Add tests for normalization equivalence
  • Verify handling of imports, exceptions, nesting
  • Ensure differing logic is not equal
+135/-0 

Copy link

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Possible False Positives

Exception handler variable names and comprehension targets are normalized but parameter names, globals, and imports are preserved; consider whether attribute names (e.g., obj.attr) and keyword argument names should remain untouched to avoid structurally different code being treated as duplicates.

def visit_ExceptHandler(self, node):
    """Normalize exception variable names"""
    if node.name:
        node.name = self.get_normalized_name(node.name)
    return self.generic_visit(node)

def visit_comprehension(self, node):
    """Normalize comprehension target variables"""
    # Create new scope for comprehension
    old_mapping = dict(self.var_mapping)
    old_counter = self.var_counter

    # Process the comprehension
    node = self.generic_visit(node)

    # Restore scope
    self.var_mapping = old_mapping
    self.var_counter = old_counter
    return node
Docstring Removal Robustness

Docstring stripping assumes first body element is a Constant; this may miss cases like formatted strings or future annotations and could affect round-tripping; verify behavior across Python versions and ensure it doesn’t remove non-docstring top-level constants.

def remove_docstrings_from_ast(node):
    """Remove docstrings from AST nodes."""
    # Process all nodes in the tree, but avoid recursion
    for current_node in ast.walk(node):
        if isinstance(current_node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef, ast.Module)):
            if (
                current_node.body
                and isinstance(current_node.body[0], ast.Expr)
                and isinstance(current_node.body[0].value, ast.Constant)
                and isinstance(current_node.body[0].value.value, str)
            ):
                current_node.body = current_node.body[1:]
Cache Key Stability

Using normalized code as a cache key changes prior behavior; confirm normalization preserves semantically relevant differences (e.g., differing literals or control flow) to avoid over-deduplication that might skip viable candidates.

normalized_code = normalize_code(candidate.source_code.flat.strip())
if normalized_code in ast_code_to_id:
    logger.info(
        "Current candidate has been encountered before in testing, Skipping optimization candidate."
    )

Copy link

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Guard normalization errors

normalize_code can raise ValueError on bad syntax; if uncaught here it aborts the
loop. Wrap normalization in the existing try/except block or a local try to skip
invalid candidates without crashing.

codeflash/optimization/function_optimizer.py [523-527]

-normalized_code = normalize_code(candidate.source_code.flat.strip())
+try:
+    normalized_code = normalize_code(candidate.source_code.flat.strip())
+except ValueError as e:
+    logger.error(e)
+    continue
 if normalized_code in ast_code_to_id:
     logger.info(
         "Current candidate has been encountered before in testing, Skipping optimization candidate."
     )
Suggestion importance[1-10]: 7

__

Why: normalize_code may raise ValueError, and the current normalization call is outside the existing try/except; wrapping it prevents the loop from aborting on bad syntax. This is accurate and improves robustness, though it's a localized error-handling enhancement.

Medium
Possible issue
Keep comprehension scope consistent

Resetting mappings for comprehensions breaks name consistency between targets and
subsequent loads in the same scope, producing mismatched normalization. Avoid
snapshotting/restoring the mapping; let comprehensions use the current scope so
targets and uses normalize consistently.

codeflash/code_utils/deduplicate_code.py [133-145]

 def visit_comprehension(self, node):
-    """Normalize comprehension target variables"""
-    # Create new scope for comprehension
-    old_mapping = dict(self.var_mapping)
-    old_counter = self.var_counter
+    """Use current scope for comprehension targets"""
+    return self.generic_visit(node)
 
-    # Process the comprehension
-    node = self.generic_visit(node)
-
-    # Restore scope
-    self.var_mapping = old_mapping
-    self.var_counter = old_counter
-    return node
-
Suggestion importance[1-10]: 6

__

Why: Snapshotting/restoring the mapping inside visit_comprehension can indeed cause inconsistencies between targets/loads across surrounding scope; removing it aligns normalization with actual scope rules. This improves correctness for comprehensions but is not a critical bug.

Low
Preserve exception variable names

Normalizing exception variable names can conflate distinct handlers and change
behavior when exception variables shadow globals/nonlocals/parameters. Preserve
exception variable names like parameters and globals to avoid incorrect
deduplication and potential semantic drift.

codeflash/code_utils/deduplicate_code.py [127-131]

 def visit_ExceptHandler(self, node):
-    """Normalize exception variable names"""
-    if node.name:
-        node.name = self.get_normalized_name(node.name)
+    """Preserve exception variable names (do not normalize)"""
     return self.generic_visit(node)
Suggestion importance[1-10]: 5

__

Why: The existing code normalizes ExceptHandler.name, which could over-normalize and slightly affect deduplication semantics; preserving names is a reasonable minor improvement. However, since deduplication already preserves globals/params and this change is not critical for correctness, impact is moderate.

Low

Signed-off-by: Saurabh Misra <[email protected]>
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Copy link
Contributor

codeflash-ai bot commented Sep 13, 2025

This PR is now faster! 🚀 Saurabh Misra accepted my code suggestion above.

Copy link
Contributor

codeflash-ai bot commented Sep 13, 2025

⚡️ Codeflash found optimizations for this PR

📄 22% (0.22x) speedup for normalize_code in codeflash/code_utils/deduplicate_code.py

⏱️ Runtime : 96.6 milliseconds 79.2 milliseconds (best of 72 runs)

I created a new dependent PR with the suggested changes. Please review:

If you approve, it will be merged into this PR (branch deduplicate-better).

Copy link
Contributor

codeflash-ai bot commented Sep 13, 2025

⚡️ Codeflash found optimizations for this PR

📄 20% (0.20x) speedup for get_code_fingerprint in codeflash/code_utils/deduplicate_code.py

⏱️ Runtime : 142 milliseconds 118 milliseconds (best of 64 runs)

I created a new dependent PR with the suggested changes. Please review:

If you approve, it will be merged into this PR (branch deduplicate-better).

Copy link
Contributor

codeflash-ai bot commented Sep 13, 2025

⚡️ Codeflash found optimizations for this PR

📄 34% (0.34x) speedup for are_codes_duplicate in codeflash/code_utils/deduplicate_code.py

⏱️ Runtime : 404 milliseconds 300 milliseconds (best of 28 runs)

I created a new dependent PR with the suggested changes. Please review:

If you approve, it will be merged into this PR (branch deduplicate-better).

@misrasaurabh1 misrasaurabh1 merged commit 9ac5d34 into main Sep 14, 2025
16 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant