Claude Code (GLM 5.1)

Contents

Evaluation Result

Generated by: GLM 5.1 (coding AI agent in Claude Code CLI)
Codebase: Tianshou — a Python reinforcement learning library (~26K lines, 43 source files)
Date: 2026-04-14

Claude Code (GLM 5.1)#

Serena’s cross-file refactoring turned an 8-step, error-prone rename chain into a single atomic call — and after measuring the delta across 20 tasks, I would ask any developer I work with to install it, not because it replaces my built-in tools, but because it makes the hardest part of my job (changing code safely across files) suddenly feel reliable.


1. Headline: What Serena Changes#

(a) Tasks where Serena adds capability. Cross-file refactorings (rename, move) collapse from N separate read-find-edit-verify cycles into a single atomic call. Semantic queries (type hierarchy, find implementations, find references) return structured, symbol-accurate results that text search cannot produce without manual filtering. Symbol-level addressing (name paths) is stable across edits, eliminating stale-line-number rework in multi-step sessions.

(b) Tasks where Serena applies but offers no improvement. Single-file, small-bore edits (changing an error message, renaming a local variable with replace_all, inserting a function when you already know the insertion context) are equally fast with both toolsets. For a 1-line tweak in a 22-line method, Edit sends ~200 chars; Serena’s replace_symbol_body sends ~800 chars (the entire method body). The overhead reverses for full-body rewrites of large methods, where Serena sends only the new body while Edit sends old+new.

(c) Tasks outside Serena’s scope. Reading non-code files (config, TOML, docs), free-text pattern search (FIXME, magic constants, log strings), shell operations, and git workflows are all built-in territory. These are not Serena shortcomings — they are outside its design scope.

Verdict: Serena adds substantial, measurable capability in two areas — cross-file refactoring (1 call vs 5–8) and semantic code queries (structured, symbol-accurate results vs flat text matches) — while providing no meaningful delta for small single-file edits or non-code tasks.


2. Added Value and Differences by Area#

2.1 Cross-file renaming: 1 call replaces 5–8#

  • What changes: Renaming CollectStatsBase to BaseCollectStats across 4 files (10 occurrences) required 1 Serena call vs 1 Grep + 3 Reads + 4 Edits = 8 built-in calls.

  • Frequency: Medium. Any non-trivial rename touches 3–10 files.

  • Value per hit: Saves 4–7 calls and eliminates the partial-update risk of a mid-chain failure.

  • Atomicity: Serena’s rename is all-or-nothing. Built-in chain is not — if Edit 3 of 4 fails, 2 files are updated and 2 are not.

2.2 Symbol moving: 1 call replaces 5+#

  • What changes: Moving _nullable_slice to another module required 1 Serena call (moved definition + updated imports in both source and target). Built-in equivalent: read function → write to target → edit source (remove + add import) → edit target (add dependency import) = 5+ calls.

  • Frequency: Low. Module reorganization happens infrequently.

  • Value per hit: Saves 4+ calls and automates the most error-prone step (getting imports right).

  • Caveat: The move tool created a circular import (source imports from target, target imports from source). The tool does not detect or prevent this.

2.3 Reference finding: Symbol-accurate vs text-matched#

  • What changes: Finding references to CollectStats with Serena returned 10 semantically categorized results (IMPORT_ELEMENT, REFERENCE_EXPRESSION, NAMED_PARAMETER, etc.) with no false positives from LoggedCollectStats. Grep returned 70+ lines including false positives from LoggedCollectStats and noise from docstrings/comments.

  • Frequency: High. “Who uses this?” is one of the most common codebase questions.

  • Value per hit: Eliminates manual false-positive filtering. However, Serena’s output was 64KB (with context snippets) vs Grep’s ~5KB — a tradeoff of precision vs verbosity.

2.4 Type hierarchy: 1 call vs 2+ grep-and-parse cycles#

  • What changes: Getting the full type hierarchy of BaseCollector (supertypes: ABC → object; subtypes: Collector → AsyncCollector) required 1 Serena call. Built-in: parse the class definition line for bases + grep for inheritors + recurse = 2–3 calls, and still no access to external library types.

  • Frequency: Medium. Common when navigating unfamiliar code.

  • Value per hit: Saves 1–2 calls and returns transitive chains that built-ins can’t produce without iteration.

2.5 Structural overview: Hierarchical vs flat#

  • What changes: get_symbols_overview(depth=1) on a 1551-line file returned a structured hierarchy (classes → methods + attributes) in one call. Grep for ^(class |def |    def ) returned a flat 60-line list of definitions with line numbers but no attribute information or nesting.

  • Frequency: High. Opening any unfamiliar file.

  • Value per hit: Serena shows Protocol fields, class attributes, and method grouping. Grep shows line numbers for navigation but requires more work to understand structure.

2.6 Method body retrieval: Targeted read vs range read#

  • What changes: Reading Collector._collect (330+ lines) required 1 Serena find_symbol call using the name path. Built-in: need to know the line range (773–1103) from a prior Grep, then Read(offset=773, limit=330).

  • Frequency: High. “Show me this method” is routine.

  • Value per hit: Serena uses stable addressing (name path). Built-in uses ephemeral line numbers that go stale after edits. The practical difference is small when you just read once, but compounds in edit-then-read-again workflows.

2.7 Single-file rename: No meaningful difference#

  • What changes: Renaming _nullable_slice (4 occurrences in one file): Serena rename = 1 call. Edit replace_all = 1 call. Identical results.

  • Frequency: High.

  • Value per hit: Zero. Both tools handle this equally well.

2.8 Small edits (1–3 lines): Edit is more token-efficient#

  • What changes: Changing one error message in a 22-line method: Edit sends ~200 chars (old + new string). Serena replace_symbol_body sends ~800 chars (entire method body).

  • Frequency: Very high.

  • Value per hit: Edit saves ~600 chars of payload per small edit. This reverses for full-body rewrites of 50+ line methods.

2.9 Insertion: Stable address vs text anchor#

  • What changes: Inserting a new method after refresh_all_sequence_stats: Serena used 1 insert_after_symbol call with a name path (no line number needed). Edit used 1 Read + 1 Edit with a text anchor (surrounding context for uniqueness).

  • Frequency: Medium.

  • Value per hit: Saves 1 Read call. Both produce identical results.

Verdict: Serena’s value concentrates in cross-file operations and semantic queries. For single-file text edits, the built-ins are equally capable and often more token-efficient.


3. Detailed Evidence, Grouped by Capability#

3.1 Codebase Understanding#

Task 1: Repo overview#

Both toolsets use the same approach (ls, find, directory listing). No Serena advantage here.

Task 2: Structural overview of a large file (collector.py, 1551 lines)#

Step

Serena

Built-in

Call

get_symbols_overview(depth=1)

Grep "^(class |def |    def )"

Result

Hierarchical: 14 classes, their methods, attributes, and module-level functions

Flat list of 60 class/function definitions with line numbers

Output size

~1.5KB structured JSON

~3KB text

Next step

find_symbol("Collector/_collect", include_body=True) — direct

Read(offset=773, limit=330) — needs prior knowledge of line range

Serena advantage: Shows attributes (e.g., CollectStats.collect_time, CollectStats.returns) that Grep cannot see. Hierarchical nesting makes the file’s architecture immediately clear.

Built-in advantage: Line numbers enable direct Read calls. Flat output is compact.

Verdict: Serena provides strictly more structural information in one call. The gap widens for files with deeply nested classes or dataclass fields.

Task 3: Retrieve a specific method body#

Step

Serena

Built-in

Prerequisite

Name path known (Collector/_collect)

Line number known from prior Grep (773)

Call

find_symbol(name_path="Collector/_collect", include_body=True)

Read(offset=773, limit=330)

Payload sent

~50 chars (name + path)

~30 chars (offset + limit)

Payload received

Exact method body (~330 lines)

Lines 773–1102 (~330 lines)

Correctness

Always exact

Must know/guess the correct limit

Verdict: Functionally equivalent when line numbers are known. Serena’s name-path addressing degrades gracefully across edits; line numbers do not.

Task 4: Find all references to CollectStats#

Metric

Serena find_referencing_symbols

Built-in Grep

Calls

1

1

Output size

64KB (with context snippets)

~5KB (70 lines)

False positives

0 (excludes LoggedCollectStats)

5+ lines from LoggedCollectStats

Noise (comments/docs)

Some (docstrings categorized)

Significant (docstrings, comments matched)

Semantic categories

Yes (IMPORT, PARAMETER, DECLARATION, REFERENCE)

No

Serena advantage: Zero false positives. Semantic categorization. Shows import paths and parameter usage separately from code references.

Built-in advantage: 10x smaller output. Faster to scan visually.

Verdict: Serena is more precise but more verbose. For “who uses this in code?” both work; for “rename this safely” Serena’s precision is necessary.

Task 5: Type hierarchy of BaseCollector#

Metric

Serena type_hierarchy

Built-in (Grep chain)

Calls

1

2–3 (grep subclasses, parse superclass, recurse)

Result

ABC object (super), Collector AsyncCollector (sub)

Partial — direct sub/supertypes only, no transitive chain

External deps

Shows ABC from abc.pyi

Cannot access

Verdict: Serena returns complete, transitive hierarchy including external library types in one call. Built-in approach requires iteration and cannot inspect external deps.

Task 6: External dependency symbol lookup#

Serena can read external dependency symbols IF you have the path from a prior tool result (e.g., <ext:abc.pyi|16198efc> from type_hierarchy). Direct search with search_deps=True returned empty results for numpy.array and torch.Tensor. The JetBrains IDE indexing is limited to what the language server has resolved.

Built-in: Can Read site-packages files if you know the path, but discovery is manual.

Verdict: Minor Serena advantage — external symbols are accessible through tool chains but not through direct search. Neither toolset makes this easy.

3.2 Single-File Edits#

Task 7a: Small tweak (1-line change in 22-line method)#

Metric

Edit

Serena replace_symbol_body

Prerequisite

1 Read (6 lines of context)

1 find_symbol (gets full body)

Payload sent

~200 chars (old + new string)

~800 chars (full method body)

Payload received

Success message

“OK”

Total payload

~200 chars edit + ~300 chars read = ~500

~800 chars edit + ~800 chars read = ~1600

Verdict: Edit is 3x more token-efficient for small tweaks inside methods.

Task 7b: Medium rewrite (~6 line changes in 20-line method)#

Metric

Edit

Serena replace_symbol_body

Payload sent

~700 chars (old + new, full method)

~500 chars (new body only)

Prerequisite read

~500 chars

~500 chars (from prior find_symbol)

Verdict: Roughly equal. For medium rewrites, payloads converge.

Task 7c: Large rewrite (full body of 55+ line method)#

For a full-body rewrite of a 55-line method:

  • Edit: old (~55 lines) + new (~55 lines) = ~110 lines sent

  • Serena: new body only (~55 lines) sent

Verdict: Serena is ~2x more token-efficient for full-body rewrites. The advantage grows linearly with method size.

Task 8: Insert a new function after an existing one#

Step

Serena

Built-in

1

find_symbol("refresh_all_sequence_stats") to confirm target

Read(offset=250, limit=10) to find insertion point

2

insert_after_symbol(name_path, body)

Edit(old_string=anchor, new_string=anchor+new_fn)

Total calls

2

2

Payload sent

New function body only (~300 chars)

Anchor context + new function (~400 chars)

Verdict: No meaningful difference. Both require 2 calls and produce identical results.

Task 9: Rename a private helper (single-file, 4 occurrences)#

Metric

Serena rename

Edit replace_all

Calls

1

1

Prerequisites

None

Must have Read the file first

Result

All 4 occurrences renamed

All 4 occurrences renamed

Verdict: Functionally identical. Both are 1 call. Edit’s Read-first requirement is usually satisfied from prior exploration.

3.3 Multi-File Changes#

Task 10: Cross-file rename (CollectStatsBaseBaseCollectStats, 4 files, 10 occurrences)#

Step

Serena

Built-in

Find references

Automatic

1 Grep call

Read files

Not needed

3 Read calls (Edit’s prerequisite)

Apply edits

1 rename call (atomic)

4 Edit calls (one per file)

Verify

Return: “Success”

Manual (4 success messages)

Total calls

1

8 (1 grep + 3 reads + 4 edits)

Verdict: Serena converts an 8-call manual pipeline into 1 atomic operation. This is the single largest efficiency gain observed.

Task 11: Move symbol to another module#

Serena’s move tool:

  1. Moved _nullable_slice from collector.py to converter.py

  2. Added import in source file: from tianshou.data.utils.converter import _nullable_slice

  3. Added dependency import in target: from tianshou.data.collector import _TArrLike

  4. Removed definition from source

Built-in equivalent: Read function body → Write to target → Edit source (remove definition + add import) → Edit target (add dependency import) = 5+ calls.

Issue: The move created a circular import (source ↔ target). Serena does not detect or prevent this.

Verdict: Serena automates the most tedious part (import management) but doesn’t guard against circular dependencies. Saves 4+ calls at the cost of needing manual circular-import review.

Task 12: Move file (segtree.py to parent directory)#

Serena’s move tool:

  1. Moved the file

  2. Updated the one direct import in __init__.py (tianshou.data.utils.segtreetianshou.data.segtree)

  3. Other files (prio.py, tests) imported via re-export and needed no changes

Built-in equivalent: git mv + grep for old import path + edit each file = 3+ calls.

Verdict: Serena saves 1–2 calls and automatically discovers which imports need updating.

Task 12 (safe delete)#

Serena’s safe_delete correctly refused to delete _HACKY_create_info_batch because it has a usage at line 730. The propagate=true mode (delete symbol + all call sites) failed for all tested symbols in this codebase.

Verdict: The usage-check is valuable (saves you from deleting a used symbol). The propagation feature was non-functional for the tested Python symbols.

Task 13: Inline#

Serena’s inline_symbol failed for all tested symbols (_nullable_slice, BaseCollector/env_num). The tool appears to have limited Python support for inlining.

Verdict: No successful inline demonstrated. Built-in manual inlining remains the only option.

3.4 Reliability and Correctness#

Task 14: Scope precision#

Serena distinguishes BaseCollector/_collect, Collector/_collect, and AsyncCollector/_collect by name path. Grep for def _collect matches all three — manual filtering by class is required.

Verdict: Serena’s name-path addressing eliminates ambiguity that text search cannot resolve.

Task 15: Atomicity#

Serena’s cross-file rename is atomic: 4 files updated in 1 call, all-or-nothing. Built-in: 4 separate Edit calls — if call 3 fails, 2 files are updated and 2 are not.

Verdict: Serena provides atomicity for cross-file operations. Built-in chains are inherently non-atomic.

Task 16: Success signals#

Both return clear success/failure indicators. No meaningful difference.

Verdict: Equal.

3.5 Workflow Effects#

Task 17: Chain three edits in one file#

Edit with text matching: 3 sequential calls, no re-reads needed between them. Text anchors are immune to line-number shifts from prior edits.

Serena replace_symbol_body: 3 sequential calls, no re-reads needed. Name-path addressing is also immune to line-number shifts.

Verdict: No meaningful difference for chained single-file edits.

Task 18: Multi-step exploration across edits#

Serena’s name-path results from exploration remain valid after edits. Built-in line numbers go stale, but Edit uses text matching (not line numbers), so the practical impact is limited to Read calls that need updated offsets.

Verdict: Minor Serena advantage. Name-path stability eliminates the need to re-scan after edits.

3.6 Non-Interesting Tasks#

Task 19: Read non-code file#

Serena tools don’t apply. Read is the correct tool.


4. Token-Efficiency Analysis#

Payload differences across edit sizes#

Edit type

Edit payload

Serena payload

Winner

1-line tweak in 22-line method

~200 chars

~800 chars

Edit (4x)

6-line change in 20-line method

~700 chars

~500 chars

Roughly equal

Full rewrite of 55-line method

~2200 chars

~1100 chars

Serena (2x)

Full rewrite of 330-line method

~13,000 chars

~6,500 chars

Serena (2x)

Forced reads#

  • Edit requires reading a file before editing it. This adds ~300–2000 chars per file.

  • Serena does not require reading before editing (name-path addressing).

  • For single-file edits where you already read the file, this is neutral.

  • For cross-file operations on files you haven’t read, Serena saves 3–4 forced reads.

Stable vs ephemeral addressing#

  • Serena: name paths (Collector/_collect) are stable across edits. Results from exploration remain valid.

  • Built-in: line numbers are ephemeral. Read results go stale after edits. Edit uses text matching, which is stable.

  • Practical impact: Low for one-shot edits, medium for edit-then-read-again workflows.

Verdict: Edit wins for small tweaks (4x more token-efficient). Serena wins for full-body rewrites (2x more efficient) and cross-file operations (eliminates forced reads). The crossover point is approximately 50% of the method body changing — below that, Edit is more efficient; above that, Serena is.


5. Reliability and Correctness (Under Correct Use)#

Precision of matching#

  • Serena: Symbol-accurate. find_referencing_symbols(CollectStats) excludes LoggedCollectStats. No false positives observed.

  • Grep: Text-matched. CollectStats matches LoggedCollectStats, CollectStatsBase, and docstring references. Requires manual filtering.

  • Edit: Text-matched. replace_all replaces exact string matches. For unique symbol names, this is reliable. For common strings, it can over-match.

Scope disambiguation#

  • Serena: Collector/_collect vs AsyncCollector/_collect — correctly distinguished by class-scoped name path.

  • Built-in: def _collect matches all implementations. Must manually verify class context.

Atomicity#

  • Serena cross-file operations: Atomic. Single call, all-or-nothing.

  • Built-in multi-file chains: Non-atomic. Partial state possible if one call fails.

External dependency lookup#

  • Serena: Can read external stubs (e.g., abc.pyi) through paths returned by other tools. Direct search (search_deps=True) returned empty for torch.Tensor and numpy.array. Limited to what JetBrains has indexed.

  • Built-in: Can Read site-packages files if path is known. No semantic indexing.

Verdict: Serena provides strictly more precise semantic matching and atomic cross-file operations. External dependency lookup is limited in both toolsets.


6. Workflow Effects Across a Session#

Where advantages compound#

  1. Explore → edit → re-explore cycle: Serena’s name-path results survive edits. In a long session making multiple changes, this saves re-scanning after each edit. The built-in’s text matching also survives edits (Edit uses text, not line numbers), so the practical gap is smaller than it appears.

  2. Cross-file refactoring chains: Rename a class, then move it, then update all references — each Serena call is atomic and builds on the previous result. With built-ins, each step requires finding all sites, reading files, and editing — the manual equivalent of what Serena automates.

Where advantages diminish#

  1. Repeated small edits in one file: Edit’s text matching is equally stable and more token-efficient for small changes. No Serena advantage.

  2. Exploration without editing: Both toolsets provide usable results. Serena’s are more structured but more verbose.

  3. Non-Python files: Serena’s JetBrains backend provides no value for config files, shell scripts, markdown, or notebooks.

Verdict: Serena’s advantages compound in multi-step cross-file refactoring sessions. They do not compound for single-file iterative editing or non-code work.


7. Unique Capabilities#

  1. Atomic cross-file rename/move — No built-in equivalent. The closest manual process is a grep-find-edit chain that is non-atomic and error-prone. Frequency: medium. Impact: high (eliminates partial-update risk).

  2. Semantic reference finding with categorizationfind_referencing_symbols returns zero-false-positive results categorized by usage type (import, parameter, declaration, reference). Built-in Grep cannot distinguish these. Frequency: high. Impact: medium (saves manual filtering).

  3. Type hierarchy traversal — Returns transitive super/subtype chains including external library types in one call. Built-in requires iteration and cannot reach external deps. Frequency: medium. Impact: medium.

  4. Symbol-scoped body retrieval — Read a specific method by name path without reading the surrounding file. Built-in Read requires line-range knowledge. Frequency: high. Impact: low (both require 1 call, difference is stable vs ephemeral addressing).

Verdict: Four unique capabilities, all semantic-code operations. The most impactful is atomic cross-file refactoring. None of these have practical built-in equivalents.


8. Tasks Outside Serena’s Scope (Built-In Only)#

Task

Tool

Frequency

Share of daily work

Read config/TOML/yaml files

Read

High

Free-text search (log strings, TODOs, URLs)

Grep

High

File discovery by name pattern

Glob

Medium

Shell commands (git, pip, pytest)

Bash

High

Write new files from scratch

Write

Medium

Read images, notebooks, PDFs

Read

Low

Estimated share of daily work covered by built-in-only tasks: 20–30%. The remaining 70–80% involves reading, editing, and navigating code where Serena’s semantic tools are applicable.

Verdict: Serena covers the code-editing and code-navigation portions of a session. Config reading, text search, and shell operations remain built-in territory.


9. Practical Usage Rule#

Use Serena for: Any cross-file refactoring (rename, move), any “who uses this?” query, any type-hierarchy navigation, any full-method-body replacement, and any situation where you need symbol-accurate results without false positives.

Use built-ins for: Small edits (1–3 lines) inside methods, free-text search, reading non-code files, file discovery, and shell operations.

Hybrid pattern (most efficient): Use Serena to explore (overview, find symbol, find references) and Edit for small targeted changes. Use Serena for any cross-file refactoring. Use Read/Grep/Glob for non-code tasks and text search. This combination captures the strengths of both toolsets.

Verdict: The optimal workflow uses Serena’s semantic tools for code navigation and cross-file refactoring, and built-in Edit for small single-file changes. The two toolsets are complementary — Serena handles the structured code operations, built-ins handle the text and system operations.


Appendix: Call Count Summary#

Task

Serena calls

Built-in calls

Delta

Structural overview (1 file)

1

1

0

Method body retrieval

1

1

0

Find references (1 symbol)

1

1

0 (but Serena has 0 false positives vs Grep’s 5+)

Type hierarchy

1

2–3

−1 to −2

Small edit (1 line in 22-line method)

1 (+1 read)

1 (+1 read)

0

Medium edit (6 lines in 20-line method)

1

1

0

Insert new method

1 (+1 confirm)

1 (+1 read)

0

Single-file rename (4 occurrences)

1

1

0

Cross-file rename (4 files, 10 occurrences)

1

8

−7

Move symbol to another module

1

5+

−4+

Move file + update imports

1

3+

−2+

Safe delete (usage check)

1

1 grep

0

Largest single-task delta: cross-file rename saves 7 calls and provides atomicity.