Evaluation Result
Generated by: GLM 5.1 (coding AI agent in Claude Code CLI)
Codebase: Tianshou — a Python reinforcement learning library (~26K lines, 43 source files)
Date: 2026-04-14
Claude Code (GLM 5.1)#
Serena’s cross-file refactoring turned an 8-step, error-prone rename chain into a single atomic call — and after measuring the delta across 20 tasks, I would ask any developer I work with to install it, not because it replaces my built-in tools, but because it makes the hardest part of my job (changing code safely across files) suddenly feel reliable.
1. Headline: What Serena Changes#
(a) Tasks where Serena adds capability. Cross-file refactorings (rename, move) collapse from N separate read-find-edit-verify cycles into a single atomic call. Semantic queries (type hierarchy, find implementations, find references) return structured, symbol-accurate results that text search cannot produce without manual filtering. Symbol-level addressing (name paths) is stable across edits, eliminating stale-line-number rework in multi-step sessions.
(b) Tasks where Serena applies but offers no improvement. Single-file, small-bore edits (changing an error message, renaming a local variable with replace_all, inserting a function when you already know the insertion context) are equally fast with both toolsets. For a 1-line tweak in a 22-line method, Edit sends ~200 chars; Serena’s replace_symbol_body sends ~800 chars (the entire method body). The overhead reverses for full-body rewrites of large methods, where Serena sends only the new body while Edit sends old+new.
(c) Tasks outside Serena’s scope. Reading non-code files (config, TOML, docs), free-text pattern search (FIXME, magic constants, log strings), shell operations, and git workflows are all built-in territory. These are not Serena shortcomings — they are outside its design scope.
Verdict: Serena adds substantial, measurable capability in two areas — cross-file refactoring (1 call vs 5–8) and semantic code queries (structured, symbol-accurate results vs flat text matches) — while providing no meaningful delta for small single-file edits or non-code tasks.
2. Added Value and Differences by Area#
2.1 Cross-file renaming: 1 call replaces 5–8#
What changes: Renaming
CollectStatsBasetoBaseCollectStatsacross 4 files (10 occurrences) required 1 Serena call vs 1 Grep + 3 Reads + 4 Edits = 8 built-in calls.Frequency: Medium. Any non-trivial rename touches 3–10 files.
Value per hit: Saves 4–7 calls and eliminates the partial-update risk of a mid-chain failure.
Atomicity: Serena’s rename is all-or-nothing. Built-in chain is not — if Edit 3 of 4 fails, 2 files are updated and 2 are not.
2.2 Symbol moving: 1 call replaces 5+#
What changes: Moving
_nullable_sliceto another module required 1 Serena call (moved definition + updated imports in both source and target). Built-in equivalent: read function → write to target → edit source (remove + add import) → edit target (add dependency import) = 5+ calls.Frequency: Low. Module reorganization happens infrequently.
Value per hit: Saves 4+ calls and automates the most error-prone step (getting imports right).
Caveat: The move tool created a circular import (source imports from target, target imports from source). The tool does not detect or prevent this.
2.3 Reference finding: Symbol-accurate vs text-matched#
What changes: Finding references to
CollectStatswith Serena returned 10 semantically categorized results (IMPORT_ELEMENT, REFERENCE_EXPRESSION, NAMED_PARAMETER, etc.) with no false positives fromLoggedCollectStats. Grep returned 70+ lines including false positives fromLoggedCollectStatsand noise from docstrings/comments.Frequency: High. “Who uses this?” is one of the most common codebase questions.
Value per hit: Eliminates manual false-positive filtering. However, Serena’s output was 64KB (with context snippets) vs Grep’s ~5KB — a tradeoff of precision vs verbosity.
2.4 Type hierarchy: 1 call vs 2+ grep-and-parse cycles#
What changes: Getting the full type hierarchy of
BaseCollector(supertypes: ABC → object; subtypes: Collector → AsyncCollector) required 1 Serena call. Built-in: parse the class definition line for bases + grep for inheritors + recurse = 2–3 calls, and still no access to external library types.Frequency: Medium. Common when navigating unfamiliar code.
Value per hit: Saves 1–2 calls and returns transitive chains that built-ins can’t produce without iteration.
2.5 Structural overview: Hierarchical vs flat#
What changes:
get_symbols_overview(depth=1)on a 1551-line file returned a structured hierarchy (classes → methods + attributes) in one call. Grep for^(class |def | def )returned a flat 60-line list of definitions with line numbers but no attribute information or nesting.Frequency: High. Opening any unfamiliar file.
Value per hit: Serena shows Protocol fields, class attributes, and method grouping. Grep shows line numbers for navigation but requires more work to understand structure.
2.6 Method body retrieval: Targeted read vs range read#
What changes: Reading
Collector._collect(330+ lines) required 1 Serenafind_symbolcall using the name path. Built-in: need to know the line range (773–1103) from a prior Grep, thenRead(offset=773, limit=330).Frequency: High. “Show me this method” is routine.
Value per hit: Serena uses stable addressing (name path). Built-in uses ephemeral line numbers that go stale after edits. The practical difference is small when you just read once, but compounds in edit-then-read-again workflows.
2.7 Single-file rename: No meaningful difference#
What changes: Renaming
_nullable_slice(4 occurrences in one file): Serena rename = 1 call. Editreplace_all= 1 call. Identical results.Frequency: High.
Value per hit: Zero. Both tools handle this equally well.
2.8 Small edits (1–3 lines): Edit is more token-efficient#
What changes: Changing one error message in a 22-line method: Edit sends ~200 chars (old + new string). Serena
replace_symbol_bodysends ~800 chars (entire method body).Frequency: Very high.
Value per hit: Edit saves ~600 chars of payload per small edit. This reverses for full-body rewrites of 50+ line methods.
2.9 Insertion: Stable address vs text anchor#
What changes: Inserting a new method after
refresh_all_sequence_stats: Serena used 1insert_after_symbolcall with a name path (no line number needed). Edit used 1 Read + 1 Edit with a text anchor (surrounding context for uniqueness).Frequency: Medium.
Value per hit: Saves 1 Read call. Both produce identical results.
Verdict: Serena’s value concentrates in cross-file operations and semantic queries. For single-file text edits, the built-ins are equally capable and often more token-efficient.
3. Detailed Evidence, Grouped by Capability#
3.1 Codebase Understanding#
Task 1: Repo overview#
Both toolsets use the same approach (ls, find, directory listing). No Serena advantage here.
Task 2: Structural overview of a large file (collector.py, 1551 lines)#
Step |
Serena |
Built-in |
|---|---|---|
Call |
|
|
Result |
Hierarchical: 14 classes, their methods, attributes, and module-level functions |
Flat list of 60 class/function definitions with line numbers |
Output size |
~1.5KB structured JSON |
~3KB text |
Next step |
|
|
Serena advantage: Shows attributes (e.g., CollectStats.collect_time, CollectStats.returns) that Grep cannot see. Hierarchical nesting makes the file’s architecture immediately clear.
Built-in advantage: Line numbers enable direct Read calls. Flat output is compact.
Verdict: Serena provides strictly more structural information in one call. The gap widens for files with deeply nested classes or dataclass fields.
Task 3: Retrieve a specific method body#
Step |
Serena |
Built-in |
|---|---|---|
Prerequisite |
Name path known ( |
Line number known from prior Grep (773) |
Call |
|
|
Payload sent |
~50 chars (name + path) |
~30 chars (offset + limit) |
Payload received |
Exact method body (~330 lines) |
Lines 773–1102 (~330 lines) |
Correctness |
Always exact |
Must know/guess the correct limit |
Verdict: Functionally equivalent when line numbers are known. Serena’s name-path addressing degrades gracefully across edits; line numbers do not.
Task 4: Find all references to CollectStats#
Metric |
Serena |
Built-in |
|---|---|---|
Calls |
1 |
1 |
Output size |
64KB (with context snippets) |
~5KB (70 lines) |
False positives |
0 (excludes |
5+ lines from |
Noise (comments/docs) |
Some (docstrings categorized) |
Significant (docstrings, comments matched) |
Semantic categories |
Yes (IMPORT, PARAMETER, DECLARATION, REFERENCE) |
No |
Serena advantage: Zero false positives. Semantic categorization. Shows import paths and parameter usage separately from code references.
Built-in advantage: 10x smaller output. Faster to scan visually.
Verdict: Serena is more precise but more verbose. For “who uses this in code?” both work; for “rename this safely” Serena’s precision is necessary.
Task 5: Type hierarchy of BaseCollector#
Metric |
Serena |
Built-in (Grep chain) |
|---|---|---|
Calls |
1 |
2–3 (grep subclasses, parse superclass, recurse) |
Result |
|
Partial — direct sub/supertypes only, no transitive chain |
External deps |
Shows |
Cannot access |
Verdict: Serena returns complete, transitive hierarchy including external library types in one call. Built-in approach requires iteration and cannot inspect external deps.
Task 6: External dependency symbol lookup#
Serena can read external dependency symbols IF you have the path from a prior tool result (e.g., <ext:abc.pyi|16198efc> from type_hierarchy). Direct search with search_deps=True returned empty results for numpy.array and torch.Tensor. The JetBrains IDE indexing is limited to what the language server has resolved.
Built-in: Can Read site-packages files if you know the path, but discovery is manual.
Verdict: Minor Serena advantage — external symbols are accessible through tool chains but not through direct search. Neither toolset makes this easy.
3.2 Single-File Edits#
Task 7a: Small tweak (1-line change in 22-line method)#
Metric |
Edit |
Serena |
|---|---|---|
Prerequisite |
1 Read (6 lines of context) |
1 |
Payload sent |
~200 chars (old + new string) |
~800 chars (full method body) |
Payload received |
Success message |
“OK” |
Total payload |
~200 chars edit + ~300 chars read = ~500 |
~800 chars edit + ~800 chars read = ~1600 |
Verdict: Edit is 3x more token-efficient for small tweaks inside methods.
Task 7b: Medium rewrite (~6 line changes in 20-line method)#
Metric |
Edit |
Serena |
|---|---|---|
Payload sent |
~700 chars (old + new, full method) |
~500 chars (new body only) |
Prerequisite read |
~500 chars |
~500 chars (from prior |
Verdict: Roughly equal. For medium rewrites, payloads converge.
Task 7c: Large rewrite (full body of 55+ line method)#
For a full-body rewrite of a 55-line method:
Edit: old (~55 lines) + new (~55 lines) = ~110 lines sent
Serena: new body only (~55 lines) sent
Verdict: Serena is ~2x more token-efficient for full-body rewrites. The advantage grows linearly with method size.
Task 8: Insert a new function after an existing one#
Step |
Serena |
Built-in |
|---|---|---|
1 |
|
|
2 |
|
|
Total calls |
2 |
2 |
Payload sent |
New function body only (~300 chars) |
Anchor context + new function (~400 chars) |
Verdict: No meaningful difference. Both require 2 calls and produce identical results.
Task 9: Rename a private helper (single-file, 4 occurrences)#
Metric |
Serena |
Edit |
|---|---|---|
Calls |
1 |
1 |
Prerequisites |
None |
Must have Read the file first |
Result |
All 4 occurrences renamed |
All 4 occurrences renamed |
Verdict: Functionally identical. Both are 1 call. Edit’s Read-first requirement is usually satisfied from prior exploration.
3.3 Multi-File Changes#
Task 10: Cross-file rename (CollectStatsBase → BaseCollectStats, 4 files, 10 occurrences)#
Step |
Serena |
Built-in |
|---|---|---|
Find references |
Automatic |
1 Grep call |
Read files |
Not needed |
3 Read calls (Edit’s prerequisite) |
Apply edits |
1 rename call (atomic) |
4 Edit calls (one per file) |
Verify |
Return: “Success” |
Manual (4 success messages) |
Total calls |
1 |
8 (1 grep + 3 reads + 4 edits) |
Verdict: Serena converts an 8-call manual pipeline into 1 atomic operation. This is the single largest efficiency gain observed.
Task 11: Move symbol to another module#
Serena’s move tool:
Moved
_nullable_slicefromcollector.pytoconverter.pyAdded import in source file:
from tianshou.data.utils.converter import _nullable_sliceAdded dependency import in target:
from tianshou.data.collector import _TArrLikeRemoved definition from source
Built-in equivalent: Read function body → Write to target → Edit source (remove definition + add import) → Edit target (add dependency import) = 5+ calls.
Issue: The move created a circular import (source ↔ target). Serena does not detect or prevent this.
Verdict: Serena automates the most tedious part (import management) but doesn’t guard against circular dependencies. Saves 4+ calls at the cost of needing manual circular-import review.
Task 12: Move file (segtree.py to parent directory)#
Serena’s move tool:
Moved the file
Updated the one direct import in
__init__.py(tianshou.data.utils.segtree→tianshou.data.segtree)Other files (
prio.py, tests) imported via re-export and needed no changes
Built-in equivalent: git mv + grep for old import path + edit each file = 3+ calls.
Verdict: Serena saves 1–2 calls and automatically discovers which imports need updating.
Task 12 (safe delete)#
Serena’s safe_delete correctly refused to delete _HACKY_create_info_batch because it has a usage at line 730. The propagate=true mode (delete symbol + all call sites) failed for all tested symbols in this codebase.
Verdict: The usage-check is valuable (saves you from deleting a used symbol). The propagation feature was non-functional for the tested Python symbols.
Task 13: Inline#
Serena’s inline_symbol failed for all tested symbols (_nullable_slice, BaseCollector/env_num). The tool appears to have limited Python support for inlining.
Verdict: No successful inline demonstrated. Built-in manual inlining remains the only option.
3.4 Reliability and Correctness#
Task 14: Scope precision#
Serena distinguishes BaseCollector/_collect, Collector/_collect, and AsyncCollector/_collect by name path. Grep for def _collect matches all three — manual filtering by class is required.
Verdict: Serena’s name-path addressing eliminates ambiguity that text search cannot resolve.
Task 15: Atomicity#
Serena’s cross-file rename is atomic: 4 files updated in 1 call, all-or-nothing. Built-in: 4 separate Edit calls — if call 3 fails, 2 files are updated and 2 are not.
Verdict: Serena provides atomicity for cross-file operations. Built-in chains are inherently non-atomic.
Task 16: Success signals#
Both return clear success/failure indicators. No meaningful difference.
Verdict: Equal.
3.5 Workflow Effects#
Task 17: Chain three edits in one file#
Edit with text matching: 3 sequential calls, no re-reads needed between them. Text anchors are immune to line-number shifts from prior edits.
Serena replace_symbol_body: 3 sequential calls, no re-reads needed. Name-path addressing is also immune to line-number shifts.
Verdict: No meaningful difference for chained single-file edits.
Task 18: Multi-step exploration across edits#
Serena’s name-path results from exploration remain valid after edits. Built-in line numbers go stale, but Edit uses text matching (not line numbers), so the practical impact is limited to Read calls that need updated offsets.
Verdict: Minor Serena advantage. Name-path stability eliminates the need to re-scan after edits.
3.6 Non-Interesting Tasks#
Task 19: Read non-code file#
Serena tools don’t apply. Read is the correct tool.
Task 20: Free-text pattern search#
Searching for FIXME|HACK|TODO across the codebase is a text search. Serena’s semantic tools don’t target this. Grep is the correct tool.
Verdict: These tasks are firmly built-in territory. They represent an estimated 20–30% of daily coding work (reading configs, searching for strings, shell operations, git workflows).
4. Token-Efficiency Analysis#
Payload differences across edit sizes#
Edit type |
Edit payload |
Serena payload |
Winner |
|---|---|---|---|
1-line tweak in 22-line method |
~200 chars |
~800 chars |
Edit (4x) |
6-line change in 20-line method |
~700 chars |
~500 chars |
Roughly equal |
Full rewrite of 55-line method |
~2200 chars |
~1100 chars |
Serena (2x) |
Full rewrite of 330-line method |
~13,000 chars |
~6,500 chars |
Serena (2x) |
Forced reads#
Edit requires reading a file before editing it. This adds ~300–2000 chars per file.
Serena does not require reading before editing (name-path addressing).
For single-file edits where you already read the file, this is neutral.
For cross-file operations on files you haven’t read, Serena saves 3–4 forced reads.
Stable vs ephemeral addressing#
Serena: name paths (
Collector/_collect) are stable across edits. Results from exploration remain valid.Built-in: line numbers are ephemeral.
Readresults go stale after edits.Edituses text matching, which is stable.Practical impact: Low for one-shot edits, medium for edit-then-read-again workflows.
Verdict: Edit wins for small tweaks (4x more token-efficient). Serena wins for full-body rewrites (2x more efficient) and cross-file operations (eliminates forced reads). The crossover point is approximately 50% of the method body changing — below that, Edit is more efficient; above that, Serena is.
5. Reliability and Correctness (Under Correct Use)#
Precision of matching#
Serena: Symbol-accurate.
find_referencing_symbols(CollectStats)excludesLoggedCollectStats. No false positives observed.Grep: Text-matched.
CollectStatsmatchesLoggedCollectStats,CollectStatsBase, and docstring references. Requires manual filtering.Edit: Text-matched.
replace_allreplaces exact string matches. For unique symbol names, this is reliable. For common strings, it can over-match.
Scope disambiguation#
Serena:
Collector/_collectvsAsyncCollector/_collect— correctly distinguished by class-scoped name path.Built-in:
def _collectmatches all implementations. Must manually verify class context.
Atomicity#
Serena cross-file operations: Atomic. Single call, all-or-nothing.
Built-in multi-file chains: Non-atomic. Partial state possible if one call fails.
External dependency lookup#
Serena: Can read external stubs (e.g.,
abc.pyi) through paths returned by other tools. Direct search (search_deps=True) returned empty fortorch.Tensorandnumpy.array. Limited to what JetBrains has indexed.Built-in: Can
Readsite-packages files if path is known. No semantic indexing.
Verdict: Serena provides strictly more precise semantic matching and atomic cross-file operations. External dependency lookup is limited in both toolsets.
6. Workflow Effects Across a Session#
Where advantages compound#
Explore → edit → re-explore cycle: Serena’s name-path results survive edits. In a long session making multiple changes, this saves re-scanning after each edit. The built-in’s text matching also survives edits (Edit uses text, not line numbers), so the practical gap is smaller than it appears.
Cross-file refactoring chains: Rename a class, then move it, then update all references — each Serena call is atomic and builds on the previous result. With built-ins, each step requires finding all sites, reading files, and editing — the manual equivalent of what Serena automates.
Where advantages diminish#
Repeated small edits in one file: Edit’s text matching is equally stable and more token-efficient for small changes. No Serena advantage.
Exploration without editing: Both toolsets provide usable results. Serena’s are more structured but more verbose.
Non-Python files: Serena’s JetBrains backend provides no value for config files, shell scripts, markdown, or notebooks.
Verdict: Serena’s advantages compound in multi-step cross-file refactoring sessions. They do not compound for single-file iterative editing or non-code work.
7. Unique Capabilities#
Atomic cross-file rename/move — No built-in equivalent. The closest manual process is a grep-find-edit chain that is non-atomic and error-prone. Frequency: medium. Impact: high (eliminates partial-update risk).
Semantic reference finding with categorization —
find_referencing_symbolsreturns zero-false-positive results categorized by usage type (import, parameter, declaration, reference). Built-in Grep cannot distinguish these. Frequency: high. Impact: medium (saves manual filtering).Type hierarchy traversal — Returns transitive super/subtype chains including external library types in one call. Built-in requires iteration and cannot reach external deps. Frequency: medium. Impact: medium.
Symbol-scoped body retrieval — Read a specific method by name path without reading the surrounding file. Built-in
Readrequires line-range knowledge. Frequency: high. Impact: low (both require 1 call, difference is stable vs ephemeral addressing).
Verdict: Four unique capabilities, all semantic-code operations. The most impactful is atomic cross-file refactoring. None of these have practical built-in equivalents.
8. Tasks Outside Serena’s Scope (Built-In Only)#
Task |
Tool |
Frequency |
Share of daily work |
|---|---|---|---|
Read config/TOML/yaml files |
|
High |
|
Free-text search (log strings, TODOs, URLs) |
|
High |
|
File discovery by name pattern |
|
Medium |
|
Shell commands (git, pip, pytest) |
|
High |
|
Write new files from scratch |
|
Medium |
|
Read images, notebooks, PDFs |
|
Low |
Estimated share of daily work covered by built-in-only tasks: 20–30%. The remaining 70–80% involves reading, editing, and navigating code where Serena’s semantic tools are applicable.
Verdict: Serena covers the code-editing and code-navigation portions of a session. Config reading, text search, and shell operations remain built-in territory.
9. Practical Usage Rule#
Use Serena for: Any cross-file refactoring (rename, move), any “who uses this?” query, any type-hierarchy navigation, any full-method-body replacement, and any situation where you need symbol-accurate results without false positives.
Use built-ins for: Small edits (1–3 lines) inside methods, free-text search, reading non-code files, file discovery, and shell operations.
Hybrid pattern (most efficient): Use Serena to explore (overview, find symbol, find references) and Edit for small targeted changes. Use Serena for any cross-file refactoring. Use Read/Grep/Glob for non-code tasks and text search. This combination captures the strengths of both toolsets.
Verdict: The optimal workflow uses Serena’s semantic tools for code navigation and cross-file refactoring, and built-in Edit for small single-file changes. The two toolsets are complementary — Serena handles the structured code operations, built-ins handle the text and system operations.
Appendix: Call Count Summary#
Task |
Serena calls |
Built-in calls |
Delta |
|---|---|---|---|
Structural overview (1 file) |
1 |
1 |
0 |
Method body retrieval |
1 |
1 |
0 |
Find references (1 symbol) |
1 |
1 |
0 (but Serena has 0 false positives vs Grep’s 5+) |
Type hierarchy |
1 |
2–3 |
−1 to −2 |
Small edit (1 line in 22-line method) |
1 (+1 read) |
1 (+1 read) |
0 |
Medium edit (6 lines in 20-line method) |
1 |
1 |
0 |
Insert new method |
1 (+1 confirm) |
1 (+1 read) |
0 |
Single-file rename (4 occurrences) |
1 |
1 |
0 |
Cross-file rename (4 files, 10 occurrences) |
1 |
8 |
−7 |
Move symbol to another module |
1 |
5+ |
−4+ |
Move file + update imports |
1 |
3+ |
−2+ |
Safe delete (usage check) |
1 |
1 grep |
0 |
Largest single-task delta: cross-file rename saves 7 calls and provides atomicity.