:::{admonition} Evaluation Result
:class: note
**Generated by**: GLM 5.1 (coding AI agent in Claude Code CLI)  
**Codebase:** [Tianshou](https://github.com/thu-ml/tianshou) — a Python reinforcement learning library (~26K lines, 43 source files)  
**Date**: 2026-04-14
:::

# Claude Code (GLM 5.1)

>Serena's cross-file refactoring turned an 8-step, error-prone rename chain into a single atomic call — and after
measuring the delta across 20 tasks, I would ask any developer I work with to install it, not because it replaces my
built-in tools, but because it makes the hardest part of my job (changing code safely across files) suddenly feel
reliable.

---

## 1. Headline: What Serena Changes

**(a) Tasks where Serena adds capability.** Cross-file refactorings (rename, move) collapse from N separate read-find-edit-verify cycles into a single atomic call. Semantic queries (type hierarchy, find implementations, find references) return structured, symbol-accurate results that text search cannot produce without manual filtering. Symbol-level addressing (name paths) is stable across edits, eliminating stale-line-number rework in multi-step sessions.

**(b) Tasks where Serena applies but offers no improvement.** Single-file, small-bore edits (changing an error message, renaming a local variable with `replace_all`, inserting a function when you already know the insertion context) are equally fast with both toolsets. For a 1-line tweak in a 22-line method, Edit sends ~200 chars; Serena's `replace_symbol_body` sends ~800 chars (the entire method body). The overhead reverses for full-body rewrites of large methods, where Serena sends only the new body while Edit sends old+new.

**(c) Tasks outside Serena's scope.** Reading non-code files (config, TOML, docs), free-text pattern search (`FIXME`, magic constants, log strings), shell operations, and git workflows are all built-in territory. These are not Serena shortcomings — they are outside its design scope.

**Verdict:** Serena adds substantial, measurable capability in two areas — cross-file refactoring (1 call vs 5–8) and semantic code queries (structured, symbol-accurate results vs flat text matches) — while providing no meaningful delta for small single-file edits or non-code tasks.

---

## 2. Added Value and Differences by Area

### 2.1 Cross-file renaming: 1 call replaces 5–8

- **What changes:** Renaming `CollectStatsBase` to `BaseCollectStats` across 4 files (10 occurrences) required 1 Serena call vs 1 Grep + 3 Reads + 4 Edits = 8 built-in calls.
- **Frequency:** Medium. Any non-trivial rename touches 3–10 files.
- **Value per hit:** Saves 4–7 calls and eliminates the partial-update risk of a mid-chain failure.
- **Atomicity:** Serena's rename is all-or-nothing. Built-in chain is not — if Edit 3 of 4 fails, 2 files are updated and 2 are not.

### 2.2 Symbol moving: 1 call replaces 5+

- **What changes:** Moving `_nullable_slice` to another module required 1 Serena call (moved definition + updated imports in both source and target). Built-in equivalent: read function → write to target → edit source (remove + add import) → edit target (add dependency import) = 5+ calls.
- **Frequency:** Low. Module reorganization happens infrequently.
- **Value per hit:** Saves 4+ calls and automates the most error-prone step (getting imports right).
- **Caveat:** The move tool created a circular import (source imports from target, target imports from source). The tool does not detect or prevent this.

### 2.3 Reference finding: Symbol-accurate vs text-matched

- **What changes:** Finding references to `CollectStats` with Serena returned 10 semantically categorized results (IMPORT_ELEMENT, REFERENCE_EXPRESSION, NAMED_PARAMETER, etc.) with no false positives from `LoggedCollectStats`. Grep returned 70+ lines including false positives from `LoggedCollectStats` and noise from docstrings/comments.
- **Frequency:** High. "Who uses this?" is one of the most common codebase questions.
- **Value per hit:** Eliminates manual false-positive filtering. However, Serena's output was 64KB (with context snippets) vs Grep's ~5KB — a tradeoff of precision vs verbosity.

### 2.4 Type hierarchy: 1 call vs 2+ grep-and-parse cycles

- **What changes:** Getting the full type hierarchy of `BaseCollector` (supertypes: ABC → object; subtypes: Collector → AsyncCollector) required 1 Serena call. Built-in: parse the class definition line for bases + grep for inheritors + recurse = 2–3 calls, and still no access to external library types.
- **Frequency:** Medium. Common when navigating unfamiliar code.
- **Value per hit:** Saves 1–2 calls and returns transitive chains that built-ins can't produce without iteration.

### 2.5 Structural overview: Hierarchical vs flat

- **What changes:** `get_symbols_overview(depth=1)` on a 1551-line file returned a structured hierarchy (classes → methods + attributes) in one call. Grep for `^(class |def |    def )` returned a flat 60-line list of definitions with line numbers but no attribute information or nesting.
- **Frequency:** High. Opening any unfamiliar file.
- **Value per hit:** Serena shows Protocol fields, class attributes, and method grouping. Grep shows line numbers for navigation but requires more work to understand structure.

### 2.6 Method body retrieval: Targeted read vs range read

- **What changes:** Reading `Collector._collect` (330+ lines) required 1 Serena `find_symbol` call using the name path. Built-in: need to know the line range (773–1103) from a prior Grep, then `Read(offset=773, limit=330)`.
- **Frequency:** High. "Show me this method" is routine.
- **Value per hit:** Serena uses stable addressing (name path). Built-in uses ephemeral line numbers that go stale after edits. The practical difference is small when you just read once, but compounds in edit-then-read-again workflows.

### 2.7 Single-file rename: No meaningful difference

- **What changes:** Renaming `_nullable_slice` (4 occurrences in one file): Serena rename = 1 call. Edit `replace_all` = 1 call. Identical results.
- **Frequency:** High.
- **Value per hit:** Zero. Both tools handle this equally well.

### 2.8 Small edits (1–3 lines): Edit is more token-efficient

- **What changes:** Changing one error message in a 22-line method: Edit sends ~200 chars (old + new string). Serena `replace_symbol_body` sends ~800 chars (entire method body).
- **Frequency:** Very high.
- **Value per hit:** Edit saves ~600 chars of payload per small edit. This reverses for full-body rewrites of 50+ line methods.

### 2.9 Insertion: Stable address vs text anchor

- **What changes:** Inserting a new method after `refresh_all_sequence_stats`: Serena used 1 `insert_after_symbol` call with a name path (no line number needed). Edit used 1 Read + 1 Edit with a text anchor (surrounding context for uniqueness).
- **Frequency:** Medium.
- **Value per hit:** Saves 1 Read call. Both produce identical results.

**Verdict:** Serena's value concentrates in cross-file operations and semantic queries. For single-file text edits, the built-ins are equally capable and often more token-efficient.

---

## 3. Detailed Evidence, Grouped by Capability

### 3.1 Codebase Understanding

#### Task 1: Repo overview
Both toolsets use the same approach (`ls`, `find`, directory listing). No Serena advantage here.

#### Task 2: Structural overview of a large file (`collector.py`, 1551 lines)

| Step | Serena | Built-in |
|------|--------|----------|
| Call | `get_symbols_overview(depth=1)` | `Grep "^(class \|def \|    def )"` |
| Result | Hierarchical: 14 classes, their methods, attributes, and module-level functions | Flat list of 60 class/function definitions with line numbers |
| Output size | ~1.5KB structured JSON | ~3KB text |
| Next step | `find_symbol("Collector/_collect", include_body=True)` — direct | `Read(offset=773, limit=330)` — needs prior knowledge of line range |

**Serena advantage:** Shows attributes (e.g., `CollectStats.collect_time`, `CollectStats.returns`) that Grep cannot see. Hierarchical nesting makes the file's architecture immediately clear.

**Built-in advantage:** Line numbers enable direct `Read` calls. Flat output is compact.

**Verdict:** Serena provides strictly more structural information in one call. The gap widens for files with deeply nested classes or dataclass fields.

#### Task 3: Retrieve a specific method body

| Step | Serena | Built-in |
|------|--------|----------|
| Prerequisite | Name path known (`Collector/_collect`) | Line number known from prior Grep (773) |
| Call | `find_symbol(name_path="Collector/_collect", include_body=True)` | `Read(offset=773, limit=330)` |
| Payload sent | ~50 chars (name + path) | ~30 chars (offset + limit) |
| Payload received | Exact method body (~330 lines) | Lines 773–1102 (~330 lines) |
| Correctness | Always exact | Must know/guess the correct limit |

**Verdict:** Functionally equivalent when line numbers are known. Serena's name-path addressing degrades gracefully across edits; line numbers do not.

#### Task 4: Find all references to `CollectStats`

| Metric | Serena `find_referencing_symbols` | Built-in `Grep` |
|--------|----------------------------------|-----------------|
| Calls | 1 | 1 |
| Output size | 64KB (with context snippets) | ~5KB (70 lines) |
| False positives | 0 (excludes `LoggedCollectStats`) | 5+ lines from `LoggedCollectStats` |
| Noise (comments/docs) | Some (docstrings categorized) | Significant (docstrings, comments matched) |
| Semantic categories | Yes (IMPORT, PARAMETER, DECLARATION, REFERENCE) | No |

**Serena advantage:** Zero false positives. Semantic categorization. Shows import paths and parameter usage separately from code references.

**Built-in advantage:** 10x smaller output. Faster to scan visually.

**Verdict:** Serena is more precise but more verbose. For "who uses this in code?" both work; for "rename this safely" Serena's precision is necessary.

#### Task 5: Type hierarchy of `BaseCollector`

| Metric | Serena `type_hierarchy` | Built-in (Grep chain) |
|--------|------------------------|----------------------|
| Calls | 1 | 2–3 (grep subclasses, parse superclass, recurse) |
| Result | `ABC → object` (super), `Collector → AsyncCollector` (sub) | Partial — direct sub/supertypes only, no transitive chain |
| External deps | Shows `ABC` from `abc.pyi` | Cannot access |

**Verdict:** Serena returns complete, transitive hierarchy including external library types in one call. Built-in approach requires iteration and cannot inspect external deps.

#### Task 6: External dependency symbol lookup

Serena can read external dependency symbols IF you have the path from a prior tool result (e.g., `<ext:abc.pyi|16198efc>` from `type_hierarchy`). Direct search with `search_deps=True` returned empty results for `numpy.array` and `torch.Tensor`. The JetBrains IDE indexing is limited to what the language server has resolved.

Built-in: Can `Read` site-packages files if you know the path, but discovery is manual.

**Verdict:** Minor Serena advantage — external symbols are accessible through tool chains but not through direct search. Neither toolset makes this easy.

### 3.2 Single-File Edits

#### Task 7a: Small tweak (1-line change in 22-line method)

| Metric | Edit | Serena `replace_symbol_body` |
|--------|------|------------------------------|
| Prerequisite | 1 Read (6 lines of context) | 1 `find_symbol` (gets full body) |
| Payload sent | ~200 chars (old + new string) | ~800 chars (full method body) |
| Payload received | Success message | "OK" |
| Total payload | ~200 chars edit + ~300 chars read = ~500 | ~800 chars edit + ~800 chars read = ~1600 |

**Verdict:** Edit is 3x more token-efficient for small tweaks inside methods.

#### Task 7b: Medium rewrite (~6 line changes in 20-line method)

| Metric | Edit | Serena `replace_symbol_body` |
|--------|------|------------------------------|
| Payload sent | ~700 chars (old + new, full method) | ~500 chars (new body only) |
| Prerequisite read | ~500 chars | ~500 chars (from prior `find_symbol`) |

**Verdict:** Roughly equal. For medium rewrites, payloads converge.

#### Task 7c: Large rewrite (full body of 55+ line method)

For a full-body rewrite of a 55-line method:
- Edit: old (~55 lines) + new (~55 lines) = ~110 lines sent
- Serena: new body only (~55 lines) sent

**Verdict:** Serena is ~2x more token-efficient for full-body rewrites. The advantage grows linearly with method size.

#### Task 8: Insert a new function after an existing one

| Step | Serena | Built-in |
|------|--------|----------|
| 1 | `find_symbol("refresh_all_sequence_stats")` to confirm target | `Read(offset=250, limit=10)` to find insertion point |
| 2 | `insert_after_symbol(name_path, body)` | `Edit(old_string=anchor, new_string=anchor+new_fn)` |
| Total calls | 2 | 2 |
| Payload sent | New function body only (~300 chars) | Anchor context + new function (~400 chars) |

**Verdict:** No meaningful difference. Both require 2 calls and produce identical results.

#### Task 9: Rename a private helper (single-file, 4 occurrences)

| Metric | Serena `rename` | Edit `replace_all` |
|--------|-----------------|-------------------|
| Calls | 1 | 1 |
| Prerequisites | None | Must have Read the file first |
| Result | All 4 occurrences renamed | All 4 occurrences renamed |

**Verdict:** Functionally identical. Both are 1 call. Edit's Read-first requirement is usually satisfied from prior exploration.

### 3.3 Multi-File Changes

#### Task 10: Cross-file rename (`CollectStatsBase` → `BaseCollectStats`, 4 files, 10 occurrences)

| Step | Serena | Built-in |
|------|--------|----------|
| Find references | Automatic | 1 Grep call |
| Read files | Not needed | 3 Read calls (Edit's prerequisite) |
| Apply edits | 1 rename call (atomic) | 4 Edit calls (one per file) |
| Verify | Return: "Success" | Manual (4 success messages) |
| **Total calls** | **1** | **8** (1 grep + 3 reads + 4 edits) |

**Verdict:** Serena converts an 8-call manual pipeline into 1 atomic operation. This is the single largest efficiency gain observed.

#### Task 11: Move symbol to another module

Serena's `move` tool:
1. Moved `_nullable_slice` from `collector.py` to `converter.py`
2. Added import in source file: `from tianshou.data.utils.converter import _nullable_slice`
3. Added dependency import in target: `from tianshou.data.collector import _TArrLike`
4. Removed definition from source

Built-in equivalent: Read function body → Write to target → Edit source (remove definition + add import) → Edit target (add dependency import) = 5+ calls.

**Issue:** The move created a circular import (source ↔ target). Serena does not detect or prevent this.

**Verdict:** Serena automates the most tedious part (import management) but doesn't guard against circular dependencies. Saves 4+ calls at the cost of needing manual circular-import review.

#### Task 12: Move file (`segtree.py` to parent directory)

Serena's `move` tool:
1. Moved the file
2. Updated the one direct import in `__init__.py` (`tianshou.data.utils.segtree` → `tianshou.data.segtree`)
3. Other files (`prio.py`, tests) imported via re-export and needed no changes

Built-in equivalent: `git mv` + grep for old import path + edit each file = 3+ calls.

**Verdict:** Serena saves 1–2 calls and automatically discovers which imports need updating.

#### Task 12 (safe delete)

Serena's `safe_delete` correctly refused to delete `_HACKY_create_info_batch` because it has a usage at line 730. The `propagate=true` mode (delete symbol + all call sites) failed for all tested symbols in this codebase.

**Verdict:** The usage-check is valuable (saves you from deleting a used symbol). The propagation feature was non-functional for the tested Python symbols.

#### Task 13: Inline

Serena's `inline_symbol` failed for all tested symbols (`_nullable_slice`, `BaseCollector/env_num`). The tool appears to have limited Python support for inlining.

**Verdict:** No successful inline demonstrated. Built-in manual inlining remains the only option.

### 3.4 Reliability and Correctness

#### Task 14: Scope precision

Serena distinguishes `BaseCollector/_collect`, `Collector/_collect`, and `AsyncCollector/_collect` by name path. Grep for `def _collect` matches all three — manual filtering by class is required.

**Verdict:** Serena's name-path addressing eliminates ambiguity that text search cannot resolve.

#### Task 15: Atomicity

Serena's cross-file rename is atomic: 4 files updated in 1 call, all-or-nothing. Built-in: 4 separate Edit calls — if call 3 fails, 2 files are updated and 2 are not.

**Verdict:** Serena provides atomicity for cross-file operations. Built-in chains are inherently non-atomic.

#### Task 16: Success signals

Both return clear success/failure indicators. No meaningful difference.

**Verdict:** Equal.

### 3.5 Workflow Effects

#### Task 17: Chain three edits in one file

Edit with text matching: 3 sequential calls, no re-reads needed between them. Text anchors are immune to line-number shifts from prior edits.

Serena `replace_symbol_body`: 3 sequential calls, no re-reads needed. Name-path addressing is also immune to line-number shifts.

**Verdict:** No meaningful difference for chained single-file edits.

#### Task 18: Multi-step exploration across edits

Serena's name-path results from exploration remain valid after edits. Built-in line numbers go stale, but Edit uses text matching (not line numbers), so the practical impact is limited to `Read` calls that need updated offsets.

**Verdict:** Minor Serena advantage. Name-path stability eliminates the need to re-scan after edits.

### 3.6 Non-Interesting Tasks

#### Task 19: Read non-code file

Serena tools don't apply. `Read` is the correct tool.

#### Task 20: Free-text pattern search

Searching for `FIXME|HACK|TODO` across the codebase is a text search. Serena's semantic tools don't target this. `Grep` is the correct tool.

**Verdict:** These tasks are firmly built-in territory. They represent an estimated 20–30% of daily coding work (reading configs, searching for strings, shell operations, git workflows).

---

## 4. Token-Efficiency Analysis

### Payload differences across edit sizes

| Edit type | Edit payload | Serena payload | Winner |
|-----------|-------------|---------------|--------|
| 1-line tweak in 22-line method | ~200 chars | ~800 chars | Edit (4x) |
| 6-line change in 20-line method | ~700 chars | ~500 chars | Roughly equal |
| Full rewrite of 55-line method | ~2200 chars | ~1100 chars | Serena (2x) |
| Full rewrite of 330-line method | ~13,000 chars | ~6,500 chars | Serena (2x) |

### Forced reads

- Edit requires reading a file before editing it. This adds ~300–2000 chars per file.
- Serena does not require reading before editing (name-path addressing).
- For single-file edits where you already read the file, this is neutral.
- For cross-file operations on files you haven't read, Serena saves 3–4 forced reads.

### Stable vs ephemeral addressing

- Serena: name paths (`Collector/_collect`) are stable across edits. Results from exploration remain valid.
- Built-in: line numbers are ephemeral. `Read` results go stale after edits. `Edit` uses text matching, which is stable.
- Practical impact: Low for one-shot edits, medium for edit-then-read-again workflows.

**Verdict:** Edit wins for small tweaks (4x more token-efficient). Serena wins for full-body rewrites (2x more efficient) and cross-file operations (eliminates forced reads). The crossover point is approximately 50% of the method body changing — below that, Edit is more efficient; above that, Serena is.

---

## 5. Reliability and Correctness (Under Correct Use)

### Precision of matching

- Serena: Symbol-accurate. `find_referencing_symbols(CollectStats)` excludes `LoggedCollectStats`. No false positives observed.
- Grep: Text-matched. `CollectStats` matches `LoggedCollectStats`, `CollectStatsBase`, and docstring references. Requires manual filtering.
- Edit: Text-matched. `replace_all` replaces exact string matches. For unique symbol names, this is reliable. For common strings, it can over-match.

### Scope disambiguation

- Serena: `Collector/_collect` vs `AsyncCollector/_collect` — correctly distinguished by class-scoped name path.
- Built-in: `def _collect` matches all implementations. Must manually verify class context.

### Atomicity

- Serena cross-file operations: Atomic. Single call, all-or-nothing.
- Built-in multi-file chains: Non-atomic. Partial state possible if one call fails.

### External dependency lookup

- Serena: Can read external stubs (e.g., `abc.pyi`) through paths returned by other tools. Direct search (`search_deps=True`) returned empty for `torch.Tensor` and `numpy.array`. Limited to what JetBrains has indexed.
- Built-in: Can `Read` site-packages files if path is known. No semantic indexing.

**Verdict:** Serena provides strictly more precise semantic matching and atomic cross-file operations. External dependency lookup is limited in both toolsets.

---

## 6. Workflow Effects Across a Session

### Where advantages compound

1. **Explore → edit → re-explore cycle:** Serena's name-path results survive edits. In a long session making multiple changes, this saves re-scanning after each edit. The built-in's text matching also survives edits (Edit uses text, not line numbers), so the practical gap is smaller than it appears.

2. **Cross-file refactoring chains:** Rename a class, then move it, then update all references — each Serena call is atomic and builds on the previous result. With built-ins, each step requires finding all sites, reading files, and editing — the manual equivalent of what Serena automates.

### Where advantages diminish

1. **Repeated small edits in one file:** Edit's text matching is equally stable and more token-efficient for small changes. No Serena advantage.

2. **Exploration without editing:** Both toolsets provide usable results. Serena's are more structured but more verbose.

3. **Non-Python files:** Serena's JetBrains backend provides no value for config files, shell scripts, markdown, or notebooks.

**Verdict:** Serena's advantages compound in multi-step cross-file refactoring sessions. They do not compound for single-file iterative editing or non-code work.

---

## 7. Unique Capabilities

1. **Atomic cross-file rename/move** — No built-in equivalent. The closest manual process is a grep-find-edit chain that is non-atomic and error-prone. Frequency: medium. Impact: high (eliminates partial-update risk).

2. **Semantic reference finding with categorization** — `find_referencing_symbols` returns zero-false-positive results categorized by usage type (import, parameter, declaration, reference). Built-in Grep cannot distinguish these. Frequency: high. Impact: medium (saves manual filtering).

3. **Type hierarchy traversal** — Returns transitive super/subtype chains including external library types in one call. Built-in requires iteration and cannot reach external deps. Frequency: medium. Impact: medium.

4. **Symbol-scoped body retrieval** — Read a specific method by name path without reading the surrounding file. Built-in `Read` requires line-range knowledge. Frequency: high. Impact: low (both require 1 call, difference is stable vs ephemeral addressing).

**Verdict:** Four unique capabilities, all semantic-code operations. The most impactful is atomic cross-file refactoring. None of these have practical built-in equivalents.

---

## 8. Tasks Outside Serena's Scope (Built-In Only)

| Task | Tool | Frequency | Share of daily work |
|------|------|-----------|-------------------|
| Read config/TOML/yaml files | `Read` | High | |
| Free-text search (log strings, TODOs, URLs) | `Grep` | High | |
| File discovery by name pattern | `Glob` | Medium | |
| Shell commands (git, pip, pytest) | `Bash` | High | |
| Write new files from scratch | `Write` | Medium | |
| Read images, notebooks, PDFs | `Read` | Low | |

Estimated share of daily work covered by built-in-only tasks: **20–30%**. The remaining 70–80% involves reading, editing, and navigating code where Serena's semantic tools are applicable.

**Verdict:** Serena covers the code-editing and code-navigation portions of a session. Config reading, text search, and shell operations remain built-in territory.

---

## 9. Practical Usage Rule

**Use Serena for:** Any cross-file refactoring (rename, move), any "who uses this?" query, any type-hierarchy navigation, any full-method-body replacement, and any situation where you need symbol-accurate results without false positives.

**Use built-ins for:** Small edits (1–3 lines) inside methods, free-text search, reading non-code files, file discovery, and shell operations.

**Hybrid pattern (most efficient):** Use Serena to explore (overview, find symbol, find references) and Edit for small targeted changes. Use Serena for any cross-file refactoring. Use `Read`/`Grep`/`Glob` for non-code tasks and text search. This combination captures the strengths of both toolsets.

**Verdict:** The optimal workflow uses Serena's semantic tools for code navigation and cross-file refactoring, and built-in Edit for small single-file changes. The two toolsets are complementary — Serena handles the structured code operations, built-ins handle the text and system operations.

---

## Appendix: Call Count Summary

| Task | Serena calls | Built-in calls | Delta |
|------|-------------|----------------|-------|
| Structural overview (1 file) | 1 | 1 | 0 |
| Method body retrieval | 1 | 1 | 0 |
| Find references (1 symbol) | 1 | 1 | 0 (but Serena has 0 false positives vs Grep's 5+) |
| Type hierarchy | 1 | 2–3 | −1 to −2 |
| Small edit (1 line in 22-line method) | 1 (+1 read) | 1 (+1 read) | 0 |
| Medium edit (6 lines in 20-line method) | 1 | 1 | 0 |
| Insert new method | 1 (+1 confirm) | 1 (+1 read) | 0 |
| Single-file rename (4 occurrences) | 1 | 1 | 0 |
| **Cross-file rename (4 files, 10 occurrences)** | **1** | **8** | **−7** |
| **Move symbol to another module** | **1** | **5+** | **−4+** |
| **Move file + update imports** | **1** | **3+** | **−2+** |
| Safe delete (usage check) | 1 | 1 grep | 0 |

Largest single-task delta: cross-file rename saves 7 calls and provides atomicity.
