:::{admonition} Evaluation Result
:class: note
**Generated by**: Claude Opus 4.6 (coding AI agent in Claude Code CLI)  
**Codebase:** [Tianshou](https://github.com/thu-ml/tianshou) — a Python reinforcement learning library (~26K lines, 43 source files)  
:::

# Claude Code (Opus 4.6, medium)

> **One-line summary:** Serena's IDE-backed semantic tools are the single most impactful addition to my toolkit — cross-file renames, moves, and reference lookups that would cost me 8–12 careful, error-prone steps collapse into one atomic call, and I would absolutely ask any developer I work with to set them up.

**Method:** Hands-on, side-by-side execution of 20 tasks using both toolsets. All edits were applied to real files and verified via `git diff`, then reverted.

---

## 1. Headline: What Serena Changes

Serena adds an IDE-backed semantic layer (powered by JetBrains) on top of the built-in text-level tools. The practical delta breaks down into three categories:

**(a) Tasks where Serena adds capability:**
- **Cross-file refactoring** (rename, move symbol, move file): 1 atomic call vs. N×(Grep+Read+Edit) chains. This is Serena's strongest contribution — it collapses multi-file, import-aware operations into single calls that are atomic and semantically correct.
- **Structural code navigation**: Symbol overviews, type hierarchies, and reference finding return structured, scope-disambiguated results that text search cannot produce without manual interpretation.
- **External dependency introspection**: Serena can look up symbols in installed packages via the IDE index without requiring manual environment discovery.

**(b) Tasks where Serena applies but offers no improvement:**
- **Single-file rename of a unique string**: `Edit` with `replace_all=true` achieves the same result in 1 call (after a Read), with comparable effort.
- **Small edits (1–3 lines inside a method)**: Edit's substring matching is more efficient — less payload sent, no requirement to supply the full symbol body.
- **Inserting new code at a known location**: Both approaches require ~1–2 calls. Edit can target by surrounding text; Serena targets by name path. The effort is comparable.

**(c) Tasks outside Serena's scope (built-in only):**
- Non-code files (config, docs, notebooks, changelogs)
- Free-text search across the repo
- Shell commands, git operations, test execution
- File creation from scratch

**Verdict:** Serena's primary contribution is collapsing multi-file, semantically-aware operations (rename, move, reference-finding, type hierarchy) from multi-step manual processes into single atomic calls; for single-file text edits, it is comparable or slightly less efficient.

---

## 2. Added Value and Differences by Area

**1. Cross-file rename/move (Serena: strong positive)**
- Serena: 1 call updates definition + all imports + all usages across N files atomically.
- Built-in: 1 Grep + N×(Read+Edit) = 2N+1 calls, non-atomic.
- Frequency: Moderate (a few times per working session during refactoring work). Value per hit: High — saves 6–10+ calls and eliminates the risk of partial updates.

**2. Structural overview / targeted symbol reading (Serena: moderate positive)**
- `get_symbols_overview(depth=1)`: 1 call returns all classes, methods, and attributes in structured JSON, via the IDE's parser — language-agnostic and always correct. Built-in equivalent: `Grep` for a language-specific heuristic pattern like `^(class |def )`. This works for simple Python files but is inherently fragile: the pattern must be hand-tuned per language, misses decorated or multi-line declarations, and can match inside strings or comments. For non-Python languages, an entirely different regex would be needed.
- `find_symbol(include_body=True)`: Retrieves a specific method body by name path without reading surrounding code. Built-in: requires knowing the line number (via Grep) then Read with offset — 2 calls.
- Frequency: High (many times per session). Value per hit: Low-to-moderate — saves ~1 call and some context window tokens per navigation. The reliability advantage over heuristic Grep patterns is consistent across all languages.

**3. Reference finding with structural context (Serena: moderate positive)**
- `find_referencing_symbols`: Returns which *symbols* reference a target, with file grouping and usage type (import, call, type annotation). Built-in `Grep` returns all text matches including docs, comments, and string literals — same recall but lower precision.
- Frequency: Moderate. Value per hit: Moderate — precision matters when planning refactors of widely-used symbols.

**4. Type hierarchy (Serena: positive, niche)**
- 1 call returns full super/sub-type chains transitively. Built-in: requires iterative Grep for `class X(Y)` patterns, manual transitive closure.
- Frequency: Low (occasional during architecture exploration). Value per hit: High per occurrence — several calls saved.

**5. Stable addressing across edits (Serena: moderate positive)**
- Serena addresses symbols by name path (`CollectStats/refresh_return_stats`), which is invariant under edits elsewhere in the file. The built-in workflow is fundamentally line-number-mediated: Grep returns line numbers, Read takes line offsets — and both go stale after any edit to the file. Edit's text matching (`old_string`) is more resilient than line numbers, but it sits at the end of a chain that starts with position-based lookups. After an edit shifts lines, previously noted Grep results and Read offsets are invalid and must be re-acquired.
- This means a multi-edit session with built-ins requires re-Grep or re-Read between edits to re-establish positions, while Serena's name paths remain valid throughout. In Task 17 (three chained edits), Serena needed 3 calls total; the built-in path needed 4 (1 Read + 3 Edits) only because the text anchors happened to be unique — but had any Edit failed uniqueness, a re-Read would have been required, pushing the count to 5–7.
- Frequency: High (any session with multiple edits to the same file). Value per hit: Moderate — saves 1–2 re-read calls per edit chain, and eliminates the class of errors where stale line numbers cause edits to land in the wrong place.

**6. Single-file small/medium edits (Serena: neutral to slight negative)**
- For a 1-line change inside a 13-line method, Serena's `replace_symbol_body` requires sending the entire 13-line body. Edit sends just the changed line. The prerequisite cost differs: Serena needs `find_symbol(include_body=True)` to get the body; Edit needs `Read` of the relevant lines. Both are 1 prerequisite call + 1 edit call.
- Frequency: Very high. Value per hit: Slightly negative for Serena — more payload for small changes.

**Verdict:** Serena's strongest contributions are in cross-file refactoring (high value, moderate frequency) and structured navigation (moderate value, high frequency); for small text edits, built-ins are slightly more efficient.

---

## 3. Detailed Evidence, Grouped by Capability

### 3.1 Structural Overview (Task 2)

| Metric | Serena `get_symbols_overview(depth=1)` | Built-in `Grep` for class/def |
|---|---|---|
| Calls | 1 | 1 |
| Output structure | JSON tree: 14 classes with nested methods + attributes | Flat list: 54 `class`/`def` lines with line numbers |
| Nesting visible? | Yes — methods grouped under classes | No — indentation implies nesting but no grouping |
| Attributes visible? | Yes | No |
| Output size | ~2.5KB structured JSON | ~3KB flat text |
| Correctness | Always correct — uses IDE parser, language-agnostic | Heuristic — the regex `^(class \|def \|    def )` is Python-specific and can miss decorated methods, multi-line signatures, or match inside strings/comments |
| Language portability | Works unchanged for any language the IDE supports | Requires a new hand-tuned regex per language |
| Next step | `find_symbol("ClassName/method", include_body=True)` — 1 call | `Read(file, offset=line, limit=N)` — 1 call |

Both paths need 1 call for overview + 1 call for drill-down. Serena's output is more structured and reliably correct; Grep's is simpler but fragile. The Grep pattern used here happened to work well for this Python codebase, but would need to be rewritten for Java (`class|interface|enum`, brace-delimited), Rust (`fn|struct|impl|trait`), TypeScript (`class|function|interface`), etc. — and even within Python, it misses cases like `@overload`-decorated methods or methods whose `def` line is preceded by a long decorator stack that pushes `def` to a non-standard indentation.

**Verdict:** Serena provides a reliably correct structural overview across languages; Grep-based overviews are heuristic approximations that work in simple cases but degrade for complex declarations or non-Python codebases.

### 3.2 Targeted Method Retrieval (Task 3)

| Metric | Serena | Built-in |
|---|---|---|
| Calls | 1 (`find_symbol` with `include_body=True`) | 2 (Grep to find line + Read with offset) |
| Prerequisite | Know the name path | Know the method name |
| Output | Method body only (no surrounding code) | Surrounding code included (must estimate limit) |

For `Collector/_collect` (330 lines): Serena returned exactly 330 lines. Built-in Read returned 332 lines (including the next method's signature).

**Verdict:** Serena saves 1 call and returns precisely the requested body; built-in is adequate but requires line-number discovery.

### 3.3 Cross-File References (Task 4)

| Metric | Serena `find_referencing_symbols` | Built-in `Grep` |
|---|---|---|
| Calls | 1 | 1 |
| Files found | 63 (code files only, structured by symbol type) | 83 (includes docs, notebooks, changelogs, README) |
| Output structure | Grouped by file, categorized (import, call, type annotation) | Flat file list |
| Precision | Code-only references | All textual mentions |

Grep finds 83 files including `README.md`, `CHANGELOG.md`, and `.ipynb` notebooks. Serena finds 63 code files. For the question "who uses this in code?", Serena's result is directly usable. For "where is this mentioned anywhere?", Grep is the right tool.

**Verdict:** Serena provides higher-precision code-usage results; Grep provides broader text-level coverage. Different tools for different questions.

### 3.4 Type Hierarchy (Task 5)

| Metric | Serena `type_hierarchy` | Built-in |
|---|---|---|
| Calls | 1 | 2+ (Grep for `class X(BaseCollector` + Grep for `class Collector(` + potential transitive search) |
| Result | `BaseCollector → ABC → object` (supers), `BaseCollector → Collector → AsyncCollector` (subs), with file locations | `class Collector(BaseCollector[TCollectStats], ...)` at line 551 — requires further reads for AsyncCollector's superclass |

**Verdict:** Serena produces the full transitive hierarchy in 1 call; built-in requires iterative searches.

### 3.5 External Dependency Lookup (Task 6)

| Metric | Serena | Built-in |
|---|---|---|
| Can retrieve? | Yes — `find_declaration` + `search_deps=True` | Requires environment discovery: `python -c "import X; print(inspect.getfile(X))"` then `Read` |
| Infrastructure | IDE index (pre-built) | Working Python environment, correct venv activated |
| Result | Symbol location + documentation | Full source file (if env is set up) |

In this session, the Python environment wasn't directly accessible from bash, so built-in lookup would have required additional setup. Serena retrieved `torch.distributions.Distribution`'s location and docstring via the IDE index.

**Verdict:** Serena provides dependency introspection without environment setup; built-in requires a working interpreter.

### 3.6 Small Edit — 1 Line Change (Task 7a)

| Metric | Serena `replace_symbol_body` | Built-in `Edit` |
|---|---|---|
| Prerequisite calls | 1 (`find_symbol` with `include_body=True`) — already done in Task 3 | 1 (`Read` of ~15 lines) |
| Edit call | 1 (send full 13-line body) | 1 (send 1-line old + 1-line new) |
| Payload sent (edit call) | ~550 chars (full body) | ~120 chars (changed line only) |
| Total calls | 1–2 | 2 |

**Verdict:** For small changes, Edit sends ~4.5× less payload in the edit call. Total call count is the same.

### 3.7 Medium Rewrite — ~19 Lines (Task 7b)

| Metric | Serena | Built-in |
|---|---|---|
| Prerequisite | 1 `find_symbol` (already done) | 1 `Read` (~22 lines) |
| Edit payload | ~550 chars (full body) | ~1000 chars (old 19 lines + new 20 lines) |
| Total calls | 1–2 | 2 |

At this scale, Serena's payload is actually *smaller* because Edit must send both old and new text, while Serena sends only the new body. The crossover point is approximately where the changed region exceeds half the method body.

**Verdict:** At medium scale, payloads converge; Serena's is slightly smaller due to only sending the replacement.

### 3.8 Large Rewrite — 55 Lines (Task 7c)

| Metric | Serena | Built-in |
|---|---|---|
| Prerequisite | 1 `find_symbol` (already done) | 1 `Read` (~67 lines) |
| Edit payload | ~2200 chars (new body) | ~4400 chars (old 63 lines + new 62 lines) |
| Total calls | 1–2 | 2 |

**Verdict:** For full-method rewrites, Serena sends ~50% less payload because it only sends the replacement, not the original.

### 3.9 Cross-File Rename (Task 10)

| Metric | Serena `rename` | Built-in chain |
|---|---|---|
| Calls | 1 | 1 Grep + 4 Read + 4 Edit = 9 |
| Files affected | 4 (automatically discovered) | 4 (manually discovered via Grep) |
| Import handling | Automatic | Manual — must identify and rewrite each import statement |
| Atomicity | Atomic — all-or-nothing | Sequential — intermediate states are inconsistent |

**Verdict:** Serena reduces a 9-call chain to 1 call for cross-file rename, with atomicity.

### 3.10 Move Symbol Across Modules (Task 11)

| Metric | Serena `move` | Built-in equivalent |
|---|---|---|
| Calls | 1 | ~8–12 (Read source, Edit source to remove, Edit target to add, Grep for imports, Read+Edit each import site) |
| Import updates | Automatic | Manual — must rewrite `from X import Y` → `from Z import Y` in each file |
| Dependency imports | Automatically adds needed imports to target file | Must manually inspect what the moved function imports and replicate |

Moving `get_stddev_from_dist` from `collector.py` to `stats.py`: Serena updated 3 files (source, target, test) in 1 call, including adding necessary imports to the target module.

**Verdict:** Move is Serena's highest-value single operation — it handles import graph updates that would be error-prone manually.

### 3.11 Move File (Task 12a)

| Metric | Serena `move` | Built-in equivalent |
|---|---|---|
| Calls | 1 | 1 `git mv` + 1 Grep + N×(Read+Edit) for import updates = ~12 |
| Files updated | 5 import updates automatic | Must manually discover and rewrite |

**Verdict:** Same pattern as symbol move — Serena collapses an N-step process.

### 3.12 Safe Delete (Task 12b)

Serena's `safe_delete` in safe mode (default):
- Reports all usages before deleting — acts as a guard.
- For unused symbols, deletes cleanly in 1 call.
- For used symbols, refuses and lists usages.

Built-in equivalent: Grep to check usages → if none found, Read + Edit to delete. 2–3 calls.

**Verdict:** Safe delete adds a safety check that the built-in path must implement manually.

### 3.13 Inline (Task 13)

Serena's inline tool did not work for any Python function tested. The JetBrains backend does not support Python function inlining (this is a language-level limitation of the IDE's refactoring engine, not a Serena bug).

**Verdict:** No candidate inlinable in this codebase; inline capability appears unavailable for Python.

### 3.14 Scope Precision (Task 14)

Three methods named `reset_env` exist in `collector.py` (in `BaseCollector`, `Collector`, `AsyncCollector`).

- Serena: `find_symbol("Collector/reset_env")` returns exactly that override's body.
- Grep: Returns 3 line numbers. Determining which belongs to which class requires reading surrounding context.

**Verdict:** Serena's name-path addressing eliminates class-level disambiguation that Grep requires.

### 3.15 Chained Edits (Task 17)

Three consecutive edits to methods in `CollectStats`:
- Serena: 3 `replace_symbol_body` calls, 0 intermediate reads. Name paths (`CollectStats/refresh_return_stats`, `CollectStats/refresh_len_stats`, `CollectStats/refresh_std_array_stats`) remained valid across all edits because they are structural identifiers, not positions.
- Built-in: 1 Read + 3 Edit calls = 4 calls in the best case. The Read was required upfront (Edit enforces "must read before editing"). The 3 Edits succeeded without re-reads only because the `old_string` anchors happened to remain unique after each edit. But this is fragile: Edit also enforces a "file has been modified since read" check when external tools modify the file, forcing a re-Read. More fundamentally, if I had needed to *find* these methods first (the typical case when you don't already know the line numbers), each Grep result from before the first edit would have been stale after it — line 232 is no longer line 232 after inserting 2 lines above it.

The core asymmetry: Serena's addressing is *structural* (survives edits by definition), while built-in addressing is *positional* (line numbers from Grep/Read go stale after any insertion or deletion). Edit's text matching partially mitigates this, but only for the final step — the discovery steps (Grep, Read with offset) remain position-dependent.

**Verdict:** Serena's name-path stability eliminates the re-read/re-grep cycle between edits; built-in tools require re-acquiring positions after each edit that shifts line numbers.

---

## 4. Token-Efficiency Analysis

### Payload by edit size

| Edit scale | Serena payload (edit call) | Edit payload (edit call) | Winner |
|---|---|---|---|
| 1-line change in 13-line method | ~550 chars (full body) | ~120 chars | Edit (~4.5×) |
| 19-line medium rewrite | ~550 chars | ~1000 chars | Serena (~1.8×) |
| 55-line full rewrite | ~2200 chars | ~4400 chars | Serena (~2×) |
| Cross-file rename (4 files) | ~100 chars | ~800 chars (4 Edit calls) | Serena (~8×) |

### Prerequisite reads

- Serena: `find_symbol(include_body=True)` returns the symbol body. If you've already navigated to it during exploration, no additional call needed.
- Built-in: `Read` with offset/limit. Always required before Edit (enforced by the tool).

### Stable vs ephemeral addressing

- Serena's name paths (`CollectStats/refresh_return_stats`) are stable identifiers — they survive any edit to surrounding code. No re-read or re-discovery needed between edits.
- The built-in workflow uses ephemeral, position-based addresses at every stage: Grep returns line numbers, Read takes line offsets, and both are invalidated by any insertion or deletion above the target. Edit's `old_string` matching is content-based and more resilient, but it depends on the upstream position-based steps to know *what* to match. After an edit shifts line numbers, the entire Grep→Read→Edit chain must be re-executed from the top.
- In a session with N edits to the same file, this costs up to N-1 additional Grep/Read round-trips with built-ins. With Serena, the cost is zero — the same name path works on the first and tenth edit.

**Verdict:** Serena is more token-efficient for medium-to-large edits and cross-file operations; Edit is more efficient for small, localized changes. Across multi-edit sessions, Serena's stable addressing avoids the re-read tax that compounds with each successive built-in edit.

---

## 5. Reliability & Correctness (Under Correct Use)

### Precision of matching
- Serena: Exact symbol resolution via name paths. `Collector/reset_env` unambiguously selects one method.
- Edit: Text matching. Unique strings match correctly; non-unique strings fail (and Edit reports the error).

### Scope disambiguation
- Serena distinguishes overrides, overloads (via indices), and nested classes by name path.
- Grep/Edit cannot distinguish methods with the same name in different classes without reading surrounding context.

### Atomicity
- Serena's cross-file operations (rename, move) are atomic — all files updated or none.
- Built-in multi-file edits are sequential — a failure mid-chain leaves an inconsistent state (recoverable via `git checkout`).

### Semantic queries vs text search
- `find_referencing_symbols` returns code-level references categorized by type. Grep returns all textual mentions.
- `type_hierarchy` returns transitive sub/supertypes. No built-in equivalent without iterative search.

### External dependency lookup
- Serena: Available via IDE index, no environment setup needed. Can retrieve symbol docs and location.
- Built-in: Requires working Python environment, correct venv, and manual file discovery. More powerful when available (full source), but higher setup cost.

**Verdict:** Serena provides stronger correctness guarantees for symbol-level operations (scope, atomicity, semantic precision); built-in tools are reliable for text-level operations with the caveat that scope disambiguation requires manual effort.

---

## 6. Workflow Effects Across a Session

### Compound advantages
- **Exploration → Edit without position re-acquisition**: Serena's overview tools produce name paths that are directly usable as edit targets. The built-in path produces line numbers (from Grep) that are consumed by Read — but after an edit, those line numbers are stale. In a typical explore-edit-explore-edit cycle, Serena's name paths remain valid throughout while built-in line numbers must be re-acquired after each edit. Over 3 explore+edit cycles, this saves ~3–6 intermediate Grep/Read calls.
- **Chained edits without re-reads**: Name-path stability means no re-reads between edits to the same file. For N edits, this saves up to N-1 Read calls. Edit's text matching partially avoids this (if old_strings stay unique), but the upstream discovery steps (Grep line numbers, Read offsets) still go stale.
- **Cross-file refactoring**: When a rename or move is part of a larger change, doing it atomically avoids the need to manually track "which files still need updating."

### Diminishing returns
- For purely exploratory sessions (reading code, no edits), Serena's advantage is modest — `Grep` and `Read` are nearly as fast for navigation, and Serena's structured output doesn't save many calls.
- For sessions dominated by small text edits (config changes, log message tweaks), Serena adds no value.

### Neutral findings
- Both toolsets require similar total calls for single-file work. The difference is ~1 call per operation.
- Output quality for code review (reading diffs, understanding changes) is identical — both require `git diff`.

**Verdict:** Serena's advantages compound across multi-step refactoring sessions where each operation feeds into the next; for read-heavy or small-edit sessions, the advantage is marginal.

---

## 7. Unique Capabilities (No Practical Built-In Equivalent)

1. **Atomic cross-file rename** — 1 call, all imports and usages updated, all-or-nothing. No built-in equivalent without scripting a multi-step chain. Frequency: moderate (refactoring sessions). Impact: high (saves 5–10 calls, eliminates partial-update risk).

2. **Atomic cross-file move** (symbol or file) with import rewriting — includes adding necessary imports to the target module. Frequency: low-to-moderate. Impact: very high per occurrence (saves 8–12 calls, handles import dependency graph).

3. **Type hierarchy traversal** — transitive supertypes and subtypes in 1 call. Built-in requires iterative Grep with manual transitive closure. Frequency: low. Impact: moderate (saves 3–5 calls).

4. **Safe delete with usage check** — reports all usages before deleting, with option to propagate. Built-in requires Grep + manual verification. Frequency: low. Impact: moderate (safety check is the value).

5. **External dependency symbol lookup** via IDE index — no Python environment needed. Frequency: moderate. Impact: moderate (avoids environment setup friction).

**Verdict:** Serena provides 3–5 capabilities with no practical built-in equivalent, concentrated in cross-file refactoring and semantic navigation.

---

## 8. Tasks Outside Serena's Scope (Built-In Only)

- **Non-code file operations**: Reading/editing config files, docs, changelogs, notebooks → `Read`/`Edit`/`Write`
- **Free-text search**: Finding log strings, URLs, magic constants → `Grep`
- **Shell operations**: Running tests, builds, git commands, package management → `Bash`
- **File creation**: New files from scratch → `Write`
- **Glob-based file discovery**: Finding files by pattern → `Glob`
- **Broad codebase search**: When you don't know what you're looking for → `Grep` with regex

Estimated share of daily work: These tasks constitute roughly 40–60% of a typical coding session (reading docs, running tests, searching for patterns, editing config). Serena's augmentation covers the remaining 40–60% where code-level semantic operations apply.

**Verdict:** Built-in tools handle roughly half of daily work that falls entirely outside Serena's scope; Serena augments the code-centric other half.

---

## 9. Practical Usage Rule

| Task type | Use |
|---|---|
| Cross-file rename, move, delete | Serena (unique capability) |
| Understanding class hierarchy or symbol relationships | Serena (1 call vs. iterative search) |
| Getting a structural overview of a large file | Serena (richer structure) or Grep (simpler, faster) |
| Reading a specific method body by name | Serena (direct) or Grep+Read (2 calls) |
| Small edit (1–5 lines inside a method) | Edit (less payload) |
| Full method rewrite | Serena `replace_symbol_body` (less payload, no re-read) |
| Single-file rename of a unique identifier | Edit with `replace_all` (equivalent) |
| Non-code files, config, docs | Built-in `Read`/`Edit` |
| Text search, pattern matching | Built-in `Grep` |
| Anything involving shell, git, tests | Built-in `Bash` |
| External dependency inspection | Serena (no env setup needed) |

**Verdict:** Use Serena for cross-file refactoring and semantic navigation; use built-ins for text-level edits, non-code files, and shell operations; for single-file code edits, choose based on edit size (Edit for small, Serena for full-body rewrites).
