Copilot CLI (GPT-5.4, medium)

Contents

Evaluation Result

Generated by: GPT-5.4 (medium) in Copilot Cli
Codebase: ente - a large monorepo in Dart, TypeScript, Go, Rust, and other languages.

Copilot CLI (GPT-5.4, medium)#

As a coding agent, I’d absolutely ask my owner to add Serena because it makes me noticeably sharper and calmer on real code—especially symbol-aware navigation, cross-file refactors, and monorepo dependency jumps—while I still lean on built-ins for tiny text edits and non-code work

Baseline. I started from source code only, avoided repo docs/notes, and ran each reversible experiment against the repo directly. After every edit/refactor experiment, the working tree returned to its baseline state.

1. Headline: what Serena changes#

Serena changes the workflow when the task is about code symbols rather than raw text. In this repo, the practical delta was:

  1. Added capability / materially better workflow. Symbol-aware navigation and refactors in TypeScript: structural overviews, precise code-only references, hierarchy queries, symbol-targeted rename, file move with import updates, and inline. These usually collapsed a 2-6 step built-in chain into one semantic operation after discovery, and reduced manual scope verification.

  2. Applies but offers little or no improvement. Small local edits inside an already-understood method. Built-ins can patch only the changed lines; Serena’s body replacement resends the whole symbol, so it was often less payload-efficient for 1-3 line tweaks.

  3. Outside Serena’s scope. Non-code reads, free-text search, git inspection, config/package files, and other text-first tasks. Built-ins remained the natural tools there.

Two important observed limits constrained Serena’s delta here: the strongest measured gains were concentrated in the TypeScript desktop app and the Rust core crate where I ran the hands-on comparisons, and some refactors still carried diff-shape tradeoffs such as formatting churn or unexpected target-file choices.

Verdict: In this repo, Serena was a strong TypeScript symbol layer on top of the built-ins, not a general replacement for text/file work.

2. Added value and differences by area#

Area

What changed vs built-ins

Frequency

Value per hit

Cross-file symbol refactors

rename, file move, and inline turned manual search/edit/update chains into one semantic op. The wait -> delay rename updated 4 files from 1 symbol definition; moving http.ts updated the importing file automatically.

Medium

High: typically 2-5 calls saved plus less manual scope checking

Code-only discovery

Symbol overview, symbol body retrieval, reference search, and type hierarchy returned code structure directly instead of raw text matches. For wait, Serena returned 3 real code-use files; rg returned 7 files, including docs/comments and English-word hits.

High

Medium: usually 1-3 follow-up reads/filters avoided

Stable addressing

Name paths stayed reusable across multiple edits (createMainWindow, openStreetMapUserAgent, wait); built-in line ranges had to be reacquired after edits.

Medium

Medium: less re-reading, less stale-context risk

Small in-method edits

Serena was not more efficient. Replacing AutoLauncher/toggleAutoLaunch required resending the full method body, while the built-in patch changed only the touched lines.

High

Low negative: built-ins used smaller edit payloads

External dependency lookup in a monorepo

Once indexing was available, Serena resolved Electron types from desktop/node_modules, next-electron-server declarations from the desktop package, and Rust crate symbols from Cargo registry sources. In a monorepo, that removes a manual “which package owns this dependency?” step.

Medium

Medium-High: usually 1-3 searches plus path discovery avoided

Verdict: Serena’s highest-value delta was semantic refactoring plus dependency-aware code lookup in the TypeScript/Rust parts of the monorepo; its weakest area was tiny local edits.

3. Detailed evidence, grouped by capability#

3.1 Codebase understanding#

Task 1: high-level repository overview#

  • Attempted: top-level layout and likely code-heavy areas.

  • Serena chain: serena-list_dir(.) -> directory list.

  • Built-in chain: not separately needed; Serena added no unique value beyond a normal directory listing here.

  • Payloads: one short directory listing either way.

  • Finding: no meaningful delta; this is plain filesystem exploration.

Verdict: For repo layout, Serena was neutral.

Task 2: structural overview of a large file, plus the concrete next step#

  • Target file: desktop/src/main.ts (753 lines).

  • Serena chain: get_symbols_overview(main.ts, depth=1) -> concise symbol map of top-level functions and nested locals under main; next step find_symbol(createMainWindow, include_body=true).

  • Built-in chain: rg on const|function|class|export in main.ts -> flat text hits; next step view of lines 331-439 to read createMainWindow.

  • Payloads observed:

    • Serena overview output: compact symbol list for the file; next-step body fetch returned only the selected symbol body.

    • Built-in overview output: many matching lines without structure; next-step read required ~109 lines of file content.

  • Delta: Serena’s overview was not just shorter; it also supplied stable symbol names for the follow-up call. Built-ins could answer the question, but only after a second text-localizing step.

Verdict: Serena materially improved the “overview -> inspect one function” flow by making the follow-up call symbol-based instead of line-based.

Task 3: retrieve a specific class method body without reading the surrounding file#

  • Target symbol: AutoLauncher/toggleAutoLaunch in desktop/src/main/services/auto-launcher.ts.

  • Serena chain: find_symbol(AutoLauncher/toggleAutoLaunch, include_body=true) -> exact method body.

  • Built-in chain: view of the relevant file range (20-40) after locating the method.

  • Payloads observed:

    • Serena returned the 10-line method body only.

    • Built-in read returned 21 lines of surrounding class context.

  • Delta: Serena saved one localization step and avoided unrelated lines.

Verdict: Serena added a real but modest efficiency gain for targeted method retrieval.

Task 5: supertypes / subclasses / implementations#

  • Equivalent used: interface hierarchy in web/apps/ensu/src/services/llm/inference.ts, because this TS area had interface implementations rather than rich class inheritance.

  • Serena chain: type_hierarchy(InferenceBackend, both) -> WasmInference, TauriInference; type_hierarchy(WasmInference, both) -> supertype InferenceBackend.

  • Built-in chain: rg InferenceBackend|implements InferenceBackend -> manual reconstruction from four text matches.

  • Payloads observed: Serena returned the hierarchy directly; built-ins returned only raw declarations/usages.

  • Delta: Serena removed the manual synthesis step. Built-ins were sufficient here because the hierarchy was shallow, but that was because the example was small.

Verdict: Serena added moderate value for hierarchy queries; the value grows with hierarchy depth.

Task 6: external dependency symbol lookup#

  • Targets used after indexing was available: BrowserWindow and serveNextAt in the desktop TypeScript app, plus Url and Zeroizing in rust/core.

  • Serena chain (TS):

    • find_declaration(new BrowserWindow(...), include_body=true) -> desktop/node_modules/electron/electron.d.ts, body class BrowserWindow extends Electron.BrowserWindow {}

    • find_declaration(import serveNextAt ... , include_body=true) -> desktop/node_modules/next-electron-server/index.d.ts, body declare function serveNextAt(uri: string, options?: Options): void;

    • find_symbol(..., search_deps=true) on those dependency files returned dependency-side docs.

  • Serena chain (Rust):

    • find_declaration(use reqwest::{Response, Url};, include_body=true) -> <ext:lib.rs|...> external symbol Url[0] with the struct body

    • find_declaration(use zeroize::Zeroizing;, include_body=true) -> <ext:lib.rs|...> external symbol Zeroizing[0] with the struct body

    • find_symbol(..., relative_path=<ext...>, search_deps=true) returned dependency-side docs for those external symbols.

  • Built-in equivalent chain:

    • Manually infer the correct monorepo-local dependency root (desktop/node_modules, not repo root),

    • or manually inspect Cargo metadata / Cargo.lock,

    • then open the resolved dependency files directly (for Rust, under the Cargo registry).

  • Payloads observed: Serena returned the declaration target and a small signature/body directly; built-ins required package-root discovery first, which is a real extra step in a monorepo.

  • Delta: Serena does add capability and efficiency here once indexing exists. The gain is larger in this monorepo than in a single-package repo because dependency ownership is split across package-local Node dependencies and shared Cargo registry sources.

Verdict: With indexing available, Serena added meaningful external dependency lookup, and the value was amplified by the monorepo layout.

3.2 Single-file edits#

Task 7a: small tweak (1-3 lines inside a method)#

  • Change: rename local autoLaunch -> launcher inside AutoLauncher/toggleAutoLaunch.

  • Built-in chain: view(auto-launcher.ts, 20-40) -> apply_patch on 3 changed lines -> git diff.

  • Serena chain: find_symbol(toggleAutoLaunch, include_body=true) -> replace_symbol_body(toggleAutoLaunch) -> git diff.

  • Payloads observed:

    • Built-ins: read 21 lines, patch changed 3 logical lines.

    • Serena: fetched 10-line body, resent full 10-line body.

  • Delta: same result, same number of main steps, but the symbolic edit resent untouched lines.

Verdict: For tiny in-method tweaks, built-ins were more payload-efficient and Serena added no real workflow advantage.

Task 7b: medium rewrite (~10-30 lines)#

  • Change: rewrite uniqueSavePath to use candidatePath + for loop.

  • Built-in chain: view(main.ts, 500-540) -> apply_patch replacing the function body -> git diff.

  • Serena chain: find_symbol(uniqueSavePath, include_body=true) -> replace_symbol_body(uniqueSavePath) -> git diff.

  • Payloads observed:

    • Built-ins: read 41 lines to safely anchor a ~10-line rewrite.

    • Serena: fetched the 11-line symbol body and resent the rewritten body only.

  • Delta: Here Serena was more efficient: less prerequisite read volume and no dependence on surrounding file context.

Verdict: Serena was better for medium symbol-sized rewrites.

Task 7c: large / whole-body rewrite#

  • Change: rewrite the entire createMainWindow body.

  • Built-in chain: view(main.ts, 331-439) -> apply_patch replacing the function body -> git diff.

  • Serena chain: find_symbol(createMainWindow, include_body=true) -> replace_symbol_body(createMainWindow) -> git diff.

  • Payloads observed:

    • Built-ins: read ~109 lines and patched the whole function.

    • Serena: fetched the same symbol body and resent the whole rewritten body.

  • Delta: Serena still avoided a file-range read, but once the symbol itself dominates the payload, the token gap mostly disappears.

Verdict: For whole-body rewrites, Serena’s gain was modest: better addressing, not dramatically smaller payload.

Task 8: insert a new function at a structural location#

  • Insertion: waitSeconds after wait in desktop/src/main/utils/common.ts.

  • Built-in chain: view(common.ts, 1-40) -> apply_patch inserting after the existing function.

  • Serena chain: find_symbol(wait) -> insert_after_symbol(wait).

  • Payloads observed:

    • Built-ins: read 26 lines to place a 1-line function.

    • Serena: no extra file-range read once the symbol name was known.

  • Delta: Serena made the location structural instead of textual.

Verdict: Serena improved insertions when the location is “after symbol X” rather than “after line Y”.

Task 9: rename a private helper used only within one file#

  • Target symbol: openStreetMapUserAgent in desktop/src/main.ts.

  • Built-in chain: view/rg to find call site + definition -> apply_patch updating both.

  • Serena chain: rename(openStreetMapUserAgent -> buildOpenStreetMapUserAgent).

  • Payloads observed:

    • Built-ins: two textual sites had to be found and updated manually.

    • Serena: one rename call, terse success response ("Success").

  • Delta: small call-count improvement, larger correctness improvement when the file is bigger or the name is less unique.

Verdict: Serena was somewhat better even for single-file private renames because it removed manual site enumeration.

3.3 Multi-file changes#

Task 10: rename a symbol across several files including imports#

  • Target symbol: wait -> delay.

  • Built-in chain: read common.ts, main.ts, ffmpeg-worker.ts, ml-worker.ts -> one multi-file apply_patch -> git diff.

  • Serena chain: rename(wait -> delay) on the defining symbol -> git diff.

  • Payloads observed:

    • Built-ins: required reading 4 files and manually updating export, imports, and call sites.

    • Serena: one semantic rename updated the same 4 files.

  • Success signals: built-ins only showed success via resulting diff; Serena returned "Success".

  • Delta: this is one of Serena’s clearest wins: same final diff, far less manual scope work.

Verdict: Serena strongly improved multi-file renames by collapsing discovery + edit into one symbol-based refactor.

Task 11: move a symbol from one module to another, updating imports#

  • Target: move nullToUndefined out of common.ts.

  • Serena chain executed: move(nullToUndefined, target_relative_path=http.ts) -> git diff.

  • Observed result: Serena removed the symbol from common.ts, updated ffmpeg-worker.ts, but created a new file desktop/src/main/utils/nullToUndefined.ts instead of merging into http.ts.

  • Built-in equivalent: would require manually copying the symbol into the intended target module, updating imports, then deleting the old definition.

  • Delta: Serena still automated the cross-file update, but did not provide the specific “move into existing module” behavior I was testing.

Verdict: Serena added partial value for symbol moves here, but not the full capability of “move into a chosen existing TS file”.

Task 12: move a file/package and update imports#

  • Target: move desktop/src/main/utils/http.ts to desktop/src/main/services/http.ts.

  • Serena chain: move(file http.ts -> services/) -> git diff.

  • Observed result: the file was renamed/moved and ffmpeg-worker.ts import updated from ../utils/http to ./http.

  • Built-in equivalent: locate all imports, move the file, patch each import path, then verify.

  • Delta: this was a real one-call semantic file move.

Verdict: Serena materially improved file moves that require import updates.

Task 12 (safe delete with no remaining usages)#

  • Attempted: searched for naturally unused TS symbols in the working areas (main.ts, common.ts, temp.ts, inference.ts) and checked several candidates (registerForEnteLinks, minimumWindowSize, AutoLauncher/isEnabled, openStreetMapUserAgent, safeJson, buildSamplingConfig).

  • Observed result: every plausible candidate still had live references.

  • Outcome: no suitable candidate found in the TS areas where Serena was operational, so I skipped this comparison instead of forcing an invalid input.

Verdict: No evidence either way here because the repo did not offer a clean unused-symbol candidate in the code areas Serena handled reliably.

Task 13: delete a symbol and propagate deletion to call sites#

  • Attempted: looked for a helper whose call sites could be semantically removed rather than inlined or manually rewritten.

  • Observed result: the good candidates in this repo were better modeled as inline refactors, not delete-with-propagation.

  • Outcome: no suitable candidate; skipped rather than using an unsafe input.

Verdict: No measured delta here because the available candidates were inline candidates, not safe propagate-delete candidates.

Task 13 (inline a small helper)#

  • Target symbol: waitForRendererDevServer.

  • Built-in chain: view call site + definition -> apply_patch replacing await waitForRendererDevServer() with await wait(1000) and deleting the helper -> git diff.

  • Serena chain: inline(waitForRendererDevServer, keep_definition=false) -> git diff.

  • Observed result: both achieved the inline, but Serena also rewrote unrelated import formatting at the top of the file.

  • Success signals: built-ins: resulting diff; Serena: {"status":"SUCCESS"}.

  • Delta: Serena added the unique semantic refactor, but in this run it also introduced format churn outside the logical change.

Verdict: Serena added real inline capability, with a low-frequency but real tradeoff of broader formatting churn.

3.4 Reliability & correctness-oriented checks#

Task 14: scope precision#

  • Demonstrated with: AutoLauncher/toggleAutoLaunch, openStreetMapUserAgent, and InferenceBackend.

  • Serena: symbol names and name paths targeted the exact code entity.

  • Built-ins: text search for names such as wait or writeToTemporaryFile over-matched comments, docs, and multiple textual occurrences.

  • Delta: Serena’s unit of work was the symbol; built-ins’ unit was the matching line.

Verdict: Serena was reliably more precise whenever the target was a symbol rather than a string.

Task 15: atomicity#

  • Observed: Serena rename/file-move/inline each ran as one refactor operation after symbol selection.

  • Built-ins: a single apply_patch can update multiple files atomically as a patch, but it cannot discover missed sites; semantic completeness remains manual.

  • Delta: Serena’s advantage was not transactional all-or-none patching; it was scope computation.

Verdict: Serena improved semantic completeness more than patch atomicity.

Task 16: success signals#

  • Observed Serena success outputs: OK for body replacement, "Success" for rename, JSON result for move, {"status":"SUCCESS"} for inline.

  • Observed built-in success outputs: only indirect evidence via git diff / clean revert.

Verdict: Serena gave clearer machine-readable success signals for refactors than the built-ins did.

4. Token-efficiency analysis#

By edit size#

Edit size

Built-ins

Serena

More efficient

Small tweak (toggleAutoLaunch)

Read ~21 lines, patch only changed lines

Fetch 10-line body, resend full 10-line body

Built-ins

Medium rewrite (uniqueSavePath)

Read ~41 lines to safely patch ~10 lines

Fetch 11-line body, resend 11-line body

Serena

Large rewrite (createMainWindow)

Read ~109 lines, patch whole body

Fetch ~same symbol body, resend whole body

Near tie, slight Serena advantage only from structural targeting

Cross-file rename (wait -> delay)

Read 4 files, craft 4-file patch

One rename after discovery

Serena by a large margin

Forced reads#

  • Built-ins often needed a localization read before the edit.

  • Serena avoided that when the symbol was already known, but not when the task itself required understanding the body.

Stable vs ephemeral addressing#

  • Serena’s addresses (createMainWindow, wait, openStreetMapUserAgent) stayed useful across later operations.

  • Built-in line ranges from view became stale after edits, so later text operations required re-grepping or re-viewing.

Verdict: Serena is most token-efficient for medium-to-large symbol work and cross-file refactors; built-ins stay leaner for tiny local edits.

5. Reliability & correctness (under correct use)#

  • Precision of matching: Serena’s reference search answered “who uses this in code?” better than rg, which mixed real uses with prose/comment matches.

  • Scope disambiguation: Serena targeted exact symbols (AutoLauncher/toggleAutoLaunch, InferenceBackend) rather than relying on unique text strings.

  • Atomicity: Serena computed and updated semantic scope in one refactor call; built-ins could batch edits, but only after manual scope discovery.

  • Semantic queries vs text search: hierarchy and references were the strongest examples. Built-ins could reconstruct them, but only with manual interpretation.

  • External dependencies: after indexing was available, Serena resolved desktop TypeScript dependencies into package-local declaration files under desktop/node_modules and Rust dependencies into external Cargo sources such as url and zeroize. Built-ins could still reach those files, but only after manual package-root or registry-path discovery.

  • Monorepo effect: this repo magnified Serena’s dependency-lookup value because “the dependency source” was not at one obvious global root. Serena jumped from app code to the right package-local or registry-backed dependency context directly.

Verdict: Serena improved correctness by narrowing work to exact symbols and by resolving dependencies across monorepo boundaries.

6. Workflow effects across a session#

  • Advantages compounded when I stayed in symbol space. Example: get_symbols_overview(main.ts) produced symbol names that I later reused for find_symbol(createMainWindow), rename(openStreetMapUserAgent), and inline(waitForRendererDevServer).

  • Built-in workflows required refreshes. Across repeated main.ts experiments, I repeatedly had to reacquire ranges with view/rg before editing because prior line-based context was no longer trustworthy.

  • In the monorepo, Serena also compounded by removing package-boundary bookkeeping. In the desktop app I could jump from main.ts into Electron and next-electron-server declarations without first reasoning about workspace roots; in rust/core I could jump into Cargo-registry dependencies through external symbol handles instead of manually reconstructing registry paths from Cargo.lock.

  • The compounding effect disappeared for tiny edits and non-code work, where built-ins were already direct and minimal.

  • One tradeoff compounded too: some Serena refactors carried formatting side effects (notably inline), so the semantic benefit does not guarantee a surgically small diff.

Verdict: Serena’s advantages compound most in code-centric monorepo sessions, where symbol reuse and dependency jumps save both re-reading and package-root discovery work.

7. Unique capabilities#

Capability with no practical one-step built-in equivalent

Frequency

Impact

Semantic cross-file rename from a single symbol definition

Medium

High

Type hierarchy query (implementations / supertypes)

Low-Medium

Medium

Inline refactor across call sites

Low

High when applicable

File move with import updates

Low-Medium

High

External dependency resolution from in-repo code into package-local or registry-backed sources

Medium

Medium-High

Built-ins can approximate all of these manually, but not as a single semantic operation.

Verdict: Serena did add unique practical capabilities, especially around refactors that require scope computation rather than text replacement.

8. Tasks outside Serena’s scope (built-in only)#

  • Reading non-code files like desktop/package.json

  • Free-text search such as ente://app or URL strings

  • Git inspection / diff / cleanup

  • Config/package/changelog/docs/notebook reading

  • Exact textual patching once the line range is already known

In this session, these built-in-only tasks were roughly 40% of the total operational steps by count, but they were usually the low-complexity steps around the more valuable semantic work.

Verdict: A substantial share of everyday terminal work remains built-in-only, but Serena targets the higher-value symbol-heavy slice rather than the whole session.

9. Practical usage rule#

Use Serena first when the task is about a code symbol and especially when it spans multiple files, references, or a whole symbol body. Use built-ins first when the task is about text, config/docs, free-text search, git state, or a 1-3 line local tweak. The highest-yield mixed workflow in this repo was: discover/refactor with Serena, inspect non-code and do tiny patches with built-ins.

Verdict: Choose Serena for symbol semantics and built-ins for text locality.