Skip to content

edit tool corrupts non-UTF-8 bytes #3732

@akashlal

Description

@akashlal

Describe the bug

Summary

The edit tool silently corrupts files that contain bytes which are valid in
legacy single-byte codepages (e.g. CP1252) but invalid as UTF-8. The tool reads
the file as UTF-8, replaces each unmappable byte with the Unicode replacement
character U+FFFD (EF BF BD), then re-encodes the whole file back as UTF-8
when writing. The corruption affects bytes the user never intended to edit and
passes git apply cleanly, so it ships unnoticed.

Environment

  • Copilot CLI version: 1.0.60
  • Model: Claude Opus 4.7
  • OS: Windows 11 / PowerShell 7
  • Reproduced: 2026-06-09

Affected version

GitHub Copilot CLI 1.0.60

Steps to reproduce the behavior

  1. Create a file containing exactly one CP1252 byte (0xA9, the © glyph):

    ```powershell $bytes = [System.Text.Encoding]::GetEncoding(1252).GetBytes("// Copyright © Microsoft. all rights reserved.rnint main() { return 0; }`r`n") [System.IO.File]::WriteAllBytes("sample.cpp", $bytes)

Verify byte 13 is 0xA9:

… 43 6F 70 79 72 69 67 68 74 20 A9 20 4D 69 63 …

  1. Ask the agent to perform any edit on the file that does not touch the
    copyright line — e.g. "Capitalize the first letter of each sentence in
    sample.cpp."
  2. Re-inspect byte 13:

… 43 6F 70 79 72 69 67 68 74 20 EF BF BD 20 4D 69 63 …

The single A9 byte is now EF BF BD (U+FFFD "REPLACEMENT CHARACTER").
The character © is gone; git diff shows a spurious modification on a
line the agent was never asked to touch.

Expected behavior

The edit tool should either:

  • (preferred) preserve the original file's byte-level encoding — detect
    the source encoding once, decode/encode round-trip-cleanly, and never emit
    U+FFFD for bytes that were valid in the source; or
  • (fallback) refuse to write the file and surface a clear error when it
    would introduce a U+FFFD byte that did not exist in the input.

In either case the tool must never silently replace bytes outside the
diff hunk the model authored.

Additional context

Actual behavior

The file is round-tripped through String / UTF-8 decode-encode. Every byte
in the input that is not a valid UTF-8 sequence is replaced with EF BF BD.
The replacement happens before the diff/patch logic sees the file, so it is
invisible to the model and to any patch-level review.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:toolsBuilt-in tools: file editing, shell, search, LSP, git, and tool call behavior

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions