Describe the bug
Summary
The edit tool silently corrupts files that contain bytes which are valid in
legacy single-byte codepages (e.g. CP1252) but invalid as UTF-8. The tool reads
the file as UTF-8, replaces each unmappable byte with the Unicode replacement
character U+FFFD (EF BF BD), then re-encodes the whole file back as UTF-8
when writing. The corruption affects bytes the user never intended to edit and
passes git apply cleanly, so it ships unnoticed.
Environment
- Copilot CLI version: 1.0.60
- Model: Claude Opus 4.7
- OS: Windows 11 / PowerShell 7
- Reproduced: 2026-06-09
Affected version
GitHub Copilot CLI 1.0.60
Steps to reproduce the behavior
-
Create a file containing exactly one CP1252 byte (0xA9, the © glyph):
```powershell $bytes = [System.Text.Encoding]::GetEncoding(1252).GetBytes("// Copyright © Microsoft. all rights reserved.rnint main() { return 0; }`r`n") [System.IO.File]::WriteAllBytes("sample.cpp", $bytes)
Verify byte 13 is 0xA9:
… 43 6F 70 79 72 69 67 68 74 20 A9 20 4D 69 63 …
- Ask the agent to perform any edit on the file that does not touch the
copyright line — e.g. "Capitalize the first letter of each sentence in
sample.cpp."
- Re-inspect byte 13:
… 43 6F 70 79 72 69 67 68 74 20 EF BF BD 20 4D 69 63 …
The single A9 byte is now EF BF BD (U+FFFD "REPLACEMENT CHARACTER").
The character © is gone; git diff shows a spurious modification on a
line the agent was never asked to touch.
Expected behavior
The edit tool should either:
- (preferred) preserve the original file's byte-level encoding — detect
the source encoding once, decode/encode round-trip-cleanly, and never emit
U+FFFD for bytes that were valid in the source; or
- (fallback) refuse to write the file and surface a clear error when it
would introduce a U+FFFD byte that did not exist in the input.
In either case the tool must never silently replace bytes outside the
diff hunk the model authored.
Additional context
Actual behavior
The file is round-tripped through String / UTF-8 decode-encode. Every byte
in the input that is not a valid UTF-8 sequence is replaced with EF BF BD.
The replacement happens before the diff/patch logic sees the file, so it is
invisible to the model and to any patch-level review.
Describe the bug
Summary
The
edittool silently corrupts files that contain bytes which are valid inlegacy single-byte codepages (e.g. CP1252) but invalid as UTF-8. The tool reads
the file as UTF-8, replaces each unmappable byte with the Unicode replacement
character
U+FFFD(EF BF BD), then re-encodes the whole file back as UTF-8when writing. The corruption affects bytes the user never intended to edit and
passes
git applycleanly, so it ships unnoticed.Environment
Affected version
GitHub Copilot CLI 1.0.60
Steps to reproduce the behavior
Create a file containing exactly one CP1252 byte (
0xA9, the©glyph):```powershell $bytes = [System.Text.Encoding]::GetEncoding(1252).GetBytes("// Copyright © Microsoft. all rights reserved.
rnint main() { return 0; }`r`n") [System.IO.File]::WriteAllBytes("sample.cpp", $bytes)Verify byte 13 is 0xA9:
… 43 6F 70 79 72 69 67 68 74 20 A9 20 4D 69 63 …
copyright line — e.g. "Capitalize the first letter of each sentence in
sample.cpp."
… 43 6F 70 79 72 69 67 68 74 20 EF BF BD 20 4D 69 63 …
The single A9 byte is now EF BF BD (U+FFFD "REPLACEMENT CHARACTER").
The character © is gone; git diff shows a spurious modification on a
line the agent was never asked to touch.
Expected behavior
The edit tool should either:
the source encoding once, decode/encode round-trip-cleanly, and never emit
U+FFFD for bytes that were valid in the source; or
would introduce a U+FFFD byte that did not exist in the input.
In either case the tool must never silently replace bytes outside the
diff hunk the model authored.
Additional context
Actual behavior
The file is round-tripped through String / UTF-8 decode-encode. Every byte
in the input that is not a valid UTF-8 sequence is replaced with EF BF BD.
The replacement happens before the diff/patch logic sees the file, so it is
invisible to the model and to any patch-level review.