Skip to content

Replace docx2txt with the Rust docx2txt-rs for 64-bit installers#708

Open
dscho wants to merge 4 commits into
git-for-windows:mainfrom
dscho:docx2txt-rs
Open

Replace docx2txt with the Rust docx2txt-rs for 64-bit installers#708
dscho wants to merge 4 commits into
git-for-windows:mainfrom
dscho:docx2txt-rs

Conversation

@dscho
Copy link
Copy Markdown
Member

@dscho dscho commented Jun 5, 2026

Replaces the Perl docx2txt MSYS package with the Rust mingw-w64-docx2txt-rs rewrite (https://packages.msys2.org/packages/mingw-w64-x86_64-docx2txt-rs, source at https://github.com/dscho/docx2txt-rs) for the 64-bit Git for Windows flavors. The new binary produces byte-identical output to the original on every fixture and drops the Perl dependency entirely.

To this end, the docx2txt-rs package is added as a dependency of git-extra; This will cause the (64-bit) Git for Windows SDKs to pull in that package. (32-bit does not matter, we do not include docx2txt in MinGit, and 32-bit docx2txt-rs would not be available anyway.)

dscho added 4 commits June 5, 2026 14:39
The Perl-based `docx2txt` package has been the SDK's `.docx` textconv
helper since Git for Windows began shipping it, but the only consumer
that actually needs it is `astextplain`. The new
`mingw-w64-docx2txt-rs` package
(https://packages.msys2.org/packages/mingw-w64-x86_64-docx2txt-rs) is a
small Rust rewrite that produces byte-identical output to the original
on every fixture, drops the Perl dependency, and is published for
mingw64, ucrt64, clang64 and clangarm64 per the upstream
`mingw_arch=('mingw64' 'ucrt64' 'clang64' 'clangarm64')` declaration
at https://github.com/msys2/MINGW-packages/blob/master/mingw-w64-docx2txt-rs/PKGBUILD.

Declare it as a dependency of the two 64-bit `git-extra` variants we
actually ship, so that the next nightly sync in git-sdk-64 and
git-sdk-arm64 pulls it in automatically, without the SDK repositories
needing their own targeted PRs. The bare `git-extra` MSYS variant and
`package_mingw-w64-i686-git-extra` are deliberately left untouched:
there is no `mingw-w64-i686-docx2txt-rs` upstream and 32-bit
installers are no longer built.

The follow-up commit teaches `astextplain` to prefer the new binary,
falling back to `docx2txt.pl` so the i686 and bare-MSYS variants of
git-extra (which do not gain the new dependency) keep working as long
as the legacy `docx2txt` package is still installed.

`pkgrel` is not bumped manually because the `pkgver()` function in
this PKGBUILD already derives the version from `git rev-list --count`
over the `git-extra/` directory (excluding `git-extra.install`), so
this commit alone will move the auto-derived `pkgver`.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The previous commit added a hard dependency on
`${MINGW_PACKAGE_PREFIX}-docx2txt-rs` for the 64-bit `git-extra`
variants, which means `docx2txt.exe` (the Rust binary, named after the
`docx2txt` Cargo package per
https://github.com/dscho/docx2txt-rs/blob/main/Cargo.toml) is
guaranteed to be on `$PATH` for every modern Git for Windows SDK.

Teach the `.docx` branch of `astextplain` to call it. The new CLI per
https://github.com/dscho/docx2txt-rs/blob/main/README.md reads
exclusively from stdin and writes to stdout (no filename argument and
no `-` sentinel), so the invocation is just `docx2txt.exe <"$1"`.

The single line

	docx2txt.exe <"$1" || docx2txt.pl "$1" - || cat "$1"

uses a layered fallback rather than an `if command -v` guard,
matching the style of the other case branches in this script (e.g.
`odt2txt "$1" || cat "$1"`,
`out=$(antiword -m UTF-8 "$1") && sed ... || cat "$1"`) which all let
a missing helper fail naturally into the next fallback. On a modern
64-bit SDK the first leg always succeeds; on the legacy i686 and
bare-MSYS `git-extra` variants (which do not gain the dependency in
the previous commit) the script falls through to the Perl shim if it
is still installed, and finally `cat`s the raw `.docx` only when both
helpers are missing, exactly matching the prior
`docx2txt.pl ... || cat "$1"` semantics.

The `.exe` suffix is spelled out explicitly so that the lookup never
resolves to the old `/usr/bin/docx2txt` shell wrapper from the legacy
package, whose CLI is completely different (takes a filename, writes
`filename.txt`, no stdin/stdout interface).

`sha256sums[18]` for `astextplain` is updated to match the new file
contents; without this, `makepkg` would refuse to build the package
with an integrity-check failure.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
With the previous commits making the 64-bit `git-extra` variants
depend on `${MINGW_PACKAGE_PREFIX}-docx2txt-rs` and teaching
`astextplain` to use it, `pactree -u mingw-w64-$PACMAN_ARCH-git-extra`
now pulls the new Rust package into the installer's file enumeration
transitively, with no need to mention it by name.

The legacy `docx2txt` token in the `required=` preflight loop and in
the non-`MINIMAL_GIT` `packages=` list has likewise become redundant
(it was only there because git-extra never used to depend on it).
Drop both so the new package replaces the old one through the
dependency edge alone, without any explicit listing here.

This has the additional benefit of avoiding a `pacman -Sy` call in
the build-installers / build-artifacts CI jobs for a package the
existing main-branch SDK snapshot does not yet carry. The runner's
sparse SDK checkout fails `pacman -Sy` with "missing required
signature" and "GPGME error: Invalid crypto engine" because the GPG
backend cannot initialize there; sidestepping the preflight install
sidesteps that failure too.

In CI for this PR specifically, the produced installer artifacts
will therefore ship with neither `docx2txt.pl` nor `docx2txt.exe`,
because the main-branch SDK snapshot still has the OLD git-extra
without the new dependency, so `pactree` will not include
docx2txt-rs this round. Real installers built after the PR merges
and the SDK is rebuilt with the new git-extra will pick docx2txt-rs
up through the dependency edge as intended.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The PKGBUILD's `pkgver()` function derives the version from
`git rev-list --count` over `git-extra/`, skipping commits that only
touch `pkgver=` or 64-hex source hashes. After the previous commits
in this series add the `${MINGW_PACKAGE_PREFIX}-docx2txt-rs`
dependency and rewrite `astextplain`, the derived value moves from
`1.1.693.6dc76c4f4` to `1.1.696.8dd445c32`; commit it so the
post-build "ensure worktree is clean" check in the
`build-packages (git-extra, ...)` CI job does not fail with
"Uncommitted changes after build!" when makepkg writes the new
value back into the PKGBUILD.

This follows the existing project convention seen most recently in
`83dbeadc git-extra: bump pkgrel`.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
@dscho dscho requested a review from rimrul June 5, 2026 18:00
@dscho dscho self-assigned this Jun 5, 2026
@dscho dscho requested a review from mjcheetham June 5, 2026 18:00
@dscho dscho marked this pull request as ready for review June 5, 2026 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant