Replace docx2txt with the Rust docx2txt-rs for 64-bit installers#708
Open
dscho wants to merge 4 commits into
Open
Replace docx2txt with the Rust docx2txt-rs for 64-bit installers#708dscho wants to merge 4 commits into
dscho wants to merge 4 commits into
Conversation
The Perl-based `docx2txt` package has been the SDK's `.docx` textconv helper since Git for Windows began shipping it, but the only consumer that actually needs it is `astextplain`. The new `mingw-w64-docx2txt-rs` package (https://packages.msys2.org/packages/mingw-w64-x86_64-docx2txt-rs) is a small Rust rewrite that produces byte-identical output to the original on every fixture, drops the Perl dependency, and is published for mingw64, ucrt64, clang64 and clangarm64 per the upstream `mingw_arch=('mingw64' 'ucrt64' 'clang64' 'clangarm64')` declaration at https://github.com/msys2/MINGW-packages/blob/master/mingw-w64-docx2txt-rs/PKGBUILD. Declare it as a dependency of the two 64-bit `git-extra` variants we actually ship, so that the next nightly sync in git-sdk-64 and git-sdk-arm64 pulls it in automatically, without the SDK repositories needing their own targeted PRs. The bare `git-extra` MSYS variant and `package_mingw-w64-i686-git-extra` are deliberately left untouched: there is no `mingw-w64-i686-docx2txt-rs` upstream and 32-bit installers are no longer built. The follow-up commit teaches `astextplain` to prefer the new binary, falling back to `docx2txt.pl` so the i686 and bare-MSYS variants of git-extra (which do not gain the new dependency) keep working as long as the legacy `docx2txt` package is still installed. `pkgrel` is not bumped manually because the `pkgver()` function in this PKGBUILD already derives the version from `git rev-list --count` over the `git-extra/` directory (excluding `git-extra.install`), so this commit alone will move the auto-derived `pkgver`. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The previous commit added a hard dependency on
`${MINGW_PACKAGE_PREFIX}-docx2txt-rs` for the 64-bit `git-extra`
variants, which means `docx2txt.exe` (the Rust binary, named after the
`docx2txt` Cargo package per
https://github.com/dscho/docx2txt-rs/blob/main/Cargo.toml) is
guaranteed to be on `$PATH` for every modern Git for Windows SDK.
Teach the `.docx` branch of `astextplain` to call it. The new CLI per
https://github.com/dscho/docx2txt-rs/blob/main/README.md reads
exclusively from stdin and writes to stdout (no filename argument and
no `-` sentinel), so the invocation is just `docx2txt.exe <"$1"`.
The single line
docx2txt.exe <"$1" || docx2txt.pl "$1" - || cat "$1"
uses a layered fallback rather than an `if command -v` guard,
matching the style of the other case branches in this script (e.g.
`odt2txt "$1" || cat "$1"`,
`out=$(antiword -m UTF-8 "$1") && sed ... || cat "$1"`) which all let
a missing helper fail naturally into the next fallback. On a modern
64-bit SDK the first leg always succeeds; on the legacy i686 and
bare-MSYS `git-extra` variants (which do not gain the dependency in
the previous commit) the script falls through to the Perl shim if it
is still installed, and finally `cat`s the raw `.docx` only when both
helpers are missing, exactly matching the prior
`docx2txt.pl ... || cat "$1"` semantics.
The `.exe` suffix is spelled out explicitly so that the lookup never
resolves to the old `/usr/bin/docx2txt` shell wrapper from the legacy
package, whose CLI is completely different (takes a filename, writes
`filename.txt`, no stdin/stdout interface).
`sha256sums[18]` for `astextplain` is updated to match the new file
contents; without this, `makepkg` would refuse to build the package
with an integrity-check failure.
Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
With the previous commits making the 64-bit `git-extra` variants
depend on `${MINGW_PACKAGE_PREFIX}-docx2txt-rs` and teaching
`astextplain` to use it, `pactree -u mingw-w64-$PACMAN_ARCH-git-extra`
now pulls the new Rust package into the installer's file enumeration
transitively, with no need to mention it by name.
The legacy `docx2txt` token in the `required=` preflight loop and in
the non-`MINIMAL_GIT` `packages=` list has likewise become redundant
(it was only there because git-extra never used to depend on it).
Drop both so the new package replaces the old one through the
dependency edge alone, without any explicit listing here.
This has the additional benefit of avoiding a `pacman -Sy` call in
the build-installers / build-artifacts CI jobs for a package the
existing main-branch SDK snapshot does not yet carry. The runner's
sparse SDK checkout fails `pacman -Sy` with "missing required
signature" and "GPGME error: Invalid crypto engine" because the GPG
backend cannot initialize there; sidestepping the preflight install
sidesteps that failure too.
In CI for this PR specifically, the produced installer artifacts
will therefore ship with neither `docx2txt.pl` nor `docx2txt.exe`,
because the main-branch SDK snapshot still has the OLD git-extra
without the new dependency, so `pactree` will not include
docx2txt-rs this round. Real installers built after the PR merges
and the SDK is rebuilt with the new git-extra will pick docx2txt-rs
up through the dependency edge as intended.
Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The PKGBUILD's `pkgver()` function derives the version from
`git rev-list --count` over `git-extra/`, skipping commits that only
touch `pkgver=` or 64-hex source hashes. After the previous commits
in this series add the `${MINGW_PACKAGE_PREFIX}-docx2txt-rs`
dependency and rewrite `astextplain`, the derived value moves from
`1.1.693.6dc76c4f4` to `1.1.696.8dd445c32`; commit it so the
post-build "ensure worktree is clean" check in the
`build-packages (git-extra, ...)` CI job does not fail with
"Uncommitted changes after build!" when makepkg writes the new
value back into the PKGBUILD.
This follows the existing project convention seen most recently in
`83dbeadc git-extra: bump pkgrel`.
Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replaces the Perl
docx2txtMSYS package with the Rustmingw-w64-docx2txt-rsrewrite (https://packages.msys2.org/packages/mingw-w64-x86_64-docx2txt-rs, source at https://github.com/dscho/docx2txt-rs) for the 64-bit Git for Windows flavors. The new binary produces byte-identical output to the original on every fixture and drops the Perl dependency entirely.To this end, the
docx2txt-rspackage is added as a dependency ofgit-extra; This will cause the (64-bit) Git for Windows SDKs to pull in that package. (32-bit does not matter, we do not includedocx2txtin MinGit, and 32-bitdocx2txt-rswould not be available anyway.)