Skip to content

A reasonably fast casefold implementation#122

Open
aneubeck wants to merge 9 commits into
mainfrom
aneubeck/casefold
Open

A reasonably fast casefold implementation#122
aneubeck wants to merge 9 commits into
mainfrom
aneubeck/casefold

Conversation

@aneubeck
Copy link
Copy Markdown
Collaborator

@aneubeck aneubeck commented Jun 5, 2026

This is useful in combination with sparse-ngrams indexing.

Copilot AI review requested due to automatic review settings June 5, 2026 08:15
@aneubeck aneubeck requested a review from a team as a code owner June 5, 2026 08:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new casefold crate providing a compact, fast Unicode simple (1-to-1) case-folding implementation intended for use in case-insensitive indexing (notably alongside sparse-ngrams).

Changes:

  • Introduces casefold::simple_fold(char) -> char backed by a generated paged-bitmap + packed-run table.
  • Adds a build script that parses Unicode CaseFolding.txt to generate the compressed table at build time.
  • Adds a dedicated casefold-benchmarks crate and wires it into the workspace.
Show a summary per file
File Description
crates/casefold/src/lib.rs Implements simple_fold and the paged-bitmap lookup logic plus correctness/size tests.
crates/casefold/README.md Documents the encoding approach, table layout, and benchmark results.
crates/casefold/data/CaseFolding.txt Vendors Unicode 16.0 CaseFolding.txt used for generating and testing the table.
crates/casefold/Cargo.toml Declares the new casefold crate package metadata.
crates/casefold/build.rs Build-time generator that parses CaseFolding.txt and emits the packed table into OUT_DIR.
crates/casefold/benchmarks/performance.rs Criterion benchmark comparing the table implementation vs a HashMap baseline across workloads.
crates/casefold/benchmarks/lib.rs Benchmark helper code for building the reference HashMap implementation.
crates/casefold/benchmarks/Cargo.toml Declares the casefold-benchmarks crate and its dependencies.
Cargo.toml Adds crates/casefold/benchmarks to the workspace members.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 9/9 changed files
  • Comments generated: 8

Comment thread crates/casefold/src/lib.rs
Comment thread crates/casefold/build.rs
Comment on lines +39 to +42
.map(|_| {
let b = rng.random_range(b'A'..=b'z');
b as char
})
Comment thread crates/casefold/build.rs Outdated
Comment thread crates/casefold/build.rs Outdated
Comment thread crates/casefold/build.rs
Comment on lines +161 to +166
let ends: Vec<u32> = runs
.iter()
.map(|r| r.start + (r.length as u32 - 1) * (r.stride as u32))
.collect();
let last_covered = *ends.last().unwrap();

Comment thread crates/casefold/benchmarks/lib.rs
Comment thread crates/casefold/build.rs Outdated
aneubeck and others added 7 commits June 5, 2026 11:25
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants