A high-performance matcher designed to solve LOGICAL and TEXT VARIATIONS problems in word matching, implemented in Rust.
It's helpful for
- Precision and Recall: Word matching is a retrieval process, LOGICAL match improves precision while TEXT VARIATIONS match improves recall.
- Content Filtering: Detecting and filtering out offensive or sensitive words.
- Search Engines: Improving search results by identifying relevant keywords.
- Text Analysis: Extracting specific information from large volumes of text.
- Spam Detection: Identifying spam content in emails or messages.
- ···
For detailed implementation, see the Design Document.
- Multiple Matching Methods:
- Simple Word Matching
- Regex-Based Matching
- Similarity-Based Matching
- Text Transformation:
- Fanjian: Simplify traditional Chinese characters to simplified ones.
Example:
蟲艸->虫草 - Delete: Remove specific characters.
Example:
*Fu&*iii&^%%*&kkkk->Fuiiikkkk - Normalize: Normalize special characters to identifiable characters.
Example:
𝜢𝕰𝕃𝙻𝝧 𝙒ⓞᵣℒ𝒟!->hello world! - PinYin: Convert Chinese characters to Pinyin for fuzzy matching.
Example:
西安->xi an, matches洗按->xi an, but not先->xian - PinYinChar: Convert Chinese characters to Pinyin.
Example:
西安->xian, matches洗按and先->xian
- Fanjian: Simplify traditional Chinese characters to simplified ones.
Example:
- AND OR NOT Word Matching:
- Takes into account the number of repetitions of words.
- Example:
hello&worldmatcheshello worldandworld,hello - Example:
无&法&无&天matches无无法天(because无is repeated twice), but not无法天 - Example:
hello~helloo~hhellomatcheshellobut nothellooandhhello
- Customizable Exemption Lists: Exclude specific words from matching.
- Efficient Handling of Large Word Lists: Optimized for performance.
See the Rust README.
See the Python README.
We provide dynamic library to link. See the C README and Java README.
git clone https://github.com/Lips7/Matcher.git
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- --default-toolchain nightly -y
cargo build --releaseThen you should find the libmatcher_c.so/libmatcher_c.dylib/matcher_c.dll in the target/release directory.
Visit the release page to download the pre-built binary.
Please refer to benchmarks for details.
-
Cache middle results during different ProcessType reduce_process_text function calling. (failed, too slow) - Try more aho-corasick library to improve performance and reduce memory usage.
-
https://github.com/daac-tools/crawdad (produce char-wise index, not byte-wise index, it's not acceptable) - https://github.com/daac-tools/daachorse (use it when Fanjian, PinYin or PinYinChar transformation is performed)
-
Test char-wise HashMap transformation for Chinese Characters. (Too slow)
-
- Make aho-corasick unsafe.
- Optimize NOT logic word-wise.
- Optimize
RegexMatcherusingRegexSet. - Optimize
SimpleMatcherwhen multipleProcessTypeare used.- Consider if there are multiple
ProcessType
- None
- Fanjian
- FanjianDelete
- FanjianDeleteNormalize
- FanjianNormalize
- We can construct a chain of transformations,
- None -> Fanjian -> Delete -> Normalize
- \ -> Normalize.
- Calcuate all possible transformations, and cache the results, so that instead calculating 8 times (Fanjian, Fanjian + Delete, Fanjian + Delete + Normalize, Fanjian + Normalize), we only need to calculate 4 times (Fanjian, Delete, Normalize, Normalize).
- Consider if there are multiple
-
Optimize process matcher when perform reduce text processing.- Consider we have to perform FanjianDeleteNormalize, we need to perform Fanjian first, then Delete, then Normalize, 3 kinds of Process Matcher are needed to perform replacement or delete, the text has to be scanned 3 times.
- What if we only construct only 1 Process Matcher which's patterns contains all the Fanjian, Delete and Normalize 3 kinds of patterns? We could scan the text only once to get all the positions that should be perform replacement or delete.
- We need to take care of the byte index will change after replacement or delete, so we need to take the offset changes into account.
- Merge multiple aho-corasick matcher into one when multiple
ProcessTypeare used. - When
dfafeature is disabled, use daachorse to perform text processing.- Do not use it for simple process function, too slow to build.
- Use more regex set to optimize regex matcher.
- Cache
get_process_matcherresults globally, instead of caching result inside SimpleMatcher. - Expose
reduce_process_textto Python. - Add a new function that can handle single simple match type.
-
text_processnow is available.
-
- Add fuzzy matcher, https://github.com/lotabout/fuzzy-matcher.
- Use
rapidfuzzinstead.
- Use
- Make
SimpleMatcherandMatcherserializable.- Make aho-corasick serializable.
- See https://github.com/Lips7/aho-corasick.
- Implement NOT logic word-wise.
- Support stable rust.
- Support iterator.
- A real java package.
- Multiple Python version wheel build.
- Customize str conversion map.
- Add Matcher process function to py, c and java.
-
For simple matcher, is it possible to use regex-automata to replace aho-corasick? and support regex. (Keep it simple and efficient) - Add simple match type to
RegexMatcherandSimMatcherto pre-process a text. - Try to replace msgpack.
- More precise and convenient MatchTable.
- More detailed and rigorous benchmarks.
- More detailed and rigorous tests.
- More detailed simple match type explanation.
- More detailed DESIGN.
- Write a Chinese README.