Learn R Programming

AhoCorasickTrie (version 0.1.2)

Fast Searching for Multiple Keywords in Multiple Texts

Description

Aho-Corasick is an optimal algorithm for finding many keywords in a text. It can locate all matches in a text in O(N+M) time; i.e., the time needed scales linearly with the number of keywords (N) and the size of the text (M). Compare this to the naive approach which takes O(N*M) time to loop through each pattern and scan for it in the text. This implementation builds the trie (the generic name of the data structure) and runs the search in a single function call. If you want to search multiple texts with the same trie, the function will take a list or vector of texts and return a list of matches to each text. By default, all 128 ASCII characters are allowed in both the keywords and the text. A more efficient trie is possible if the alphabet size can be reduced. For example, DNA sequences use at most 19 distinct characters and usually only 4; protein sequences use at most 26 distinct characters and usually only 20. UTF-8 (Unicode) matching is not currently supported.

Copy Link

Version

Install

install.packages('AhoCorasickTrie')

Monthly Downloads

353

Version

0.1.2

License

Apache License 2.0

Issues

Pull Requests

Stars

Forks

Maintainer

Last Published

September 29th, 2020

Functions in AhoCorasickTrie (0.1.2)

AhoCorasickTrie

AhoCorasickTrie: fast searching for multiple keywords in multiple texts
AhoCorasickSearch

Fast searching for one or more keywords in one or more texts
AhoCorasickSearchList

Fast searching for one or more keywords in a list of texts