Questions tagged [string-matching]
38 questions
37
votes
7 answers
What algorithm would you best use for string similarity?
I am designing a plugin to uniquely identify content on various web pages, based on addresses.
So I may have one address which looks like:
1 someawesome street, anytown, F100 211
later I may find this address in a slightly different format.
1…

Squiggs.
- 511
- 1
- 4
- 8
24
votes
5 answers
Is there any good search algorithm for a single character?
I know several basic string-matching algorithms such as KMP or Boyer-Moore, but all of those analyze the pattern before searching.However, if one has a single character, there is not much to analyze.
So is there any better algorithm than the naive…

Christian
- 365
- 2
- 3
7
votes
2 answers
Finding and counting equal substrings in a set of strings
I'm thinking about a way of finding similar parts in Strings. I have a set of strings of varying length i.e:
The quick brown fox jumps
fox force five
the bunny is much quicker than the fox
is
First, i thought of just tokenizing the strings and…

Chris
- 207
- 1
- 5
6
votes
2 answers
Detecting plagiarism – what algorithm?
I'm currently writing a program to read a body of text and compare it to search-engine results (from searching for substrings of the given text), with the goal of detecting plagiarism in, for example, academic papers.
The two strings being compared…

Vivian
- 189
- 1
- 6
6
votes
4 answers
Is "use "abc".equals(myString) instead of myString.equals("abc") to avoid null pointer exception" already problematic in terms of business logic?
I heard numerous times that when comparing Strings in Java, to avoid null pointer exception, we should use "abc".equals(myString) instead of myString.equals("abc"), but my question is, is this idea already problematic in terms of business…

ggrr
- 5,725
- 11
- 35
- 37
5
votes
2 answers
Any reasons why SQL-92 changed *, ? into %, _?
Do you know any reasons why SQL-92 standard has changed glob pattern wildcard characters from * and ? (SQL-89) to % and _?
Currently I need to do mask conversions to allow users searching data with * and ? which they already know from file system.…

miroxlav
- 672
- 4
- 17
4
votes
2 answers
why regex, when using global search and {0,} quantifier, match the end of the string?
I have asked a question here about js, regex, quantifiers and global search. I've understood finally how this works, but, let's take a concrete example and then I`ll write my question.
Based on the same example
var str = 'ddd';
var r =…

Gigi Ionel
- 41
- 3
3
votes
3 answers
Replace strings based on substring match
I have N strings and M search-replace pairs. Each of the strings contains exactly one of the search pair and the whole string needs to be replaced by the replace pair.
Say you have returns,between,paragraphs and turn => foo, tween => bar, rag => baz…

chx
- 363
- 2
- 11
3
votes
2 answers
Burrows-Wheeler transform backward search: how to find suffix index?
BWT backward search algorithm is pretty straightforward if we only need the multiplicity of a pattern. However I also need to find the suffix indices (i.e. positions in the reference string where a pattern occurs). e.g., given string banana and…

user798275
- 131
- 2
3
votes
2 answers
String Searching algorithm
A title for a movie can be ambiguous. (Eg. The Lord of the Rings, Lord of the Rings, Lord of the Rings, The)
There exists a database entry that has a list of movie titles mapped to a unique identifier.
I am trying to write a method that takes a…

Theheist1992
- 39
- 1
3
votes
4 answers
fast n-gram access data structure
TL;DR
Is there a data structure that'd quickly let me match words at any point (e.g., 'foo' matches 'foobar' and 'zoofoo'), and, ideally, returns a list of "characters that show up after the needle" (e.g., 'foo' should return ['b', $]).
I'm…

Tordek
- 454
- 1
- 6
- 9
3
votes
0 answers
String and Suffix Matching from a Suffix List
I'm having trouble finding an algorithm that matches (or fails to match) a substring of a string to a suffix from a list of suffixes. The hits I'm finding are for suffix trees and suffix arrays, but they are not quite what I am looking for.
My…
user118658
2
votes
2 answers
Are there typo-tolerance algorithms (as opposed to string similarity)?
I want to build a search with basic typo tolerance.
There are quite a few string similarity algorithms (and implementations for almost all languages I guess).
However, humans tend to make some typos more frequently than others. E.g
mixing up the…

cis
- 147
- 3
2
votes
0 answers
Algorithm to search very long blacklist in another very long data set
I have two data sets. The first data set has approx. 50.000 movie and song titles and the second one have 20.000 blacklist strings. I am looking for the best algorithm to detect movie/song title which contains blacklisted word(s).
Example:
Dataset…

Eray
- 336
- 2
- 9
2
votes
3 answers
Algorithm for optimizing text compression
I am looking for text compression algorithms (natural language compression, rather than compression of arbitrary binary data).
I have seen for example An Efficient Compression Code for Text Databases. This algorithm basically uses the words as…

Lance
- 2,537
- 15
- 34