2

I want to build a search with basic typo tolerance. There are quite a few string similarity algorithms (and implementations for almost all languages I guess).

However, humans tend to make some typos more frequently than others. E.g

  • mixing up the order of letters when typing, e.g tpying instead of typing
  • hitting the key next to the intended one, e.g. slright instead of alright because s and a are next to each other on most keyboard layouts
  • mixing up letters with similar sounds associated to them, e.g. there instead of their

I think, a good typo tolerance algorithm should take that into account. E.g. the pair alright and slright should get a higher similarity than alright and mlright. As far as I known, no string similarity algorithms does something like this.

Are there free algorithms (and implementations in TypeScript) which do?

(Unfortunately, just gathering masses of data of what my users actually type is not an option for me.)

cis
  • 147
  • 3
  • 1
    There may not be *one* algorithm which takes all of your criteria into account, but there are well-known algorithms for criteria 1 and 3 (Levenshtein-distance and phonetic algorithms like Soundex), and I would not be astonished if there is some algo for criterion no 2 as well. No idea if there exist research papers for combining theses algorithms. Note research request for 3rd party papers or libraries are off-topic for this site. – Doc Brown May 23 '22 at 13:57
  • We can consider keyboard layout for typos, and I believe that cell phones do that. They have the advantage of knowing their own keyboard layout (this information is basically lost on personal computers, and over the web). – Erik Eidt May 23 '22 at 15:04
  • Why don't you want a string similarity algorithm? In fact, what you describe *is* a string similarity algorithm, with a specific definition of "similarity", is it not? – Jörg W Mittag May 23 '22 at 16:28
  • Erik, it’s a while ago. But on MacOS X you could determine exactly which physical key was being pressed, and with which modifier keys. – gnasher729 May 23 '22 at 19:28

2 Answers2

2

I don't think there is a single algorithm that's going to do what you want.

For point 1 (transpositions) you can use Damerau-Levenshtein Distance, or the simpler Optimal String Alignment distance (discussed in the same link).

For point 2 (close characters on the keyboard) you will need some kind of parameterised distance function. I have seen Levenshtein algorithms (can't remember where) that allow different penalties for edit, insert and delete, but I haven't seen anything that gives different penalties for different character substitutions. I think it should be possible to modify those algorithms to do that though.

For point three (homonyms) you would have to understand the grammatical context to know which of "their" or "there" was correct, so I don't see how that could be done with a basic spell checker.

rghome
  • 668
  • 5
  • 12
  • 1
    to Point 3: for a search algorithm, no grammatical analysis is necessary, only a phonetic search like [Soundex](https://en.wikipedia.org/wiki/Soundex). – Doc Brown May 23 '22 at 15:40
  • I agree you could find similar sounding words, but how would you know that the word in the document is wrong? – rghome May 24 '22 at 14:57
  • You don't. But the OP asked for searching, not for spell checking. – Doc Brown May 24 '22 at 17:16
1

There are a couple of string similarity algorithms that could be useful in an editor for detecting typos. The trick is to use the current position, find the start of the current token by moving back searching for a separator, and apply the algorithm on the substring up to the next separator.

These algorithms are good for matching full identifiers, with all the situations that tou have described, e.g. tpying instead of typing.

However, these algorithms are less suitable for detecting typos as they occur, e.g. tpyi… instead of typi… So if your intention is to detect typos as they happen, you’ll need some more creativity and manage some trie. As long as the trie for the currently typed token is valid, you don’t do anything because the token may correspond to a valid one (e.g if you’d have a variable called tpyih). As soon as you reach a char that is not in the trie, you could try a string similarity to check if there is a typo. You should however limit the check to close similarity (i.e distance not larger than 2).

Christophe
  • 74,672
  • 10
  • 115
  • 187