-1

I'm interested in finding a text distance (or string similarity) algorithm which computes a greater distance (or lower similarity) when characters are further apart.

For example, I want the distance between abc and abz to be greater than the distance between abc and abd.

It would be easy to compute a text distance like this for strings of the same length, but I'd like to find one that also works for strings of different lengths.

Common algorithms like Levenshtein, Jaro-Winkler, and Ratcliff-Obershelp compute the same values for these two examples.

Edit: People are asking for a specific distance metric, so let's say it's the absolute difference between character values divided by the length of the longer string. And to keep this simple, only ASCII characters are considered.

Vermillion
  • 159
  • 6
  • 2
    It is pretty trivial to define a few different metrics which fulfills what you described. Hence, your question is clearly underspecified - tell us what purpose your have in mind, what's your goal with this kind of text distance? Are positional differences of characters als important, and how to weight them against "characters diffs"? Are string length's important? And please give us some context. – Doc Brown Sep 23 '22 at 05:04
  • what about strings of different length? if you have abc closer to abcd or abd or abz? – Pieter B Sep 23 '22 at 07:07
  • I was asking to see if anyone knows of an existing string similarity algorithm that takes character distance into account. I'm not trying to completely specify the problem. This is more of an exploratory question. Maybe this isn't the forum for those. – Vermillion Sep 23 '22 at 23:07
  • But for a specific example, let's say that it's the Levenshtein distance combined with a distance metric between characters. – Vermillion Sep 24 '22 at 03:40
  • Is this a lossy function? If abc to xbc is the same distance as abc to abz then it’s lossy because you can’t use the number to remember what abc went to. – candied_orange Sep 26 '22 at 00:40
  • 1
    You are just arbitrary guessing (parts of) a metrics, ignoring most of what I wrote in my comment. Sorry, but that's not a good way of asking on this site - it seems you are just asking fpr curiosity and have no actual problem at hand which could provide the necessary context. – Doc Brown Sep 26 '22 at 03:32
  • @DocBrown I thought this forum allowed these kinds of exploratory questions (more than StackOverflow). Another question on here is "What algorithm would you best use for string similarity" which has 34 upvotes. In my opinion, that question is more vague than mine. – Vermillion Sep 26 '22 at 18:05
  • I would delete this question and start over with a more specific one, but the site won't allow it. – Vermillion Sep 26 '22 at 18:29
  • @Vermillion: if you mean https://softwareengineering.stackexchange.com/questions/330934/what-algorithm-would-you-best-use-for-string-similarity : that question presents a **real world problem with a clear goal**, which gives the context your question currently misses. Yours looks like something hypothetical, something invented just for curiosity. So no, that other question is definitely not more vague than yours. – Doc Brown Sep 26 '22 at 19:25
  • Am I allowed to completely rewrite the title and content of this question then? Since I can't delete it. – Vermillion Sep 26 '22 at 20:11

3 Answers3

1

First: Define what the distance between two characters is. For example, p and b, g and k, d and t Are similar. A and e are reasonably similar. If I had software converting speech to text, and the recognised word is dable, then I suspect it’s really table.

Or you take the keyboard distance. If you see ltunpstf then it’s someone typing “keyboard” blindly with their finger one position to the right.

gnasher729
  • 42,090
  • 4
  • 59
  • 119
  • I'm thinking of the distance being the absolute difference between the character values (e.g. their ASCII values). So the distance between `a` and `c` is 2. – Vermillion Sep 22 '22 at 21:25
  • 2
    @Vermillion Think about unicode what difference you would use for non ascii alphabets – gapsf Sep 23 '22 at 18:20
  • Capital Latin letter A and capital Greek letter Alpha look identical. So what is the difference? – gnasher729 Sep 24 '22 at 22:32
  • 1
    What justification is there that the difference between a and c is much less than between a and z? – gnasher729 Sep 24 '22 at 22:33
1

As others have said, you must first define "distance". Once you have done so, however, standard approaches can be used. I have implemented Levenshtein this way--most changes were counted as two points, but different trailing digits were only counted as one point. In practice it proved quite good at finding what the users generally wanted--other lots in the same project. The basic algorithm works the same no matter how complex your distance logic is. I have never played with the others you mention so I can't address them but I would be surprised if the same approach didn't work.

Loren Pechtel
  • 3,371
  • 24
  • 19
0

I agree with @gnasher729 where it is suggested that you define the meaning of a distance precisely. For example:

  • should Distance("ab","a") be > Distance ("ab","z")?
  • should Distance ("ab","a") = Distance("ab","A")?

Not sure if the code link below could help you or not, it may be useful as a starting point to make the question more specific:

Code for thought-Not actual algorithm.

NoChance
  • 12,412
  • 1
  • 22
  • 39