6

I am looking for a way to compare a set of files against a given file with each comparison giving me a "closeness" metric. I should then be able to order based upon the metric to find the closest file. I considered using diff, but afaik this only provides a yes or no on whether a particular line matches, which is too large a scale for my purposes as for my purposes a word change in a line of text is closer than a completely different line whereas diff returns no-match for both cases.

Would I be able to use a soundex effectively on a 100 or more line file or is there a better algorithm? Also is there a metric which would provide a positive match if the lines which were similar were on drastically different line numbers?

Thanks

Gruffputs
  • 512
  • 2
  • 9

5 Answers5

3

You could use diff and count the number of lines that differ:

diff f1.txt f2.txt | wc -l

This will give you a numeric range

Martin York
  • 11,150
  • 2
  • 42
  • 70
3

I've used Levenshtein distance before to good effect. You can find examples in several languages here: http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance.

This algorithm gives you a number greater than 0 to show how far apart the two sets of data are, with 0 meaning the sets are identical.

JohnL
  • 1,890
  • 13
  • 14
1

Vector Space Model

A common method for comparing whole documments is a Vector Space Model which represents each word as a vector and then makes it possible to compare those vectors with another document to measure similarity

That gives you more documents with similar content though and not necessarily similar text. For that you typically use something called Levenshtein distance, not sure if it's feasible for whole documents though (it's mostly used on strings)

Also see this

Homde
  • 11,104
  • 3
  • 40
  • 68
  • I've used Levenstein Distance on fairly large text files before, though I was actually using it to find out if a source file had a block of text similar to our copyright statement template. – JohnL Jan 18 '11 at 13:51
1

I would believe you could go a long way with diff. If you take a look at the Diff algorithms it might be extensible to be more fine grained than per line level

http://c2.com/cgi/wiki?DiffAlgorithm

svrist
  • 187
  • 1
  • 6
1

You can still use diff if you alter its inputs so that every character appears on a separate line:

diff <(sed 's/\(.\)/\1\n/g' <file1) <(sed 's/\(.\)/\1\n/g' <file2) | wc -l

The <(...) syntax requires bash and conceptually means "run ..., directing its output to a temporary file somewhere, and replace the <(...) construct with the name of this temporary file, deleting it afterwards" (although it will use FIFOs if available instead of actual temporary files). It's a nice way to use the output of a process as a named input file for a program, which is necessary when the consuming program can't read from stdin or requires 2 separate sources of input (as is the case with diff here).

j_random_hacker
  • 856
  • 7
  • 11