This is somewhat of an XY answer but given you started with
read a body of text and compare it to search-engine results (from searching for substrings of the given text), with the goal of detecting plagiarism in, for example, academic papers.
It seems text search itself is a good, practical answer to your problem. The basic way of detecting plagiarisms would be the following:
- Start with a corpus of documents that the target document could have been plagiarized.
- Create, e.g., a Lucene based inverted index over those documents (through say Solr or Elasticsearch).
- Split your target document into a set of phrases (e.g. by breaking off each sentence / sub-sentence / every n words).
- Search your corpus for each phrase. You will return a (possibly empty) set of documents that that phrase could have been plagiarized from (and the location(s) in each document it was possibly taken from).
- Collect all of these potential instances of plagiarism. If this exceeds more than a small threshold of phrases, alarm the target as probably being plagiarized.
This approach has several advantages over trying to diff strings:
- It allows you to pinpoint exactly what in the target document might have been plagiarized and where it could have come from. This will allow humans reviewing the output to have visibility and make intelligent decisions on the output.
- A good indexing solution will buy you the ability to work around misspellings and different stop words / tiny differences in phrasing.
- A good indexing solution will scale very well.
- Having a self-managed corpus will behave much better than searching the internet. The internet is such a wild and unruly place that you are likely to get spurious matches and miss out on important matches. That is, Google may catch students copying from Wikipedia, but it is also liable to falsely accuse people of copying from random blogs if you are not very, very careful. It is also liable to miss things like ArXiv papers in the field, essays students can buy from shady websites, past essays written from other students, that are very realistic sources of plagiarism.
If you think about Turn-it-in, their approach must be similar to this as they
- Tell you where the essay could have been plagiarized
- Can include past-papers / non-wiki & co. sourcing.
The value that Turn-it-in and similar can add over just setting up a system like this yourself (which honestly would not be too hard) is
- Size and quality of their reference corpus
- Development time of their UI
- Tuning of their indexing and searching
- Sophistication in how they determine phrases and their thresholds for likely plagiarism.