Your question is a little unclear as the title asks for an algorithm, but you also ask for a data structure. I will describe what I did with a similar issue, dealing with music.
The way I approached it was with a series of string transformations that produced a string with as much ambiguity removed as possible.
A few of the rules:
- Remove all whitespace
- Change all non-ASCII letters to ASCII equivalents. (ö -> o)
- Change everything to uppercase
- If a comma is encountered, swap the right side of the common with the left
- Remove common words like "the", "of", etc.
- Change Roman numerals to Arabic ("VII" -> 7)
So I'd end up with:
Blue Öyster Cult -> BLUEOYSTERCULT
Amos, Tori -> TORIAMOS
The Red Hot Chili Peppers -> REDHOTCHILIPEPPERS
I then used that for all comparisons, though it was never exposed to the user. In my case, I just used this as the identifier.
The rules were necessarily a bunch of heuristics developed by experimenting with real CDDB data. It obviously wasn't guaranteed to be foolproof, but it wasn't that hard to find a set that worked most of the time.
Your issue isn't quite the same. Remakes will be a problem because your titles will match. That might be partially solvable by looking for dates in the title ("Total Recall (2013)") but I suspect that data will often be missing.