1

I'd like some advice on how to approach this problem. I have a database of ~3000 pictures of people. Their names are built into the filename but there is no standard format. Here are some common name formats: MarySue-042; henry higgins03; J. H. Doe; Jones, Peter; and M N Shyamalan, MD. Some have middle names and some don't; sometimes the last name comes first, sometimes it doesn't.

There are also some non-people names, like "1122 Lundee Street", "MemorialHospital" etc.

I'm renaming them in a standard format. I'd like to build a model that can

  1. Recognize a probable name format, and/or
  2. Determine which format the name follows.

I'd like some advice on the best way to do this. My plan at the moment is to build a bunch of regex expressions for the most common formats and check if the filename fits one. If a one-off name gets overlooked, I can change it manually.

What I Tried So Far:

I've built a regular expression for the most common name format, FirstLast-[0-9]. It's [A-Z][a-z]+[A-Z][a-z]+-[0-9]+. The problem is, this also picks up location names like "MemorialHospital-02". I thought about discarding ones where the letters in either position exceed a certain length, but I have some people with very long names that this approach would ignore.

Furthermore, although this is the most common name format, there is a significant amount of names in different formats, so I'm still missing a lot.

lll
  • 121
  • 4
  • 3
    Distinguishing between names of people and not-people may be difficult considering that there may be no difference at all. For example: [Gary Street, Binghamton, New York](https://www.google.com/maps/place/Gary+St,+Binghamton,+NY+13905) and [Gary Street, English rugby coach](https://en.wikipedia.org/wiki/Gary_Street). – 8bittree Feb 13 '17 at 18:42
  • 4
    Gary should *totally* move to that street. – FrustratedWithFormsDesigner Feb 13 '17 at 18:56
  • 7
    Parsing names is a hard problem - see [Falsehoods Programmers Believe About Names](http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/) – Dan Pichelman Feb 13 '17 at 19:02
  • Possible duplicate of [Name validation best practices](http://softwareengineering.stackexchange.com/questions/330512/name-validation-best-practices) – gnat Feb 13 '17 at 20:17

5 Answers5

6

Here's how I would approach the problem.

  1. Start by getting somewhere a dictionary of names (“John”).
  2. Get a dictionary of words (“hospital”) and geographical locations (“London”).
  3. For every string among the 3 000 ones, compute the number of occurrences of persons' names, and the number of occurrences of words and geographical locations.

If a given string contains only persons' names, that's likely to be a person. If it contains only words and locations, it's probably not a person.

Hopefully, the ones which contain both are not too numerous. Those ones could be handled manually.

Arseni Mourzenko
  • 134,780
  • 31
  • 343
  • 513
  • 1
    I'd add that possibly > 90% of the task can be automated, and a list of hard-to-decide names can be presented for human consideration. If it's a one-off task, having a bit of human involvement is OK. – 9000 Feb 13 '17 at 19:24
2

If you only have to do this once:

Write enough conversion algorithms to get the list of exceptions down to... (holds wet finger in the air) ...about 300. Then eyeball the rest. If the exceptions still number 600 after you've spent about an hour on this, though, then just stop there and eyeball the last 600.

If you have to spot and convert names forever:

Write enough converters to get the exceptions down to about 5-10%. If you're in doubt, just go for 10%.

In general, there are far too many exceptions to name rules for you to handle. (There's a blog post out there that I can't find at the moment, listing 40 exceptions to commonly believed rules about names. If you could catch them all, you could probably solve the strong AI problem.) So don't waste time trying. That said, you'll do all right by following the 80-20 rule or even a 90-10.

1

The simple approach of splitting the names based on capitalisation and manually splitting the list into names, possible names and non-names will give you a good start as any filename that consists solely of non-names is likely not a person but there are likely to be a number of corner cases even if all of the names are all of English origin.

You could enhance that with a tool like OpenCV and a people recogniser, e.g. peopledetect.py for upright people or facedetect.py for profile type pictures, (both these examples are each under 80 lines of code). If all of your photographs are profile or passport type photos then a rule that the detected face be over 50% of the image area could further refine the selection.

Assuming that all people pictures will include a single person while the chances are that place pictures will have either no people, or multiple ones, should allow the majority of can't be 100% sure pictures to be filtered giving a very limited number to manually categorise.

If you regularly have to do a very large number of such filters then you could consider training up a machine learning system for either, both or the combination of the naming convention or the image content.

Steve Barnes
  • 5,270
  • 1
  • 16
  • 18
0

For general approach, I would:

  1. Certainly build a few regular expressions for potential positive match, which you're already doing.

  2. Also create some words/expressions for negative match. E.g., presence of "[A|a]venue" (or Hospital) basically guarantees a non-name.

  3. ~3000 items is a small enough set to eventually eyeball. I would dump output of #1 and #2 to a spreadsheet and scan for corner cases to fix manually.

0

If you only have 3000 files do it manually. Say it takes you 10 seconds to rename and move the file. That's just over 8h work and you'll be sure you've got them right.

There's always going to be an error rate in any algorithmic approach so you really need to have a lot of data and time to evolve something which is better than a human

Ewan
  • 70,664
  • 5
  • 76
  • 161
  • I'm going to have to update these things about once a month. I'm leaning towards regex expressions and then fixing the outliers manually. – lll Feb 13 '17 at 19:48
  • You'll never get them entirely right by this process. If you've got more of them to deal with over time, fixing the process by which they're originally named is much more important. – Jules Feb 14 '17 at 08:17