I'd like some advice on how to approach this problem. I have a database of ~3000 pictures of people. Their names are built into the filename but there is no standard format. Here are some common name formats: MarySue-042; henry higgins03; J. H. Doe; Jones, Peter; and M N Shyamalan, MD.
Some have middle names and some don't; sometimes the last name comes first, sometimes it doesn't.
There are also some non-people names, like "1122 Lundee Street"
, "MemorialHospital"
etc.
I'm renaming them in a standard format. I'd like to build a model that can
- Recognize a probable name format, and/or
- Determine which format the name follows.
I'd like some advice on the best way to do this. My plan at the moment is to build a bunch of regex expressions for the most common formats and check if the filename fits one. If a one-off name gets overlooked, I can change it manually.
What I Tried So Far:
I've built a regular expression for the most common name format, FirstLast-[0-9]. It's [A-Z][a-z]+[A-Z][a-z]+-[0-9]+
. The problem is, this also picks up location names like "MemorialHospital-02"
. I thought about discarding ones where the letters in either position exceed a certain length, but I have some people with very long names that this approach would ignore.
Furthermore, although this is the most common name format, there is a significant amount of names in different formats, so I'm still missing a lot.