It all comes back to pattern recognition. The reason we have facial recognition software is faces, for the most part, have similar traits, and most of the pictures we take of people have them facing the camera. But that was still very difficult to develop. Let us consider for a moment a picture of bomb striking a populated target. How would you create a pattern for that? The picture would have an explosion, fairly simple for a human to figure out, but rather hard to program for. An explosion could be of various sizes, shapes and colors. Let's say you create some sort of rule based system. What if the picture is an explosion for avalanche control in the mountains? That is not violent or offensive, but it would contain a lot of the same characteristics.
As many other people have pointed out, it is very hard to define obscene. The US Supreme Court attempted to in the case of Miller vs. California, and as a result we have the so called Miller Test which consists of three parts:
- Whether "the average person, applying contemporary community
standards", would find that the work, taken as a whole, appeals to
the prurient interest,
- Whether the work depicts/describes, in a patently offensive way,
sexual conduct specifically defined by applicable state law,
- Whether the work, taken as a whole, lacks serious literary, artistic,
political or scientific value
Nice and ambiguous, makes things a little hard to program for. Most sites that allow you to upload images have some sort of human moderation. For example, all pictures posted to online dating sites have to follow a set of rules and have to be approved manually.