Apart from the points mentioned in the other answers (hard to prove that operations are independent, and programmers think serially), there is a third factor that needs to be considered: the cost of parallelization.
The truth is, that thread parallelism has very significant costs associated with it:
Thread creation is very expensive: To the Kernel, starting a thread is just about the same as starting a process. I'm not sure about the precise costs, but I believe it's in the order of ten microseconds.
Thread communication via mutexes is expensive: Usually, this requires a system call on each side, possibly putting a thread to sleep and waking it up again, which produces latency as well as cold caches and flushed TLBs. On average, taking and releasing a mutex costs around one microsecond.
So far, so good. Why is this a problem for implicit parallelism? Because implicit parallelism is easiest to prove on the small scales. It is one thing to prove that two iterations of a simple loop are independent of each other, it is a whole different thing to prove that printing something to stdout
and sending a query to a database are independent of each other and can be executed in parallel (the database process could be on the other side of the pipe!).
That is, the implicit parallelism that a computer program can prove is likely unexploitable because the costs of parallelization are greater than the advantage of parallel processing. On the other hand, the big scale parallelism that can really accelerate an application is not provable for a compiler. Just think about how much work a CPU can do within a microsecond. Now, if the parallelization is supposed to be faster than the serial program, the parallel program must be able to keep all CPUs busy for several microseconds between two mutex calls. That requires really coarse grained parallelism, which is almost impossible to prove automatically.
Finally, no rule without an exception: Exploitation of implicit parallelism works where no threads are involved, which is the case with vectorization of the code (using SIMD instruction sets like AVX, Altivec, etc.). That indeed works best for the small scale parallelism that's relatively easy to prove.