TL:DR: Yes, but not for most tasks. That's why the current iteration of this idea is hybrid CPUs with some performance, some efficiency cores, now that we have transistor budgets to throw that many cores onto a consumer laptop/desktop CPU.
but can't a system with many simple cores be more efficient than one with single digit number of advanced cores for some tasks?
"For some tasks" is the rub. They're a lot worse for many other tasks, ones that haven't been or can't easily be parallelized. For looping over a medium-sized array, for example, it's often not worth talking to other CPU cores about doing some of the work, because the latency involved is comparable to the time it would take just doing the work in a single thread.
And a "many simple-ish CPU" processor is worse than GPUs for tasks that are highly parallel, and don't have much data-dependent branching. I.e. where high latency can be tolerated to achieve the high throughput per-power and per-die-area GPUs can provide. So as other answers have pointed out, the middle ground between throughput-optimized GPUs and latency-optimized CPUs isn't very big in terms of commercial demand. (Stuff like CPU SIMD does it well enough for most things, although with power-efficiency becoming ever more important, there is room for hybrid CPUs with some efficiency cores.)
Per-thread performance is very important for things that aren't embarrassingly parallel. (And also because memory bandwidth / cache footprint scale some with number of threads for many workloads, so there's a lower limit on how simple/small you'd want to make each core without a totally different architecture such as a GPU.)
A system with fewer big cores can use SMT (Simultaneous Multi-Threading, for example hyperthreading) to make those big cores look like 2x as many smaller cores. (Or 4x or 8x, for example in IBM POWER CPUs.) This is not as power-efficient as actually having more smaller cores, but is in the same ballpark. And of course simple OS context-switching makes it possible to run as many software threads as you want on a core, while the reverse is not possible: there's no simple way to use lots of simple cores to run one thread fast.
There are diminishing returns The opposite of this question, Why not make one big CPU core? has
Related: Modern Microprocessors A 90-Minute Guide! has a section on SMT and multi-core, and is excellent background reading on CPU design constraints like power.
Making large cache-coherent systems is hard so it's hard to scale to huge numbers of CPU cores. The biggest Xeon and Epyc chips are packing 56 or 64 physical cores on one die.
Compare this with Xeon Phi compute cards which is pretty much what you wondered about: AVX-512 bolted on to low-power Silvermont cores, going up to 72 cores per card with some high-bandwidth memory. (And 4-way SMT to hide ALU and memory latency, so it actually supported 4x that many threads.)
They discontinued that line in 2018 due to lack of demand. This article says it's "never seen any commercial success in the market". You couldn't get big speedups from running existing binaries on it; code generally needed to be compile to take advantage of AVX-512. (I think Intel's toolchain was supposed to be able to auto-parallelize some loops so source changes might have been less necessary or smaller than for using GPUs). And it omitted AVX-512BW so it wasn't good for high-quality video encode (x264/x265 as opposed to fixed-function hardware); I think primarily good for FP work, which means it was competing with GPUs. (Some of the reason may be due to working on a new from-the-ground-up architecture for "exascale" computing, after seeing how the computing landscape evolved since the start of the Larrabee project in the mid 2000's; it was originally designed to be a more-programmable GPU based on x86 cores(!), but evolved into Xeon Phi once GPGPU did that better and they found a more appropriate nice.)
Hybrid / heterogenous CPUs: some fast, some efficient core
The latest iteration of your idea is to have a mix of cores, so you can still have a few fast cores for serial / latency-sensitive stuff.
Some code is only somewhat parallel, or has a few different threads doing separate tasks that are individually serial. (Not really distributing one task across many threads.)
ARM has been doing that for a while (calling it big.LITTLE), and Intel's new Alder Lake design with a mix of Performance (Golden Cove) cores and Efficiency (Gracemont) cores is exactly this: add some relatively-small throughput-optimized cores that don't push so far into the diminishing returns for spending more power to increase per-thread throughput.
So when doing "light weight" work where an E-core is enough to keep up with something that isn't useful to do faster (like playing a video or typing / clicking around on a web page), only that small core needs to be powered up.
Or when doing some number crunching / video encoding / whatever with lots of thread-level parallelism, 4 E-cores for the area of one P-core gives you more total throughput. But you still have some P-cores for tasks that don't (or simply weren't) parallelized.
(I wrote in more detail about Alder Lake on superuser.com.)
Even the E cores on Alder Lake do fairly wide superscalar out-of-order exec, and can have quite good throughput on code where the instruction-level parallelism is easy for the CPU to find. (On ARM big.LITTLE, the little cores are often in-order, but still 3-wide superscalar with stuff like hit-under-miss caches to find some memory-level parallelism. e.g. Cortex-A53)
For most systems for general workloads, it's not commercially viable to not have any latency-optimized cores that have high single-threaded performance. Many tasks don't easily parallelize, or simply haven't been because that's much more programming effort. (Although low-end smartphones sometimes only use low-end cores; people would rather have cheap and slow than no phone at all, and power efficiency matters even more than for laptops.)
Previous many-small-cores CPUs:
I already mentioned Xeon Phi, but years before that another interesting example was Sun UltraSPARC T1, aka Niagara, released in 2005.
It was an 8-core CPU (or 4 or 6 core for other models), at a time when x86 CPUs were only just starting1 to introduce dual-core such as Athlon X2. They weren't even trying to aim for high per-thread performance, which was essential for most interactive use back then. Instead they were aiming at server / database workloads with many connections so there was already plenty of thread-level parallelism for software, even back then.
Each core had a fairly simple pipeline, and was a "barrel" processor, rotating between up to 4 non-stalled logical cores aka hardware threads. (Basically like 4-way SMT, but in-order so instructions from separate threads never mix in an execution unit.) Keeping cores small and simple made locking overhead lower, I think.
32 logical cores was a huge amount back in 2005. (And later generations were multi-socket capable, allowing 2x or 4x that in a whole system). The wiki article mentions that without special compiler options (to auto-parallelize I assume), it left a lot of performance on the table for workloads that weren't already parallelized, like gzip
or MySQL (back in 2005, and IDK if they were talking about a single big query or what). So that's the downside of having many weak cores, especially if you really lean into it like Sun did by making it with not much cache, and depending on the barrel processor to hide latency.
Footnote 1: Part of that was that most of the x86 market was for machines that would run Windows, and until Windows XP the mainstream versions weren't SMP-capable IIRC. But servers had been multi-socket for a long time, achieving SMP by having multiple physical CPU packages in separate sockets. But primarily it was that transistor budgets were still at a point where more cache and wider/deeper OoO exec were still providing significant gains.
Parallelization of many tasks is hard
Communication between cores is always going to be pretty expensive (at least high-latency), because distances are inherently far, and because of out-of-order exec to hide memory and L3-cache latency. A core can speculatively see its own actions, but stores can't be made visible to other cores speculatively, otherwise you'd have to roll them back on detecting a branch mispredict. (Which is completely impractical in terms of maintaining a consistent point to roll back to, and would defeat the purpose of having it be a separate core.)
So inter-core communication takes at least the latency of the interconnect and cache-coherency protocol, plus the latency of the store buffer and out-of-order exec window. (And the memory-ordering barriers normally involved limit out-of-order execution somewhat.)