4 bits sensible minimum:
0-9 Numeric data needs 4 bits
0-9 = 10 words.
Next highest binary word size = 4 bits = 16 possible words.
So BCD data (binary coded decimal) = 4 bits
8 bits logical next jump
0-9, a-z, A-Z = 10+26+26 = 62 words.
Could handle with 7 bits = 128 words.
8 is about as easy as 7 and allows 2 x 4 bits so numeric data can be packed 2 per 8 bit byte.
Then 12 bits (not 16) ?:
Next logical size up = 12 bits and the early and very successful PDP-8 used 12 bits. 12 bits used for data and program allow 2^12 = 4096 address locations. As Bill Gates may possibly have once said "4K of memory should be enough for anyone".
The following PDP-11 family used 16 bits.
Doubling for compatability.
If you wish to interoperate with systems at lower and higher levels and if you want to have more capable devices in the same family, then being able to handle 2 words of the smaller system within the larger system word makes much sense.
BUT
The exceptions that prove the rule:
"Always" is such a strong word :-)
1-bit, 12-bit, 18-bit, 36-bit examples below.
18 & 36 bit machines were never microcontrollers.
1 & 12 bit ones were.
The one-bit system mentioned below is really a "random bits as you see fit" system. The one bit data word is essentially a go/no-go flag produced by computation and is used to enable or disable program activity. The program counter is an up counter that advances through memory cyclically with code being enabled or disabled as required. Very very very nasty indeed. By the time it arrived on the market the 8 bit processors of the day were quite mature and the 1-bit processor never really made much sense. I do not know how much use it ever got.
1-bit !!!:
Motorola MC14500B
I got an honourable mention by Jack Gansell for best description of this device :-)
Datasheet - click page for PDF download.

12-bit:
Harris HM-6100 aka Intersil IM6100 - 12 bit minicomputer wannabee](http://www.classiccmp.org/dunfield/other/i6100cfs.pdf)
Based on the vastly successful DEC PDP-8 12 bit minicomputer.
Overview
Program memory and data memory occupy the same memory space. The total size of directly addressable memory is 4 K words. Word size is 12 bits. The 6100 doesn't have stack memory.
Program memory size is 4 K words. All conditional instructions allow the processor to skip the next instruction only. To go conditionally to arbitrary address in memory when certain condition is met the code should execute "skip if the condition is not met" instruction first and put direct or indirect unconditional jump instruction after the skip instruction. Unconditional instructions can be used to jump directly within current page (127 words), or jump indirectly within the whole memory space (4 K words). The 6100 supports subroutine calls, but, due to lack of stack memory, the return address for subroutines is stored in memory. There is no "return from subroutine" instruction - the subroutine should use indirect jump to return back to the caller.
Data memory size is 4 K words. The data can be accessed directly within zero page (0000h - 007Fh) or within current 127-word page. The data can be accessed indirectly anywhere in 4 K words of memory.
Wikipedia - Intersil 6100
The PDP-8 & Intersil 6100 had 16 very rich instructions. There is NO substract instruction.
The ADD instruction is named TADD to remind you that it's 2's-complement add so we don't need no ... subtract instruction.
18 bit, 36 bit other - the PDP family:
Wikipedia Programmed Data Processor
PDP1 - 18 bit
PDP2 - 24 bit died aborning
PDP3, PDP6 - 36 bit
PDP-12 User Handbook (preliminary - Wow.
Despite numbering this is pre pre PDP16 - a PDP-8 on steroids with analog I/O capability - and engineering lab machine. I could have had one for free if I'd wanted, but it would not have fitted anywhere sensible - or insensible.
First computer game I ever played was on one of these.
Space War.
Machine was in two small-room sized cabinets.
You'd open a door and walk inside to do stuff to its internals.