A large variety of answers here... mostly addressing the issue in a variety of ways.
I've been writing embedded low level software and firmware for over 25 years in a variety of languages - mostly C (but with diversions into Ada, Occam2, PL/M, and a variety of assemblers along the way).
After a long period of thought and trial and error, I have settled into a method that gets results fairly quickly and is fairly easy to create test wrappers and harnesses (where they ADD VALUE!)
The method goes something like this:
Write a driver or hardware abstraction code unit for each major peripheral you want to use. Also write one to initialise the processor and get everything set up (this makes the friendly environment). Typically on small embedded processors - your AVR being an example - there might be 10 - 20 such units, all small. These might be units for initialize, A/D conversion to unscaled memory buffers, bitwise output, pushbutton input (no debounce just sampled), pulse width modulation drivers, UART / simple serial drivers the use interrupts and small I/O buffers. There might be a few more - eg I2C or SPI drivers for EEPROM, EPROM, or other I2C/SPI devices.
For each of the hardware abstraction (HAL) / driver units, I then write a test program. This relies on a serial port (UART) and processor init - so the first test program uses those 2 units only and just does some basic input and output. This lets me test that I can start the processor and that I have basic debug support serial I/O working. Once that works (and only then) do I then develop the other HAL test programs, building these on top of the known good UART and INIT units. So I might have test programs for reading the bitwise inputs and displaying these in a nice form (hex, decimal, whatever) on my serial debug terminal. I can then move into bigger and more complex things like EEPROM or EPROM test programs - I make most of these menu driven so I can select a test to run, run it, and see the result. I can't SCRIPT it but usually I don't need to - menu driven is good enough.
Once I have all my HAL running, I then find a way to get a regular timer tick. This is typically at a rate somewhere between 4 and 20 ms. This must be regular, generated in an interrupt. The rollover / overflow of counters is usually how this can be done. The interrupt handler then INCREMENTS a byte size "semaphore". At this point you can also fiddle around with power management if you need to. The idea of the semaphore is that if its value is >0 you need to run the "main loop".
The EXECUTIVE runs the main loop. It pretty much just waits on that semaphore to become non-0 (the I abstract this detail away). At this point, you can play about with counters to count these ticks (cos you know the tick rate) and so you can set flags showing if the current executive tick is for an interval of 1 second, 1 minute, and other common intervals you might want to use. Once the executive knows that the semaphore is >0, it runs a single pass through every "application" processes "update" function.
The application processes effectively sit alongside each other and get run regularly by an "update" tick. This is just a function called by the executive. This is effectively poor-mans multi-tasking with a very simple home-grown RTOS that relies on all applications entering, doing a little piece of work and exiting. Applications need to maintain their own state variables and can't do long running calculations because there is no pre-emptive operating system to force fairness. OBVIOUSLY the running time of the applications (cumulatively) should be smaller than the major tick period.
The above approach is easily extended so you can have things like communication stacks added that run asynchronously and comms messages can then be delivered to the applications (you add a new function to each which is the "rx_message_handler" and you write a message dispatcher which figures out which application to dispatch to).
This approach works for pretty much any communication system you care to name - it can (and has done) work for many proprietary systems, open standards comms systems, it even works for TCP/IP stacks.
It also has the advantage of being built up in modular pieces with well defined interfaces. You can pull pieces in and out at any time, substitute different pieces. At each point along the way you can add test harness or handlers which build upon the known good lower layer parts (the stuff below). I have found that roughly 30% to 50% of a design can benefit from adding specially written unit tests which are usually fairly easily added.
I have taken this all a step further (an idea I nicked from somebody else who has done this) and replace the HAL layer with an equivalent for PC. So for example you can use C / C++ and winforms or similar on a PC and by writing the code CAREFULLY you can emulate each interface (eg EEPROM = a disk file read into PC memory) and then run the entire embedded application on a PC. The ability to use a friendly debugging environment can save a vast amount of time and effort. Only really big projects can usually justify this amount of effort.
The above description is something that is not unique to how I do things on embedded platforms - I have come across numerous commercial organisations that do similar. The way its done is usually vastly different in implementation but the principles are frequently much the same.
I hope the above gives a bit of a flavour... this approach works for small embedded systems that run in a few kB with aggressive battery management through to monsters of 100K or more source lines that run permanently powered. If you run "embedded" on a big OS like Windows CE or so on then all the above is completely immaterial. But that's not REAL embedded programming, anyhow.