I've faced a LOT of these issues in the past and you're right, it's a nightmare. I suggest you start by putting yourself firmly on the side of the best practice and saying "Getting a copy of live, however tempting - is NOT an option". It puts you in the right headspace from the very beginning, for my reasons why see a previous answer of mine.
Getting a good test environment is crucial, these often evolve alongside your production environments and help test upgrade paths as well as regular bugs. Putting time in here and making sure you have a proper QA team and strategy will pay dividends down the line.
Having said that, this is real life and there are always issues which are only discovered in live. So, how on earth can you investigate an issue which is occurring on one system, for one client, and nowhere else?
The key is logging.
You have the code, and you have the logs. What you need to do is a process of elimination to work out what's going on at various stages.
But what happens if the logs and data you need don't exist?
Then you're a step forward, understanding what you need to solve the problem is the first step on the way to solving it. Identify what questions you have (did the code enter this IF statement or skip it) and prove it.
This is a lot easier said than done so having said that here are some pointers:
- Your progress on these issues are now inexorably linked to your release schedules, rapid development and rapid deployment play a part more than ever.
- Get the people who write the code solving the problems, otherwise you'll have a team singing the values of good logging and another ignoring them
- NEVER log anything sensitive/inappropriate to insecure logs
- Keep your communication open, a client is a lot more responsive if they know the plan, understand what is in the release and when they're going to get it
- Developers being removed from live systems does not necessarily mean they can't ask questions, consider asking them to pair with the Ops guys asking questions but keeping their hands off the keyboad
- Seriously consider leaving logging in place, if it stung you once then keeping the diagnostic resources in there will make it much easier to solve again.
The key to cracking this is having a good QA process and doing frequent incremental drops into production (which can help you investigate problems as you go). It's funny, that's the solution to a lot of software development headaches!