The challenge with coming up with valid empirical evidence for anything is dealing with dependent variables. Dependent variables are things that tend to be linked in the subject under test. For example, many people who smoke tend to do little exercise. So when trying to determine whether smoking increases the risk of heart attacks, I have to turn exercise into an independent variable, eg by ensuring my group of smokers and my control group do the same amount of exercise. That way, I ensure that any differences between the two groups are related to smoking only.
What does this have to do with the question? Well the same applies to code katas. It might be the case that people who tend to do code katas are also the sort of people who listen to podcasts, read blogs, attend conferences etc. So were I to try and scientifically measure whether there was a significant difference between coders who do katas and those that don't, I'd have to factor all those dependent variables out. Even then, it might be a case that the kata group and more likely to drink decaf coffee and caffeine levels might affect coding skills, so I have to factor that out too. And so the list would go on. There are so many dependent variables around coding that it would be hard to deal with all of them.
And even if I could, how would I measure whether katas make someone a better coder. A "good coder" is too subjective to measure in this regard, not least because there is no universally agreed "best practice" to writing code that fits every development scenario.
So the answer to your question, "is there scientific evidence showing that katas work?" is very likely "no". It would be a very difficult study to set up and anyone that did would struggle to have their data set accepted as reliable as the assumptions they'd have to make around removing those dependent variables and the definition of what's being measured would be all too easy to criticise as false assumptions.