2

I have a stream of integer data and want to perform some statistical analysis on it. I want to calculate the mean and the standard deviation of it. So far it isn't hard, but keep in mind that I am talking about streams of data, I'd prefer not to store all of the data. There exists both an algorithm for the mean and the deviation to keep the stored data at a minimum - I'd refer to Wikipedia in this matter.

But the problem now is that some of the data will be completely absurd in regards to the rest of the data. For example I will receive

1 2 2134 7 -2 14 // 2134 is out of line and junk, don't calculate with it

I know in what ranges my values will likely be, but only relative to their average value. So I'd like to know if there is a good approach to tackle these kind of noises.

It is even more annoying, as the junk data will most likely be the first to arrive and not in line with the mean value of the rest of the sequence so I can't precalculate the mean value. An example would be something like

1111 1564 13 1645 12 -4 37 90 ...

The junk makes < 5% of the data, so at least there isn't much noise, but I have to keep it out of my calculation.


To distinguish between real and junk data, I know that the real data lies in a bounded interval around the mean value - scaled to the example above, it would be something like +-200. "Groups" of junk data can behave in the same way, only that their mean value differs by at least 1 times the length of the interval from the mean value of the real data.

At the beginning, without any data, I have no idea what the mean value of the real data might be.

I can replay a limited amount/small fraction of data only by first saving it to memory.


Is there a good algorithm that occupies as little as possible space and works when applied iteratively to a sequence?

WorldSEnder
  • 288
  • 1
  • 11
  • 2
    How do you determine what is noise and what is good data? – Adam Zuckerman Aug 06 '15 at 17:54
  • @AdamZuckerman The data lies in an intervall within a bounded range from the overall mean value of all relevant data. This range is - speaking relatively to the example about +-200. The junk however, separated into several groups, they all have about the same value range to their mean value – WorldSEnder Aug 06 '15 at 17:56
  • So, at the beginning of the sequence, if you have an excessive number (lets say 9000), you have no way of knowing if that is valid or not until after receiving many more data points? – Adam Zuckerman Aug 06 '15 at 18:02
  • @AdamZuckerman, that is correct. At the same time I want to keep the amount of data I keep cached as small as possible – WorldSEnder Aug 06 '15 at 18:05
  • @WorldSEnder If the junk data is separated into distinct groups, perhaps it would be feasible to keep track of a mean value for each of those groups, then decide which mean is the "real" mean by comparing how many values went into each group? – Ordous Aug 06 '15 at 18:34
  • 3
    What you need is a [Moving Average](https://en.wikipedia.org/wiki/Moving_average), a [Standard Deviation](https://en.wikipedia.org/wiki/Standard_deviation) and a bit of statistical analysis. You should be able to calculate how many points you need to have in your buffer to get a 99% confidence level in identifying your outlying data points. I'd be surprised if it is more than about twenty. Or, y'know, you can just tweak your algorithm until you get the results you expect. – Robert Harvey Aug 06 '15 at 18:44
  • @Ordous, the problem here, how do you decide how to split them into groups at first? If you don't split from the beginning, when do you split a group into two subgroups because the range of its values got too large? I like the idea of RobertHarvey. Precalculating a buffer size to get a confidence level seems feasable as I have an expectation how often I'll encounter a junk value. – WorldSEnder Aug 06 '15 at 18:48
  • @RobertHarvey That would work if you had the capacity to look at an arbitrary value, rather than a stream of values. The combination of "Junk is likely first to arrive" and "Junk is in several groups with their own valid mean values" means this situation is pretty much actively working against you method. – Ordous Aug 06 '15 at 19:35
  • 1
    @WorldSEnder can you replay the data? I.e. would it be possible to perform an initial pass to analyze the data in order to determine what the good and bad values are. Then process it a second time to do the actual calculations, discarding outlier values as determined by the criteria established in the first pass? –  Aug 06 '15 at 20:04
  • @Snowman, The only way I could replay the data is by saving it and then replay it from memory. I'd rather not do that. Replaying some of the data would be fine though - for example the first few – WorldSEnder Aug 06 '15 at 20:05
  • @Ordous: My assumption is that you would buffer the first 20 or so values before you started identifying outliers. I think you misunderstood my description. – Robert Harvey Aug 06 '15 at 20:15
  • @WorldSEnder if you could save it to memory to begin with, there would be little point in using an algorithm that calculates as it goes. Conceptually, I think of this as trying to calculate the average value from a sensor that is constantly producing data and there is too much memory to fit in a reasonable amount of storage. In other words, at the conceptual level, I assume it is _impossible_ to store the data. However, sometimes in those cases it is still possible to replay it. –  Aug 06 '15 at 20:16
  • @Snowman, what I meant is that I can only save a tiny fraction of all of the data. See the updated question for a short replay of the info I gave in the comments here – WorldSEnder Aug 06 '15 at 20:19
  • @RobertHarvey No, that's exactly what I thought. But from what I understood from the OPs comments - there's a very high probability that all of those 20 or so values are both a) noise and b) noise that is very much like valid values from a different distribution – Ordous Aug 06 '15 at 20:34
  • @Ordous: Then you need a bigger buffer. – Robert Harvey Aug 06 '15 at 20:36
  • @RobertHarvey Indeed. But since most of the first values are noise (again from what I understand...) then you need to buffer around 10% of the data, which is unfeasible. I would completely agree on the method if you could pick random points, or if the noise was in random positions. – Ordous Aug 06 '15 at 20:40
  • @Ordous: Then throw out points until you get a break in the data, at which point you conclude that the "settling time" has passed. – Robert Harvey Aug 06 '15 at 20:45

1 Answers1

3

Signal processing is a huge topic, and a lot depends on your application, but a good start for your research would be a Kalman filter. It's recursive, so it doesn't require storing the entire history, and is commonly used for tasks like filtering outliers out of a sensor signal.

Karl Bielefeldt
  • 146,727
  • 38
  • 279
  • 479