0

Consider application where users rates products (e.g. 1-5 stars).

Through passage of time, there might be millions of records. One can create desired indexes and/or keep sum and count of all ratings to easily get overall average score of particular product or seller.

The problem I am trying to tackle is how to efficiently provide an average score based on ratings from last year (365 days). In other words, any rating that is older than 365 days MUST NOT be considered for average score.

Three most naïve approaches are:

  1. Calculate it on DB level, using GROUP BY and AVG (or similar DB functions).
  • cons: it might analyze thousands of rows to give actual rating
  1. Have a background process that updates average_score once per day for all products and sellers.
  • cons: average_score is updated only once per day
  1. Recalculate average_score every time new rating is added.
  • cons: average_score is recalculated too often

My question is if there is better/smarter way to do it, taking into account there might be hundreds of thousands ratings per day?

  • 1
    #3 alone does not work, since average ratings change automatically each day, even when there is no new rating. It might be combined with #2, of course. – Doc Brown Feb 02 '22 at 15:30

1 Answers1

2

This depends on how important the one-year window is. If you get a request at 3:15 PM on January 31st of a year, is it important that you exclude a rating from 3:14 on January 31st of the previous year?

A nice thing about averages is that you can break them up and recalculate them. So if it is acceptable to include ratings from one year ago by day (rather than by minute - that is, it would be OK to include the 3:14 request from the previous year), then you can calculate a daily average and use that as a baseline, and re-calculate against that baseline value throughout the day. For example, in the January 31st example, you would have calculated a baseline at the start of the day, and keep track of not only what the average is but also the number of ratings that it was calculated using. So say your baseline was 4.315 stars, based on 1000 ratings. You receive a request at 3:15. You grab all ratings that occurred since the baseline, and find that 20 ratings have given you an average of 2.5 stars. You then multiply the averages back out - 4.315 * 1000 = 4315, 2.5 * 20 = 50. You can then add those together and re-divide: (4315 + 50) / (1000 + 20) = 4.279.

You can use this strategy more generally to break down your recalculations - for example, you could store a monthly average for each month, for example. And if minute-by-minute granularity is truly important, you actually could break this down further, so for example storing weekly averages, and recalculating using the partial weeks for the current week and the one-year-ago week.

autophage
  • 872
  • 4
  • 7
  • It will work, but it's not space-efficient. Keeping daily average per product (or seller) will multiply amount of records by 365 (excluding days without rating). I guess recalculating it once per day is best option from business, efficiency and simplicity perspective. – Maciej Pszczolinski Jan 31 '22 at 16:47
  • 3
    @MaciejPszczolinski: there is no need to store the daily average for each of the 365 days. You just store the total sum of ratings over the last year together with the number of ratings in that period (which requires two attributes in the product record, not 365 additional records). After midnight, you determine the sum and number of the ratings from day 366 in the past on-the-fly, relying on the ratings table having an appropriate index on the timestamps. These value can then be used to adapt the moving average for the next day, – Doc Brown Feb 02 '22 at 15:25
  • @MaciejPszczolinski At an absolute bare minimum, you must be able to track the average/sum of reviews on a daily basis. Otherwise there is no way to determine whether a review is more than 365 days old or not. That is an information theoretic limit. That being said, if you have a need to be more efficient than that, there are weighted average approaches which allow the value of old reviews to taper off without actually requiring you to track individual days. Of course, if you are actually keeping record of reviews, this is a moot point -- they're bigger than the averages. – Cort Ammon Oct 29 '22 at 21:41