5

I have a dataset containing list of users (around 50M).
Each user has an email address, name, and some more data columns.
I want to send a weekly email to those users, and the content of the email will be based on the user's data.
Each user should get the mail on a specific day of the week, based on his registration date.

How do I design a system that will allow me to query the users, and send them the custom email? How can I make that system scalable?

One approach I thought about is writing a worker that will query all the users that need to receive the email on the current day (this will be paginated), and will insert them to a queue where several workers can process them.

The problem with this approach is that I can't add workers that query the data and insert it to the queue, because I can't guarantee that each user will be queried by a single worker.

Maybe there is a specific DB or data access tool I can use? Is there a known pattern for this kind of processing?

I would appreciate your suggestions and thoughts on the subject.

Nicolás
  • 621
  • 6
  • 10
ItayMaoz
  • 159
  • 3
  • Sending hundreds of thousands of emails a week is by far the biggest problem here, other than that, the data set is nowhere near large enough to cause any problems. – biziclop Nov 30 '15 at 11:48
  • Email is just an example, it can be any kind of task that requires querying large amount of data. Do you think that that the simple design I wrote about in the question is sufficient for this dataset? And if so, what is a dataset size that will make it "break"? – ItayMaoz Nov 30 '15 at 11:52
  • Take a look at http://stackoverflow.com/questions/187998/row-offset-in-sql-server . – dbasnett Nov 30 '15 at 12:32
  • I know how to paginate the results, that's not the problem I am asking about. – ItayMaoz Nov 30 '15 at 12:37
  • Ah, okay. So some relatively long-running processing that is done on a per-record basis. If it doesn't have to update the original data (and it seems like that is the case), this could be a use case for map-reduce. – biziclop Nov 30 '15 at 12:44
  • Just to clarify: Are you saying that querying 50M users and registering a part of them into a queue is too big of a daily effort for a single worker? – Vlad GURDIGA Dec 05 '15 at 10:51

1 Answers1

2

Have you considered creating an index on your user table for each day of the week? Then you should be able to quickly retrieve the subset of users you want. Creating the index will probably take a bit of time, but subsequent updates shouldn't be too bad (unless you tend to bulk-populate your user table).

If this doesn't sound feasible, then you might want to consider partitioning or sharding your database. These approaches split the table into separate tables based on some criterion (day of the week, in your case). You'd probably want to create a view that encompasses all the partitions/shards to provide a complete table, but your worker threads could just focus on one.

TMN
  • 11,313
  • 1
  • 21
  • 31