3

I'm currently designing a conceptual web app, where users can submit posts, comments, and "like"/"dislike" both of those. However, I am not sure of how to store the (dis)likes, because of how many there could be and how many queries would be run on their corresponding table. (For example, getting (dis)likes per post, getting which posts/comments the user has liked, calculating which posts are trending based on how many likes they have received recently.)

In the concept, the web app is fairly popular. Let's compare it to Facebook and say it's 1/1000th as popular as Facebook. In 2012, Facebook dealt with 2.7 billion likes per day (probably more now, but we'll go with the 2012 stats). That means that the conceptual web app is going to deal with 2.7 million likes a day, which is almost 30 billion per year. 30 billion inserts a year and many more queries on the table.

I have two main choices when it comes to database system; SQL or NoSQL. I have already chosen MySQL for the other parts of the web app. To my knowledge, NoSQL databases like Cassandra are faster with inserts, but would there be a noticeable general performance difference?

gnat
  • 21,442
  • 29
  • 112
  • 288
tomatocan
  • 39
  • 2
  • 2
    Note that tool/library/product recommendations are off-topic on this site, so I replaced the specific tools you mentioned with "SQL vs NoSQL" as that is the major conceptual difference between the two tools you specified. – Ixrec Mar 23 '16 at 23:13
  • 1
    related: [Why NoSQL over SQL?](http://programmers.stackexchange.com/questions/109192/why-nosql-over-sql) "There is only one scenario where NoSQL is the way to go. And that is when you need to be very very flexible regarding database traffic. That means you have to expect that all of a sudden the database-requests explode by a factor of - let's say - 10 or 100. The ability to react to such situations is called elasticity..." – gnat Mar 24 '16 at 07:12
  • You can store an aggregated value of the total likes/dislikes per post. Do you have a need to itemize all of these for each user and/or question? – JeffO Mar 24 '16 at 13:23

3 Answers3

8

No startup has ever written the first version of their software with this kind of scalability in mind.

Facebook started out in PHP, and wrote a cross-compiler to convert their PHP code to C++ to reduce the number of servers they need by 50%. Twitter made major architectural changes, and got a 3X improvement in speed.

In both cases, they started out with a small but nimble system, typically in a rapid development tool, and switched to more robust, scalable systems later. The ability to write a functioning system and get it to market quickly is all that matters when you're small.

If you ever get as big as Facebook or Twitter, you're scalability woes will be a good problem to have, and you'll have the time and money to fix them properly.

Robert Harvey
  • 198,589
  • 55
  • 464
  • 673
  • 1
    In particular, the Twitter founders, when asked whether the performance of MRI (at that time the only production-ready Ruby implementation, and an extremely slow one) was an obstacle to scaling the service, they said it's exactly the opposite: the agility and flexibility of Ruby was what made the radical architectural changes possible that were required for scaling. Later on, they decided to re-write their home-grown message queue from Ruby to Scala and got another massive boost, which however was not necessarily solely related to the language switch but also to the fact that the Ruby queue … – Jörg W Mittag Mar 23 '16 at 22:54
  • … was the first time any of them had ever written a message queue, and the second time around they had learned to avoid the mistakes they made with the first one. Now they employ a healthy mix of a whole bunch of technologies, still including Ruby on Rails, but also a lot of Scala. – Jörg W Mittag Mar 23 '16 at 22:56
  • And you can read about how the stack sites are built along with others like Basecamp.com and discover that when it comes to high-performing websites, there are a lot of toys between the web server and data storage. – JeffO Mar 24 '16 at 13:30
  • The BaseCamp example is very interesting. Instead of making everything client-side Javascript MVVM, they used a "sheet-based UI" which is generated server-side. You can read more about it at [How Basecamp Next got to be so damn fast without using much client-side UI](https://signalvnoise.com/posts/3112-how-basecamp-next-got-to-be-so-damn-fast-without-using-much-client-side-ui). – Robert Harvey Mar 24 '16 at 16:12
  • @GabrielPetersson I mean, obviously if your *startup* feature is 10 billion rows. – Robert Harvey Apr 25 '23 at 11:38
2

RDBMS vs. NoSQL

The question RDBMS vs. NoSQL, despite the claims of some vendors, is not a simple question of performance and scalability. It's a question of structure of your data and what you intend to do with it.

If your data is highly structured, you can certainly benefit from an RDBMS, and scale it as needed, by using bigger servers, adding multiple processors, distribute your data across several databases using some clever partitioning scheme, and even use some hadoop based SQL implementations having big data in mind... all this if the data structure allows it.

However if your data is not so structured, or has a structure that might evolve quickly, then a NoSQL database such as MongoDB, Aerospike, Cassandra or others could certainly be a more flexible alternative. These databases, due to their flexible structure are also easier to distribute. Some are even container-enabled thus allowing the best possible scalability. But to select the most appropriate one, you also have to take care of the kind of NoSQL you'll need, the read-write patterns in your application, and even some low level system architecture aspects (example: use of in-memory technology, or SSD storage vs. traditional hard disks).

How to choose ?

So the first thing you need is to have a clear view of your application, and the data it has to manage. and deal with the overall architecture of your product.

One of the aspect is the distribution of your webapp accross web-servers for ensuring scalability, and the use of eventual middleware such as kafka as message broker, and eventually use of microservices, with several loosely connected databases.

Another aspect will be the interface between your application and the database. The best approach here, is to design the software architecture so to isolate the interactions with the database, alowing you to change if needed, with a minimum of impact on the remaining of the application. This approach will enable you to start with a database and change if needed without too much worries, gaining from the experience as the dataflows grow.

Christophe
  • 74,672
  • 10
  • 115
  • 187
0

The driving factor in this choice is not speed of operation but the structure of the data.

In your case you have a post with assocIated likes. If you always display the post and is likes together without reference to other data then this fits the no-sql pattern.

You need only select the post by id or perhaps an indexed userid and the db returns the required likes automatically.

If however you wanted to display the average likes per post with todays date, or the post with the most likes. a sql db would be better suited as a single select can calculate the aggregate over the tables.

When you talk about billions of rows you also have to consider the size of the data. Any query which has to look at all the data will be slow regardless of the database structure.

Ewan
  • 70,664
  • 5
  • 76
  • 161