56

I'm planning to design and set up a database to store dictionary entries (usually single words) and their meaning in another language. So, for example, the table Glossary must have entry and definition and each table record has a reference to the id of a record stored in Tag (Each entry must have a tag or category).

Since my data has a structure, I thought using a SQL database (like MySQL) is not a bad idea; but people say MongoDB is much better for performance.

At the client side, the application must be able to provide a search box with autocomplete which consumes a REST API provided by the backend. Is it safe to go with MySQL in such a scenario? or should I use MongoDB or ElasticSearch of any other solution for this? Hundred thousands of records are supposed to be stored and accessed in this way.

enderland
  • 12,091
  • 4
  • 51
  • 63
Aziz Az
  • 491
  • 1
  • 4
  • 7
  • It's not really relational data. Just lists of words and their meaning. It's more document-like data you're talking about. – Pieter B Jun 05 '17 at 20:32
  • 79
    The people telling you things have not done much research into this. The language with the largest vocabulary, English, has less than a million distinct words. This is well within the realm of the performance capabilities of a relational DB. – TheCatWhisperer Jun 05 '17 at 20:43
  • 26
    I don't see anything here that would make me think MySQL wouldn't work fine for that. Performance on a simple lookup wouldn't be an issue, and it has full text search if you need to go that route. – GrandmasterB Jun 05 '17 at 20:48
  • 1
    I think we need to be careful with this question so it doesn't get closed, because software suggestions are off-topic. However given the problem I think relational versus no-sql would be an answerable question. – Greg Burghardt Jun 05 '17 at 22:59
  • 1
    Have you considered that some words have multiple definitions? Take for example "set" with 430 odd meanings. – Tim Jun 06 '17 at 01:03
  • 46
    Regarding "MongoDB is much better for performance"—as an unmodified statement with no scope clarification this is rank nonsense. For an example, see [Command-line Tools can be 235x Faster than your Hadoop Cluster](https://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html) (which I came across from a link in [The Website Obesity Crisis](http://idlewords.com/talks/website_obesity.htm)). – Wildcard Jun 06 '17 at 01:48
  • 82
    I'm so tired of people saying relational databases are bad and MongoDB is better because it is faster. That's like saying cars are bad and we should use airplanes because they travel faster. My advice is to ignore advice like this. – Brandon Jun 06 '17 at 01:53
  • 6
    The idea that NoSQL stores are more efficient applies for *writing*, but dictionary look-up is vastly biased towards *reading* (English doesn't change that fast). Also, they are predominantly for extremely large data sets, but English isn't that big. You're getting bad advice. – Kilian Foth Jun 06 '17 at 06:11
  • 5
    People are telling you this because NoSQL is cool these days. You should properly evaluate the pros and cons and try to ignore the cool factor – mcfedr Jun 06 '17 at 09:07
  • 13
    @Brandon The sad thing is that the whole "NoSQL is so much faster" claims usually boil down to some theoretical explanation of why they ought to be so much better, but in practice that does not even apply for many real world scenarios. See e.g. [here](https://www.enterprisedb.com/postgres-plus-edb-blog/marc-linster/postgres-outperforms-mongodb-and-ushers-new-developer-reality). Their used benchmark suite is open source and available on github too. Hell CERN manages their PB of data with an OracleDB just fine. – Voo Jun 06 '17 at 10:13
  • Someone is selling you something either directly or by proxy: "someone just sold me this super-duper NoSQL system as a silver bullet for all of my persistence problems, must tell everyone how great it is to avoid confronting the possibility that I got suckered by the shiny new thing!". – Jared Smith Jun 06 '17 at 13:42
  • I can see the angel investor now - "If only you would have used a NoSQL database instead of the most popular open source database in the world... we could have invested in you." – corsiKa Jun 06 '17 at 16:51
  • 3
    [MongoDB is Web Scale.](https://www.youtube.com/watch?v=b2F-DItXtZs) – Robert Harvey Jun 06 '17 at 17:33
  • A relational database is a good idea. But MySql... not so much. It's lagged behind the other players for many years now. Take this list: CTEs, FULL JOIN, Lateral Join/Apply, and Windowing functions. They've all been part of the ansi standard since at least 2011 or longer, and are all part of Sql Server, Oracle, and Postgresql, but not MySql. MySql doesn't really qualify as a modern db engine anymore. If you need free or open source, choose Postgresql. – Joel Coehoorn Jun 06 '17 at 23:26

4 Answers4

94

I can't tell you why it's a bad idea. I can tell you a bunch of reasons why a relational database is a good idea though.

  1. Remember that not everyone consults a dictionary for a definition. More times than not, a dictionary is used to find the correct spelling. This means you're not just finding a needle in a haystack, you are searching the haystack for needles that are similar to the one described by the user (if I may use an idiom).

    You won't just be doing primary key look-ups. You'll be doing keyword searches

  2. Words can be related, either in meaning or spelling (read, read, red and reed)

    Whenever you see the word "related" think "Relational Database"

  3. If you need speed, you need caching on top of your relational database, not a broken relational data model

  4. A properly normalized database speeds up primary key look-ups and searches since there is simply fewer bits to sift through.

  5. The people who say normalized databases are slower are referring to the 0.1% of cases where this is true. In the other 99.9% of cases they haven't actually worked with a truly normalized database to see the performance first hand, so ignore them. I have worked with a normalized database. Love it. Don't want to go back. And I'm not a database guy. I'm a C#/JavaScript/HTML/Ruby guy.

  6. Words have an origin. In fact, many words in the same language can have the same origin, which is another word in a different language. For instance, résumé (the thing we upload to recruiters websites so we can get incessant phone calls and e-mails for the next 7 years) is a French word.

  7. A dictionary also defines what kind of word it is (noun, verb, adjective ect). This isn't just a piece of text: "noun" it has meaning as well. Plus with a relational database you can say things like "give me all the nouns for the English language" and since a normalized database will be utilizing foreign keys, and foreign keys have (or should have) indexes, the lookup will be a snap.

  8. Think of how words are pronounced. In English especially, lots of words have the same pronunciation (see my example above with read and reed, or read and red).

    The pronunciation of a word is, itself, another word. A relational database would allow you to use foreign keys to any pronunciations. That information won't be duplicated in a relational database. It gets duplicated like crazy in a no-SQL database.

  9. And now let's talk about plural and singular versions of words. :) Think "boat" and "boats". Or the very fact that a word is "singular" or "plural".

  10. Oh! And now let's talk about past tense, present tense, future tense and present participle (to be honest, I don't know what the crap "present participle" is. I think it has something to do with words ending in "ing" in English or something).

    Look up "run" and you should see the other tenses: ran, runs, running

    In fact, "tense" is another relationship itself.

  11. English doesn't do this so much, but gender is another thing that defines a word. Languages like Spanish have suffixes the define whether the subject of the noun is male or female. If you need to fill in the blanks for a sentence, gender is extremely important in many languages.

    Since you can't always rely on language conventions to determine gender (in Spanish, words ending in "o" are masculine/male, but that's not true for all words), you need an identifying value: Male or Female. This is another relationship that a normalized database handles gracefully even at millions of records.

With all the twisted rules and relationships between words, and even different languages, it's hard for me to imagine this data store as a "document store" like a no-SQL solution provides. There are so many and such a large variety of relationships between words and their components that a relational database is the only sensible solution.

Greg Burghardt
  • 34,276
  • 8
  • 63
  • 114
  • 8
    For #1, indexing is often one of the strengths of the non-relational offerings, not a weakness. – JimmyJames Jun 05 '17 at 20:40
  • 62
    @JimmyJames Don't think for a minute that relational systems aren't using the same kinds of indexes. Many of those techniques were pioneered in that world. – Blrfl Jun 05 '17 at 21:03
  • Good answer Greg. I would like to see added to it (based partially on #2, and other reasons) that the word should be an index, but not the clustered index. – TheCatWhisperer Jun 05 '17 at 21:22
  • 1
    @Blrfl I did not think that for a minute or even a small fraction of one. The point still stands. – JimmyJames Jun 05 '17 at 21:41
  • 15
    "Whenever you see the word "related" think "Relational Database"". I don't agree. The "relational" in "relational database" refers to the tuples themselves. Related is much too broad a term for this statement to hold any water – gardenhead Jun 06 '17 at 01:39
  • 12
    There are also graph databases (Neo4j comes to mind) that are explicitly focused on traversing relationships rather than performing traditional joins. This may be advantageous given that many dictionaries are actually webs of words; for example, the WordNet project uses its own graph-like format, instead of a traditional RDMS. – tucuxi Jun 06 '17 at 12:44
  • 2
    +1 For a small dictionary (and by small I mean dictionaries of any size as long as they're the kind that "normal people" use), RDBMS are perfectly fine and are stable and mature and are understandable by many developers. However "If you need to fill in the blanks for a sentence, gender tense (which is probably the wrong term for this) is extremely important in many languages." > You can't just tack on *-a* to any word in Spanish and make it feminine or *-o* to make it masculine. Your system would have *man-* (hand) as a word and then add *-a* to it, which is wrong. Language is hard. – errantlinguist Jun 06 '17 at 13:04
  • Just to pick a few minor nits: 1) ‘Gender tense’ is indeed not the term; it’s just _gender_. Gense is a property of verbs. 2) English has two true tenses (present and past) and a bunch of periphrastic constructions that are often also called ‘tenses’, but aren’t tenses in narrower terms, including future and: 3) Present participles aren’t tenses, but they do end in _-ing_. A word like _running_ isn’t necessarily a present participle, though: it can also be a gerund, or indeed a verbal noun. 4) I don’t follow your #8—how are identical pronunciations not repeated as much in RDBMS? – Janus Bahs Jacquet Jun 06 '17 at 16:15
  • @JanusBahsJacquet: I clarified #8 and corrected "gender tense". Didn't get a chance yet to correct past/present tense. It feels like English is the only language where the government requires you to take classes on it for 13 years, and then you still can't get it right. :P – Greg Burghardt Jun 06 '17 at 17:26
  • 3
    This answer lacks any understanding of No-SQL databases. 1. You can completely avoid duplication in a No-SQL database, especially in the simple dictionary use case. You add duplication to improve performance (like you do in a relational database sometimes) 2. In a document store like MongoDB you'll build relations via nested documents. Different doesn't mean worse. 3. You can use MongoDBs aggregation framework to combine data from multiple collections. 4. Neither relational nor No-SQL database are good for spell checking, use something like SOLR for that. – Christian Strempfer Jun 06 '17 at 18:49
  • That said, I would advise the OP to use a relational database he is familiar with, but not because of the misleading reasons here. How are points 6 to 11 even related to the database? – Christian Strempfer Jun 06 '17 at 18:52
  • @ChristianStrempfer: Points 6-11 illustrate how "words" can be relational data, and benefit from things like foreign keys, and database normalization. – Greg Burghardt Jun 06 '17 at 18:55
  • 1
    @GregBurghardt Maybe that's what you wanted to write, but it's not there. Reread for example point 6, and check where is the database argument. You should put the last paragraph above it, to make clear what you're talking about there. But actually I would argue that a large variety of relationships are an argument for databases without a fixes schema (-> No SQL). – Christian Strempfer Jun 06 '17 at 19:21
  • 1
    @tucuxi A graph database in this case is a misfit. Specifically, see https://stackoverflow.com/questions/13046442/comparision-of-relational-databases-and-graph-databases "The key difference between a graph and relational database is that relational databases work with sets while graph databases work with paths.". You don't want to work with paths here. – David Tonhofer Jun 06 '17 at 20:44
  • 1
    I've used NoSQL and SQL DB's... not sure how any of these points suggest SQL is better than NoSQL. I could probably build this in MongoDB or MySql, and imagine both would have things they do well and things that are a little trickier. – NPSF3000 Jun 06 '17 at 22:06
  • 5
    I downvoted this answer *just* for "Whenever you see the word 'related' think 'Relational Database'." That's **ridiculous**. I love relational databases, but the relational model *isn't* appropriate for *all* kinds of relationships. Your view of normalized data is also completely wrong. Normalizing data optimizes **edits**, because data isn't duplicated, not searches. (That's why reporting DBs don't normalize. They use dimensional modeling techniques and star schemas.) I don't think you know what you're talking about. The 80 upvotes confirm all my concerns about advice on this site. – jpmc26 Jun 06 '17 at 22:40
  • 1
    @jpmc26: A reporting database is a completely different problem than a dictionary, and involves volumes of data that are orders of magnitude greater. – Greg Burghardt Jun 07 '17 at 12:12
  • 2
    @GregBurghardt It doesn't matter. The point is your answer is filled with sweeping generalities that just aren't true. I'd also be *hesitant* to try to model some of the complex relationships you mention using a third normal form database. It's probably *possible*, but other models might make it easier. Your answer also seems to assert that the ability to group attributes together is some kind of distinct advantage of relational databases, when it most certainly isn't. – jpmc26 Jun 07 '17 at 19:43
27

If you go with the key-value store (which offers you a more impoverished programming model) and it turns out you need more structure (in your case, say, adding a third language), or you need to do more complex queries involving joins, you'll spend a bunch of time reorganizing your keys, denormalizing your data, and/or looping over all the data to find what you need.

If you start with a relational database, you can work through your application's design, code, and try it out concentrating more on the natural data model for your application, rather than on shoehorning it into the key-value form.

Once the application settles down, you can work on performance, by measuring various options. There are quite a few performance tricks to do in SQL before needing to switch technologies. You'll have learned a lot about your application and will be in a much better position to decide if relational is hurting you and if key-value will work for your data model.

If it turns out that key-value is exactly what your application needs, you can switch without having wasted significant investment in the relational model, whereas the other way around you might possibly end up wasting time making the key-value model do things that are trivial in the relational model.

Consider the relational database as an accelerator to getting your application designed, written, and up and running, in the face of ever-changing requirements as you learn more about your domain and users.

When you have millions of users, you'll almost certainly need to refactor the design anyway, even if you had picked key-value to start with.

Erik Eidt
  • 33,282
  • 5
  • 57
  • 91
  • 13
    The epilogue in [this article](http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/) describes exactly a scenario of changing requirements invalidating a design. It describes one (real) application as "a perfect use case for MongoDB", but then describes how a relatively minor change in requirements, that would have been trivial to implement in a RDBMS, required a decent amount of work and would have moved it to a use case that (as the preceding parts of the article explain) is very much not a good use case of Mongo. – Derek Elkins left SE Jun 06 '17 at 04:27
  • 5
    Sarah's MongoDB article is exactly what we went through with a 1.0 product we had built using it; by 1.1 we were using Postgres. – Joe Jun 06 '17 at 17:37
  • @DerekElkins, super reference, thx! – Erik Eidt Jun 06 '17 at 17:38
  • 1
    " but then describes how a relatively minor change in requirements, that would have been trivial to implement in a RDBMS" Sure, but the opposite is true. We use RDBMS's at work and face issues that would be trivial to solve in MongoDB. Strangely enough, software requirements don't always map perfectly to the capabilities of the tools we use. – NPSF3000 Jun 07 '17 at 04:51
  • @NPSF3000, it would be awesome if you could cite a reference, like a blog or some text that elaborated on that! – Erik Eidt Jun 07 '17 at 05:20
  • @DerekElkins: Sarah Mei might be a good programmer, but she isn't a MongoDB expert. Reads like she worked only at 2 very simple MongoDB projects and made several [beginner's mistakes](http://blog.mongodb.org/post/88473035333/6-rules-of-thumb-for-mongodb-schema-design-part-3). She nested documents although it's a many-to-many relationship (episodes/actors) and nested arrays with unlimited elements (social feed). To her defense the manual probably didn't explain [data modeling](http://docs.mongodb.org/manual/data-modeling/) as good as today. – Christian Strempfer Jun 07 '17 at 05:30
  • @ErikEidt Sorry, we don't currently have a dev blog - MongoDB collects a lot of case studies though. We currently face issues of scaling and schema changes all the time that our MySQL platform doesn't handle as well easily Mongo would. Of course, MySQL does make some features easier to build that can be easily modeled as a join. We also take advantage of columnar RDBMS like BigQuery... they look like SQL but work very differently than row based designs. We also use Redis for a real-time feature (albeit could be anything) and S3 a lot (which is just a key-value store for big values). – NPSF3000 Jun 07 '17 at 05:40
10

For a database this small, it's probably not going to make much difference for performance. A standard RDBMS isn't a terrible idea here because presumably, there should be far more reads than writes of a given entry. Performance doesn't seem to be a primary driver for this. Caching in the application layer also mitigates such concerns.

The other consideration is replication and resilience. Relational databases tend to be designed around a single instance. You should read up on the CAP theorem and consider what matters the most to you.

JimmyJames
  • 24,682
  • 2
  • 50
  • 92
  • How does CAP apply to a relatively normal web app? Depending on your kit it's likely you can sustain thousands of inbound connections and a page caching layer can increase that by an order of magnutude. CAP only starts becoming something you need to consider when distributed systems are the _only_ way of achieving your objective. – Ben Jun 06 '17 at 20:29
  • 2
    @Ben Resiliency is an objective in it's own right. If having a single point of failure is not acceptable for an application, distributed solutions offer a solution. Non-RDBMS solutions tend to be more oriented towards this. It's not simply volume to consider. Latency and availability are concerns. If your requirement is to have 99.9% uptime. You can only be down for about 9 hours a year and losing the data in one db is catastrophic so you need to account for replication/backups/snapshots. It's misguided to think it necessarily simplifies things. – JimmyJames Jun 06 '17 at 20:48
2

These NoSQL databases always sound like a good idea at the outset, but you'll be guaranteed to run into problems when you start dealing with edge cases (e.g. where keywords must by looked up by their value (or part of) for instance.

It would be a safer option to go with a relational database at the outset and then denormalise later. MySQL is awesome for this sort of purpose (simple relational databases with text-based searching), there's not too many use-cases where you'll find it struggling with this sort of data. Just make sure you have your indexes set up correctly and you'll find it will perform at a level comparable (or better when doing a text search) to a NoSQL database, and it will give you the flexibility to modify your app logic without being bound to a concrete data structure.

As you find the most common usage of your data (and if you ever find it's not meeting your performance needs), you can then proceed to de-normalise the data by outputting to a set format that can be loaded into (and retrieved from) a NoSQL schema.

joel.cass
  • 29
  • 1