19

For example, say I want to fetch a User and all of his phone numbers and email addresses. The phone numbers and emails are stored in separate tables, One user to many phones/emails. I can do this quite easily:

SELECT * FROM users user 
    LEFT JOIN emails email ON email.user_id=user.id
    LEFT JOIN phones phone ON phone.user_id=user.id

The problem* with this is that it's returning the user's name, DOB, favorite color, and all the other information stored in the user table over-and-over again for each record (users×emails×phones records), presumably eating up bandwidth and slowing down the results.

Wouldn't it be nicer if it returned a single row for each user, and within that record there was a list of emails and a list of phones? It would make the data much easier to work with too.

I know you can get results like this using LINQ or perhaps other frameworks, but it seems to be a weakness in the underlying design of relational databases.

We could get around this by using NoSQL, but shouldn't there be some middle ground?

Am I missing something? Why doesn't this exist?

* Yes, it's designed this way. I get it. I'm wondering why there isn't an alternative that is easier to work with. SQL could keep doing what it's doing but then they could add a keyword or two to do a little bit of post-processing that returns the data in a nested format instead of a cartesian product.

I know this can be done in a scripting language of your choice, but it requires that the SQL server either sends redundant data (example below) or that you to issue multiple queries like SELECT email FROM emails WHERE user_id IN (/* result of first query */).


Instead of having MySQL return something akin to this:

[
    {
        "name": "John Smith",
        "dob": "1945-05-13",
        "fav_color": "red",
        "email": "johnsmith45@gmail.com",
    },
    {
        "name": "John Smith",
        "dob": "1945-05-13",
        "fav_color": "red",
        "email": "john@smithsunite.com",
    },
    {
        "name": "Jane Doe",
        "dob": "1953-02-19",
        "fav_color": "green",
        "email": "originaljane@deerclan.com",
    }
]

And then having to group on some unique identifier (which means I need to fetch that too!) client-side to reformat the result set how you want it, just return this:

[
    {
        "name": "John Smith",
        "dob": "1945-05-13",
        "fav_color": "red",
        "emails": ["johnsmith45@gmail.com", "john@smithsunite.com"]
    },
    {
        "name": "Jane Doe",
        "dob": "1953-02-19",
        "fav_color": "green",
        "emails": ["originaljane@deerclan.com"],
    }
]

Alternatively, I can issue 3 queries: 1 for the users, 1 for the emails, and 1 for the phone numbers, but then the email and phone number result sets need to contain the user_id so that I can match them back up with the users I previously fetched. Again, redundant data and needless post-processing.

mpen
  • 1,889
  • 19
  • 29
  • 6
    Think of SQL as a spreadsheet, like in Microsoft Excel, then try to figure out how to create a cell value that contains inner cells. It no longer works well as a spreadsheet. What you are looking for is a tree structure, but then you no longer have the benefits of a spreadsheet (i.e. you can't total a column in a tree). Tree structures don't make for very human readable reports. – Reactgular Sep 13 '13 at 16:01
  • @Matthew: Trees actually make for very readable reports, if they're presented in a collapsible tree control that allows the user to "drill down" through the layers on demand. But if you're talking about *printed* reports, then yeah, I'd agree. – Mason Wheeler Sep 13 '13 at 16:54
  • 1
    @MasonWheeler - Depends on how big the tree is, and how deep it goes. I just came across [this tree](http://orteil.dashnet.org/nested), which kindof demonstrates that. – Bobson Sep 13 '13 at 16:57
  • 55
    SQL isn't bad at returning data, you're bad at querying for what you want. As a rule of thumb, if you think a widely used tool is buggy or broken for a common use case, the problem is you. – Sean McSomething Sep 13 '13 at 17:51
  • 12
    @SeanMcSomething So true that it hurts, I couldn't have said it better myself. – WernerCD Sep 13 '13 at 18:35
  • @Mark Read up on Database Normalization: https://en.wikipedia.org/wiki/Database_normalization - most likely you are looking at a normalized database (redundant data removed), and you WANT to look at a denormalized dataset (one line per customer, much stuff repeated/empty). – WernerCD Sep 13 '13 at 18:37
  • 6
    This is a great questions. Answers that are saying "this is the way it is" are missing the point. *Why* is it not possible to return rows with embedded collections of rows? – Chris Pitman Sep 13 '13 at 19:03
  • 8
    @SeanMcSomething: Unless that widely-used tool is C++ or PHP, in which case you're probably right. ;) – Mason Wheeler Sep 13 '13 at 20:41
  • 4
    What you are asking for is for the data to be returned as a nested data structure. Nested data structures are inherently non-relational which is why SQL DBMS's don't return them by default. As noted in the answer, you can get the same effect by requesting a series of rowsets, instead of trying get it all in one query. However, if you *really* want it to return a nested data structure, you can get that with the `SELECT .. FOR XML` option of SQL Server (I am sure that similar options exist for Oracle, etc.). – RBarryYoung Sep 13 '13 at 22:09
  • 1
    @RBarryYoung you should write that up as an answer. – Mr.Mindor Sep 14 '13 at 02:36
  • 1
    SQL SELECT returns a 'relation'. A relation has a single schema (number of columns, etc.) that is determined by the query. Reporting the query result is not the concern of the SQL Engine, its concern is to return the result. You are free to use other means to report the results. – NoChance Sep 14 '13 at 06:25
  • 1
    @SeanMcSomething: I know how to get the data I want. I've been doing it for many years. But I have to post-process it to get it into the format I want. Relational databases are what they are, and I'm not saying they are bugged or flawed, I'm simply wondering why there isn't an alternative that can return data in a friendlier format. Sorry for using subjective words in the question title; I've rephrased it. – mpen Sep 26 '13 at 00:18
  • Hierarchical models used to be popular, but were found to be inflexible. Hence relational databases which are essentially based on set theory. http://www.techopedia.com/definition/19782/hierarchical-database – Jaydee May 25 '15 at 13:16
  • If you're looking for a database that does nested structures like that, the first thing I thought of was Datomic, which doesn't use SQL, but rather a syntax similar to Datalog. In this database,I believe nested structures are more common because of the flexibility of the datalog query language, and how they implement what you would call a cursor. It's not a super fast system, though, but extremely flexible. So goes the tradeoffs between performance and flexibility. – Nick Klauer May 25 '15 at 13:28
  • @Jaydee But hierarchies can already be achieved with traditional RDBMS via foreign key constraints. The only thing that can't be done is polymorphic relations. Throw that in and change the result format to support it, and we get the best of both worlds, no? – mpen May 28 '15 at 17:53
  • With your edit, you are looking for looking for [group_concat](https://dev.mysql.com/doc/refman/5.0/en/group-by-functions.html#function_group-concat). See also the [group-concat tag](http://stackoverflow.com/questions/tagged/group-concat) on Stack Overflow. As described [below](http://programmers.stackexchange.com/a/211431/40980), It will do what you are asking. –  May 28 '15 at 18:29
  • @MichaelT `group_concat` has its limitations. I've exceeded length limits because of it. You also have to worry about delimiters if you plan on splitting it. Furthermore, it might work OK for emails or phone numbers, but not if you need more than one field from the joined table. Again, I'm not looking for a solution to any specific problem, this is simply a "why" question, not "how" -- I can solve it in any number of ways, they just don't feel optimal to me. – mpen May 28 '15 at 21:31
  • You may then wish to look into Postgres and its [json](http://www.postgresql.org/docs/9.3/static/functions-json.html) set of functions and [aggregations](http://www.postgresql.org/docs/9.3/static/functions-aggregate.html). Or with Oracle and its [connect by](http://docs.oracle.com/cd/B19306_01/server.102/b14200/queries003.htm) style queries and some craftiness to get the data you want. –  May 28 '15 at 21:35
  • I'd like to point out that apparently I'm not the only one who thought a nested query format would be a good idea: https://facebook.github.io/react/blog/2015/05/01/graphql-introduction.html – mpen Jul 02 '15 at 05:00
  • 1
    Apparently it is already implemented (sort of) in PostgreSQL http://www.postgresql.org/docs/9.2/static/functions-json.html http://bender.io/2013/09/22/returning-hierarchical-data-in-a-single-sql-query/ – Szabolcs Páll Jan 13 '16 at 12:53
  • 3
    @Sean McSomething - "SQL isn't bad at returning data, you're bad at querying for what you want. As a rule of thumb, if you think a widely used tool is buggy or broken for a common use case, the problem is you." That's just flat out wrong. SQL is inherently flawed in a way that many queries will either require splitting up into multiple queries (inefficient), or returning duplicate data (inefficient). There is often NO POSSIBLE WAY AROUND THIS. It absolutely is a flaw in how SQL was designed. – industry7 Mar 21 '16 at 14:10
  • @industry7, have you ever actually found that the "returning duplicate data" problem *actually* causes a performance problem due to inefficiency? Or are you just assuming that it must be inefficient? – Steve May 01 '21 at 18:57
  • @Steve short answer is yes-ish, longer answer is db performance is complicated by itself and then you still have a whole stack of software on top of that affected by decision in lower layers. – industry7 Aug 19 '21 at 15:52

10 Answers10

52

It's returning exactly what you asked for: a single record set containing the Cartesian product defined by the joins. There are plenty of valid scenarios where that's exactly what you would want, so saying that SQL is giving a bad result (and thus implying that it would be better if you changed it) would actually screw a lot of queries up.

What you're experiencing is known as "Object/Relational Impedance Mismatch," the technical difficulties that arise from the fact that the object-oriented data model and the relational data model are fundamentally different in several ways. LINQ and other frameworks (known as ORMs, Object/Relational Mappers, not coincidentally,) don't magically "get around this;" they just issue different queries. It can be done in SQL too. Here's how I'd do it:

SELECT * FROM users user where [criteria here]

Iterate the list of users and make a list of IDs.

SELECT * from EMAILS where user_id in (list of IDs here)
SELECT * from PHONES where user_id in (list of IDs here)

And then you do the joining client-side. This is how LINQ and other frameworks do it. There's no real magic involved; just a layer of abstraction.

Mason Wheeler
  • 82,151
  • 24
  • 234
  • 309
  • 16
    +1 for "exactly what you asked for". Too often we jump to the conclusion that there is something wrong with the technology rather than the conclusion that we need to learn how to use the technology effectively. – Matt Sep 13 '13 at 17:40
  • 1
    Hibernate will retrieve the root entity and certain collections in a single query when the _eager_ fetch mode is used for those collections; in that case it does the reduction of root entity properties in memory. Other ORMs can likely do the same. – Mike Partridge Sep 13 '13 at 18:36
  • 3
    Actually this is not to blame on the relational model. It copes very nicely with nested relations thank you. This is purely an implementation bug in early versions of SQL. I think more recent versions has added it though. – John Nilsson Sep 13 '13 at 21:37
  • 8
    Are you sure this is an example of object-relational impedance? Seems to me that the relational model perfectly matches the conceptual data model of the OP: each user is associated with a list of zero, one, or more email addresses. That model is also perfectly usable in an OO paradigm (aggregation: the user object has a collection of emails). The limitation is in the technique being used to query the database, which is an implementation detail. There are query techniques around which do return heirarchical data, e.g. [heirarchical DataSets in .Net](http://support.microsoft.com/kb/318454) – MarkJ Sep 13 '13 at 21:59
  • @MarkJ you should write that up as an answer. – Mr.Mindor Sep 14 '13 at 02:35
12

You could use a built in function to concatenate the records together. In MySQL you can use the GROUP_CONCAT() function and in Oracle you can use the LISTAGG() function.

Here is a sample of what a query might look like in MySQL:

SELECT user.*, 
    (SELECT GROUP_CONCAT(DISTINCT emailAddy) FROM emails email WHERE email.user_id = user.id
    ) AS EmailAddresses,
    (SELECT GROUP_CONCAT(DISTINCT phoneNumber) FROM phones phone WHERE phone.user_id = user.id
    ) AS PhoneNumbers
FROM users user 

This would return something like

username    department       EmailAddresses                        PhoneNumbers
Tim_Burton  Human Resources  hr@m.com, tb@me.com, nunya@what.com   231-123-1234, 231-123-1235
Linger
  • 299
  • 3
  • 4
  • 20
  • This seems to be the closest solution (in SQL) to what the OP is attempting to do. He will potentially still have to do client side processing to break the EmailAddresses and PhoneNumbers results into lists. – Mr.Mindor Sep 13 '13 at 20:32
  • 2
    What if the phone number has a "type", like "Cell", "Home", or "Work"? Furthermore, commas are technically allowed in email addresses (if they're quoted) -- how would I split it then? – mpen May 28 '15 at 21:35
12

Deep down, in the guts of a relational database, its all rows and columns. That is the structure that a relational database is optimized to work with. Cursors work on individual rows at a time. Some operations create temporary tables (again, it needs to be rows and columns).

By working with only rows and returning only rows, the system is able to better deal with memory and network traffic.

As mentioned, this allows for certain optimizations to be done (indexes, joins, unions, etc...)

If one was to want a nested tree structure, this requires that one pulls all the data at once. Gone are the optimizations for the cursors on the database side. Likewise, the traffic over the network becomes one big burst that can take much longer than the slow trickle of row by row (this is something that is occasionally lost in today's web world).

Every language has arrays within it. These are easy things to work with and interface with. By using a very primitive structure, the driver between the database and program - no matter what language - can work in a common way. Once one starts adding trees, the structures in the language become more complex and more difficult to traverse.

It isn't that hard for a programing language to convert the rows returned into some other structure. Make it into a tree or a hash set or leave it as a list of rows that you can iterate over.

There is also history at work here. Transferring structured data was something ugly in the days of old. Look at the EDI format to get an idea of what you might be asking for. Trees also imply recursion - which some languages didn't support (the two most important languages of the old days didn't support recursion - recursion didn't enter Fortran until F90 and of the era COBOL didn't either).

And while the languages of today have support for recursion and more advanced data types, there isn't really a good reason to change things. They work, and they work well. The ones that are changing things are the nosql databases. You can store trees in documents in a document based one. LDAP (its actually oldish) is also a tree based system (though its probably not what you're after). Who knows, maybe the next thing in nosql databases will be one that returns back the query as a json object.

However, the 'old' relational databases... they're working with rows because thats what they're good at and everything can talk to them without trouble or translation.

  1. In protocol design, perfection has been reached not when there is nothing left to add, but when there is nothing left to take away.

From RFC 1925 - The Twelve Networking Truths

  • "If one was to want a nested tree structure, this requires that one pulls all the data at once. Gone are the optimizations for the cursors on the database side." -- That doesn't sound true. It would just have to maintain a couple cursors: one for the main table, and then one for each joined table. Depending on the interface, it might return one row and all joined tables in one chunk (partially streamed), or it can stream the subtrees (and perhaps not even query them) until you start iterating them. But yes, that's complicating things a lot. – mpen Sep 27 '13 at 02:19
  • 3
    Every modern language should have some sort of tree class though, no? And wouldn't it be up to the driver to deal with that? I guess the SQL guys still need to design a common format (don't know much about that). The thing that gets me though is that I either have to send 1 query with joins, and get back and filter out the redundant data which each row (the user info, which only changes every Nth row), or issue 1 query (users), and loop over the results, then send two more queries (emails, phones) for each record to fetch the info I need. Either method seems wasteful. – mpen Sep 27 '13 at 02:24
  • @mpen, it's actually far less wasteful than you think. Whilst the client-side API must present the data conceptually as if it is a table, there is nothing to prevent the system just sending one physical copy of the user over the wire. And of course, the database itself often stores the data in normal form. Creating an explicit tree structure from the data once it is on the client, is algorithmically straightforward and involves no significant overhead. – Steve May 01 '21 at 07:46
11

The problem with this is that it's returning the user's name, DOB, favorite color, and all the other information stored

The problem is that you are not being selective enough. You asked for everything when you said

Select * from...

...and you got it (including DOB and favourite colours).

You probably should been a little more (ahem) ...selective, and said something like:

select users.name, emails.email_address, phones.home_phone, phones.bus_phone
from...

It's also possible you might see records that look like duplicates because a user might join to multiple email records, but the field that distinguishes these two is not in your Select statement, so you might want to say something like

select distinct users.name, emails.email_address, phones.home_phone, phones.bus_phone
from...

...over-and-over again for each record...

Also, I notice you're doing a LEFT JOIN. This will join all records on the left of the join (i.e. users) to all records on the right, or in other words:

A left outer join returns all the values from an inner join plus all values in the left table that do not match to the right table.

(http://en.wikipedia.org/wiki/Join_(SQL)#Left_outer_join)

So another question is do you actually need a left join, or would an INNER JOIN have been sufficient? They are very different types of joins.

Wouldn't be nicer if it returned a single row for each user, and within that record there was a list of emails

If you actually want a single column within the result set to contain a list that is generated on-the-fly, that can be done but it varies depending on which database you're using. Oracle has the listagg function.


Ultimately, I think that your problem might be solved if you rewrite your query close to something like this:

select distinct users.name, users.id, emails.email_address, phones.phone_number
from users
  inner join emails on users.user_id = emails.user_id
  inner join phones on users.user_id = phones.user_id
FrustratedWithFormsDesigner
  • 46,105
  • 7
  • 126
  • 176
  • 1
    using * is discouraged but not the crux of his problem. Even if he selects 0 user columns he may still experience a duplication effect since both Phones and Emails have a 1-many relation to Users. Distinct would not prevent a phone number from appearing twice ala phone1/name@hotmail.com, phone1/name@google.com. – mike30 Sep 13 '13 at 16:23
  • 6
    -1: "your problem _might_ be solved" says that you don't know what effect would the change from `left join` to `inner join`. In this case, this won't reduce the "repetitions" the user is complaining about; it would simply omit those users that lack a phone or email. hardly any improvement. also, when interpreting the "all records on the left to all records on the right" skips the `ON` criteria, which prunes all the 'wrong' relations inherent in the Cartesian product but keeps all the repeated fields. – Javier Sep 13 '13 at 21:23
  • @Javier: Yes, which is why I also said *do you actually need a left join, or would an INNER JOIN have been sufficient? * OP's description of the problem makes it *sound* as though they were expecting the result of an inner join. Of course, without any sample data or a description of what they *really* wanted, it's hard to say. I made the suggestion because I've actually seen people (ones I work with) do this: choose the wrong join and then complain when they don't understand the results they get. Having *seen* it, I thought it might have happened here. – FrustratedWithFormsDesigner Sep 16 '13 at 13:54
  • 3
    You're missing the point of the question. In this hypothetical example, I *want* all the user data (name, dob, etc) **and** I want all his/her phone numbers. An inner join excludes users w/ no emails or no phones -- how does that help? – mpen May 25 '15 at 15:53
4

Queries always produce a rectangular (un-jagged) tabular set of data. There are no nested sub-sets within a set. In the world of sets everything is a pure un-nested rectangle.

You can think of a join as putting 2 sets side-by-side. The "on" condition is how the records in each set are matched up. If a user has 3 phone numbers, then you'll see a 3-time duplication in the user info. A rectangular un-jagged set must be produced by the query. It's simply the nature of joining sets with a 1-to-many relationship.

To get what you want, you must use a separate query like Mason Wheeler described.

select * from Phones where user_id=344;

The result of this query is still a rectanglar un-jagged set. As is everything in the world of sets.

mike30
  • 2,788
  • 2
  • 16
  • 19
3

The concept of relational closure basically means that the result of any query is a relation which can be used in other queries as if it was a base table. This is an powerful concept because it makes queries composable.

If SQL allowed you to write queries which output nested data structures, you would break this principle. A nested data structure is not a relation, so you would need a new query language, or complex extensions to SQL, in order to query it further or to join it which other relations.

Basically you would build a hierarchical DBMS on top of a relational DBMS. It will be much more complex for a dubious benefit, and you lose the advantages of a consistently relational system.

I understand why it would sometimes be convenient to be able output hierarchically structured data from SQL, but the cost in the added complexity throughout the DBMS to support this is definitely not worth it.

JacquesB
  • 57,310
  • 21
  • 127
  • 176
2

You have to decide where the bottlenecks exist. The bandwidth between your database and application is usually pretty fast. There's no reason most databases couldn't return 3 separate datasets within one call and no joins. Then you get to join it all together in your app if you want.

Otherwise, you want the database to put this dataset together and then remove all the repeated values in each row that are the result of the joins and not necessarily the rows themselves having duplicate data like two people with the same name or phone number. Seems like a lot of over-head to save on bandwidth. You would be better off focusing on returning less data with better filtering and removing the columns you don't need. Because Select * is never used in production-well that depends.

JeffO
  • 36,816
  • 2
  • 57
  • 124
  • "There's no reason most databases couldn't return 3 separate datasets within one call and no joins" -- How do you get it to return 3 separate datasets with one call? I thought you had to send 3 different queries, which introduces latency between each one? – mpen Sep 26 '13 at 00:28
  • A stored procedure could be called in 1 transaction, and then return as many datasets as you wanted. Maybe a "SelectUserWithEmailsPhones" sproc is needed. – Graham Sep 26 '13 at 12:47
  • 1
    @Mark: you can send (in sql server at least) more than one command at as part of the same batch. cmdText="select * from b; select * from a; select * from c" and then use that as the command text for the sqlcommand. – jmoreno Nov 01 '13 at 05:03
2

Very simply, don't join your data if you want distinct results for a user query and a phone number query, otherwise as others have pointed out the "Set" or data will contain extra fields for every row.

Issue 2 distinct queries instead of one with a join.

In the stored procedure or inline parameterized sql craft 2 queries and return the results of both back. Most database and languages support multiple result sets.

For example, SQL Server and C# accomplish functionality this by using IDataReader.NextResult().

Jon Raynor
  • 10,905
  • 29
  • 47
1

You are missing something. If you want to denormalize your data, you have to do it yourself.

;with toList as (
    select  *, Stuff(( select ',' + (phone.phoneType + ':' + phone.PhoneNumber) 
                    from phones phone
                    where phone.user_id = user.user_id
                    for xml path('')
                  ), 1,1,'') as phoneNumbers
from users user
)
select *
from toList
jmoreno
  • 10,640
  • 1
  • 31
  • 48
-5

Pls refer to the usage of STUFF function which groups multiple rows (phone numbers) of a column (contact) that can be extracted as a single cell of delimited values of a row (user).

Today we extensively using this but facing some high CPU and performance issues. XML data type is another option but is a design change not a query level one.

  • 5
    Please expand on how this solves the question. Rather than saying to "Pls refer to the usage of", provide an example of how this would achieve the question asked. It can also be helpful to quote 3rd party sources where it makes things clearer. – bitsoflogic Jul 17 '18 at 14:08
  • 1
    Looks like `STUFF` is akin to splice. Not sure how that applies to my question. – mpen Jul 17 '18 at 19:49