100

Yesterday I was discussing with a "hobby" programmer (I myself am a professional programmer). We came across some of his work, and he said he always queries all columns in his database (even on/in production server/code).

I tried to convince him not to do so, but wasn't so successful yet. In my opinion a programmer should only query what is actually needed for the sake of "prettiness", efficiency and traffic. Am I mistaken with my view?

the baconing
  • 1,111
  • 2
  • 8
  • 9
  • 1
    I'd say its cos what if the table contents change? adding/removing columns? your still selecting *.. so you're gonna be missing things or pulling back more data than you need. – JF it Apr 03 '14 at 13:15
  • 2
    @JFit That's part of it, but far from the whole story. – jwenting Apr 03 '14 at 14:37
  • 8
    Good reasons at SO - http://stackoverflow.com/questions/3180375/select-vs-select-column – Bratch Apr 03 '14 at 15:40
  • 6
    And good reasons here, [Why is select * considered harmful?](http://stackoverflow.com/questions/3639861/why-is-select-considered-harmful) – Ellie Kesselman Apr 03 '14 at 16:22
  • @gnat can a question really be considered a duplicate of a closed question? (ie because the closed one wasn't really suitable in the first place) – gbjbaanb Apr 04 '14 at 07:30
  • @gbjbaanb sure, why not. As for "first place" part of your comment, see http://meta.stackexchange.com/questions/194476/someone-flagged-my-question-as-already-answered-but-its-not – gnat Apr 04 '14 at 07:39
  • Well if your goal is to convince a hobby programmer and this hobby programmer has complete control over what is going into and out of their tables then good luck making a convincing argument. If "select *" isn't a performance issue for a particular application then there's nothing wrong with "select *". If it is potentially an issue then I'm certain that your justifications will sink in to your hobby programmer as the concept of bandwidth isn't that difficult to comprehend. – Dunk Apr 07 '14 at 14:55
  • luckily i finally convinced him and he will never ever use it in production code again – the baconing Apr 08 '14 at 08:54
  • FYI. the DBA site has a very [related question](http://dba.stackexchange.com/questions/69361/is-select-still-a-big-no-no-on-sql-server-2012) that shows select * is also just plain bad from the DB PoV. Very interesting reading from something other than a programmer viewpoint. – gbjbaanb Jul 02 '14 at 07:38

10 Answers10

181

Schema Changes

  • Fetch by order --- If the code is fetching column # as the way to get the data, a change in the schema will cause the column numbers to readjust. This will mess up the application and bad things will happen.
  • Fetch by name --- If the code is fetching column by name such as foo, and another table in the query adds a column foo, the way this is handled may cause problems when trying to get the right foo column.

Either way, a schema change can cause problems with the extraction of the data.

Further consider if a column that was being used is removed from the table. The select * from ... still works but errors out when trying to pull the data out of the result set. If the column is specified in the query, the query will error out instead giving a clear indiciation as to what and where the problem is.

Data overhead

Some columns can have a significant amount of data associated with them. Selecting back * will pull all the data. Yep, here's that varchar(4096) thats on 1000 rows that you've selected back giving you an additional possible 4 megabytes of data that you're not needing, but is sent across the wire anyways.

Related to the schema change, that varchar might not exist there when you first created the table, but now its there.

Failure to convey intent

When you select back * and get 20 columns but only need 2 of them, you are not conveying the intent of the code. When looking at the query that does a select * one doesn't know what the important parts of it are. Can I change the query to use this other plan instead to make it faster by not including these columns? I don't know because the intent of what the query returns isn't clear.


Lets look at some SQL fiddles that explore those schema changes a bit more.

First, the initial database: http://sqlfiddle.com/#!2/a67dd/1

DDL:

create table one (oneid int, data int, twoid int);
create table two (twoid int, other int);

insert into one values (1, 42, 2);
insert into two values (2, 43);

SQL:

select * from one join two on (one.twoid = two.twoid);

And the columns you get back are oneid=1, data=42, twoid=2, and other=43.

Now, what happens if I add a column to table one? http://sqlfiddle.com/#!2/cd0b0/1

alter table one add column other text;

update one set other = 'foo';

And my results from the same query as before are oneid=1, data=42, twoid=2, and other=foo.

A change in one of the tables disrupts the values of a select * and suddenly your binding of 'other' to an int is going to throw an error and you don't know why.

If instead your SQL statement was

select 
    one.oneid, one.data, two.twoid, two.other
from one join two on (one.twoid = two.twoid);

The change to table one would not have disrupted your data. That query runs the same before the change and after the change.


Indexing

When you do a select * from you are pulling all the rows form all the tables that match the conditions. Even tables you really don't care about. While this means more data is transferred there's another performance issue lurking further down the stack.

Indexes. (related on SO: How to use index in select statement?)

If you are pulling back lots of columns the database plan optimizer may disregard using an index because you are still going to need to fetch all those columns anyways and it would take more time to use the index and then fetch all of the columns in the query than it would be just to do a complete table scan.

If you are just selecting the, say, last name of a user (which you do a lot and so have an index on it), the database can do an index only scan (postgres wiki index only scan, mysql full table scan vs full index scan, Index-Only Scan: Avoiding Table Access).

There is quite a bit of optimizations about reading only from indexes if possible. The information can be pulled in faster on each index page because you're pulling less of it also - you're not pulling in all those other columns for the select *. It is possible for an index only scan to return results on the order of 100x faster (source: Select * is bad).

This isn't saying that a full index scan is great, its still a full scan - but its better than a full table scan. Once you start chasing down all the ways that that select * hurts performance you keep finding new ones.

Related reading

  • 2
    @Tonny I'd agree - but when I answered (first) I never thought this question would generate quite so much discussion and commentary! Its obvious to query only for named columns, isn't it?! – gbjbaanb Apr 04 '14 at 07:28
  • 3
    Breaking everything by adding a column is also a good reason why code should always access columns in a datareader by name not by hard-coded ordinal... – Julia Hayward Apr 04 '14 at 08:51
  • 1
    @gbjbaanb It is to me. But a lot of people come to write SQL queries without a formal background/training. To them is may not be obvious. – Tonny Apr 04 '14 at 09:01
  • I agree with MichaelT's answer. Client-side SQL should probably *never* use "select * from". However, on Oracle, in a local cursor inside a PL/SQL package, I think it may be OK to use it, the reason being that there is no perfomance overhead and that everything is staying completely inside the DB: If the schema changes later and if this has consequences for your code at all, the package body will be invalidated and contain errors. It's always a good idea to compile the schema and check USER_ERRORS after any DDL. – hvb Apr 04 '14 at 10:19
  • This is a better answer than the accepted one (at time of writing this comment), but it still misses the most important point, which is that retrieving all columns virtually guarantees that no index will ever cover the entire query. The "overhead" of a few extra columns is nothing compared to the overhead of doing table scans, index scans, and bookmark lookups instead of faster seeks/range scans/skip scans. – Aaronaught Apr 06 '14 at 02:03
  • 1
    @Aaronaught I've updated it with the additional bit on the indexing issues. Are there any other points that I should bring up for the wrongness of `select *`? –  Apr 07 '14 at 00:59
  • 1
    No other points that I can think of - this pretty much covers it. Too bad my +1 can't put this over the crap that the OP accepted way too quickly... – Aaronaught Apr 08 '14 at 00:08
  • @Aaronaught I'm not complaining too much - I got a populist badge out of it (its shiny gold!). –  Apr 08 '14 at 00:27
  • 3
    Wow, the accepted answer was so poor at actually explaining anything that I down-voted it. Amazed that this isn't the accepted answer. +1. – Ben Lee Apr 08 '14 at 16:25
67

Think about what you're getting back, and how you bind those to variables in your code.

Now think what happens when someone updates the table schema to add (or remove) a column, even one you're not directly using.

Using select * when you're typing queries by hand is fine, not when you're writing queries for code.

gbjbaanb
  • 48,354
  • 6
  • 102
  • 172
  • 1
    When using APIs where in code you access columns by column name, nothing much will happen, unless a column you want to use isn't there anymore. – PlasmaHH Apr 03 '14 at 13:51
  • 8
    Performance, network load, etc. etc. are far more important than convenience of getting the columns back in the order and with the name you want. – jwenting Apr 03 '14 at 14:38
  • 21
    @jwenting really? performance matters more than correctness? Anyway, I don't see that "select *" performs better than selecting only the columns you want. – gbjbaanb Apr 03 '14 at 14:49
  • @gbjbaanb yes, it does. Sadly many people forget that you should write working code rather than pretty code (though if you can combine the two, even better). And poorly performant code, code that causes excessive network traffic for example, or overloads the database server, is NOT correct code. Which is exactly what can easily happen when doing "select *", especially if you're pulling in BLOB or CLOB columns you don't need. – jwenting Apr 03 '14 at 14:51
  • 1
    Someone shouldn't be making schema changes to the database your program uses without knowing what it impacts. If this is happening, then we could equally ask what if someone renames one of the columns, or changes the data type. See the second answer here http://stackoverflow.com/questions/3180375/select-vs-select-column – Bratch Apr 03 '14 at 15:36
  • 9
    @Bratch, in real life production environments, you might have hundreds of applications using the same tables and there's no possible way all those applications can be maintained properly. You are correct in sentiment, but practically, the argument fails just due to the realities of working in copmanies. Schema changes to active tables happens all the time. – user1068 Apr 03 '14 at 15:54
  • 5
    @jwenting think for a bit. Fast code that does not work, is code that *does not work*. If performance matters more than correctness, I have some awesome code that runs in almost 0 milliseconds that I can sell you for a quite reasonable sum :) – gbjbaanb Apr 03 '14 at 16:17
  • 1
    @jwenting There's gross performance negligence which borders on or outright crosses over into incorrect code (making something O(n!) that could easily be O(n), for example), and then there's sacrificing correctness for the sake of (often misunderstood) performance "gains" (such as putting NOLOCK on every single query because "it improves performance and prevents deadlocks"). Performance is a tradeoff, but as gbjbaanb said, code that doesn't need to work is easily the fastest code. – Nick Apr 03 '14 at 17:13
  • @jwenting: I think gbjbannb misunderstood you as saying that performance matters more than correctness, and then you affirmed this misunderstanding. What you originally said was performance is more correct than _convenience_, not correctness, which I think is far less disagreeable. – Mooing Duck Apr 03 '14 at 18:35
  • 18
    I don't understand the point in this answer. If you add a column to a table, both the SELECT * and the SELECT [Columns] will work, the only difference is that if the code needs to bind to the new column, the SELECT [Columns] will need to be modified whereas the SELECT * will not. If a column is removed from a table, the SELECT * will break at the point of binding, whereas the SELECT [Columns] will break when the query is executed. It seems to me that the SELECT * is the more flexible option as any changes to the table would only require changes to the binding. Am I missing something? – TallGuy Apr 03 '14 at 19:26
  • 1
    @TallGuy the point is you do not know which order the columns will appear in (ie if something changes). I prefer to be explicit in these things, so I am not surprised by code that reads columnX yet one day suddenly starts to be populated with columnN. Its sloppy, lazy programming practice. (and its also inefficient if you only use a subset). – gbjbaanb Apr 03 '14 at 20:45
  • 11
    @gbjbaanb then access the columns by name. Anything else would be obviously stupid unless you specified the column order in the query. – user253751 Apr 03 '14 at 22:14
  • 2
    @immibis I agree. While the order of column data returned by a database query might be maintained, using the order of your columns (especially if you're using * instead of specifying them!) to obtain your data seems extremely error-prone. – Craig Otis Apr 03 '14 at 23:26
  • 1
    @MooingDuck he seems to equate correctness with looking pretty, and THAT's what is the problem. Performance should never be sacrificed to make code look "pretty". If it works as it should, what it looks like is not that important (though, as I said, making it look good is rewarding, and it can make maintenance easier, but if it looks good but doesn't perform to specs it's not correct). – jwenting Apr 04 '14 at 06:53
  • @jwenting I fail to see the performance improvement of receiving all columns of a table or join across the network when I only needed 2 columns. – Adriano Varoli Piazza Apr 04 '14 at 16:00
  • 2
    @AdrianoVaroliPiazza: What he _meant_ was that performance (via being explicit about columns) is a better reason to avoid `*` than the inability to access by column. He was NOT suggesting that `*` is a good thing. – Mooing Duck Apr 04 '14 at 16:30
  • 1
    @MooingDuck right, I misread that. – Adriano Varoli Piazza Apr 04 '14 at 17:12
  • 5
    Terrible answer. Almost none of the major frameworks today care about having extra columns in the result set. Even if you don't use a framework, most of the time `SELECT *` should work - I mean, you *are* referencing columns by name, and not by index, aren't you? Index-based mapping is just bad. There are plenty of reasons not to use `SELECT *`, but this isn't one of them. – Aaronaught Apr 06 '14 at 02:01
  • @Aaronaught who says he's using a major framework? who says he isn't using index access? You can assume all you like, but at the end of the day - its an error waiting to happen solely to save someone typing out what they wanted to use - laziness is always poor programming practice. (and index is good - its fast, unlike string compares for everything. There are reasons apps run as fast today as they used to years ago even though we have vastly faster computers. This is one of them) – gbjbaanb Apr 06 '14 at 10:50
  • 2
    Really? You think accessing columns by index instead of by name is a good idea because it's *faster*? Talk about micro-optimization... you're talking about queries that are, *at minimum*, on the order of several milliseconds, and you're saving maybe a dozen CPU instructions at the very high cost of any kind of maintainability. Hard-coding column indexes is "an error waiting to happen" - doing `SELECT *` is almost never going to affect the correctness of any half-decently designed program. – Aaronaught Apr 06 '14 at 13:52
  • @Aaronaught and yet you criticise indexes for being "an error waiting to happen" yet you advocate "select *". String compares v integer compares - they're hugely different, and a program based on string compares is going to be very slow, it may be "a few CPU instructions" (plus a few memory accesses, and cache stalls) but they add up. Sometimes they really add up. – gbjbaanb Apr 06 '14 at 21:53
  • 2
    You've got to be kidding me. The string comparisons are *several orders of magnitude* less expensive than the actual database operation. The fact that you would even suggest that this matters one whit in practice is proof positive that you've never even attempted to profile it. – Aaronaught Apr 08 '14 at 00:04
  • 1
    @Aaronaught seriously, one little bit of bad practice does mean its ok because you do it once. People who use string comparitors use them all over the place, array indexes become keyed by strings and next thing you know, you've slowed everything down - and then you look at benchmarks and say "it must be that big db connection because I profiled the indexes and they look insignificant". Overall performance comes from an eye on how everything works. – gbjbaanb Apr 08 '14 at 07:26
  • 2
    This is utter nonsense. String comparisons **are** used everywhere in practice, and they're not slow. The fastest databases available today are distributed hash tables that take strings as keys. I can assure you that I'm very fussy about performance and know a whole lot about optimization, and the penalty for string comparisons vs. primitive type comparisons is completely irrelevant in every instance except maybe for operating systems and microcontroller software. The primary factor in performance is I/O; secondary is the computational complexity, e.g. O(N²) vs. O(N). – Aaronaught Apr 09 '14 at 00:17
  • String lookups are used everywhere, but I'm not talking about that - I'm saying that if you use a string lookup for picking columns out of your DB query so you can use 'select *' (and pull unnecessary columns down), you're trading convenience for performance. Using a integer index stops this kind of laziness and is a useful tool to ensure good practice. String lookups are much slower than integer indexing, and the mass use of them (for convenience) does have an impact on overall performance as well. It is not something to assume is fair practice because something else is slow. – gbjbaanb Apr 09 '14 at 09:13
  • 1
    @user1068, maybe at companies with poor planning or no change controls, or even worse, poor security practices that allow anyone to make schema changes and break all software accessing the table. Select * or columns wouldn't even matter. Maybe this would be good material for the DBA SE. – Bratch Apr 09 '14 at 22:47
  • 1
    @gbjbaanb "Premature optimization" isn't a compliment. You have much more serious performance problems elsewhere in your code. The only way to know is to follow reasonable practices during development, then profile the running code to find out where the actual performance issues are. I can just about guarantee they won't be where you thought they would be and you'll realize belatedly that you put a lot of effort into optimizing where it didn't matter. As an anti-bonus, you also ended up with code that is much harder to maintain due to use of "magic numbers." Maintenance is hugely important. – Craig Tullis Apr 10 '14 at 02:54
  • Also, it would probably be better to use the term "ordinal" to refer to the positions of the columns in the result set. "Index" doesn't necessarily mean quite the same thing, and has additional unrelated meaning related to databases, so it introduces a little confusion in this context. I suppose the term "position" works pretty well, too. – Craig Tullis Apr 10 '14 at 03:01
  • 2
    I just had a look at some of the .NET DbDataReader code (which won't be all *that* different from other providers). The string indexer (`this[string]`) calls `GetOrdinal(string)`, and then calls `GetValue(int)` with the result. There's a HashTable involved. HashTable lookups are pretty darn fast. The HashTable is cached for follow-on column ordinal look-ups. All in all, this just plain isn't where your performance issues are going to be. Use the column names. It's a much better pattern for maintainable code, and creates a far less brittle dependency on the database. – Craig Tullis Apr 10 '14 at 03:24
  • @Craig I agree, I just disagree with the practice of "I'll do it this way solely because it makes my life easier" when the more efficient way is hardly more difficult. DB query code (for example) is tightly coupled between specifying the columns and reading them, why wouldn't you use an integer positional lookup for half a dozen columns you wrote out 2 lines earlier? Its the attitude more than anything, one that gives rise to using select * in the first place. Optimisation: there's a trade off obviously, but I see much "we'll fix it later" that is just an excuse to be sloppy that costs to fix. – gbjbaanb Apr 10 '14 at 07:33
  • 4
    @gbjbaanb I guess I'm asserting that the performance difference between using ordinals directly and using the names of the columns is a miniscule, unimportant fraction of the overall cost of that call to the database. We're talking micrososeconds vs milliseconds. You're also presuming that the SQL you derive your ordinals from is embedded in your imperative code in the same place, but that is neither a universal best practice nor a guarantee. Using the column names is just a better **pattern** for writing and maintaining solid code. This goes to code **correctness**, *not* mere convenience. – Craig Tullis Apr 10 '14 at 14:56
  • 2
    @Bratch - Your view seems 'academic' to me - ignoring the realities of actual companies and the way work really gets done. Those things you mention are nice, but the reality is almost nobody has that stuff screwed on tight. You should not create maintenance issues when you have a choice, and here there is a clearly better choice. I've been getting paid for this stuff for 15 years, and I've read a lot of comments like yours - "if we were doing it right this would not be a problem" well, in the real world, we do it wrong, and a good developer plans for that. – user1068 Apr 10 '14 at 17:09
38

Another concern: if it's a JOIN query and you're retrieving query results into an associative array (as could be the case in PHP), it's bug-prone.

The thing is that

  1. if table foo has columns id and name
  2. if table bar has columns id and address,
  3. and in your code you are using SELECT * FROM foo JOIN bar ON foo.id = bar.id

guess what happens when someone adds a column name to the bar table.

The code will suddenly stop working properly, because now the name column appears in the results twice and if you're storing the results into an array, data from second name (bar.name) will overwrite the first name (foo.name)!

It's quite a nasty bug because it's very non-obvious. It can take a while to figure out, and there's no way the person adding another column to the table could have anticipated such undesirable side effect.

(True story).

So, don't use *, be in control of what columns you are retrieving and use aliases where appropriate.

Konrad Morawski
  • 9,721
  • 4
  • 37
  • 58
  • okay in this case (which i consider sort of rare) it could be a major issue. But you could still avoid(and most people probably will) it by querying with the wildcard and just add an alias for the identical column names. – the baconing Apr 03 '14 at 12:36
  • 4
    In theory, but if you use a wildcard for convenience you rely on it to automatically give you all columns in existence and never bother to update the query as the tables grow. If you are specifying each and every column, you are forced to go to the query to add another one to your `SELECT` clause and this is when you hopefully spot the name is not unique. BTW I don't think it's so rare in systems with large databases. As I said, I once spent a couple of hours hunting this bug in a big mudball of PHP code. And I found another case just now: http://stackoverflow.com/q/17715049/168719 – Konrad Morawski Apr 03 '14 at 12:40
  • 3
    I spend an hour last week trying to get this through a a consultants head. He is supposed to be a SQL guru... Sigh... – Tonny Apr 03 '14 at 22:30
22

Querying every column might be perfectly legitimate, in many cases.

Always querying every column isn't.

It's more work for your database engine, which has to go off and rummage around its internal metadata to work out which columns it needs to deal with before it can get on with the real business of actually getting the data and sending it back to you. OK, it's not the biggest overhead in the world, but system catalogs can be an appreciable bottleneck.

It's more work for your network, because you're pulling back any number of fields when you might only want one or two of them. If somebody [else] goes and adds a couple of dozen extra fields, all of which contains big chunks of text, you're throughput suddenly goes through the floor - for no readily apparent reason. This is made worse if your "where" clause isn't particularly good and you're pulling back lots of rows as well - that's potentially a lot of data tromping its way across the network to you (i.e. it's going to be slow).

It's more work for your application, having to pull back and store all of this extra data that it quite probably doesn't care about.

You run the risk of columns changing their order. OK, you shouldn't have to worry about this (and you won't if you select only the columns you need) but, if you go get them all at once and somebody [else] decides to rearrange the column order within the table, that carefully crafted, CSV export that you give to accounts down the hall suddenly goes all to pot - again, for no readily apparent reason.

BTW, I've said "someone [else]" a couple of times, above. Remember that databases are inherently multi-user; you may not have the control over them that you think you do.

Phill W.
  • 11,891
  • 4
  • 21
  • 36
  • 3
    I would think that always querying every column may be legitimate for things like schema-agnostic table viewing facilities. Not a terribly common situation, but in the context of internal-use-only tools such things can be handy. – supercat Apr 03 '14 at 15:11
  • 1
    @supercat That is just about the ONLY valid use-case for a "SELECT *" that I can think of. And even then I would prefer to limit the query to "SELECT TOP 10 *" (in MS SQL) or add "LIMIT 10" (mySQL) or add "WHERE ROWNUM <=10" (Oracle). Usually in that case it's more about "what columns are there and some sample data" than the complete content. – Tonny Apr 03 '14 at 22:39
  • @Tonny: SQL Server changed their default scripts to add the `TOP` limitation; I'm not sure how important that is if code reads as many as it cares to display and then disposes the query. I think query responses are processed somewhat lazily, though I don't know the details. In any case, I think that rather than saying it "isn't legitimate", it would be better to say "...is legitimate in *far* fewer"; basically, I'd summarize the legitimate cases as those where the user would have a better idea what's meaningful than the programmer. – supercat Apr 03 '14 at 22:44
  • @supercat I can agree to that. And I really like the way you put it in your last sentence. I have to remember that one. – Tonny Apr 03 '14 at 22:49
11

The short answer is: it depends on what database they use. Relational databases are optimized for extracting the data you need in a fast, reliable and atomic way. On large datasets and complex queries it's much faster and probablly safer than SELECTing * and do the equivalent of joins on the 'code' side. Key-value stores might not have such functionalities implemented, or might not be mature enough to use in production.

That said, you can still populate whatever data structure you're using with SELECT * and work out the rest in code but you'll find performance bottlenecks if you want to scale.

The closest comparison is sorting data: you can use quicksort or bubblesort and the result will be correct. But won't be optimized, and definitely will have issues when you introduce concurrency and need to sort atomically.

Of course, it's cheaper to add RAM and CPUs than investing in a programmer that can do SQL queries and has even a vague understanding of what a JOIN is.

lorenzog
  • 419
  • 2
  • 8
  • Learn SQL! It isn't that hard. It is the "native" language of databases far and wide. It's powerful. It's elegant. It has stood the test of time. And there's no way you're going to write a join on the "code" side that's more efficient than the join in the database, unless you are really inept at doing SQL joins. Consider that in order to do a "code join," you have to pull all the data from both tables in even a simple 2-table join. Or are you pulling index stats and using those to decide which table data to pull before you join? Didn't think so... Learn to use the database correctly, people. – Craig Tullis Apr 05 '14 at 13:23
  • @Craig: SQL is common in *relational* databases far and wide. That's far from the only type of DB, though...and there's a reason more modern database approaches are often called NoSQL. :P No one i know would go calling SQL "elegant" without a heavy dose of irony. It just sucks less than many of the alternatives, as far as relational databases are concerned. – cHao Apr 06 '14 at 08:27
  • @cHao I've been very aware of the various other types of databases out there for *decades*. The Pick "nosql" database has been around forever. "NoSQL" isn't even remotely a new concept. ORM's have also been around forever, and they've always been slow. Slow != Good. As for elegance (LINQ?), you can't convince me this is reasonable or elegant for a where clause: `Customer customer = this._db.Customers.Where( “it.ID = @ID”, new ObjectParameter( “ID”, id ) ).First();` See [Time to Take Offense](http://www.sybase.com/files/White_Papers/Sybase_SQLAnywhere_Top10newFeatures_wp.pdf) on page 2. – Craig Tullis Apr 06 '14 at 18:38
  • @Craig: Don't even get me started on ORM. Nearly every system out there does it horribly, and the abstraction leaks all over the place. That's because *relational DB records aren't objects* -- at best, they're the serializable guts of part of an object. But as for LINQ, you really want to go there? The SQLish equivalent is something like `var cmd = db.CreateCommand(); cmd.CommandText = "SELECT TOP 1 * FROM Customers WHERE ID = @ID"; cmd.Parameters.AddWithValue("@ID", id); var result = cmd.ExecuteReader();`....and then proceed to create a Customer from each row. LINQ beats the pants off that. – cHao Apr 06 '14 at 19:06
  • @Craig: Granted, it's not as elegant as it could be. But it'll never be as elegant as i'd like til it can convert .net code to SQL. :) At which point you could say `var customer = _db.Customers.Where(it => it.id == id).First();`. – cHao Apr 06 '14 at 19:16
  • @cHao I'm going to stick with the `SELECT TOP 1 * FROM Customers WHERE ID = @ID` bit being pretty elegant. I *like* SQL, and I like that it doesn't layer yet another abstraction over the database. I do like extension methods, and LINQ is just a natural (obvious, actually) evolution of extension methods, but don't care tons for LINQ 2 SQL (know how to use it, tho). As to the ADO example, the common cruft should be abstracted into a library. I've done it, others have done it. My most recent favorite candidate is probably the Dapper lightweight ORM from none other than the StackExchange people. – Craig Tullis Apr 06 '14 at 19:34
  • ...except of course that I'd avoid the `select *` in favor of selecting just the required fields. :-) – Craig Tullis Apr 06 '14 at 19:36
  • @Craig: That part, alone, is not too bad. Though if you use some other DBMS, the syntax will end up changing. PostgreSQL and MySQL would say `SELECT * FROM Customers ... LIMIT 1`. And Oracle does some ugliness with a magic `ROWNUM` column. That's one huge problem i have with using SQL directly: a number of common things aren't standardized. Each dialect has quirks that make it incompatible with the others. – cHao Apr 06 '14 at 19:45
  • @cHao Sure, but those differences still have to be handled by somebody, right? So you're just shifting the responsibility to the ORM, and ORMs are almost universally not very fast, and they aren't immune to having bugs of their own that you'll have no control over fixing. Although maybe even the slowest ORMs are "fast enough" in at least some cases. But despite any of this, how often is database portability a *real* issue? – Craig Tullis Apr 06 '14 at 20:37
  • It's also kind of funny in light of all the performance issues with ORMs to see all the comments from people who insist that you should eliminate db constraints and transactions for "performance" reasons, so that you can do all the joins and integrity checks and rollbacks in the middle tier, then they want to use LINQ2SQL or Hibernate in the middle tier. And ORMs aren't immune to complexity, either, not to mention the leaky abstractions you brought up, and the simple fact that your database is going to *far* outlive your application code. – Craig Tullis Apr 06 '14 at 20:39
  • @Craig: I'm not saying "treat the database as if it were an Excel spreadsheet". :) Constraints, indexes, transactions, triggers, etc are essential. But as for portability, my job is to work on an application that talks to a bunch of different databases. Some are PostgreSQL, some MySQL, and some are something else entirely. I'd rather have one front-end hiding away the differences than have to worry about which database is which type and write different code to access each one. – cHao Apr 07 '14 at 04:05
  • @cHao I didn't have the sense that you were saying that, although you and I both know that a number of people more or less do--the whole "modern" notion that a database is merely storage and shouldn't try to be anything more, which is unfortunate. Many devs can't seem to be bothered to really learn databases. Unfortunate. I sympathize with your reasons for preferring to use an ORM, which is fine if you can take the performance hit. I still kind of like abstracting everything but the SQL, which isn't that hard. But then I like stored procedures, too. :-) – Craig Tullis Apr 07 '14 at 05:20
  • @cHao Why does "..and then proceed to create a Customer from each row. LINQ beats the pants off that." In the example you gave above? It is solely because of the object-relational impedance mismatch. – johnny Apr 07 '14 at 14:33
  • @johnny: Were that the whole story, LINQ wouldn't have all that much over straight SQL. There's also the DBMS-agnostic query syntax, and the relative lack of boilerplate/busywork code. (The main ugliness is the `new ObjectParameter`. That could be avoided these days with a dynamic or anonymous-type object, but eh.) – cHao Apr 07 '14 at 22:34
8

IMO, its about being explicit vs implicit. When I write code, I want it to work because I made it work, not just because all of the parts just happen to be there. If you query all records and your code works, then you'll have the tendency to move on. Later on if something changes and now your code doesn't work, its a royal pain to debug lots of queries and functions looking for a value that should be there and the only values reference are *.

Also in an N-tiered approach, its still best to isolate database schema disruptions to the data tier. If your data tier is passing * to the business logic and most likely on the the presentation tier, you are expanding your debugging scope exponentially.

zkent
  • 188
  • 3
  • 3
    This is probably one of the most important reasons here, and it's got just a tiny fraction of the votes. The maintainability of a codebase littered with `select *` is much worse! – Eamon Nerbonne Apr 05 '14 at 09:39
6

because if the table gets new columns then you get all those even when you don't need them. with varchars this can become a lot of extra data that needs to travel from the DB

some DB optimizations may also extract the non fixed length records to a separate file to speed up access to the fixed length parts, using select* defeats the purpose of that

ratchet freak
  • 25,706
  • 2
  • 62
  • 97
1

Apart from overhead, something you want to avoid in the first place, I would say that as an programmer you don't depend on column order defined by the database administrator. You select each column even if you need them all.

  • 3
    Agree, though I'd also recommend pulling out values from a result set by column name in any case. – Rory Hunter Apr 03 '14 at 14:05
  • Seconded, carried. Use the column names, do not depend on the column order. The column order is a brittle dependency. The names should have (you hope) been derived from some actual design effort, or you explicitly alias composite columns or computations or conflicting column names in your query, and reference the explicit alias that you specified. But relying on the order is pretty much just duct tape and prayer... – Craig Tullis Apr 05 '14 at 13:18
1

I don't see any reason why you shouldn't use for the purpose it's build - retrieve all the columns from a database. I see three cases:

  1. A column is added in the database and you want it in code also. a) With * will fail with a proper message. b) Without * will work, but won't do what you expect which is pretty bad.

  2. A column is added in database and you do not want it in code. a) With * will fail; this means that * does no longer applies since it's semantics means "retrieve all". b) Without * will work.

  3. A column is removed Code will fail either way.

Now the most common case is case 1 (since you used * which means all you most probably want all); without * you can have code that works fine but doesn't do what expected which is much, much worst that code that fails with a proper error message.

I'm not taking into consideration the code which retrieves the column data based on column index which is error-prone in my opinion. It's much more logic to retrieve it based on column name.

Random42
  • 10,370
  • 10
  • 48
  • 65
  • Your premise is incorrect. `Select *` was intended more as a convenience for ad-hoc querying, not for application development purposes. Or for use in statistical constructs like `select count(*)` which lets the query engine decide whether to use an index, which index to use and so on and you aren't returning any actual column data. Or for use in clauses like `where exists( select * from other_table where ... )`, which again is an invitation to the query engine to pick the most efficient path on its own and the subquery is only used to constrain results from the main query. Etc. – Craig Tullis Apr 10 '14 at 03:14
  • @Craig I believe every book/tutorial on SQL says that `select *` has the semantics of retrieving all the columns; if your application really needs this, I don't see any reasons why not use it. Can you point to some reference (Oracle, IBM, Microsoft etc.) that mention the purpose for which `select *` was build is not to retrieve all columns? – Random42 Apr 10 '14 at 06:18
  • Well, of course `select *` exists to retrieve all the columns... as a convenience feature, for ad-hoc querying, not because it's a great idea in production software. The reasons are already covered pretty well in the answers on this page, which is why I didn't create my own detailed answer: •) Performance problems, repeatedly marshalling data over the network that you never use, •) problems with column aliasing, •) query plan optimization failures (failure to use indexes in some cases), •) inefficient server I/O in cases where limited select could have *solely* used indexes, etc. – Craig Tullis Apr 10 '14 at 14:44
  • *Maybe* there's an edge case here or there that justifies the use of `select *` in an actual production application, but the nature of an edge case is that it is not the *common* case. :-) – Craig Tullis Apr 10 '14 at 15:01
  • @Craig The reasons are against retrieving all columns from a database not against using `select *`; what I was saying if you really need all the columns, I see no reason why you shouldn't use `select *`; although few there must be scenarios where all the columns are needed. – Random42 Apr 10 '14 at 16:45
  • There are still the issues with index use/optimized query plans and others. Requirements change, schemas change, "all columns" might be from a database view rather than a table and the view definition is subject to potential change. I'm not claiming for a second that I've never used `select *` in an app, just to be clear. But when I reflect back on it, it's almost always been in a "hurry up" situation and wasn't necessarily the best choice. – Craig Tullis Apr 10 '14 at 16:52
  • ALSO--if you ARE using an ORM that can spin up objects for you automagically (Dapper...), and you use `select *`, then you are not only filling the network pipe and server memory with extra unused data on every call, but filling computer memory with bloated objects that contain data and properties that don't get used. ONE OF THE BIG ISSUES IS THE CREATION OF BRITTLE DEPENDENCIES, which is a bad thing. – Craig Tullis Apr 10 '14 at 16:54
  • @Craig Again, you are advising against selecting all the columns from a table which I totally agree with in the vast majority of cases; but using `select *` when you actually want and need all the columns is a totally different thing. – Random42 Apr 10 '14 at 18:20
  • ...which I am simply contending should be, in essentially all case with virtually no exceptions, only when doing ad-hoc, real-time discovery queries against the database using your weapon of choice for your particular database (iSQL, SQL Server Management Studio, TOAD, MySQL Workbench, SQLyog, the command line...). I'm just reiterating that I don't believe there are many, if any, justifiable cases for using `select *` *in production application software*. I've seen it a lot, been doing this a year or two (or twenty-two or more) and have seen it misused *far* more often than not. Peace. :-) – Craig Tullis Apr 10 '14 at 18:40
  • So you think you want all the columns from a table, or from a join, so you use `select *` in your app, right? Then, the schema of one or more of the underlying tables or views changes. After all, in most large environments, your app doesn't have exclusive ownership of the database. So now you're getting back more than you originally bargained for (or less). Lots of finicky little things could happen. You could end up with duplicate column names from entirely different base tables, lots of things. I'm just not going to become a fan of using `select *` *in production apps.* – Craig Tullis Apr 10 '14 at 18:44
1

Think of it this way... if you query all columns from a table that has just a few small string or numeric fields, that total 100k of data. Bad practice, but it will perform. Now add a single field that holds, say, an image or a 10mb word document. now your fast performing query immediately and mysteriously start performing poorly, just because a field was added to the table... you may not need that huge data element, but because you've done Select * from Table you get it anyway.

  • 6
    this seems to merely repeat point already made few hours ago in a [first answer](http://programmers.stackexchange.com/a/234659/31260) and in couple of other answers – gnat Apr 03 '14 at 19:02