8

Background

I am recently in the process of enduring grueling tech interviews for positions that use the .NET stack, some of which include silly questions like this one, and some questions that are more valid. I recently came across an issue that may be valid but I want to check with the community here to be sure.

When asked by an interviewer how I would count the frequency of words in a text document and rank the results, I answered that I would

  1. Use a stream object put the text file in memory as a string.
  2. Split the string into an array on spaces while ignoring punctuation.
  3. Use LINQ against the array to .GroupBy() and .Count(), then OrderBy() said count.

I got this answer wrong for two reasons:

  1. Streaming an entire text file into memory could be disasterous. What if it was an entire encyclopedia? Instead I should stream one block at a time and begin building a hash table.
  2. LINQ is too expensive and requires too many processing cycles. I should have built a hash table instead and, for each iteration, only added a word to the hash table if it didn't otherwise exist and then increment it's count.

The first reason seems, well, reasonable. But the second gives me more pause. I thought that one of the selling points of LINQ is that it simply abstracts away lower-level operations like hash tables but that, under the veil, it is still the same implementation.

Question

Aside from a few additional processing cycles to call any abstracted methods, does LINQ require significantly more processing cycles to accomplish a given data iteration task than a lower-level task (such as building a hash table) would?

Matt Cashatt
  • 3,315
  • 5
  • 24
  • 35
  • 2
    Ask him what dumbass put an entire encyclopedia in a single text file? – JeffO Apr 14 '12 at 16:58
  • At least when in comes to LINQ to Objects, this should be easy enough to test. I doubt you will see a significant difference. But it does depend on the LINQ provider. – Oded Apr 14 '12 at 17:07
  • 4
    It's one of those things that should be measured. Build 2 or 3 implementations and record the performance. Generalizations about LINQ or technique X are not useful. I'd say it's poor of the interviewer to declare using LINQ a "wrong" answer. Although in heavy load server side processing every millisecond counts. – Lord Tydus Apr 14 '12 at 17:09
  • 1
    A quick google for "performance testing linq to objects and loops" found quite a few hits. Some include source code that you can take at use to test for yourself. See [this](http://architecture-blog.rdacorp.com/2009/03/linq-on-objects-performance.html), [this](http://www.dotnetscraps.com/dotnetscraps/post/linq-performance-part-1-linq-to-collection.aspx) and [this](http://codebetter.com/stevehebert/2008/02/06/linq-to-objects-relating-data-structure-organization-to-where-clause-optimization/). – Oded Apr 14 '12 at 17:42
  • @Oded--Thanks Oded. These links were helpful and indicate that LINQ is indeed much slower and would be noticeable in a heavy load situation. Can you please convert your comment to an answer? – Matt Cashatt Apr 14 '12 at 17:46
  • 1
    As for interviews, remember there are some "old school" C++ programmers who think you should re-invent the wheel rather than use the .NET libraries. You'll also run into old school VB'ers who want to do all of the data access code by hand rather than using LINQ and EF. – jfrankcarr Apr 14 '12 at 18:25
  • I will not post an answer, as I don't consider a set of links to be an answer... – Oded Apr 14 '12 at 18:36
  • 1
    Oded, those examples in the links you provided are very wrong. I can't go into all the details in a comment, but take the second link. It compares "foreach x if x=toFind stop" with a linq query which does the equivilent of "select * from list where x like toFind" The difference is the first one stops when it finds the first instance, the linq query always iterates over every entry and will return a collection of ALL items matching the search pattern. Very different. That's not because LINQ is broken, it's because he used the wrong query. – Ian Apr 14 '12 at 23:18

2 Answers2

9

I'd say the main weakness of this answer is less its use of Linq and more the specific operators chosen. GroupBy takes each element and projects it to a key and a value which go into a lookup. In other words, every word will add something to the lookup.

The naive implementation .GroupBy(e => e) will store a copy of every word in the source material, making the final lookup nearly as large as the original source material. Even if we project out the value with .GroupBy(e => e, e => null) we're creating a large lookup of small values.

What we would want is an operator that preserves only the needed information, which is one copy of each word and the count of that word so far. For that, we can use Aggregate:

words.Aggregate(new Dictionary<string, int>(), (counts, word) => 
{
    int currentCount;
    counts.TryGetValue(word, currentCount);
    counts[word] = currentCount + 1;
    return counts;
} 

From here, there are several ways we could attempt to make this faster:

  1. Instead of creating many strings while splitting, we could pass around structs that reference the original string and the segment that contains the word, and only copy the segment out when it turns out to be a unique key
  2. Use Parallel Linq to aggregate across several cores then combine the results. This is trivial compared to the leg work required for doing this by hand.
Chris Pitman
  • 3,426
  • 1
  • 18
  • 21
  • All good points Chris, thanks. I am going to refrain from accepting for a bit as the question is more general and essentially answered by Oded in the comments above. I just want to give him the opportunity to provide the answer first. Thanks again for your insight, however, which is great. – Matt Cashatt Apr 14 '12 at 18:18
6

I think you had a narrow escape, the interviewer didn't really know what he was talking about. Even worse, he believes there is a "right" answer. If he was someone you would want to work for, I'd expect him to take your initial answer, find out why you chose it, and then challenge you to make it better if he can find problems with it.

LINQ scares people because it looks like magic. It's actually very simple (So simple you would need to be a genius to come up with it)

var result = from item in collection where item=>item.Property > 3 select item;

Is compiled into:

IEnumerable<itemType> result = collection.Where(item=>item.property >3);

(Please don't shout if I've got the syntax wrong on either, it's after midnight and I'm in bed :) )

Where is an extension method on IEnumerable which takes a lambda. The lambda is simply (in this case) compiled to a delegate:

bool AMethod(ItemType item)
{
    return item.property >3;
}

The Where method simply adds ALL instances of item where AMethod returns true to a collection which is returned.

There is no reason that would be slower than doing a foreach and adding all matching items to a collection in that loop. In fact the Where extension method is probably doing just that. The real magic comes when you need to inject an alternative where criteria.

As I mentioned above in my comment, some of the examples linked to are very wrong. And it's this sort of misinformation that causes the problems.

Finally, if the interview had of given you a chance, you could have said that:

  • LINQ is easy to read, especially where you start introducing interesting projections and groupings. Easy to read code is eas[y|ier] to maintain code which is a win.

  • It would be really easy to measure the performance if it was actually a bottleneck and replace it with something else.

wonea
  • 141
  • 7
Ian
  • 5,462
  • 22
  • 26
  • Overall I agree with you but the behavior of the Where method - Where method does not add all matching items to a collection. It stores the information needed to filter items in expression tree. If the returned iterator is not actually used, no filtering will happen at all. – Codism Apr 15 '12 at 04:57
  • Excellent point, I should have mentioned that. In their example they do use the returned iterator. This was the madness of their test. To extract the one found value (all items in their test data were unique) they had a foreach which iterated the resultant enumerable to display the result. Of course there was only one result so it only printed the one answer. Madness :) – Ian Apr 15 '12 at 07:05
  • While I've not used LINQ, one thing I find distasteful is that optimizes things like `Count` for a few narrow scenarios that work poorly with encapsulation. Concatenate a million-item list and a four-item iterator, and `Count` should require about 5 operations, but will instead require a million. I wish MS would define `IEnhancedEnumerator` with an `int Move(int)` method which would return 0 in case of success, or return the amount of shortfall in case of failure (so doing `Move(1000003)` on a newly-created `List.Enumerator` from a million-item list would return 3). Any implementation... – supercat Sep 04 '15 at 22:33
  • ...of `IEnumerable` could be wrapped in an implementation of `IEnhancedEnumerator`, but types which implement `IEnhancedEnumerator` directly could allow orders-of-magnitude speedup on many operations, and even things like the return from `Append` could expose the fast-seek ability of the constituents. – supercat Sep 04 '15 at 22:34