Improving performance for grepping over a huge file

Question

I have FILE_A which has over 300,000 lines and FILE_B which has over 30 million lines. I created a Bash script that greps each line in FILE_A over in FILE_B and writes the result of the grep to a new file.

This whole process is taking over 5 hours.

How can I improve the performance of my script?

I'm using grep -F -m 1 as the grep command. FILE_A looks like this:

123456789 
123455321

and FILE_B is like this:

123456789,123456789,730025400149993,
123455321,123455321,730025400126097,

So with Bash I have a while loop that picks the next line in FILE_A and greps it over in FILE_B. When the pattern is found in FILE_B, I write it to file result.txt.

while read -r line; do
   grep -F -m1 $line 30MFile
done < 300KFile

score 18 · Accepted Answer · answered May 29 '12 at 22:11

18

Try using grep --file==FILE_A. It almost certainly loads the patterns into memory, meaning it will only scan FILE_B once.

grep -F -m1 --file==300KFile 30MFile

answered May 29 '12 at 22:11

Gort the Robot

14,733
4
51
60

This would work only assuming I have enough memory right? – rogerio_marcio May 29 '12 at 23:12
Honestly, I haven't tried it out myself on files of that size, but I believe it should dramatically improve your speed. If you are on a modern machine, you should have no trouble holding a 300K file in memory. (Or a 30M one for that matter.) – Gort the Robot May 29 '12 at 23:20
when i used the -f (--file) option it basically recreated the 30MFile. Am I doing something wrong? – rogerio_marcio May 30 '12 at 02:19
Hmmm...maybe the 300Kfile had a blank line in it? – Gort the Robot May 30 '12 at 03:16
right on spot! that was it! that worked perfectly, it finished in 30 seconds! thank you!! – rogerio_marcio May 30 '12 at 12:34

score 3 · Answer 2 · edited May 18 '14 at 21:55

Here is a Perl answer for posterity. I routinely do this for matching 1M lines to 30-35M lines. It takes around 10 seconds to finish.

First, hash up FILE_A:

my %simple_hash;
open my $first_file, '<', 'FILE_A' or die "What have you done?! $!";
while (<$first_file>) {
  chomp;                 ## Watch out for Windows newlines
  $simple_hash{$_} = 1;  ## There may be an even faster way to define this
}
close $first_file;

Then, if your big file is delimited and know what column to go after, check for just the existence of the hash key as you run down FILE_B, which is much, much faster than checking for equality or regular expression matching:

open my $second_file, '<', 'FILE_B' or die "Oh no, not again.. $!";
while (<$second_file>) {
  my ($col1, undef) = split ',';
  if (exists($simple_hash{$col1}) {
    print $_;
  }
}
close $second_file;

If your larger target file isn't nicely parse-able, then this script loses its value as so much of its speed comes from not having to fire up the regular expression engine.

score 1 · Answer 3 · answered May 29 '12 at 22:38

If you don't mind some more involved programming, consider using suffix trees (or a variant).

You can preprocess FILE_B using Ukkonen's algorithm in linear time. Then, you query each line in FILE_A in time linear in line length and get all the line numbers that match (might need to adapt the tree a tad) which you can write to a result file.

The whole procedure runs in time O(n + Nm) if n is the length of FILE_B, N is the number of lines in FILE_A and m is the length of the longest line in FILE_A -- this is essentially linear runtime. Beats the quadratic time your original approach needs by magnitudes.

score 1 · Answer 4 · answered May 29 '12 at 22:44

1

I found the --mmap flag lately, didn't have a chance to test it, but I'll be happy to hear about your findings. Here is the description from man page:

--mmap If  possible, use the mmap(2) system call to read input, instead
      of the default read(2) system call.  In some situations,  --mmap
      yields  better performance.  However, --mmap can cause undefined
      behavior (including core dumps) if an input file  shrinks  while
      grep is operating, or if an I/O error occurs.

See this or this for further info about mmap.

answered May 29 '12 at 22:44

Ramzi Kahil

1,089
1
10
20

I'm definitely going to give this a shot and i'll let you know how it goes. How probable is that I will encounter a core dump? – rogerio_marcio May 29 '12 at 23:11
@rogerio_marcio Well, as I understand the man, "if the file shrinks while grep is operating, or if an I/O error occurs.". Not really probably, but you should know this better. (If as I assume the file is untouched while grep - this should not happen) – Ramzi Kahil May 29 '12 at 23:19
For testing that `--mmap` dose not dump anything, I would recommend a run with `--mmap`, and one without. And then use `wc` to see that you have the same amount of output - this should be a robust test considering that we ran 2 times grep, and just a flag differed. – Ramzi Kahil May 29 '12 at 23:24
@rogerio_marcio Have you tried this? Any insights? – Ramzi Kahil Jun 14 '12 at 10:45

score -1 · Answer 5 · answered May 19 '14 at 02:28

-1

why don't you put that file in a database databases are really good at doing an efficient merge,hash,nested loop join like this . And they are really good at utilizing virtual memory

answered May 19 '14 at 02:28

Andyz Smith

853
5
12

All you are doing with all the other answers is reinventing the database wheel – Andyz Smith May 20 '14 at 14:49

Improving performance for grepping over a huge file

5 Answers5