10

I have FILE_A which has over 300,000 lines and FILE_B which has over 30 million lines. I created a Bash script that greps each line in FILE_A over in FILE_B and writes the result of the grep to a new file.

This whole process is taking over 5 hours.

How can I improve the performance of my script?

I'm using grep -F -m 1 as the grep command. FILE_A looks like this:

123456789 
123455321

and FILE_B is like this:

123456789,123456789,730025400149993,
123455321,123455321,730025400126097,

So with Bash I have a while loop that picks the next line in FILE_A and greps it over in FILE_B. When the pattern is found in FILE_B, I write it to file result.txt.

while read -r line; do
   grep -F -m1 $line 30MFile
done < 300KFile
Peter Mortensen
  • 1,050
  • 2
  • 12
  • 14
rogerio_marcio
  • 103
  • 1
  • 4

5 Answers5

18

Try using grep --file==FILE_A. It almost certainly loads the patterns into memory, meaning it will only scan FILE_B once.

grep -F -m1 --file==300KFile 30MFile
Gort the Robot
  • 14,733
  • 4
  • 51
  • 60
3

Here is a Perl answer for posterity. I routinely do this for matching 1M lines to 30-35M lines. It takes around 10 seconds to finish.

First, hash up FILE_A:

my %simple_hash;
open my $first_file, '<', 'FILE_A' or die "What have you done?! $!";
while (<$first_file>) {
  chomp;                 ## Watch out for Windows newlines
  $simple_hash{$_} = 1;  ## There may be an even faster way to define this
}
close $first_file;

Then, if your big file is delimited and know what column to go after, check for just the existence of the hash key as you run down FILE_B, which is much, much faster than checking for equality or regular expression matching:

open my $second_file, '<', 'FILE_B' or die "Oh no, not again.. $!";
while (<$second_file>) {
  my ($col1, undef) = split ',';
  if (exists($simple_hash{$col1}) {
    print $_;
  }
}
close $second_file;

If your larger target file isn't nicely parse-able, then this script loses its value as so much of its speed comes from not having to fire up the regular expression engine.

Peter Mortensen
  • 1,050
  • 2
  • 12
  • 14
Mintx
  • 131
  • 4
1

If you don't mind some more involved programming, consider using suffix trees (or a variant).

You can preprocess FILE_B using Ukkonen's algorithm in linear time. Then, you query each line in FILE_A in time linear in line length and get all the line numbers that match (might need to adapt the tree a tad) which you can write to a result file.

The whole procedure runs in time O(n + Nm) if n is the length of FILE_B, N is the number of lines in FILE_A and m is the length of the longest line in FILE_A -- this is essentially linear runtime. Beats the quadratic time your original approach needs by magnitudes.

Raphael
  • 542
  • 1
  • 4
  • 14
1

I found the --mmap flag lately, didn't have a chance to test it, but I'll be happy to hear about your findings. Here is the description from man page:

--mmap If  possible, use the mmap(2) system call to read input, instead
      of the default read(2) system call.  In some situations,  --mmap
      yields  better performance.  However, --mmap can cause undefined
      behavior (including core dumps) if an input file  shrinks  while
      grep is operating, or if an I/O error occurs.

See this or this for further info about mmap.

Ramzi Kahil
  • 1,089
  • 1
  • 10
  • 20
  • I'm definitely going to give this a shot and i'll let you know how it goes. How probable is that I will encounter a core dump? – rogerio_marcio May 29 '12 at 23:11
  • @rogerio_marcio Well, as I understand the man, "if the file shrinks while grep is operating, or if an I/O error occurs.". Not really probably, but you should know this better. (If as I assume the file is untouched while grep - this should not happen) – Ramzi Kahil May 29 '12 at 23:19
  • For testing that `--mmap` dose not dump anything, I would recommend a run with `--mmap`, and one without. And then use `wc` to see that you have the same amount of output - this should be a robust test considering that we ran 2 times grep, and just a flag differed. – Ramzi Kahil May 29 '12 at 23:24
  • @rogerio_marcio Have you tried this? Any insights? – Ramzi Kahil Jun 14 '12 at 10:45
-1

why don't you put that file in a database databases are really good at doing an efficient merge,hash,nested loop join like this . And they are really good at utilizing virtual memory

Andyz Smith
  • 853
  • 5
  • 12