What is the optimal algorithm for estimating number of lines of text in a file?

Question

Given a file > 5Gb, simply with lines like Apache's access.log.

I need to get number of lines.

Any constructions like

file(filename).read().counter('\n')

would read all of the file and it would take a very long time.

On the other hand, e.g.

os.stat(filename).st_size

works very fast.

It is possible to get at least an estimate for the number of lines, by the size of first lines in bytes and total size of the file.

Is there a more accurate way?

Are you looking for an algorithm or a Python specific implementation? If it's the latter, let me know and I'll migrate this to Stack Overflow. — , Jul 10 '13 at 22:34
any language, not only python, any way to get number of lines as more fast as it possible. — rezeptor, Jul 10 '13 at 22:36
I'm gonna change your title a tad to better reflect what you want. Feel free to change it back if it's not better. — , Jul 10 '13 at 22:40
To estimate the number of lines without actually counting them, you're going to need some estimate of the number of characters per line. If the file is very large, and the line lengths are statistically regular, count the number of lines in the first megabyte to get an estimate of bytes per line, then divide the total file size by that. (Note: I don't think "optimal" is possible here because you are necessarily trading away accuracy for speed. There is nothing "more optimal" then scanning the whole file if you want a correct answer.) — Gort the Robot, Jul 10 '13 at 23:11
@StevenBurnap: that looks like an answer rather than a request for clarification. Why not write it as an answer? — Bryan Oakley, Jul 10 '13 at 23:20

score 6 · Answer 1 · edited Dec 01 '20 at 17:42

If I understand your question, you want to be able to ESTIMATE the number of lines in a file, without having to iterate through file. A few things spring to mind:

ESTIMATION based on file size Estimation always assumes a trade off: a reasonable approximation with less work is better than an absolute value with more work. I like your idea of establishing an average line size; say that your files generally follow the same pattern (for example, {x, y, z} coordinates from some experiment), then it would be reasonable to assume that a file might be made up of:

HEADER
{x, y, z}(1),
{x, y, z}(2),
...
{x, y, z}(n)
FOOTER

If the HEADER and FOOTER are of the same template in all files, then you can almost disregard them as constants. Which means, you are left with a file size that is dependent on the number of lines of {x, y, z}. You could make some assumption (based on observation) of the average size of these lines and then make your estimation from there. There is no reason that this scenario couldn't be adapted to any situation where the file follows a particular format (of course if the file doesn't have a regular format then it is going to be tough).

Alternative: put the number of lines into the HEADER Using Python's ability to iterate over a file, you could calculate the number of lines once, and then you could query this value as often as you like (think of it like sorting an array once, and then doing a binary search over it many times). If you regularly need to know how many lines are in a file, this could be a reasonable, and accurate way to go about it.

In Python, you can count the number of lines without having to load the entire file into memory:

number_lines = sum(1 for line in open(my_file))

Using a generator like this is also optimised in Python, so it is as fast as Python will get :)

As fast as Python will get, but C can get faster. – user253751 Nov 30 '20 at 18:01 — user253751, Nov 30 '20 at 18:01

score 1 · Answer 2 · edited Dec 01 '20 at 17:41

Scan a representative sample of lines (say, 100) for the average line length, then take the file's total length and divide by said average.

If the lines are highly irregular, line count may be less useful than file size as a metric. If the lives are highly regular and complex, parsing into a relational table may be useful. (if it's highly regular and simple, line-length will be accurate.)

If this is a request you get often, consider adjusting the log file to include a alphanumeric counter, allowing you to just find and parse the last and first row's values.

If this is a one-time ad-hoc operation, just count \ns.

If at all possible, reduce log file size.

score 1 · Answer 3 · edited Dec 01 '20 at 17:41

To do an estimate, your sample should be randomized. Don't just look at the first 100 lines. We have random access to these files. Use it. Those first lines might be radically different than the rest. It's ok to dive into a random spot, scan for a line terminator, count until you get to the next one, repeat.

Don't bother to exclude lines that have been counted before. Doesn't hurt the math much with and would slow you down.

Some minor points:

In addition to file size you need to know the average line length (character count), and the average bytes per character. Or you can make darn sure you're really counting bytes when you get your average per line. The world doesn't just run on ASCII any more.

Also, line terminators can be 8 or 16 bits, which will show up in your file size. Know which you're dealing with when you account for them.

I would make an additional point: the type of average (mean, median or mode) you would use depends very much on the file you have. — John Go-Soco, Nov 30 '20 at 15:43
One thing to worry - by seeking randomly you will oversample long lines.. or rather lines _after_ the long lines. You'll have to compensate for that. — John Dvorak, Dec 01 '20 at 17:43

score 1 · Answer 4 · answered Dec 14 '20 at 00:21

An estimation is

lines in file = size of file / averaged line size

where size of file is easy to get, and you can read the first x of megabytes to do the average, a Python example

def line_estimation(filename, first_size=1<<16):
    with open(filename, 'rb') as file:
        buf = file.read(first_size)
        return len(buf) // buf.count(b'\n')

ans = os.path.getsize(filename) // line_estimation(filename)

score -1 · Answer 5 · answered Nov 30 '20 at 20:59

-1

The Linux/Unix way wc -l should be much faster than pure Python solutions. Using subprocess, we can do

int(subprocess.check_output(['wc', '-l', filename]).split()[0])

answered Nov 30 '20 at 20:59

Arunima

37
6

2

It still needs to read the whole file from disk which is what OP wants to avoid. – Thorbjørn Ravn Andersen Nov 30 '20 at 23:07

What is the optimal algorithm for estimating number of lines of text in a file?

5 Answers5