I'm writing a log file viewer for a web application and for that I want to paginate through the lines of the log file. The items in the file are line based with the newest item on the bottom.
So I need a
tail() method that can read
n lines from the bottom and supports an offset. What I came up with looks like this:
def tail(f, n, offset=0): """Reads a n lines from f with an offset of offset lines.""" avg_line_length = 74 to_read = n + offset while 1: try: f.seek(-(avg_line_length * to_read), 2) except IOError: # woops. apparently file is smaller than what we want # to step back, go to the beginning instead f.seek(0) pos = f.tell() lines = f.read().splitlines() if len(lines) >= to_read or pos == 0: return lines[-to_read:offset and -offset or None] avg_line_length *= 1.3
Is this a reasonable approach? What is the recommended way to tail log files with offsets?
The code I ended up using. I think this is the best so far:
def tail(f, n, offset=None): """Reads a n lines from f with an offset of offset lines. The return value is a tuple in the form ``(lines, has_more)`` where `has_more` is an indicator that is `True` if there are more lines in the file. """ avg_line_length = 74 to_read = n + (offset or 0) while 1: try: f.seek(-(avg_line_length * to_read), 2) except IOError: # woops. apparently file is smaller than what we want # to step back, go to the beginning instead f.seek(0) pos = f.tell() lines = f.read().splitlines() if len(lines) >= to_read or pos == 0: return lines[-to_read:offset and -offset or None], \ len(lines) > to_read or pos > 0 avg_line_length *= 1.3
This may be quicker than yours. Makes no assumptions about line length. Backs through the file one block at a time till it's found the right number of '\n' characters.
def tail( f, lines=20 ): total_lines_wanted = lines BLOCK_SIZE = 1024 f.seek(0, 2) block_end_byte = f.tell() lines_to_go = total_lines_wanted block_number = -1 blocks =  # blocks of size BLOCK_SIZE, in reverse order starting # from the end of the file while lines_to_go > 0 and block_end_byte > 0: if (block_end_byte - BLOCK_SIZE > 0): # read the last block we haven't yet read f.seek(block_number*BLOCK_SIZE, 2) blocks.append(f.read(BLOCK_SIZE)) else: # file too small, start from begining f.seek(0,0) # only read what was not read blocks.append(f.read(block_end_byte)) lines_found = blocks[-1].count('\n') lines_to_go -= lines_found block_end_byte -= BLOCK_SIZE block_number -= 1 all_read_text = ''.join(reversed(blocks)) return '\n'.join(all_read_text.splitlines()[-total_lines_wanted:])
I don't like tricky assumptions about line length when -- as a practical matter -- you can never know things like that.
Generally, this will locate the last 20 lines on the first or second pass through the loop. If your 74 character thing is actually accurate, you make the block size 2048 and you'll tail 20 lines almost immediately.
Also, I don't burn a lot of brain calories trying to finesse alignment with physical OS blocks. Using these high-level I/O packages, I doubt you'll see any performance consequence of trying to align on OS block boundaries. If you use lower-level I/O, then you might see a speedup.