Split one file into multiple files based on pattern (cut can occur within lines)


Question

A lot of solutions exist, but the specificity here is I need to be able to split within a line, the cut should occur just before the pattern. Ex:

Infile:

<?xml 1><blabla1>
<blabla><blabla2><blabla>
<blabla><blabla>
<blabla><blabla3><blabla><blabla>
<blabla><blabla><blabla><?xml 4>
<blabla>
<blabla><blabla><blabla>
<blabla><?xml 2><blabla><blabla>

Should become with pattern <?xml

Outfile1:

<?xml 1><blabla1>
<blabla><blabla2><blabla>
<blabla><blabla>
<blabla><blabla3><blabla><blabla>
<blabla><blabla><blabla>

Outfile2:

<?xml 4>
<blabla>
<blabla><blabla><blabla>
<blabla>

Outfile3:

<?xml 2><blabla><blabla>

Actually the perl script in the validated answer here works fine for my little example. But it generates an error for my bigger (about 6GB) actual files. The error is:

panic: sv_setpvn called with negative strlen at /home/.../split.pl line 7, <> chunk 1.

I don't have the permissions to comment, that's why I started a new post. And finally, a Python solution would be even more appreciated, as I understand it better.

1
4
5/23/2017 12:25:13 PM

Accepted Answer

This performs the split without reading everything into RAM:

def files():
    n = 0
    while True:
        n += 1
        yield open('/output/dir/%d.part' % n, 'w')


pat = '<?xml'
fs = files()
outfile = next(fs) 

with open(filename) as infile:
    for line in infile:
        if pat not in line:
            outfile.write(line)
        else:
            items = line.split(pat)
            outfile.write(items[0])
            for item in items[1:]:
                outfile = next(fs)
                outfile.write(pat + item)

A word of warning: this doesn't work if your pattern spreads across multiple lines (that is, contains "\n"). Consider the mmap solution if this is the case.

10
5/23/2017 12:00:14 PM

Perl can parse large files line by line instead of slurping the whole file into memory. Here is a short script (with explanation):

perl -n -E 'if (/(.*)(<\?xml.*)/ ) {
   print $fh $1 if $1;
   open $fh, ">output." . ++$i;
   print $fh $2;
} else { print $fh $_ }'  in.txt

perl -n : The -n flag will loop over your file line by line (setting the contents to $_)

-E : Execute the following text (Perl expects a filename by default)

if (/(.*)(<\?xml.*) ) if a line matches <?xml split that line (using regex matchs) into $1 and $2.

print $fh $1 if $1 Print the start of the line to the old file.

open $fh, ">output.". ++$i; Create a new file-handle for writing.

print $fh $2 Print the rest of the line to the new file.

} else { print $fn $_ } If the line didn't match <?xml just print it to the current file-handle.

Note: this script assumes your input file starts with <?xml.


Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Icon