Thursday, September 3, 2015

Working with text files in Python 2.7

Reading text files in Python 2.7 at first sight is simple, but without correct opening, a work with them can lead to unexpected results.
For example you have a file, looks like ascii-encoded and reading it line by line saving line number and line data. But if in this file some non-ascii symbols are hidden, you will have problems with encoding. In Python such problem could be reproduced as UnicodeDecodeError:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x?? in position ??: ordinal not in range(128)
Moreover if line endings are nonuniform for single platform (i.e. '\r\n' or '\r' will be met along with '\n') you could have problems with lines number. And if you open your log file in text editor - line numbers in it and in your Python script results will not correspond.
Let's prepare the test logfile:
filename = 'logfile.log'
with open(filename, 'w') as f:
    f.write('line 1' + '\n')
    f.write('line 2' + '\n')
    f.write('line 3' + '\r')
    f.write('line 4' + '\r\n')
    f.write('line 5' + '\n')
    f.write('line 6' + '\n')
    f.write('line 7')
If you open it in text editor, the result will be the following:
1 line 1
2 line 2
3 line 3
4 line 4

6 line 5
7 line 6
8 line 7
Now for example we needed to get the line number with string "line 5".
filename = 'logfile.log'
with open(filename, 'r') as f:
    for line_num, line in enumerate(f):
        if line.startswith('line 5'):
            print '%s: %s' % (line_num + 1, line.strip())
the output is:
4: line 5
Here we have a wrong line number 4. The line number in text editor is 6.

Fortunately in python there are several possible ways to open files with different useful arguments.
1) open(name[, mode[, buffering]])
2) codecs.open(filename, mode[, encoding[, errors[, buffering]]])
3) io.open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True)
The way 1 we've already used, let's try to use 2 and 3.
import io
import codecs


filename = 'logfile.log'
with codecs.open(filename, 'r') as f:
    for line_num, line in enumerate(f):
        if line.startswith('line 5'):
            print '%s: %s' % (line_num + 1, line.strip())

with io.open(filename, 'r') as f:
    for line_num, line in enumerate(f):
        if line.startswith('line 5'):
            print '%s: %s' % (line_num + 1, line.strip())

and the output is:
4: line 5
6: line 5
Here codecs.open gave the same result as a simple open, but io.open gave the expected result.
Due to its argument newline, it is possible to enable universal newlines mode and read the lines regardless line ending format (Windows, Unix, Mac OS up to version 9), i.e. lines can end with '\r\n', '\n', or '\r'. This mode is enabled by default (when newline=None). Therefore lines are split correctly and we get the line numbers the same as in text editors.
In Python 3.x io.open is the default interface to access files and streams.

codecs.open and io.open functions have encoding argument that specifies the encoding which is to be used for the file. Also they have errors argument - an optional string that specifies how encoding and decoding errors are to be handled. The difference is that codecs.open handles line endings differently.
After the additions of arguments:
with codecs.open(filename, 'r', encoding='utf-8', errors='replace') as f:
    for line_num, line in enumerate(f):
        if line.startswith('line 5'):
            print '%s: %s' % (line_num + 1, line.strip())

with io.open(filename, 'r', encoding='utf-8', errors='replace') as f:
    for line_num, line in enumerate(f):
        if line.startswith('line 5'):
            print '%s: %s' % (line_num + 1, line.strip())
the output is:
6: line 5
6: line 5

No comments:

Post a Comment