Let me start directly by asking, do we really need Python to read large text files? Wouldn't our normal word processor or text editor suffice for that? When I mention large here, I mean extremely large files!
Well, let's see some evidence on whether we would need Python for reading such files or not.
Obtaining the File
In order to carry out our experiment, we need an extremely large text file. In this tutorial, we will be obtaining this file from the UCSC Genome Bioinformatics downloads website. The file we will be using in particular is the hg38.fa.gz
file, which as described here, is:
"Soft-masked" assembly sequence in one file. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are shown in lower case; non-repeating sequence is shown in upper case.
I don't want you to worry if you didn't understand the above statement, as it is related to Genetics terminology. What matters in this tutorial is the concept of reading extremely large text files using Python.
Go ahead and download hg38.fa.gz
(please be careful, the file is 938 MB). You can use 7-zip to unzip the file, or any other tool you prefer.
After you unzip the file, you will get a file called hg38.fa
. Rename it to hg38.txt
to obtain a text file.
Opening the File the Traditional Way
What I mean here by the traditional way is using our word processor or text editor to open the file. Let's see what happens when we try to do that.
I first tried using Microsoft Word to open the file, and got the following message:
Although opening the file didn't also work using WordPad and Notepad on a Windows based machine, it did open using TextEdit on a Mac OS X machine.
But you get the point, and having some guaranteed way to open such extremely large files would be a nice idea. In this quick tip, we will see how to do that using Python.
Reading the Text File Using Python
In this section, we are going to see how we can read our large file using Python. Let's say we wanted to read the first 500 lines from our large text file. We can simply do the following:
input_file = open('hg38.txt','r') output_file = open('output.txt','w') for lines in range(500): line = input_file.readline() output_file.write(line)
Notice that we read 500 lines from hg38.txt
, line by line, and wrote those lines to a new text file output.txt
, which should look as shown in this file.
But say that we wanted to directly navigate through the text file without extracting it line by line and sending that to another text file, especially since this way seems more flexible.
Navigating Through Large Text Files
Although the above step allowed us to read large text files by extracting lines from that large file and sending those lines to another text file, directly navigating through the large file without the need to extract it line by line would be a preferable idea.
We can simply do that using Python to read the text file through the terminal screen as follows (navigating through the file 50 lines at a time):
input_file = open('hg38.txt','r') while(1): for lines in range(50): print input_file.readline() user_input = raw_input('Type STOP to quit, otherwise press the Enter/Return key ') if user_input == 'STOP': break
As you can see from this script, you can now read and navigate through the large text file immediately using your terminal. Whenever you want to quit, you just need to type STOP
(case sensitive) in your terminal.
I'm sure that you will notice how smooth Python makes it to navigate through such an extremely large text file without having any issues. Python is again proving itself to be a language striving to make our lives easier!
Comments