Introducing Pandas

In this tutorial I will give a basic introduction to pandas. Oh, I don't mean the animal panda, but a Python library!

As mentioned on the pandas website:

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Thus, pandas is a data analysis library that has the data structures we need to cleanse raw data into a form which is suitable for analysis (i.e. tables). It is important to note here that since pandas performs important tasks such as aligning data for comparison and merging of data sets, handling of missing data, etc., it has become a de facto library for high-level data processing in Python (i.e. statistics). Well, pandas was originally designed to handle financial data, provided that the common alternative is using a spreadsheet (i.e. Microsoft Excel).

The basic data structure of pandas is called DataFrame, which is an ordered collection of columns with names and types, thus looking like a database table where a single row represents a single case (example) and columns represent particular attributes. It should be noted here that the elements in various columns may be of different types.

So, the bottom line is that the pandas library provides us with the data structures and functions necessary for data analysis.

Installing Pandas

Let's now see how we can install pandas on our machines and use it for data analysis. The easiest way to install pandas and avoid any dependency issues is by using Anaconda which pandas comes part of. As mentioned on the Anaconda download page:

Anaconda is a completely free Python distribution (including for commercial use and redistribution). It includes more than 400 of the most popular Python packages for science, math, engineering, and data analysis

The Anaconda distribution is cross-platform, meaning that it can be installed on OS X, Windows, and Linux machines. I'm going to use the OS X installer since I'm working on a Mac OS X El Capitan machine, but of course you can choose the suitable installer for your operating system. I will go with the graphical installer (be careful, it is 339 MB).

Anaconda Mac OS X graphical installer — Anaconda Mac OS X Graphical Installer

After downloading the installer, simply walk through the simple installation wizard steps and you are all set!

All we need to do now in order to use pandas is to import the package as follows:

import pandas as pd

Pandas Data Structures

I have mentioned one of the three pandas data structures above, the DataFrame. I will describe this data structure in this section in addition to the other pandas data structure, Series. There is another data structure called Panel, but I will not describe it in this tutorial as it is not so frequently used, as mentioned in the documentation. DataFrame is a 2D data structure, Series is a 1D data structure, and Panel is a 3D and higher data structure.

DataFrame

The DataFrame is a tabular data structure which is composed of ordered columns and rows. To make things clearer, let's look at the example of creating a DataFrame (table) from a dictionary of lists. The following example shows a dictionary consisting of two keys, Name and Age, and their corresponding list of values.

import pandas as pd
import numpy as np

name_age = {'Name' : ['Ali', 'Bill', 'David', 'Hany', 'Ibtisam'],
'Age' : [32, 55, 20, 43, 30]}
data_frame = pd.DataFrame(name_age)
print data_frame

If you run the above script, you should get an output similar to the following:

Notice that the DataFrame constructor orders the columns alphabetically. If you want to change the order of the columns, you can type the following under data_frame above:

data_frame_2 = pd.DataFrame(name_age, columns = ['Name', 'Age'])

To view the result, simply type: print data_frame_2.

Say you didn't want to use the default labels 0,1,2,..., and wanted to use a, b, c,... instead. In that case, you can use index in the above script as follows:

data_frame_2 = pd.DataFrame(name_age, columns = ['Name', 'Age'], index = ['a', 'b', 'c', 'd', 'e'])

That was very nice, wasn't it? Using DataFrame, we were able to see our data organized in a tabular form.

Series

Series is the second pandas data structure I'm going to talk about. A Series is a one-dimensional (1D) object similar to a column in the table. If we want to create a Series for a list of names, we can do the following:

series = pd.Series(['Ali', 'Bill', 'David', 'Hany', 'Ibtisam'],
index = [1, 2, 3, 4, 5])
print series

The output of this script would be as follows:

Notice that we used index to label the data. Otherwise, the default labels will start from 0,1,2...

Pandas Functions

In this section, I'm going to show you examples of some functions we can use with DataFrame and Series.

Head and Tail

The functions head() and tail() enable us to view a sample of our data, especially when we have a large number of entries. The default number of elements that get displayed are 5, but you can return the customized number you like.

Let's say we have a Series composed of 20,000 random items (numbers):

import pandas as pd
import numpy as np
series = pd.Series(np.random.randn(20000))

Using the head() and tail() methods to observe the first and last five items, respectively, we can do the following:

print series.head()
print series.tail()

The output of this script should be something similar to the following (notice that you might have different values since we are generating random values):

Add

Let's take an example of the add() function, where we will attempt to add two data frames as follows:

import pandas as pd

dictionary_1 = {'A' : [5, 8, 10, 3, 9],
'B' : [6, 1, 4, 8, 7]}
dictionary_2 = {'A' : [4, 3, 7, 6, 1],
'B' : [9, 10, 10, 1, 2]}
data_frame_1 = pd.DataFrame(dictionary_1)
data_frame_2 = pd.DataFrame(dictionary_2)
data_frame_3 = data_frame_1.add(data_frame_2)
print data_frame_1
print data_frame_2
print data_frame_3

The output of the above script is:

You can also perform this addition process by simply using the + operator: data_frame_3 = data_frame_1 + data_frame_2.

Describe

A very nice pandas function is describe(), which generates various summary statistics for our data. For the example in the last section, let's do the following:

print data_frame_3.describe()

The output of this operation will be:

Further Resources

This was just a scratch of the surface on Python's pandas. For more details, you can check the pandas documentation, and you can also check some books like Learning Pandas and Mastering Pandas.

Conclusion

Scientists sometimes need to carry out some statistical operations and display some neat graphs that require them to use a programming language. But, at the same time, they don't want to spend too much time or be faced with a serious learning curve in carrying out such tasks.

As we saw in this tutorial, pandas enabled us to represent data in tabular form and carry out some operations on those tables in a very simple way. Combining pandas with other Python libraries, scientists can even do more advanced tasks such as drawing specialized graphs for their data.

Thus, pandas is a very helpful library and starting point for scientists, economists, statisticians, and anyone willing to perform some data analysis tasks.

HIGHLIGHTS OF THE DAY