Hands-on Tutorial on Python Data Processing Library Pandas – Part 2

Pandas is a Python language package, which is used for data processing in the part one. This is a very common basic programming library when we use Python language for machine learning programming. This article is the second tutorial in the series of pandas tutorial series. We recommend you to read the first pandas introductory tutorial here, before start exploring this.

File Operation

The pandas library provides a series of read_functions for reading files in various formats. They are as follows:

Read_csv
Read_table
Read_fwf
Read_clipboard
Read_excel
Read_hdf
Read_html
Read_json
Read_msgpack
Read_pickle
Read_sas
Read_sql
Read_stata
Read_feather

Reading Excel files

Note: To read the Excel file, you also need to install another library:xlrd

You can do this via pip:

 sudo pip3 install xlrd
$  pip3 show xlrd
Name: xlrd
Version: 1.1.0
Summary: Library for developers to extract data from Microsoft Excel (tm) spreadsheet files
Home-page: http://www.python-excel.org/
Author: John Machin
Author-email: [email protected]
License: BSD
Location: /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages
Requires:

Next, we look at a simple example of reading Excel:

# file_operation.py
import pandas as pd
import numpy as np
df1 = pd.read_excel("data/test.xlsx")
print("df1:\n{}\n".format(df1))
The content of this Excel is as follows:
df1:
   C  Mon
0  D  Tue
1  E  Wed
2  F  Thu
3  G  Fri
4  A  Sat
5  B  Sun

Note: The code and data files for this article are available through the Github repository mentioned at the beginning of the article.

Read CSV file

Next, let’s look at an example of reading a CSV file.
The contents of the first CSV file are as follows:

$ cat test1.csv
C,Mon
D,Tue
E,Wed
F,Thu
G,Fri
A,Sat

The way to read is also very simple:

# file_operation.py
df2 = pd.read_csv("data/test1.csv")
print("df2:\n{}\n".format(df2))

Let’s look at the second example. The contents of this file are as follows:

$ cat test2.csv
C|Mon
D|Tue
E|Wed
F|Thu
G|Fri
A|Sat

Strictly speaking, this is not a CSV file because its data is not separated by commas. In this case, we can read this file by specifying a delimiter, like so:

# file_operation.py
df3 = pd.read_csv("data/test2.csv", sep="|")
print("df3:\n{}\n".format(df3))

In fact, a read_csvvery large number of parameters are supported to adjust the read parameters, as shown in the following table:

parameter	Instructions
Path	file path
Sep or delimiter	Field separator
Header	The number of column names, the default is 0 (first row)
Index_col	Column number or name is used as the index of the row in the result
Names	List of column names for the result
Skiprows	Number of rows skipped from start position
Na_values	Alternative NAsequence of values
Comment	Separate the commented characters by the end of the line
Parse_dates	Try to resolve the data to datetime. The default isFalse
Keep_date_col	If the column is connected to the resolution date, the connected column is preserved. The default is False.
Converters	Column converter
Dayfirst	When parsing dates that can cause ambiguity, store it in an internal form. The default isFalse
Data_parser	The function used to parse the date
Nrows	Number of rows read from file
Iterator	Returns a TextParser object for reading part of the content
Chunksize	Specifies the size of the read block
Skip_footer	The number of rows to ignore at the end of the file
Verbose	Output various parsed output information
Encoding	file encoding
Squeeze	If the parsed data contains only one column, one is returnedSeries
Thousands of	Thousands of separators

See the detailed read_csv function description here: pandas.read_csv

Handle Invalid Value

The real world is not perfect. The data we read often has some invalid values. If you do not deal with these invalid values, it will cause great disruption to the program.

Treat invalid values, there are two main methods: directly ignore these invalid values; or replace invalid values with valid values.

Below I first create a data structure that contains invalid values. Then use pandas.isnafunctions to confirm which values are invalid:

# process_na.py
import pandas as pd
import numpy as np
df = pd.DataFrame([[1.0, np.nan, 3.0, 4.0],
                  [5.0, np.nan, np.nan, 8.0],
                  [9.0, np.nan, np.nan, 12.0],
                  [13.0, np.nan, 15.0, 16.0]])
 
print("df:\n{}\n".format(df));
print("df:\n{}\n".format(pd.isna(df)));****
df:
      0   1 	2 	3
0   1.0 NaN   3.0   4.0
1   5.0 NaN   NaN   8.0
2   9.0 NaN   NaN  12.0
3  13.0 NaN  15.0  16.0
 
df:
   	0 	1      2      3
0  False  True  False  False
1  False  True   True  False
2  False  True   True  False
3  False  True  False  False

Ignoring Invalid Values

We can pandas.DataFrame.dropnadiscard invalid values through functions:

# process_na.py
print("df.dropna():\n{}\n".format(df.dropna()));

Note: dropnaBy default, the original data structure will not be changed. Instead, a new data structure is returned. If you want to change the data directly, you can pass arguments when you call this function inplace = True.

For the original structure, when the invalid value is all discarded, it will no longer be a valid DataFrame, so this line of code is output as follows:

df.dropna():
Empty DataFrame
Columns: [0, 1, 2, 3]
Index: []

We can also choose to discard the column where the entire column is invalid:

# process_na.py
print("df.dropna(axis=1, how='all'):\n{}\n".format(df.dropna(axis=1, how='all')));
df.dropna(axis=1, how='all'):
      0 	2 	3
0   1.0   3.0   4.0
1   5.0   NaN   8.0
2   9.0   NaN  12.0
3  13.0  15.0  16.0

Note: axis=1 Indicates the axis of the column. How can take ‘any’ or ‘all’, the default is the former.

Replace Invalid Value

We can also fill and replace invalid values with valid ones by functions. like this:

# process_na.py
print("df.fillna(1):\n{}\n".format(df.fillna(1)));
df.fillna(1):
      0    1 	2 	3
0   1.0  1.0   3.0   4.0
1   5.0  1.0   1.0   8.0
2   9.0  1.0   1.0  12.0
3  13.0  1.0  15.0  16.0

It may not make sense to replace all invalid values with the same data, so we can specify different data to fill in. For ease of operation, before filling, we can rename, modify the names of rows and columns by methods:

# process_na.py
 
df.rename(index={0: 'index1', 1: 'index2', 2: 'index3', 3: 'index4'},
          columns={0: 'col1', 1: 'col2', 2: 'col3', 3: 'col4'},
          inplace=True);
df.fillna(value={'col2': 2}, inplace=True)
df.fillna(value={'col3': 7}, inplace=True)
print("df:\n{}\n".format(df));
df:
        col1  col2  col3  col4
index1   1.0   2.0   3.0   4.0
index2   5.0   2.0   7.0   8.0
index3   9.0   2.0   7.0  12.0
index4  13.0   2.0  15.0  16.0

Processing Strings

Data is often involved in the processing of strings, then we look at pandas for string manipulation.

The strfield contains a series of functions to process the string. And, these functions automatically handle invalid values.

Here are some examples. In the first set of data, we deliberately set some strings containing spaces:

# process_string.py
import pandas as pd
s1 = pd.Series([' 1', '2 ', ' 3 ', '4', '5']);
print("s1.str.rstrip():\n{}\n".format(s1.str.lstrip()))
print("s1.str.strip():\n{}\n".format(s1.str.strip()))
print("s1.str.isdigit():\n{}\n".format(s1.str.isdigit()))
s1.str.rstrip():
0 	1
1    2
2    3
3 	4
4 	5
dtype: object
s1.str.strip():
0    1
1    2
2    3
3    4
4    5
dtype: object
s1.str.isdigit():
0    False
1    False
2    False
3 	True
4 	True
dtype: bool

Here are some other examples showing the handling of uppercase, lowercase, and string lengths:

# process_string.py
s2 = pd.Series(['Stairway to Heaven', 'Eruption', 'Freebird',
                    'Comfortably Numb', 'All Along the Watchtower'])
print("s2.str.lower():\n{}\n".format(s2.str.lower()))
print("s2.str.upper():\n{}\n".format(s2.str.upper()))
print("s2.str.len():\n{}\n".format(s2.str.len()))
s2.str.lower():
0          stairway to heaven
1                    eruption
2                    freebird
3            comfortably numb
4    all along the watchtower
dtype: object
 
s2.str.upper():
0          STAIRWAY TO HEAVEN
1                    ERUPTION
2                    FREEBIRD
3            COMFORTABLY NUMB
4    ALL ALONG THE WATCHTOWER
dtype: object
 
s2.str.len():
0    18
1 	8
2 	8
3    16
4    24
dtype: int64

Conclusion

In this article, we covered the most basic operations in data processing using pandas. We hope that you understood the tutorial well and if you have any queries, please drop your comment in below comment box. We will get back to you as soon as possible.

Note: I learnt this pandas tutorial from this great resources Pandas – Powerful Python Data Analysis Toolkit and Python Data Analysis by J.Metz. It is a great tutorial and I highly recommend to read it if you are more interested in Pandas and Python data analysis!

Data ManipulationImport DataPandas