Tech Daily

Introduction to Data Analysis Part 1: Using Pandas and Exporting to Excel

Introduction to Data Analysis Part 1: Using Pandas and Exporting to Excel

Introduction

In this multi-part series, I'll be going over some of the basics that I've learned over the past few weeks, starting with Pandas.

By the end of the series, the topics that it will have covered include:

  • Importing data from a CSV or Excel file into Pandas
  • Cleaning up the data and removing unusable data.
  • Converting between datatypes.
  • Exporting and importing data to and from Excel files.
  • Performing operations against the data to create additional data, such as a total for example.
  • Using other libraries such as NumPy to perform numerical-based operations on data instead of Pandas.
  • Creating visualisations of the data with MatPlotLib, another library for Python.

Part one of this series will cover importing data from a CSV file, cleaning it up, converting some of it to different datatypes and then exporting it to Excel.

Let's begin by first looking at what is Pandas.

What is Pandas?

Pandas is an easy to use library for Python that you can use to perform data analysis and manipulation of data that can come from a number of sources, such as:

  • CSV files
  • Excel files
  • JSON files / API responses
  • Python lists, tuples or dictionaries

When you import data with Pandas, the data is stored in what is called a DataFrame. A DataFrame can be thought of as an Excel spreadsheet in terms of how it is presented.

Now, let's make a start using Pandas.

Pre-requisites

Prior to starting, you will need to have Miniconda installed, along with a conda environment configured.

For a list of all the required modules and their dependencies, go to the GitHub repository link at the bottom and use the environment.yml file in there to
install them with conda.

If you need to install and setup Miniconda, please see the Miniconda documentation here for the downloads and setup instructions. There is a file in the GitHub repository linked in the references section called environment.yml. This has a list of all the modules, libraries and dependencies that are needed for this series. You can use the file to install them with conda.

For simplicity, I would recommend using Jupyter Notebooks for any data analysis that you do with Python.

A plugin for Visual Studio Code is available to work with Jupyter Notebooks, if you use it for your text editor or IDE.

There is a Jupyter Notebook available with all the steps below in the GitHub repository linked in the references section.

Using Pandas to Import and Clean-up Data

Step 1. Import Pandas

import pandas as pd
Enter fullscreen mode Exit fullscreen mode

Step 2. Create a Pandas DataFrame From the Data CSV File

In this section, a new Pandas DataFrame will be created from a CSV file.

Once that has been done, we will take a look at some of the data and the datatypes of each column in the DataFrame.

Step 2.1. Create the Pandas DataFrame From the CSV File

order_data = pd.read_csv("data/order-data.csv")
Enter fullscreen mode Exit fullscreen mode

Step 2.2. Show the First Five Rows of the Pandas DataFrame

order_data.head(n = 5)
Enter fullscreen mode Exit fullscreen mode

Output from execution of code

As you can see, the DataFrame looks very similar to a spreadsheet.

The number in the first column that doesn't have a column name is the index. This is automatically created by Pandas when it creates a DataFrame. It can be used for referencing data directly, which will be covered in another part of this series.

Step 2.3. Show the DataTypes of Each Column in the Pandas DataFrame

order_data.info()
Enter fullscreen mode Exit fullscreen mode

Output from execution of code

An object datatype typically implies a string of text.

Step 2.4. Show the Total Row Count in the Pandas DataFrame

print(f"Total Rows (Before NaN Removal): {len(order_data)}")
Enter fullscreen mode Exit fullscreen mode

Output from execution of code

Step 3. Check for NaN (null) Values and Clean-Up

Before performing any kind of data manipulation, the data needs to be cleaned up.

Firstly, the data will be checked for NaN (Not a Number - basically an empty (null value) cell) in any of the rows and columns (axes) and depending upon the criteria, will be either removed or changed.

Here are the rules for the clean-up of the data:

  • order_id missing: Delete row.
  • order_date missing: Delete row.
  • customer_name missing: Set name to "Jane Bloggs".
  • item_vendor_name missing: Set name to "Unknown".
  • item: Remove any "," from the description and replace with " -".
  • item missing: Delete row.
  • item_qty missing: Delete row.
  • price_per_unit missing: Delete row.
  • price_currency missing: Set to "GBP".

Prior to deleting any rows from the Pandas DataFrame, those rows will be added to a new Pandas DataFrame called order_data_removed so that it can then be added to a separate Excel worksheet for reference.

Step 3.1. Check for NaN In All Rows and Columns

order_data.isna().sum()
Enter fullscreen mode Exit fullscreen mode

Output from execution of code

Step 3.2. Replace NaN Values (Where Required)

Replace NaN in customer_name with "Jane Bloggs":

order_data["customer_name"].fillna(value = "Jane Bloggs",
                                   inplace = True)
Enter fullscreen mode Exit fullscreen mode

Note: By default, when you perform the above operation, it won't update the DataFrame permanently. It will only do it for that operation.

To get around this, you can either perform a reassignment (basically putting order_data["customer_name"] = at the beginning of the above) or, in some cases, you can use inplace = True to achieve the same thing. Not all functions have that argument available so you may need to use reassignment for that function (Spoilers: There are some in this article later on (3.3 is the first one)).

Check for any NaN values in customer_name:

print(f'customer_name NaN: {order_data["customer_name"].isna().sum()}')
Enter fullscreen mode Exit fullscreen mode

Output from execution of code

Replace NaN in item_vendor_name with "Unknown":

order_data["item_vendor_name"].fillna(value = "Unknown",
                                      inplace = True)
Enter fullscreen mode Exit fullscreen mode

Check for any NaN values in item_vendor_name:

print(f'item_vendor_name NaN: {order_data["item_vendor_name"].isna().sum()}')
Enter fullscreen mode Exit fullscreen mode

Output from execution of code

Replace NaN in price_currency with "GBP":

order_data["price_currency"].fillna(value = "GBP",
                                   inplace = True)
Enter fullscreen mode Exit fullscreen mode

Check for any NaN values in price_currency:

print(f'price_currency NaN: {order_data["price_currency"].isna().sum()}')
Enter fullscreen mode Exit fullscreen mode

Output from execution of code

Step 3.3. Replace "," with " -" in item Column

Just to play safe, remove any "," in the item column with " -" so that there are no issues with exporting to CSV (if you wanted to) or any other format where "," could cause issues.

There could also be issues with running other operations that you may wish to use.

order_data["item"] = order_data["item"].str.replace(",", " -")
Enter fullscreen mode Exit fullscreen mode

Step 3.4. Convert order_date Column To Date from Object

Check the order_date datatype before converting it and what it currently looks like (yyyy/mm/dd):

print(f'order_date Data Type (Before Conversion): {order_data["order_date"].dtype}')
print(f'order_date (Before Conversion):\n{order_data["order_date"].head(n = 5)}')
Enter fullscreen mode Exit fullscreen mode

Output from execution of code

Now, perform the conversion to a datetime datatype that is more usable for Pandas:

order_data["order_date"] = pd.to_datetime(order_data["order_date"], 
                                          format = "%Y-%m-%d", 
                                          utc = False)
Enter fullscreen mode Exit fullscreen mode

Check the order_date datatype after converting it and what it now looks like (yyyy-mm-dd):

print(f'\norder_date Data Type (After Conversion): {order_data["order_date"].dtype}')
print(f'order_date (After Conversion):\n{order_data["order_date"].head(n = 5)}')
Enter fullscreen mode Exit fullscreen mode

Output from execution of code

Step 3.5. Convert item_qty Column To Integer from Float

Check the item_qty datatype before converting it and what it currently looks like (1.0 for example):

print(f'item_qty Data Type (Before Conversion): {order_data["item_qty"].dtype}')
print(f'item_qty (Before Conversion):\n{order_data["item_qty"].head(n = 5)}')
Enter fullscreen mode Exit fullscreen mode

Output from execution of code

Now, perform the conversion to an integer datatype:

order_data["item_qty"] = order_data["item_qty"].convert_dtypes(convert_integer=True)
Enter fullscreen mode Exit fullscreen mode

Check the item_qty datatype after converting it and what it now looks like (1 for example):

print(f'\nitem_qty Data Type (After Conversion): {order_data["item_qty"].dtype}')
print(f'item_qty (After Conversion):\n{order_data["item_qty"].head(n = 5)}')
Enter fullscreen mode Exit fullscreen mode

Output from execution of code

Step 3.6. Round price_per_unit to Two Decimal Places

Check the price_per_unit before rounding it (up or down) and what it currently looks like (1.234 for example):

print(f'price_per_unit Data Type (Before Rounding): {order_data["price_per_unit"].dtype}')
print(f'price_per_unit (Before Rounding):\n{order_data["price_per_unit"].head(n = 5)}')
Enter fullscreen mode Exit fullscreen mode

Output from execution of code

Next, perform the rounding to two decimal places:

order_data["price_per_unit"] = order_data["price_per_unit"].round(decimals=2)
Enter fullscreen mode Exit fullscreen mode

Lastly, check the price_per_unit after rounding it (up or down) and what it now looks like (1.23 for example):

print(f'\nprice_per_unit Data Type (After Rounding): {order_data["price_per_unit"].dtype}')
print(f'price_per_unit (After rounding):\n{order_data["price_per_unit"].head(n = 5)}')
Enter fullscreen mode Exit fullscreen mode

Output from execution of code

Step 3.7. Remove Lines With NaN Values (Where Required)

Before removing any rows with NaN values, place those rows into a separate Pandas DataFrame so that it can be exported later on, mostly for reference or perhaps to be used for other purposes, should they arise:

order_data_removed = order_data[order_data.isna().any(axis = 1)]
Enter fullscreen mode Exit fullscreen mode

Next, show the first five rows of the order_data_removed Pandas DataFrame:

order_data_removed.head(n = 5)
Enter fullscreen mode Exit fullscreen mode

Output from execution of code

Next, check for total NaN entries in the order_data_removed Pandas DataFrame:

order_data.isna().sum()
Enter fullscreen mode Exit fullscreen mode

Output from execution of code

Lastly, as the clean-up criteria for NaN values has been completed, any remaining NaN values in the order_data Pandas DataFrame can now be removed:

order_data.dropna(inplace = True)
Enter fullscreen mode Exit fullscreen mode

Check for any remaining NaN values. There should be none:

order_data.isna().sum()
Enter fullscreen mode Exit fullscreen mode

Output from execution of code

Check the total number of rows in the order_data Pandas DataFrame:

print(f"Total Rows (After NaN Removal): {len(order_data)}")
Enter fullscreen mode Exit fullscreen mode

Output from execution of code

Step 3.8. Reset the order_data and order_data_removed Indexes

As there have been some rows removed from the order_data Pandas DataFrame, the index will need to be reset, otherwise it will still have the old index in there that started at 0 and ended at 999 (1000 rows).

Note: By default, reset_index will create a new index and place the old index into a new column. To stop the old index being created in a new column, use drop = True.

First, reset the index of the order_data Pandas DataFrame:

order_data.reset_index(inplace = True, 
                       drop = True)
Enter fullscreen mode Exit fullscreen mode

Show the last five lines of the order_data Pandas DataFrame:

order_data.tail(n = 5)
Enter fullscreen mode Exit fullscreen mode

Output from execution of code

Next, reset the index of the order_data_removed Pandas DataFrame:

order_data_removed.reset_index(inplace = True, 
                               drop = True)
Enter fullscreen mode Exit fullscreen mode

Show the last five lines of the order_data_removed Pandas DataFrame:

order_data_removed.tail(n = 5)
Enter fullscreen mode Exit fullscreen mode

Output from execution of code

Step 3.9. Review The Current State of the order_data Pandas DataFrame

Show first five rows to see what the data looks like:

order_data.head(n = 5)
Enter fullscreen mode Exit fullscreen mode

Output from execution of code

Next, check that all of the datatypes are correct:

order_data.info()
Enter fullscreen mode Exit fullscreen mode

Output from execution of code

Step 4. Export The Two DataFrames To Excel

The final step will be to export the two DataFrames (order_data and order_data_removed) to an Excel workbook. Each DataFrame will be on its own worksheet.

with pd.ExcelWriter(path = "order_data_exported.xlsx", 
                    engine = "xlsxwriter",
                    date_format = "YYYY-MM-DD",
                    datetime_format = "YYYY-MM-DD") as writer:

    order_data.to_excel(writer, 
                        index = False,
                        sheet_name = "order_data")

    order_data_removed.to_excel(writer, 
                        index = False,
                        sheet_name = "order_data_removed")
Enter fullscreen mode Exit fullscreen mode

You should now see the Excel file in the same folder where you ran the script / Notebook from. You can open it and review the contents.

Output from execution of code

This concludes part one of this series.

References

GitHub Repository Link

Miniconda Installation Documentation