Categories
Python

Pandas DataFrames in Python: An Introduction

I figure since you have found yourself navigating to this page that you probably have a good amount of data that you are looking to analyze, and you may possibly be wondering the best and most efficient way to answer some of your questions about your data. The answer to your questions can be found with the use of the python package: Pandas.

How to access Pandas

Due to the popularity of Pandas, it has its own conventional abbreviation, so anytime you are importing pandas into python, use the nomenclature below:

import pandas as pd

The Primary Use of Pandas package is the DataFrame

The pandas API defines a pandas dataframe as:

Two-dimensional, size-mutable, potentially heterogeneous tabular data. Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

Basically, all that means is that you have data contained in the format to what you see below. Data found in rows and columns:

a screenshot detailing what a pandas dataframe looks like
Example dataframe with labels for data, rows, and columns. Dataset is from Kaggle
DataFrames are extremely useful because they provide a simple method to print a table visualization and then manipulate it the way you want it to be. You can easily reference a row by using the index (the numbers on the far left). The index will be the corresponding row’s number starting at zero unless you specify names to each row. The columns can also be easily referenced by the column name such as “Track Name” or by their position in the dataframe. We will talk in more detail about referencing rows and columns later in this article.

Creation Time!

There are several ways to create a pandas dataframe:

  1. Import data from a .csv file (or other file type e.g. Excel, SQL database)
  2. from a list
  3. from a dictionary
  4. from a numpy array
  5. many, many more!

In general, you will mainly be putting data from a .csv file, or some type of data source (i.e. SQL database), into a pandas dataframe. You will not be making it from scratch since that would take an insanely long time to do depending on the amount of data you have. However, in the off chance that you need to, here is a quick, simple example from a python dictionary:

import pandas as pd
dict1 = {'Exercises': ['Running','Walking','Cycling'],
         'Mileage': [250, 1000, 550]}
df = pd.DataFrame(dict1)
df

Output:

picture of a simple pandas dataframe to show what one looks like
Basic dataframe made from code above

The dictionary keys (‘Exercises’ and ‘Mileage’) become the corresponding column headers. The values in the dictionary being the lists in this example became the individual data points in the dataframe. The order that the lists are in matters since Running will be placed in the first row since it is first in the ‘Exercises’ list and 250 will be placed in the first row in the second column since it is first in the ‘Mileage’ list. Also, you will notice that since I did not specify labels for the index of the dataframe it automatically was labeled 0,1, and 2.

From a .csv File

However, as I said before the most likely way that you will be creating a pandas dataframe is from a csv or other type of file that you will import in to analyze the data. This is easily completed with just the following:

df = pd.read_csv("file_location.../file_name.csv")

pd.read_csv() is an extremely strong and versatile method and will be extremely useful depending on how you are looking to import your data. If your csv file already comes with headers or an index, you can specify this while importing and make your life so much easier. In order to understand the full ability of pd.read_csv() I suggest you look at the pandas API here.

First things first

Now, you have imported your data into your python editor and you are ready to analyze it. However, before we get to the juicy part of answering your analytical questions you have to become familiar with your data and see what it looks like. As the person who is analyzing this data, you have to become comfortable with the dataset. There are four methods that I love to use in order to get to know my data and which Pandas makes super easy.

  1. .head() & .tail()
  2. .info()
  3. .describe()
  4. .sample()
raw_song.head()

The line above is the line that I have in the picture at the top of the page. It will display the first 5 lines of the dataframe and each of the columns to give you an easy summary of what the data looks like. You can also specify a certain number of rows inside the () of the method to show more rows if your heart so desires.

pandas dataframe showing the functionality of .head()
.head() method on song data from Spotify dataset

.tail()

.tail() is the same just displaying the last 5 lines.

raw_song.tail()

"pandas

From these two quick methods, I have a general idea of the column names and just what the data looks like just from a small sample of the dataset. These methods are also really useful especially given a dataset such as the Spotify dataset working with over 3 million rows, you can easily display the dataset and get a quick idea, and it won’t take your computer a long time to display the data.

.info()

.info() is also useful in that it shows me a succinct list of all of the columns, their datatypes, and whether you have any null datapoints or not.

raw_song.info(verbose=True, null_counts=True)

"picture

.describe()

If you have completely integer or float columns (i.e. ‘Position’, ‘Streams’), then .describe() can be a useful method to understand more about your dataset as it will show many descriptive statistics about those columns.

raw_song.describe()
pandas dataframe output showing the .describe() method

.describe() method on song data from Spotify dataset. Notice that only ‘Position’ and ‘Streams’ columns are shown since they were the only two integer columns, the other columns are strings and do not have descriptive stats.

.sample()

Lastly, .sample() will allow you to randomly sample your dataframe and see if any manipulation that you made has incorrectly changed something in your dataset, and can also be great when first exploring your dataset just to get an idea of what the dataset contains exactly that was not already shown in the previous methods.

raw_song.sample(10)
an example of 10 rows gathered by the .sample() method of a pandas dataframe
.sample() method on song data from Spotify dataset.

I use each of these methods consistently while exploring and preparing my datasets for analysis. Anytime I change the data in a column, change a column name, or add/delete a row/column I will then make sure it all changed the way I wanted it to by quickly running at least some of the previous 5 methods.

Selecting a Row(s) or Column(s)

Awesome, now you know how to look at your dataset as a whole, but you really just want to look at a few columns or a few rows and get the rest out of your way.

.loc[] and .iloc[]

These two methods will do exactly that just in different ways depending on the way in which you are able to reference a certain row or column.

If you know the label of a row or a column, use .loc[].

Or if you know the index of a row or a column, use .iloc[].

If you know both, just pick your favorite, flip a coin, or use the one that you know will not potentially change. For example, if you add rows or columns to your dataframe the index will change on these rows/columns and may cause you to reference the wrong row/column later.

Thus, going back to the Spotify dataset. You can just look at the column “Track Name” by using either .loc[] or .iloc[]. With .loc[] since I know the label of the column I would use the following:

.loc[ ]

raw_song.loc[:,'Track Name']

The colon after the first bracket specifies which rows I am referencing, and since I want all rows in the “Track Name” column, I am using the “:”.

functionality of the pandas .loc[] method

.iloc[ ]

I will receive the same output with .iloc[] only this time the index of the “Track Name” column needs to be specified:

raw_song.iloc[:,1]
shows the functionality of the pandas .iloc[] method

.loc[] and .iloc[] work the same for rows, except in this case since both the labels and the indices for the rows are the same, they would look exactly the same.

Slicing and Dicing

Photo by Annie Spratt on Unsplash

Another easy way to get a portion of the DataFrame is just by using [] and specifying the column name inside of the brackets.

raw_song[['Artist','Streams']].head()
how to select certain columns using indexing in pandas

If you use only one column and one set of brackets you will receive a Pandas Series.

raw_song['Streams']
how to select only one column in a pandas dataframe

Adding Rows, Columns from a Pandas DataFrame

Utilizing what we already know from .loc[] we can use this to add a row or a column to a dataframe. You can also add columns in two other ways using .insert() or by acting as you are adding a slice of a dataframe and specifying the name of the column inside square brackets []. If you are trying to add multiple rows and columns you can create a separate dataframe and join this new dataframe of the new column or row to your original dataframe to add that column or row. To do so you would use pd.merge(), concat(), or join(); however, further discussion and examples of these methods will be discussed in a future article and are outside the scope of this post.

Adding a row:

If you decide to use .loc[] to add a row to a dataframe, you can only add it to the bottom. Specifying any other index in the dataframe will erase the value currently in that row and replace it with the data that you are inserting. In this example, the new index is ‘last’ and it shows up at the bottom. Notice that it doesn’t have to be a specific name. Just as long as it is not the same as any of the other indices.

raw_song.loc['last'] = [0,'hello','bluemen',1,"https://open.spotify.com/track/notarealtrack", '2017-02-05','ec']
how to add a column using pandas .loc[] method

Adding a Column:

You can do the same to add a column to a dataframe using .loc[]. Once again, create a new name for the column unless you are trying to replace a column with new values. Either place a single value that can be copied for all rows, or use a list with the same length as the number of rows in the dataframe.

raw_song.loc[:,'new_col'] = 0
raw_song.tail()
how to add a new column to pandas dataframe
.loc[] to add a column to the Spotify dataset. I used 0 to simplify the input

There are two other ways to insert a new column into your dataframe other than at the end.

The method .insert() allows you to specify where to place the column in your dataframe. It takes 3 arguments, the index where it is to be placed, the name of the new column, and the values to be placed as the column data.

raw_song.insert(2,'new_col',0)
raw_song.tail()
how to use insert() to add a column in a specific place in pandas dataframe
Using .insert() to make a new column in the Spotify dataset.

The second way to add a column is by acting like it is a slice of the dataframe. You do this by using the [ ] to name the new column and have it equal the new data.

raw_song['new_col'] = 0
raw_song.tail()
adding a new column at the end of a pandas dataframe
adding a new column to the end of the dataframe

In this way, I cannot specify the location of the new column but is another way of performing the operation.

Deleting Rows, Columns from a Pandas DataFrame

Image for post
Photo by Jan Tinneberg on Unsplash

If you want to get rid of some rows or columns, it is simple, just drop them.

Specify which axis you would like to drop (0 for rows, 1 for columns) and the name of that row or column you would like to drop. Voila! You are good to go!

raw_song.drop(labels='new_col',axis=1)
how to drop a column from the dataframe.
.drop() removed the “new_col” column added in the previous section.

Renaming Indices or Columns

If you are looking to change the index of your dataframe to a different column in your dataframe then use .set_index(). The name of the column goes in the parentheses. However, if you know exactly what you would like to name your index use the .rename() method.

raw_song.rename(index={0:'first'}).head()
how to rename a row index in a dataframe
The dataframe’s first index changed from 0 to “first”.

To rename a column specify which column you want to rename and what you would like to name it inside the {}. This is similar to renaming the index.

raw_song.rename(columns={'Position':'POSITION_RENAMED'}).head()
how to rename a column in the dataframe
The first column changed from “Position” to “POSITION_RENAMED”

How to Iterate Over your Pandas DataFrame

Most data is messy. You must manipulate your dataset to make it as clean as possible. The easiest way to make changes on an entire column is to iterate over every value in the dataframe. The easiest way to do so is already built into pandas along with a for loop:

for index, col in raw_song.iterrows():
    # manipulate data here

How to Write a DataFrame to a file

After completing all of the manipulations to your dataframe, it is now time to export it. Similar to importing the dataset from a file, it is now just the opposite. Pandas have a variety of different file types that you can write the dataframe to. The most common is to write it to a csv file.

pd.to_csv('file_name.csv')

Now you know the basics of Pandas and DataFrames. These are extremely powerful tools in your data analysis toolbox.

Still confused? Contact me to schedule a consultation to get control of your data.

This article was originally posted on Medium.