Data Science Tutorial

Data Analysis with Pandas

A guide on how to perform data analysis with Pandas.

Tirendaz AI
CodeX
Published in
6 min readNov 17, 2021

--

Photo by JESHOOTS.COM on Unsplash

Pandas is one of Python’s most important library. You can use Pandas for data analysis and data preprocessing. In this post, I’ll cover as following topics:

  • Loading a dataset
  • Exploring the dataset
  • Selecting with the loc and iloc methods
  • Sorting
  • Working with Series
  • Filtering
  • Grouping

Let’s dive in!

Loading the Dataset with Pandas

The dataset I’m going to use in this post is video games sales. This dataset contains a list of video games with sales greater than 100,000 copies.

You can find the dataset and notebook from here.

Before loading the dataset, let’s import the pandas library.

import pandas as pd

One of the most important stages of data analysis is to load the dataset. You can load your dataset with the read.csv method in Pandas. Let’s load the dataset using the read_csv() method.

games = pd.read_csv("vgsalesGlobale.csv")

If we want, we can specify a index column using the index_col parameter. Let me show this.

games = pd.read_csv("vgsalesGlobale.csv", index_col = "Name")

Exploring the Dataset

You can use the head method to see the first rows of the dataset. Let’s see the first 5 rows of the dataset.

games.head()

There are two data structures in Pandas; Series and DataFrame. A data structure consisting of rows and columns such as Excel is called DataFrame. DataFrame can include different data structures such as numeric or categorical. The other data structure is Series which consists of a single column.

You can see the last rows of the dataset using the tail method.

games.tail()

By default, the last 5 rows are displayed. You can also see the last 10 rows.

games.tail(10)

Let’s take a look at the number of rows of the dataset with the len method.

len(games)# Outout: 
16598

You can see the shape of the dataset with the shape attribute.

games.shape# Outout: 
(16598, 10)

You can see the data types of the columns with dtype attribute.

games.dtypes

Selecting with the loc and iloc methods

You can see any row in the dataset using the iloc method. Let’s say we want to see the 300th row. Note that I’m going to write 299 in the iloc method because indexes start from zero in Python.

games.iloc[299]

No one who does not know the game of super mario. Let’s select this game with an index label. You can the loc method for this.

games.loc["Super Mario Bros."]

In the dataset, there are two kinds of super mario games, which were made in 1985 and 1999.

Sorting

Now let’s sort the games by years using the sort_value method.

games.sort_values(by = "Year").head()

You can use the ascending argument to sort by descending values. Let me set this argument as False.

games.sort_values(by = "Year", ascending = False).head()

You can also sort by two columns. Let’s sort the dataset by year and genre columns.

games.sort_values(by = ["Year","Genre"]).head()

You can also sort the dataset by indexes.

games.sort_index().head()

Thus, the indexes in the data set were sorted in alphabetical order.

Working with Series

You can select a column in the dataset. For example, let’s take a look at the first ten rows of the publisher column.

games["Publisher"].head(10)

Filtering

Sometimes, you may want to filter rows based on some criteria. For example, let’s see the rows whose genre is action.

games[games["Genre"] == "Action"]

There is a way to filter. Let me show this. First, I’m going to assign the filter to a variable.

games_by_genre = games["Genre"] == "Action"

Next, let’s filter by this variable.

games[games_by_genre].head()

I obtained the same result as above. Now let’s create a criterion with games made in 2010.

games_in_2010 = games["Year"] == 2010

Now let’s filter by the two criteria.

games[games_by_genre & games_in_2010]

You can also see games made in 2010 or the action genre. You can use the pipe for this.

games[games_by_genre | games_in_2010]

You can also see values ​​greater or less than a certain value. For example, let’s take a look at the games made after 2015. First, let’s create our criteria.

after_2015 = games["Year"] > 2015

Now let’s filter by this variable.

games[after_2015]

You can also specify a certain range as a criterion. For example, let’s see the games between 2000 and 2010.

mid_2000s = games["Year"].between(2000, 2010)

Now let’s filter by this variable.

games[mid_2000s]

You can also filter by indexes. For example, let’s see the game names that include sport word. Note that the str command is used to use string methods. First of all, let’s convert the titles of the indexes to lowercase with the str.lower, then finds the names with sport in them with the contains method.

sport_in_title = games.index.str.lower().str.contains(“sport”)

Now let’s filter by this variable.

games[sport_in_title]

You can find the statistics of the numeric columns in the dataset. For example, let’s find the mean of global sales.

games["Global_Sales"].mean()#Output:0.537

Grouping data

You can group using the groupby method. For example, let’s group by Genre column. First, let’s create the group object.

genres = games.groupby("Genre")

Now let’s see the total value of global sales. I’m going to use the sum method for this.

genres["Global_Sales"].sum()

You can sort the values with the sort_values method.

genres["Global_Sales"].sum().sort_values(ascending = False)

Conclusion

Pandas is the most used library for data preprocessing in Python. Pandas has many more methods. You can play with the data like a ball using Pandas. You can find this notebook here.

That’s it. I hope you enjoy it. Thanks for reading. Don’t forget to follow us on YouTube | GitHub | Twitter | Kaggle | LinkedIn 👍

If this post was helpful, please click the clap 👏 button below a few times to show me your support 👇

--

--

Tirendaz AI
CodeX

Generative AI & Data Science | Top writer on Medium | YouTuber on AI: https://bit.ly/subscribe-tirendazai