Working with Text Data in Pandas

Tirendaz AI
Level Up Coding
Published in
5 min readFeb 22, 2021

--

A guide to getting started with text data in Pandas using a real-world dataset.

Photo by Philippe Bout on Unsplash

Real-world datasets don’t just consist of numbers. Datasets can also include texts. It is important to work with these texts for data analysis. I’ll talk about the following topics in this post,

  • How to use string methods in Pandas?
  • How to use regular expressions in Pandas?
  • Practicing with IMDb dataset

Before getting started, please don’t forget to subscribe to my youtube channel where I create content about AI, data science, machine learning, and deep learning.

Let’s dive into!

How to string methods in Pandas?

Python is a popular language for data manipulation because it works easily with text. It has a set of built-in methods that you can use on strings. You can also use these methods quickly in Pandas.

For example, let’s convert the word hello to uppercase.

In Pandas, you need to type the str code to use string or regular expression methods. To show this, let me import Pandas and Numpy.

Next, let’s create data.

Let’s capitalize the first letters of the values in this data. First, let’s convert this data to series.

You can use the str attribute to implement string methods for the series and index object. Let’s capitalize the first letter in the data.

You can also make all the letters lowercase.

You can find the length of the texts with the len method.

You can find the names that start with a with the startswith method.

You can apply string methods to index objects. To show this, let’s create a data frame.

Let’s see the columns of the dataset.

df.columns is an index object. You can use the str attribute for this object. For example, let’s make these column names lowercase and replace spaces with the _ symbol.

You can use the methods such as split for series. To show this, let’s create data.

Let’s split the letters according to the underscore here. I’m going to use the [] symbol to select 1st index.

You can convert the separated values into the data frame using the expand=True parameter. You can also limit the split process with the n parameter. Let me show this.

How to use regular expressions in Pandas?

You can also use regular expressions in Pandas. To show this, let’s create finance data.

Let’s remove the dollar symbol.

Note that the $ symbol is a meta-character, and it has a special meaning in regular expressions. To remove this character from being a meta-character, you need to use a back slash escape character. Let’s replace “-$” with “-”.

Practice with the IMDb dataset

I’m going to practice with string methods using a real dataset. The dataset is about the highest-rated movies on IMDb. First, let’s import the dataset.

You can access this dataset from here. Let’s see the first rows of this dataset.

Let’s convert the strings in the title column into uppercase using the upper method.

The columns in the dataset were index objects. Let’s capitalize the first letters of these names using the capitalize method.

Let’s take a look at the dataset using the head method.

You can use the contains method to check whether a text is in the dataset. For example, let’s search for the name Brad Pitt in the actor list.

You can use the replace method to remove a character. For example, let’s remove the square brackets from the actor list.

Let’s remove the left bracket.

That’s it. In this post, I covered how to work with text data in Pandas. I hope you enjoy this post. Thanks for reading. You can find the notebook here.

--

--