Methods For Categorical Data in Pandas

Methods you should know to work with categorical data in Pandas

Tirendaz AI
The Startup

--

Photo by Marvin Meyer on Unsplash

You may have categorical data in your dataset. A categorical data is a type with two or more categories. If you have categorical data in the dataset, converting these data to categorical data allows you to use less memory and make easier analyzes.

I’ll talk about the following topics in this post.

  • Converting to categorical data
  • Working with categorical data
  • The performance of the category types
  • The categorical methods
  • Creating a dummy variable?

Before getting started, don’t forget to subscribe to my youtube channel where I create content about ai, data science, machine learning, and deep learning.

Let’s dive into!

Converting to Categorical Data

You can convert a variable to a categorical variable. To show this, first, let’s import the Pandas and Numpy libraries.

If your dataset has duplicate values, you can use functions such as unique and value_counts methods. Tos show this, let’s create a data.

Let’s take a look at the unique values in the data.

Let’s see the number of duplicate values in the data using the value_counts method.

You can assign numerical values to these values. To show this, let me create a values variable.

Now, let’s map this variable to data.

Categorical Type in Pandas

Pandas has special categorical types for data. To show this, let’s print the data variable again.

Let’s assign the length of this variable to variable N.

Let’s create a dataframe using this name data.

Let’s take a look at the dataset.

Let’s select the name column.

Let’s take a look at the structure of the name column.

This column is in the Series data structure. Let’s convert this series into a category.

Now the values in this name_cat are categorical. To check this, let’s assign the values in name_cat to x.

Let’s take a look at the structure of these values.

Let’s see the codes.

You can also convert the column in the data frame to a category.

You can directly create a categorical variable.

You can directly categorize data with the Categorical method.

You can categorize the data that has categorical coding with the from_codes. To show this, let’s create people and codes variables and then map these two variables.

Notice that there is no specific order in categorical data. You can categorically sort with ordered = True.

Let’s order this variable with the as_ordered method.

Working with Categorical Data.

You can easily work with functions like groupby if you categorize the data. To show this, let’s create a data from normal distribution.

Let me divide this data into four intervals.

Let’s check the type of this interval variable.

This interval variable is a categorical type. You can assign a label to these ranges.

You can calculate some summary statistics using the groupby. First, let’s convert the ranges to an series.

Now, let’s find the minimum and maximum values of the intervals.

The Performance of Categorical Types

When working with big data, converting to categorical variables and analyzing improves performance. Categorical versions of the DataFrame column take up significantly less memory space. For example, let’s create data with ten million elements.

Let’s assign labels to these values.

Let’s convert this data into categorical data.

Now, let’s take a look at the memory usage of categorical and non-categorical data.

As you can see, categorical data uses less memory than non-categorical data.

The Categorical Methods

You can use some special methods for series. To show these methods, let’s create a series.

Let’s convert this data into categorical data.

The cat attribute allows us to access categorical methods. For example, let’s use the codes method to see the codes of values in data.

When you want to use the categorical methods, you need to write the cat method first, and then you can use the categorical methods.

You can use the set_categories method to increase the categories.

You can use the remove_unused_categories method to remove unused categories. To show this, let’s select the values a and b in the data.

Now, let’s remove the categories that are not used.

Creating a Dummy Variable?

Before building a machine learning model, you need to convert categorical data into dummy variables. To show this, let me use the s_ct data again.

You can use the get_dummies function to convert categorical data into dummy variables.

That’s it. In this post, I covered how to work with categorical data in Pandas. I hope you enjoy this post. Thanks for reading. You can find the notebook here.

--

--