Groupby & Aggregate using Pandas

python pandas groupby aggregate © Copyright 2017, Akshay Sehgal | 3rd Oct 21


Table of contents:

  1. Introduction
  2. Syntax
    2.1 Adding more groups/levels
    2.2 Adding more variables/features
  3. Grouping
  4. Aggregation
    4.1 In-built aggregation methods
    4.2 Custom functions with pandas apply
    4.3 Multiple aggregations using agg method
    4.4 Custom functions with agg method
  5. Transform
  6. Advanced Usage
    6.1 Sequential/local grouping of a dataframe
    6.2 Re-indexing to a fixed date range for each group
  7. Other ways of grouping data
    7.1 Using collections' defaultdict
    7.2 Using numpy's split function
    7.3 Using itertools' groupby
  8. References

1. Introduction

"Groupby" is probably one of the most basic data pre-processing step that a Data Scientist should master as soon as possible. Interestingly enough, you find it in almost every scripting language that claims to work well with databases.

Most of us would have been introduced to the SQL GROUPBY statement which allows a user to summarize or aggregate a given dataset. Python brings the pandas groupby method to the table, which is highly pythonic in its syntax and equally versatile, if not more. But the utility of a groupby is much more than just aggregation. In this notebook, I will showcase a few examples, where you could really exploit this method for various other use-cases.

Before we can start writing code, let's explore the basics behind a groupby operation. The core concept behind any groupby operation is a three step process called Split-Apply-Combine.

Here is a diagram to make this more intuitive.

groupby

2. Syntax

The syntax for using a groupby method in Pandas comprises of 2 parts. First is a grouper object and the second is the aggregator. The general structure looks like the following -

dataset.groupby(['grouping column(s)'])['output column(s)'].aggregation()
|____________________________________| |________________________________|
                   |                                    |
        grouper object (split)            aggregation (apply & combine)

This will get more clear as we take an example from actual data. So, let's start by loading a dataset to work with. For this notebook I will use the Titanic dataset which can either be downloaded from Kaggle, or directly loaded using the visualization library called Seaborn.

Question: What is the total number of passengers from each class who survived?

You would usually do this in a single statement as the following:

Note: The reset_index() helps bring the grouping columns from index, back as a column in a dataframe.

Question: What was the average fare for passengers from each town?

2.1 Adding more groups/levels

We can pass a list of features in the groupby() to increase the levels for grouping the data as below.

Question: What was the average fare for male vs female passengers from each town?

2.2 Adding more variables/features

Similarly, we can select a list of variables for which you need to apply the aggreate function.

Question: What was the average fare and age for male vs female passengers from each town?

3. Grouping

Before we go further and try other, more complex scenarios, let's try to understand the data structures we are working with, so that we can be much more creative with our approaches and get a deeper understanding on how they work.

You can imagine the pipeline of the above code to be as -

Let's see what the grouper object looks like for a better understanding.

So, this shows that if we try to iterate over the grouper object, its nothing but a tuple with the key and a dataframe.

Let's see what each of those is.

The key for each of the tuples/groups is the value from the grouper column (in this case the embark_town) and the value is just the complete dataframe filtered for that value! If we try to print one of the dataframe from this grouper, you can see that all the rows in this slice of data contain Queenstown as the embark_town, as shown below.

Similarly, let's see how the grouper object looks like for multiple grouping features. The 'key' in this case is just a tuple with all the group combinations, which, after aggregation, gets set as the index of the final output.

4. Aggregation

There are multiple ways of aggregating your grouper object.

  1. First part of this section is to understand that you can perform aggregations on multiple columns, OR, perform multiple aggregations themselves on different columns, or a combination of both.
  2. Second, you can use apply() or agg() to write your own custom aggregators but Pandas makes it much easier by providing a ton of in-built aggregators such as sum() or mean() as we discussed in the above examples.

Let's try to go through a few scenarios and explore how we can use these aggregations.

4.1 In-built aggregation methods

Pandas provides a ton of aggregation methods to quickly get the statistics you are looking for. Below are a few of the common ones that are use and more details on these can be found on the official pandas documentation.

Function Description
mean( ) Compute mean of groups
median( ) Compute median of groups
mode( ) Compute mode of groups
sum( ) Compute sum of group values
prod( ) Compute product of group values
size( ) Compute group sizes
count( ) Compute count of group
std( ) Standard deviation of groups
var( ) Compute variance of groups
skew( ) Compute skewness/3rd moment of groups
kurt( ) Compute kurtosis/4th moment of groups
sem( ) Standard error of the mean of groups
mad( ) Mean absolute deviation for each group
describe( ) Generates descriptive statistics
first( ) Compute first of group values
last( ) Compute last of group values
nth( ) Take nth value, or a subset if n is a list
min( ) Compute min of group values
max( ) Compute max of group values

4.2 Custom functions with pandas apply

This is by far the most popular way of applying a custom function to a dataframe, or in this case, applying it on each of the dataframe slices for groups defined by the grouper. The behavior of the apply() method with groupby is similar to the standard one.

You can apply it to each row (or column) of a dataframe input (if you have more than one column for aggregation) or to a series (if you have single column for aggregation). Within the function, you can actually either work directly with individual series or just write your own lambda function. Here are a few ways using the apply function.

Question: Get the unique set of ages for all each age category (who column) from each town.

Question: Get the range (min - max) of ages for each age category (who column) from each town.

Question: Get the mean fare by age ratio for each age category (who column) from each town.

4.3 Multiple aggregations using agg method

Sooner or later, you would find it necessary to work with multiple aggregations over multiple columns at once. This is where agg() method comes in. Here is a quick example of how you can use multiple in-built functions over multiple columns at once.

The general way to do this is to create a dictionary with the requirements and pass it to the agg() function. There are a few ways to structure the dictionary -

##Single function per column
{
 'column1': 'function1', 
 'column2': 'function2'
}

##Multiple functions per column
{
 'column1': ['function1', 'function2'], 
 'column2': ['function3', 'function4']
}

Question: Get the mean of fare, AND median of age for each age category (who column) from each town

Question: Get the sum & mean of fare, AND median, min and max of age for each age category (who column) from each town

As of Pandas >= 0.25, another way to define the agg function is define the each column with ('column', 'function').

Let's demonstrate that with an example.

Question: Get the sum & mean of fare, AND min and max of age for each age category (who column) from each town, but rename columns

4.4 Custom functions with agg method

As you might think, just modifying the aggregate functions to include lambda functions is a way to create your own custom functions applied to specific columns. Here are a few examples.

5. Transform

Apart from just aggregating, you can use groupby to transform columns based on the grouper object. This requires using transform() function and returns the same number of rows as the original dataset, but the functions are applied based on the grouping defined. Let's consider the following point.

Question: Create a new column that returns the average fare for the age group (who column) the passenger belongs to.

Notice that the output series is of the length of the original titanic dataframe, but contains only 3 unique values [88.8, 69.8, 77.3], one for each of the ['woman', 'man', 'child']. This make the grouping object highly versatile in the way you would use it for data preprocessing.

6. Advanced Usage

Let's introduce a few advanced cases where you end up using groupby for data preprocessing.

6.1 Sequential/local grouping of a dataframe

The grouper object doesnt need to explicitly come from the dataframe. As long as the length of the grouper is the same as the number of rows in the dataframe, you can assign any grouper to groupby the rows by.

Question: Get sum of the value column of the given dataframe based on the sequentially occuring groups category i.e, in [1,1,2,2,1,1] the first group of 1's should be a separate group than second set of 1's.

We can solve this creating a custom grouper, by shifting the column value by 1 and comparing them with original. If not equal, it will swap the boolean value. Then we can take a cumsum over the boolean to get groups where the value changes consecutively. Here is the solution for a similar problem I solved on Stack Overflow.

6.2 Re-indexing to a fixed date range for each group

Question: A dataframe only contains rows for few dates for each id. The goal is to re-index the dataframe for a fixed date range, but for each of the id individually. Also, fill the missing data with 0 values.

Here we can create a custom reindex using pandas.date_range. Then, after setting the original date column as index, we can apply pandas.DataFrame.reindex along with groupby on the id column to reindex with the new date range for group, while filling empty values as 0.

7. Other ways of grouping data

Here I discuss 3 ways that are popularly used to group data depending on the data structures and libraries you are already working with.

Let's say we have a list of tuples with keys and values which we need to group.

7.1 Using collections' defaultdict

A useful way of grouping data is to use defaultdict. Defaultdict can store the grouping values as keys, and store the values as a list of values (or a custom function on them)

7.2 Using numpy's split function

Another way of splitting an array into a list of sub-arrays based on a grouping key is by using np.split along with the indexes for each group returned by np.unique. Only important thing is, the arrays needs to be sorted explicitly.

7.3 Using itertools' groupby

Itertools provides a groupby api which is actually a sequential/local grouping method.

It's powerful for getting groups from "AAABBBAACCC" as "AAA BBB AA CCC". But in order to get groups as "AAAAA BBB CCC", its necessary to first sort the data by the grouping key.

8. References