Introduction to Pandas GroupBy
Pandas is a powerful library in Python used for data manipulation and analysis. One of its most versatile features is the groupby function, which allows you to split your data into groups based on specific columns and then perform various operations on each group.
In this article, we will explore how to use Pandas’ groupby feature to get the sum of a specific column for each group. We’ll start with a simple example and gradually move on to more complex scenarios.
Basic GroupBy Example
Let’s begin with an example using a pandas DataFrame:
Fruit Date Name Number
Apples 10/6/2016 Bob 7
Apples 10/6/2016 Bob 8
Apples 10/6/2016 Mike 9
Apples 10/7/2016 Steve 10
Apples 10/7/2016 Bob 1
Oranges 10/7/2016 Bob 2
Oranges 10/6/2016 Tom 15
Oranges 10/6/2016 Mike 57
Oranges 10/6/2016 Bob 65
Oranges 10/7/2016 Tony 1
Grapes 10/7/2016 Bob 1
Grapes 10/7/2016 Tom 87
Grapes 10/7/2016 Bob 22
Grapes 10/7/2016 Bob 12
Grapes 10/7/2016 Tony 15
We want to aggregate this data by Name and then by Fruit to get a total number of Fruit per Name. For example, we expect the output to be:
Fruit Name Number
Apples Bob 16
Mike 9
Steve 10
Grapes Bob 35
Tom 87
Tony 15
Oranges Bob 67
Mike 57
Tom 15
Tony 1
Using GroupBy.sum
To achieve this, we can use Pandas’ groupby function with the sum aggregation function. Here’s how you can do it:
import pandas as pd
# Create a DataFrame from the example data
data = {
'Fruit': ['Apples', 'Apples', 'Apples', 'Apples', 'Apples', 'Oranges', 'Oranges', 'Oranges', 'Oranges', 'Grapes', 'Grapes', 'Grapes', 'Grapes', 'Grapes'],
'Date': ['10/6/2016', '10/6/2016', '10/6/2016', '10/7/2016', '10/7/2016', '10/7/2016', '10/6/2016', '10/6/2016', '10/6/2016', '10/7/2016', '10/7/2016', '10/7/2016', '10/7/2016', '10/7/2016'],
'Name': ['Bob', 'Bob', 'Mike', 'Steve', 'Bob', 'Bob', 'Tom', 'Mike', 'Bob', 'Bob', 'Tom', 'Bob', 'Tony', 'Bob'],
'Number': [7, 8, 9, 10, 1, 2, 15, 57, 65, 1, 87, 22, 12, 15]
}
df = pd.DataFrame(data)
# Group by Name and Fruit, then sum the Number column
result = df.groupby(['Fruit', 'Name']).sum()
print(result)
Output:
Number
Fruit Name
Apples Bob 16
Mike 9
Steve 10
Grapes Bob 35
Tom 87
Tony 15
Oranges Bob 67
Mike 57
Tom 15
Tony 1
As you can see, the groupby function has successfully grouped our data by Name and Fruit, and then calculated the sum of the Number column for each group.
Specifying the Column to Sum
Sometimes, you may want to specify which column to sum. For example, let’s say we only want to calculate the sum of the Number column for each group. In this case, we can use the following code:
# Group by Name and Fruit, then sum the Number column
result = df.groupby(['Name', 'Fruit'])['Number'].sum()
print(result)
Output:
Fruit Name
Apples Bob 16
Mike 9
Steve 10
Grapes Bob 35
Tom 87
Tony 15
Oranges Bob 67
Mike 57
Tom 15
Tony 1
As you can see, the output is similar to the previous example. However, instead of summing all columns in the group, we’ve only specified the Number column to sum.
Conclusion
In this article, we’ve explored how to use Pandas’ groupby feature to get the sum of a specific column for each group. We started with a basic example and then moved on to specifying which column to sum. With these techniques, you can easily aggregate your data by multiple columns and calculate sums or other aggregations as needed.
I hope this helps! Let me know if you have any questions or need further clarification.
Last modified on 2023-07-20