It’s Time For You To Understand Pandas Group By Function Towards Data Science – Medium

With its 1.0.0 release on January 29, 2020 pandas reached its maturity as a data manipulation library. Pandas provide a framework that is also suitable for OLAP operations and it is the to-go tool for business intelligence in python.

In this guide, I would like to explain, by showing different examples and applications, the groupby function provided by Pandas, which is the equivalent of the homonymous GROUP BY available in the SQL language.

Grouping data is among one of the most basic operations to generate reports and find insights in structured data it helps answering business questions such as: how many revenues we have for each country? Which are the most sold items by country? How are sales are going over time? How our sales are going compared to the previous year for country x?

Let’s Get Started:

The syntax of groupby can be decomposed in four different groups:

  • We used this function by calling it to a dataframe.
  • We specify a list of columns to which we want to group our dataframe and all the optional argument (Available in the official Pandas documentation)
  • We define an aggregation function or a group of aggregation functions to apply to each column
  • Optionally we can add any function that is applicable to a dataframe, as the combination of the previous steps returns a dataframe.

The Analogy with SQL

Coming from SQL you can see how the logic behind the usage of group_byis very similar to SQL.

The notebook containing all the codes and examples from this guide can be found here.

The Data:

Group Sales by Country:

df[[‘Country’, ‘Quantity’]].groupby(‘Country’).sum().head()

By using double square brackets we can select columns by names.

However, by defaultgroupby does not sort values, but we can add a function that will sort our dataframe:

df[[‘Country’, ‘Quantity’]].groupby(‘Country’).sum().sort_values(by=’Quantity’, ascending=False).head()

Avoiding to do this second operation will lead to a data frame grouped by the exact timestamp a transaction was made, which makes completely no sense:

Each transaction is grouped by date at a minute level, which leads to data that is not regularly organized. To change our granularity and group by at date level, the .dt util from pandas will become handy:

df[[‘InvoiceDate’, ‘Quantity’]].groupby(df[‘InvoiceDate’]’Quantity’, ascending=False).head()

Similarly, in business, sometimes we want to group our data by week, as well in this situation the .dt util will help us. You can think of this util as the DATEPART function in SQL:

df[[‘InvoiceDate’, ‘Quantity’]].groupby(df[‘InvoiceDate’].dt.week).sum().plot()
Another example of a function we can then apply to our dataframe, the .plot() function.
df[[‘InvoiceDate’, ‘Quantity’, ‘Country’]].groupby([df[‘InvoiceDate’].dt.month, df[‘Country’]]).sum().head()

What can we see here? Our dataframe has now what is called a multi index, which describes the two different levels of aggregation. Multi index are useful when dealing with data that has multiple levels of aggregation, such as by date, by week number, by year and so on.

We can rename our Muliindex to make it more consistent with our data by using the set_names method:

year_week = df[['InvoiceDate', 'Quantity', 'Country']]
year_week = year_week.groupby([df['InvoiceDate'].dt.year, df['InvoiceDate'].dt.week, df['Country']]).sum()
year_week.index = year_week.index.set_names(['Year', 'Week', 'Country'])

If for some reasons, we want to get rid of it we can use the reset_index() after the aggregation function, this will return a dataframe similar to the result on an SQL query.

In addition, grouping by year and week number are very common in real life, it gives the opportunity to compare, for example, seasonality between different years:

year_week = df[[‘InvoiceDate’, ‘Quantity’, ‘Country’]]
year_week = year_week.groupby([df[‘InvoiceDate’].dt.year, df[‘InvoiceDate’].dt.week]).sum()
year_week.index = year_week.index.set_names([‘Year’, ‘Week’])
year_week.unstack(level=0).plot(kind=’bar’, subplots=True, figsize=(15, 4))

Since we only have a few weeks from 2011, the first plot is mostly empty.

Similarly, we can do another report by grouping county and description and sort descending.

In other words: let’s answer the question which are the most sold items for each country?

items_sold = df[[‘Description’, ‘Country’, ‘Quantity’]]items_sold.groupby([‘Country’, ‘Description’]).sum().sort_values(by=’Quantity’, ascending=False)

Analogously, this is how we would write this query in SQL:

SELECT Country, Description, sum(Quantity) FROM df GROUP BY Country, Description ORDER BY Quantity DESC

Grouping by Multiple aggregation Functions

Let’s group by country and apply sum for quantity and average for the unit price:

country_agg = df[[‘Country’, ‘Quantity’, ‘UnitPrice’]]
country_agg = country_agg.groupby([‘Country’]).agg({‘Quantity’:’sum’, ‘UnitPrice’:’mean’})

Another important usage of .agg() is when, for example, we want to apply different functions to the same column. Let’s say we want to group by country, get the average unit price and sum the quantity, but at the same time get the minimum and maximum quantity sold:

country_agg = df[[‘Country’, ‘Quantity’, ‘UnitPrice’]]
country_agg = country_agg.groupby([‘Country’]).agg({‘Quantity’:[‘min’, ‘max’, ‘sum’], ‘UnitPrice’:’mean’})
items_sold = df[[‘Description’, ‘Country’, ‘Quantity’]]
items_sold.groupby([‘Country’, ‘Description’]).sum()


What do you think?


电子邮件地址不会被公开。 必填项已用*标注





Data Science Street Smarts: Navigating the Sea of Information Towards Data Science – Medium

Segmenting Customers using K-Means and Transaction Records Towards Data Science – Medium