Groupby Method Basics
Creating a Groupby Object
To use the groupby method, you first create a groupby object and then apply an aggregation method:
df.groupby('model_year')Common Aggregation Methods
Once the object is grouped, here are common methods to aggregate the data:
mean(): Compute the mean of groupssum(): Compute the sum of group valuessize(): Compute group sizescount(): Count non-NA values in groupsstd(): Compute the standard deviation of groupsvar(): Compute the variance of groupssem(): Compute the standard error of the meandescribe(): Generate descriptive statisticsfirst(): Return the first group valuelast(): Return the last group valuenth(n): Return the nth value (or subset if n is a list)min(): Compute the minimum of group valuesmax(): Compute the maximum of group values
Full list of aggregation methods in the pandas documentation
Aggregating Data
Example of applying aggregation methods:
df.groupby('model_year').mean(numeric_only=True)Example Output
model_yearbecomes the index:
avg_year = df.groupby('model_year').mean(numeric_only=True)
avg_year.index
avg_year.columns
avg_year['mpg']
df.groupby('model_year').mean()['mpg']To view descriptive statistics:
df.groupby('model_year').describe()
df.groupby('model_year').describe().transpose()Grouping by Multiple Columns
Example: Average mpg per model_year and cylinders
df.groupby(['model_year', 'cylinders']).mean()Understanding MultiIndex
After grouping by multiple columns, pandas uses a MultiIndex:
year_cyl = df.groupby(['model_year', 'cylinders']).mean()
year_cyl.index
year_cyl.index.levels
year_cyl.index.namesIndexing and Filtering with MultiIndex
Selecting Data
-
Based on outer index:
year_cyl.loc[70] year_cyl.loc[[70, 72]] -
Based on specific row:
year_cyl.loc[(70, 8)]
Cross-Section (.xs())
Use .xs() to filter specific levels of a MultiIndex:
year_cyl.xs(key=70, axis=0, level='model_year')
year_cyl.xs(key=4, axis=0, level='cylinders')Basic Concept of .xs()
.xs()stands for “cross-section.”- It is used to extract rows or columns based on the value of a specific level in a MultiIndex DataFrame or Series.
When to Use .xs() vs Other Methods
.xs()is ideal for slicing along a particular level in a MultiIndex.- For simpler filtering, you might use
.loc[]or.query()instead, but they may not handle MultiIndex as cleanly.
Comparison with .loc[]
.loc[] is more general but less concise for MultiIndex:
df.loc[('A', slice(None)), :] vs. df.xs('A', level='Group')
Common Pitfalls
- Not Using
level: If you don’t specify thelevel, pandas may not know which level of the index to slice on. - Dropping Levels: By default,
.xs()drops the sliced level. Usedrop_level=Falseif you want to retain it.
Why Is .xs() Powerful?
- It simplifies handling hierarchical data without requiring manual restructuring.
- Useful for slicing specific subsets of large datasets.
Syntax
DataFrame.xs(key, axis=0, level=None, drop_level=True)
Parameters:
key:- The label of the level you want to extract.
axis:- Whether you are slicing rows (
axis=0, default) or columns (axis=1).
- Whether you are slicing rows (
level:- The name or index of the level in the MultiIndex you want to select.
drop_level:- Whether to drop the level you are slicing from the resulting DataFrame/Series (
Trueby default).
- Whether to drop the level you are slicing from the resulting DataFrame/Series (
Filtering Before Grouping
Keep in mind, its usually much easier to filter out values before running a groupby() call, so you should attempt to filter out any values/categories you don’t want to use. For example, its much easier to remove 4 cylinder cars before the groupby() call, very difficult to this sort of thing after a group by.
df[df['cylinders'].isin([6,8])].groupby(['model_year','cylinders']).mean(numeric_only=True)Advanced MultiIndex Operations
Swapping Levels
Swap levels of a MultiIndex:
year_cyl.swaplevel().head()The swaplevel() method in pandas is a powerful tool for rearranging the levels of a MultiIndex in DataFrames or Series. This function allows you to swap two specified levels, which can be particularly useful when you need to reorganize your data for analysis or visualization.
Understanding swaplevel()
- Purpose: To interchange two levels in a MultiIndex, either in the row index or the column index.
- Common Use Cases:
- Reordering hierarchical levels to facilitate data selection or aggregation.
- Preparing data for operations that require a specific level order.
Sorting MultiIndex
Sort the MultiIndex based on a specific level:
year_cyl.sort_index(level='model_year', ascending=False)
year_cyl.sort_index(level='cylinders', ascending=False)Advanced Aggregation with agg()
The agg() method in pandas is a versatile function that allows you to perform one or more aggregation operations across a DataFrame or Series. It’s particularly useful for summarizing data, especially after grouping operations.
Understanding agg()
- Purpose: To apply one or more aggregation functions over a specified axis (rows or columns) of a DataFrame or Series.
- Common Use Cases:
- Calculating summary statistics like sum, mean, min, max, etc.
- Applying multiple aggregation functions simultaneously.
- Aggregating data after grouping operations.
Using agg() on a DataFrame
You can apply multiple aggregation methods to columns:
df.agg(['median', 'mean'])
df.agg(['sum', 'mean'])[['mpg', 'weight']]Custom Aggregation by Column
Pass a dictionary to specify aggregation methods for specific columns:
df.agg({'mpg': ['median', 'mean'], 'weight': ['mean', 'std']})Combining agg() with Groupby
df.groupby('model_year').agg({
'mpg': ['median', 'mean'],
'weight': ['mean', 'std']
})