🔥 .apply() Method
✅ What it does:
- Used to apply a function to a DataFrame column.
- Works with both custom functions and lambda expressions.
- Can process multiple columns together.
📌 Examples:
Apply with a function
def last_four(num):
return str(num)[-4:] #str: to transform CC Number from int to str,then can be sliced.
df['last_four'] = df['CC Number'].apply(last_four)df['total_bill'].mean()
def yelp(price):
if price < 10:
return "$"
elif 10 <= price < 30:
return "$$"
else:
return "$$$"
df['Expensive'] = df['total_bill'].apply(yelp)Like I can create a customize function and apply it into dataframe.
Apply with Lambda
df['total_bill'].apply(lambda bill: bill * 0.18)Apply with Multiple Columns
def quality(total_bill, tip):
return "Generous" if tip / total_bill > 0.25 else "Other"
df['Tip Quality'] = df[['total_bill', 'tip']].apply(lambda df: quality(df['total_bill'], df['tip']), axis=1)✅ Breaking it Down
-
df[['total_bill', 'tip']]- Selects two columns:
total_billandtipfrom the DataFrame. - This creates a subset of the DataFrame with only these two columns.
- Selects two columns:
-
.apply(lambda df: quality(df['total_bill'], df['tip']), axis=1)lambda df: quality(df['total_bill'], df['tip'])- Defines an anonymous function (
lambdafunction) that takes each row (df). - Extracts
total_billandtipfrom the row. - Passes them into the
quality()function.
- Defines an anonymous function (
axis=1- Ensures the function is applied row-wise (each row is treated as
df). - This means the function is applied once per row, using that row’s
total_billandtip.
- Ensures the function is applied row-wise (each row is treated as
-
df['Tip Quality'] = ...- Saves the result into a new column named
"Tip Quality".
- Saves the result into a new column named
-
.apply()should be used row-wise for multiple columns- The
apply()function by default applies the function to each column, not rows. - Since
quality()needs bothtotal_billandtipfrom the same row, you must use.apply()alongaxis=1.
- The
Using np.vectorize() for faster execution
import numpy as np
df['Tip Quality'] = np.vectorize(quality)(df['total_bill'], df['tip'])✅ Tip: np.vectorize() is significantly faster than .apply() when applying functions to multiple columns.
.apply() runs row by row, making it slower for large datasets
np.vectorize() applies the function directly to the entire column, making it much faster!
📊 .describe() – Statistical Summary
df.describe() # General statistics
df.describe().transpose() # Transposed for readability🚀 Use case: Quickly summarize numerical columns.
🔀 .sort_values() – Sorting Data
df.sort_values('tip') # Sort by a single column
df.sort_values(['tip', 'size']) # Sort by multiple columns📌 Tip: Use sort_values() to reorder your DataFrame.
📈 .corr() – Correlation Matrix
df.corr()
df[['total_bill', 'tip']].corr()🔍 Insight: Shows relationships between numerical columns.
What is Correlation?
- Correlation measures the strength and direction of the relationship between two numerical variables.
- The value of correlation ranges from -1 to 1:
- +1 → Perfect positive correlation (as one increases, the other increases).
- 0 → No correlation (variables are independent).
- -1 → Perfect negative correlation (as one increases, the other decreases).
🏆 .idxmin() and .idxmax() – Find Min/Max Index
df['total_bill'].idxmax() # Index of the highest bill
df['total_bill'].idxmin() # Index of the lowest bill
df.iloc[df['total_bill'].idxmax()] # Retrieve row with max value🔍 Insight: Use these methods to find the row index of extreme values.
🔢 .value_counts() – Count Unique Values
df['sex'].value_counts()📌 Use case: Count occurrences of categorical values.
🔍 .unique() and .nunique() – Find Unique Values
df['size'].unique() # List of unique values
df['size'].nunique() # Count of unique values📌 Use case: Check how many unique categories exist in a column.
🔄 .replace() – Replace Values
df['Tip Quality'].replace(to_replace='Other', value='Ok', inplace=True)🚀 Use case: Quickly replace specific values in a DataFrame. Just a few items
🎭 .map() – Value Mapping
my_map = {'Dinner': 'D', 'Lunch': 'L'}
df['time'].map(my_map)📌 Use case: Convert categorical values into abbreviated forms. For many items
🗂 .duplicated() & .drop_duplicates() – Handle Duplicates
df.duplicated() # Returns True for duplicate rows
df.drop_duplicates(inplace=True) # Remove duplicate rows📌 Use case: Clean dataset by removing duplicate entries. It returns a Boolean Series, where:
Falsemeans the row is not a duplicate (i.e., it appears for the first time).Truemeans the row is a duplicate (i.e., it has appeared before in the DataFrame).
📍 .between() – Range Filtering
df[df['total_bill'].between(10, 20, inclusive=True)]📌 Use case: Filter rows within a numerical range.
🎲 .sample() – Random Sampling
df.sample(5) # Get 5 random rows
df.sample(frac=0.1) # Get 10% of the dataset📌 Use case: Useful for testing or creating smaller datasets.
🔝 .nlargest() & .nsmallest() – Get Top/Bottom N Values
df.nlargest(10, 'tip') # Top 10 highest tips
df.nsmallest(5, 'total_bill') # 5 smallest total bills📌 Use case: Identify extreme values in your dataset.
🎯 Final Notes
- Pandas has many built-in methods; this is just a small selection.
- For a deep dive, check the Pandas Documentation.
- Experiment with these methods to get comfortable using them in real-world data analysis! 🚀