🧪 Plot 1: Scatter Plot – Age vs Employment Days

Goal: Show relationship between age and employment duration for people who are employed.

✅ Transform both DAYS_BIRTH and DAYS_EMPLOYED to positive values
✅ Filter out currently unemployed individuals

df1 = df[df['DAYS_EMPLOYED'] < 0]
df1['DAYS_EMPLOYED'] = df1['DAYS_EMPLOYED'] * -1
df1['DAYS_BIRTH'] = df1['DAYS_BIRTH'] * -1
 
plt.figure(figsize=(12,8))
sns.scatterplot(
    x='DAYS_BIRTH', 
    y='DAYS_EMPLOYED', 
    data=df1, 
    s=5, 
    linewidth=0, 
    alpha=0.1
)

📈 Plot 2: Distribution of Age (Histogram)

Goal: Show distribution of applicant age in years.

df1['Age in Years'] = df1['DAYS_BIRTH'] / 365
 
plt.figure(figsize=(12,8))
sns.histplot(data=df1, x='Age in Years', bins=50, color='pink')
plt.ylim(0, 14000)
plt.show()

📦 Plot 3: Boxplot – Family Status vs Income (Bottom 50% only)

Goal: Show income distribution by family status for the bottom half of income earners.

Option 1: Using nsmallest (based on values)

bottom_half_df = df.nsmallest(int(len(df)/2), columns='AMT_INCOME_TOTAL')
 
sns.boxplot(
    x='NAME_FAMILY_STATUS', 
    y='AMT_INCOME_TOTAL', 
    data=bottom_half_df, 
    hue='FLAG_OWN_REALTY'
)
plt.legend(bbox_to_anchor=(1.2,1))

Option 2: Using sort_values().tail() (based on position)

df_sorted = df.sort_values(by='AMT_INCOME_TOTAL', ascending=False)
bottom_half = df_sorted.tail(len(df) // 2)
 
plt.figure(figsize=(12,6))
sns.boxplot(
    x='NAME_FAMILY_STATUS', 
    y='AMT_INCOME_TOTAL', 
    data=bottom_half, 
    hue='FLAG_OWN_REALTY'
)
plt.legend(bbox_to_anchor=(1.1,1))

🧠 Note: These two methods may give slightly different results due to how they handle duplicates and sorting logic.


🔥 Plot 4: Heatmap – Feature Correlation

Goal: Show correlation between numeric features in the dataset.

# Drop FLAG_MOBIL since it has no variance
df_corr = df.drop('FLAG_MOBIL', axis=1)
 
sns.heatmap(df_corr.corr(numeric_only=True), cmap='viridis')

For plot 3 I originally use option2 and got different results, here are the explanations:

🧪 1. Your first method: nsmallest(...)

bottom_half_df = df.nsmallest(int(len(df)/2), columns='AMT_INCOME_TOTAL')
  • This grabs exactly the smallest half of values in 'AMT_INCOME_TOTAL' — it’s value-based selection.

  • It’s guaranteed to return the smallest 50% based on actual numeric values, regardless of duplicates or distribution.


🧪 2. Your second method: sort_values(...).tail(...)

df_sorted = df.sort_values(by='AMT_INCOME_TOTAL', ascending=False)
half_n = len(df_sorted) // 2
bottom_half = df_sorted.tail(half_n)
  • Here, you’re sorting from largest to smallest, and then taking the last half.

  • This also seems like you’re getting the smallest 50%, but this is position-based. If there are a lot of duplicate values near the middle, this may include or exclude some rows differently from nsmallest.


🔍 So what’s the real difference?

They differ when:

  • There are many duplicate values around the middle, or

  • The number of rows isn’t perfectly divisible by 2, or

  • nsmallest() doesn’t preserve order like tail() after sorting might.

Even though they feel equivalent, nsmallest() and sort + tail() are subtly different in behavior.


✅ Recommendation:

If you want the most reliable way to get the bottom 50% based on values, use:

df.nsmallest(len(df) // 2, columns='AMT_INCOME_TOTAL')

If you’re ever unsure, you can compare both sets:

set1 = set(bottom_half_df.index)
set2 = set(bottom_half.index)
print("Difference in index sets:", set1.symmetric_difference(set2))

That’ll show you the exact rows that differ.