Join our FREE personalized newsletter for news, trends, and insights that matter to everyone in America

Newsletter
New

Feature Engineering Tricks I Use As An Ml Engineer

Card image cap

People often call feature engineering the black art of machine learning. Even if you use the most advanced algorithm, like a deep neural network with millions of parameters, it won’t work well if you give it raw data. As an ML engineer, I actually spend about 70% of my time on feature engineering instead of tuning models. This work is all about turning your domain knowledge into something the model can use. In this article, I’ll share some feature engineering tricks that I use in my own work.

Feature Engineering Tricks

In this guide, I’ll walk you through 4 practical feature engineering tricks I use constantly. We’ll use a real-world scenario based on Loan Recovery Analysis. You can download the dataset I am using from here.

First, let’s load our data:

import pandas as pd  
import numpy as np  
  
# Load the dataset  
df = pd.read_csv('loan recovery.csv')  
  
# Let's verify what we have  
print(df[['Age', 'Monthly_Income', 'Loan_Amount', 'Loan_Type']].head())
   Age  Monthly_Income  Loan_Amount Loan_Type
0 59 215422 1445796 Home
1 49 60893 1044620 Auto
2 35 116520 1923410 Home
3 63 140818 1811663 Home
4 28 76272 88578 Personal

We have a dataset containing borrower details like Income, Loan Amount, and Age. Now, let’s dive into the tricks.

Trick 1: The Ratio Feature

Raw numbers often lack context. A $5,000 monthly EMI (Equated Monthly Instalment) sounds high, right? But if the borrower earns $500,000 a month, it’s peanuts. If they earn $6,000, it’s a crisis.

So, always create a ratio. Here, we calculate the Debt-to-Income (DTI) Ratio. This single number captures the financial burden much better than income or EMI alone:

# Create a ratio of EMI to Income  
df['DTI_Ratio'] = df['Monthly_Income'] / df['Monthly_EMI']  
  
# Let's see the difference context makes  
print(df[['Monthly_Income', 'Monthly_EMI', 'DTI_Ratio']].head())
   Monthly_Income  Monthly_EMI  DTI_Ratio
0 215422 4856.88 44.353989
1 60893 55433.68 1.098484
2 116520 14324.61 8.134253
3 140818 6249.28 22.533476
4 76272 816.46 93.417926

Models love normalised relationships. A ratio creates a standard scale that applies to everyone, regardless of whether they are a millionaire or a fresh graduate.

Trick 2: Binning

Is a 29-year-old borrower significantly different risk-wise from a 30-year-old? Probably not. But a 25-year-old (early career) is very different from a 55-year-old (approaching retirement). Raw continuous variables like Age can sometimes introduce noise or complexity that the model doesn’t need.

So, always group continuous values into Bins or buckets:

# Define our bins and labels  
bins = [20, 35, 55, 100]  
labels = ['Young Adult', 'Middle-Aged', 'Senior']  
  
# Create a new categorical feature  
df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels)  
  
print(df[['Age', 'Age_Group']].head())
   Age    Age_Group
0 59 Senior
1 49 Middle-Aged
2 35 Young Adult
3 63 Senior
4 28 Young Adult

It helps the model handle non-linear relationships. The risk might not increase linearly with every single year of age, but it might jump significantly between Young Adult and Senior.

Trick 3: Log Transformation

Money variables (like Income or Loan Amount) rarely follow a nice, normal Bell Curve. They usually have a Long Tail; lots of people with average incomes, and a few people with massive incomes. These massive outliers can confuse models (especially linear ones).

So, always apply a Logarithm. It squashes the huge values down, making the distribution look more Normal:

# Apply log transformation (np.log1p adds 1 to avoid log(0) errors)  
df['Log_Income'] = np.log1p(df['Monthly_Income'])  
  
# Compare raw vs transformed  
print(df[['Monthly_Income', 'Log_Income']].head())
   Monthly_Income  Log_Income
0 215422 12.280359
1 60893 11.016890
2 116520 11.665827
3 140818 11.855231
4 76272 11.242074

It stabilises the variance. The difference between $50k and $60k feels huge to a model, but the difference between $1M and $1.01M might confuse it if treated strictly linearly. Log scales focus on magnitude rather than raw differences.

Trick 4: Domain Interaction

Sometimes the most powerful features come from simple business logic. In our dataset, we have Loan_Amount (total borrowed) and Outstanding_Loan_Amount (left to pay).

Here, instead of feeding both raw numbers, let’s calculate Repayment Progress:

# Calculate how much of the loan has arguably been paid off  
df['Repayment_Progress'] = 1 - (df['Outstanding_Loan_Amount'] / df['Loan_Amount'])  
  
print(df[['Loan_Amount', 'Outstanding_Loan_Amount', 'Repayment_Progress']].head())
   Loan_Amount  Outstanding_Loan_Amount  Repayment_Progress
0 1445796 2.914130e+05 0.798441
1 1044620 6.652042e+05 0.363209
2 1923410 1.031372e+06 0.463779
3 1811663 2.249739e+05 0.875819
4 88578 3.918989e+04 0.557566

This feature directly represents good behaviour. A borrower with 90% progress is very different from one with 5%, even if their outstanding amounts are similar.

Closing Thoughts

Feature engineering isn’t just about math; it’s about empathy. You are trying to tell the story of the human behind the data points. When we created the DTI_Ratio, we were asking, “Can this person actually afford their life?” When we created Age_Groups, we were acknowledging, “Different life stages have different priorities.”

Better data beats a better algorithm. Don’t just throw raw CSVs into a model. Take the time to craft features that tell the truth.

If you found this article helpful, make sure to follow me on Instagram for daily AI resources and practical learning. And check out my latest book: Hands-On GenAI, LLMs & AI Agents; a step-by-step guide to becoming job-ready in this decade of AI.

The post Feature Engineering Tricks I Use as an ML Engineer appeared first on AmanXai by Aman Kharwal.