Feature Engineering Tricks I Use As An Ml Engineer
People often call feature engineering the black art of machine learning. Even if you use the most advanced algorithm, like a deep neural network with millions of parameters, it won’t work well if you give it raw data. As an ML engineer, I actually spend about 70% of my time on feature engineering instead of tuning models. This work is all about turning your domain knowledge into something the model can use. In this article, I’ll share some feature engineering tricks that I use in my own work.
Feature Engineering Tricks
In this guide, I’ll walk you through 4 practical feature engineering tricks I use constantly. We’ll use a real-world scenario based on Loan Recovery Analysis. You can download the dataset I am using from here.
First, let’s load our data:
import pandas as pd
import numpy as np
# Load the dataset
df = pd.read_csv('loan recovery.csv')
# Let's verify what we have
print(df[['Age', 'Monthly_Income', 'Loan_Amount', 'Loan_Type']].head())Age Monthly_Income Loan_Amount Loan_Type
0 59 215422 1445796 Home
1 49 60893 1044620 Auto
2 35 116520 1923410 Home
3 63 140818 1811663 Home
4 28 76272 88578 Personal
We have a dataset containing borrower details like Income, Loan Amount, and Age. Now, let’s dive into the tricks.
Trick 1: The Ratio Feature
Raw numbers often lack context. A $5,000 monthly EMI (Equated Monthly Instalment) sounds high, right? But if the borrower earns $500,000 a month, it’s peanuts. If they earn $6,000, it’s a crisis.
So, always create a ratio. Here, we calculate the Debt-to-Income (DTI) Ratio. This single number captures the financial burden much better than income or EMI alone:
# Create a ratio of EMI to Income df['DTI_Ratio'] = df['Monthly_Income'] / df['Monthly_EMI'] # Let's see the difference context makes print(df[['Monthly_Income', 'Monthly_EMI', 'DTI_Ratio']].head())
Monthly_Income Monthly_EMI DTI_Ratio
0 215422 4856.88 44.353989
1 60893 55433.68 1.098484
2 116520 14324.61 8.134253
3 140818 6249.28 22.533476
4 76272 816.46 93.417926
Models love normalised relationships. A ratio creates a standard scale that applies to everyone, regardless of whether they are a millionaire or a fresh graduate.
Trick 2: Binning
Is a 29-year-old borrower significantly different risk-wise from a 30-year-old? Probably not. But a 25-year-old (early career) is very different from a 55-year-old (approaching retirement). Raw continuous variables like Age can sometimes introduce noise or complexity that the model doesn’t need.
So, always group continuous values into Bins or buckets:
# Define our bins and labels bins = [20, 35, 55, 100] labels = ['Young Adult', 'Middle-Aged', 'Senior'] # Create a new categorical feature df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels) print(df[['Age', 'Age_Group']].head())
Age Age_Group
0 59 Senior
1 49 Middle-Aged
2 35 Young Adult
3 63 Senior
4 28 Young Adult
It helps the model handle non-linear relationships. The risk might not increase linearly with every single year of age, but it might jump significantly between Young Adult and Senior.
Trick 3: Log Transformation
Money variables (like Income or Loan Amount) rarely follow a nice, normal Bell Curve. They usually have a Long Tail; lots of people with average incomes, and a few people with massive incomes. These massive outliers can confuse models (especially linear ones).
So, always apply a Logarithm. It squashes the huge values down, making the distribution look more Normal:
# Apply log transformation (np.log1p adds 1 to avoid log(0) errors) df['Log_Income'] = np.log1p(df['Monthly_Income']) # Compare raw vs transformed print(df[['Monthly_Income', 'Log_Income']].head())
Monthly_Income Log_Income
0 215422 12.280359
1 60893 11.016890
2 116520 11.665827
3 140818 11.855231
4 76272 11.242074
It stabilises the variance. The difference between $50k and $60k feels huge to a model, but the difference between $1M and $1.01M might confuse it if treated strictly linearly. Log scales focus on magnitude rather than raw differences.
Trick 4: Domain Interaction
Sometimes the most powerful features come from simple business logic. In our dataset, we have Loan_Amount (total borrowed) and Outstanding_Loan_Amount (left to pay).
Here, instead of feeding both raw numbers, let’s calculate Repayment Progress:
# Calculate how much of the loan has arguably been paid off df['Repayment_Progress'] = 1 - (df['Outstanding_Loan_Amount'] / df['Loan_Amount']) print(df[['Loan_Amount', 'Outstanding_Loan_Amount', 'Repayment_Progress']].head())
Loan_Amount Outstanding_Loan_Amount Repayment_Progress
0 1445796 2.914130e+05 0.798441
1 1044620 6.652042e+05 0.363209
2 1923410 1.031372e+06 0.463779
3 1811663 2.249739e+05 0.875819
4 88578 3.918989e+04 0.557566
This feature directly represents good behaviour. A borrower with 90% progress is very different from one with 5%, even if their outstanding amounts are similar.
Closing Thoughts
Feature engineering isn’t just about math; it’s about empathy. You are trying to tell the story of the human behind the data points. When we created the DTI_Ratio, we were asking, “Can this person actually afford their life?” When we created Age_Groups, we were acknowledging, “Different life stages have different priorities.”
Better data beats a better algorithm. Don’t just throw raw CSVs into a model. Take the time to craft features that tell the truth.
If you found this article helpful, make sure to follow me on Instagram for daily AI resources and practical learning. And check out my latest book: Hands-On GenAI, LLMs & AI Agents; a step-by-step guide to becoming job-ready in this decade of AI.
The post Feature Engineering Tricks I Use as an ML Engineer appeared first on AmanXai by Aman Kharwal.
Popular Products
-
Fake Pregnancy Test$61.56$30.78 -
Anti-Slip Safety Handle for Elderly S...$57.56$28.78 -
Toe Corrector Orthotics$41.56$20.78 -
Waterproof Trauma Medical First Aid Kit$169.56$84.78 -
Rescue Zip Stitch Kit$109.56$54.78