# Flight Price Prediction using Machine Learning

I recently worked on an exciting machine learning project to predict flight prices using a dataset of historical flight data. In this post, I’ll walk through the code and explain the theory and techniques used to build the predictive model.

# Data Preprocessing

The first step was to load and preprocess the flight dataset, which was in an Excel file. I used the pandas library to read in the data:

import pandas as pd

df = pd.read_excel(r'C:\Users\siddhant\downloads\ML_Projects\Flight-Price-Prediction\Flight Dataset\Data_Train.xlsx')

Next, I cleaned the data by dropping any rows with missing values and removing unnecessary columns like ‘Route’ and ‘Additional_Info’:

df.dropna(inplace=True)
df.drop(['Route','Additional_Info'],inplace=True,axis=1)

I also performed some feature transformations, such as:

Standardizing city names (e.g. ‘New Delhi’ to ‘Delhi’)
Extracting day and month from the date
Splitting departure and arrival times into hours and minutes
Converting the duration string into separate hours and minutes columns

Here’s an example of extracting the day and month:

df['Day'] = pd.to_datetime(df['Date_of_Journey'],format='%d/%m/%Y').dt.day
df['Month'] = pd.to_datetime(df['Date_of_Journey'],format='%d/%m/%Y').dt.month

df.drop('Date_of_Journey',inplace=True,axis=1)

# Exploratory Data Analysis

To visualize relationships in the data, I created some plots using the seaborn library. For example, this box plot shows the distribution of price by destination city:

sns.catplot(x='Destination',y='Price',data=df.sort_values('Price',ascending=False),kind='boxen',aspect=3,height=4)

# Feature Encoding

Many of the features were categorical, such as airline, source, and destination. To use them in the model, I needed to encode them as numerical features.

For features with only a few categories, I used one-hot encoding:

airline = df[['Airline']] 
airline = pd.get_dummies(airline,drop_first=True)

source = df[['Source']]
source = pd.get_dummies(source,drop_first=True)

destination = df[['Destination']]
destination = pd.get_dummies(destination,drop_first=True)

For Total_Stops, which had a natural ordering, I used label encoding:

df['Total_Stops'].replace({'non-stop':0,'1 stop':1,'2 stops':2,'3 stops':3,'4 stops':4},inplace=True)

# How One-Hot Encoding Works

Imagine we have a column “Airline” with values like “Air India”, “Indigo”, and “SpiceJet”. One-hot encoding turns this column into several binary columns, each representing one airline.

For example, consider this initial dataset:

Index	Airline
0	Air India
1	Indigo
2	SpiceJet
3	Air India
4	SpiceJet

One-hot encoding transforms it into the following format:

Index	Air India	Indigo	SpiceJet
0	1	0	0
1	0	1	0
2	0	0	1
3	1	0	0
4	0	0	1

In this new table:

Each unique category in the original “Airline” column becomes a separate binary column.
A value of 1 in the binary column indicates the presence of the respective category for that row.
A value of 0 indicates the absence of that category.

For example:

The first row has a 1 under “Air India” and 0 under the others, indicating that the airline is “Air India”.
The second row has a 1 under “Indigo”, indicating the airline is “Indigo”.

This method allows the machine learning model to understand and process categorical data without assuming any ordinal relationship between the categories.

# Code Implementation

Here’s how you can implement one-hot encoding in Python using pandas:

# One-hot encoding
airline = pd.get_dummies(df[['Airline']], drop_first=True)
source = pd.get_dummies(df[['Source']], drop_first=True)
destination = pd.get_dummies(df[['Destination']], drop_first=True)

# Combine the new encoded columns with the dataframe
df = pd.concat([df, airline, source, destination], axis=1)
df.drop(['Airline', 'Source', 'Destination'], axis=1, inplace=True)

# Why One-Hot Encoding tho?

because it prevents the model from assuming any natural ordering between categories. It treats each category as independent, which is important for features like airline names, where there’s no inherent order. This method also avoids any bias that might occur if the model incorrectly interprets the numerical values of categories as ordinal.

# Model Building

With the data preprocessed and encoded, I was ready to build the predictive model. I chose a Random Forest regressor, as it is able to capture complex non-linear relationships.

First I split the data into features (X) and target (y):

X = df.drop('Price',axis=1)
y = df['Price']

Then I used scikit-learn to split into train and test sets:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

To tune the hyperparameters of the random forest, I used randomized search cross-validation over a parameter grid:

from sklearn.model_selection import RandomizedSearchCV 
from sklearn.ensemble import RandomForestRegressor

# Parameter grid to search over
param_grid = {'n_estimators': [100, 200, 300, 400, 500],
              'max_features': ['auto', 'sqrt'],
              'max_depth': , 
              'min_samples_split': ,
              'min_samples_leaf': }
              
# Randomized search with 5-fold CV              
rf_random = RandomizedSearchCV(estimator = RandomForestRegressor(), 
                               param_distributions = param_grid,
                               n_iter = 20, cv = 5, random_state=0)

rf_random.fit(X_train, y_train)

The best parameters found were:

rf_random.best_params_

{'n_estimators': 400,
 'min_samples_split': 5,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': 20}

# Model Evaluation

With the tuned model, I made predictions on the test set:

y_pred = rf_random.predict(X_test)

To evaluate performance, I plotted the actual vs predicted prices:

plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.6)
plt.xlabel('Actual Price')  
plt.ylabel('Predicted Price')
plt.title('Actual vs Predicted Price')
plt.show()

then I also computed the R-squared, which was 0.82, indicating the model explains 82% of the variance in price.

from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred)
print(f'R-squared: {r2:.2f}')

R-squared: 0.77

# Conclusion

In this project, I built a random forest model to predict flight prices with 77% accuracy. The most important steps were:

Cleaning and preprocessing the data
Visualizing the data to identify patterns
Encoding categorical variables
Tuning the model hyperparameters
Evaluating the final model on a test set

Machine learning is a powerful tool for this kind of price prediction task. With more data and further feature engineering, the model could likely be improved even further.

I hope this post gave you a good overview of the project! Let me know if you have any other questions.