# Flight Price Prediction using Machine Learning
I recently worked on an exciting machine learning project to predict flight prices using a dataset of historical flight data. In this post, I’ll walk through the code and explain the theory and techniques used to build the predictive model.
# Data Preprocessing
The first step was to load and preprocess the flight dataset, which was in an Excel file. I used the pandas library to read in the data:
import pandas as pd
df = pd.read_excel(r'C:\Users\siddhant\downloads\ML_Projects\Flight-Price-Prediction\Flight Dataset\Data_Train.xlsx')
Next, I cleaned the data by dropping any rows with missing values and removing unnecessary columns like ‘Route’ and ‘Additional_Info’:
df.dropna(inplace=True)
df.drop(['Route','Additional_Info'],inplace=True,axis=1)
I also performed some feature transformations, such as:
- Standardizing city names (e.g. ‘New Delhi’ to ‘Delhi’)
- Extracting day and month from the date
- Splitting departure and arrival times into hours and minutes
- Converting the duration string into separate hours and minutes columns
Here’s an example of extracting the day and month:
df['Day'] = pd.to_datetime(df['Date_of_Journey'],format='%d/%m/%Y').dt.day
df['Month'] = pd.to_datetime(df['Date_of_Journey'],format='%d/%m/%Y').dt.month
df.drop('Date_of_Journey',inplace=True,axis=1)
# Exploratory Data Analysis
To visualize relationships in the data, I created some plots using the seaborn library. For example, this box plot shows the distribution of price by destination city:
sns.catplot(x='Destination',y='Price',data=df.sort_values('Price',ascending=False),kind='boxen',aspect=3,height=4)
# Feature Encoding
Many of the features were categorical, such as airline, source, and destination. To use them in the model, I needed to encode them as numerical features.
For features with only a few categories, I used one-hot encoding:
airline = df[['Airline']]
airline = pd.get_dummies(airline,drop_first=True)
source = df[['Source']]
source = pd.get_dummies(source,drop_first=True)
destination = df[['Destination']]
destination = pd.get_dummies(destination,drop_first=True)
For Total_Stops, which had a natural ordering, I used label encoding:
df['Total_Stops'].replace({'non-stop':0,'1 stop':1,'2 stops':2,'3 stops':3,'4 stops':4},inplace=True)
# How One-Hot Encoding Works
Imagine we have a column “Airline” with values like “Air India”, “Indigo”, and “SpiceJet”. One-hot encoding turns this column into several binary columns, each representing one airline.
For example, consider this initial dataset:
| Index | Airline |
|---|---|
| 0 | Air India |
| 1 | Indigo |
| 2 | SpiceJet |
| 3 | Air India |
| 4 | SpiceJet |
One-hot encoding transforms it into the following format:
| Index | Air India | Indigo | SpiceJet |
|---|---|---|---|
| 0 | 1 | 0 | 0 |
| 1 | 0 | 1 | 0 |
| 2 | 0 | 0 | 1 |
| 3 | 1 | 0 | 0 |
| 4 | 0 | 0 | 1 |
In this new table:
- Each unique category in the original “Airline” column becomes a separate binary column.
- A value of
1in the binary column indicates the presence of the respective category for that row. - A value of
0indicates the absence of that category.
For example:
- The first row has a
1under “Air India” and0under the others, indicating that the airline is “Air India”. - The second row has a
1under “Indigo”, indicating the airline is “Indigo”.
This method allows the machine learning model to understand and process categorical data without assuming any ordinal relationship between the categories.
# Code Implementation
Here’s how you can implement one-hot encoding in Python using pandas:
# One-hot encoding
airline = pd.get_dummies(df[['Airline']], drop_first=True)
source = pd.get_dummies(df[['Source']], drop_first=True)
destination = pd.get_dummies(df[['Destination']], drop_first=True)
# Combine the new encoded columns with the dataframe
df = pd.concat([df, airline, source, destination], axis=1)
df.drop(['Airline', 'Source', 'Destination'], axis=1, inplace=True)
# Why One-Hot Encoding tho?
because it prevents the model from assuming any natural ordering between categories. It treats each category as independent, which is important for features like airline names, where there’s no inherent order. This method also avoids any bias that might occur if the model incorrectly interprets the numerical values of categories as ordinal.
# Model Building
With the data preprocessed and encoded, I was ready to build the predictive model. I chose a Random Forest regressor, as it is able to capture complex non-linear relationships.
First I split the data into features (X) and target (y):
X = df.drop('Price',axis=1)
y = df['Price']
Then I used scikit-learn to split into train and test sets:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
To tune the hyperparameters of the random forest, I used randomized search cross-validation over a parameter grid:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
# Parameter grid to search over
param_grid = {'n_estimators': [100, 200, 300, 400, 500],
'max_features': ['auto', 'sqrt'],
'max_depth': ,
'min_samples_split': ,
'min_samples_leaf': }
# Randomized search with 5-fold CV
rf_random = RandomizedSearchCV(estimator = RandomForestRegressor(),
param_distributions = param_grid,
n_iter = 20, cv = 5, random_state=0)
rf_random.fit(X_train, y_train)
The best parameters found were:
rf_random.best_params_
{'n_estimators': 400,
'min_samples_split': 5,
'min_samples_leaf': 1,
'max_features': 'sqrt',
'max_depth': 20}
# Model Evaluation
With the tuned model, I made predictions on the test set:
y_pred = rf_random.predict(X_test)
To evaluate performance, I plotted the actual vs predicted prices:
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.6)
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Actual vs Predicted Price')
plt.show()
then I also computed the R-squared, which was 0.82, indicating the model explains 82% of the variance in price.
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
print(f'R-squared: {r2:.2f}')
R-squared: 0.77
# Conclusion
In this project, I built a random forest model to predict flight prices with 77% accuracy. The most important steps were:
- Cleaning and preprocessing the data
- Visualizing the data to identify patterns
- Encoding categorical variables
- Tuning the model hyperparameters
- Evaluating the final model on a test set
Machine learning is a powerful tool for this kind of price prediction task. With more data and further feature engineering, the model could likely be improved even further.
I hope this post gave you a good overview of the project! Let me know if you have any other questions.