TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial…

Follow publication

PySpark Analysis on Airport Data

Ojaas Hampiholi
TDS Archive
Published in
6 min readMay 16, 2021

--

Photo by Carlos Muza on Unsplash

Big data is a collection of massive and complex data sets and data volume that include huge quantities of data, data management capabilities, social media analytics, and real-time data. Big Data means data that can not be processed by streamlined processing methods because of the sheer huge size of data to be processed. Big Data is also defined as the system of handling large amounts of data to produce results in real-time with very little latency. Apache Spark is a unified analytics engine that is used primarily for large-scale processing. Some of the major advantages of Apache Spark are Speed of Processing, Ease of Use, Generality for Platforms and Toolkits that can be integrated with it.

PySpark is Python API that works on Apache Spark. The major advantages of using PySpark are that it integrates easily with Java, R, and Scala. Data Scientists can leverage RDD’s to work with higher speeds on distributed clusters storing data in parts. Disking Persistence and Powerful Caching help to improve processing speed more. In this article, we discuss how PySpark can be used on Google Collaboratory Notebooks to analyze a very huge data set containing about 3.6 Million rows and 15 Features.

Installing and Initializing PySpark on Google Colab

  1. Download JRE Headless version to Notebook.
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

2. Download Spark with Hadoop installation zip file and unzip it for further use.

!wget -q https://downloads.apache.org/spark/spark-3.0.2/spark-3.0.2-bin-hadoop2.7.tgz

!tar xf spark-3.0.2-bin-hadoop2.7.tgz

3. Set the Javahome and Sparkhome variables.

import os 
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64" os.environ["SPARK_HOME"] = "/content/spark-3.0.2-bin-hadoop2.7"

4. Install and Initialize findspark library.

!pip install -q findsparkimport findspark
findspark.find()
findspark.init()

5. Create Spark and SQLContext Sessions.

from pyspark.sql import SparkSession

spark = SparkSession.builder\
.master("local")\
.appName("Colab")\
.config('spark.ui.port', '4050')\
.getOrCreate()

from pyspark.sql import SQLContext
sqlContext = SQLContext(spark)
spark

The above steps can be found in the Notebook. Once we have made the necessary installations and set up the analysis environment, data is open source and can be found on Kaggle under the title USA Airport Dataset. This project uses a dataset that is available at the link as of 31 March 2021. After downloading, we can read the CSV data to pyspark and register its table form to be used with SQLContext.

df = spark.read.csv("./Airports2.csv", header=True, inferSchema=True)df.registerTempTable('df')

Preliminary Analysis and Insights

The first steps that any Data Analyst would take to do preliminary analysis on the dataset are to count the number of rows, the schema of data, and statistical descriptions of features.

df.printSchema()
Schema of Data
df.describe().show()
Statistical Analysis

The steps above are done to get basics insights into the data and they help Data Analysts to determine pre-processing steps that need to be applied to the data to get it to the proper format. To understand the data better, we can have a look at the first few rows of data. Spark Transformations and Actions can be used to subset the data.

Subset of Data

df.select("Origin_airport","Destination_airport","Passengers","Seats").show(15)
Subset of data

Aggregate of Data

airportAgg_DF = df.groupBy("Origin_airport").agg(F.sum("Passengers"))
airportAgg_DF.show(10)
Aggregate of Data

Research Questions and SQL Solutions

Once all Preliminary Analysis is complete, we proceed to answer some questions that can be formulated as SQL Queries using SQLContext. The Research Questions and their corresponding answers have been discussed in brief below. The code solutions to the questions posed below can be found here.

Find the Airport with Highest Number of Flight Departures

We can see that popular airports in terms of flight departures are Chicago O’Hare, Hartsfield-Jackson Atlanta, Dallas/Fort Worth, and Los Angeles International Airports in the same order.

Find the Airport with Highest Number of Passenger Arrivals

We can see that popular airports in terms of passenger arrivals are Hartsfield-Jackson Atlanta, Chicago O’Hare, Dallas/Fort Worth, and Los Angeles International Airports in the same order.

Find the Airport with Most Flight Traffic

We can see that popular airports in terms of the Number of Flights are Chicago O’Hare, Hartsfield-Jackson Atlanta, Dallas/Fort Worth, and Los Angeles International Airports in the same order.

Find the Airport with Most Passenger Footfall

We can see that popular airports in terms of the Number of Passengers are Dallas/Fort Worth, Hartsfield-Jackson Atlanta, Chicago O’Hare, and Los Angeles International Airports in the same order.

Find the Occupancy Rate for Most Popular Routes

The occupancy rates of popular flights are somewhere between 48% and 71% with an average of 60%. This implies that although there are lots of flights that operate between airports, most of them are not efficient in terms of passenger traffic. Also, reducing and rescheduling some flights to increase occupancy rates can help Airlines to reduce their fuel costs, while also helping to protect the environment by reducing carbon footprint.

Find the Number of Flights for Long Distance Journeys

We can observe from the output that in most cases, the frequency of long-distance flights is low. However, it is interesting to note that there are lots of flights that operate between Hawaii — Honolulu (Honolulu International Airport) and New York (John F Kennedy International Airport and Newark Liberty International Airport) even with considerable long distance between the two places.

Find the Average Distances for Routes with Most Flights

We can see that medium distance routes (100–300 miles) have the greatest number of flights except for few exceptional cases. Flight Route between Chicago (ORD) and New York (EWR, LGA) is operated heavily even though the distance between airports is 725 miles. Another interesting case that we see here is that flight between Atlanta (ATL) and Dallas/Fort Worth (DFW) is a popular service even though the distance is around 720 miles.

Future Work

The Output tables can be saved as separate CSV files and can be used to visualize data using Python or Tableau to create interactive dashboards that can help people to choose flights and plan their vacations accordingly. An application can be made to include a workflow using Apache AirFlow which refreshes the data daily and updates charts that have been created on the dashboard.

Sign up to discover human stories that deepen your understanding of the world.

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Ojaas Hampiholi
Ojaas Hampiholi

Written by Ojaas Hampiholi

Data Scientist, ML Engineer and Open Source Tech Enthusiast

Write a response