Analyzing Domestic Airlines and Flights
By Olivia Heiner, Eddie Liu, Siddarth Chalasani, Allen Chun, Yupeng William Chen
Every day over 80,000 planes fly over the United States. With these flights comes data about prices, distances, airlines, airports, and route networks. All these things are important to travelers as they travel from one city to the next. What airlines and airports will get them from one place to the next in a timely manner? What airports are well-connected to other airports? In this article, we explore some of these questions by analyzing a few data sets: the Openflights Airports Database, the Kaggle USA Airport Dataset, and the Kaggle 2015 Flight Delays and Cancellations Dataset.
Which Airlines face the most delays?
Using the 2015 Kaggle dataset from the U.S. Department of Transportation containing the on-time performance of domestic flights by large air carriers, we determined the longest flight delays by airline. By analyzing the average departure and arrival delay times by airline, we were able to produce two sets of lists. In both lists, five airlines consistently had the highest delay times. In increasing order of average delay time, these airlines were identified as Envoy Air (MQ), JetBlue (B6), United Airlines (UA), Frontier Airlines (F9), and Spirit Airlines (NK). While all five of these airlines had an average arrival delay time of more than 20 minutes, only two of them had an average departure delay time of more than 20 minutes. These two airlines were the Frontier Airlines (with an average departure delay time of 21.1 minutes) and Spirit Airlines (with an average departure delay time of 21.9 minutes). Spirit Airlines also had the highest average arrival delay time, averaging 25.7 minutes.
One reason for the high delay times can be justified by the fact that Spirit and Frontier Airlines are known to sell cheap tickets. Throughout the first nine months of 2015, Spirit Airlines had an average fare price of less than $68. Compare that to Southwest Airlines, whose average fare price will cost roughly $156, more than double that of Spirit Airlines. Frontier Airlines is also a cheap alternative, with some of their lowest fares at $29 for a one-way flight. For both of these airlines however, additional fees and extremely high delay times may prevent some passengers from selecting Spirit or Frontier anytime soon.
How about delays with airports?
While choosing an airline is an important factor in travelers’ decisions, the airports travelers decide to fly to and from are important as well. We wanted to explore how airports differed in terms of flight delay times. In examining the top airports by average departure delay (ADD) and average arrival delay (AAD) and by origin and destination, four sets of lists were produced: ADD by origin airport, AAD by origin airport, ADD by destination airport, and AADs by destination airport. The top five airports in all of these categories are Hartsfield-Jackson Atlanta International Airport (ATL), Los Angeles International Airport (LAX), Dallas/Fort Worth International Airport (DFW), Denver International Airport (DEN), and Chicago O’Hare International Airport (ORD). On average, the five airports above had a departure delay of more than 15 minutes, and an average arrival delay of more than 20 minutes.
Among the five, ORD had the longest delays, with an average departure delay of 12.5 minutes and an average arrival delay of 22.5 minutes. We initially thought that flight delays would be highest at airports with bad weather; while this seems like it could be true when considering ORD, which is located in a city with cold and snowy winters, it became apparent that geographic location and weather was not actually the primary indicator of delay times. Instead, the total number of flights and connections going through the airport was more strongly correlated with flight delays. This makes sense, as heavy traffic and a lot of connections requires more logistics to get flights through the airport on time. All the top five airports for delays are very busy airports and are “well connected” nodes in the flight route network, which will be explored in the following section.
Networks in flight patterns
In network analysis, graphs with nodes and edges are analyzed to find characteristics such as groupings of nodes, influence rankings of nodes, and measures of connectivity. Using the Openflights Airports Database and the Kaggle USA Airport Dataset, we represented the flight network information as a weighted graph, where each of the nodes was an airport and each edge represented a flight route between two airports. We then set the weights of the edges to be the distances between airports.
Once we merged the data together and created the graph, we used Gephi — an open source networks analysis software — to analyze our data. Using Gephi, we hoped to find some meaningful ways to characterize and rank airports. There are many algorithms available to analyze networks, so part of the challenge of the analysis was finding the most suitable algorithm for our purposes.
Initially, we tried out the HITS algorithm. HITS is traditionally used to analyze online networks and is similar to the better-known Page Rank algorithm. While Page Rank makes one ranking of the nodes in terms of the nodes’ influence in the network, the HITS algorithm makes two rankings of nodes: Hubs and Authorities. Nodes which rank higher as hubs have many outgoing edges, and nodes which rank higher as authorities have many incoming edges. We thought this would translate well to the route network: hubs would be airports that are used as sources or connecting airports, whereas authorities would be airports that are popular final destinations. However, we immediately encountered a problem where the list of biggest hubs and biggest authorities were almost identical. This was likely because most regular flight routes tend to have a return flight, so most outgoing flights had a corresponding incoming flight. This led the algorithm to interpret the airports with the most flight routes as both the biggest hubs and biggest authorities. The results of this algorithm were ultimately rather uninteresting, so we next tried to rank airports by their betweenness centrality measure.
Betweenness centrality is a measure of how often a particular node is part of the shortest path between any two nodes. The following equation is used to calculate betweenness centrality:
Here, 𝜎st is the number of shortest paths from node s to node t and 𝜎st(v) is the number of those shortest paths that pass through node v. We then normalized these numbers, so all airports had betweeness centrality measures between 0 and 1.
This turned out to be a great algorithm to rank airports in the network. Airports with high between-ness centrality measures are airports that would make great hubs for connecting domestic flights, because they often are on the shortest path between two airports.
In the figure below, we have plotted the top domestic airports, with the node size corresponding to betweenness centrality measure and the color corresponding to the population of the city the airport serves (darker colors indicate more populated cities). The lines between airports represent a flight between the two airports. To plot the results, we used the matplotlib, networkx and basemap libraries in python, as well as some help from Tuan Doan Nguyen’s article on flight network visualizations.
The airports with the best betweenness centrality were Atlanta International Airport (ATL), Dallas Fort Worth Airport (DFW), O’Hare International Airport in Chicago (ORD), and Minneapolis International Airport (MSP). These results line up well with what airports are usually considered to be popular hubs for domestic flights. Furthermore, we noticed that many of the airports with the best betweenness centrality also appeared at the top of the list of airports with the longest delays. This suggests that there may be a correlation between the popularity of an airport, especially as a hub or connecting airport, and the average length of delays experienced by flights to and from the airport.
Further analyses could be done using similar techniques that were used in this article. For example, incorporating the price of flights could enable us to figure out not only the shortest paths between two airports, but the cheapest paths between two airports. Also, one important limitation was that our dataset with flight distances did not include the Denver airport, as well as a few other small airports. The reason for the exclusion of certain airports from that dataset is unknown and had an effect on the outcomes, especially since Denver is a high traffic airport. Obtaining another distances dataset with more airports would be beneficial to making a more accurate analysis.