parente.dev

Exploration of Airline On-Time Performance

October 21, 2014

This post comes from a sample notebook I created for a developerWorks article about running IPython in Docker on SoftLayer. I had fun toying the data and using Seaborn and Cloudant for the first time, so I decided to include it here for posterity.


In this notebook, we explore a sample of data from the U.S. Department of Transportation (US-DOT) Research and Innovative Technology Administration (RITA) Bureau of Transportation Statistics (BTS). The data comes from the On-Time Performance table:

This table contains on-time arrival data for non-stop domestic flights by major air carriers, and provides such additional items as departure and arrival delays, origin and destination airports, flight numbers, scheduled and actual departure and arrival times, cancelled or diverted flights, taxi-out and taxi-in times, air time, and non-stop distance.

Questions

For the purposes of this notebook, I have captured a subset of the table in a Cloudant database. We will start by connecting to the database and simply looking at the available data. Once we understand the content, we will try to answer the following questions about flights during the month of June, 2014:

  1. What is the distribution of departure delays of at least 15 minutes by state? Arrival delays?
  2. Is there a tendency of flights from one state to another to experience a delay of 15 minutes or more on the arriving end?
  3. How did arrival delay in minutes vary day-by-day?

Connect to Cloudant

To get to the data, we can use a Cloudant client for Python. We'll can install the official client by shelling out to bash and running a pip command right here.

In [1]:
!pip install cloudant

Requirement already satisfied (use --upgrade to upgrade): cloudant in /usr/local/lib/python2.7/dist-packages
Requirement already satisfied (use --upgrade to upgrade): requests-futures==0.9.4 in /usr/local/lib/python2.7/dist-packages (from cloudant)
Requirement already satisfied (use --upgrade to upgrade): requests>=1.2.0 in /usr/lib/python2.7/dist-packages (from requests-futures==0.9.4->cloudant)
Requirement already satisfied (use --upgrade to upgrade): futures>=2.1.3 in /usr/local/lib/python2.7/dist-packages (from requests-futures==0.9.4->cloudant)
Cleaning up...

Now we'll import the cloudant package we just installed and use it to connect to the read-only rita_transtats_2014_06 database in the parente user account.

In [2]:
import cloudant
In [3]:
account = cloudant.Account('parente')
In [4]:
database = account.database('rita_transtats_2014_06')

The cloudant package builds on the popular Python requests package. Almost every object that comes back from the API is a subclass of a requests class. This means we can perform a HTTP GET against the database and get the JSON body of the response with a couple method calls.

In [5]:
database.get().json()
Out[5]:

{u'compact_running': False,
u'db_name': u'rita_transtats_2014_06',
u'disk_format_version': 5,
u'disk_size': 160707408,
u'doc_count': 502500,
u'doc_del_count': 0,
u'instance_start_time': u'0',
u'other': {u'data_size': 265247850},
u'purge_seq': 0,
u'update_seq': u'502516-g1AAAADveJzLYWBgYMlgTmGQT0lKzi9KdUhJMtJLykxPyilN1UvOyS9NScwr0ctLLckBKmRKZEiy\_\_\_\_f1YSA-PrcKJ1JTkAyaR6qMZX84nWmMcCJBkagBRQ736w5mgSNR-AaIbYvDgLACDLUQs'}

From the above, we can see the database contains roughly 500,000 documents. We can grab a couple from the database to inspect locally.

In [6]:
items = []
for i, item in enumerate(database.all_docs(params={'include_docs' : True})):
    if i > 1: break
    items.append(item)
print items

[{u'value': {u'rev': u'1-40fc15f608f95f0428e2e9d6b468e483'}, u'id': u'04da0d01eb0f15d5c56eb1399a000a35', u'key': u'04da0d01eb0f15d5c56eb1399a000a35', u'doc': {u'DISTANCE': 1947.0, u'DEST_AIRPORT_ID': 12892, u'ARR_DEL15': 0.0, u'ORIGIN_STATE_ABR': u'GA', u'_rev': u'1-40fc15f608f95f0428e2e9d6b468e483', u'ARR_DELAY_NEW': 0.0, u'UNIQUE_CARRIER': u'DL', u'ORIGIN_AIRPORT_ID': 10397, u'DISTANCE_GROUP': 8, u'DEP_DEL15': 0.0, u'_id': u'04da0d01eb0f15d5c56eb1399a000a35', u'DEST_STATE_ABR': u'CA', u'DEP_DELAY_NEW': 0.0, u'FL_DATE': u'2014-06-30'}}, {u'value': {u'rev': u'1-4dd24d56dc537210f49fe327c7773178'}, u'id': u'04da0d01eb0f15d5c56eb1399a001198', u'key': u'04da0d01eb0f15d5c56eb1399a001198', u'doc': {u'DISTANCE': 282.0, u'DEST_AIRPORT_ID': 13487, u'ARR_DEL15': 0.0, u'ORIGIN_STATE_ABR': u'NE', u'_rev': u'1-4dd24d56dc537210f49fe327c7773178', u'ARR_DELAY_NEW': 0.0, u'UNIQUE_CARRIER': u'DL', u'ORIGIN_AIRPORT_ID': 13871, u'DISTANCE_GROUP': 2, u'DEP_DEL15': 0.0, u'_id': u'04da0d01eb0f15d5c56eb1399a001198', u'DEST_STATE_ABR': u'MN', u'DEP_DELAY_NEW': 0.0, u'FL_DATE': u'2014-06-30'}}]

The dictionary format is hard to read and contains metadata from Cloudant that we don't care about. Let's use the pandas package to put the data in a tabular, HTML format instead.

In [7]:
import pandas
In [8]:
pandas.DataFrame([item['doc'] for item in items])
Out[8]:
ARR_DEL15 ARR_DELAY_NEW DEP_DEL15 DEP_DELAY_NEW DEST_AIRPORT_ID DEST_STATE_ABR DISTANCE DISTANCE_GROUP FL_DATE ORIGIN_AIRPORT_ID ORIGIN_STATE_ABR UNIQUE_CARRIER _id _rev
0 0 0 0 0 12892 CA 1947 8 2014-06-30 10397 GA DL 04da0d01eb0f15d5c56eb1399a000a35 1-40fc15f608f95f0428e2e9d6b468e483
1 0 0 0 0 13487 MN 282 2 2014-06-30 13871 NE DL 04da0d01eb0f15d5c56eb1399a001198 1-4dd24d56dc537210f49fe327c7773178

Returning to the source of the data, we can get definitions for each of these fields.

ARR_DEL15
Arrival Delay Indicator, 15 Minutes or More (1=Yes)
ARR_DEL15
Arrival Delay Indicator, 15 Minutes or More (1=Yes)
ARR_DELAY_NEW
Difference in minutes between scheduled and actual arrival time. Early arrivals set to 0.
DEP_DEL15
Departure Delay Indicator, 15 Minutes or More (1=Yes)
DEP_DELAY_NEW
Difference in minutes between scheduled and actual departure time. Early departures set to 0.
DEST_AIRPORT_ID
Destination Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.
DEST_STATE_ABR
Destination Airport, State Code
DISTANCE
Distance between airports (miles)
DISTANCE_GROUP
Distance Intervals, every 250 Miles, for Flight Segment
FL_DATE
Flight Date (yyyymmdd)
ORIGIN_AIRPORT_ID
Origin Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.
ORIGIN_STATE_ABR
Origin Airport, State Code
UNIQUE_CARRIER
Unique Carrier Code. When the same code has been used by multiple carriers, a numeric suffix is used for earlier users, for example, PA, PA(1), PA(2). Use this field for analysis across a range of years.

For the purposes of the specific questions stated at the top of this notebook, we only need a subset of the available columns, namely delay metrics, origin and destination states, and the flight date. We'll ignore the other fields.

In [9]:
columns = [u'FL_DATE', u'ORIGIN_STATE_ABR', u'DEST_STATE_ABR', u'ARR_DEL15', u'ARR_DELAY_NEW', u'DEP_DEL15', u'DEP_DELAY_NEW', u'DISTANCE', u'DISTANCE_GROUP',]

Moving forward, we'll assume we only have 1 GB of RAM total. Since we're only dealing with half a million records here, we can probably pull the entire contents of the database into local memory. If the data proves too large, we can rely on the map/reduce and search capabilities in Cloudant to work with the data instead.

Being optimistic, we write a little loop that gets up to 20,000 docs at a time from the database. It stores the 20,000 in a simple Python list. Once the buffer reaches the threshold, we create a DataFrame from the buffer which reduces the data to just the fields we want. We do this chunking because appending to a DataFrame one row at a time is much slower.

In [10]:
%%time
dfs = []
buff = []
for i, item in enumerate(database.all_docs(params={'include_docs' : True})):
    buff.append(item['doc'])
    if i > 0 and i % 20000 == 0:
        print 'Processed #{}'.format(i)
        df = pandas.DataFrame(buff, columns=columns)
        dfs.append(df)
        buff = []
# don't forget the leftovers
df = pandas.DataFrame(buff, columns=columns)
dfs.append(df)

Processed #20000
Processed #40000
Processed #60000
Processed #80000
Processed #100000
Processed #120000
Processed #140000
Processed #160000
Processed #180000
Processed #200000
Processed #220000
Processed #240000
Processed #260000
Processed #280000
Processed #300000
Processed #320000
Processed #340000
Processed #360000
Processed #380000
Processed #400000
Processed #420000
Processed #440000
Processed #460000
Processed #480000
Processed #500000
CPU times: user 29 s, sys: 1.92 s, total: 30.9 s
Wall time: 43.5 s

Now we can build one DataFrame by quickly concatenating all the subframes we built in the loop above.

In [11]:
df = pandas.concat(dfs)

At this point, we have two copies of all the data in memory, which is undesirable. Before we delete the temporary buffer and subframes to free up some RAM, let's ensure the DataFrame row count matches the document count in the database.

In [12]:
assert len(df) == database.get().json()['doc_count']
In [13]:
del dfs
del buff
In [14]:
!free -m
             total       used       free     shared    buffers     cached

Mem: 989 700 289 1 73 178
-/+ buffers/cache: 449 540
Swap: 2047 29 2017

As one last step, we reset the index on the DataFrame so that it is a unique, monotonically increasing integer. As it stands, we have dupes in our index because of our chunking (i.e., each chunk starts at index 0). This reset will prove important in some of our later plots where the index must be unique per row.

In [15]:
df = df.reset_index(drop=True)

Distribution of Delay Counts by State

What is the distribution of departure delays of at least 15 minutes by state? Arrival delays?

Let's look at some basic information about delays to start and work up to delays grouped by state. Because the question asks about delays 15 minutes or longer, we'll focus on the DEP_DEL15 and ARR_DEL15 columns.

In [16]:
df.DEP_DEL15.value_counts() / len(df)
Out[16]:

0 0.731817
1 0.248993
dtype: float64
In [17]:
df.ARR_DEL15.value_counts() / len(df)
Out[17]:

0 0.718265
1 0.258054
dtype: float64

Roughly a quarter of all departures and a quarter of all arrivals have delays. We can look at the distribution more closely once we enable and configure plotting with matplotlib and seaborn.

In [18]:
%matplotlib inline
In [19]:
import matplotlib.pyplot as plt
In [20]:
import seaborn as sns
sns.set_palette("deep", desat=.6)
colors = sns.color_palette("deep")
sns.set_context(rc={"figure.figsize": (18, 5)})

Let's group by state now and sum the number of delays. We'll do it for both departure and arrival delays.

In [21]:
by_origin_state = df.groupby('ORIGIN_STATE_ABR')
departure_delay_counts = by_origin_state.DEP_DEL15.sum()
In [22]:
by_dest_state = df.groupby('DEST_STATE_ABR')
arrival_delay_counts = by_dest_state.ARR_DEL15.sum()

To plot, we'll put both series in a DataFrame so we can view the arrival and departure delays for each state side-by-side.

In [23]:
delay_df = pandas.DataFrame([departure_delay_counts, arrival_delay_counts]).T
In [24]:
delay_df.sort('DEP_DEL15', ascending=False).plot(kind='bar', title='Number of delayed flights by state')
Out[24]:

<matplotlib.axes.\_subplots.AxesSubplot at 0x7f67ac978990>

Big states with big airports appear to be in the top five. But we haven't accounted for how many total flights these states service. We should plot the percentage of flights that are delayed.

In [25]:
pct_departure_delay = departure_delay_counts / df.ORIGIN_STATE_ABR.value_counts()
pct_arrival_delay = arrival_delay_counts / df.DEST_STATE_ABR.value_counts()

Ranking states of origin by their percentage of departures tells a different story than the plot above. For example, here we see Illinois and Arkansas at the top of the list whereas IL was third in total departure delay counts and AR was ranked 25th or so. California, which is #2 in the number of total departure delays is only #17 in percentage of departures delayed. Not bad.

In [26]:
pct_departure_delay.order(ascending=False).plot(kind='bar', title='% flights with departure delays by origin state')
Out[26]:

<matplotlib.axes.\_subplots.AxesSubplot at 0x7f67abb67e90>

Similarly, when we look at destination states ranked by percentage of arrivals delayed, we see some new states at the head of the list. For instance, Delaware, second to last in the total number of delays overall, has the highest percentage of arrival delays for inbound flights. Iowa and Kansas are also new entries near the top.

In [27]:
pct_arrival_delay.order(ascending=False).plot(kind='bar', color=colors[1], title='% flights with arrival delay by destination state')
Out[27]:

<matplotlib.axes.\_subplots.AxesSubplot at 0x7f67ad3fec10>

We can get a sense of the difference between the two percentages for each state by plotting them on the same axes. In the plot below, we find that most states see more arrival delays than departure delays. The disparity seems greatest for smaller, less populated states that don't have huge airports (e.g., DE, IA, RI). We can't say for sure without studying more data or perhaps correlating the disparity with the state's ranking in terms of the total number of flights it serviced. We'll leave that as an exercise for the future.

In [28]:
pct_delay_df = pandas.DataFrame([pct_departure_delay, pct_arrival_delay], index=['PCT_DEP_DEL15', 'PCT_ARR_DEL15']).T
In [29]:
pct_delay_df.sort('PCT_ARR_DEL15', ascending=False).plot(kind='bar', title='Overlapping % delay plots for comparison')
Out[29]:

<matplotlib.axes.\_subplots.AxesSubplot at 0x7f67a96854d0>

Delay Tendency

Is there a tendency of flights from one state to another to experience a delay of 15 minutes or more on the arriving end?

While there are many ways to answer this question, we'll look at visualizations of two metrics:

  1. How many times does a delay occur for an (origin → destination) state pair over all flights during the time period? (This is the support of a simple association rule.)
  2. What percentage of total flights from an origin to a destination are delayed during the time period?

First, we'll compute the support. We do so by grouping all of the arrival delay counts by the origin and destination.

In [30]:
from __future__ import division
In [31]:
delay_counts_df = df[['ORIGIN_STATE_ABR', 'DEST_STATE_ABR', 'ARR_DEL15']].groupby(['ORIGIN_STATE_ABR', 'DEST_STATE_ABR']).sum()
delay_counts_df.head()
Out[31]:
ARR_DEL15
ORIGIN_STATE_ABR DEST_STATE_ABR
AK AK 351
AZ 5
CA 11
CO 21
GA 3

We divide each (origin → destination) delay count by the total number of flights during the period.

In [32]:
support = (delay_counts_df / len(df))
support.head()
Out[32]:
ARR_DEL15
ORIGIN_STATE_ABR DEST_STATE_ABR
AK AK 0.000699
AZ 0.000010
CA 0.000022
CO 0.000042
GA 0.000006

We unstack the multiple indices so that we have a N x N matrix with origins as rows and destinations as columns.

In [33]:
support = support.unstack()
support.head()
Out[33]:
ARR_DEL15
DEST_STATE_ABR AK AL AR AZ CA CO CT DE FL GA ... TT TX UT VA VI VT WA WI WV WY
ORIGIN_STATE_ABR
AK 0.000699 NaN NaN 0.000010 0.000022 0.000042 NaN NaN NaN 0.000006 ... NaN 0.000050 0.000004 NaN NaN NaN 0.000209 NaN NaN NaN
AL NaN NaN NaN NaN NaN 0.000064 NaN NaN 0.000080 0.000364 ... NaN 0.000631 NaN 0.000018 NaN NaN NaN NaN NaN NaN
AR NaN NaN NaN 0.000008 0.000036 0.000098 NaN NaN NaN 0.000163 ... NaN 0.000826 0.000002 NaN NaN NaN NaN NaN NaN NaN
AZ 0.000026 NaN 0.000008 0.000129 0.002559 0.000352 NaN NaN 0.000086 0.000113 ... NaN 0.000722 0.000239 0.000072 NaN NaN 0.000291 0.000062 NaN NaN
CA 0.000056 NaN 0.000008 0.001847 0.011355 0.001409 0.000014 NaN 0.000302 0.000344 ... NaN 0.002113 0.000557 0.000406 NaN NaN 0.001423 0.000068 NaN 0

5 rows × 53 columns

Unfortunately, we now have a multilevel set of columns. One way to remove the outer level is to rotate the matrix, reset the outer index to drop it, and then rotate it back.

In the resulting matrix, each cell represents the proportion of total, system-wide flights that were delayed between an (origin → destination) pair.

In [34]:
support = support.T.reset_index(level=0, drop=True).T
support.head()
Out[34]:
DEST_STATE_ABR AK AL AR AZ CA CO CT DE FL GA ... TT TX UT VA VI VT WA WI WV WY
ORIGIN_STATE_ABR
AK 0.000699 NaN NaN 0.000010 0.000022 0.000042 NaN NaN NaN 0.000006 ... NaN 0.000050 0.000004 NaN NaN NaN 0.000209 NaN NaN NaN
AL NaN NaN NaN NaN NaN 0.000064 NaN NaN 0.000080 0.000364 ... NaN 0.000631 NaN 0.000018 NaN NaN NaN NaN NaN NaN
AR NaN NaN NaN 0.000008 0.000036 0.000098 NaN NaN NaN 0.000163 ... NaN 0.000826 0.000002 NaN NaN NaN NaN NaN NaN NaN
AZ 0.000026 NaN 0.000008 0.000129 0.002559 0.000352 NaN NaN 0.000086 0.000113 ... NaN 0.000722 0.000239 0.000072 NaN NaN 0.000291 0.000062 NaN NaN
CA 0.000056 NaN 0.000008 0.001847 0.011355 0.001409 0.000014 NaN 0.000302 0.000344 ... NaN 0.002113 0.000557 0.000406 NaN NaN 0.001423 0.000068 NaN 0

5 rows × 53 columns

At this point, we have a DataFrame that we can query but no clear idea of where to start looking. A visualization of the entire DataFrame can help us find interesting pairs. We borrow and slightly modify some code from seaborn to plot our asymmetric matrix as a heatmap.

In [37]:
import numpy as np
In [38]:
def asymmatplot(plotmat, names=None, cmap="Greys", cmap_range=None, ax=None, **kwargs):
    '''
    Plot an asymmetric matrix with colormap and statistic values. A modification of the
    symmatplot() function in Seaborn to show the upper-half of the matrix.
    
    See https://github.com/mwaskom/seaborn/blob/master/seaborn/linearmodels.py for the original.
    '''
    if ax is None:
        ax = plt.gca()

    nvars = len(plotmat)

    if cmap_range is None:
        vmax = np.nanmax(plotmat) * 1.15
        vmin = np.nanmin(plotmat) * 1.15
    elif len(cmap_range) == 2:
        vmin, vmax = cmap_range
    else:
        raise ValueError("cmap_range argument not understood")

    mat_img = ax.matshow(plotmat, cmap=cmap, vmin=vmin, vmax=vmax, **kwargs)

    plt.colorbar(mat_img, shrink=.75)

    ax.xaxis.set_ticks_position("bottom")
    ax.set_xticklabels(names, rotation=90)
    ax.set_yticklabels(names)

    minor_ticks = np.linspace(-.5, nvars - 1.5, nvars)
    ax.set_xticks(minor_ticks, True)
    ax.set_yticks(minor_ticks, True)
    major_ticks = np.linspace(0, nvars - 1, nvars)
    ax.set_xticks(major_ticks)
    ax.set_yticks(major_ticks)
    ax.grid(False, which="major")
    ax.grid(True, which="minor", linestyle="-")

    return ax

In the plot, gray boxes represent cases where there are no flights between the origin (row) and destination (column) states. We see mostly light yellowish boxes representing state pairs where delays occur, but in tiny numbers compared to the total number of system-wide flights.

We do see a couple very hot spots, namely in the (CA → CA) and (TX → TX) cells. We also see a few moderately hot spots, for example (TX → CA) and (FL → NY). We can interpret these cells as indicators of where delays tend to occur most across the entire set of flights.

In [39]:
fig, ax = plt.subplots(figsize=(18,18))
asymmatplot(support, names=support.columns, ax=ax, cmap='OrRd')
Out[39]:

<matplotlib.axes.\_subplots.AxesSubplot at 0x7f679b621890>

To understand what the percentage of flights for a particular (origin → destination) state pair that are delayed, we can look at a second metric using the same visualization. Here we compute the total number of flights for each (origin → destination) pair.

In [40]:
trip_counts_df = df[['ORIGIN_STATE_ABR', 'DEST_STATE_ABR', 'FL_DATE']].groupby(['ORIGIN_STATE_ABR', 'DEST_STATE_ABR']).count()

To put trip counts DataFrame and our earlier delay_counts DataFrame on the same axes, we rename the columns to COUNTS in both cases.

In [41]:
delay_counts_df = delay_counts_df.rename_axis({'ARR_DEL15' : 'COUNTS'}, axis=1)
trip_counts_df = trip_counts_df.rename_axis({'FL_DATE' : 'COUNTS'}, axis=1)

Now we divide the delay counts by the total trip counts and perform the same transforms we did previous to produce the N by N matrix. In this case, each cell represents the proportion of flights between each (origin → destination) that were delayed.

In [42]:
mat = (delay_counts_df / trip_counts_df).unstack().T.reset_index(level=0, drop=True).T

In this second heatmap plot, the (CA → CA) and (TX → TX) hotspots from the firt visualization no longer stand out. Though there are many in-state delays for these two states, there are even more flights, keeping the percentage of delayed flights for these in-state trips lower than other routes.

To the contrary, we see some cases where all flights from one state to another had arrival delays:

  • (AR → UT)
  • (MT → NY)
  • (CO → RI) and (RI → CO)

We can also see some other moderately hot spots, such as (AK → NJ) and (OK → MN), which seem to have a higher percentage of delays than other state pairs.

One "crosshair" jumps out in the visualization: the row and column representing Illinois are nearly both filled with non-gray cells. On closer inspection, we see Illinois sends flights to and receives flights from every other state except one: TT, the state code abbreviation for U.S. Pacific Trust Territories and Possessions. And though it is difficult to make accurate relative value judgments from this visualization, it appears the run of cells in the row and column for Illinois are darker than most other row or column runs (e.g., GA).

In [43]:
fig, ax = plt.subplots(figsize=(18,18))
asymmatplot(mat, names=mat.columns, ax=ax, cmap='OrRd', cmap_range=(0., 1.0))
Out[43]:

<matplotlib.axes.\_subplots.AxesSubplot at 0x7f679b051950>

Of course, this plot only shows proportions of delayed flights between two states and doesn't depict the number of flights between the two nor the magnitude of the delay. So while all flights between Rhode Island and Colorado were delayed, we need to keep in mind ...

In [44]:
print delay_counts_df.loc['RI', 'CO']
print trip_counts_df.loc['RI', 'CO']

COUNTS 3
Name: (RI, CO), dtype: float64
COUNTS 3
Name: (RI, CO), dtype: int64

there were only three of them!

A visualization that captures both the proportion of (origin → destination) flights delayed as well as the proportion of total flights represented by that state pair is yet another exercise for the future.

Delay Distribution by Date

How did arrival delay in minutes vary day-by-day?

To address this question, we can group the ARR_DELAY_NEW column by date and look at their descriptive stats. A Tukey box plot by day is a reasonable way for us to start.

In [45]:
fig, ax = plt.subplots(figsize=(18,10))
sns.boxplot(df.ARR_DELAY_NEW, df.FL_DATE, ax=ax)
fig.autofmt_xdate()

Oh my, outliers! If the data are to be trusted, there's at least one flight every day that is over 500 minutes (8 hours) late in arriving. In the most extreme case, a flight scheduled on 2014-06-19 appears to have arrived 1800 minutes (30 hours) late.

Whether this information is accurate or not requires some fact checking against other historical sources. For now, we can turn off the fliers to get a better view of the interquartile range.

In [46]:
fig, ax = plt.subplots(figsize=(18,10))
sns.boxplot(df.ARR_DELAY_NEW, df.FL_DATE, ax=ax, showfliers=False)
fig.autofmt_xdate()

We see that the median arrival delay on most days is zero (or better than zero: this data column counts early arrivals as zeros). Some days see greater skew toward longer delays than others, particularly over the five day periods of Sunday, 2014-06-08 through Friday, 2014-06-13 and Monday, 2014-06-23 through Friday, 2014-06-27. There's also another period of elongation from Wednesday, 2014-06-18 through Thursday, 2014-06-19. These might relate to weather patterns during those weeks or an increase in number of passengers (e.g., summer vacations). Further study and sources of data are required to test these hypotheses.

In [47]:
!cal 6 2014
     June 2014

Su Mo Tu We Th Fr Sa  
 1 2 3 4 5 6 7  
 8 9 10 11 12 13 14  
15 16 17 18 19 20 21  
22 23 24 25 26 27 28  
29 30

Going Further

If you wish to take this exploration further, here are some questions you might consider addressing with additional thinking and data.

  • How accurately can a model predict if a flight will be delayed or not using simple features like origin, destination, day of the week, etc.?
  • What factors (e.g., U.S. weather) help explain the greater median arrival delay and dispersion from 2014-06-08 to 2014-06-13?
  • How do the results above contrast with the results from applying the same analyses to data from June, 2001? June, 2002?

Another Read: This is My Blog on Docker + IPython »

Late last year, I decided to refresh my blog with a more responsive design. In doing so, I looked for ways to simplify my blogofile setup. I looked at Wintersmith, Pelican, and PyGreen but in the end decided to roll my own: five Mako templates, 170 lines of Python, and my posts in Markdown. I was pleased. Mostly. Two aspects of my generator nagged me, both of which I corrected this past weekend.