Problem Set 1, due April 13th at 3:30pm

(in other words, it is due at noon, or one minute after 11:59am).

Before You Start

Make sure to at least take a basic tutorial in the IPython notebook, otherwise you'll be totally lost. For this problem set, you should download INFX574-PS1.ipynb and the flights.zip dataset from Canvas. Create a local copy of the notebook and rename it LASTNAME_FIRSTNAME-PS1.ipynb. Then edit your renamed file directly in your browser by typing:

ipython notebook <name_of_downloaded_file>

You should also make sure the following libraries load correctly (click on the box below and hit Ctrl-Enter)

In [2]:
#IPython is what you are using now to run the notebook
import IPython
print("IPython version:      %6.6s (need at least 1.0)" % IPython.__version__)

# Numpy is a library for working with Arrays
import numpy as np
print("Numpy version:        %6.6s (need at least 1.7.1)" % np.__version__)

# SciPy implements many different numerical algorithms
import scipy as sp
print("SciPy version:        %6.6s (need at least 0.12.0)" % sp.__version__)

# Pandas makes working with data tables easier
import pandas as pd
print("Pandas version:       %6.6s (need at least 0.11.0)" % pd.__version__)

# Module for plotting
import matplotlib
print("Mapltolib version:    %6.6s (need at least 1.2.1)" % matplotlib.__version__)

# SciKit Learn implements several Machine Learning algorithms
import sklearn
print("Scikit-Learn version: %6.6s (need at least 0.13.1)" % sklearn.__version__)
IPython version:       5.1.0 (need at least 1.0)
Numpy version:        1.11.3 (need at least 1.7.1)
SciPy version:        0.18.1 (need at least 0.12.0)
Pandas version:       0.19.2 (need at least 0.11.0)
Mapltolib version:     2.0.0 (need at least 1.2.1)
Scikit-Learn version: 0.18.1 (need at least 0.13.1)

About the Problem Set:

This is the same problem set used by Emma Spiro in INFX573. The only difference is that instead of doing the problem set in R, you will use Python and the IPython notebook.

Instructions:

In this problem set you will perform a basic exploratory analysis on an example dataset, bringing to bear all of your new skills in data manipulation and visualization. You will be required to submit well commented python code, documenting all code used in this problem set, along with a write up answering all questions below. Use figures as appropriate to support your answers, and when required by the problem. This data set uses the NYCFlights13 dataset. You can download the dataset from canvas.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
In [3]:
flights_df= pd.read_csv('flights.csv')
In [4]:
print(flights_df.shape)
print(flights_df.columns)
print(flights_df.dtypes)
(336776, 17)
Index(['Unnamed: 0', 'year', 'month', 'day', 'dep_time', 'dep_delay',
       'arr_time', 'arr_delay', 'carrier', 'tailnum', 'flight', 'origin',
       'dest', 'air_time', 'distance', 'hour', 'minute'],
      dtype='object')
Unnamed: 0      int64
year            int64
month           int64
day             int64
dep_time      float64
dep_delay     float64
arr_time      float64
arr_delay     float64
carrier        object
tailnum        object
flight          int64
origin         object
dest           object
air_time      float64
distance        int64
hour          float64
minute        float64
dtype: object
In [9]:
flights_df.dest.unique()
flights_df.head(10)
Out[9]:
Unnamed: 0 year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
0 1 2013 1 1 517.0 2.0 830.0 11.0 UA N14228 1545 EWR IAH 227.0 1400 5.0 17.0
1 2 2013 1 1 533.0 4.0 850.0 20.0 UA N24211 1714 LGA IAH 227.0 1416 5.0 33.0
2 3 2013 1 1 542.0 2.0 923.0 33.0 AA N619AA 1141 JFK MIA 160.0 1089 5.0 42.0
3 4 2013 1 1 544.0 -1.0 1004.0 -18.0 B6 N804JB 725 JFK BQN 183.0 1576 5.0 44.0
4 5 2013 1 1 554.0 -6.0 812.0 -25.0 DL N668DN 461 LGA ATL 116.0 762 5.0 54.0
5 6 2013 1 1 554.0 -4.0 740.0 12.0 UA N39463 1696 EWR ORD 150.0 719 5.0 54.0
6 7 2013 1 1 555.0 -5.0 913.0 19.0 B6 N516JB 507 EWR FLL 158.0 1065 5.0 55.0
7 8 2013 1 1 557.0 -3.0 709.0 -14.0 EV N829AS 5708 LGA IAD 53.0 229 5.0 57.0
8 9 2013 1 1 557.0 -3.0 838.0 -8.0 B6 N593JB 79 JFK MCO 140.0 944 5.0 57.0
9 10 2013 1 1 558.0 -2.0 753.0 8.0 AA N3ALAA 301 LGA ORD 138.0 733 5.0 58.0

Some Tips

  • This assignment involves extensive Data frame splitting and aggregation. You should look into the details of the methods groupby, transform, sum, count, mean etc
  • Many of the tasks in the assignment can be done either through the Pandas Data Frame or by converting the data frames to Series. Many of the methods in the numpy are applicable to Series only. When stuck, try to explore the type of object (Pandas Data Frame or Numpy Series) you are dealing with.

Question 1

Let’s explore flights from NYC to Seattle. Use the flights dataset to answer the following questions.

(a) How many flights were there from NYC airports to Seattle in 2013?

In [5]:
# filter the rows by destination, then return the length of filtered rows

len(flights_df[(flights_df['dest']=='SEA')])
Out[5]:
3923

3923 flights

(b) How many airlines fly from NYC to Seattle?

In [13]:
# filter rows by finding the unique carriers that fly from NYC to Seattle, then return the length of data frame

len(flights_df[(flights_df['dest']=='SEA')].carrier.unique())
Out[13]:
5

5 airlines

(c) How many unique air planes fly from NYC to Seattle?

In [14]:
# find the unique air planes by counting distinct values in the tailnum column

len(flights_df[(flights_df['dest']=='SEA')].tailnum.unique())
Out[14]:
936

936 unique air planes

(d) What is the average arrival delay for flights from NC to Seattle?

In [16]:
# calculate the mean value of arr_delay column using arr_delay.mean(), based on the filtered rows

flights_df[(flights_df['dest']=='SEA')].arr_delay.mean()
Out[16]:
-1.0990990990990992

The average arrival delay time is -1.099 minutes, which means they arrived about 1.1 minutes earlier.

(e) What proportion of flights to Seattle come from each NYC airport?

In [169]:
# return the unique origin airports that fly to Seattle: EWR and JFK

origin_list = flights_df[(flights_df['dest']=='SEA')].origin.unique()

# for each origin, calculate the proportion of flights

total_flights = len(flights_df[(flights_df['dest']=='SEA')].flight)
for origin in origin_list:
    flights = len(flights_df[(flights_df['dest']=='SEA') & (flights_df['origin']==origin)].flight)
    proportion = str(round(flights/total_flights, 4) * 100)+'%'
    print('Proportion of flights to Seattle come from '+ origin + ' airport is: ' + proportion)
Proportion of flights to Seattle come from EWR airport is: 46.67%
Proportion of flights to Seattle come from JFK airport is: 53.33%

Proportion of flights to Seattle come from EWR airport is about 46.67%; Proportion of flights to Seattle come from JFK airport is about 53.33%.

Question 2

Flights are often delayed. Consider the following questions exploring delay patterns.

(a) Which date has the largest average departure delay? Which date has the largest average arrival delay?

In [94]:
# first, create a new column called "date":
flights_df['date'] = pd.to_datetime((flights_df.year*10000+flights_df.month*100+flights_df.day).apply(str),format='%Y%m%d')

# we can check the new column by printing the head of the data frame:
# flights_df.head(10)

# use .groupby() to aggregate data for each date, then use .mean() to calculate average delay time
dep_delay = flights_df.groupby('date', as_index = False)['dep_delay'].mean()
arr_delay = flights_df.groupby('date', as_index = False)['arr_delay'].mean()

# print the rows that have maximum delay time for 2 new data frames
print(dep_delay.ix[dep_delay['dep_delay'].idxmax()])
print(arr_delay.ix[arr_delay['arr_delay'].idxmax()])
date         2013-03-08 00:00:00
dep_delay                83.5369
Name: 66, dtype: object
date         2013-03-08 00:00:00
arr_delay                85.8622
Name: 66, dtype: object

2013-03-08

(b) What was the worst day to fly out of NYC in 2013 if you dislike delayed flights?

In [170]:
# print the row that has the maximum delay time for departure
print(dep_delay.ix[dep_delay['dep_delay'].idxmax()])
date         2013-03-08 00:00:00
dep_delay                83.5369
Name: 66, dtype: object

2013-03-08

(c) Are there any seasonal patterns in departure delays for flights from NYC?

In [97]:
%matplotlib inline

# in the same way of creating departure delay data frame, create a data frame that aggregate dep_delay by month
month_delay = flights_df.groupby('month', as_index = False)['dep_delay'].mean()

# print out average departure delay time by month:
print(month_delay)

# plot a line graph:
month_delay.plot(x = 'month', y = 'dep_delay')
    month  dep_delay
0       1  10.036665
1       2  10.816843
2       3  13.227076
3       4  13.938038
4       5  12.986859
5       6  20.846332
6       7  21.727787
7       8  12.611040
8       9   6.722476
9      10   6.243988
10     11   5.435362
11     12  16.576688
Out[97]:
<matplotlib.axes._subplots.AxesSubplot at 0x11d6539b0>

From the table and the graph we can see there is obvious seasonal pattern. The departure delay time gets long in summer, gets longest in July; then decrease in autumn, and gets shortest in November.

(d) On average, how do departure delays vary over the course of a day?

In [98]:
# in the same way above, create a data frame that aggregate dep_delay by hour in a day
hour_delay = flights_df.groupby('hour', as_index = False)['dep_delay'].mean()

# print out average departure delay time by hour:
print(hour_delay)

# plot a line graph:
hour_delay.plot(x = 'hour', y = 'dep_delay')
    hour   dep_delay
0    0.0  128.642452
1    1.0  207.049327
2    2.0  239.921875
3    3.0  304.727273
4    4.0   -5.554098
5    5.0   -4.355644
6    6.0   -1.520552
7    7.0    0.223289
8    8.0    1.091432
9    9.0    4.268295
10  10.0    5.548644
11  11.0    5.652309
12  12.0    7.601892
13  13.0    9.380639
14  14.0    8.090381
15  15.0   10.682049
16  16.0   13.621150
17  17.0   16.756593
18  18.0   18.664671
19  19.0   21.497342
20  20.0   28.266213
21  21.0   42.064964
22  22.0   68.190042
23  23.0   96.946865
24  24.0   64.862069
Out[98]:
<matplotlib.axes._subplots.AxesSubplot at 0x11d857630>

From the table and the graph we can see that in a day, the average departure delay time is long during the early morning, and is longest at 3am; it falls below zero from 4am, then starts increasing again until 23pm. Overall, it is shorter during the day, and longer during the night.

Question 3

Which flight departing NYC in 2013 flew the fastest?
In [174]:
# create a new column "speed" that calculates the speed of each flight (in miles/minute)
flights_df['speed'] = flights_df['distance']/flights_df['air_time']

# sort and filter the data frame by descending speed values and wanted columns
sort_df = flights_df.sort_values(by = 'speed', ascending=False)
sort_df = sort_df[['carrier', 'tailnum', 'flight', 'origin', 'dest', 'date', 'speed']]

# return the first row, which has the highest flight speed
sort_df.head(1)
Out[174]:
carrier tailnum flight origin dest date speed
216447 DL N666DN 1499 LGA ATL 2013-05-25 11.723077

The fatest flight's carrier was DL, tail number was N666DN, flight number was 1499, departed from LGA to ATL in May 25th, 2013. Its speed was 11.72 miles per minute.

Question 4

Which flights (i.e. carrier + flight + dest) happen every day? Where do they fly to?

In [181]:
# create a new column "f_join" that joins "carrier", "flight number" and "destination" together as a string
flights_df['f_join'] = flights_df['carrier'] + flights_df['flight'].astype(str) + ' - ' + flights_df['dest']

# for each unique "f_join", count the number of days that it appears in the data frame
f_count = flights_df.groupby('f_join', as_index = False)['date'].count()

# find the flights that appeared 365 times in the "date" column, which means this flight happened every day in 2013
res = f_count.loc[f_count['date'] == 365]

# return number of wanted flights using len()
print('Number of flights is: ' + str(len(res)))
print(res)
Number of flights is: 18
             f_join  date
790     AA119 - LAX   365
801    AA1357 - SJU   365
813    AA1611 - MIA   365
842     AA181 - LAX   365
941      AA59 - SFO   365
1130   B61783 - MCO   365
1178    B6219 - CLT   365
1241    B6359 - BUR   365
1246    B6371 - FLL   365
1272    B6431 - SRQ   365
1362    B6703 - SJU   365
1735   DL2159 - MCO   365
1805   DL2391 - TPA   365
4631   EV5712 - IAD   365
6561     UA15 - HNL   365
10605   VX251 - LAS   365
10609   VX407 - LAX   365
10613   VX413 - LAX   365

There are 18 flights happen every day. Their carrier, flight number and destination informatoin is shown in the data frame above.

Question 5

Develop one research question you can address using the nycflights2013 dataset. Provide two visualizations to support your exploration of this question. Discuss what you find.

Research Question: does flying distance and time affect arrival delay time?

In [9]:
import statsmodels.api as sm
%matplotlib inline

# create a data frame that stores 2 variables: distance and arrival delay
dist_delay = flights_df.groupby('distance', as_index = False)['arr_delay'].mean()

# plot a scatter plot that takes distance as predictor, and arrival delay as response
x = dist_delay.distance
y = dist_delay.arr_delay
plt.xlabel("Distance")
plt.ylabel("Average Arrival Delay")
plt.scatter(x, y, alpha=0.4)
Out[9]:
<matplotlib.collections.PathCollection at 0x11cf8deb8>
In [129]:
# in the same way, create a data frame that stores 2 variables: air time and arrival delay
dist_delay = flights_df.groupby('air_time', as_index = False)['arr_delay'].mean()

# plot a scatter plot that takes distance as predictor, and arrival delay as response
x = dist_delay.air_time
y = dist_delay.arr_delay
plt.xlabel("Air Time")
plt.ylabel("Average Arrival Delay")
plt.scatter(x, y, alpha=0.4)
Out[129]:
<matplotlib.collections.PathCollection at 0x13e160b38>

Discussion: from these 2 scatter plots we can see that both flying distance and flying time affect the arrival delay time. As the distance increases, the average arrival delay time gets shorter; for flying time between 300 and 400 minutes, the average arrival delay gets longer with the time; also when flying time exceeds 600 minutes, the arrival delay time increases along with the fly time.

Question 6

What weather conditions are associated with flight delays leaving NYC? Use graphics to explore.

In [7]:
# import weather.csv
weather_df= pd.read_csv('weather.csv')

# merge 2 datasets on joined date_hour column
flights_df['date_time'] = pd.to_datetime(flights_df[['year', 'month', 'day', 'hour']])
weather_df['date_time'] = pd.to_datetime(weather_df[['year', 'month', 'day', 'hour']])
new_df = flights_df.merge(weather_df,on='date_time')
# new_df.head(10)

1) Temperature vs. Delay:

In [137]:
# aggregate average departure delay by temperature
temp_delay = new_df.groupby('temp', as_index = False)['dep_delay'].mean()

# plot a scatter plot that takes temperature as predictor, and departure delay as response
plt.xlabel("Temperature")
plt.ylabel("Average Departure Delay")
plt.scatter(temp_delay.temp, temp_delay.dep_delay, alpha=0.4)
Out[137]:
<matplotlib.collections.PathCollection at 0x13faeb588>

2) Humidity vs. Delay:

In [140]:
# aggregate average departure delay by humidity
humid_delay = new_df.groupby('humid', as_index = False)['dep_delay'].mean()

# plot a scatter plot that takes humidity as predictor, and departure delay as response
plt.xlabel("Humidity")
plt.ylabel("Average Departure Delay")
plt.scatter(humid_delay.humid, humid_delay.dep_delay, alpha=0.4)
Out[140]:
<matplotlib.collections.PathCollection at 0x1367034a8>

3) Wind Direction vs. Delay:

In [141]:
# aggregate average departure delay by wind direction 
dir_delay = new_df.groupby('wind_dir', as_index = False)['dep_delay'].mean()

# plot a scatter plot that takes wind direction as predictor, and departure delay as response
plt.xlabel("Wind Direction")
plt.ylabel("Average Departure Delay")
plt.scatter(dir_delay.wind_dir, dir_delay.dep_delay, alpha=0.4)
Out[141]:
<matplotlib.collections.PathCollection at 0x13ad782e8>

4) Wind Speed vs. Delay:

In [191]:
# aggregate average departure delay by wind speed 
speed_delay = new_df.groupby('wind_speed', as_index = False)['dep_delay'].mean()

# plot a scatter plot that takes wind speed as predictor, and departure delay as response
plt.xlabel("Wind Speed")
plt.ylabel("Average Departure Delay")
plt.scatter(speed_delay.wind_speed, speed_delay.dep_delay, alpha=0.4)

# filter out out-liers
plt.xlim(-3,45)
Out[191]:
(-3, 45)

5) Wind Gust vs. Delay:

In [190]:
# aggregate average departure delay by wind gust
gust_delay = new_df.groupby('wind_gust', as_index = False)['dep_delay'].mean()

# plot a scatter plot that takes wind gust as predictor, and departure delay as response
plt.xlabel("Wind Gust")
plt.ylabel("Average Departure Delay")
plt.scatter(gust_delay.wind_gust, gust_delay.dep_delay, alpha=0.4)

# filter out out-liers
plt.xlim(-3,50)
Out[190]:
(-3, 50)

6) Pressure vs. Delay:

In [194]:
# aggregate average departure delay by pressure
pre_delay = new_df.groupby('pressure', as_index = False)['dep_delay'].mean()

# plot a scatter plot that takes the pressure as predictor, and departure delay as response
plt.xlabel("Pressure")
plt.ylabel("Average Departure Delay")
plt.scatter(pre_delay.pressure, pre_delay.dep_delay, alpha=0.4)

# filter out out-liers
plt.ylim(-15,130)
Out[194]:
(-15, 130)

7) Precipitation vs. Delay

In [14]:
# aggregate average departure delay by precipitation
precip_delay = new_df.groupby('precip', as_index = False)['dep_delay'].mean()

# plot a scatter plot that takes the precipitation as predictor, and departure delay as response
plt.xlabel("Precipitation")
plt.ylabel("Average Departure Delay")
plt.scatter(precip_delay.precip, precip_delay.dep_delay, alpha=0.4)

# filter out out-liers
plt.xlim(-0.01,0.5)
Out[14]:
(-0.01, 0.5)

8) Visibility vs. Delay:

In [196]:
# aggregate average departure delay by visibility
visib_delay = new_df.groupby('visib', as_index = False)['dep_delay'].mean()

# plot a scatter plot that takes visibility as predictor, and departure delay as response
plt.xlabel("Visibility")
plt.ylabel("Average Departure Delay")
plt.scatter(visib_delay.visib, visib_delay.dep_delay, alpha=0.4)
Out[196]:
<matplotlib.collections.PathCollection at 0x141f5b908>

Interpretation: from the graphs we can see that overall, temperature, wind speed and wind gust are likely to have a positive effect on the average departure delay. Pressure and visibility are likely to have negative effect on the average departure delay time. wind direction has a positive effect when it's below about 130; and has a negative effect over that number. For humidity and precipitation, I don't see an obvious association from the graph.