(in other words, it is due at noon, or one minute after 11:59am).
Make sure to at least take a basic tutorial in the IPython notebook, otherwise you'll be totally lost. For this problem set, you should download INFX574-PS1.ipynb and the flights.zip dataset from Canvas. Create a local copy of the notebook and rename it LASTNAME_FIRSTNAME-PS1.ipynb. Then edit your renamed file directly in your browser by typing:
ipython notebook <name_of_downloaded_file>
You should also make sure the following libraries load correctly (click on the box below and hit Ctrl-Enter)
#IPython is what you are using now to run the notebook
import IPython
print("IPython version: %6.6s (need at least 1.0)" % IPython.__version__)
# Numpy is a library for working with Arrays
import numpy as np
print("Numpy version: %6.6s (need at least 1.7.1)" % np.__version__)
# SciPy implements many different numerical algorithms
import scipy as sp
print("SciPy version: %6.6s (need at least 0.12.0)" % sp.__version__)
# Pandas makes working with data tables easier
import pandas as pd
print("Pandas version: %6.6s (need at least 0.11.0)" % pd.__version__)
# Module for plotting
import matplotlib
print("Mapltolib version: %6.6s (need at least 1.2.1)" % matplotlib.__version__)
# SciKit Learn implements several Machine Learning algorithms
import sklearn
print("Scikit-Learn version: %6.6s (need at least 0.13.1)" % sklearn.__version__)
This is the same problem set used by Emma Spiro in INFX573. The only difference is that instead of doing the problem set in R, you will use Python and the IPython notebook.
In this problem set you will perform a basic exploratory analysis on an example dataset, bringing to bear all of your new skills in data manipulation and visualization. You will be required to submit well commented python code, documenting all code used in this problem set, along with a write up answering all questions below. Use figures as appropriate to support your answers, and when required by the problem. This data set uses the NYCFlights13 dataset. You can download the dataset from canvas.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
flights_df= pd.read_csv('flights.csv')
print(flights_df.shape)
print(flights_df.columns)
print(flights_df.dtypes)
flights_df.dest.unique()
flights_df.head(10)
Let’s explore flights from NYC to Seattle. Use the flights dataset to answer the following questions.
(a) How many flights were there from NYC airports to Seattle in 2013?
# filter the rows by destination, then return the length of filtered rows
len(flights_df[(flights_df['dest']=='SEA')])
3923 flights
(b) How many airlines fly from NYC to Seattle?
# filter rows by finding the unique carriers that fly from NYC to Seattle, then return the length of data frame
len(flights_df[(flights_df['dest']=='SEA')].carrier.unique())
5 airlines
(c) How many unique air planes fly from NYC to Seattle?
# find the unique air planes by counting distinct values in the tailnum column
len(flights_df[(flights_df['dest']=='SEA')].tailnum.unique())
936 unique air planes
(d) What is the average arrival delay for flights from NC to Seattle?
# calculate the mean value of arr_delay column using arr_delay.mean(), based on the filtered rows
flights_df[(flights_df['dest']=='SEA')].arr_delay.mean()
The average arrival delay time is -1.099 minutes, which means they arrived about 1.1 minutes earlier.
(e) What proportion of flights to Seattle come from each NYC airport?
# return the unique origin airports that fly to Seattle: EWR and JFK
origin_list = flights_df[(flights_df['dest']=='SEA')].origin.unique()
# for each origin, calculate the proportion of flights
total_flights = len(flights_df[(flights_df['dest']=='SEA')].flight)
for origin in origin_list:
flights = len(flights_df[(flights_df['dest']=='SEA') & (flights_df['origin']==origin)].flight)
proportion = str(round(flights/total_flights, 4) * 100)+'%'
print('Proportion of flights to Seattle come from '+ origin + ' airport is: ' + proportion)
Proportion of flights to Seattle come from EWR airport is about 46.67%; Proportion of flights to Seattle come from JFK airport is about 53.33%.
Flights are often delayed. Consider the following questions exploring delay patterns.
(a) Which date has the largest average departure delay? Which date has the largest average arrival delay?
# first, create a new column called "date":
flights_df['date'] = pd.to_datetime((flights_df.year*10000+flights_df.month*100+flights_df.day).apply(str),format='%Y%m%d')
# we can check the new column by printing the head of the data frame:
# flights_df.head(10)
# use .groupby() to aggregate data for each date, then use .mean() to calculate average delay time
dep_delay = flights_df.groupby('date', as_index = False)['dep_delay'].mean()
arr_delay = flights_df.groupby('date', as_index = False)['arr_delay'].mean()
# print the rows that have maximum delay time for 2 new data frames
print(dep_delay.ix[dep_delay['dep_delay'].idxmax()])
print(arr_delay.ix[arr_delay['arr_delay'].idxmax()])
2013-03-08
(b) What was the worst day to fly out of NYC in 2013 if you dislike delayed flights?
# print the row that has the maximum delay time for departure
print(dep_delay.ix[dep_delay['dep_delay'].idxmax()])
2013-03-08
(c) Are there any seasonal patterns in departure delays for flights from NYC?
%matplotlib inline
# in the same way of creating departure delay data frame, create a data frame that aggregate dep_delay by month
month_delay = flights_df.groupby('month', as_index = False)['dep_delay'].mean()
# print out average departure delay time by month:
print(month_delay)
# plot a line graph:
month_delay.plot(x = 'month', y = 'dep_delay')
From the table and the graph we can see there is obvious seasonal pattern. The departure delay time gets long in summer, gets longest in July; then decrease in autumn, and gets shortest in November.
(d) On average, how do departure delays vary over the course of a day?
# in the same way above, create a data frame that aggregate dep_delay by hour in a day
hour_delay = flights_df.groupby('hour', as_index = False)['dep_delay'].mean()
# print out average departure delay time by hour:
print(hour_delay)
# plot a line graph:
hour_delay.plot(x = 'hour', y = 'dep_delay')
From the table and the graph we can see that in a day, the average departure delay time is long during the early morning, and is longest at 3am; it falls below zero from 4am, then starts increasing again until 23pm. Overall, it is shorter during the day, and longer during the night.
Which flight departing NYC in 2013 flew the fastest?
# create a new column "speed" that calculates the speed of each flight (in miles/minute)
flights_df['speed'] = flights_df['distance']/flights_df['air_time']
# sort and filter the data frame by descending speed values and wanted columns
sort_df = flights_df.sort_values(by = 'speed', ascending=False)
sort_df = sort_df[['carrier', 'tailnum', 'flight', 'origin', 'dest', 'date', 'speed']]
# return the first row, which has the highest flight speed
sort_df.head(1)
The fatest flight's carrier was DL, tail number was N666DN, flight number was 1499, departed from LGA to ATL in May 25th, 2013. Its speed was 11.72 miles per minute.
Which flights (i.e. carrier + flight + dest) happen every day? Where do they fly to?
# create a new column "f_join" that joins "carrier", "flight number" and "destination" together as a string
flights_df['f_join'] = flights_df['carrier'] + flights_df['flight'].astype(str) + ' - ' + flights_df['dest']
# for each unique "f_join", count the number of days that it appears in the data frame
f_count = flights_df.groupby('f_join', as_index = False)['date'].count()
# find the flights that appeared 365 times in the "date" column, which means this flight happened every day in 2013
res = f_count.loc[f_count['date'] == 365]
# return number of wanted flights using len()
print('Number of flights is: ' + str(len(res)))
print(res)
There are 18 flights happen every day. Their carrier, flight number and destination informatoin is shown in the data frame above.
Develop one research question you can address using the nycflights2013 dataset. Provide two visualizations to support your exploration of this question. Discuss what you find.
import statsmodels.api as sm
%matplotlib inline
# create a data frame that stores 2 variables: distance and arrival delay
dist_delay = flights_df.groupby('distance', as_index = False)['arr_delay'].mean()
# plot a scatter plot that takes distance as predictor, and arrival delay as response
x = dist_delay.distance
y = dist_delay.arr_delay
plt.xlabel("Distance")
plt.ylabel("Average Arrival Delay")
plt.scatter(x, y, alpha=0.4)
# in the same way, create a data frame that stores 2 variables: air time and arrival delay
dist_delay = flights_df.groupby('air_time', as_index = False)['arr_delay'].mean()
# plot a scatter plot that takes distance as predictor, and arrival delay as response
x = dist_delay.air_time
y = dist_delay.arr_delay
plt.xlabel("Air Time")
plt.ylabel("Average Arrival Delay")
plt.scatter(x, y, alpha=0.4)
Discussion: from these 2 scatter plots we can see that both flying distance and flying time affect the arrival delay time. As the distance increases, the average arrival delay time gets shorter; for flying time between 300 and 400 minutes, the average arrival delay gets longer with the time; also when flying time exceeds 600 minutes, the arrival delay time increases along with the fly time.
What weather conditions are associated with flight delays leaving NYC? Use graphics to explore.
# import weather.csv
weather_df= pd.read_csv('weather.csv')
# merge 2 datasets on joined date_hour column
flights_df['date_time'] = pd.to_datetime(flights_df[['year', 'month', 'day', 'hour']])
weather_df['date_time'] = pd.to_datetime(weather_df[['year', 'month', 'day', 'hour']])
new_df = flights_df.merge(weather_df,on='date_time')
# new_df.head(10)
# aggregate average departure delay by temperature
temp_delay = new_df.groupby('temp', as_index = False)['dep_delay'].mean()
# plot a scatter plot that takes temperature as predictor, and departure delay as response
plt.xlabel("Temperature")
plt.ylabel("Average Departure Delay")
plt.scatter(temp_delay.temp, temp_delay.dep_delay, alpha=0.4)
# aggregate average departure delay by humidity
humid_delay = new_df.groupby('humid', as_index = False)['dep_delay'].mean()
# plot a scatter plot that takes humidity as predictor, and departure delay as response
plt.xlabel("Humidity")
plt.ylabel("Average Departure Delay")
plt.scatter(humid_delay.humid, humid_delay.dep_delay, alpha=0.4)
# aggregate average departure delay by wind direction
dir_delay = new_df.groupby('wind_dir', as_index = False)['dep_delay'].mean()
# plot a scatter plot that takes wind direction as predictor, and departure delay as response
plt.xlabel("Wind Direction")
plt.ylabel("Average Departure Delay")
plt.scatter(dir_delay.wind_dir, dir_delay.dep_delay, alpha=0.4)
# aggregate average departure delay by wind speed
speed_delay = new_df.groupby('wind_speed', as_index = False)['dep_delay'].mean()
# plot a scatter plot that takes wind speed as predictor, and departure delay as response
plt.xlabel("Wind Speed")
plt.ylabel("Average Departure Delay")
plt.scatter(speed_delay.wind_speed, speed_delay.dep_delay, alpha=0.4)
# filter out out-liers
plt.xlim(-3,45)
# aggregate average departure delay by wind gust
gust_delay = new_df.groupby('wind_gust', as_index = False)['dep_delay'].mean()
# plot a scatter plot that takes wind gust as predictor, and departure delay as response
plt.xlabel("Wind Gust")
plt.ylabel("Average Departure Delay")
plt.scatter(gust_delay.wind_gust, gust_delay.dep_delay, alpha=0.4)
# filter out out-liers
plt.xlim(-3,50)
# aggregate average departure delay by pressure
pre_delay = new_df.groupby('pressure', as_index = False)['dep_delay'].mean()
# plot a scatter plot that takes the pressure as predictor, and departure delay as response
plt.xlabel("Pressure")
plt.ylabel("Average Departure Delay")
plt.scatter(pre_delay.pressure, pre_delay.dep_delay, alpha=0.4)
# filter out out-liers
plt.ylim(-15,130)
# aggregate average departure delay by precipitation
precip_delay = new_df.groupby('precip', as_index = False)['dep_delay'].mean()
# plot a scatter plot that takes the precipitation as predictor, and departure delay as response
plt.xlabel("Precipitation")
plt.ylabel("Average Departure Delay")
plt.scatter(precip_delay.precip, precip_delay.dep_delay, alpha=0.4)
# filter out out-liers
plt.xlim(-0.01,0.5)
# aggregate average departure delay by visibility
visib_delay = new_df.groupby('visib', as_index = False)['dep_delay'].mean()
# plot a scatter plot that takes visibility as predictor, and departure delay as response
plt.xlabel("Visibility")
plt.ylabel("Average Departure Delay")
plt.scatter(visib_delay.visib, visib_delay.dep_delay, alpha=0.4)
Interpretation: from the graphs we can see that overall, temperature, wind speed and wind gust are likely to have a positive effect on the average departure delay. Pressure and visibility are likely to have negative effect on the average departure delay time. wind direction has a positive effect when it's below about 130; and has a negative effect over that number. For humidity and precipitation, I don't see an obvious association from the graph.