Exploratory analysis of the air crash data set Import the data set Casualty analysis Model processing time analysis

Exploratory analysis of the air crash data set Import the data set Casualty analysis Model processing time analysis

Written in front: This is the most serious data set I have ever seen. Behind almost every row of data is the price of life and blood. This exploratory analysis is not an attempt to explain anything, but merely an exercise in data processing capabilities. Therefore, this exploratory analysis will only show what the data should look like without making too many evaluations. There is a saying called "Because we cherish peace, we look back on war". Here too, because I cherish life, I look back on the plane crash. The safe flight is now the lives of more than 100,000 innocent people, paying tribute to these great explorers.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Import data set

crash = pd.read_csv("./Airplane_Crashes_and_Fatalities_Since_1908.csv")
crash.info()
<class'pandas.core.frame.DataFrame'>
RangeIndex: 5268 entries, 0 to 5267
Data columns (total 13 columns):
Date 5268 non-null object
Time 3049 non-null object
Location 5248 non-null object
Operator 5250 non-null object
Flight # 1069 non-null object
Route 3562 non-null object
Type 5241 non-null object
Registration 4933 non-null object
cn/In 4040 non-null object
Aboard 5246 non-null float64
Fatalities 5256 non-null float64
Ground 5246 non-null float64
Summary 4878 non-null object
dtypes: float64(3), object(10)
memory usage: 535.1+ KB
crash = crash.drop(["Summary","cn/In","Flight #","Route","Location"],axis=1)
crash.info()
<class'pandas.core.frame.DataFrame'>
RangeIndex: 5268 entries, 0 to 5267
Data columns (total 8 columns):
Date 5268 non-null object
Time 3049 non-null object
Operator 5250 non-null object
Type 5241 non-null object
Registration 4933 non-null object
Aboard 5246 non-null float64
Fatalities 5256 non-null float64
Ground 5246 non-null float64
dtypes: float64(3), object(5)
memory usage: 329.3+ KB
print(crash[2200:2205])
            Date Time Operator Type/
2200 03/06/1968 8:00 Military-US Air Force Fairchild C-123K   
2201 03/08/1968 19:18 Air Manila Fairchild F-27   
2202 03/09/1968 23:20 Military-French Air Force Douglas DC6B   
2203 03/19/1968 19:37 Viking Airways-Air Taxi Cessna 182   
2204 03/23/1968 13:00 Fortaire Aviation-Air Taxi Brantly 305   

     Registration Aboard Fatalities Ground  
2200 54-0590 49.0 49.0 0.0  
2201 PI-C871 14.0 14.0 0.0  
2202 43748 20.0 19.0 0.0  
2203 N2623F 2.0 2.0 0.0  
2204 N2224U 5.0 3.0 0.0  

Casualty analysis

Casualty ranking

print(crash["Fatalities"].sum())
fatal_crash = crash[crash["Fatalities"].notnull()]
fatal_crash = fatal_crash.sort_values(by="Fatalities")
print(fatal_crash[-5:])
105479.0
            Date Time Operator/
3562 06/23/1985 7:15 Air India   
2726 03/03/1974 11:41 Turkish Airlines (THY)   
4455 11/12/1996 18:40 Saudi Arabian Airlines/Kazastan Airlines   
3568 08/12/1985 18:56 Japan Air Lines   
2963 03/27/1977 17:07 Pan American World Airways/KLM   

                                      Type Registration Aboard/
3562 Boeing B-747-237B VT-EFO 329.0   
2726 McDonnell Douglas DC-10-10 TC-JAV 346.0   
4455 Boeing B-747-168B/Ilyushin IL-76TD HZAIH/UN-76435 349.0   
3568 Boeing B-747-SR46 JA8119 524.0   
2963 Boeing B-747-121/Boeing B-747-206B N736PA/PH-BUF 644.0   

      Fatalities Ground  
3562 329.0 0.0  
2726 346.0 0.0  
4455 349.0 0.0  
3568 520.0 0.0  
2963 583.0 0.0  
  • Neterife Air Crash: Two Boeing-747s collided, 583 people killed, also known as the Air Crash
  • JAL 123 crash: Boeing 747 crashed into Mount Fuji, the highest death record in a single plane crash
  • The Chalkidadri collision incident, the most serious aerial collision incident
  • Turkish Airlines Flight 981 crash: Unlocked cargo door causes explosive pressure
  • Air India Flight 182: terrorist attack

Probability of casualties

print(fatal_crash["Fatalities"].sum()/fatal_crash["Aboard"].sum())
0.729700936002
print(fatal_crash[fatal_crash["Fatalities"] != 0]["Fatalities"].sum()/fatal_crash[fatal_crash["Fatalities"] != 0]["Aboard"].sum())
0.754191781605    

Model handling

Processing function

type_crash = fatal_crash["Type"]
def type_handle(x):
    x = str(x)
    if "McDonnell Douglas" in x:
        return "McDonnell Douglas"
    elif "Douglas" in x:
        return "Douglas"
    elif "Boeing" in x:
        return "Boeing"
    elif "Airbus" in x:
        return "Airbus"
    elif "Embraer" in x:
        return "Embraer"
    elif "Ilyushin" in x:
        return "Ilyushin"
    else:
        return "other"
company_crash = type_crash.map(type_handle)
print(pd.value_counts(company_crash))
other 3581
Douglas 984
Boeing 376
McDonnell Douglas 123
Ilyushin 96
Embraer 61
Airbus 35
Name: Type, dtype: int64
fatal_crash["company"] = company_crash
print(fatal_crash[:2])
            Date Time Operator Type Registration/
108 10/21/1926 13:15 Imperial Airways Handley Page W-10 G-EBMS   
5178 11/08/2007 8:00 Juba Air Cargo Antonov 12 ST-JUA   

      Aboard Fatalities Ground year month company  
108 12.0 0.0 0.0 1926 10 other  
5178 4.0 0.0 2.0 2007 11 other  

process result

def airplane_counte(x):
    fatal_ratio = x["Fatalities"].sum()/x["Aboard"].sum()
    crash_time = x.shape[0]
    fatal_num = x["Fatalities"].sum()
    return pd.Series({"fatal_num":fatal_num,"crash_time":crash_time,"fatal_ratio":fatal_ratio})

company = fatal_crash.groupby(['company']).apply(airplane_counte)
print(company)
plt.close()
plt.figure(figsize=(16,4))
plt.subplot(1,3,1)
company['crash_time'].drop("other").plot(kind='bar',title="time")
plt.subplot(1,3,2)
company['fatal_num'].drop("other").plot(kind='bar',title="fatal_num")
plt.subplot(1,3,3)
company['fatal_ratio'].plot(kind='bar',title="fatal_ratio")
plt.show()
                   crash_time fatal_num fatal_ratio
company                                              
Airbus 35.0 2980.0 0.510711
Boeing 376.0 18705.0 0.649434
Douglas 984.0 16899.0 0.794350
Embraer 61.0 644.0 0.779661
Ilyushin 96.0 4547.0 0.883084
McDonnell Douglas 123.0 6827.0 0.531946
other 3581.0 54877.0 0.785854

Analysis results by vendor

Time analysis

year

def get_year(x):
    return x.split("/")[-1]
fatal_crash['year'] = fatal_crash["Date"].map(get_year)
year_fatal = fatal_crash[fatal_crash["year"] != np.NaN][["year","Fatalities","Aboard"]]
year_fatal.info()
<class'pandas.core.frame.DataFrame'>
Int64Index: 5256 entries, 108 to 2963
Data columns (total 3 columns):
year 5256 non-null object
Fatalities 5256 non-null float64
Aboard 5246 non-null float64
dtypes: float64(2), object(1)
memory usage: 164.2+ KB
def year_analysis(x):
    return pd.Series({"fatal_num":x["Fatalities"].sum(),"time":x.shape[0],"fatal_ratio":x["Fatalities"].sum()/x[" Aboard"].sum()})
year = year_fatal.groupby(["year"]).apply(year_analysis)
year = year.sort_index()
year.info()
<class'pandas.core.frame.DataFrame'>
Index: 98 entries, 1908 to 2009
Data columns (total 3 columns):
fatal_num 98 non-null float64
fatal_ratio 98 non-null float64
time 98 non-null float64
dtypes: float64(3)
memory usage: 3.1+ KB
plt.close()
plt.figure(figsize=(16,4))
plt.subplot(1,3,1)
year["fatal_num"].plot(title="fatal_num")
plt.subplot(1,3,2)
year["time"].plot(title="crash_time")
plt.subplot(1,3,3)
year["fatal_ratio"].plot(title="fatal_ratio")
plt.show()

Yearly analysis results

month

def get_month(x):
    return x.split("/")[0]
fatal_crash['month'] = fatal_crash["Date"].map(get_month)
month_fatal = fatal_crash[fatal_crash["month"] != np.NaN][["month","Fatalities","Aboard"]]
month_fatal.info()
<class'pandas.core.frame.DataFrame'>
Int64Index: 5256 entries, 108 to 2963
Data columns (total 3 columns):
month 5256 non-null object
Fatalities 5256 non-null float64
Aboard 5246 non-null float64
dtypes: float64(2), object(1)
memory usage: 164.2+ KB
def month_analysis(x):
    return pd.Series({"fatal_num":x["Fatalities"].sum(),"time":x.shape[0],"fatal_ratio":x["Fatalities"].sum()/x[" Aboard"].sum()})
month = month_fatal.groupby(["month"]).apply(year_analysis)
month = month.sort_index()
print(month)
       fatal_num fatal_ratio time
month                               
01 8425.0 0.768354 494.0
02 7966.0 0.693057 395.0
03 8708.0 0.787057 453.0
04 6769.0 0.711852 378.0
05 7130.0 0.731807 370.0
06 7909.0 0.681399 385.0
07 9232.0 0.700349 427.0
08 10174.0 0.729162 474.0
09 10286.0 0.760349 458.0
10 8388.0 0.778758 452.0
11 10033.0 0.766522 454.0
12 10459.0 0.668478 516.0
plt.close()
plt.figure(figsize=(16,4))
plt.subplot(1,3,1)
month["fatal_num"].plot(kind="bar",title="fatal_num")
plt.subplot(1,3,2)
month["time"].plot(kind="bar",title="crash_time")
plt.subplot(1,3,3)
month["fatal_ratio"].plot(kind="bar",title="fatal_ratio")
plt.show()

Analyze by month

hour

def get_hour(x):
    hour = x.split(":")[0]
    try:
        hour = float(hour)
        if int(hour) == hour and hour <24:
            return hour
        else:
            return np.nan
    except:
        return np.nan
    
time_fatal = fatal_crash[fatal_crash["Time"].isnull() == False]
time_fatal["hour"] = time_fatal["Time"].map(get_hour)
time_fatal.info()
<class'pandas.core.frame.DataFrame'>
Int64Index: 3049 entries, 108 to 2963
Data columns (total 13 columns):
Date 3049 non-null object
Time 3049 non-null object
Operator 3046 non-null object
Type 3048 non-null object
Registration 2952 non-null object
Aboard 3049 non-null float64
Fatalities 3049 non-null float64
Ground 3046 non-null float64
airplane 3049 non-null object
company 3049 non-null object
year 3049 non-null object
month 3049 non-null object
hour 3036 non-null float64
dtypes: float64(4), object(9)
memory usage: 333.5+ KB


c:\users\qiank\appdata\local\programs\python\python35\lib\site-packages\ipykernel_launcher.py:13: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]
def hour_analysis(x):
    return pd.Series({"fatal_num":x["Fatalities"].sum(),"time":x.shape[0],"fatal_ratio":x["Fatalities"].sum()/x[" Aboard"].sum()})
hour = time_fatal.groupby(["hour"]).apply(hour_analysis)
plt.close()
plt.figure(figsize=(16,4))
plt.subplot(1,3,1)
hour["fatal_num"].plot(title="fatal_num")
plt.subplot(1,3,2)
hour["time"].plot(title="crash_time")
plt.subplot(1,3,3)
hour["fatal_ratio"].plot(title="fatal_ratio")
plt.show()

Analyze by time

Reference: https://cloud.tencent.com/developer/article/1110773 Exploratory analysis of air crash data set Import data set casualty analysis Model processing time analysis-Cloud + Community-Tencent Cloud