나는야 데이터사이언티스트/PYTHON

[Python]파이썬 데이터 전처리 기초 정리

우주먼지의하루 2020. 3. 5. 09:08
728x90

데이터는 Kaggle에 있는 bostan marathon 데이터를 참고했다.

데이터 불러오기부터 저장까지 아주아주아주아주아주아주 기초가 되는 전처리 방법 정리

 

https://www.kaggle.com/rojour/boston-results

 

Finishers Boston Marathon 2015, 2016 & 2017

This data has the names, times and general demographics of the finishers

www.kaggle.com

 

 

 

 

 

 

 

In [1]:
import pandas as pd
In [12]:
#load the csv file
marathon_2015 = pd.read_csv("C://Users//User//Desktop//boston-results/marathon_results_2015.csv")
marathon_2016 = pd.read_csv("C://Users//User//Desktop//boston-results/marathon_results_2016.csv")
marathon_2017 = pd.read_csv("C://Users//User//Desktop//boston-results/marathon_results_2017.csv")
In [13]:
#add column
marathon_2015['Year'] = '2015'
marathon_2016['Year'] = '2016'
marathon_2017['Year'] = '2017'
In [14]:
#merge files
marathon_2015_2017 = pd.concat([marathon_2015,marathon_2016,marathon_2017],sort=False)
In [15]:
marathon_2015_2017.info()
 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 79638 entries, 0 to 26409
Data columns (total 27 columns):
Unnamed: 0       53008 non-null float64
Bib              79638 non-null object
Name             79638 non-null object
Age              79638 non-null int64
M/F              79638 non-null object
City             79637 non-null object
State            70645 non-null object
Country          79638 non-null object
Citizen          3440 non-null object
Unnamed: 9       158 non-null object
5K               79638 non-null object
10K              79638 non-null object
15K              79638 non-null object
20K              79638 non-null object
Half             79638 non-null object
25K              79638 non-null object
30K              79638 non-null object
35K              79638 non-null object
40K              79638 non-null object
Pace             79638 non-null object
Proj Time        79638 non-null object
Official Time    79638 non-null object
Overall          79638 non-null int64
Gender           79638 non-null int64
Division         79638 non-null int64
Year             79638 non-null object
Unnamed: 8       83 non-null object
dtypes: float64(1), int64(4), object(22)
memory usage: 17.0+ MB
In [16]:
#drop columns
marathon_2015_2017 = marathon_2015_2017.drop(columns = ['Unnamed: 0','Bib', 'Citizen', 'Unnamed: 9', 'Proj Time', 'Unnamed: 8'])
In [17]:
marathon_2015_2017.head()
Out[17]:
  Name Age M/F City State Country 5K 10K 15K 20K ... 25K 30K 35K 40K Pace Official Time Overall Gender Division Year
0 Desisa, Lelisa 25 M Ambo NaN ETH 0:14:43 0:29:43 0:44:57 1:00:29 ... 1:16:07 1:32:00 1:47:59 2:02:39 0:04:56 2:09:17 1 1 1 2015
1 Tsegay, Yemane Adhane 30 M Addis Ababa NaN ETH 0:14:43 0:29:43 0:44:58 1:00:28 ... 1:16:07 1:31:59 1:47:59 2:02:42 0:04:58 2:09:48 2 2 2 2015
2 Chebet, Wilson 29 M Marakwet NaN KEN 0:14:43 0:29:43 0:44:57 1:00:29 ... 1:16:07 1:32:00 1:47:59 2:03:01 0:04:59 2:10:22 3 3 3 2015
3 Kipyego, Bernard 28 M Eldoret NaN KEN 0:14:43 0:29:44 0:45:01 1:00:29 ... 1:16:07 1:32:00 1:48:03 2:03:47 0:05:00 2:10:47 4 4 4 2015
4 Korir, Wesley 32 M Kitale NaN KEN 0:14:43 0:29:44 0:44:58 1:00:28 ... 1:16:07 1:32:00 1:47:59 2:03:27 0:05:00 2:10:49 5 5 5 2015

5 rows × 21 columns

In [18]:
import numpy as np
In [19]:
# Convert using pandas to_timedelta method

marathon_2015_2017['5K'] = pd.to_timedelta(marathon_2015_2017['5K'])
marathon_2015_2017['10K'] = pd.to_timedelta(marathon_2015_2017['10K'])
marathon_2015_2017['15K'] = pd.to_timedelta(marathon_2015_2017['15K'])
marathon_2015_2017['20K'] = pd.to_timedelta(marathon_2015_2017['20K'])
marathon_2015_2017['Half'] = pd.to_timedelta(marathon_2015_2017['Half'])
marathon_2015_2017['25K'] = pd.to_timedelta(marathon_2015_2017['25K'])
marathon_2015_2017['30K'] = pd.to_timedelta(marathon_2015_2017['30K'])
marathon_2015_2017['35K'] = pd.to_timedelta(marathon_2015_2017['35K'])
marathon_2015_2017['40K'] = pd.to_timedelta(marathon_2015_2017['40K'])
marathon_2015_2017['Pace'] = pd.to_timedelta(marathon_2015_2017['Pace'])
marathon_2015_2017['Official Time'] = pd.to_timedelta(marathon_2015_2017['Official Time'])
In [23]:
# Convert time to seconds value using astype method
marathon_2015_2017['5K'] = marathon_2015_2017['5K'].astype('m8[s]').astype(np.int64)
marathon_2015_2017['10K'] = marathon_2015_2017['10K'].astype('m8[s]').astype(np.int64)
marathon_2015_2017['15K'] = marathon_2015_2017['15K'].astype('m8[s]').astype(np.int64)
marathon_2015_2017['20K'] = marathon_2015_2017['20K'].astype('m8[s]').astype(np.int64)
marathon_2015_2017['Half'] = marathon_2015_2017['Half'].astype('m8[s]').astype(np.int64)
marathon_2015_2017['25K'] = marathon_2015_2017['25K'].astype('m8[s]').astype(np.int64)
marathon_2015_2017['30K'] = marathon_2015_2017['30K'].astype('m8[s]').astype(np.int64)
marathon_2015_2017['35K'] = marathon_2015_2017['35K'].astype('m8[s]').astype(np.int64)
marathon_2015_2017['40K'] = marathon_2015_2017['40K'].astype('m8[s]').astype(np.int64)
marathon_2015_2017['Pace'] = marathon_2015_2017['Pace'].astype('m8[s]').astype(np.int64)
marathon_2015_2017['Official Time'] = marathon_2015_2017['Official Time'].astype('m8[s]').astype(np.int64)
In [24]:
marathon_2015_2017.head()
Out[24]:
  Name Age M/F City State Country 5K 10K 15K 20K ... 25K 30K 35K 40K Pace Official Time Overall Gender Division Year
0 Desisa, Lelisa 25 M Ambo NaN ETH 883 1783 2697 3629 ... 4567 5520 6479 7359 296 7757 1 1 1 2015
1 Tsegay, Yemane Adhane 30 M Addis Ababa NaN ETH 883 1783 2698 3628 ... 4567 5519 6479 7362 298 7788 2 2 2 2015
2 Chebet, Wilson 29 M Marakwet NaN KEN 883 1783 2697 3629 ... 4567 5520 6479 7381 299 7822 3 3 3 2015
3 Kipyego, Bernard 28 M Eldoret NaN KEN 883 1784 2701 3629 ... 4567 5520 6483 7427 300 7847 4 4 4 2015
4 Korir, Wesley 32 M Kitale NaN KEN 883 1784 2698 3628 ... 4567 5520 6479 7407 300 7849 5 5 5 2015

5 rows × 21 columns

In [25]:
# Save to CSV file "marathon_2015_2017.csv"
marathon_2015_2017.to_csv("C://Users//User//Desktop//boston-results/marathon_2015_2017.csv", index = None, header=True)

 

반응형