728x90
데이터는 Kaggle에 있는 bostan marathon 데이터를 참고했다.
데이터 불러오기부터 저장까지 아주아주아주아주아주아주 기초가 되는 전처리 방법 정리
https://www.kaggle.com/rojour/boston-results
In [1]:
import pandas as pd
In [12]:
#load the csv file
marathon_2015 = pd.read_csv("C://Users//User//Desktop//boston-results/marathon_results_2015.csv")
marathon_2016 = pd.read_csv("C://Users//User//Desktop//boston-results/marathon_results_2016.csv")
marathon_2017 = pd.read_csv("C://Users//User//Desktop//boston-results/marathon_results_2017.csv")
In [13]:
#add column
marathon_2015['Year'] = '2015'
marathon_2016['Year'] = '2016'
marathon_2017['Year'] = '2017'
In [14]:
#merge files
marathon_2015_2017 = pd.concat([marathon_2015,marathon_2016,marathon_2017],sort=False)
In [15]:
marathon_2015_2017.info()
In [16]:
#drop columns
marathon_2015_2017 = marathon_2015_2017.drop(columns = ['Unnamed: 0','Bib', 'Citizen', 'Unnamed: 9', 'Proj Time', 'Unnamed: 8'])
In [17]:
marathon_2015_2017.head()
Out[17]:
In [18]:
import numpy as np
In [19]:
# Convert using pandas to_timedelta method
marathon_2015_2017['5K'] = pd.to_timedelta(marathon_2015_2017['5K'])
marathon_2015_2017['10K'] = pd.to_timedelta(marathon_2015_2017['10K'])
marathon_2015_2017['15K'] = pd.to_timedelta(marathon_2015_2017['15K'])
marathon_2015_2017['20K'] = pd.to_timedelta(marathon_2015_2017['20K'])
marathon_2015_2017['Half'] = pd.to_timedelta(marathon_2015_2017['Half'])
marathon_2015_2017['25K'] = pd.to_timedelta(marathon_2015_2017['25K'])
marathon_2015_2017['30K'] = pd.to_timedelta(marathon_2015_2017['30K'])
marathon_2015_2017['35K'] = pd.to_timedelta(marathon_2015_2017['35K'])
marathon_2015_2017['40K'] = pd.to_timedelta(marathon_2015_2017['40K'])
marathon_2015_2017['Pace'] = pd.to_timedelta(marathon_2015_2017['Pace'])
marathon_2015_2017['Official Time'] = pd.to_timedelta(marathon_2015_2017['Official Time'])
In [23]:
# Convert time to seconds value using astype method
marathon_2015_2017['5K'] = marathon_2015_2017['5K'].astype('m8[s]').astype(np.int64)
marathon_2015_2017['10K'] = marathon_2015_2017['10K'].astype('m8[s]').astype(np.int64)
marathon_2015_2017['15K'] = marathon_2015_2017['15K'].astype('m8[s]').astype(np.int64)
marathon_2015_2017['20K'] = marathon_2015_2017['20K'].astype('m8[s]').astype(np.int64)
marathon_2015_2017['Half'] = marathon_2015_2017['Half'].astype('m8[s]').astype(np.int64)
marathon_2015_2017['25K'] = marathon_2015_2017['25K'].astype('m8[s]').astype(np.int64)
marathon_2015_2017['30K'] = marathon_2015_2017['30K'].astype('m8[s]').astype(np.int64)
marathon_2015_2017['35K'] = marathon_2015_2017['35K'].astype('m8[s]').astype(np.int64)
marathon_2015_2017['40K'] = marathon_2015_2017['40K'].astype('m8[s]').astype(np.int64)
marathon_2015_2017['Pace'] = marathon_2015_2017['Pace'].astype('m8[s]').astype(np.int64)
marathon_2015_2017['Official Time'] = marathon_2015_2017['Official Time'].astype('m8[s]').astype(np.int64)
In [24]:
marathon_2015_2017.head()
Out[24]:
In [25]:
# Save to CSV file "marathon_2015_2017.csv"
marathon_2015_2017.to_csv("C://Users//User//Desktop//boston-results/marathon_2015_2017.csv", index = None, header=True)
반응형
'나는야 데이터사이언티스트 > PYTHON' 카테고리의 다른 글
[Python]데이터 시각화, 연관성 분석 heat map, pairplot 그리기 (0) | 2020.03.22 |
---|---|
[Python]데이터 시각화, matplotlib & seaborn - line Plot(선 그래프) (0) | 2020.03.20 |
[Python]pandas.cut - 데이터 범주화하기 / if문 쓰지않고 데이터 나누기 (0) | 2020.03.12 |
[Python]데이터 시각화, matplotlib & seaborn - Bar Plot(막대그래프) (0) | 2020.03.11 |
[Python] sklearn.pipeline, 파이프라인(Pipeline)이란 ? (0) | 2020.02.23 |