나는야 데이터사이언티스트/PYTHON

[Python] 용량이 큰 CSV 파일 빠르게 불러오기

우주먼지의하루 2020. 3. 23. 03:48
728x90

데이터는 Kaggle에 있는 bostan marathon 데이터를 참고했다.

https://www.kaggle.com/rojour/boston-results

Finishers Boston Marathon 2015, 2016 & 2017

This data has the names, times and general demographics of the finishers

www.kaggle.com


Untitled

csv 파일 빠르게 불러오기(작업 효율성 증가)

In [29]:
#tistory 관련 코드(필요없음)
from IPython.core.display import display, HTML
display(HTML("<style>.container {width:90% !important;}</style>"))
In [1]:
import pandas as pd
In [4]:
%%time
marathon_2017 = pd.read_csv("C://Users//82106//Desktop//boston-results//marathon_results_2017.csv")
Wall time: 109 ms
In [6]:
marathon_2017.head()
Out[6]:
Unnamed: 0 Bib Name Age M/F City State Country Citizen Unnamed: 9 ... 25K 30K 35K 40K Pace Proj Time Official Time Overall Gender Division
0 0 11 Kirui, Geoffrey 24 M Keringet NaN KEN NaN NaN ... 1:16:59 1:33:01 1:48:19 2:02:53 0:04:57 - 2:09:37 1 1 1
1 1 17 Rupp, Galen 30 M Portland OR USA NaN NaN ... 1:16:59 1:33:01 1:48:19 2:03:14 0:04:58 - 2:09:58 2 2 2
2 2 23 Osako, Suguru 25 M Machida-City NaN JPN NaN NaN ... 1:17:00 1:33:01 1:48:31 2:03:38 0:04:59 - 2:10:28 3 3 3
3 3 21 Biwott, Shadrack 32 M Mammoth Lakes CA USA NaN NaN ... 1:17:00 1:33:01 1:48:58 2:04:35 0:05:03 - 2:12:08 4 4 4
4 4 9 Chebet, Wilson 31 M Marakwet NaN KEN NaN NaN ... 1:16:59 1:33:01 1:48:41 2:05:00 0:05:04 - 2:12:35 5 5 5

5 rows × 25 columns

1. Python library for Apache Arrow

Apache Arrow란 메모리 내 데이터를 위한 언어 간 개발 플랫폼이다. 자세한 내용은 https://arrow.apache.org/

In [5]:
! pip install pyarrow
Requirement already satisfied: pyarrow in c:\users\82106\anaconda3\lib\site-packages (0.16.0)
Requirement already satisfied: six>=1.0.0 in c:\users\82106\anaconda3\lib\site-packages (from pyarrow) (1.12.0)
Requirement already satisfied: numpy>=1.14 in c:\users\82106\anaconda3\lib\site-packages (from pyarrow) (1.16.5)
In [15]:
#feather파일 형식으로 바꾸기
marathon_2017.to_feather("C://Users//82106//Desktop//boston-results//marathon_2017.feather")
In [17]:
%%time
marathon_2017_feather = pd.read_feather("C://Users//82106//Desktop//boston-results//marathon_2017.feather")
Wall time: 40.4 ms
In [18]:
marathon_2017_feather.head()
Out[18]:
Unnamed: 0 Bib Name Age M/F City State Country Citizen Unnamed: 9 ... 25K 30K 35K 40K Pace Proj Time Official Time Overall Gender Division
0 0 11 Kirui, Geoffrey 24 M Keringet None KEN None None ... 1:16:59 1:33:01 1:48:19 2:02:53 0:04:57 - 2:09:37 1 1 1
1 1 17 Rupp, Galen 30 M Portland OR USA None None ... 1:16:59 1:33:01 1:48:19 2:03:14 0:04:58 - 2:09:58 2 2 2
2 2 23 Osako, Suguru 25 M Machida-City None JPN None None ... 1:17:00 1:33:01 1:48:31 2:03:38 0:04:59 - 2:10:28 3 3 3
3 3 21 Biwott, Shadrack 32 M Mammoth Lakes CA USA None None ... 1:17:00 1:33:01 1:48:58 2:04:35 0:05:03 - 2:12:08 4 4 4
4 4 9 Chebet, Wilson 31 M Marakwet None KEN None None ... 1:16:59 1:33:01 1:48:41 2:05:00 0:05:04 - 2:12:35 5 5 5

5 rows × 25 columns

In [19]:
type(marathon_2017_feather)
Out[19]:
pandas.core.frame.DataFrame

2. dask.dataframe

Dask 패키지를 사용하면 가상 데이터프레임을 만들 수 있다. 가상 데이터프레임은 Pandas 데이터프레임과 비슷한 기능을 제공하지만 실제로 모든 데이터가 메모리 상에 로드되어 있는 것이 아니라 하나 이상의 파일 혹은 데이터베이스에 존재하는 채로 처리할 수 있는 기능이다. 참조 : https://datascienceschool.net/view-notebook/2282b75b2a63448087b77269885c27cb/

In [22]:
import dask.dataframe
In [26]:
%%time
marathon_2017_desk = dask.dataframe.read_csv("C://Users//82106//Desktop//boston-results//marathon_results_2017.csv")
Wall time: 19 ms
In [27]:
marathon_2017_desk.head()
Out[27]:
Unnamed: 0 Bib Name Age M/F City State Country Citizen Unnamed: 9 ... 25K 30K 35K 40K Pace Proj Time Official Time Overall Gender Division
0 0 11 Kirui, Geoffrey 24 M Keringet NaN KEN NaN NaN ... 1:16:59 1:33:01 1:48:19 2:02:53 0:04:57 - 2:09:37 1 1 1
1 1 17 Rupp, Galen 30 M Portland OR USA NaN NaN ... 1:16:59 1:33:01 1:48:19 2:03:14 0:04:58 - 2:09:58 2 2 2
2 2 23 Osako, Suguru 25 M Machida-City NaN JPN NaN NaN ... 1:17:00 1:33:01 1:48:31 2:03:38 0:04:59 - 2:10:28 3 3 3
3 3 21 Biwott, Shadrack 32 M Mammoth Lakes CA USA NaN NaN ... 1:17:00 1:33:01 1:48:58 2:04:35 0:05:03 - 2:12:08 4 4 4
4 4 9 Chebet, Wilson 31 M Marakwet NaN KEN NaN NaN ... 1:16:59 1:33:01 1:48:41 2:05:00 0:05:04 - 2:12:35 5 5 5

5 rows × 25 columns

In [28]:
#desk는 compute() 매소드를 호출해야 값을 볼 수 있다.
marathon_2017_desk['Age'].mean().compute()
Out[28]:
42.587731919727375
반응형