csv 파일 빠르게 불러오기(작업 효율성 증가)¶

#tistory 관련 코드(필요없음)
from IPython.core.display import display, HTML
display(HTML("<style>.container {width:90% !important;}</style>"))

import pandas as pd

%%time
marathon_2017 = pd.read_csv("C://Users//82106//Desktop//boston-results//marathon_results_2017.csv")

Wall time: 109 ms

marathon_2017.head()

1. Python library for Apache Arrow¶

Apache Arrow란 메모리 내 데이터를 위한 언어 간 개발 플랫폼이다. 자세한 내용은 https://arrow.apache.org/

! pip install pyarrow

Requirement already satisfied: pyarrow in c:\users\82106\anaconda3\lib\site-packages (0.16.0)
Requirement already satisfied: six>=1.0.0 in c:\users\82106\anaconda3\lib\site-packages (from pyarrow) (1.12.0)
Requirement already satisfied: numpy>=1.14 in c:\users\82106\anaconda3\lib\site-packages (from pyarrow) (1.16.5)

#feather파일 형식으로 바꾸기
marathon_2017.to_feather("C://Users//82106//Desktop//boston-results//marathon_2017.feather")

%%time
marathon_2017_feather = pd.read_feather("C://Users//82106//Desktop//boston-results//marathon_2017.feather")

Wall time: 40.4 ms

marathon_2017_feather.head()

type(marathon_2017_feather)

pandas.core.frame.DataFrame

2. dask.dataframe¶

Dask 패키지를 사용하면 가상 데이터프레임을 만들 수 있다. 가상 데이터프레임은 Pandas 데이터프레임과 비슷한 기능을 제공하지만 실제로 모든 데이터가 메모리 상에 로드되어 있는 것이 아니라 하나 이상의 파일 혹은 데이터베이스에 존재하는 채로 처리할 수 있는 기능이다. 참조 : https://datascienceschool.net/view-notebook/2282b75b2a63448087b77269885c27cb/

import dask.dataframe

%%time
marathon_2017_desk = dask.dataframe.read_csv("C://Users//82106//Desktop//boston-results//marathon_results_2017.csv")

Wall time: 19 ms

marathon_2017_desk.head()

#desk는 compute() 매소드를 호출해야 값을 볼 수 있다.
marathon_2017_desk['Age'].mean().compute()

42.587731919727375

	Unnamed: 0	Bib	Name	Age	M/F	City	State	Country	Citizen	Unnamed: 9	...	25K	30K	35K	40K	Pace	Proj Time	Official Time	Overall	Gender	Division
0	0	11	Kirui, Geoffrey	24	M	Keringet	NaN	KEN	NaN	NaN	...	1:16:59	1:33:01	1:48:19	2:02:53	0:04:57	-	2:09:37	1	1	1
1	1	17	Rupp, Galen	30	M	Portland	OR	USA	NaN	NaN	...	1:16:59	1:33:01	1:48:19	2:03:14	0:04:58	-	2:09:58	2	2	2
2	2	23	Osako, Suguru	25	M	Machida-City	NaN	JPN	NaN	NaN	...	1:17:00	1:33:01	1:48:31	2:03:38	0:04:59	-	2:10:28	3	3	3
3	3	21	Biwott, Shadrack	32	M	Mammoth Lakes	CA	USA	NaN	NaN	...	1:17:00	1:33:01	1:48:58	2:04:35	0:05:03	-	2:12:08	4	4	4
4	4	9	Chebet, Wilson	31	M	Marakwet	NaN	KEN	NaN	NaN	...	1:16:59	1:33:01	1:48:41	2:05:00	0:05:04	-	2:12:35	5	5	5

	Unnamed: 0	Bib	Name	Age	M/F	City	State	Country	Citizen	Unnamed: 9	...	25K	30K	35K	40K	Pace	Proj Time	Official Time	Overall	Gender	Division
0	0	11	Kirui, Geoffrey	24	M	Keringet	None	KEN	None	None	...	1:16:59	1:33:01	1:48:19	2:02:53	0:04:57	-	2:09:37	1	1	1
1	1	17	Rupp, Galen	30	M	Portland	OR	USA	None	None	...	1:16:59	1:33:01	1:48:19	2:03:14	0:04:58	-	2:09:58	2	2	2
2	2	23	Osako, Suguru	25	M	Machida-City	None	JPN	None	None	...	1:17:00	1:33:01	1:48:31	2:03:38	0:04:59	-	2:10:28	3	3	3
3	3	21	Biwott, Shadrack	32	M	Mammoth Lakes	CA	USA	None	None	...	1:17:00	1:33:01	1:48:58	2:04:35	0:05:03	-	2:12:08	4	4	4
4	4	9	Chebet, Wilson	31	M	Marakwet	None	KEN	None	None	...	1:16:59	1:33:01	1:48:41	2:05:00	0:05:04	-	2:12:35	5	5	5

	Unnamed: 0	Bib	Name	Age	M/F	City	State	Country	Citizen	Unnamed: 9	...	25K	30K	35K	40K	Pace	Proj Time	Official Time	Overall	Gender	Division
0	0	11	Kirui, Geoffrey	24	M	Keringet	NaN	KEN	NaN	NaN	...	1:16:59	1:33:01	1:48:19	2:02:53	0:04:57	-	2:09:37	1	1	1
1	1	17	Rupp, Galen	30	M	Portland	OR	USA	NaN	NaN	...	1:16:59	1:33:01	1:48:19	2:03:14	0:04:58	-	2:09:58	2	2	2
2	2	23	Osako, Suguru	25	M	Machida-City	NaN	JPN	NaN	NaN	...	1:17:00	1:33:01	1:48:31	2:03:38	0:04:59	-	2:10:28	3	3	3
3	3	21	Biwott, Shadrack	32	M	Mammoth Lakes	CA	USA	NaN	NaN	...	1:17:00	1:33:01	1:48:58	2:04:35	0:05:03	-	2:12:08	4	4	4
4	4	9	Chebet, Wilson	31	M	Marakwet	NaN	KEN	NaN	NaN	...	1:16:59	1:33:01	1:48:41	2:05:00	0:05:04	-	2:12:35	5	5	5

[Python]시계열 데이터 모델링 - 기초버전 (0)	2020.04.05
[Python] 시계열 데이터 분석 - 기초버전 (1)	2020.03.27
[Python]데이터 시각화, 연관성 분석 heat map, pairplot 그리기 (0)	2020.03.22
[Python]데이터 시각화, matplotlib & seaborn - line Plot(선 그래프) (0)	2020.03.20
[Python]pandas.cut - 데이터 범주화하기 / if문 쓰지않고 데이터 나누기 (0)	2020.03.12

우주먼지의 하루

[Python] 용량이 큰 CSV 파일 빠르게 불러오기

csv 파일 빠르게 불러오기(작업 효율성 증가)¶

1. Python library for Apache Arrow¶

2. dask.dataframe¶

'나는야 데이터사이언티스트 > PYTHON' 카테고리의 다른 글

'나는야 데이터사이언티스트/PYTHON'의 다른글

티스토리툴바

« 2025/02 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

[Python] 용량이 큰 CSV 파일 빠르게 불러오기

csv 파일 빠르게 불러오기(작업 효율성 증가)¶

1. Python library for Apache Arrow¶

2. dask.dataframe¶

'나는야 데이터사이언티스트 > PYTHON' 카테고리의 다른 글

'나는야 데이터사이언티스트/PYTHON'의 다른글

관련글

티스토리툴바