10 minutes to pandas 4
import pandas as pd
import matplotlib.pyplot as plt
9. Time Series (시계열)
rng = pd.date_range('1/1/2012',periods=100,freq='S')
ts = pd.Series(np.random.randint(0,500,len(rng)), index=rng)
ts.resample('5Min').sum()
2012-01-01 24106
Freq: 5T, dtype: int32
rng = pd.date_range('3/6/2012 00:00',periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)),rng)
ts
2012-03-06 -0.228536
2012-03-07 1.182960
2012-03-08 -0.189565
2012-03-09 -0.968019
2012-03-10 -0.550340
Freq: D, dtype: float64
ts_utc = ts.tz_localize('UTC')
ts_utc
2012-03-06 00:00:00+00:00 -0.228536
2012-03-07 00:00:00+00:00 1.182960
2012-03-08 00:00:00+00:00 -0.189565
2012-03-09 00:00:00+00:00 -0.968019
2012-03-10 00:00:00+00:00 -0.550340
Freq: D, dtype: float64
다른 시간대로 변환
ts_utc.tz_convert('US/Eastern')
2012-03-05 19:00:00-05:00 -0.228536
2012-03-06 19:00:00-05:00 1.182960
2012-03-07 19:00:00-05:00 -0.189565
2012-03-08 19:00:00-05:00 -0.968019
2012-03-09 19:00:00-05:00 -0.550340
Freq: D, dtype: float64
시간 표현 <-> 기간 표현으로 변환합니다.
rng = pd.date_range('1/1/2012',periods=5,freq='M')
ts = pd.Series(np.random.randn(len(rng)),index=rng)
ts
2012-01-31 0.031629
2012-02-29 0.875231
2012-03-31 0.005173
2012-04-30 -0.383027
2012-05-31 0.054017
Freq: M, dtype: float64
ps= ts.to_period()
ps
2012-01 0.031629
2012-02 0.875231
2012-03 0.005173
2012-04 -0.383027
2012-05 0.054017
Freq: M, dtype: float64
ps.to_timestamp()
2012-01-01 0.031629
2012-02-01 0.875231
2012-03-01 0.005173
2012-04-01 -0.383027
2012-05-01 0.054017
Freq: MS, dtype: float64
prng = pd.period_range('1990Q1','2000Q4',freq='Q-NOV')
ts = pd.Series(np.random.randn(len(prng)),prng)
ts.index = (prng.asfreq('M','e')+1).asfreq('H','s')+9
ts.head()
1990-03-01 09:00 -1.165074
1990-06-01 09:00 0.790822
1990-09-01 09:00 2.920755
1990-12-01 09:00 0.491993
1991-03-01 09:00 -0.491173
Freq: H, dtype: float64
10. Categoricals (범주화)
Pandas는 데이터프레임 내에 범주형 데이터 포함 할 수 있다.
df = pd.DataFrame({"id":[1,2,3,4,5,6],"raw_grade":['a','b','b','a','a','e']})
df["grade"] = df["raw_grade"].astype('category')
df["grade"]
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
범주에 label 부여 가능합니다.
df["grade"].cat.categories = ["very good","good","very bad"]
범주의 순서를 바꾸고 동시에 누락된 범주 추가
df["grade"] = df["grade"].cat.set_categories(["very bad","bad","medium","good","very good"])
df["grade"]
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]
df.sort_values(by="grade")
id | raw_grade | grade | |
---|---|---|---|
5 | 6 | e | very bad |
1 | 2 | b | good |
2 | 3 | b | good |
0 | 1 | a | very good |
3 | 4 | a | very good |
4 | 5 | a | very good |
df.groupby("grade").size()
grade
very bad 1
bad 0
medium 0
good 2
very good 3
dtype: int64
11. Plotting (그래프)
ts = pd.Series(np.random.randn(1000), index = pd.date_range('1/1/2000',periods=1000))
ts = ts.cumsum()
ts.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1d24340db08>
!
데이터프레임에서 plot() 메소드는 라벨이 존재하는 모든 열을 그릴 때 편리함.
df = pd.DataFrame(np.random.randn(1000,4),index=ts.index,
columns=['A','B','C','D'])
df = df.cumsum()
plt.figure(); df.plot(); plt.legend(loc='best')
<matplotlib.legend.Legend at 0x1d2440e30c8>
<Figure size 432x288 with 0 Axes>
12. Getting Data In / Out (데이터 입 / 출력)
CSV
df.to_csv('foo.csv')
pd.read_csv('foo.csv',index_col=0)
HDF5
df.to_hdf('foo.h5','df')
pd.read_hdf('foo.h5','df')