10 minutes to pandas 4

19 April 2020 - 6 mins read time
Tags: Python TIL

import pandas as pd
import matplotlib.pyplot as plt

9. Time Series (시계열)

rng  = pd.date_range('1/1/2012',periods=100,freq='S')

ts = pd.Series(np.random.randint(0,500,len(rng)), index=rng)

ts.resample('5Min').sum()

2012-01-01    24106
Freq: 5T, dtype: int32

rng = pd.date_range('3/6/2012 00:00',periods=5, freq='D')

ts = pd.Series(np.random.randn(len(rng)),rng)

ts

2012-03-06   -0.228536
2012-03-07    1.182960
2012-03-08   -0.189565
2012-03-09   -0.968019
2012-03-10   -0.550340
Freq: D, dtype: float64

ts_utc = ts.tz_localize('UTC')

ts_utc

2012-03-06 00:00:00+00:00   -0.228536
2012-03-07 00:00:00+00:00    1.182960
2012-03-08 00:00:00+00:00   -0.189565
2012-03-09 00:00:00+00:00   -0.968019
2012-03-10 00:00:00+00:00   -0.550340
Freq: D, dtype: float64

다른 시간대로 변환

ts_utc.tz_convert('US/Eastern')

2012-03-05 19:00:00-05:00   -0.228536
2012-03-06 19:00:00-05:00    1.182960
2012-03-07 19:00:00-05:00   -0.189565
2012-03-08 19:00:00-05:00   -0.968019
2012-03-09 19:00:00-05:00   -0.550340
Freq: D, dtype: float64

시간 표현 <-> 기간 표현으로 변환합니다.

rng = pd.date_range('1/1/2012',periods=5,freq='M')

ts = pd.Series(np.random.randn(len(rng)),index=rng)

ts

2012-01-31    0.031629
2012-02-29    0.875231
2012-03-31    0.005173
2012-04-30   -0.383027
2012-05-31    0.054017
Freq: M, dtype: float64

ps= ts.to_period()

ps

2012-01    0.031629
2012-02    0.875231
2012-03    0.005173
2012-04   -0.383027
2012-05    0.054017
Freq: M, dtype: float64

ps.to_timestamp()

2012-01-01    0.031629
2012-02-01    0.875231
2012-03-01    0.005173
2012-04-01   -0.383027
2012-05-01    0.054017
Freq: MS, dtype: float64

prng = pd.period_range('1990Q1','2000Q4',freq='Q-NOV')

ts = pd.Series(np.random.randn(len(prng)),prng)

ts.index = (prng.asfreq('M','e')+1).asfreq('H','s')+9

ts.head()

1990-03-01 09:00   -1.165074
1990-06-01 09:00    0.790822
1990-09-01 09:00    2.920755
1990-12-01 09:00    0.491993
1991-03-01 09:00   -0.491173
Freq: H, dtype: float64

10. Categoricals (범주화)

Pandas는 데이터프레임 내에 범주형 데이터 포함 할 수 있다.

df = pd.DataFrame({"id":[1,2,3,4,5,6],"raw_grade":['a','b','b','a','a','e']})

df["grade"] = df["raw_grade"].astype('category')

df["grade"]

  a
  b
  b
  a
  a
  e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

범주에 label 부여 가능합니다.

df["grade"].cat.categories = ["very good","good","very bad"]

범주의 순서를 바꾸고 동시에 누락된 범주 추가

df["grade"] = df["grade"].cat.set_categories(["very bad","bad","medium","good","very good"])

df["grade"]

  very good
       good
       good
  very good
  very good
   very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]

df.sort_values(by="grade")

	id	raw_grade	grade
5	6	e	very bad
1	2	b	good
2	3	b	good
0	1	a	very good
3	4	a	very good
4	5	a	very good

df.groupby("grade").size()

grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

11. Plotting (그래프)

ts = pd.Series(np.random.randn(1000), index = pd.date_range('1/1/2000',periods=1000))

ts = ts.cumsum()

ts.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x1d24340db08>

! output_38_1

데이터프레임에서 plot() 메소드는 라벨이 존재하는 모든 열을 그릴 때 편리함.

df = pd.DataFrame(np.random.randn(1000,4),index=ts.index,
                 columns=['A','B','C','D'])

df = df.cumsum()

plt.figure(); df.plot(); plt.legend(loc='best')

<matplotlib.legend.Legend at 0x1d2440e30c8>

<Figure size 432x288 with 0 Axes>

output_42_2

12. Getting Data In / Out (데이터 입 / 출력)

CSV

df.to_csv('foo.csv')

pd.read_csv('foo.csv',index_col=0)

HDF5

df.to_hdf('foo.h5','df')

pd.read_hdf('foo.h5','df')