摘要:九時間序列時區表示時區轉換時區跨度轉換十畫圖圖片描述十一從版本開始,可以在中支持類型的數據。
六、分組
對于“group by”操作,我們通常是指以下一個或多個操作步驟:
(Splitting)按照一些規則將數據分為不同的組
(Applying)對于每組數據分別執行一個函數
(Combining)將結果組合刀一個數據結構中
將要處理的數組是:
df = pd.DataFrame({ "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"], "B": ["one", "one", "two", "three", "two", "two", "one", "three"], "C": np.random.randn(8), "D": np.random.randn(8) }) df A B C D 0 foo one 0.961295 -0.281012 1 bar one 0.901454 0.621284 2 foo two -0.584834 0.919414 3 bar three 1.259104 -1.012103 4 foo two 0.153107 1.108028 5 bar two 0.115963 1.333981 6 foo one 1.421895 -1.456916 7 foo three -2.103125 -1.757291
1、分組并對每個分組執行sum函數:
df.groupby("A").sum() C D A bar 2.276522 0.943161 foo -0.151661 -1.467777
2、通過多個列進行分組形成一個層次索引,然后執行函數:
df.groupby(["A", "B"]).sum() C D A B bar one 0.901454 0.621284 three 1.259104 -1.012103 two 0.115963 1.333981 foo one 2.383191 -1.737928 three -2.103125 -1.757291 two -0.431727 2.027441七、Reshaping
Stack
tuples = list(zip(*[["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"], ["one", "two", "one", "two", "one", "two", "one", "two"]])) tuples [("bar", "one"), ("bar", "two"), ("baz", "one"), ("baz", "two"), ("foo", "one"), ("foo", "two"), ("qux", "one"), ("qux", "two")]
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"]) df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"]) df2 = df[:4] df2 A B first second bar one -0.907306 -0.009961 two 0.905177 -2.877961 baz one -0.356070 -0.373447 two -1.496644 -1.958782
stacked = df2.stack() stacked first second bar one A -0.907306 B -0.009961 two A 0.905177 B -2.877961 baz one A -0.356070 B -0.373447 two A -1.496644 B -1.958782 dtype: float64
stacked.unstack() A B first second bar one -0.907306 -0.009961 two 0.905177 -2.877961 baz one -0.356070 -0.373447 two -1.496644 -1.958782
stacked.unstack(1) second one two first bar A -0.907306 0.905177 B -0.009961 -2.877961 baz A -0.356070 -1.496644 B -0.373447 -1.958782八、相關操作
要處理的數組為:
df A B C D F 2013-01-01 0.000000 0.000000 0.135704 5 NaN 2013-01-02 0.139027 1.683491 -1.031190 5 1 2013-01-03 -0.596279 -1.211098 1.169525 5 2 2013-01-04 0.367213 -0.020313 2.169802 5 3 2013-01-05 0.224122 1.003625 -0.488250 5 4 2013-01-06 0.186073 -0.537019 -0.252442 5 5
(一)、統計
1、執行描述性統計:
df.mean() A 0.053359 B 0.153115 C 0.283858 D 5.000000 F 3.000000 dtype: float64
2、在其他軸上進行相同的操作:
df.mean(1) 2013-01-01 1.283926 2013-01-02 1.358266 2013-01-03 1.272430 2013-01-04 2.103341 2013-01-05 1.947899 2013-01-06 1.879322 Freq: D, dtype: float64
3、對于擁有不同維度,需要對齊的對象進行操作,pandas會自動的沿著指定的維度進行廣播
dates s = pd.Series([1,3,4,np.nan,6,8], index=dates).shift(2) s DatetimeIndex(["2013-01-01", "2013-01-02", "2013-01-03", "2013-01-04", "2013-01-05", "2013-01-06"], dtype="datetime64[ns]", freq="D") 2013-01-01 NaN 2013-01-02 NaN 2013-01-03 1 2013-01-04 3 2013-01-05 4 2013-01-06 NaN Freq: D, dtype: float64
(二)、Apply
對數據應用函數:
df.apply(np.cumsum) A B C D F 2013-01-01 0.000000 0.000000 0.135704 5 NaN 2013-01-02 0.139027 1.683491 -0.895486 10 1 2013-01-03 -0.457252 0.472393 0.274039 15 3 2013-01-04 -0.090039 0.452081 2.443841 20 6 2013-01-05 0.134084 1.455706 1.955591 25 10 2013-01-06 0.320156 0.918687 1.703149 30 15
df.apply(lambda x: x.max() - x.min()) A 0.963492 B 2.894589 C 3.200992 D 0.000000 F 4.000000 dtype: float64
(三)、字符串方法
Series對象在其str屬性中配備了一組字符串處理方法,可以很容易的應用到數組中的每個元素。
s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"]) s.str.lower() 0 a 1 b 2 c 3 aaba 4 baca 5 NaN 6 caba 7 dog 8 cat dtype: object九、時間序列
1、時區表示:
rng = pd.date_range("3/6/2012 00:00", periods=5, freq="D") ts = pd.Series(np.random.randn(len(rng)), rng) ts 2012-03-06 -0.932261 2012-03-07 -1.405305 2012-03-08 0.809844 2012-03-09 -0.481539 2012-03-10 -0.489847 Freq: D, dtype: float64
ts_utc = ts.tz_localize("UTC") ts_utc 2012-03-06 00:00:00+00:00 -0.932261 2012-03-07 00:00:00+00:00 -1.405305 2012-03-08 00:00:00+00:00 0.809844 2012-03-09 00:00:00+00:00 -0.481539 2012-03-10 00:00:00+00:00 -0.489847 Freq: D, dtype: float64
2、時區轉換
ts_utc.tz_convert("US/Eastern") 2012-03-05 19:00:00-05:00 -0.932261 2012-03-06 19:00:00-05:00 -1.405305 2012-03-07 19:00:00-05:00 0.809844 2012-03-08 19:00:00-05:00 -0.481539 2012-03-09 19:00:00-05:00 -0.489847 Freq: D, dtype: float64
3、時區跨度轉換
rng = pd.date_range("1/1/2012", periods=5, freq="M") ts = pd.Series(np.random.randn(len(rng)), index=rng) ps = ts.to_period() ts ps ps.to_timestamp() 2012-01-31 0.932519 2012-02-29 0.247016 2012-03-31 -0.946069 2012-04-30 0.267513 2012-05-31 -0.554343 Freq: M, dtype: float64 2012-01 0.932519 2012-02 0.247016 2012-03 -0.946069 2012-04 0.267513 2012-05 -0.554343 Freq: M, dtype: float64 2012-01-01 0.932519 2012-02-01 0.247016 2012-03-01 -0.946069 2012-04-01 0.267513 2012-05-01 -0.554343 Freq: MS, dtype: float64十、畫圖
ts = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000)) ts = ts.cumsum() ts
圖片描述
十一、Categorical從0.15版本開始,pandas可以在DataFrame中支持Categorical類型的數據。
df = pd.DataFrame({ "id":[1,2,3,4,5,6], "raw_grade":["a","b","b","a","a","e"] }) df id raw_grade 0 1 a 1 2 b 2 3 b 3 4 a 4 5 a 5 6 e
1、將原始的grade轉換為Categorical數據類型:
df["grade"] = df["raw_grade"].astype("category", ordered=True) df["grade"] 0 a 1 b 2 b 3 a 4 a 5 e Name: grade, dtype: category Categories (3, object): [a < b < e]
2、將Categorical類型數據重命名為更有意義的名稱:
df["grade"].cat.categories = ["very good", "good", "very bad"]
3、對類別進行重新排序,增加缺失的類別:
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"]) df["grade"] 0 very good 1 good 2 good 3 very good 4 very good 5 very bad Name: grade, dtype: category Categories (5, object): [very bad < bad < medium < good < very good]
4、排序是按照Categorical的順序進行的而不是按照字典順序進行:
df.sort("grade") id raw_grade grade 5 6 e very bad 1 2 b good 2 3 b good 0 1 a very good 3 4 a very good 4 5 a very good
5、對Categorical列進行排序時存在空的類別:
df.groupby("grade").size() grade very bad 1 bad 0 medium 0 good 2 very good 3 dtype: int64
以上代碼不想自己試一試嗎?
鐳礦 raquant提供 jupyter(研究) 在線練習學習 python 的機會,無需安裝 python 即可運行 python 程序。
文章版權歸作者所有,未經允許請勿轉載,若此文章存在違規行為,您可以聯系管理員刪除。
轉載請注明本文地址:http://m.specialneedsforspecialkids.com/yun/45532.html
摘要:所處理的數組是方法可以對指定軸上的索引進行改變增加刪除操作,這將返回原始數據的一個拷貝去掉包含缺失值的行對缺失值進行填充對數據進行布爾填充五合并提供了大量的方法能夠輕松的對和對象進行各種符合各種邏輯關系的合并操作。 導入本篇中使用到的模塊: import numpy as np import pandas as pd from pandas import Ser...
摘要:時間永遠都過得那么快,一晃從年注冊,到現在已經過去了年那些被我藏在收藏夾吃灰的文章,已經太多了,是時候把他們整理一下了。那是因為收藏夾太亂,橡皮擦給設置私密了,不收拾不好看呀。 ...
摘要:一大熊貓世界來去自如的老生常談,從基礎來看,我們仍然關心對于與外部數據是如何交互的。函數受限制問題唯一重要的參數,標志著一個的第個頁將會被取出。數據分析入門之總結基礎一歡迎來翔的博客查看完成版。 一.大熊貓世界來去自如:Pandas的I/O 老生常談,從基礎來看,我們仍然關心pandas對于與外部數據是如何交互的。 1.1 結構化數據輸入輸出 read_csv與to_csv 是?對...
閱讀 1527·2023-04-25 17:41
閱讀 3048·2021-11-22 15:08
閱讀 849·2021-09-29 09:35
閱讀 1613·2021-09-27 13:35
閱讀 3332·2021-08-31 09:44
閱讀 2722·2019-08-30 13:20
閱讀 1945·2019-08-30 13:00
閱讀 2565·2019-08-26 12:12