數(shù)據(jù)科學(xué)庫pandas筆記1

caiyongji 發(fā)布于2019-07-31 10:22 / 4001人閱讀

摘要：上海本科年廣州碩士年廣州本科應(yīng)屆畢業(yè)生北京本科年北京本科年上海本科年廣州碩士年廣州本科應(yīng)屆畢業(yè)生當(dāng)然，如果想看尾部的數(shù)據(jù)，可以用函數(shù)，它默認(rèn)顯示尾部的行，與相反。

數(shù)據(jù)結(jié)構(gòu)之DataFrame

pandas中有兩種數(shù)據(jù)結(jié)構(gòu)Series和DataFrame,Series類似于Numpy中的一維數(shù)組，這里就不詳細(xì)記錄了。主要記錄下DataFrame的常見使用。

DataFrame是一個(gè)表格型的數(shù)據(jù)結(jié)構(gòu)，它含有一組有序的列，每列可以是不同的值類型（數(shù)值、字符串、布爾值等）。DataFrame既有行索引也有列索引，它可以被看做由Series組成的字典（共用同一個(gè)索引）。

下面記錄DataFrame的常見使用，引入pandas約定：

from pandas import Series,DataFrame
import pandas as pd

DataFrame基本操作

1. 創(chuàng)建一個(gè)DataFrame數(shù)據(jù)框

創(chuàng)建一個(gè)DataFrame最常見的方法是傳入一個(gè)等長的列表或者Numpy數(shù)組組成的字典。

In [16]: d = {
    ...:     "name":["cat","dog","lion"],
    ...:     "age":[3,5,6],
    ...:     "sex":["male","female","male"]
    ...: }

In [17]: d
Out[17]:
{"name": ["cat", "dog", "lion"],
 "age": [3, 5, 6],
 "sex": ["male", "female", "male"]}

In [18]: df = pd.DataFrame(d)

In [19]: df
Out[19]:
   name  age     sex
0   cat    3    male
1   dog    5  female
2  lion    6    male

2. 查看數(shù)據(jù)框的概述

In [20]: df.info()

RangeIndex: 3 entries, 0 to 2  # 三個(gè)索引，從0到2
Data columns (total 3 columns): # 字段信息
name    3 non-null object # 字符串類型
age     3 non-null int64 # 整型
sex     3 non-null object # 字符串類型
dtypes: int64(1), object(2) # 統(tǒng)計(jì)數(shù)據(jù)類型信息
memory usage: 152.0+ bytes # 占用內(nèi)存大小

3. 切片和索引

3.1 基于列索引進(jìn)行切片

In [24]: df.age
Out[24]:
0    3
1    5
2    6
Name: age, dtype: int64

In [25]: df["age"]
Out[25]:
0    3
1    5
2    6
Name: age, dtype: int64

In [26]: df[["age","name"]]
Out[26]:
   age  name
0    3   cat
1    5   dog
2    6  lion

3.2 基于行索引進(jìn)行切片
基于行索引進(jìn)行切片有多種方法，比如DataFrame里的ix函數(shù)，loc函數(shù)和iloc函數(shù)等。

In [27]: df.ix[0]
D:work-envioramentanacondaScriptsipython:1: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
Out[27]:
name     cat
age        3
sex     male
Name: 0, dtype: object

使用ix函數(shù)可以進(jìn)行行索引的切片，但是pandas建議使用loc或者iloc。

In [28]: df.ix[0:1]
D:work-envioramentanacondaScriptsipython:1: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
Out[28]:
  name  age     sex
0  cat    3    male
1  dog    5  female

In [29]: df[0:1] # 類似列表的切片操作
Out[29]:
  name  age   sex
0  cat    3  male

In [30]: df[0:2]
Out[30]:
  name  age     sex
0  cat    3    male
1  dog    5  female

同樣，也可以使用類似列表切片的操作進(jìn)行行索引切片，不過ix函數(shù)的這種操作會(huì)包括右邊的索引，切的范圍不同。

對(duì)于切出來的數(shù)據(jù)，數(shù)據(jù)類型還是數(shù)據(jù)框的，可以繼續(xù)切片（多重切片）。

In [36]: df[0:2]["name"]
Out[36]:
0    cat
1    dog
Name: name, dtype: object

4. 選取和修改值

In [37]: df
Out[37]:
   name  age     sex
0   cat    3    male
1   dog    5  female
2  lion    6    male

In [38]: df["age"]
Out[38]:
0    3
1    5
2    6
Name: age, dtype: int64

In [39]: df["age"] = 10 # 基于整列的值都修改為10

In [40]: df
Out[40]:
   name  age     sex
0   cat   10    male
1   dog   10  female
2  lion   10    male

In [41]: df["age"][0] = 20 # 修改age列的第一行的值為20

In [42]: df
Out[42]:
   name  age     sex
0   cat   20    male
1   dog   10  female
2  lion   10    male

In [43]: df.age = [3,4,5] # 為多個(gè)字段賦值可以傳入一個(gè)列表

In [44]: df
Out[44]:
   name  age     sex
0   cat    3    male
1   dog    4  female
2  lion    5    male

5. 數(shù)據(jù)的篩選
某些情況下，需要根據(jù)一些數(shù)據(jù)進(jìn)行篩選，比如篩選出年齡大于5歲的人或者居住地區(qū)為廣州的人等等。

In [44]: df
Out[44]:
   name  age     sex
0   cat    3    male
1   dog    4  female
2  lion    5    male

In [46]: df.age == 4 # 邏輯判斷，年齡等于4的，返回一個(gè)Series的布爾型數(shù)組
Out[46]:
0    False
1     True
2    False
Name: age, dtype: bool

In [47]: df[df.age == 4] # 根據(jù)這個(gè)布爾型數(shù)組進(jìn)行索引，返回為True的
Out[47]:
  name  age     sex
1  dog    4  female

In [48]: df[[False,True,False]] # 這種與上面方法是等價(jià)的
Out[48]:
  name  age     sex
1  dog    4  female

In [51]: df.age > 3 # 大于小于也是可以的
Out[51]:
0    False
1     True
2     True
Name: age, dtype: bool

這里也有個(gè)小技巧就是，在這些邏輯判斷操作的前面加上~號(hào)，就可以反轉(zhuǎn)結(jié)果，如下：

In [54]: df.age == 3
Out[54]:
0     True
1    False
2    False
Name: age, dtype: bool

In [55]: ~(df.age == 3)
Out[55]:
0    False
1     True
2     True
Name: age, dtype: bool

同時(shí)也支持多重篩選

In [57]: df
Out[57]:
   name  age     sex
0   cat    3    male
1   dog    4  female
2  lion    5    male

In [58]: (df.age == 3) & (df.name == "cat")
Out[58]:
0     True
1    False
2    False
dtype: bool

In [59]: df[(df.age == 3) & (df.name == "cat")]
Out[59]:
  name  age   sex
0  cat    3  male

pandas的query函數(shù)也可以達(dá)到篩選功能

In [66]: df.query("age == 3")
Out[66]:
  name  age   sex
0  cat    3  male

In [67]: df.query("(age == 3)&(sex=="male")")
Out[67]:
  name  age   sex
0  cat    3  male

6. 使用loc與iloc

對(duì)于DataFrame的行的標(biāo)簽索引，引入了特殊的標(biāo)簽運(yùn)算符loc和iloc。它們可以讓你用類似NumPy的標(biāo)記，使用軸標(biāo)簽（loc）或整數(shù)索引（iloc），從DataFrame選擇行和列的子集。

In [73]: df
Out[73]:
   name  age     sex
0   cat    3    male
1   dog    4  female
2  lion    5    male

In [74]: df.iloc[1] # 根據(jù)行標(biāo)簽進(jìn)行索引，選取行索引為1的行
Out[74]:
name       dog
age          4
sex     female
Name: 1, dtype: object

In [75]: df.iloc[0:2] 
Out[75]:
  name  age     sex
0  cat    3    male
1  dog    4  female

如果行標(biāo)簽不是整數(shù)，而是字符串，那么就可以使用loc了。

In [76]: df.index = list("abc") # 將行索引改為abc

In [77]: df
Out[77]:
   name  age     sex
a   cat    3    male
b   dog    4  female
c  lion    5    male

In [78]: df.loc["a"] # 選取行索引為a的行
Out[78]:
name     cat
age        3
sex     male
Name: a, dtype: object

In [79]: df.loc[["a","b"]]
Out[79]:
  name  age     sex
a  cat    3    male
b  dog    4  female

In [80]: df.iloc[0] # 同樣也可以使用iloc
Out[80]:
name     cat
age        3
sex     male
Name: a, dtype: object

iloc是根據(jù)具體的行的位置進(jìn)行索引的，也就不管行標(biāo)簽是整數(shù)還是字符串類型，而loc是根據(jù)行標(biāo)簽進(jìn)行索引的。
loc和iloc還有支持多個(gè)參數(shù)進(jìn)行索引

In [83]: df
Out[83]:
   name  age     sex
a   cat    3    male
b   dog    4  female
c  lion    5    male

In [84]: df.iloc[0:2] # 選取第一行和第二行
Out[84]:
  name  age     sex
a  cat    3    male
b  dog    4  female

In [85]: df.iloc[0:2,1] # 選取列，列索引從0開始，所以選取第二列的數(shù)據(jù)
Out[85]:
a    3
b    4
Name: age, dtype: int64

In [86]: df.iloc[0:2,[0,1]] # 選取多列
Out[86]:
  name  age
a  cat    3
b  dog    4

7. 丟棄DataFrame的行或者列
丟棄某條軸上的一個(gè)或多個(gè)項(xiàng)很簡單，只要有一個(gè)索引數(shù)組或列表即可。由于需要執(zhí)行一些數(shù)據(jù)整理和集合邏輯，所以drop方法返回的是一個(gè)在指定軸上刪除了指定值的新對(duì)象：

對(duì)于DataFrame，可以刪除任意軸上的索引值。用標(biāo)簽序列調(diào)用drop會(huì)從行標(biāo)簽（axis 0）刪除值：

In [153]: dd
Out[153]:
   name  age     sex
0   cat    3    male
1   dog    5  female
2  lion    6    male

In [154]: dd.drop([1,2])
Out[154]:
  name  age   sex
0  cat    3  male

通過傳遞axis=1或axis="columns"可以刪除列的值：

In [156]: dd
Out[156]:
   name  age     sex
0   cat    3    male
1   dog    5  female
2  lion    6    male

In [157]: dd.drop("sex",axis=1)
Out[157]:
   name  age
0   cat    3
1   dog    5
2  lion    6

In [158]: dd.drop("sex",axis="columns")
Out[158]:
   name  age
0   cat    3
1   dog    5
2  lion    6

8. DataFrame行，列的添加
以下面數(shù)據(jù)框?yàn)槔?/p>

In [182]: dd
Out[182]:
  name  age
0  cat    2
1  dog    3

以字典方式添加一行，忽略索引：

In [190]: row = {"name":"lion","age":4}

In [191]: dd.append(row,ignore_index=True)
Out[191]:
   name  age
0   cat    2
1   dog    3
2  lion    4

使用loc，添加一行或者修改已存在行的內(nèi)容：

In [206]: dd
Out[206]:
  name  age
0  cat    2
1  dog    3

In [207]: dd.loc[4] = ["lion","4"]

In [208]: dd
Out[208]:
   name age
0   cat   2
1   dog   3
4  lion   4

In [209]: dd.loc[1] = ["dog",5]

In [210]: dd
Out[210]:
   name age
0   cat   2
1   dog   5
4  lion   4

數(shù)據(jù)分析中的常見使用

首先是數(shù)據(jù)的加載，pandas提供了一些用于將表格型數(shù)據(jù)讀取為DataFrame對(duì)象的函數(shù)，常用的有read_csv和read_table。
以read_csv為例：

In [97]: df = pd.read_csv("dataAnalyst_gbk.csv",encoding="gbk")

In [98]: df
Out[98]:
  city education  top   avg work_year
0   上海        本科    9   8.0        3年
1   廣州        碩士   15  11.0        2年
2   廣州        本科   12  10.0     應(yīng)屆畢業(yè)生
3   北京        本科   13  12.0        2年
4   北京        本科   11   8.0        1年

read_csv函數(shù)默認(rèn)是以utf-8格式進(jìn)行文件的加載，而這個(gè)dataAnalyst_gbk.csv文件是gbk格式的，所以需要enconding指定文件格式以解碼。如上代碼所示，read_csv函數(shù)將表格型數(shù)據(jù)加載為DataFrame對(duì)象。

csv文件默認(rèn)是以逗號(hào)為分隔符，如果想指定分隔符，可以使用sep參數(shù)，比如下面讀取test.csv文件，并且以t為分隔符。

df = pd.read_csv("test.csv",sep="	")

如果想對(duì)讀取文件的字段名，也就是第一行的列名進(jìn)行更改，可以在讀取文件的時(shí)候使用names參數(shù)：

In [111]: df = pd.read_csv("dataAnalyst_gbk.csv",encoding="gbk",names=["a","b","c","d","e"])

In [112]: df # 可以看到列標(biāo)簽變成了abcde
Out[112]:
      a          b    c    d          e
0  city  education  top  avg  work_year
1    上海         本科    9  8.0         3年
2    廣州         碩士   15   11         2年
3    廣州         本科   12   10      應(yīng)屆畢業(yè)生
4    北京         本科   13   12         2年
5    北京         本科   11    8         1年

在獲取數(shù)據(jù)后，便可以對(duì)數(shù)據(jù)進(jìn)行進(jìn)一步的分析，清洗等操作，下面記錄一些常見的使用。

1. head()函數(shù)，默認(rèn)顯示前5行

對(duì)于一些非常龐大的數(shù)據(jù)框，使用head()可以簡要的查看數(shù)據(jù)，head默認(rèn)顯示前5行，可以傳遞數(shù)字讓pandas顯示多少行。

In [116]: df.head()
Out[116]:
  city       education  top   avg work_year
0   上海        本科    9   8.0        3年
1   廣州        碩士   15  11.0        2年
2   廣州        本科   12  10.0     應(yīng)屆畢業(yè)生
3   北京        本科   13  12.0        2年
4   北京        本科   11   8.0        1年

In [117]: df.head(3)
Out[117]:
   city     education  top avg    work_year
0   上海        本科    9   8.0        3年
1   廣州        碩士   15  11.0        2年
2   廣州        本科   12  10.0     應(yīng)屆畢業(yè)生

當(dāng)然，如果想看尾部的數(shù)據(jù)，可以用tail函數(shù)，它默認(rèn)顯示尾部的5行，與head相反。

2. 更改數(shù)據(jù)類型
使用df.info()函數(shù)可以看到各個(gè)列的數(shù)據(jù)類型，實(shí)際分析中也有需求去更改它的數(shù)據(jù)類型

In [120]: df.info()

RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
city         5 non-null object
education    5 non-null object
top          5 non-null int64
avg          5 non-null float64
work_year    5 non-null object
dtypes: float64(1), int64(1), object(3)
memory usage: 280.0+ bytes

pandas提供了astyp()函數(shù)進(jìn)行數(shù)據(jù)類型的更改，例如下面將top列的值的數(shù)據(jù)類型更換為字符串類型。

In [122]: df.top
Out[122]:
0     9
1    15
2    12
3    13
4    11
Name: top, dtype: int64 # 默認(rèn)是int64

In [123]: df.top.astype("str")
Out[123]:
0     9
1    15
2    12
3    13
4    11
Name: top, dtype: object # 已經(jīng)更改為字符串類型

需要注意的是df.top.astype("str")并不會(huì)去修改原先數(shù)據(jù)框的數(shù)據(jù)類型，而是新建了一個(gè)，如果想對(duì)原先的數(shù)據(jù)框進(jìn)行修改，需要進(jìn)行賦值操作df.top = df.top.astype("str") 。

In [124]: df.top = df.top.astype("str") 

In [125]: df.top
Out[125]:
0     9
1    15
2    12
3    13
4    11
Name: top, dtype: object

3. 進(jìn)行一些簡單的數(shù)值計(jì)算以及篩選過濾

In [129]: df
Out[129]:
  city education  top   avg work_year
0   上海        本科    9   8.0        3年
1   廣州        碩士   15  11.0        2年
2   廣州        本科   12  10.0     應(yīng)屆畢業(yè)生
3   北京        本科   13  12.0        2年
4   北京        本科   11   8.0        1年

In [130]: df["avg_2"] = df.avg*2 # 增加新的一列，數(shù)據(jù)為avg數(shù)值的兩倍

In [131]: df 
Out[131]:
  city education  top   avg work_year  avg_2
0   上海        本科    9   8.0        3年   16.0
1   廣州        碩士   15  11.0        2年   22.0
2   廣州        本科   12  10.0     應(yīng)屆畢業(yè)生   20.0
3   北京        本科   13  12.0        2年   24.0
4   北京        本科   11   8.0        1年   16.0

找出平均薪資大于10K的數(shù)據(jù)或者平均薪資大于10K的城市：

In [133]: df.query("avg>10")
Out[133]:
  city education  top   avg work_year  avg_2
1   廣州        碩士   15  11.0        2年   22.0
3   北京        本科   13  12.0        2年   24.0

In [134]: df.query("avg>10").city
Out[134]:
1    廣州
3    北京
Name: city, dtype: object

文章版權(quán)歸作者所有，未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請(qǐng)注明本文地址：http://m.specialneedsforspecialkids.com/yun/43702.html

發(fā)表評(píng)論

登陸后可評(píng)論

0條評(píng)論

caiyongji

男|高級(jí)講師

我要關(guān)注我要私信

TA的文章

tensorflow

閱讀 2345·2023-04-25 14:29
藍(lán)橋杯-烏托邦樹

閱讀 1477·2021-11-22 09:34
leetcode每日一題-559:N叉樹的最大深度

閱讀 2715·2021-11-22 09:34
[11.11]edgeNAT：全場VPS月付8折年付7折,韓國/美國VPS月付48元起

閱讀 3398·2021-11-11 10:59
python IDLE的簡明圖示使用說明（適合初學(xué)者）

閱讀 1864·2021-09-26 09:46
虛擬主機(jī)怎么選擇-如何選擇虛擬主機(jī)？

閱讀 2238·2021-09-22 16:03
前端資源整理 - 訂閱、工具等

閱讀 1929·2019-08-30 12:56
去除display:inline-block的間隙問題

閱讀 486·2019-08-30 11:12

国产xxxx99真实实拍_久久不雅视频_高清韩国a级特黄毛片_嗯老师别我我受不了了小说

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長期優(yōu)惠，快來選購！

數(shù)據(jù)科學(xué)庫pandas筆記1

相關(guān)文章

**15個(gè)Python庫，讓你學(xué)習(xí)數(shù)據(jù)科學(xué)更輕松**

8步從Python白板到專家，從基礎(chǔ)到深度學(xué)習(xí)

Python數(shù)據(jù)分析實(shí)用程序

ApacheCN 學(xué)習(xí)資源匯總 2019.3

ApacheCN 學(xué)習(xí)資源匯總 2019.3

發(fā)表評(píng)論

0條評(píng)論

caiyongji

男|高級(jí)講師

TA的文章

tensorflow

藍(lán)橋杯-烏托邦樹

leetcode每日一題-559:N叉樹的最大深度

[11.11]edgeNAT：全場VPS月付8折年付7折,韓國/美國VPS月付48元起

python IDLE的簡明圖示使用說明（適合初學(xué)者）

虛擬主機(jī)怎么選擇-如何選擇虛擬主機(jī)？

前端資源整理 - 訂閱、工具等

去除display:inline-block的間隙問題

最新活動(dòng)

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長期優(yōu)惠，快來選購！

數(shù)據(jù)科學(xué)庫pandas筆記1

相關(guān)文章

發(fā)表評(píng)論

0條評(píng)論

男|高級(jí)講師

TA的文章

最新活動(dòng)

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長期優(yōu)惠，快來選購！