數據排序

有邏輯的排序數據

下載 Colab 同步學習

這章我們繼續使用鐵達尼號數據來學習有邏輯的排序。

.sort_values() 排序數值

與之前一樣,導入 pandas.read_csv() 儲存檔案,然後用 .sample() 輸出任意三行。

import pandas as pd

csv_url = 'https://raw.githubusercontent.com/aicourse-org/dataset/main/titanic/titanic.csv'

df = pd.read_csv(csv_url)
df.sample(3)
survived pclass name sex age sibsp parch ticket fare cabin embarked
745 0 1 Crosby, Capt. Edward Gifford male 70.0 1 1 WE/P 5735 71.0000 B22 S
445 1 1 Dodge, Master. Washington male 4.0 0 2 33638 81.8583 A34 S
229 0 3 Lefebre, Miss. Mathilde female NaN 3 1 4133 25.4667 NaN S

現在我們使用 .sort_value() 方法把表格將年齡 age 由小到大排序:

df.sort_values('age')
survived pclass name sex age sibsp parch ticket fare cabin embarked
803 1 3 Thomas, Master. Assad Alexander male 0.42 0 1 2625 8.5167 NaN C
755 1 2 Hamalainen, Master. Viljo male 0.67 1 1 250649 14.5000 NaN S
644 1 3 Baclini, Miss. Eugenie female 0.75 2 1 2666 19.2583 NaN C
469 1 3 Baclini, Miss. Helene Barbara female 0.75 2 1 2666 19.2583 NaN C
78 1 2 Caldwell, Master. Alden Gates male 0.83 0 2 248738 29.0000 NaN S
859 0 3 Razi, Mr. Raihed male NaN 0 0 2629 7.2292 NaN C
863 0 3 Sage, Miss. Dorothy Edith “Dolly” female NaN 8 2 CA. 2343 69.5500 NaN S
868 0 3 van Melkebeke, Mr. Philemon male NaN 0 0 345777 9.5000 NaN S
878 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S
888 0 3 Johnston, Miss. Catherine Helen “Carrie” female NaN 1 2 W./C. 6607 23.4500 NaN S

891 rows × 11 columns

當然我們也可以用降冪 ascending=False 來排序( ascending 指升序 ):

df.sort_values('age', ascending=False)
survived pclass name sex age sibsp parch ticket fare cabin embarked
630 1 1 Barkworth, Mr. Algernon Henry Wilson male 80.0 0 0 27042 30.0000 A23 S
851 0 3 Svensson, Mr. Johan male 74.0 0 0 347060 7.7750 NaN S
493 0 1 Artagaveytia, Mr. Ramon male 71.0 0 0 PC 17609 49.5042 NaN C
96 0 1 Goldschmidt, Mr. George B male 71.0 0 0 PC 17754 34.6542 A5 C
116 0 3 Connors, Mr. Patrick male 70.5 0 0 370369 7.7500 NaN Q
859 0 3 Razi, Mr. Raihed male NaN 0 0 2629 7.2292 NaN C
863 0 3 Sage, Miss. Dorothy Edith “Dolly” female NaN 8 2 CA. 2343 69.5500 NaN S
868 0 3 van Melkebeke, Mr. Philemon male NaN 0 0 345777 9.5000 NaN S
878 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S
888 0 3 Johnston, Miss. Catherine Helen “Carrie” female NaN 1 2 W./C. 6607 23.4500 NaN S

891 rows × 11 columns

我們可以再加一列 Survived 為排序條件,所以如果在同年齡下存活 Survived = 1 會先顯示。

我們這裡在後方用了切片 [13:20] 顯示14到20行。我們可以看到同樣是62歲,存活的會先排序。

df.sort_values(['age', 'survived'], ascending=False)[13:20]
survived pclass name sex age sibsp parch ticket fare cabin embarked
275 1 1 Andrews, Miss. Kornelia Theodosia female 63.0 1 0 13502 77.9583 D7 S
483 1 3 Turkula, Mrs. (Hedwig) female 63.0 0 0 4134 9.5875 NaN S
570 1 2 Harris, Mr. George male 62.0 0 0 S.W./PP 752 10.5000 NaN S
829 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0000 B28 NaN
252 0 1 Stead, Mr. William Thomas male 62.0 0 0 113514 26.5500 C87 S
555 0 1 Wright, Mr. George male 62.0 0 0 113807 26.5500 NaN S
170 0 1 Van der hoef, Mr. Wyckoff male 61.0 0 0 111240 33.5000 B19 S

你也可以使用 [:20] 來顯示最後二十行。

df.sort_values(['age', 'survived'], ascending=False)[:20]
survived pclass name sex age sibsp parch ticket fare cabin embarked
630 1 1 Barkworth, Mr. Algernon Henry Wilson male 80.0 0 0 27042 30.0000 A23 S
851 0 3 Svensson, Mr. Johan male 74.0 0 0 347060 7.7750 NaN S
96 0 1 Goldschmidt, Mr. George B male 71.0 0 0 PC 17754 34.6542 A5 C
493 0 1 Artagaveytia, Mr. Ramon male 71.0 0 0 PC 17609 49.5042 NaN C
116 0 3 Connors, Mr. Patrick male 70.5 0 0 370369 7.7500 NaN Q
672 0 2 Mitchell, Mr. Henry Michael male 70.0 0 0 C.A. 24580 10.5000 NaN S
745 0 1 Crosby, Capt. Edward Gifford male 70.0 1 1 WE/P 5735 71.0000 B22 S
33 0 2 Wheadon, Mr. Edward H male 66.0 0 0 C.A. 24579 10.5000 NaN S
54 0 1 Ostby, Mr. Engelhart Cornelius male 65.0 0 1 113509 61.9792 B30 C
280 0 3 Duane, Mr. Frank male 65.0 0 0 336439 7.7500 NaN Q
456 0 1 Millet, Mr. Francis Davis male 65.0 0 0 13509 26.5500 E38 S
438 0 1 Fortune, Mr. Mark male 64.0 1 4 19950 263.0000 C23 C25 C27 S
545 0 1 Nicholson, Mr. Arthur Ernest male 64.0 0 0 693 26.0000 NaN S
275 1 1 Andrews, Miss. Kornelia Theodosia female 63.0 1 0 13502 77.9583 D7 S
483 1 3 Turkula, Mrs. (Hedwig) female 63.0 0 0 4134 9.5875 NaN S
570 1 2 Harris, Mr. George male 62.0 0 0 S.W./PP 752 10.5000 NaN S
829 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0000 B28 NaN
252 0 1 Stead, Mr. William Thomas male 62.0 0 0 113514 26.5500 C87 S
555 0 1 Wright, Mr. George male 62.0 0 0 113807 26.5500 NaN S
170 0 1 Van der hoef, Mr. Wyckoff male 61.0 0 0 111240 33.5000 B19 S

.valus_counts()

除了排序,我們可以用 .value_counts() 來輸出每個列中不同變數的數量,例如 sex

df['sex'].value_counts()
male      577
female    314
Name: sex, dtype: int64

一但有了數量,我們就可以輕易地做出圖表。(別擔心,我們會在下面的章節學習如何製作圖表。)

df['sex'].value_counts().plot.bar()
<AxesSubplot:>

.pivot()

在整理數據時,會用到一個很實用的方法 .pivot()

我們在輸出一次鐵達尼號數據:

df.sample(3)
survived pclass name sex age sibsp parch ticket fare cabin embarked
758 0 3 Theobald, Mr. Thomas Leonard male 34.0 0 0 363294 8.050 NaN S
20 0 2 Fynney, Mr. Joseph J male 35.0 0 0 239865 26.000 NaN S
442 0 3 Petterson, Mr. Johan Emil male 25.0 1 0 347076 7.775 NaN S

.pivot() 是一個 pandas DataFrame 將數據重整成你指定的行列排序的方法。如下,我們指定每一列都顯示 survived

df.pivot(columns='survived')
pclass name sex age sibsp parch ticket fare cabin embarked
survived 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 3.0 NaN Braund, Mr. Owen Harris NaN male NaN 22.0 NaN 1.0 NaN 0.0 NaN A/5 21171 NaN 7.25 NaN NaN NaN S NaN
1 NaN 1.0 NaN Cumings, Mrs. John Bradley (Florence Briggs Th… NaN female NaN 38.0 NaN 1.0 NaN 0.0 NaN PC 17599 NaN 71.2833 NaN C85 NaN C
2 NaN 3.0 NaN Heikkinen, Miss. Laina NaN female NaN 26.0 NaN 0.0 NaN 0.0 NaN STON/O2. 3101282 NaN 7.9250 NaN NaN NaN S
3 NaN 1.0 NaN Futrelle, Mrs. Jacques Heath (Lily May Peel) NaN female NaN 35.0 NaN 1.0 NaN 0.0 NaN 113803 NaN 53.1000 NaN C123 NaN S
4 3.0 NaN Allen, Mr. William Henry NaN male NaN 35.0 NaN 0.0 NaN 0.0 NaN 373450 NaN 8.05 NaN NaN NaN S NaN
886 2.0 NaN Montvila, Rev. Juozas NaN male NaN 27.0 NaN 0.0 NaN 0.0 NaN 211536 NaN 13.00 NaN NaN NaN S NaN
887 NaN 1.0 NaN Graham, Miss. Margaret Edith NaN female NaN 19.0 NaN 0.0 NaN 0.0 NaN 112053 NaN 30.0000 NaN B42 NaN S
888 3.0 NaN Johnston, Miss. Catherine Helen “Carrie” NaN female NaN NaN NaN 1.0 NaN 2.0 NaN W./C. 6607 NaN 23.45 NaN NaN NaN S NaN
889 NaN 1.0 NaN Behr, Mr. Karl Howell NaN male NaN 26.0 NaN 0.0 NaN 0.0 NaN 111369 NaN 30.0000 NaN C148 NaN C
890 3.0 NaN Dooley, Mr. Patrick NaN male NaN 32.0 NaN 0.0 NaN 0.0 NaN 370376 NaN 7.75 NaN NaN NaN Q NaN

891 rows × 20 columns

所有原本的列都被拆分為 未存活 0 和存活 1。閱讀起來有點困難,下面我們加上 agevalues(數據):

df.pivot(columns='survived', values='age')
survived 0 1
0 22.0 NaN
1 NaN 38.0
2 NaN 26.0
3 NaN 35.0
4 35.0 NaN
886 27.0 NaN
887 NaN 19.0
888 NaN NaN
889 NaN 26.0
890 32.0 NaN

891 rows × 2 columns

我們再加上一個列 sex。我們這裡使用列表 [] 因為有兩個列。

df.pivot(columns='survived', values=['age','sex'])
age sex
survived 0 1 0 1
0 22 NaN male NaN
1 NaN 38 NaN female
2 NaN 26 NaN female
3 NaN 35 NaN female
4 35 NaN male NaN
886 27 NaN male NaN
887 NaN 19 NaN female
888 NaN NaN female NaN
889 NaN 26 NaN male
890 32 NaN male NaN

891 rows × 4 columns

df.pivot(columns='survived', values=['age','sex'])
age sex
survived 0 1 0 1
0 22 NaN male NaN
1 NaN 38 NaN female
2 NaN 26 NaN female
3 NaN 35 NaN female
4 35 NaN male NaN
886 27 NaN male NaN
887 NaN 19 NaN female
888 NaN NaN female NaN
889 NaN 26 NaN male
890 32 NaN male NaN

891 rows × 4 columns

.pivot() 可以將行與列重整。接下來有另一個方法叫 .pivot_table().pivot() 相似,但它可以直接對重組行與列的數據進行聚合計算 (aggregate)。

如下我們使用 .pivot_table() 得到所有其他列的平均值

df.pivot_table(columns='survived')
survived 0 1
age 30.626179 28.343690
fare 22.117887 48.395408
parch 0.329690 0.464912
pclass 2.531876 1.950292
sibsp 0.553734 0.473684

我們可以指定只得到 age 的平均值。

df.pivot_table(columns='survived', values='age')
survived 0 1
age 30.626179 28.34369

.pivot_table() 預設的聚合計算是平均值,可以使用 aggfunc= 來輸出其他的結果,例如中位數。

df.pivot_table(columns='survived', values='age', aggfunc='median')
survived 0 1
age 28.0 28.0

.groupby() 分組

最後我們來介紹另一個常用的方法:分組 .groupby()

df.sample(3)
survived pclass name sex age sibsp parch ticket fare cabin embarked
357 0 2 Funk, Miss. Annie Clemmer female 38.0 0 0 237671 13.00 NaN S
418 0 2 Matthews, Mr. William John male 30.0 0 0 28228 13.00 NaN S
560 0 3 Morrow, Mr. Thomas Rowan male NaN 0 0 372622 7.75 NaN Q

.groupby() 的目的與 .pivot_table() 相同,但是一個更快擷取行與列資訊的方式。

.groupby() 的參數裡填入想擷取的 ,並在後方加上聚合函數。下面我們擷取列 sex 然後計算其他列中男女的平均值。

df.groupby('sex').mean()
survived pclass age sibsp parch fare
sex
female 0.742038 2.159236 27.915709 0.694268 0.649682 44.479818
male 0.188908 2.389948 30.726645 0.429809 0.235702 25.523893

可以使用切片來單獨顯示年齡 age 的平均值:

df.groupby('sex')['age'].mean()
sex
female    27.915709
male      30.726645
Name: age, dtype: float64

也可以在函數裡再增加一個列 survived 來顯示年齡對性別和存活的平均值。

df.groupby(['sex','survived'])['age'].mean()
sex     survived
female  0           25.046875
        1           28.847716
male    0           31.618056
        1           27.276022
Name: age, dtype: float64

後方聚合函數換成 .median() 來顯示中位數:

df.groupby(['sex','survived'])['age'].median()
sex     survived
female  0           24.5
        1           28.0
male    0           29.0
        1           28.0
Name: age, dtype: float64

.groupby() 特點

因為 .groupby() 是一個特別為了得到 聚合計算 的快速方式,所以如果沒有填入聚合函數,就不會有任何結果輸出。

df.groupby(['sex','survived'])
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f9d8767c2e0>

值得一提的是 .groupby() 也可以類似於 .pitvot() 一樣指定列。

下面我們用 .size() 來顯示數量。

df.groupby(['sex','survived']).size()
sex     survived
female  0            81
        1           233
male    0           468
        1           109
dtype: int64

然後使用 .unstack('survived')survived 指定成列。

df.groupby(['sex','survived']).size().unstack('survived')
survived 0 1
sex
female 81 233
male 468 109

sex 為列。

df.groupby(['sex','survived']).size().unstack('sex')
sex female male
survived
0 81 468
1 233 109

再多一層行 pclass

df.groupby(['pclass','sex','survived']).size().unstack('survived')
survived 0 1
pclass sex
1 female 3 91
male 77 45
2 female 6 70
male 91 17
3 female 72 72
male 300 47

OK,現在你有了基本排序的知識,下一章我們學習如何用 pandas 畫基本圖表。

v1.0

請告訴我們

請告訴我們您的問題或勘誤建議。
您的意見是我們前進的動力,非常感激!