Pandas基础

Pandas基础

# 导入包
import numpy as np
import pandas as pd

文件的读取和写入

df_csv = pd.read_csv('train.csv')
df_csv
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

df_txt = pd.read_table('study_pandas.txt')
df_txt
apple
df_excel = pd.read_excel('study_pandas.xlsx')
df_excel
abalabala

header = None表示第一行不作为列名
index_col表示把某一列或几列作为索引
usecols表示读取列的集合,默认读取所有列
nrows表示读取的数据行数

pd.read_table('study_pandas.txt', header = None)
0 1 2 3 4
0 0 1 2 3.0
1 1 apple quit lalala NaN
2 2 row key ting NaN
3 3 oppo vivo apple NaN
pd.read_csv('train.csv', index_col=['PassengerId'])
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ...
887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 11 columns

pd.read_table('study_pandas.txt', usecols=['0', '2'])
0 2
0 apple lalala
1 row ting
2 oppo apple
pd.read_csv('train.csv', nrows=2)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C

在读取txt文件时, 经常遇到分割符非空格的情况, read_table有一个分割参数sep, 它使得用户可以自定义分割符号,进行txt数据的读取

pd.read_table('study_pandas.txt', sep='\|\|\|\|', engine='python')
Unnamed: 0 0 1 2 3
0 1 apple quit lalala None
1 4 lili lala pipi pion

sep是正则参数

当索引没有特殊意义时,把index设置为False

基本数据结构

pandas 中具有两种基本的数据存储结构,存储一维 values 的 Series 和存储二维 values 的 DataFrame

Series 一般由四个部分组成,分别是序列的值 data 、索引 index 、存储类型 dtype 、序列的名字 name 。其中,索引也可以指定它的名字,默认为空。

s = pd.Series(data = [100, 'a', {'dic1':5}],
             index = pd.Index(['id1', 20, 'third'], name='my_idx'),
             dtype = 'object', 
             name = 'my_name')
s
my_idx
id1              100
20                 a
third    {'dic1': 5}
Name: my_name, dtype: object
s.values
array([100, 'a', {'dic1': 5}], dtype=object)
s.index
Index(['id1', 20, 'third'], dtype='object', name='my_idx')
s.dtype
dtype('O')
s.name
'my_name'
s.shape
(3,)

如果想要取出单个索引对应的值,可以通过 [index_item] 可以取出。

s['third']
{'dic1': 5}

DataFrame 在 Series 的基础上增加了列索引,一个数据框可以由二维的 data 与行列索引来构造:

data = [[1, 'a', 1.2], [2, 'b', 2.2], [3, 'c', 3.2]]
df = pd.DataFrame(data = data,
                  index = ['row_%d'%i for i in range(3)],
                  columns = ['col_0', 'col_1', 'col_2'])
df
col_0 col_1 col_2
row_0 1 a 1.2
row_1 2 b 2.2
row_2 3 c 3.2

采用从列索引名到数据的映射来构造数据框,同时再加上行索引

df = pd.DataFrame(data = {'col_0': [1, 2, 3], 'col_1':list('abc'),
                         'col_2' : [1.2, 2.2, 3.2]},
                 index = ['row_%d'%i for i in range(3)])
df
col_0 col_1 col_2
row_0 1 a 1.2
row_1 2 b 2.2
row_2 3 c 3.2

由于这种映射关系,在 DataFrame 中可以用 [col_name] 与 [col_list] 来取出相应的列与由多个列组成的表,结果分别为 Series 和 DataFrame

df['col_0']
row_0    1
row_1    2
row_2    3
Name: col_0, dtype: int64
df[['col_0', 'col_1']]
col_0 col_1
row_0 1 a
row_1 2 b
row_2 3 c

通过 .T 可以把 DataFrame 进行转置:

df.T
row_0 row_1 row_2
col_0 1 2 3
col_1 a b c
col_2 1.2 2.2 3.2

常用基本函数

df = pd.read_csv('train.csv')
df.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

head, tail 函数分别表示返回表或者序列的前 n 行和后 n 行,其中 n 默认为5:

df.head(2)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
df.tail(3)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.45 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q

info, describe 分别返回表的 信息概况 和表中 数值列对应的主要统计量 :

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
df.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

在 Series 和 DataFrame 上定义了许多统计函数,最常见的是 sum, mean, median, var, std, max, min

df_demo = df[['Age', 'Pclass']]
df_demo.mean()
Age       29.699118
Pclass     2.308642
dtype: float64
df_demo.max()
Age       80.0
Pclass     3.0
dtype: float64

需要介绍的是 quantile, count, idxmax 这三个函数,它们分别返回的是分位数、非缺失值个数、最大值对应的索引

df_demo.quantile(0.75)
Age       38.0
Pclass     3.0
Name: 0.75, dtype: float64
df_demo.count()
Age       714
Pclass    891
dtype: int64
df_demo.idxmax()
Age       630
Pclass      0
dtype: int64

操作后返回的是标量,所以又称为聚合函数,它们有一个公共参数 axis ,默认为0代表逐列聚合,如果设置为1则表示逐行聚合

df_demo.mean(axis=1).head() # 这是Age和Pclass的均值,没什么意义
0    12.5
1    19.5
2    14.5
3    18.0
4    19.0
dtype: float64

对序列使用 unique 和 nunique 可以分别得到其唯一值组成的列表和唯一值的个数:

df['Age'].unique()
array([22.  , 38.  , 26.  , 35.  ,   nan, 54.  ,  2.  , 27.  , 14.  ,
        4.  , 58.  , 20.  , 39.  , 55.  , 31.  , 34.  , 15.  , 28.  ,
        8.  , 19.  , 40.  , 66.  , 42.  , 21.  , 18.  ,  3.  ,  7.  ,
       49.  , 29.  , 65.  , 28.5 ,  5.  , 11.  , 45.  , 17.  , 32.  ,
       16.  , 25.  ,  0.83, 30.  , 33.  , 23.  , 24.  , 46.  , 59.  ,
       71.  , 37.  , 47.  , 14.5 , 70.5 , 32.5 , 12.  ,  9.  , 36.5 ,
       51.  , 55.5 , 40.5 , 44.  ,  1.  , 61.  , 56.  , 50.  , 36.  ,
       45.5 , 20.5 , 62.  , 41.  , 52.  , 63.  , 23.5 ,  0.92, 43.  ,
       60.  , 10.  , 64.  , 13.  , 48.  ,  0.75, 53.  , 57.  , 80.  ,
       70.  , 24.5 ,  6.  ,  0.67, 30.5 ,  0.42, 34.5 , 74.  ])
df['Age'].nunique()
88

value_counts 可以得到唯一值和其对应出现的频数

df['Embarked'].value_counts()
Embarked
S    644
C    168
Q     77
Name: count, dtype: int64

如果想要观察多个列组合的唯一值,可以使用 drop_duplicates 。其中的关键参数是 keep ,默认值 first 表示每个组合保留第一次出现的所在行, last 表示保留最后一次出现的所在行, False 表示把所有重复组合所在的行剔除。

df_demo = df[['Pclass', 'Embarked', 'PassengerId']]
df_demo.drop_duplicates(['Pclass', 'Embarked'])
Pclass Embarked PassengerId
0 3 S 1
1 1 C 2
3 1 S 4
5 3 Q 6
9 2 C 10
15 2 S 16
19 3 C 20
61 1 NaN 62
245 1 Q 246
303 2 Q 304

duplicated 和 drop_duplicates 的功能类似,但前者返回了是否为唯一值的布尔列表,其 keep 参数与后者一致。其返回的序列,把重复元素设为 True ,否则为 False 。 drop_duplicates 等价于把 duplicated 为 True 的对应行剔除。

df_demo.duplicated(['Pclass', 'Embarked']).head()
0    False
1    False
2     True
3    False
4     True
dtype: bool

pandas 中的替换函数可以归纳为三类:映射替换、逻辑替换、数值替换,此处介绍 replace 的用法。

df['Sex'].replace({'female':0, 'male':1}).head()
C:\Users\dell\AppData\Local\Temp\ipykernel_20744\89183440.py:1: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  df['Sex'].replace({'female':0, 'male':1}).infer_objects(copy=False).head()





0    1
1    0
2    0
3    0
4    1
Name: Sex, dtype: int64
df['Sex'].replace(['female', 'male'], [0, 1]).head()
C:\Users\dell\AppData\Local\Temp\ipykernel_20744\1075419171.py:1: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  df['Sex'].replace(['female', 'male'], [0, 1]).head()





0    1
1    0
2    0
3    0
4    1
Name: Sex, dtype: int64

replace 还有一种特殊的方向替换,指定 method 参数为 ffill 则为用前面一个最近的未被替换的值进行替换, bfill 则使用后面最近的未被替换的值进行替换。

s = pd.Series(['a', 1, 'b', 2, 1, 1, 'a'])
s
0    a
1    1
2    b
3    2
4    1
5    1
6    a
dtype: object
s.replace([1, 2], method='ffill')
C:\Users\dell\AppData\Local\Temp\ipykernel_20744\3681488121.py:1: FutureWarning: The 'method' keyword in Series.replace is deprecated and will be removed in a future version.
  s.replace([1, 2], method='ffill')





0    a
1    a
2    b
3    b
4    b
5    b
6    a
dtype: object
s.replace([1, 2], method='bfill')
C:\Users\dell\AppData\Local\Temp\ipykernel_20744\1140877935.py:1: FutureWarning: The 'method' keyword in Series.replace is deprecated and will be removed in a future version.
  s.replace([1, 2], method='bfill')





0    a
1    b
2    b
3    a
4    a
5    a
6    a
dtype: object

逻辑替换包括了 where 和 mask ,这两个函数是完全对称的: where 函数在传入条件为 False 的对应行进行替换,而 mask 在传入条件为 True 的对应行进行替换,当不指定替换值时,替换为缺失值。

s =pd.Series([-1, 1.2345, 100, -50])
s.where(s < 0, 100)
0     -1.0
1    100.0
2    100.0
3    -50.0
dtype: float64
s.mask(s<0, -50)
0    -50.0000
1      1.2345
2    100.0000
3    -50.0000
dtype: float64

传入的条件只需是与被调用的 Series 索引一致的布尔序列即可:

s_condition = pd.Series([True, False, False, True], index = s.index)
s.mask(s_condition, -50)
0    -50.0000
1      1.2345
2    100.0000
3    -50.0000
dtype: float64

数值替换包含了 round, abs, clip 方法,它们分别表示按照给定精度四舍五入、取绝对值和截断:

s = pd.Series([-1, 1.2345, 100, -50])
s.round(2)
0     -1.00
1      1.23
2    100.00
3    -50.00
dtype: float64
s.abs()
0      1.0000
1      1.2345
2    100.0000
3     50.0000
dtype: float64
s.clip(-1, 100)
0     -1.0000
1      1.2345
2    100.0000
3     -1.0000
dtype: float64

排序共有两种方式,其一为值排序,其二为索引排序,对应的函数是 sort_values 和 sort_index 。

为了演示排序函数,下面先利用 set_index 方法把年级和姓名两列作为索引,对年龄进行排序,默认参数 ascending=True 为升序:

df_demo = df[['Sex', 'Age', 'Fare', 'Name']].set_index(['Sex', 'Name'])
df_demo.sort_values('Age').head()
Age Fare
Sex Name
male Thomas, Master. Assad Alexander 0.42 8.5167
Hamalainen, Master. Viljo 0.67 14.5000
female Baclini, Miss. Eugenie 0.75 19.2583
Baclini, Miss. Helene Barbara 0.75 19.2583
male Caldwell, Master. Alden Gates 0.83 29.0000
df_demo.sort_values('Fare', ascending=False).head()
Age Fare
Sex Name
female Ward, Miss. Anna 35.0 512.3292
male Lesurer, Mr. Gustave J 35.0 512.3292
Cardeza, Mr. Thomas Drake Martinez 36.0 512.3292
female Fortune, Miss. Mabel Helen 23.0 263.0000
male Fortune, Mr. Charles Alexander 19.0 263.0000

在排序中,经常遇到多列排序的问题,比如在年龄相同的情况下,对票价进行排序,并且保持票价降序排列,年龄降序排列:

df_demo.sort_values(['Age', 'Fare'], ascending=[False, False]).head()
Age Fare
Sex Name
male Barkworth, Mr. Algernon Henry Wilson 80.0 30.0000
Svensson, Mr. Johan 74.0 7.7750
Artagaveytia, Mr. Ramon 71.0 49.5042
Goldschmidt, Mr. George B 71.0 34.6542
Connors, Mr. Patrick 70.5 7.7500

索引排序的用法和值排序完全一致,只不过元素的值在索引中,此时需要指定索引层的名字或者层号,用参数 level 表示。另外,需要注意的是字符串的排列顺序由字母顺序决定。

df_demo.sort_index(level = ['Sex', 'Name'], ascending=[True, False]).head()
Age Fare
Sex Name
female de Messemaeker, Mrs. Guillaume Joseph (Emma) 36.0 17.4000
Zabour, Miss. Thamine NaN 14.4542
Zabour, Miss. Hileni 14.5 14.4542
Yrois, Miss. Henriette ("Mrs Harbeck") 24.0 13.0000
Young, Miss. Marie Grice 36.0 135.6333

apply 方法常用于 DataFrame 的行迭代或者列迭代,它的 axis 含义与第2小节中的统计聚合函数一致, apply 的参数往往是一个以序列为输入的函数。例如对于 .mean() ,使用 apply 可以如下地写出

df_demo = df[['Age', 'Fare']]
def my_mean(x):
    res = x.mean()
    return res
df_demo.apply(my_mean)
Age     29.699118
Fare    32.204208
dtype: float64

df_demo.apply(lambda x:x.mean())
Age     29.699118
Fare    32.204208
dtype: float64
posted @ 2024-07-22 07:46  0214jx  阅读(134)  评论(0)    收藏  举报