Pandas基础

# 导入包
import numpy as np
import pandas as pd

文件的读取和写入

df_csv = pd.read_csv('train.csv')
df_csv

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

891 rows × 12 columns

df_txt = pd.read_table('study_pandas.txt')
df_txt

	apple

df_excel = pd.read_excel('study_pandas.xlsx')
df_excel

	abalabala

header = None表示第一行不作为列名
index_col表示把某一列或几列作为索引
usecols表示读取列的集合，默认读取所有列
nrows表示读取的数据行数

pd.read_table('study_pandas.txt', header = None)

	0	1	2	3	4
0		0	1	2	3.0
1	1	apple	quit	lalala	NaN
2	2	row	key	ting	NaN
3	3	oppo	vivo	apple	NaN

pd.read_csv('train.csv', index_col=['PassengerId'])

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...
887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

891 rows × 11 columns

pd.read_table('study_pandas.txt', usecols=['0', '2'])

	0	2
0	apple	lalala
1	row	ting
2	oppo	apple

pd.read_csv('train.csv', nrows=2)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38	1	0	PC 17599	71.2833	C85	C

在读取txt文件时，经常遇到分割符非空格的情况， read_table有一个分割参数sep, 它使得用户可以自定义分割符号，进行txt数据的读取

pd.read_table('study_pandas.txt', sep='\|\|\|\|', engine='python')

	Unnamed: 0	0	1	2	3
0	1	apple	quit	lalala	None
1	4	lili	lala	pipi	pion

sep是正则参数

当索引没有特殊意义时，把index设置为False

基本数据结构

pandas 中具有两种基本的数据存储结构，存储一维 values 的 Series 和存储二维 values 的 DataFrame

Series 一般由四个部分组成，分别是序列的值 data 、索引 index 、存储类型 dtype 、序列的名字 name 。其中，索引也可以指定它的名字，默认为空。

s = pd.Series(data = [100, 'a', {'dic1':5}],
             index = pd.Index(['id1', 20, 'third'], name='my_idx'),
             dtype = 'object', 
             name = 'my_name')
s

my_idx
id1              100
20                 a
third    {'dic1': 5}
Name: my_name, dtype: object

s.values

array([100, 'a', {'dic1': 5}], dtype=object)

s.index

Index(['id1', 20, 'third'], dtype='object', name='my_idx')

s.dtype

dtype('O')

s.name

'my_name'

s.shape

(3,)

如果想要取出单个索引对应的值，可以通过 [index_item] 可以取出。

s['third']

{'dic1': 5}

DataFrame 在 Series 的基础上增加了列索引，一个数据框可以由二维的 data 与行列索引来构造：

data = [[1, 'a', 1.2], [2, 'b', 2.2], [3, 'c', 3.2]]
df = pd.DataFrame(data = data,
                  index = ['row_%d'%i for i in range(3)],
                  columns = ['col_0', 'col_1', 'col_2'])
df

	col_0	col_1	col_2
row_0	1	a	1.2
row_1	2	b	2.2
row_2	3	c	3.2

采用从列索引名到数据的映射来构造数据框，同时再加上行索引

df = pd.DataFrame(data = {'col_0': [1, 2, 3], 'col_1':list('abc'),
                         'col_2' : [1.2, 2.2, 3.2]},
                 index = ['row_%d'%i for i in range(3)])
df

	col_0	col_1	col_2
row_0	1	a	1.2
row_1	2	b	2.2
row_2	3	c	3.2

由于这种映射关系，在 DataFrame 中可以用 [col_name] 与 [col_list] 来取出相应的列与由多个列组成的表，结果分别为 Series 和 DataFrame

df['col_0']

row_0    1
row_1    2
row_2    3
Name: col_0, dtype: int64

df[['col_0', 'col_1']]

	col_0	col_1
row_0	1	a
row_1	2	b
row_2	3	c

通过 .T 可以把 DataFrame 进行转置：

df.T

	row_0	row_1	row_2
col_0	1	2	3
col_1	a	b	c
col_2	1.2	2.2	3.2

常用基本函数

df = pd.read_csv('train.csv')
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

head, tail 函数分别表示返回表或者序列的前 n 行和后 n 行，其中 n 默认为5：

df.head(2)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C

df.tail(3)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.45	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.00	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.75	NaN	Q

info, describe 分别返回表的信息概况和表中数值列对应的主要统计量：

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

df.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

在 Series 和 DataFrame 上定义了许多统计函数，最常见的是 sum, mean, median, var, std, max, min

df_demo = df[['Age', 'Pclass']]
df_demo.mean()

Age       29.699118
Pclass     2.308642
dtype: float64

df_demo.max()

Age       80.0
Pclass     3.0
dtype: float64

需要介绍的是 quantile, count, idxmax 这三个函数，它们分别返回的是分位数、非缺失值个数、最大值对应的索引

df_demo.quantile(0.75)

Age       38.0
Pclass     3.0
Name: 0.75, dtype: float64

df_demo.count()

Age       714
Pclass    891
dtype: int64

df_demo.idxmax()

Age       630
Pclass      0
dtype: int64

操作后返回的是标量，所以又称为聚合函数，它们有一个公共参数 axis ，默认为0代表逐列聚合，如果设置为1则表示逐行聚合

df_demo.mean(axis=1).head() # 这是Age和Pclass的均值，没什么意义

0    12.5
1    19.5
2    14.5
3    18.0
4    19.0
dtype: float64

对序列使用 unique 和 nunique 可以分别得到其唯一值组成的列表和唯一值的个数：

df['Age'].unique()

array([22.  , 38.  , 26.  , 35.  ,   nan, 54.  ,  2.  , 27.  , 14.  ,
        4.  , 58.  , 20.  , 39.  , 55.  , 31.  , 34.  , 15.  , 28.  ,
        8.  , 19.  , 40.  , 66.  , 42.  , 21.  , 18.  ,  3.  ,  7.  ,
       49.  , 29.  , 65.  , 28.5 ,  5.  , 11.  , 45.  , 17.  , 32.  ,
       16.  , 25.  ,  0.83, 30.  , 33.  , 23.  , 24.  , 46.  , 59.  ,
       71.  , 37.  , 47.  , 14.5 , 70.5 , 32.5 , 12.  ,  9.  , 36.5 ,
       51.  , 55.5 , 40.5 , 44.  ,  1.  , 61.  , 56.  , 50.  , 36.  ,
       45.5 , 20.5 , 62.  , 41.  , 52.  , 63.  , 23.5 ,  0.92, 43.  ,
       60.  , 10.  , 64.  , 13.  , 48.  ,  0.75, 53.  , 57.  , 80.  ,
       70.  , 24.5 ,  6.  ,  0.67, 30.5 ,  0.42, 34.5 , 74.  ])

df['Age'].nunique()

value_counts 可以得到唯一值和其对应出现的频数

df['Embarked'].value_counts()

Embarked
S    644
C    168
Q     77
Name: count, dtype: int64

如果想要观察多个列组合的唯一值，可以使用 drop_duplicates 。其中的关键参数是 keep ，默认值 first 表示每个组合保留第一次出现的所在行， last 表示保留最后一次出现的所在行， False 表示把所有重复组合所在的行剔除。

df_demo = df[['Pclass', 'Embarked', 'PassengerId']]
df_demo.drop_duplicates(['Pclass', 'Embarked'])

	Pclass	Embarked	PassengerId
0	3	S	1
1	1	C	2
3	1	S	4
5	3	Q	6
9	2	C	10
15	2	S	16
19	3	C	20
61	1	NaN	62
245	1	Q	246
303	2	Q	304

duplicated 和 drop_duplicates 的功能类似，但前者返回了是否为唯一值的布尔列表，其 keep 参数与后者一致。其返回的序列，把重复元素设为 True ，否则为 False 。 drop_duplicates 等价于把 duplicated 为 True 的对应行剔除。

df_demo.duplicated(['Pclass', 'Embarked']).head()

0    False
1    False
2     True
3    False
4     True
dtype: bool

pandas 中的替换函数可以归纳为三类：映射替换、逻辑替换、数值替换,此处介绍 replace 的用法。

df['Sex'].replace({'female':0, 'male':1}).head()

C:\Users\dell\AppData\Local\Temp\ipykernel_20744\89183440.py:1: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  df['Sex'].replace({'female':0, 'male':1}).infer_objects(copy=False).head()





0    1
1    0
2    0
3    0
4    1
Name: Sex, dtype: int64

df['Sex'].replace(['female', 'male'], [0, 1]).head()

C:\Users\dell\AppData\Local\Temp\ipykernel_20744\1075419171.py:1: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  df['Sex'].replace(['female', 'male'], [0, 1]).head()





0    1
1    0
2    0
3    0
4    1
Name: Sex, dtype: int64

replace 还有一种特殊的方向替换，指定 method 参数为 ffill 则为用前面一个最近的未被替换的值进行替换， bfill 则使用后面最近的未被替换的值进行替换。

s = pd.Series(['a', 1, 'b', 2, 1, 1, 'a'])
s

0    a
1    1
2    b
3    2
4    1
5    1
6    a
dtype: object

s.replace([1, 2], method='ffill')

C:\Users\dell\AppData\Local\Temp\ipykernel_20744\3681488121.py:1: FutureWarning: The 'method' keyword in Series.replace is deprecated and will be removed in a future version.
  s.replace([1, 2], method='ffill')





0    a
1    a
2    b
3    b
4    b
5    b
6    a
dtype: object

s.replace([1, 2], method='bfill')

C:\Users\dell\AppData\Local\Temp\ipykernel_20744\1140877935.py:1: FutureWarning: The 'method' keyword in Series.replace is deprecated and will be removed in a future version.
  s.replace([1, 2], method='bfill')





0    a
1    b
2    b
3    a
4    a
5    a
6    a
dtype: object

逻辑替换包括了 where 和 mask ，这两个函数是完全对称的： where 函数在传入条件为 False 的对应行进行替换，而 mask 在传入条件为 True 的对应行进行替换，当不指定替换值时，替换为缺失值。

s =pd.Series([-1, 1.2345, 100, -50])
s.where(s < 0, 100)

0     -1.0
1    100.0
2    100.0
3    -50.0
dtype: float64

s.mask(s<0, -50)

0    -50.0000
1      1.2345
2    100.0000
3    -50.0000
dtype: float64

传入的条件只需是与被调用的 Series 索引一致的布尔序列即可：

s_condition = pd.Series([True, False, False, True], index = s.index)
s.mask(s_condition, -50)

0    -50.0000
1      1.2345
2    100.0000
3    -50.0000
dtype: float64

数值替换包含了 round, abs, clip 方法，它们分别表示按照给定精度四舍五入、取绝对值和截断：

s = pd.Series([-1, 1.2345, 100, -50])
s.round(2)

0     -1.00
1      1.23
2    100.00
3    -50.00
dtype: float64

s.abs()

0      1.0000
1      1.2345
2    100.0000
3     50.0000
dtype: float64

s.clip(-1, 100)

0     -1.0000
1      1.2345
2    100.0000
3     -1.0000
dtype: float64

排序共有两种方式，其一为值排序，其二为索引排序，对应的函数是 sort_values 和 sort_index 。

为了演示排序函数，下面先利用 set_index 方法把年级和姓名两列作为索引,对年龄进行排序，默认参数 ascending=True 为升序：

df_demo = df[['Sex', 'Age', 'Fare', 'Name']].set_index(['Sex', 'Name'])

df_demo.sort_values('Age').head()

		Age	Fare
Sex	Name
male	Thomas, Master. Assad Alexander	0.42	8.5167
male	Hamalainen, Master. Viljo	0.67	14.5000
female	Baclini, Miss. Eugenie	0.75	19.2583
female	Baclini, Miss. Helene Barbara	0.75	19.2583
male	Caldwell, Master. Alden Gates	0.83	29.0000

df_demo.sort_values('Fare', ascending=False).head()

		Age	Fare
Sex	Name
female	Ward, Miss. Anna	35.0	512.3292
male	Lesurer, Mr. Gustave J	35.0	512.3292
male	Cardeza, Mr. Thomas Drake Martinez	36.0	512.3292
female	Fortune, Miss. Mabel Helen	23.0	263.0000
male	Fortune, Mr. Charles Alexander	19.0	263.0000

在排序中，经常遇到多列排序的问题，比如在年龄相同的情况下，对票价进行排序，并且保持票价降序排列，年龄降序排列：

df_demo.sort_values(['Age', 'Fare'], ascending=[False, False]).head()

		Age	Fare
Sex	Name
male	Barkworth, Mr. Algernon Henry Wilson	80.0	30.0000
	Svensson, Mr. Johan	74.0	7.7750
	Artagaveytia, Mr. Ramon	71.0	49.5042
	Goldschmidt, Mr. George B	71.0	34.6542
	Connors, Mr. Patrick	70.5	7.7500

索引排序的用法和值排序完全一致，只不过元素的值在索引中，此时需要指定索引层的名字或者层号，用参数 level 表示。另外，需要注意的是字符串的排列顺序由字母顺序决定。

df_demo.sort_index(level = ['Sex', 'Name'], ascending=[True, False]).head()

		Age	Fare
Sex	Name
female	de Messemaeker, Mrs. Guillaume Joseph (Emma)	36.0	17.4000
	Zabour, Miss. Thamine	NaN	14.4542
	Zabour, Miss. Hileni	14.5	14.4542
	Yrois, Miss. Henriette ("Mrs Harbeck")	24.0	13.0000
	Young, Miss. Marie Grice	36.0	135.6333

apply 方法常用于 DataFrame 的行迭代或者列迭代，它的 axis 含义与第2小节中的统计聚合函数一致， apply 的参数往往是一个以序列为输入的函数。例如对于 .mean() ，使用 apply 可以如下地写出

df_demo = df[['Age', 'Fare']]
def my_mean(x):
    res = x.mean()
    return res
df_demo.apply(my_mean)

Age     29.699118
Fare    32.204208
dtype: float64

df_demo.apply(lambda x:x.mean())

Age     29.699118
Fare    32.204208
dtype: float64

posted @ 2024-07-22 07:46 0214jx 阅读(134) 评论(0) 收藏举报

刷新页面返回顶部

0214jx

Pandas基础

Pandas基础

文件的读取和写入

基本数据结构

常用基本函数

公告