numpy 与pandas的使用

numpy

概念：

1、使用numpy 可以进行简单的快速的科学计算

2、adarray,一个多维数组结构，搞笑且节省空间

3、无需循环对整个数据进行运算

4、是读写磁盘数据的工具以及用于操作内存映射文件的工具

5、线性代数、随机数生成和傅里叶变种功能

6、用于集成C、C++等代码的工具

使用：

array方法将列表转换为数组

li1=[1,2,3,4,5,6]

li1 = np.array([1,2,3,4,5])    （array方法    将 li1列表 转换为 数组）

np.arange? （查看 arange 的使用方式）

li.dtype （查看元素类型）

li.size (查看元素个数)

li.ndim (数组维度)

li.shape （查看数组维度的大小，就是查看有几行几列）（2,5）----2行5列

类型转换：

li = np.array([1,2,3,4,5],dtype='float')    可以指定类型 

li.astype（"int"）   把 float 转为 int 类型

linspace : 去两值之间的平均分布数

np.linspace(1,2,num=50,endpoint=False)     在1到2之间生成50个平均数，endpoint=False 表示 只取小数后两位

zeros：生成 n 个 0

np.zeros(10)   生成 10个float的0

np.zeros(10，dtype=‘int’)   生成 10个 int 的0

reshape：拆分（生成维度），数据一定要对的上，不然报错

li = np.ones((10))  生成10个 float 的 1.

li.reshape( (2,5) )   把 li 拆分成 2行 5 列

li.reshape( (5,2) )    把 li 拆分成 5 行 2 列

向量运算：

li1 = np.array([1,2,3,4,5])    （array方法    将 li1列表 转换为 数组）
li2 = np.array([4,5,6,7,8])
先在可以进行 + 、- 、 *、 / 、//  
li1 + li2 
li1 - li2
li1 * li2
li1 / li2
li1 ** li2 
li1 // li2

数组索引：

一维

li = np.array([1,2,3,4,5,6])

li[2]  ---- 3

li[0]  ---- 1

二维：

li = np.array([2,3,4,5,6,7,8,9])

li1 = li.reshape(2,4)

li[0,2]  ------ 4 

li[1,2]  ------ 8

数组切片：

一维：

li = np.array([2,3,4,5,6,7,8,9])

li[0:3]  ------ [2,3,4]

二维：

li = np.array([2,3,4,5,6,7,8,9])
li1 = li.reshape(2,4)

li1 ------  array([[2, 3, 4, 5],
                  [6, 7, 8, 9]])
       
li[0:2,1:3]-----array([[3, 4],
                      [7, 8]])

布尔型索引：

li = np.array([2, 3, 4, 5, 6, 7, 8, 9])

1、li<5 ------------ [ True,  True,  True, False, False, False, False, False]

2、li[li<5] -------- [2, 3, 4]

3、li[li<=5] ------- [2, 3, 4, 5]

4、选出li 中 大于5的偶数
li[(li>5) & (li%2 ==0)]
5、选出 li 中 大于5或偶数
li[(li>5)|(li%2 ==0)]

花式索引：

一维：

li = np.array([2, 3, 4, 5, 6, 7, 8, 9])

li[[1,3,6]] -------- [3, 5, 8]  其实就是多个索引取值

二维：

li1 = np.array([[2, 3, 4, 5],
       [6, 7, 8, 9]])

li1[(0,[0,1,2])] ------[2, 3, 4]

通用函数：

li = np.array([2, 3, 4, 5, 6, 7, 8, 9])

li.max()  ---- 9   最大
li.min()  ---- 2   最小
li.sum()  ---- 44  求和
li.mean() ---- 5.5  平均值

li.isnan  -----返回布尔True 或False   判断数组中是否有等于 None的

std  --- 求标准差
var  --- 求方差
argmin   ---求最小索引值
argmax  ----求最大索引值

随机函数：

import random

1、np.random.rand(10)    生成 0到1 之间 的 10个数

2、np.random.randint(1,10,10)  生成1 - 10 之间的 10个数

a = array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

3、np.rabdom.shuffle(a)  返回的时候打乱了a的顺序：a = array([6, 3, 4, 5, 7, 9, 8, 1, 2, 0])

pandas

概念：

1、是python强大的数据分析工具包

2、基于numpy构建

3、主要功能：

（1）具备对其功能的数据结构：DataFrame、Series

（2）集成时间序列

（3）提供丰富的数学运算操作

（4）灵活处理缺失数据

pandas: Series

特性：Series是一种类似于一维数组的对象，由一组数据和一组与之相关的数据标签（索引）组成

ps：Series比较像列表（数组）和字典的结合体

li = pd.Series([1,2,3,4,5])     （定义）

li.values   （获取 li 的值）

li.index      （获取 li 的索引）

获取值数组和索引数组：values属性和index属性

运算：

pandas在运算时，会按索引进行对齐然后计算。如果存在不同的索引，则结果的索引是两个操作数索引的并集

sr1 = pd.Series([12,23,34], index=['c','a','d'])
sr2 = pd.Series([11,20,10], index=['d','c','a',])

sr1+sr2   （这个可以相加，上下索引都对的上）

sr3 = pd.Series([11,20,10,14], index=['d','c','a','b'])

sr1+sr3    （这个会出现 NaN（缺失数） 值，因为只有sr3 有 b 值）

缺失数据：

缺失数据：使用NaN（Not a Number）来表示缺失数据。其值等于np.nan。内置的None值也会被当做NaN处理。

处理方法：

dropna()        过滤掉值为NaN的行                     例如： sr.dropna()
fillna()        填充缺失数据                          例如： sr.fillna('哈哈')
isnull()        返回布尔数组，缺失值对应为True          例如：  sr.isnull()
notnull()       返回布尔数组，缺失值对应为False         例如： sr.notnull()

DataFrame

概念：DataFrame是一个表格型的数据结构，含有一组有序的列；也可以被看做是由Series组成的字典，并且共用一个索引。

创建：

sr = pd.DataFrame({'one':[1,2,3,4],'two':[4,3,2,1]})

sr1=pd.DataFrame({'one':pd.Series([1,2,3],index['a','b','c']),
                 'two':pd.Series([1,2,3,4],index=['b','a','c','d'])
                 })

csv文件读取与写入：

1、df = pd.read_csv('filename.csv')  读取 filename.csv 文件
2、df.to_csv('xxx/xxx'，index=False)     把df复制到xxx/xxx.csv文件中 ,index=False 不保存行号  

例子：
    df = pd.read_csv('E:/职业资料/数据分析/数据分析--oldboy/James.csv')
    df.to_csv('./a.csv',index=False)

查看数据：

常用的属性和方法：

拿NBA总冠军数据练习：df=pd.read_html('https://baike.baidu.com/item/NBA%E6%80%BB%E5%86%A0%E5%86%9B/2173192?fr=aladdin')

msg = df[0]

index                   获取索引       例：msg.index
T                       转置（把行和列转换） 例：msg.T
columns                 获取列索引       例：msg.columns
values                  获取值数组       例：msg.values
describe()              获取快速统计      例：msg.describe

索引和切片：

DataFrame有行索引和列索引。

PS：进行下面操作需先： df.columns = df.values[0]

标签获取：

df [1]       -----获取   第一列
df[['A', 'B']]
df['A'][0]
df[0:10][['A', 'C']]

df.loc [1]   ------获取  第一行
df.loc[:,['A','B']]
df.loc[:,'A':'C']
df.loc[0,'A']
df.loc[0:10,['A','C']]

通过位置获取：

df.iloc[3]
df.iloc[3,3]
df.iloc[0:3,4:6]
df.iloc[1:5,:]
df.iloc[[1,2,4],[0,3]]

通过布尔值过滤：

df[df['A']>0]
df[df['A'].isin([1,3,5])]
df[df<0] = 0

DataFrame数据对齐与缺失数据

DataFrame对象在运算时，同样会进行数据对其，结果的行索引与列索引分别为两个操作数的行索引与列索引的并集。

处理缺失方法：

dropna(axis=0,where='any',…)
fillna()
isnull()
notnull()

posted @ 2019-06-09 22:43 萤huo虫阅读(394) 评论(0) 收藏举报

刷新页面返回顶部

言念君子，温润如玉

numpy 与pandas的使用

numpy

概念：

使用：

array方法将列表转换为数组

类型转换：

向量运算：

数组索引：

数组切片：

布尔型索引：

花式索引：

通用函数：

随机函数：

pandas

概念：

pandas: Series

运算：

缺失数据：

DataFrame

查看数据：

索引和切片：

DataFrame数据对齐与缺失数据

公告

言念君子，温润如玉

numpy 与pandas的使用

numpy

概念：

使用：

array方法 将 列表 转换为 数组

类型转换：

向量运算：

数组索引：

数组切片：

布尔型索引：

花式索引：

通用函数：

随机函数：

pandas

概念：

pandas: Series

运算：

缺失数据：

DataFrame

查看数据：

索引和切片：

DataFrame数据对齐与缺失数据

公告

array方法将列表转换为数组