python 金融量化分析

金融概念

均线：

短期均线下穿长期均线下跌 -- 死叉

短期均线上穿长期均线上涨 -- 金叉

但是有滞后性

金融分析

基本面分析

公司分析财务数据业绩报告
行业分析

技术面分析

各项技术指标

K线
平均线
KDJ

金融量化投资

避免主观情绪
及时跟踪市场变化
多角度的观察和多层次的模型
在决定投资策略后，能通过回测验证效果

量化策略

核心内容

选股
择时
仓位管理
止盈止亏

量化投资与Python

第三方模块

numpy 数值计算
pandas 数据分析
matplot 图标绘制

Python进行写

自己写框架
在线平台聚宽米筐优矿 Quantopian
开源框架RQalpha

IPython

tab 自动补全
? 搜索
a.? 查询a的所有的方法
a.____? a的所有的私有的方法
a.a*? 过滤历史中a开头的
%run 执行脚本
%paste 运行粘贴板的内容
! 执行系统命令
!ipconfig

魔术命令：

%timeit func() 测试代码的执行时间适合运行时间小的代码
这个会执行很多次取平均值

In [37]: %timeit func(a,b)
The slowest run took 14.25 times longer than the fastest. This could mean that a
n intermediate result is being cached.
10000000 loops, best of 3: 166 ns per loop

%time func() 适合运行时间长的代码
%pdb on 调试
p 打印 q 退出
%pdb off 关闭调试
历史命令
以xx开头的进行过滤 xx加上箭头
_代表上一个历史
__代表上两个历史
___代表上面三个

_34 34是输入代码
_i34 输入

jupyter notebook

会生成一个token 进行远程连接 pycharm支持

与IDE进行交互 jupyter notebook

在cmd中输入jupyter notebook,会启动jupyter,可以在浏览器上进行编辑。

与pycharm进行交互，输入token

点击运行的时候输入token=fb9ae30019c4b001f7bd06e1154774d1ba4e1b02a87d557b

Numpy

Numpy高性能的科学计算和数据分析包,一个多维数组结构，高效且节省空间
ndarray 多维数组对象

array()		将列表转换为数组，可选择显式指定dtype
arange()		range的numpy版，支持浮点数
linspace()	类似arange()，第三个参数为数组长度
zeros()		根据指定形状和dtype创建全0数组
ones()		根据指定形状和dtype创建全1数组
empty()		根据指定形状和dtype创建空数组（随机值）
eye()		根据指定边长和dtype创建单位矩阵

创建：ndarray

import numpy as np
np.array([1,2,3])

可以for循环遍历：

for i in _:  # _ 是上面返回的对象
    print(i)

多维数组数据类型必须是一样的

sys.getsizeof() 查看占用的内存

import sys

a = list(range(100))
sys.getsizeof(a) # 1008

b = np.array(range(100))
sys.getsizeof(b)  # 496

np.dot() 点乘

import random

prize = [round(random.uniform(10,20),2) for i in range(20)] # round(x,y) x 是随机数 y 是位数 
num =[random.randint(1,10) for i in range(20)]
for i,j in zip(prize,num):
    sum+=i*j

不用for循环

np_prize = np.array(prize)
np_num = np.array(num)
np.dot(np_prize,np_num)
或者：
np_prize*np_num
_.sum()

查看数据类型:

In [61]: np_prize.dtype # 
Out[61]: dtype('float64') # 根据系统的默认环境

dtype 数组的数据类型
size

np.array(range(10000))
array([   0,    1,    2, ..., 9997, 9998, 9999])
a= _
a.size  # 10000

shape 数组维度的大小
是一个元组

a.shape
(10000,)  # 这里代表的是一维

多维数组

b = np.array([[1,2,3],[4,5,6]])
b.shape # (2,3) 2行3列的数组

ndim 查看维度

a.ndim -- 1
b.ndim -- 2

数据类型转换astype

b = np.array([[1,2,3],[4,5,6]])
b.astype('float32')
array([[ 1.,  2.,  3.],
       [ 4.,  5.,  6.]], dtype=float32)

在构造的时候改变数据类型

a = range(10)
c = np.array(a,dtype='float32')

arange

arange 是numpy中的一种类型和range是一样的，和range的区别是设置步长可以是小数
np.arange()生成一个range

z = np.arange(1,4,0.2)

np.array(np.arange(10))  
np.array(range(10))

np.linspace(0,10,15) 线性空间一共分成了15份代表了数组空间的大小
np.zeros np.ones
创建多维的，传入一个shape 数组

np.zeros(10,dtype='int32')  # 创建的时候指定数据类型 默认创建的带小数
In [97]: np.zeros((3,5))
Out[97]:
array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])

In [98]: np.zeros((3,5,10))  # 创建的三维数组
Out[98]:
array([[[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]],

       [[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]],

       [[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]]])

全0:np.zeros(10) 全1：np.ones()

里面是元组
二维：
np.zeros((3,5))

三维
np.zeros((3,5,10))

np.empty

ones zeros在创建的时候会开辟内存，但是empty不会，如果之前的内存空间被释放了，empty有可能会被填充
np.empty(10)
开辟内存不填充数，使用内存原来的值

np.eye(5) 单位矩阵

In [103]: np.eye(5)
Out[103]:
array([[ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  1.]])

切片和索引

可以进行乘方加减等运算同时不同for循环

 a
 array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
 a[0]
 0
 c= a**2
 c
 array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])
 b = a+1
 b
 array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

俩个同样大的数组进行运算

a=a  a**

二维数组的切片

a = np.arange(15).reshape(3,5) # 先创建一个一维的 然后把形状转换成二维 但是必须是总数一致的
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [124]: a[0][0]
Out[124]: 0

In [125]: a[1][3]
Out[125]: 8

维的也可以用逗号进行切片，逗号左边代表行，右边代表列

a[1，3] -- 8

In [127]: a[:2,1:3]  # 先切行 再切列
Out[127]:
array([[1, 2],
       [6, 7]])

numpy的切片和列表的切片不同

列表的切片是通过copy复制出来的，新复制的修改不会影响原来的值，但是numpy的数组是创建的一个视图，视图修改后，原来的也会修改，这种是针对大数据设计的。可以通过copy解决

In [129]: a
Out[129]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [130]: a= b
In [131]: b
Out[131]: array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])
In [132]: b[2:6]
Out[132]: array([3, 4, 5, 6])
In [133]: b[0] = 111
In [134]: a
Out[134]: array([111,   2,   3,   4,   5,   6,   7,   8,   9,  10])
In [135]: b
Out[135]: array([111,   2,   3,   4,   5,   6,   7,   8,   9,  10])

b是a的视图

b = a[:4].copy() 这样就是复制的

布尔型的索引

布尔型索引：将同样大小的布尔数组传进索引，会返回一个由所有True对应位置的元素的数组

a = [random.randint(1,20) for i in range(10)]
aa = np.array(a)
aa<5
Out[144]: array([False,  True,  True,  True, False, False, False, False, False,
False], dtype=bool)

aa[aa<5]
array([4, 1, 2, 4])  # 通过bool就能进行索引出True的值

a[b]

b是一个bool数组

小于5而且是偶数的使用&
a[(aa<5) & (aa%2==0)]

或 |

非 ~

numpy花式索引

 aa
 array([ 4,  9,  1, 20,  2, 13,  7,  4,  9,  5])
 aa[[1,3,5]]
 array([ 9, 20, 13])

In [158]: a[:,1:4]  # 切片 先切行，再切列
Out[158]:
array([[ 1,  2,  3],
       [ 6,  7,  8],
       [11, 12, 13]])

In [159]: a[:,[1,3]]  # 花式索引
Out[159]:
array([[ 1,  3],  
       [ 6,  8],
       [11, 13]])

Numpy通用函数

math中的
rint round 四舍五入

a = np.arange(0,5,0.2)
a= np.array(a)
np.round(a)
np.rint(b)

ceil 向上取整

math.ceil(3.5)  # 结果是4
np.ceil()

floor 向下取整

math.floor(3.5) # 结果是3
np.floor()

trunc 向0取整

math.trunc(-3.5) # 结果是-3
math.trunc(3.5) # 结果是3

np.modf 把整数部分和小数部分都变成数组

In [188]: a
Out[188]:
array([ 0. ,  0.2,  0.4,  0.6,  0.8,  1. ,  1.2,  1.4,  1.6,  1.8,  2. ,
        2.2,  2.4,  2.6,  2.8,  3. ,  3.2,  3.4,  3.6,  3.8,  4. ,  4.2,
        4.4,  4.6,  4.8])

In [189]: np.modf(a)
Out[189]:
(array([ 0. ,  0.2,  0.4,  0.6,  0.8,  0. ,  0.2,  0.4,  0.6,  0.8,  0. ,
         0.2,  0.4,  0.6,  0.8,  0. ,  0.2,  0.4,  0.6,  0.8,  0. ,  0.2,
         0.4,  0.6,  0.8]),
 array([ 0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,  1.,  2.,  2.,  2.,
         2.,  2.,  3.,  3.,  3.,  3.,  3.,  4.,  4.,  4.,  4.,  4.]))

In [190]: x,y = np.modf(a)

In [191]: x
Out[191]:
array([ 0. ,  0.2,  0.4,  0.6,  0.8,  0. ,  0.2,  0.4,  0.6,  0.8,  0. ,
        0.2,  0.4,  0.6,  0.8,  0. ,  0.2,  0.4,  0.6,  0.8,  0. ,  0.2,
        0.4,  0.6,  0.8])

In [192]: y
Out[192]:
array([ 0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,  1.,  2.,  2.,  2.,
        2.,  2.,  3.,  3.,  3.,  3.,  3.,  4.,  4.,  4.,  4.,  4.])

isinf

inf是无限大的意思，当分母是0的时候，比值是无限大的

In [194]: a = np.array([1,2,3])
In [195]: b = np.array([1,2,0])
In [196]: a/b
E:/softinstall2017/anaconda/Scripts/ipython-script.py:1: RuntimeWarning: divide
by zero encountered in true_divide
  if __name__ == '__main__':
Out[196]: array([  1.,   1.,  inf])

In [215]: c
Out[215]: array([  1.,   1.,  inf])
In [216]: np.isinf(c)
Out[216]: array([False, False,  True], dtype=bool)

isnan

分母和分子都是0的时候比值是nan not a number 也是代表缺失的数

用np.isnan()

In [198]: a = np.array([1,2,0])
In [199]: b = np.array([1,2,0])
In [200]: a/b
E:/softinstall2017/anaconda/Scripts/ipython-script.py:1: RuntimeWarning: invalid
 value encountered in true_divide
  if __name__ == '__main__':
Out[200]: array([  1.,   1.,  nan])  not a number

In [210]: c
Out[210]: array([  1.,   1.,  nan])

In [211]: np.isnan(c)
Out[211]: array([False, False,  True], dtype=bool)

把所有不是nan的取出来
a[~np.isnan(a)]

np.maximum(a,b)

maximum mininum 是比较两个数组中的每一个位置的大小，然后把最大或者最小的值并返回一个数组

a = np.array([1,2,0])
b = np.array([3,1,6])
np.maximum(a,b)
array([3, 2, 6])  # 结果是一个数组

数学和统计方法

sum求和
cumsum 累计求和返回一个数组前面每一个数的和
a.mean() 求平均数
var 方差
min
max 返回数组中的最大值
argmin
argmax 返回最大值的下标

In [222]: a
Out[222]: array([1, 2, 0])

In [223]: b
Out[223]: array([3, 1, 6])

In [224]: np.sum(a) # 求和
Out[224]: 3

In [225]: np.mean(a) # 求平均值
Out[225]: 1.0

In [226]: np.var(a)  # 求方差
Out[226]: 0.66666666666666663

In [227]: np.max(a)  # 求数组中的最大值
Out[227]: 2

In [228]: np.min(a) # 求数组中的最小值
Out[228]: 0

In [229]: np.argmax(a) # 求数组中下标最大值的索引
Out[229]: 1

In [230]: np.argmin(a)
Out[230]: 2

In [231]: np.cumsum(a) # 累计求和 
Out[231]: array([1, 3, 3], dtype=int32)

随机数生成

random.uniform(1.0,10) 小数的范围
random.random() 返回0-1
(random.random()*9）+1
random.randint() 整数
random.shuffle(a) 洗牌
random.choice() str list 返回一个

np.random.rand 0-1 原来的random

给一个形状产生一个随机的数据

生成一个随机的二维数组

In [3]: np.random.randint(1,10,(3,5))
Out[3]:
array([[3, 2, 3, 9, 6],
       [6, 6, 2, 3, 8],
       [7, 6, 3, 1, 5]])

rand		给定形状产生随机数组（0到1之间的数） 替换了random.random
randint		给定形状产生随机整数
choice		给定形状产生随机选择
shuffle		与random.shuffle相同
uniform		给定形状产生随机数组

Pandas 数据分析

Pandas是基于numpy构建的

DataFrame
Series
时间序列
灵活处理缺失数据

安装：pip install pandas
使用：import pandas as pd

Series

类似数组和字典的结合体,所以字典和数组的特性都能使用，是有序的

Series的左侧默认是0 1 2 3 这样的索引，右侧是values

In [5]: pd.Series([1,2,3,4])
Out[5]:
0    1
1    2
2    3
3    4
dtype: int64

In [6]: pd.Series([11,22,33,44])
Out[6]:
0    11
1    22
2    33
3    44
dtype: int64

自定义索引

重点是index = list()是一个列表,或者是index=['a','b','c','d']

In [8]: pd.Series([11,22,33,44],index=list('abcd'))
Out[8]:
a    11
b    22
c    33
d    44
dtype: int64

Series特性

通过key和下标进行索引，充分说明了Series符合字典和列表的特性

In [13]: a
Out[13]:
a    11
b    22
c    33
d    44
dtype: int64

In [14]: a['a']  # 通过key进行索引
Out[14]: 11

In [15]: a[0] # 通过下标进行索引
Out[15]: 11

a.index 返回索引对象
a.values 获取值

In [17]: a.index
Out[17]: Index(['a', 'b', 'c', 'd'], dtype='object')  返回的索引是一个对象

In [18]: a.values
Out[18]: array([11, 22, 33, 44], dtype=int64)

Series支持NumPy模块的特性（下标）：

从ndarray创建Series：Series(arr)
与标量运算：sr*2
两个Series运算：sr1+sr2
索引：sr[0], sr[[1,2,4]]
切片：sr[0:2]
通用函数：np.abs(sr)
布尔值过滤：sr[sr>0]

In [21]: pd.Series(np.array([1,2,3]))  从ndarray创建Series
Out[21]:
0    1
1    2
2    3
dtype: int32

############## 标量运算  两个标量运算 ##############
In [22]: a
Out[22]:
a    11
b    22
c    33
d    44
dtype: int64

In [23]: a+2
Out[23]:
a    13
b    24
c    35
d    46
dtype: int64

In [26]: a+a
Out[26]:
a    22
b    44
c    66
d    88
dtype: int64
########### 索引 花式索引 布尔索引##############

In [32]: a[[1,2]]  # 花式索引
Out[32]:
b    22
c    33
dtype: int64

In [34]: a[a>20]  # bool索引
Out[34]:
b    22
c    33
d    44
dtype: int64

In [35]: a>20
Out[35]:
a    False
b     True
c     True
d     True
dtype: bool

Series支持字典的特性（标签）：

从字典创建Series：Series(dic),
in运算：’a’ in sr
键索引：sr['a'], sr[['a', 'b', 'd']]

In [36]: pd.Series({'a':1,'b':2,'c':3}) # 从字典创建Series
Out[36]:
a    1
b    2
c    3
dtype: int64

#### in运算
In [37]: a
Out[37]:
a    11
b    22
c    33
d    44
dtype: int64

In [38]: 'a' in a  标签表在数组中的时候返回True
Out[38]: True

In [39]: 'sss' in a
Out[39]: False

a.get('a') -- 11  get到 就返回相应的value
a.get('asdfa',default=0) # get在没有的时候也没有返回值，通过default可以设置在没有的key的时候的返回值



### series 切片

In [42]: a['a':'c'] # 只有在标签的时候才是顾头顾尾，这样设计的原因是取的时候不知道下一个值是什么，而索引可以通过+1获取下一个
Out[42]:
a    11
b    22
c    33
dtype: int64

In [43]: a[0:3] # 索引的时候仍然是顾头不顾尾
Out[43]:
a    11
b    22
c    33
dtype: int64

标签也可以切片顾头顾尾
只有在这种时候是

在索引是0 1 2。。。的时候还是顾头不顾尾

pandas整数索引：

a[] 中既可以是标签又可以是索引，在整数索引的时候，输入下标进行索引会出现问题
a.iloc(-1) 按照下标解释
a.loc[] 按照标签解释

a = pd.Series(np.arange(20))
b= a[10:].copy()
In [9]: b
Out[9]:
10    10
11    11
12    12
13    13
14    14
15    15
16    16
17    17
18    18
19    19
dtype: int32

b不能用b[-1] 进行索引取值

In [10]: b[-1:]  # 加冒号进行索引取值
Out[10]:
19    19
dtype: int32

In [14]: b[10]  # 只有既是标签又是索引的时候 才能通过切片取到
Out[14]: 10


######### loc  iloc   #########

Out[15]:
10    10
11    11
12    12
13    13
14    14
15    15
16    16
17    17
18    18
19    19
dtype: int32

In [16]: b.iloc[0]  # iloc是通过下标进行索引
Out[16]: 10

In [17]: b.loc[10] # loc 是通过标签进行索引
Out[17]: 10

Pandas数据对齐

pandas在运算时，会按索引进行对齐然后计算。如果存在不同的索引，则结果的索引是两个操作数索引的并集。

In [19]:a =  pd.Series(np.array([1,2,3,4]),index=list('abcd'))
Out[19]:
a    1
b    2
c    3
d    4
dtype: int32

In [22]: b =pd.Series(np.array([1,2,3,4]),index=list('acdb'))

In [23]: a.values
Out[23]: array([1, 2, 3, 4])

In [24]: b.values
Out[24]: array([1, 2, 3, 4])

In [25]: b.values + b.values
Out[25]: array([2, 4, 6, 8])  # 通过values相加的时候是数组相加

In [26]: a+b  # pandas 先对齐 再相加
Out[26]:
a    2
b    6
c    5
d    7
dtype: int32

两个数组，标签对齐是全部的。

没有的时候会出现NaN 缺失值

In [29]: b =pd.Series(np.array([1,2,3]),index=list('acd'))

In [30]: a =pd.Series(np.array([1,2,3]),index=list('bca'))

In [31]: a+b  # 没有对应值的时候会出现NaN
Out[31]:
a    4.0
b    NaN
c    4.0
d    NaN
dtype: float64

如何在两个Series对象相加时将缺失值设为0？

sr1.add(sr2, fill_value=0)

处理缺失值NaN：
a.add(b,fill_value=0) 没有的自动补0

In [56]: a.add(b)
Out[56]:
a    4.0
b    NaN
c    4.0
d    NaN
dtype: float64

In [57]: a.add(b,fill_value = 0)
Out[57]:
a    4.0
b    1.0
c    4.0
d    3.0
dtype: float64

灵活的算术方法：add, sub, div, mul

add sub div mul 在运算的时候都可以用fill_value进行填充0

pandas 缺失数据

pandas中的缺失数据是NaN ，等价于np.NaN,内置的None也会被处理成NaN

处理缺失数据的相关方法：

dropna() 过滤掉值为NaN的行
fillna() 填充缺失数据
snull() 返回布尔数组，缺失值对应为True
notnull() 返回布尔数组，缺失值对应为False
过滤缺失数据：sr.dropna() 或 sr[sr.notnull()]
填充缺失数据：sr.fillna(0)

注意的是pandas处理数据的时候是返回一个新的值，原来的值没有修改

c.dropna() 没有修改原来的

In [72]: c.dropna()   把c中的NaN删除了，并重新返回一个值
Out[72]:
a    4.0
c    4.0
dtype: float64

In [73]: c  # 原来的c值是不变的
Out[73]:
a    4.0
b    NaN
c    4.0
d    NaN
dtype: float64

c[c.isnull()] 里面是Bool索引
c[~c.isnull()] 非NaN
c[c.notnull()] 非NaN

In [86]: c[~c.isnull()] # 对其取非就拿到了没有NaN的值
Out[86]:
a    4.0
c    4.0
dtype: float64

In [87]: c[c.isnull()]  # isnull()函数 本质是通过bool索引获取的
Out[87]:
b   NaN
d   NaN
dtype: float64


In [89]: c[c.notnull()]  # c.notnull() 函数直接获取非NaN的值
Out[89]:
a    4.0
c    4.0
dtype: float64:

c.fillna(0) 把所有缺失的填0

还可以填充平均数c.fillna(c.mean())

In [91]: c.fillna(0)
Out[91]:
a    4.0
b    0.0
c    4.0
d    0.0
dtype: float64

In [93]: c.fillna(c.mean())  # 填充c的平均数 注意这里是mean() 有括号 函数 没有括号会报错
Out[93]:
a    4.0
b    4.0
c    4.0
d    4.0
dtype: float64

Series中的函数都加括号 mean() add() cumsume()

DataFrame

是一个表格型的数据结构,是二维的，可看做含有Series的字典，并且共用一个索引，这和索引可以看做是数据库中的主键

创建 pd.DateFrame()

创建方式一：
当one two 分别与索引看的时候，可以看做是单独的Series

In [94]: pd.DataFrame({'one':[1,2,3,4],'two':[4,3,2,1]})
    ...:
Out[94]:
   one  two
0    1    4
1    2    3
2    3    2
3    4    1

也可以自定义索引


In [98]: pd.DataFrame({'one':[1,2,3,4],'two':[7,8,9,3]},index=['a','b','c','d'])
Out[98]:
   one  two
a    1    7
b    2    8
c    3    9
d    4    3

创建方式二：

通过Series的方式创建

pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']), 'two':pd.Series([1,2,3,4],index=['b','a','c','d'])})

从csv文件读取数据

csv文件是通过逗号隔开的

pd.read_csv('文件名') 或者是文件对象
a.to_csv('') 保存新的csv文件
没有指定行索引，保存的时候会自动添加一列索引

In [104]: a.to_csv('new_csv')

In [105]: pd.read_csv('new_csv')
Out[105]:
      Unnamed: 0    id        date    open   close    high     low
0              0     0  2007-03-01  22.074  20.657  22.503  20.220
1              1     1  2007-03-02  20.750  20.489  20.944  20.256
2              2     2  2007-03-05  20.300  19.593  20.384  19.218
3              3     3  2007-03-06  19.426  19.977  20.308  19.315
4              4     4  2007-03-07  19.995  20.520  20.706  19.827
5              5     5  2007-03-08  20.353  20.273  20.454  20.167

a.index 行索引

2463   904595.00  601318
2464   506834.00  601318
2465   657610.00  601318
2466   667132.00  601318
2467   491294.00  601318
2468   616005.00  601318
2469  1147936.00  601318

[2470 rows x 9 columns]

In [106]: df = a

In [107]: df.index
Out[107]: RangeIndex(start=0, stop=2470, step=1)  # 索引是顾头不顾尾的

a.columns 列索引

In [108]: df.columns  列索引返回的是字符串
Out[108]: Index(['id', 'date', 'open', 'close', 'high', 'low', 'volume', 'code']
, dtype='object')

a.values 返回的是二维数组去掉索引后
a.T 转置
a.describe() 快速统计平均值中间值（注意的是有括号的）

快速返回的值包含总数count平均值 mean

In [112]: df.describe()
Out[112]:
                id         open        close         high          low  \
count  2470.000000  2470.000000  2470.000000  2470.000000  2470.000000
mean   1234.500000    25.910605    25.932880    26.361828    25.520403
std     713.171905     9.571887     9.580407     9.776995     9.387243
min       0.000000     9.361000     9.361000     9.649000     8.965000
25%     617.250000    18.891000    18.894500    19.166500    18.688000
50%    1234.500000    22.587500    22.571500    22.973000    22.189500
75%    1851.750000    31.769000    31.845500    32.260500    31.483250
max    2469.000000    64.337000    64.333000    66.236000    63.006000

             volume      code
count  2.470000e+03    2470.0
mean   4.781578e+05  601318.0
std    5.650142e+05       0.0
min    2.543527e+04  601318.0
25%    1.948375e+05  601318.0
50%    3.016310e+05  601318.0
75%    4.918206e+05  601318.0
max    8.433281e+06  601318.0

索引改名
df.index.name = 'new_id',这样会新增加一列

In [17]: df.index.name = 'new_id'

In [18]: df
Out[18]:
          id        date    open   close
new_id
0          0  2007-03-01  22.074  20.657
1          1  2007-03-02  20.750  20.489
2          2  2007-03-05  20.300  19.593
3          3  2007-03-06  19.426  19.977

修改列名
df.rename(columns={'':''} 传入一个字典字典中是列名

In [19]: df.rename(columns={'open':'new_open'})
Out[19]:
          id        date  new_open   close    high     low
new_id

DF 切片和索引

df+[]是对列进行索引

df['close']

df[['close','open']]

先切片
df[0:10][['open','close']]

loc iloc 逗号左边行右边是列
df.loc[:,['close','open']]
df.loc[0:10,['close':'close']]

In [25]: df.loc[:10,['close','open']]   # 左边是行 右边是列
Out[25]:
         close    open
new_id
0       20.657  22.074
1       20.489  20.750
2       19.593  20.300
3       19.977  19.426
4       20.520  19.995
5       20.273  20.353
6       20.101  20.264
7       19.739  19.999
8       19.818  19.783
9       19.841  19.558
10      19.849  20.097

iloc 按照下标取值
df.iloc[:,0:10]

In [26]: df.iloc[:5,0:3] 先去行的下标 然后去列的下标
Out[26]:
        id        date    open
new_id
0        0  2007-03-01  22.074
1        1  2007-03-02  20.750
2        2  2007-03-05  20.300
3        3  2007-03-06  19.426
4        4  2007-03-07  19.995

通过bool值过滤

df[df<20] 不符合的变成lnan

In [31]: df<20
Out[31]:
           id  date   open  close   high    low  volume   code
new_id
0        True  True  False  False  False  False   False  False
1        True  True  False  False  False  False   False  False
2        True  True  False   True  False   True   False  False
3        True  True   True   True  False   True   False  False
4        True  True   True  False  False   True   False  False
5        True  True  False  False  False  False   False  False
6        True  True  False  False  False   True   False  False
7        True  True   True   True   True   True   False  False
8        True  True   True   True   True   True   False  False


In [32]: df[df<20]  # 把所有小于20 的转换成NaN
Out[32]:
          id        date    open   close    high     low  volume  code
new_id
0        0.0  2007-03-01     NaN     NaN     NaN     NaN     NaN   NaN
1        1.0  2007-03-02     NaN     NaN     NaN     NaN     NaN   NaN
2        2.0  2007-03-05     NaN  19.593     NaN  19.218     NaN   NaN
3        3.0  2007-03-06  19.426  19.977     NaN  19.315     NaN   NaN
4        4.0  2007-03-07  19.995     NaN     NaN  19.827     NaN   NaN
5        5.0  2007-03-08     NaN     NaN     NaN     NaN     NaN   NaN
6        6.0  2007-03-09     NaN     NaN     NaN  19.735     NaN   NaN
7        7.0  2007-03-12  19.999  19.739  19.999  19.646     NaN   NaN
8        8.0  2007-03-13  19.783  19.818  19.982  19.699     NaN   NaN

df[df<20].fillna(0)

In [34]: df[df<20].fillna(0)
Out[34]:
          id        date    open   close    high     low  volume  code
new_id
0        0.0  2007-03-01   0.000   0.000   0.000   0.000     0.0   0.0
1        1.0  2007-03-02   0.000   0.000   0.000   0.000     0.0   0.0
2        2.0  2007-03-05   0.000  19.593   0.000  19.218     0.0   0.0
3        3.0  2007-03-06  19.426  19.977   0.000  19.315     0.0   0.0
4        4.0  2007-03-07  19.995   0.000   0.000  19.827     0.0   0.0
5        5.0  2007-03-08   0.000   0.000   0.000   0.000     0.0   0.0
6        6.0  2007-03-09   0.000   0.000   0.000  19.735     0.0   0.0
7        7.0  2007-03-12  19.999  19.739  19.999  19.646     0.0   0.0
8        8.0  2007-03-13  19.783  19.818  19.982  19.699     0.0   0.0

上面的不能直接用对不符合的填充0是因为date这一列的值是字符串类型，所以报错

In [43]: df2 = df.loc[:,'open':'code']
In [44]: df2
Out[44]:
          open   close    high     low      volume    code
new_id
0       22.074  20.657  22.503  20.220  1977633.51  601318
1       20.750  20.489  20.944  20.256   425048.32  601318
2       20.300  19.593  20.384  19.218   419196.74  601318
3       19.426  19.977  20.308  19.315   297727.88  601318
4       19.995  20.520  20.706  19.827   287463.78  601318

csv文件中的是字符串
df[df['data].isin(['2017-03-1'])]

DF 每一列的数据类型要一致

Pandas 数据对齐

dropna(how='all') 默认是any 全部是ana的时候删除

axis 轴默认是0 位行列1
dropna(how='all'，axis=1) 删除列中全是ana的

In [53]: df3.dropna(how='all',axis=1)
Out[53]:
          close     code     high      low     open      volume
new_id
0        41.314  1202636   45.006   40.440   44.148  3955267.02
1        40.978  1202636   41.888   40.512   41.500   850096.64
2        19.593  1202636   40.768   19.218   40.600   838393.48
3        19.977  1202636   40.616   19.315   19.426   595455.76
4        41.040  1202636   41.412   19.827   19.995   574927.56
5        40.546  1202636   40.908   40.334   40.706   261967.66
6        40.202  1202636   40.706   19.735   40.528   321775.58
7        19.739  1202636   19.999   19.646   19.999   290706.12

DF中使用函数

In [54]: df['close'].mean() 对close取平均值
Out[54]: 25.93288016194329

索引排序
df.sort__index(ascending=True) 升序默认
df.sort__index(ascending=False) 降序排列
列排序
df.sort__values('close',ascending=True) 收盘价升序排列
df.sort__values('close',ascending=False) 收盘价降序排列
还可以传列表代表排序优先级
df2.applymap(lambda x:x+1) applymap用在DataFrame
map用在Series中

pandas从文件读取参数

读取excel文件

需要安装 pip3 install xlrd
pd.read_excel('文件名')

csv文件

pd.read_csv() 默认分隔符是,
pd.read_table() 默认分隔符是\t

pd.read_table('',sep=',') 也可以自己指定分隔符

header=None 列名会自动变成1 2 3.。
names = ['id','close'] 可以指定列名

index_col = 'data' 指定一列为索引

parse_dates = True 把所有的值试图转换成时间对象

parse_dates =['date'] 直接转换确定的列

na_values = ['none','nan','NaN']

In [77]: df1 = pd.read_csv('601318.csv',index_col='date',parse_dates=True)
In [79]: df = df1

In [80]: type(df.index[0])
Out[80]: pandas._libs.tslib.Timestamp

pandas写入文件参数

na_rep 默认输出的nan 是空的字符串，这里可以指定

df.to_csv() 后会加一列索引 -- index = False 不输出行索引

pandas 时间对象

datetime.datetime.strptime(str,'%Y-%m-%d')

把字符串转换成时间对象
pip3 install python-dateutil

datetil.parse.parse()

pd.to_datetime(['2017-11-12'])

Pandas的时间索引

pd.date_range('2017-01-01','2017-08-08')

时间频率：
freq
默认是D day

可选参数：
H hour
W week
B business 工作日
SM 半月
T minute
S second
A year

periods = 100 长度是100

praze_dates = ['date']

df['2017']

df['2017-02']

df['2017-02':'2017-08'] 时间切片后面包含

算两个均线 MA5 MA10 前4个设置成NaN
新加两列

前5个取平均

每次创建一个视图

金叉死叉的交点

df['ma5']

Pandas DataFrame的结构像数据库

df.groupby().get_group()

pd.merge(left,right) 合并
pd.merge(left,right,on='key') 指定连表相等的值

Matplotlib

安装：pip install matplotlab

使用：import matplotlib.pyplot as plt

绘图方法：

绘图函数：plt.plot() #调用函数生成图像
显示图像：plt.show() #显示图像
注意：每显示一次就会把创建的图对象数据清空，当需要再次显示的话，就需要再创建一个数据

绘图参数

传值：仅传一个列表的话默认是y轴的数据，x赋值为0,1,2....；传入两个列表，分别代表x轴和y轴数据；
线条属性：r-o ：第一个代表颜色，第二个代表线条的样式，第三个代表点的形状【象形】。

线型linestyle（-,-.,--,..）
点型marker（v,^,s,*,H,+,x,D,o,…）
颜色color（b,g,r,y,k,w,…）

绘制一条线

plt.plot([1,2,3,4]) # 画一条直线 默认是蓝色的、
plt.plot([1,2,3,4],'ro') # 点图 r 代表红色 o 代表 dot 点
plt.plot([1,2,3,4],[2,3,4,5],'r-o') # 第一个列表是x轴 第二个是Y轴 后面是红色的点线

绘制多条线

可以在一个plot函数中写，也可以写多个plot函数，然后用plt.show()显示

plt.plot([1,2,3,6],[2,3,4,5],'r-o',[4,5,7,9],'bD')

plt.plot([1,2,3],'ro') 默认一个参数的时候 x轴 1 2 3

设置标题 x轴 y轴内容

先创建plot，然后进行设置

plt.plot([1,2,3,6],[2,3,4,5],'r-o',[4,5,7,9],'bD')
In [111]: plt.xlabel('x')

In [112]: plt.ylabel('y')

In [113]: plt.title('this is a title')

In [114]: plt.show()

对DataFrame数据进行绘制

pandas 内部支持继承了matplotlib

df[['open','close']].plot()
df['close'].plot()
plt.show()

最好的是单位一致的放在一起画图

绘制频率直方图

import numpy as np
x = np.random.randint(0,10,100) #随机生成100个数
plt.hist(x)
plt.show()

或者

plt.hist(x,np.arange(10))
plt.show()

画布与字图

创建画布：
fig = plt.figure()
创建字图：
ax1 = fig.add_subplot(2,2,1) 2*2 的第一个图
ax1 = fig.add_subplot(2,2,1) #前两个参数代表子图所占大小，第三个参数表示是第几张图

ax2 = fig.add_subplot(2,2,2)

ax3 = fig.add_subplot(2,2,3)

ax4 = fig.add_subplot(2,2,4)

ax1.plot()

实现简单的量化框架

TuShare是一个免费、开源的python财经数据接口包。
http://tushare.org/index.html

先安装lxml: pip3 install lxml
安装tushare: pip3 install tushare

限制：还有一定的顺序

股票停牌
钱不够
至少买一手 100 的倍数
成交量不超过当天的成交量
卖出不能超过持仓数

数据本地化手续费基准

在线平台

set_benchmark 设置基准

set_option 动态复权避免分红等影响股价波动

get_index_stocks 获取成分股就是沪深300的300只股票

avg_cost 每股的平均成本

定时运行函数：
run_monthly(handle,1) 第二个参数代表交易日

滑点撮合交易的价格最后成交价会变动

获取历史数据 attribute_history
unit='1d' 按天回测

Context 中存的是账户的基本信息
股票数据
持仓数据
avaliable_cash 可用资金

position
   security 股票代码
   price 最新的价格
   total_mount 总仓位
   closeable_mount 可卖出的仓位 T+1

指标

alpha 正数相对于风险获得了超额收益
Beta 基准和投资的相关性
Sharpe 夏普比率越大越好
最大回撤赔的最多的时候

最好把获取HS300的放在handle中

双均线策略

择时

因子选股策略

标准
多因子

最小市值市盈率
市值-市盈率

均值回归理论

下跌的会重新回归均线

偏离程度（MA-P）/MA

动量策略

前一段时间好买入

反转策略

前段时间不好买入

布林带策略

择时

三条

上面的是价格压力线
下面是支撑线
中间是n日均线

PEG 策略

选股策略

市盈率 = 股价/每股收益

适用于成长型的公司

羊驼交易法则

随机买入
遗传算法

改良--动量买入收益率最好的

海龟交易

唐奇安通道

上线
中线
下线

核心是止损

posted @ 2017-08-21 23:11 hzxPeter 阅读(1294) 评论(1) 收藏举报

刷新页面返回顶部

HzxPeter