Python基本数据统计(一)---- 便捷数据获取 & 数据准备和整理 & 数据显示

1. 便捷数据获取

  1.1 本地数据获取:文件的打开,读写和关闭(另外的单独章节)

  1.2 网络数据获取:

    1.2.1 urllib, urllib2, httplib, httplib2 (python3中为urllib.request, http.client)

      正则表达式(另外的单数章节)

    1.2.2 通过matplotlib.finace模块获取雅虎财经上的数据

In [7]: from matplotlib.finance import quotes_historical_yahoo_ochl

In [8]: from datetime import date

In [9]: from datetime import datetime

In [10]: import pandas as pd

In [11]: today = date.today()

In [12]: start = (today.year-1, today.month, today.day)

In [14]: quotes = quotes_historical_yahoo_ochl('AXP', start, today)  # 获取数据

In [15]: fields = ['date', 'open', 'close', 'high', 'low', 'volume']

In [16]: list1 = []

In [18]: for i in range(0,len(quotes)):
    ...:     x = date.fromordinal(int(quotes[i][0]))  # 取每一行的第一列,通过date.fromordinal设置为日期数据类型
    ...:     y = datetime.strftime(x,'%Y-%m-%d')  # 通过datetime.strftime把日期设置为指定格式
    ...:     list1.append(y)  # 将日期放入列表中
    ...:     

In [19]: quotesdf = pd.DataFrame(quotes,index=list1,columns=fields)  # index设置为日期,columns设置为字段

In [20]: quotesdf = quotesdf.drop(['date'],axis=1)  # 删除date列

In [21]: print quotesdf
                 open      close       high        low      volume
2016-01-20  60.374146  61.835916  62.336256  60.128882   9043800.0
2016-01-21  61.806486  61.453305  63.101479  61.325767   8992300.0
2016-01-22  57.283819  54.016907  57.774347  53.114334  43783400.0

    1.2.3 通过自然语言工具包NLTK获取语料库等数据

      1. 下载nltk:pip install nltk

      2. 下载语料库:

In [1]: import nltk

In [2]: nltk.download()
NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> gutenberg
    Downloading package gutenberg to /root/nltk_data...
      Package gutenberg is already up-to-date!

      3. 获取数据:

In [3]: from nltk.corpus import gutenberg

In [4]: print gutenberg.fileids()
[u'austen-emma.txt', u'austen-persuasion.txt', u'austen-sense.txt', u'bible-kjv.txt', u'blake-poems.txt', u'bryant-stories.txt', u'burgess-busterbrown.txt', u'carroll-alice.txt', u'chesterton-ball.txt', u'chesterton-brown.txt', u'chesterton-thursday.txt', u'edgeworth-parents.txt', u'melville-moby_dick.txt', u'milton-paradise.txt', u'shakespeare-caesar.txt', u'shakespeare-hamlet.txt', u'shakespeare-macbeth.txt', u'whitman-leaves.txt']

In [5]: texts = gutenberg.words('shakespeare-hamlet.txt')

In [6]: texts
Out[6]: [u'[', u'The', u'Tragedie', u'of', u'Hamlet', u'by', ...]

2. 数据准备和整理

  2.1 quotes数据加入[ 列 ]属性名

In [79]: quotesdf = pd.DataFrame(quotes)

In [80]: quotesdf
Out[80]: 
            0          1          2          3          4           5
0    735983.0  60.374146  61.835916  62.336256  60.128882   9043800.0
1    735984.0  61.806486  61.453305  63.101479  61.325767   8992300.0
2    735985.0  57.283819  54.016907  57.774347  53.114334  43783400.0
3    735988.0  53.428272  53.977664  54.713455  53.114334  18498300.0

[253 rows x 6 columns]

In [81]: fields = ['date','open','close','high','low','volume']

In [82]: quotesdf = pd.DataFrame(quotes,columns=fields)  # 设置列属性名称

In [83]: quotesdf
Out[83]: 
         date       open      close       high        low      volume
0    735983.0  60.374146  61.835916  62.336256  60.128882   9043800.0
1    735984.0  61.806486  61.453305  63.101479  61.325767   8992300.0
2    735985.0  57.283819  54.016907  57.774347  53.114334  43783400.0
3    735988.0  53.428272  53.977664  54.713455  53.114334  18498300.0

  2.2 quotes数据加入[ index ]属性名

In [84]: quotesdf
Out[84]: 
         date       open      close       high        low      volume
0    735983.0  60.374146  61.835916  62.336256  60.128882   9043800.0
1    735984.0  61.806486  61.453305  63.101479  61.325767   8992300.0
2    735985.0  57.283819  54.016907  57.774347  53.114334  43783400.0

[253 rows x 6 columns]

In [85]: quotesdf = pd.DataFrame(quotes, index=range(1,len(quotes)+1),columns=fields)  # 把index属性从0,1,2...改为1,2,3...

In [86]: quotesdf
Out[86]: 
         date       open      close       high        low      volume
1    735983.0  60.374146  61.835916  62.336256  60.128882   9043800.0
2    735984.0  61.806486  61.453305  63.101479  61.325767   8992300.0
3    735985.0  57.283819  54.016907  57.774347  53.114334  43783400.0

  2.3 日期转换:Gregorian日历表示法 => 普通表示方法

In [88]: from datetime import date

In [89]: firstday = date.fromordinal(735190)

In [93]: firstday
Out[93]: datetime.date(2013, 11, 18)

In [95]: firstday = datetime.strftime(firstday,'%Y-%m-%d')

In [96]: firstday
Out[96]: '2013-11-18'

  2.4 创建时间序列:

In [120]: import pandas as pd

In [121]: dates = pd.date_range('20170101', periods=7)  # 根据起始日期和长度生成日期序列

In [122]: dates
Out[122]: 
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04','2017-01-05', '2017-01-06', '2017-01-07'],dtype='datetime64[ns]', freq='D')

In [123]: import numpy as np

In [124]: dates = pd.DataFrame(np.random.randn(7,3), index=dates, columns=list('ABC'))  # 时间序列当作index,ABC当作列的name属性,表内容为七行三列随机数

In [125]: dates
Out[125]: 
                   A         B         C
2017-01-01  0.705927  0.311453  1.455362
2017-01-02 -0.331531 -0.358449  0.175375
2017-01-03 -0.284583 -1.760700 -0.582880
2017-01-04 -0.759392 -2.080658 -2.015328
2017-01-05 -0.517370  0.906072 -0.106568
2017-01-06 -0.252802 -2.135604 -0.692153
2017-01-07 -0.275184  0.142973 -1.262126

  2.5 练习

In [101]: datetime.now()  # 显示当前日期和时间
Out[101]: datetime.datetime(2017, 1, 20, 16, 11, 50, 43258)
=========================================
In [108]: datetime.now().month  # 显示当前月份
Out[108]: 1

=========================================
In [126]: import pandas as pd

In [127]: dates = pd.date_range('2015-02-01',periods=10)

In [128]: dates
Out[128]: 
DatetimeIndex(['2015-02-01', '2015-02-02', '2015-02-03', '2015-02-04','2015-02-05', '2015-02-06', '2015-02-07', '2015-02-08','2015-02-09', '2015-02-10'],dtype='datetime64[ns]', freq='D')

In [133]: res = pd.DataFrame(range(1,11),index=dates,columns=['value'])

In [134]: res
Out[134]: 
            value
2015-02-01      1
2015-02-02      2
2015-02-03      3
2015-02-04      4
2015-02-05      5
2015-02-06      6
2015-02-07      7
2015-02-08      8
2015-02-09      9
2015-02-10     10

3. 数据显示

  3.1 显示方式:

In [180]: quotesdf2.index  # 显示索引
Out[180]: 
Index([u'2016-01-20', u'2016-01-21', u'2016-01-22', u'2016-01-25',
       ...
       u'2017-01-11', u'2017-01-12', u'2017-01-13', u'2017-01-17',
       u'2017-01-18', u'2017-01-19'],
      dtype='object', length=253)

In [181]: quotesdf2.columns  # 显示列名
Out[181]: Index([u'open', u'close', u'high', u'low', u'volume'], dtype='object')

In [182]: quotesdf2.values  # 显示数据的值
Out[182]: 
array([[  6.03741455e+01,   6.18359160e+01,   6.23362562e+01,
          6.01288817e+01,   9.04380000e+06],
       ..., 
       [  7.76100010e+01,   7.66900020e+01,   7.77799990e+01,
          7.66100010e+01,   7.79110000e+06]])

In [183]: quotesdf2.describe  # 显示数据描述
Out[183]: 
<bound method DataFrame.describe of                  open      close       high        low      volume
2016-01-20  60.374146  61.835916  62.336256  60.128882   9043800.0
2016-01-21  61.806486  61.453305  63.101479  61.325767   8992300.0
2016-01-22  57.283819  54.016907  57.774347  53.114334  43783400.0

  3.2 索引的格式:u 表示unicode编码

  3.3 显示行:

In [193]: quotesdf.head(2)  # 专用方式显示头两行
Out[193]: 
       date       open      close       high        low     volume
1  735983.0  60.374146  61.835916  62.336256  60.128882  9043800.0
2  735984.0  61.806486  61.453305  63.101479  61.325767  8992300.0

In [194]: quotesdf.tail(2)  # 专用方式显示尾两行
Out[194]: 
         date       open      close       high        low     volume
252  736347.0  77.110001  77.489998  77.610001  76.510002  5988400.0
253  736348.0  77.610001  76.690002  77.779999  76.610001  7791100.0

In [195]: quotesdf[:2]  # 切片方式显示头两行
Out[195]: 
       date       open      close       high        low     volume
1  735983.0  60.374146  61.835916  62.336256  60.128882  9043800.0
2  735984.0  61.806486  61.453305  63.101479  61.325767  8992300.0

In [197]: quotesdf[251:]  # 切片方式显示尾两行
Out[197]: 
         date       open      close       high        low     volume
252  736347.0  77.110001  77.489998  77.610001  76.510002  5988400.0
253  736348.0  77.610001  76.690002  77.779999  76.610001  7791100.0

4. 数据选择

5. 简单统计与处理

6. Grouping

7. Merge

posted on 2017-01-20 17:38  你的踏板车要滑向哪里  阅读(684)  评论(0编辑  收藏  举报

导航