Python高级数据处理与可视化（五）---- Python的理工类应用 & Python人文社科类应用

6. Python的理工类应用

　　6.1 简单的三角函数计算

np.sin(x)

　　6.2 一组数据的傅立叶变换

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 24 13:07:01 2017

@author: Wayne
"""

import scipy as sp
import pylab as pl

listA = sp.ones(500)  # 生成一个含500个元素的序列，元素的值为1
listA[100:300] = -1
f = sp.fft(listA)
pl.plot(f)
pl.show()

f = sp.fft(listA)

　　6.3 Biopython：一个使用Python开发计算分子生物学工具的国际社团

　　　　功能：将生物信息学文件分析成Python可利用的数据结构；处理常用的在线生物信息学数据库代码；提供常用生物信息程序的界面

　　　　安装：打开Anaconda Prompt，输入conda install Biopython（conda list 会列出所有已安装的python包和框架）

7. Python的人文社科类应用

　　7.1 NLTK语料库

　　7.2 古腾堡项目

　　　　7.2.1 gutenberg包下载

　　　　7.2.2 计算NLTK中目前收录的gutenberg项目的书

In[30]: from nltk.corpus import gutenberg

In[31]: gutenberg.fileids()
Out[31]: 
[u'austen-emma.txt',
 u'austen-persuasion.txt',
 u'austen-sense.txt',
 u'bible-kjv.txt',
 u'blake-poems.txt',
 u'bryant-stories.txt',
 u'burgess-busterbrown.txt',
 u'carroll-alice.txt',
 u'chesterton-ball.txt',
 u'chesterton-brown.txt',
 u'chesterton-thursday.txt',
 u'edgeworth-parents.txt',
 u'melville-moby_dick.txt',
 u'milton-paradise.txt',
 u'shakespeare-caesar.txt',
 u'shakespeare-hamlet.txt',
 u'shakespeare-macbeth.txt',
 u'whitman-leaves.txt']

gutenberg.fileids()

　　　　7.2.3 一些简单的计算（单词数量，某单词数量，不重复单词数量，多于n个字母的单词）

In [30]: from nltk.corpus import gutenberg

In [31]: gutenberg.fileids()  # 显示gutenberg项目中收录的书籍
Out[31]: 
[u'austen-emma.txt',
 u'austen-persuasion.txt',
 u'austen-sense.txt',
 u'bible-kjv.txt',
 u'blake-poems.txt',
 u'bryant-stories.txt',
 u'burgess-busterbrown.txt',
 u'carroll-alice.txt',
 u'chesterton-ball.txt',
 u'chesterton-brown.txt',
 u'chesterton-thursday.txt',
 u'edgeworth-parents.txt',
 u'melville-moby_dick.txt',
 u'milton-paradise.txt',
 u'shakespeare-caesar.txt',
 u'shakespeare-hamlet.txt',
 u'shakespeare-macbeth.txt',
 u'whitman-leaves.txt']

In [32]: allwords = gutenberg.words('shakespeare-hamlet.txt')  # 所有的单词

In [33]: len(allwords)  # 一共的单词数量
Out[33]: 37360

In [34]: len(set(allwords))  # 不重复的单词数量
Out[34]: 5447

In [35]: allwords.count('Hamlet')  # 单词'Hamlet'出现的次数
Out[35]: 99

In [36]: A = set(allwords)  # 不重复的单词集合

In [37]: longwords = [w for w in A if len(w)>12]  # A集合中，长度大于12的单词

In [38]: sorted(longwords)  # 按大小写字母排序
Out[38]: 
[u'Circumstances',
 u'Guildensterne',
 u'Incontinencie',
 u'Recognizances',
 u'Vnderstanding',
 u'determination',
 u'encompassement',
 u'entertainment',
 u'imperfections',
 u'indifferently',
 u'instrumentall',
 u'reconcilement',
 u'stubbornnesse',
 u'transformation',
 u'vnderstanding']

longwords = [w for w in A if len(w)>12]

　　　　7.2.4 NLTK中自己的一些函数

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 24 13:07:01 2017

@author: Wayne
"""

from nltk.corpus import gutenberg
from nltk.probability import *

allwords = gutenberg.words('shakespeare-hamlet.txt')
fd2 = FreqDist([sx.lower() for sx in allwords if sx.isalpha()])  # allwords中的是字母的提取出来并小写,做成频率分布
print fd2.B()  # 所有的单词数量
print fd2.N()  # 不同的单词数量
fd2.tabulate(20)  # 提取前二十个制作成表格
fd2.plot(20)
fd2.plot(20, cumulative=True)  # cumulative表示累积

fd2 = FreqDist([sx.lower() for sx in allwords if sx.isalpha()])

　　7.3 就职演说语料库

　　　　7.3.1 下载 inaugural 包

　　　　7.3.2 计算 'freedom' 出现的频率

In [61]: from nltk.corpus import inaugural

In [62]: fd3 = FreqDist([s for s in inaugural.words()])

In [63]: fd3.freq('freedom')
Out[63]: 0.0011939479191683535

fd3 = FreqDist([s for s in inaugural.words()])

　　　　7.3.3 将1950年后的机制演说根据单词长度进行频率统计：

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 24 13:07:01 2017

@author: Wayne
"""

from nltk.corpus import inaugural
from nltk.probability import ConditionalFreqDist
from matplotlib import pyplot as plt

cfd = ConditionalFreqDist(                       # 条件频率统计
            (fileid, len(w))                     # 根据单词长度进行统计
            for fileid in inaugural.fileids()
            for w in inaugural.words(fileid)
            if fileid>'1950')                    # 1950年以后
print cfd.items()[:40]
cfd.plot()

ConditionalFreqDist

posted on 2017-01-25 00:49 你的踏板车要滑向哪里阅读(306) 评论(0) 收藏举报

刷新页面返回顶部

你的踏板车要滑向哪里

Python高级数据处理与可视化（五）---- Python的理工类应用 & Python人文社科类应用

导航

公告