第一个完整的python程序---统计python3的所有包中，以哪个字母开头的最多？

看了几天dive into python3，想不到写出来的第一个程序却是python2.7的。

太久太久没有编程，python更是刚接触几天，所以写出来的程序真是够丑陋，相信以后水平提高了，几行代码就能搞定。

先看代码：

import os;
import types;

fp = open(r'1.txt');

letter=[chr(i) for i in range(97,123)];

lines = fp.readlines()
namelist=[];
sp=[];

for line in lines:
    s=line.split('\t');
    namelist.append(s[0]);
  
namelist=sorted(namelist,key=str.lower);

print("total:"+str((len(namelist))));

for L in letter:
    sp=[];
    for x in namelist:
        if x[0]==L or x[0]==L.upper():
            sp.append(x);
    speak=L+":"+str(len(sp));
    print(speak);
    
fp.close()

View Code

其中1.txt中的文件形如：

mycloud	Work distribution for small clusters.
pyNFFT	A pythonic wrapper around NFFT
uptime	Cross-platform uptime library
cchardet	Universal encoding detector. This library is faster than chardet.
pySmartDL	A Smart Download Manager for Python
mytools	mes outils pour python
ddlib	A set of common functions by DDarko.org
webapp2_static	Simple handler to Serve static files on non Google App Engine (GAE) webapp2 environments
lineup	Distributed Pipeline Framework for Python
pyrasite	Inject code into a running Python process

中间的大空格，其实为tab键。

跑整个文件，结果为：

写这个程序的目的是什么呢？故事是这样：

家里的网是悲剧的教育网，需要用包的时候没法上python.org下包；
很偶然的在发现python上包的url时这样的：

https://pypi.python.org/packages/source/l/lxml/lxml-3.3.3.tar.gz
https://pypi.python.org/packages/source/w/wgetdb/wgetdb-0.1.4.tar.gz

看到特征了吗？各个包是放在以英文字母分类的目录中的，且文件格式都是tar.gz
又偶然地在在Python3 package的页面上看到提供了所有“包名+作用”的列表，于是就想能不能写个程序，把所有的程序批量下载下来呢？
基本的思路是获得包名，然后按字母分类，迭代解析下载就行了。
包名的处理就是上面的程序，但跑完后我就没有继续下去了，因为有4000+个包！！！另外，感觉公网连python官网的速度也很慢，估计批量下载不动。
这个程序后续要做的事情在两个方面：
- 研究包的版本命名规则，这样就可以根据每个包页面下的多个gz文件名，只下载最高版本的包；
- 批量下载文件的方法，我已搜到一篇文章，附在最后面。

但其实我也不会接着去写了，写这个程序的意义在于让我明确了接下来该怎么学：

把dip3看3遍，并且是要对着shell自己跑起来看结果，而不是只看过去。从我写的这个程序来说，所有要用的知识都在我已看过的前3章中讲了，但我是查了资料才会用的。要说dip3真是本好书，逻辑很清晰，读起来也不累，讲得也很透，比如第4章开头讲编码，短短几段看过去，就很透彻了。
选择学python3是正确的，比python2看起来要合理得多，并且现在也有很多包了，这是大的趋势。并且，在将python2的程序中写的方法移植到python3的过程中，也能加深理解。
研究透dip3后，同步学习网页解析和数据库操作。在比较了sqlite3、mysql和oracle，最后选择了Mariadb，- -！。
- sqlite是很方便，但是未来我的数据量可能会比较大，这就比较麻烦；
- mysql其实是比较合适的，资料也比较多，但oracle收购后停滞了，显示不是趋势，未来有较大的迁移成本；
- oracle是很不错的db，但问题的解决会很负责，往往要依赖专家；
- Mariadb基本兼容Mysql，特别是python连接方面。更重要的是它代表着开源db的趋势，未来会越来越流行，因为wiki、google都转了，国外的资料很多，慢慢的国内的社区也会开始用起来的，到时候资料会很多。
然后学一下数据分析的方法。据说有一本书叫做：<python for data analysis>,可以读一下,京东有中文版卖，网上有一个80M的数据资料，另外豆瓣上有人写了较全的读书笔记。

附录：批量下载网站文件

#! encoding=utf-8  
  
import urllib2  
import re  
import os  
  
def Download(url,output):  
    print "downloading..."+url  
    response = urllib2.urlopen(url)  
    resourceFile = open(output,"wb")  
    resourceFile.write(response.read())  
    resourceFile.close()  
    print "downloaded"  
  
def Action(url,ext = "pdf",output = "."):  
      
    #1.domain  
    index = url.rfind("/");  
    domain = url[0:index+1];  
    print domain  
    request = urllib2.Request(url)  
    response = urllib2.urlopen(request)  
      
    #2.content  
    content = response.read()  
#    print content  
      
    #3.resource  
    mode = '\"([^\"]+'+ext+')\"'  
    pattern = re.compile(mode)  
    strMatch = pattern.findall(content)  
    size = len(strMatch)  
    print "file num: "+str(size)  
    for i in range(0,size,1):  
#        print strMatch[i]  
        one = strMatch[i]  
        partIndex = one.rfind('/')  
        if not one.startswith('http://'):  
            if -1!=partIndex:  
                directDir = one[0:partIndex+1]  
            else:  
                directDir = ""  
#            print directDir  
            try:  
                os.makedirs(output+"/"+directDir)  
            except Exception,e:  
                pass  
            fileUrl = domain+one  
            fileOutput = output+"/"+one  
            print fileUrl  
            print fileOutput  
            Download(fileUrl,fileOutput)  
        else:  
            print one  
            print "........."  
            print one[partIndex:]  
            fileOutput = output+"/"+one[partIndex:]  
            print fileOutput  
            Download(one,fileOutput)  
    #5.download  
  
if __name__=='__main__':  
    print "download"  
    url = "http://compgeom.cs.uiuc.edu/~jeffe/teaching/algorithms/";  
    Action("http://tech.qq.com/","jpg");

View Code

它来自于这里：

http://blog.csdn.net/infoworld/article/details/9337619

posted @ 2014-03-11 17:19 tmtfinder 阅读(423) 评论(0) 收藏举报

刷新页面返回顶部

tmtfinder

第一个完整的python程序---统计python3的所有包中，以哪个字母开头的最多？

公告