如何找出知乎的所有神回复

　　有时候看到神回复，感觉真是惊叹！先上几个看看

你小时候父母为了让你能够努力学习，都用过什么丧尽天良的方法？ - 知乎
给了我这张脸

求职者在答应企业面试后，没有任何说明而爽约是否妥当？ - 知乎
你们企业也经常说“回去等我们的电话吧”然后没有任何下文啊。

什么叫见过大世面？ -知乎

能享受最好的，能承受最坏的

　　神回复总是言简意赅，吐槽精准，回答到位，甚至是让人忍俊不禁，如何找到知乎的所有神回复，有人说找那些投票多并且回答字数少的answer，对于一个计算机的同学，不可能就是这样一条条去翻一吧，让我们写一个爬虫，抓下知乎的问题，每个问题保留投票最高的回答。首先需要一个得到问题列表，这里列表可以在http://www.zhihu.com/log/questions找到，爬问题列表的代码如下：　　

def getQuestions(start,offset='20'):
    #cookies = urllib2.HTTPCookieProcessor()
    #opener = urllib2.build_opener(cookies)
    #urllib2.install_opener(opener)

    header = {"Accept":"*/*",
    "Accept-Encoding":"gbk,utf-8,gzip,deflate,sdch",
    "Accept-Language":"zh-CN,zh;q=0.8,en;q=0.6",
    "Connection":"keep-alive",
    "Content-Length":"64",
    "Content-Type":"application/x-www-form-urlencoded; charset=utf-8",
    'Cookie':'*************'
    "Host":"www.zhihu.com",
    "Origin":"http://www.zhihu.com",
    "Referer":"http://www.zhihu.com/log/questions",
    "User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/537.36",
    "X-Requested-With":"XMLHttpRequest"
    }

    parms = {'start':start,
            'offset':offset,
            '_xsrf':'*************'}
    url = 'http://www.zhihu.com/log/questions'
    req = urllib2.Request(url,headers=header,data=urllib.urlencode(parms))
    content = urllib2.urlopen( req ).read()
    html = gzip.GzipFile(fileobj = cStringIO.StringIO(content)).read()
    html = eval(html)['msg'][1]
    pageSoup = BeautifulSoup(html)
    questions = []
    items = pageSoup.find_all('div',{'class':'zm-item'})
    for item in items:
        url = item.find_all('a',{'target':'_blank'})[0].get('href').rsplit('/',1)[1]
        questions.append(url)
    lastId = items[-1].get('id').split('-')[1]
    return questions,lastId

　　得到问题列表后再抓取每个问题投票最高的回复，代码如下：

def getArticle(url):
    page = getPage(url)
    pageSoup = BeautifulSoup(page)
    title = str(pageSoup.title).replace('<title>','').replace('</title>','').strip()
    item = pageSoup.find_all('div',{'class':'zm-item-answer'})
    if item is None or len(item) == 0:
        return None
    anwser = item[0].find('div',{'class':'fixed-summary zm-editable-content clearfix'}).get_text().strip()
    vote = item[0].find('div',{'class':'zm-item-vote-info '}).get('data-votecount').strip()
    anwser = formatStr(anwser)
    ans_len = len(anwser)
    if ans_len > 100:
        anwser = anwser[0:100]
    title = formatStr(title)
    out = [title, anwser, str(ans_len),vote,url]
    return out

　　现在我们得到了每个问题的标题、投票最高的回复、问题链接。接下来我们需要把“回复短投票高”这样的规则进行量化计算，很明显一个回复是神回复的可能性与投票数成正比，与回复文本的长度成反比，但实现上，我们需要注意一些细节，比如有些神配图，一言不语，文本长度为0，所以需要平滑一些，另外回复越段应该越精辟，于是我定义了如下公式：$$Score=\frac{vote}{5+\frac{answer\_len^2}{10}}$$

　　这个公式整体来说，认为是神回复的可能与投票数成正比，与回复的长度的平方成反比，加5是为了平滑哪些神配图。

　　爬了一个晚上，爬取了2万个问题，然后按Score计算，取Score最大的top 1000，欣赏神回复，请移步至github。

　　问题爬得太少，更多的精辟神回复没有找出来。

转载请注明出处：http://www.cnblogs.com/fengfenggirl/　　

posted @ 2014-09-01 16:27 CodeMeals 阅读(17708) 评论(29) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

CodeMeals

Code Me and Give You a Code Meals.

如何找出知乎的所有神回复

公告