task03-论文代码统计
论文代码统计
3.1 任务说明
- 任务主题:论文代码统计,统计所有论文出现代码的相关统计
- 任务内容:使用正则表达式统计代码链接、页数和图表数据
- 任务成果:学习正则表达式
3.2 数据处理步骤
在原始arxiv数据集中作者经常会在论文的 comments 或 abstract 字段中给出具体的代码链接,所以我
们需要从这些字段里面找出代码的链接。
- 确定数据出现的位置;
- 使⽤用正则表达式完成匹配;
- 完成相关的统计;
3.3 正则表达式
python 3 正则表达式
参考:https://www.runoob.com/python3/python3-reg-expressions.html
3.3.1 相关函数
re.match()
尝试从字符串的起始位置匹配一个模式。如果不是起始位置匹配成功的话,返回None;
(即若开始不符合正则,则匹配失败)匹配成功:返回一个匹配的对象
re.match(pattern,string,flag=0)
- pattern:匹配的正则表达式
- string:要匹配的字符串
- flags:标志位,用于控制正则表达式的匹配方法
通过调用span函数返回匹配对象的位置
re.match('www','www.runoob.com').span()
通过group或groups来获取匹配的表达式
group(num) #第几组,可以输入多个组号(返回所有组的元组)
groups #返回一个包含所有组的元组
re.search()
扫描整个字符串并返回第一个成功的匹配。匹配成功,返回一个匹配的对象;否则返回None。
re.search(patterh,string,flag=0)
参数同re.match()
re.sub()
用于替换字符串中的匹配项
re.sub(pattern,repl,string,count=0,flags=0)
- pattern:正则中的模式字符串
- repl:替换的字符串,也可为一个函数
- string:要被查找替换的原始字符串
- count:模式匹配后替换的最大次数,默认0表示替换所有的匹配
- flags:编译时用的匹配模式,数字形式
re.complie()
用于编译正则表达式,生成一个正则表达式对象(RegexObject),供 match() 和 search() 这两个函数使用
re.compile(pattern[,flags])
- pattern:一个字符串形式的正则表达式
- flags:可选,表示匹配模式
- re.I 忽略大小写
- re.L 表示特殊字符集 \w \W \b \B \s \S y依赖于当前环境
- re.M 多行模式
- re.S 即为'.'并且包括换行符在内的任意字符('.'不包括换行符)
- re.U 表示特殊字符集,\w, \W, \b, \B, \d, \D, \s, \S 依赖于 Unicode 字符属性数据库
- re .X re.X 为了增加可读性,忽略空格和' # '后面的注释
pattern=re.compile(r'\d+')
m=pattern.match('one12twothree34four',2,10) #从位置2开始匹配
m.group(0)#等价于m.group(),下同
m.start(0)
m.end(0)
m.span(0)
pattern = re.compile(r'([a-z]+) ([a-z]+)', re.I) #匹配两个组,re.I表示忽略大小写
re.findall()
在字符串中找到正则表达式所匹配的所有子串,并返回一个列表,如果没有找到匹配的,返回空列表
re.findall(pattern,string,flags=0)
#或
pattern.findall(string[,pos[,endpos]])
- pattern:匹配模式
- string:待匹配的字符串
- pos:可选参数,指定字符串的起始位置,默认为0
- endpos:可选参数,指定字符串的结束位置,默认为字符串的长度
re.finditer()
在字符串中找到正则表达式所匹配的所有字串,并把他们作为一个迭代器返回
re.finditer(pattern,string,flags=0)
- pattern:匹配的正则表达式
- string:要匹配的字符串
- flags:标志位,用于控制正则表达式的匹配方式、
re.split()
按照能够匹配的字串将字符串分割后返回列表;若找不到匹配,则返回含原字符串的列表(不分割)
re.split(pattern,string[,maxsplit=0,flags=0])
- pattern:要匹配的正则表达式
- string:要匹配的字符串
- maxsplit:分割次数,maxsplit=1分割一次,默认为0,不限制次数
- flags:标志位
3.3.2正则表达式对象
正则表达式修饰符(flags)
修饰符 | 描述 |
---|---|
re.I | 使匹配对大小写不敏感 |
re.L | 做本地化识别(locale-aware)匹配 |
re.M | 多行匹配,影响 ^ 和 $ |
re.S | 使 . 匹配包括换行在内的所有字符 |
re.U | 根据Unicode字符集解析字符。这个标志影响 \w, \W, \b, \B. |
re.X | 标志通过给予你更灵活的格式以便你将正则表达式写得更易于理解。 |
正则表达式模式(pattern)
模式 | 描述 |
---|---|
^ | 匹配字符串的开头 |
$ | 匹配字符串的末尾。 |
. | 匹配任意字符,除了换行符,当re.DOTALL标记被指定时,则可以匹配包括换行符的任意字符。 |
[...] | 用来表示一组字符,单独列出:[amk] 匹配 'a','m'或'k' |
[^...] | 不在[]中的字符:[^abc] 匹配除了a,b,c之外的字符。 |
re* | 匹配0个或多个的表达式。 |
re+ | 匹配1个或多个的表达式。 |
re? | 匹配0个或1个由前面的正则表达式定义的片段,非贪婪方式 |
re | 匹配n个前面表达式。例如,"o{2}"不能匹配"Bob"中的"o",但是能匹配"food"中的两个o。 |
re | 精确匹配n个前面表达式。例如,"o{2,}"不能匹配"Bob"中的"o",但能匹配"foooood"中的所有o。"o{1,}"等价于"o+"。"o{0,}"则等价于"o*"。 |
re | 匹配 n 到 m 次由前面的正则表达式定义的片段,贪婪方式 |
a| b | 匹配a或b |
(re) | 匹配括号内的表达式,也表示一个组 |
(?imx) | 正则表达式包含三种可选标志:i, m, 或 x 。只影响括号中的区域。 |
(?-imx) | 正则表达式关闭 i, m, 或 x 可选标志。只影响括号中的区域。 |
(?: re) | 类似 (...), 但是不表示一个组 |
(?imx: re) | 在括号中使用i, m, 或 x 可选标志 |
(?-imx: re) | 在括号中不使用i, m, 或 x 可选标志 |
(?#...) | 注释. |
(?= re) | 前向肯定界定符。如果所含正则表达式,以 ... 表示,在当前位置成功匹配时成功,否则失败。但一旦所含表达式已经尝试,匹配引擎根本没有提 |
(?! re) | 前向否定界定符。与肯定界定符相反;当所含表达式不能在字符串当前位置匹配时成功。 |
(?> re) | 匹配的独立模式,省去回溯。 |
\w | 匹配数字字母下划线 |
\W | 匹配非数字字母下划线 |
\s | 匹配任意空白字符,等价于 [\t\n\r\f]。 |
\S | 匹配任意非空字符 |
\d | 匹配任意数字,等价于 [0-9]。 |
\D | 匹配任意非数字 |
\A | 匹配字符串开始 |
\Z | 匹配字符串结束,如果是存在换行,只匹配到换行前的结束字符串。 |
\z | 匹配字符串结束 |
\G | 匹配最后匹配完成的位置。 |
\b | 匹配一个单词边界,也就是指单词和空格间的位置。例如, 'er\b' 可以匹配"never" 中的 'er',但不能匹配 "verb" 中的 'er'。 |
\B | 匹配非单词边界。'er\B' 能匹配 "verb" 中的 'er',但不能匹配 "never" 中的 'er'。 |
\n, \t, 等。 | 匹配一个换行符。匹配一个制表符, 等 |
\1...\9 | 匹配第n个分组的内容。 |
\10 | 匹配第n个分组的内容,如果它经匹配。否则指的是八进制字符码的表达式。 |
3.3.3 正则表达式测试网站
https://tool.oschina.net/regex/
3.4 代码
import seaborn as sns
from bs4 import BeautifulSoup
import re
import requests
import json
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
#读入数据
data=pd.read_csv("data.csv")
D:\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3146: DtypeWarning: Columns (0) have mixed types.Specify dtype option on import or set low_memory=False.
has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
data.head()
id | submitter | authors | title | comments | journal-ref | doi | report-no | categories | license | abstract | versions | update_date | authors_parsed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 704 | Pavel Nadolsky | C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-... | Calculation of prompt diphoton production cros... | 37 pages, 15 figures; published version | Phys.Rev.D76:013009,2007 | 10.1103/PhysRevD.76.013009 | ANL-HEP-PR-07-12 | hep-ph | NaN | A fully differential calculation in perturba... | [{'version': 'v1', 'created': 'Mon, 2 Apr 2007... | 2008-11-26 | [['Balázs', 'C.', ''], ['Berger', 'E. L.', '']... |
1 | 704 | Louis Theran | Ileana Streinu and Louis Theran | Sparsity-certifying Graph Decompositions | To appear in Graphs and Combinatorics | NaN | NaN | NaN | math.CO cs.CG | http://arxiv.org/licenses/nonexclusive-distrib... | We describe a new algorithm, the $(k,\ell)$-... | [{'version': 'v1', 'created': 'Sat, 31 Mar 200... | 2008-12-13 | [['Streinu', 'Ileana', ''], ['Theran', 'Louis'... |
2 | 704 | Hongjun Pan | Hongjun Pan | The evolution of the Earth-Moon system based o... | 23 pages, 3 figures | NaN | NaN | NaN | physics.gen-ph | NaN | The evolution of Earth-Moon system is descri... | [{'version': 'v1', 'created': 'Sun, 1 Apr 2007... | 2008-01-13 | [['Pan', 'Hongjun', '']] |
3 | 704 | David Callan | David Callan | A determinant of Stirling cycle numbers counts... | 11 pages | NaN | NaN | NaN | math.CO | NaN | We show that a determinant of Stirling cycle... | [{'version': 'v1', 'created': 'Sat, 31 Mar 200... | 2007-05-23 | [['Callan', 'David', '']] |
4 | 704 | Alberto Torchinsky | Wael Abu-Shammala and Alberto Torchinsky | From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a... | NaN | Illinois J. Math. 52 (2008) no.2, 681-689 | NaN | NaN | math.CA math.FA | NaN | In this paper we show how to compute the $\L... | [{'version': 'v1', 'created': 'Mon, 2 Apr 2007... | 2013-10-15 | [['Abu-Shammala', 'Wael', ''], ['Torchinsky', ... |
#取需要的字段
data=data.loc[:,['categories','abstract','comments']]
data.head()
categories | abstract | comments | |
---|---|---|---|
0 | hep-ph | A fully differential calculation in perturba... | 37 pages, 15 figures; published version |
1 | math.CO cs.CG | We describe a new algorithm, the $(k,\ell)$-... | To appear in Graphs and Combinatorics |
2 | physics.gen-ph | The evolution of Earth-Moon system is descri... | 23 pages, 3 figures |
3 | math.CO | We show that a determinant of Stirling cycle... | 11 pages |
4 | math.CA math.FA | In this paper we show how to compute the $\L... | NaN |
# 匹配 xx pages
data['pages']=data['comments'].apply(lambda x:re.findall('[1-9][0-9]* pages',str(x)))
data.head()
categories | abstract | comments | pages | |
---|---|---|---|---|
0 | hep-ph | A fully differential calculation in perturba... | 37 pages, 15 figures; published version | [37 pages] |
1 | math.CO cs.CG | We describe a new algorithm, the $(k,\ell)$-... | To appear in Graphs and Combinatorics | [] |
2 | physics.gen-ph | The evolution of Earth-Moon system is descri... | 23 pages, 3 figures | [23 pages] |
3 | math.CO | We show that a determinant of Stirling cycle... | 11 pages | [11 pages] |
4 | math.CA math.FA | In this paper we show how to compute the $\L... | NaN | [] |
对pages进⾏统计
# 筛选出所有有pags的文章
data=data[ data.pages.apply(len)>0 ]
data.head()
categories | abstract | comments | pages | |
---|---|---|---|---|
0 | hep-ph | A fully differential calculation in perturba... | 37 pages, 15 figures; published version | [37 pages] |
2 | physics.gen-ph | The evolution of Earth-Moon system is descri... | 23 pages, 3 figures | [23 pages] |
3 | math.CO | We show that a determinant of Stirling cycle... | 11 pages | [11 pages] |
5 | cond-mat.mes-hall | We study the two-particle wave function of p... | 6 pages, 4 figures, accepted by PRA | [6 pages] |
6 | gr-qc | A rather non-standard quantum representation... | 16 pages, no figures. Typos corrected to match... | [16 pages] |
# 把pages中的列表进行转换
data['pages'] = data['pages'].apply(lambda x: float(x[0].replace(' pages', '')))
data.head()
categories | abstract | comments | pages | |
---|---|---|---|---|
0 | hep-ph | A fully differential calculation in perturba... | 37 pages, 15 figures; published version | 37.0 |
2 | physics.gen-ph | The evolution of Earth-Moon system is descri... | 23 pages, 3 figures | 23.0 |
3 | math.CO | We show that a determinant of Stirling cycle... | 11 pages | 11.0 |
5 | cond-mat.mes-hall | We study the two-particle wave function of p... | 6 pages, 4 figures, accepted by PRA | 6.0 |
6 | gr-qc | A rather non-standard quantum representation... | 16 pages, no figures. Typos corrected to match... | 16.0 |
data['pages'].describe().astype('int')
count 1089180
mean 17
std 22
min 1
25% 8
50% 13
75% 22
max 11232
Name: pages, dtype: int32
# 选择主要类别
data['categories'] = data['categories'].apply(lambda x: x.split(' ')[0])
data['categories'] = data['categories'].apply(lambda x: x.split('.')[0])
data.head(20)
categories | abstract | comments | pages | |
---|---|---|---|---|
0 | hep-ph | A fully differential calculation in perturba... | 37 pages, 15 figures; published version | 37.0 |
2 | physics | The evolution of Earth-Moon system is descri... | 23 pages, 3 figures | 23.0 |
3 | math | We show that a determinant of Stirling cycle... | 11 pages | 11.0 |
5 | cond-mat | We study the two-particle wave function of p... | 6 pages, 4 figures, accepted by PRA | 6.0 |
6 | gr-qc | A rather non-standard quantum representation... | 16 pages, no figures. Typos corrected to match... | 16.0 |
9 | math | Partial cubes are isometric subgraphs of hyp... | 36 pages, 17 figures | 36.0 |
10 | math | In this paper we present an algorithm for co... | 14 pages; title changed; to appear in Experime... | 14.0 |
13 | math | In this article we discuss a relation betwee... | 18 pages, 1 figure | 18.0 |
14 | hep-th | The pure spinor formulation of the ten-dimen... | 22 pages; signs and coefficients adjusted for ... | 22.0 |
15 | hep-ph | In this work, we evaluate the lifetimes of t... | 17 pages, 3 figures and 1 table | 17.0 |
16 | astro-ph | Results from spectroscopic observations of t... | 10 pages, 11 figures (figures 3, 4, 7 and 8 at... | 10.0 |
17 | hep-th | We give a prescription for how to compute th... | 20 pages, v2: an overall sign and typos corrected | 20.0 |
18 | math | In this note we give a new method for gettin... | 6 pages, Journal-ref added | 6.0 |
19 | hep-ex | The shape of the hadronic form factor f+(q2)... | 21 pages, 13 postscript figures, submitted to ... | 21.0 |
20 | nlin | Spatiotemporal pattern formation in a produc... | 5 pages, 4 figures | 5.0 |
21 | math | We present Lie group integrators for nonline... | 20 pages, 4 figures | 20.0 |
22 | astro-ph | The very nature of the solar chromosphere, i... | 4 pages, 2 figures, to appear in the proceedin... | 4.0 |
24 | cond-mat | We present recent advances in understanding ... | 41 pages, 13 figures, in "Polarons in Advanced... | 41.0 |
26 | cond-mat | We describe a peculiar fine structure acquir... | 4 pages, 2 figures; mistakes due to an erroneo... | 4.0 |
27 | math | We prove pfaffian and hafnian versions of Li... | 10 pages | 10.0 |
# 每类论⽂文的平均⻚页数
plt.figure(figsize=(12, 6))
data.groupby(['categories'])['pages'].mean().plot(kind='bar')
<AxesSubplot:xlabel='categories'>
对figures进⾏统计
data['figures'] = data['comments'].apply(lambda x: re.findall('[1-9][0-9]* figures', str(x)))
data = data[data['figures'].apply(len) > 0]
data['figures'] = data['figures'].apply(lambda x: float(x[0].replace(' figures', '')))
data.head()
categories | abstract | comments | pages | figures | |
---|---|---|---|---|---|
0 | hep-ph | A fully differential calculation in perturba... | 37 pages, 15 figures; published version | 37.0 | 15.0 |
2 | physics | The evolution of Earth-Moon system is descri... | 23 pages, 3 figures | 23.0 | 3.0 |
5 | cond-mat | We study the two-particle wave function of p... | 6 pages, 4 figures, accepted by PRA | 6.0 | 4.0 |
9 | math | Partial cubes are isometric subgraphs of hyp... | 36 pages, 17 figures | 36.0 | 17.0 |
15 | hep-ph | In this work, we evaluate the lifetimes of t... | 17 pages, 3 figures and 1 table | 17.0 | 3.0 |
# 每类论⽂文的平均插图数
plt.figure(figsize=(12, 6))
data.groupby(['categories'])['figures'].mean().plot(kind='bar')
<AxesSubplot:xlabel='categories'>
对论文的代码(github)链接进行提取
# 筛选包含github的论⽂文
data_with_code = data[
(data.comments.str.contains('github')==True)|
(data.abstract.str.contains('github')==True)
]
data_with_code
categories | abstract | comments | pages | figures | |
---|---|---|---|---|---|
253172 | astro-ph | Solar tomography has progressed rapidly in r... | 21 pages, 6 figures, 5 tables | 21.0 | 6.0 |
254226 | astro-ph | We describe a hybrid Fourier/direct space co... | 10 pages, 6 figures. Submitted to Astronomy an... | 10.0 | 6.0 |
296182 | astro-ph | REBOUND is a new multi-purpose N-body code w... | 10 pages, 9 figures, accepted by A&A, source c... | 10.0 | 9.0 |
300298 | physics | This article proposes a way to improve the p... | 10 pages, 7 figures. CODE: https://github.com/... | 10.0 | 7.0 |
311500 | physics | The interaction of distinct units in physica... | Preprint. 24 pages, 4 figures, 2 tables. Sourc... | 24.0 | 4.0 |
... | ... | ... | ... | ... | ... |
1381266 | cs | Sequence-based place recognition methods for... | 9 pages, 6 figures, 2 tables | 9.0 | 6.0 |
1381310 | cs | The target identification in brain-computer ... | 12 pages, 6 figures | 12.0 | 6.0 |
1381509 | eess | In this paper, we study the problem of imagi... | 10 pages, 2 figures, to be published in STACOM... | 10.0 | 2.0 |
1381606 | astro-ph | We derive a simple prescription for includin... | 14 pages; 6 figures; 3 appendices | 14.0 | 6.0 |
1382418 | cs | Rotation detection serves as a fundamental b... | 12 pages, 6 figures, 8 tables | 12.0 | 6.0 |
2175 rows × 5 columns
data_with_code['text'] = data_with_code['abstract'].fillna('') + data_with_code['comments'].fillna('')
<ipython-input-36-6de94aa2a570>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
data_with_code['text'] = data_with_code['abstract'].fillna('') + data_with_code['comments'].fillna('')
data_with_code.head()
categories | abstract | comments | pages | figures | text | |
---|---|---|---|---|---|---|
253172 | astro-ph | Solar tomography has progressed rapidly in r... | 21 pages, 6 figures, 5 tables | 21.0 | 6.0 | Solar tomography has progressed rapidly in r... |
254226 | astro-ph | We describe a hybrid Fourier/direct space co... | 10 pages, 6 figures. Submitted to Astronomy an... | 10.0 | 6.0 | We describe a hybrid Fourier/direct space co... |
296182 | astro-ph | REBOUND is a new multi-purpose N-body code w... | 10 pages, 9 figures, accepted by A&A, source c... | 10.0 | 9.0 | REBOUND is a new multi-purpose N-body code w... |
300298 | physics | This article proposes a way to improve the p... | 10 pages, 7 figures. CODE: https://github.com/... | 10.0 | 7.0 | This article proposes a way to improve the p... |
311500 | physics | The interaction of distinct units in physica... | Preprint. 24 pages, 4 figures, 2 tables. Sourc... | 24.0 | 4.0 | The interaction of distinct units in physica... |
pattern = '[a-zA-z]+://github[^\s]*'
data_with_code['code_flag']=data_with_code['text'].str.findall(pattern).apply(len)
<ipython-input-47-108c14cee0bc>:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
data_with_code['code_flag']=data_with_code['text'].str.findall(pattern).apply(len)
data_with_code.head()
categories | abstract | comments | pages | figures | text | code_flag | |
---|---|---|---|---|---|---|---|
253172 | astro-ph | Solar tomography has progressed rapidly in r... | 21 pages, 6 figures, 5 tables | 21.0 | 6.0 | Solar tomography has progressed rapidly in r... | 0 |
254226 | astro-ph | We describe a hybrid Fourier/direct space co... | 10 pages, 6 figures. Submitted to Astronomy an... | 10.0 | 6.0 | We describe a hybrid Fourier/direct space co... | 1 |
296182 | astro-ph | REBOUND is a new multi-purpose N-body code w... | 10 pages, 9 figures, accepted by A&A, source c... | 10.0 | 9.0 | REBOUND is a new multi-purpose N-body code w... | 1 |
300298 | physics | This article proposes a way to improve the p... | 10 pages, 7 figures. CODE: https://github.com/... | 10.0 | 7.0 | This article proposes a way to improve the p... | 2 |
311500 | physics | The interaction of distinct units in physica... | Preprint. 24 pages, 4 figures, 2 tables. Sourc... | 24.0 | 4.0 | The interaction of distinct units in physica... | 1 |
data_with_code['text'].str.findall(pattern)
253172 []
254226 [https://github.com/elsner/arkcos]
296182 [https://github.com/hannorein/rebound]
300298 [https://github.com/dcasadei/psde, https://git...
311500 [http://github.com/ntamas/netctrl]
...
1381266 []
1381310 [https://github.com/osmanberke/Deep-SSVEP-BCI.]
1381509 []
1381606 [https://github.com/alexander-mead/BNL]
1382418 [https://github.com/Thinklab-SJTU/DCL_RetinaNe...
Name: text, Length: 2175, dtype: object
# 统计每类论文的链接数量
data_with_code = data_with_code[data_with_code['code_flag'] >= 1]
plt.figure(figsize=(12, 6))
data_with_code.groupby(['categories'])['code_flag'].count().plot(kind='bar')
<AxesSubplot:xlabel='categories'>