task01-论文数据统计

task01: 论文数据统计

https://github.com/datawhalechina/team-learning-data-mining/tree/master/AcademicTrends

任务说明

  • 任务主题:论文数量统计,即统计2019年全年计算机各个方向论文数量;
  • 任务内容:赛题的理解、使用 Pandas 读取数据并进行统计;
  • 任务成果:学习 Pandas 的基础操作;

数据集介绍

  • 数据集来源:数据集链接

  • 数据集的格式如下:

    • id:arXiv ID,可用于访问论文;
    • submitter:论文提交者;
    • authors:论文作者;
    • title:论文标题;
    • comments:论文页数和图表等其他信息;
    • journal-ref:论文发表的期刊的信息;
    • doi:数字对象标识符,https://www.doi.org
    • report-no:报告编号;
    • categories:论文在 arXiv 系统的所属类别或标签;
    • license:文章的许可证;
    • abstract:论文摘要;
    • versions:论文版本;
    • authors_parsed:作者的信息。
  • 数据集实例

    "root":{
      "id":string"0704.0001"
      "submitter":string"Pavel Nadolsky"
      "authors":string"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan"
      "title":string"Calculation of prompt diphoton production cross sections at Tevatron and LHC energies"
      "comments":string"37 pages, 15 figures; published version"
      "journal-ref":string"Phys.Rev.D76:013009,2007"
      "doi":string"10.1103/PhysRevD.76.013009"
      "report-no":string"ANL-HEP-PR-07-12"
      "categories":string"hep-ph"
      "license":NULL
      "abstract":string"  A fully differential calculation in perturbative quantum chromodynamics is presented for the production of massive photon pairs at hadron colliders. All next-to-leading order perturbative contributions from quark-antiquark, gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as all-orders resummation of initial-state gluon radiation valid at next-to-next-to leading logarithmic accuracy. The region of phase space is specified in which the calculation is most reliable. Good agreement is demonstrated with data from the Fermilab Tevatron, and predictions are made for more detailed tests with CDF and DO data. Predictions are shown for distributions of diphoton pairs produced at the energy of the Large Hadron Collider (LHC). Distributions of the diphoton pairs from the decay of a Higgs boson are contrasted with those produced from QCD processes at the LHC, showing that enhanced sensitivity to the signal can be obtained with judicious selection of events."
      "versions":[
      		0:{
      				"version":string"v1"
      				"created":string"Mon, 2 Apr 2007 19:18:42 GMT"
      			}
      		1:{
      				"version":string"v2"
      				"created":string"Tue, 24 Jul 2007 20:10:27 GMT"
      			}]
      "update_date":string"2008-11-26"
      "authors_parsed":[
      		0:[
      				0:string"Balázs"
      				1:string"C."
      				2:string""]
      		1:[
      				0:string"Berger"
      				1:string"E. L."
      				2:string""]
      		2:[
      				0:string"Nadolsky"
      				1:string"P. M."
      				2:string""]
      		3:[
      				0:string"Yuan"
      				1:string"C. -P."
      				2:string""]]
    

}

arxiv论文类别介绍

代码

首先导入模块

import seaborn as sns
from bs4 import BeautifulSoup
import re
import requests
import json
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm

从json中加载数据,保存为DataFrame

data = []
with open("arxiv-metadata-oai-snapshot.json",'r') as f:
    first=True
    for line in tqdm(f):
        data.append(json.loads(line))
        if first:
            print(json.loads(line))
            first=False

3987it [00:00, 39567.97it/s]

{'id': '0704.0001', 'submitter': 'Pavel Nadolsky', 'authors': "C. Bal\\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan", 'title': 'Calculation of prompt diphoton production cross sections at Tevatron and\n  LHC energies', 'comments': '37 pages, 15 figures; published version', 'journal-ref': 'Phys.Rev.D76:013009,2007', 'doi': '10.1103/PhysRevD.76.013009', 'report-no': 'ANL-HEP-PR-07-12', 'categories': 'hep-ph', 'license': None, 'abstract': '  A fully differential calculation in perturbative quantum chromodynamics is\npresented for the production of massive photon pairs at hadron colliders. All\nnext-to-leading order perturbative contributions from quark-antiquark,\ngluon-(anti)quark, and gluon-gluon subprocesses are included, as well as\nall-orders resummation of initial-state gluon radiation valid at\nnext-to-next-to-leading logarithmic accuracy. The region of phase space is\nspecified in which the calculation is most reliable. Good agreement is\ndemonstrated with data from the Fermilab Tevatron, and predictions are made for\nmore detailed tests with CDF and DO data. Predictions are shown for\ndistributions of diphoton pairs produced at the energy of the Large Hadron\nCollider (LHC). Distributions of the diphoton pairs from the decay of a Higgs\nboson are contrasted with those produced from QCD processes at the LHC, showing\nthat enhanced sensitivity to the signal can be obtained with judicious\nselection of events.\n', 'versions': [{'version': 'v1', 'created': 'Mon, 2 Apr 2007 19:18:42 GMT'}, {'version': 'v2', 'created': 'Tue, 24 Jul 2007 20:10:27 GMT'}], 'update_date': '2008-11-26', 'authors_parsed': [['Balázs', 'C.', ''], ['Berger', 'E. L.', ''], ['Nadolsky', 'P. M.', ''], ['Yuan', 'C. -P.', '']]}


1796911it [05:07, 5838.67it/s] 
data=pd.DataFrame(data)
data.to_csv("data.csv",index=False)
print(data.shape)
print(data.columns)
data.head()
(1796911, 14)
Index(['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi',
       'report-no', 'categories', 'license', 'abstract', 'versions',
       'update_date', 'authors_parsed'],
      dtype='object')
id submitter authors title comments journal-ref doi report-no categories license abstract versions update_date authors_parsed
0 0704.0001 Pavel Nadolsky C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-... Calculation of prompt diphoton production cros... 37 pages, 15 figures; published version Phys.Rev.D76:013009,2007 10.1103/PhysRevD.76.013009 ANL-HEP-PR-07-12 hep-ph None A fully differential calculation in perturba... [{'version': 'v1', 'created': 'Mon, 2 Apr 2007... 2008-11-26 [[Balázs, C., ], [Berger, E. L., ], [Nadolsky,...
1 0704.0002 Louis Theran Ileana Streinu and Louis Theran Sparsity-certifying Graph Decompositions To appear in Graphs and Combinatorics None None None math.CO cs.CG http://arxiv.org/licenses/nonexclusive-distrib... We describe a new algorithm, the $(k,\ell)$-... [{'version': 'v1', 'created': 'Sat, 31 Mar 200... 2008-12-13 [[Streinu, Ileana, ], [Theran, Louis, ]]
2 0704.0003 Hongjun Pan Hongjun Pan The evolution of the Earth-Moon system based o... 23 pages, 3 figures None None None physics.gen-ph None The evolution of Earth-Moon system is descri... [{'version': 'v1', 'created': 'Sun, 1 Apr 2007... 2008-01-13 [[Pan, Hongjun, ]]
3 0704.0004 David Callan David Callan A determinant of Stirling cycle numbers counts... 11 pages None None None math.CO None We show that a determinant of Stirling cycle... [{'version': 'v1', 'created': 'Sat, 31 Mar 200... 2007-05-23 [[Callan, David, ]]
4 0704.0005 Alberto Torchinsky Wael Abu-Shammala and Alberto Torchinsky From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a... None Illinois J. Math. 52 (2008) no.2, 681-689 None None math.CA math.FA None In this paper we show how to compute the $\L... [{'version': 'v1', 'created': 'Mon, 2 Apr 2007... 2013-10-15 [[Abu-Shammala, Wael, ], [Torchinsky, Alberto, ]]

观察data的categories列

data["categories"].describe()
count     395123
unique     28321
top        cs.CV
freq       12076
Name: categories, dtype: object
data["categories"].head(20)
0                hep-ph
1         math.CO cs.CG
2        physics.gen-ph
3               math.CO
4       math.CA math.FA
5     cond-mat.mes-hall
6                 gr-qc
7     cond-mat.mtrl-sci
8              astro-ph
9               math.CO
10      math.NT math.AG
11              math.NT
12              math.NT
13      math.CA math.AT
14               hep-th
15               hep-ph
16             astro-ph
17               hep-th
18      math.PR math.AG
19               hep-ex
Name: categories, dtype: object
unique_categories = set([i for l in [x.split(' ') for x in data["categories"]] for i in l])
print(len(unique_categories))
unique_categories
176





{'acc-phys',
 'adap-org',
 'alg-geom',
 'ao-sci',
 'astro-ph',
 'astro-ph.CO',
 'astro-ph.EP',
 'astro-ph.GA',
 'astro-ph.HE',
 'astro-ph.IM',
 'astro-ph.SR',
 'atom-ph',
 'bayes-an',
 'chao-dyn',
 'chem-ph',
 'cmp-lg',
 'comp-gas',
 'cond-mat',
 'cond-mat.dis-nn',
 'cond-mat.mes-hall',
 'cond-mat.mtrl-sci',
 'cond-mat.other',
 'cond-mat.quant-gas',
 'cond-mat.soft',
 'cond-mat.stat-mech',
 'cond-mat.str-el',
 'cond-mat.supr-con',
 'cs.AI',
 'cs.AR',
 'cs.CC',
 'cs.CE',
 'cs.CG',
 'cs.CL',
 'cs.CR',
 'cs.CV',
 'cs.CY',
 'cs.DB',
 'cs.DC',
 'cs.DL',
 'cs.DM',
 'cs.DS',
 'cs.ET',
 'cs.FL',
 'cs.GL',
 'cs.GR',
 'cs.GT',
 'cs.HC',
 'cs.IR',
 'cs.IT',
 'cs.LG',
 'cs.LO',
 'cs.MA',
 'cs.MM',
 'cs.MS',
 'cs.NA',
 'cs.NE',
 'cs.NI',
 'cs.OH',
 'cs.OS',
 'cs.PF',
 'cs.PL',
 'cs.RO',
 'cs.SC',
 'cs.SD',
 'cs.SE',
 'cs.SI',
 'cs.SY',
 'dg-ga',
 'econ.EM',
 'econ.GN',
 'econ.TH',
 'eess.AS',
 'eess.IV',
 'eess.SP',
 'eess.SY',
 'funct-an',
 'gr-qc',
 'hep-ex',
 'hep-lat',
 'hep-ph',
 'hep-th',
 'math-ph',
 'math.AC',
 'math.AG',
 'math.AP',
 'math.AT',
 'math.CA',
 'math.CO',
 'math.CT',
 'math.CV',
 'math.DG',
 'math.DS',
 'math.FA',
 'math.GM',
 'math.GN',
 'math.GR',
 'math.GT',
 'math.HO',
 'math.IT',
 'math.KT',
 'math.LO',
 'math.MG',
 'math.MP',
 'math.NA',
 'math.NT',
 'math.OA',
 'math.OC',
 'math.PR',
 'math.QA',
 'math.RA',
 'math.RT',
 'math.SG',
 'math.SP',
 'math.ST',
 'mtrl-th',
 'nlin.AO',
 'nlin.CD',
 'nlin.CG',
 'nlin.PS',
 'nlin.SI',
 'nucl-ex',
 'nucl-th',
 'patt-sol',
 'physics.acc-ph',
 'physics.ao-ph',
 'physics.app-ph',
 'physics.atm-clus',
 'physics.atom-ph',
 'physics.bio-ph',
 'physics.chem-ph',
 'physics.class-ph',
 'physics.comp-ph',
 'physics.data-an',
 'physics.ed-ph',
 'physics.flu-dyn',
 'physics.gen-ph',
 'physics.geo-ph',
 'physics.hist-ph',
 'physics.ins-det',
 'physics.med-ph',
 'physics.optics',
 'physics.plasm-ph',
 'physics.pop-ph',
 'physics.soc-ph',
 'physics.space-ph',
 'plasm-ph',
 'q-alg',
 'q-bio',
 'q-bio.BM',
 'q-bio.CB',
 'q-bio.GN',
 'q-bio.MN',
 'q-bio.NC',
 'q-bio.OT',
 'q-bio.PE',
 'q-bio.QM',
 'q-bio.SC',
 'q-bio.TO',
 'q-fin.CP',
 'q-fin.EC',
 'q-fin.GN',
 'q-fin.MF',
 'q-fin.PM',
 'q-fin.PR',
 'q-fin.RM',
 'q-fin.ST',
 'q-fin.TR',
 'quant-ph',
 'solv-int',
 'stat.AP',
 'stat.CO',
 'stat.ME',
 'stat.ML',
 'stat.OT',
 'stat.TH',
 'supr-con'}

转化年份数据

data["update_date"].head(10)
0    2008-11-26
1    2008-12-13
2    2008-01-13
3    2007-05-23
4    2013-10-15
5    2015-05-13
6    2008-11-26
7    2009-02-05
8    2010-03-18
9    2007-05-23
Name: update_date, dtype: object
data["year"] = pd.to_datetime(data["update_date"]).dt.year
data["year"].head(20)
0     2008
1     2008
2     2008
3     2007
4     2013
5     2015
6     2008
7     2009
8     2010
9     2007
10    2008
11    2007
12    2008
13    2009
14    2009
15    2008
16    2009
17    2007
18    2007
19    2015
Name: year, dtype: int64
del data["update_date"] 
len(data.columns)
14

取2019年及以后的数据

data = data[data["year"] >= 2019]
data.reset_index(drop=True, inplace=True) 
data.head()
id submitter authors title comments journal-ref doi report-no categories license abstract versions authors_parsed year
0 0704.0297 Sung-Chul Yoon Sung-Chul Yoon, Philipp Podsiadlowski and Step... Remnant evolution after a carbon-oxygen white ... 15 pages, 15 figures, 3 tables, submitted to M... None 10.1111/j.1365-2966.2007.12161.x None astro-ph None We systematically explore the evolution of t... [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... [[Yoon, Sung-Chul, ], [Podsiadlowski, Philipp,... 2019
1 0704.0342 Patrice Ntumba Pungu B. Dugmore and PP. Ntumba Cofibrations in the Category of Frolicher Spac... 27 pages None None None math.AT None Cofibrations are defined in the category of ... [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... [[Dugmore, B., ], [Ntumba, PP., ]] 2019
2 0704.0360 Zaqarashvili T.V. Zaqarashvili and K Murawski Torsional oscillations of longitudinally inhom... 6 pages, 3 figures, accepted in A&A None 10.1051/0004-6361:20077246 None astro-ph None We explore the effect of an inhomogeneous ma... [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... [[Zaqarashvili, T. V., ], [Murawski, K, ]] 2019
3 0704.0525 Sezgin Ayg\"un Sezgin Aygun, Ismail Tarhan, Husnu Baysal On the Energy-Momentum Problem in Static Einst... This submission has been withdrawn by arXiv ad... Chin.Phys.Lett.24:355-358,2007 10.1088/0256-307X/24/2/015 None gr-qc None This paper has been removed by arXiv adminis... [{'version': 'v1', 'created': 'Wed, 4 Apr 2007... [[Aygun, Sezgin, ], [Tarhan, Ismail, ], [Baysa... 2019
4 0704.0535 Antonio Pipino Antonio Pipino (1,3), Thomas H. Puzia (2,4), a... The Formation of Globular Cluster Systems in M... 32 pages (referee format), 9 figures, ApJ acce... Astrophys.J.665:295-305,2007 10.1086/519546 None astro-ph None The most massive elliptical galaxies show a ... [{'version': 'v1', 'created': 'Wed, 4 Apr 2007... [[Pipino, Antonio, ], [Puzia, Thomas H., ], [M... 2019

提取网页上的类别信息

#提取网页上的类别信息
website_url = requests.get('https://arxiv.org/category_taxonomy').text 
soup = BeautifulSoup(website_url,'lxml') 
root = soup.find('div',{'id':'category_taxonomy_list'}) 
tags = root.find_all(["h2","h3","h4","p"], recursive=True)

#保存信息
level_1_name = ""
level_2_name = ""
level_2_code = ""
level_1_names = []
level_2_codes = []
level_2_names = []
level_3_codes = []
level_3_names = []
level_3_notes = []

for t in tags:
    if t.name == "h2":
        level_1_name = t.text    
        level_2_code = t.text
        level_2_name = t.text
    elif t.name == "h3":
        raw = t.text
        level_2_code = re.sub(r"(.*)\((.*)\)",r"\2",raw) #正则表达式:模式字符串串:(.*)\((.*)\);被替换字符串串 "\2";被处理理字符串串: raw
        level_2_name = re.sub(r"(.*)\((.*)\)",r"\1",raw)
    elif t.name == "h4":
        raw = t.text
        level_3_code = re.sub(r"(.*) \((.*)\)",r"\1",raw)
        level_3_name = re.sub(r"(.*) \((.*)\)",r"\2",raw)
    elif t.name == "p":
        notes = t.text
        level_1_names.append(level_1_name)
        level_2_names.append(level_2_name)
        level_2_codes.append(level_2_code)
        level_3_names.append(level_3_name)
        level_3_codes.append(level_3_code)
        level_3_notes.append(notes)
        
#根据以上信息⽣生成 dataframe格式的数据
df_taxonomy = pd.DataFrame({
    'group_name' : level_1_names,
    'archive_name' : level_2_names,
    'archive_id' : level_2_codes,
    'category_name' : level_3_names,
    'categories' : level_3_codes,
    'category_description': level_3_notes
    
})
df_taxonomy.to_csv("df_taxonomy.csv",index=False)
#按照 "group_name" 进⾏行行分组,在组内使⽤用  "archive_name" 进⾏行行排序
df_taxonomy.groupby(["group_name","archive_name"])
df_taxonomy.head()
group_name archive_name archive_id category_name categories category_description
0 Computer Science Computer Science Computer Science Artificial Intelligence cs.AI Covers all areas of AI except Vision, Robotics...
1 Computer Science Computer Science Computer Science Hardware Architecture cs.AR Covers systems organization and hardware archi...
2 Computer Science Computer Science Computer Science Computational Complexity cs.CC Covers models of computation, complexity class...
3 Computer Science Computer Science Computer Science Computational Engineering, Finance, and Science cs.CE Covers applications of computer science to the...
4 Computer Science Computer Science Computer Science Computational Geometry cs.CG Roughly includes material in ACM Subject Class...

每篇论文的categories(多)类型不分开,连接data与类型数据

_df = data.merge(df_taxonomy, on="categories", how="left").drop_duplicates(["id","group_name"]).groupby("group_name").agg({"id":"count"}).sort_values(by="id",ascending=False).reset_index()
 
_df
group_name id
0 Physics 79985
1 Mathematics 51567
2 Computer Science 40067
3 Statistics 4054
4 Electrical Engineering and Systems Science 3297
5 Quantitative Biology 1994
6 Quantitative Finance 826
7 Economics 576

若一篇论文被分为多个类别,则该论文的每个类别都有一条记录

newdata=data.copy(deep=True)
tmps=data.categories.str.split(" ")
newdata.categories=tmps
newdata.head(20)
id submitter authors title comments journal-ref doi report-no categories license abstract versions authors_parsed year
0 0704.0297 Sung-Chul Yoon Sung-Chul Yoon, Philipp Podsiadlowski and Step... Remnant evolution after a carbon-oxygen white ... 15 pages, 15 figures, 3 tables, submitted to M... None 10.1111/j.1365-2966.2007.12161.x None [astro-ph] None We systematically explore the evolution of t... [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... [[Yoon, Sung-Chul, ], [Podsiadlowski, Philipp,... 2019
1 0704.0342 Patrice Ntumba Pungu B. Dugmore and PP. Ntumba Cofibrations in the Category of Frolicher Spac... 27 pages None None None [math.AT] None Cofibrations are defined in the category of ... [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... [[Dugmore, B., ], [Ntumba, PP., ]] 2019
2 0704.0360 Zaqarashvili T.V. Zaqarashvili and K Murawski Torsional oscillations of longitudinally inhom... 6 pages, 3 figures, accepted in A&A None 10.1051/0004-6361:20077246 None [astro-ph] None We explore the effect of an inhomogeneous ma... [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... [[Zaqarashvili, T. V., ], [Murawski, K, ]] 2019
3 0704.0525 Sezgin Ayg\"un Sezgin Aygun, Ismail Tarhan, Husnu Baysal On the Energy-Momentum Problem in Static Einst... This submission has been withdrawn by arXiv ad... Chin.Phys.Lett.24:355-358,2007 10.1088/0256-307X/24/2/015 None [gr-qc] None This paper has been removed by arXiv adminis... [{'version': 'v1', 'created': 'Wed, 4 Apr 2007... [[Aygun, Sezgin, ], [Tarhan, Ismail, ], [Baysa... 2019
4 0704.0535 Antonio Pipino Antonio Pipino (1,3), Thomas H. Puzia (2,4), a... The Formation of Globular Cluster Systems in M... 32 pages (referee format), 9 figures, ApJ acce... Astrophys.J.665:295-305,2007 10.1086/519546 None [astro-ph] None The most massive elliptical galaxies show a ... [{'version': 'v1', 'created': 'Wed, 4 Apr 2007... [[Pipino, Antonio, ], [Puzia, Thomas H., ], [M... 2019
5 0704.0710 Joerg Junkersfeld J. Junkersfeld (for the CB-ELSA collaboration) Photoproduction of pi0 omega off protons for E... 8 pages, 13 figures Eur.Phys.J.A31:365-372,2007 10.1140/epja/i2006-10302-7 None [nucl-ex] None Differential and total cross-sections for ph... [{'version': 'v1', 'created': 'Thu, 5 Apr 2007... [[Junkersfeld, J., , for the CB-ELSA collabora... 2019
6 0704.0752 Davoud Kamani Davoud Kamani Actions for the Bosonic String with the Curved... 8 pages, Latex, no figure, Some minor changes ... Braz. J. Phys. 38, 268-271 (2008) 10.1590/S0103-97332008000200010 None [hep-th] None At first we introduce an action for the stri... [{'version': 'v1', 'created': 'Thu, 5 Apr 2007... [[Kamani, Davoud, ]] 2020
7 0704.0803 Josephine Nanao Walter A. Simmons and Sandip S. Pakvasa Geometric Phase and Superconducting Flux Quant... 5 pages, pdf format None None None [quant-ph] None In a ring of s-wave superconducting material... [{'version': 'v1', 'created': 'Thu, 5 Apr 2007... [[Simmons, Walter A., ], [Pakvasa, Sandip S., ]] 2019
8 0704.0880 Qiuping A. Wang Q. A. Wang (ISMANS), F. Tsobnang (ISMANS), S. ... Stochastic action principle and maximum entropy This work is a further development of the idea... Chaos, Solitons and Fractals, 40(2009)2550-2556 None None [cond-mat.stat-mech] None A stochastic action principle for stochastic... [{'version': 'v1', 'created': 'Fri, 6 Apr 2007... [[Wang, Q. A., , ISMANS], [Tsobnang, F., , ISM... 2020
9 0704.0981 Xuan Hien Nguyen Xuan Hien Nguyen Construction of Complete Embedded Self-Similar... 30 pages Adv. Differential Equations 15 (2010), no. 5-6... None None [math.DG] None We study the Dirichlet problem associated to... [{'version': 'v1', 'created': 'Sat, 7 Apr 2007... [[Nguyen, Xuan Hien, ]] 2019
10 0704.1000 Liming Zhang L.M. Zhang, et al (for the Belle Collaboration) Measurement of D0-D0bar mixing in D0->Ks pi+ p... 6 pages, 4 figures, Submitted to Physical Revi... Phys.Rev.Lett.99:131803,2007 10.1103/PhysRevLett.99.131803 BELLE-CONF-0702 [hep-ex] None We report a measurement of D0-D0bar mixing i... [{'version': 'v1', 'created': 'Sat, 7 Apr 2007... [[Zhang, L. M., ]] 2019
11 0704.1245 Pamela Klaassen P.D. Klaassen and C.D. Wilson Outflow and Infall in a Sample of Massive Star... 34 pages, 9 figures, accepted for publication ... Astrophys.J.663:1092-1102,2007 10.1086/518760 None [astro-ph] None We present single pointing observations of S... [{'version': 'v1', 'created': 'Tue, 10 Apr 200... [[Klaassen, P. D., ], [Wilson, C. D., ]] 2019
12 0704.1369 Kazuya Aoki K. Aoki (for the PHENIX Collaboration) Double Helicity Asymmetry of Inclusive pi0 Pro... 4 pages, 3 figures, to be published in the Pro... AIPConf.Proc.915:339-342,2007 10.1063/1.2750791 None [hep-ex] None The proton spin structure is not understood ... [{'version': 'v1', 'created': 'Wed, 11 Apr 200... [[Aoki, K., , for the PHENIX Collaboration]] 2019
13 0704.1403 Alberto S. Cattaneo Alberto S. Cattaneo, Florian Schaetz Equivalences of Higher Derived Brackets 16 pages; minor changes; corrected typos; to a... J. Pure Appl. Algebra, 212, 2450-2460 (2008) 10.1016/j.jpaa.2008.03.013 None [math.QA, math.DG, math.SG] None This note elaborates on Th. Voronov's constr... [{'version': 'v1', 'created': 'Wed, 11 Apr 200... [[Cattaneo, Alberto S., ], [Schaetz, Florian, ]] 2020
14 0704.1430 Simone Zaggia R. Y. Momany, E.V. Held, I. Saviane, S. Zaggia, L... The blue plume population in dwarf spheroidal ... Accepted for publication in Astronomy & Astrop... None 10.1051/0004-6361:20067024 None [astro-ph] None Abridged... Blue stragglers (BSS) are though... [{'version': 'v1', 'created': 'Wed, 11 Apr 200... [[Momany, Y., ], [Held, E. V., ], [Saviane, I.... 2019
15 0704.1445 Yasha Gindikin Yasha Gindikin and Vladimir A. Sablikov Deformed Wigner crystal in a one-dimensional q... 10 pages, 11 figures. Misprints fixed Phys. Rev. B 76, 045122 (2007) 10.1103/PhysRevB.76.045122 None [cond-mat.str-el, cond-mat.mes-hall] http://arxiv.org/licenses/nonexclusive-distrib... The spatial Fourier spectrum of the electron... [{'version': 'v1', 'created': 'Wed, 11 Apr 200... [[Gindikin, Yasha, ], [Sablikov, Vladimir A., ]] 2019
16 0704.1454 James R. Graham James R. Graham (1), Bruce Macintosh (2), Rene... Ground-Based Direct Detection of Exoplanets wi... White paper submitted to the NSF-NASA-DOE Astr... None None None [astro-ph] None The Gemini Planet (GPI) imager is an "extrem... [{'version': 'v1', 'created': 'Wed, 11 Apr 200... [[Graham, James R., ], [Macintosh, Bruce, ], [... 2019
17 0704.1507 David Ardila D.R. Ardila, D.A. Golimowski, J.E. Krist, M. C... HST/ACS Coronagraphic Observations of the Dust... Accepted to ApJ None None None [astro-ph] None We present ACS/HST coronagraphic observation... [{'version': 'v1', 'created': 'Wed, 11 Apr 200... [[Ardila, D. R., ], [Golimowski, D. A., ], [Kr... 2019
18 0704.1579 Jose Alfonso Lopez Aguerri J. A. L. Aguerri, R. Sanchez-Janssen and C. Mu... A Study of Catalogued Nearby Galaxy Clusters i... 19 pages, 11 figures, accepted for publication... None 10.1051/0004-6361:20066478 None [astro-ph] None We have selected a sample of 88 nearby (z<0.... [{'version': 'v1', 'created': 'Thu, 12 Apr 200... [[Aguerri, J. A. L., ], [Sanchez-Janssen, R., ... 2019
19 0704.1776 Joerg Junkersfeld H. van Pee, O. Bartholomy, V. Crede (for the C... Photoproduction of pi0-mesons off protons from... 17 pages, 17 figures Eur.Phys.J.A31:61-77,2007 10.1140/epja/i2006-10160-3 None [nucl-ex] None Photoproduction of pi0 mesons was studied wi... [{'version': 'v1', 'created': 'Fri, 13 Apr 200... [[van Pee, H., , for the CB-ELSA Collaboration... 2019
explode_data=newdata.explode("categories")
explode_df = explode_data.merge(df_taxonomy, on="categories", how="left").drop_duplicates(["id","group_name"]).groupby("group_name").agg({"id":"count"}).sort_values(by="id",ascending=False).reset_index()
explode_df
group_name id
0 Physics 173191
1 Computer Science 134283
2 Mathematics 116930
3 Statistics 39655
4 Electrical Engineering and Systems Science 24834
5 Quantitative Biology 8140
6 Quantitative Finance 3680
7 Economics 2595

若一篇论文属于多个分类,则取第一个分类

tmplist=[]
for i in range(0,len(tmps)):
    tmplist.append(tmps[i][0])
first_data=newdata.copy(deep=True)
first_data.categories=tmplist
first_data.head(20)
id submitter authors title comments journal-ref doi report-no categories license abstract versions authors_parsed year
0 0704.0297 Sung-Chul Yoon Sung-Chul Yoon, Philipp Podsiadlowski and Step... Remnant evolution after a carbon-oxygen white ... 15 pages, 15 figures, 3 tables, submitted to M... None 10.1111/j.1365-2966.2007.12161.x None astro-ph None We systematically explore the evolution of t... [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... [[Yoon, Sung-Chul, ], [Podsiadlowski, Philipp,... 2019
1 0704.0342 Patrice Ntumba Pungu B. Dugmore and PP. Ntumba Cofibrations in the Category of Frolicher Spac... 27 pages None None None math.AT None Cofibrations are defined in the category of ... [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... [[Dugmore, B., ], [Ntumba, PP., ]] 2019
2 0704.0360 Zaqarashvili T.V. Zaqarashvili and K Murawski Torsional oscillations of longitudinally inhom... 6 pages, 3 figures, accepted in A&A None 10.1051/0004-6361:20077246 None astro-ph None We explore the effect of an inhomogeneous ma... [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... [[Zaqarashvili, T. V., ], [Murawski, K, ]] 2019
3 0704.0525 Sezgin Ayg\"un Sezgin Aygun, Ismail Tarhan, Husnu Baysal On the Energy-Momentum Problem in Static Einst... This submission has been withdrawn by arXiv ad... Chin.Phys.Lett.24:355-358,2007 10.1088/0256-307X/24/2/015 None gr-qc None This paper has been removed by arXiv adminis... [{'version': 'v1', 'created': 'Wed, 4 Apr 2007... [[Aygun, Sezgin, ], [Tarhan, Ismail, ], [Baysa... 2019
4 0704.0535 Antonio Pipino Antonio Pipino (1,3), Thomas H. Puzia (2,4), a... The Formation of Globular Cluster Systems in M... 32 pages (referee format), 9 figures, ApJ acce... Astrophys.J.665:295-305,2007 10.1086/519546 None astro-ph None The most massive elliptical galaxies show a ... [{'version': 'v1', 'created': 'Wed, 4 Apr 2007... [[Pipino, Antonio, ], [Puzia, Thomas H., ], [M... 2019
5 0704.0710 Joerg Junkersfeld J. Junkersfeld (for the CB-ELSA collaboration) Photoproduction of pi0 omega off protons for E... 8 pages, 13 figures Eur.Phys.J.A31:365-372,2007 10.1140/epja/i2006-10302-7 None nucl-ex None Differential and total cross-sections for ph... [{'version': 'v1', 'created': 'Thu, 5 Apr 2007... [[Junkersfeld, J., , for the CB-ELSA collabora... 2019
6 0704.0752 Davoud Kamani Davoud Kamani Actions for the Bosonic String with the Curved... 8 pages, Latex, no figure, Some minor changes ... Braz. J. Phys. 38, 268-271 (2008) 10.1590/S0103-97332008000200010 None hep-th None At first we introduce an action for the stri... [{'version': 'v1', 'created': 'Thu, 5 Apr 2007... [[Kamani, Davoud, ]] 2020
7 0704.0803 Josephine Nanao Walter A. Simmons and Sandip S. Pakvasa Geometric Phase and Superconducting Flux Quant... 5 pages, pdf format None None None quant-ph None In a ring of s-wave superconducting material... [{'version': 'v1', 'created': 'Thu, 5 Apr 2007... [[Simmons, Walter A., ], [Pakvasa, Sandip S., ]] 2019
8 0704.0880 Qiuping A. Wang Q. A. Wang (ISMANS), F. Tsobnang (ISMANS), S. ... Stochastic action principle and maximum entropy This work is a further development of the idea... Chaos, Solitons and Fractals, 40(2009)2550-2556 None None cond-mat.stat-mech None A stochastic action principle for stochastic... [{'version': 'v1', 'created': 'Fri, 6 Apr 2007... [[Wang, Q. A., , ISMANS], [Tsobnang, F., , ISM... 2020
9 0704.0981 Xuan Hien Nguyen Xuan Hien Nguyen Construction of Complete Embedded Self-Similar... 30 pages Adv. Differential Equations 15 (2010), no. 5-6... None None math.DG None We study the Dirichlet problem associated to... [{'version': 'v1', 'created': 'Sat, 7 Apr 2007... [[Nguyen, Xuan Hien, ]] 2019
10 0704.1000 Liming Zhang L.M. Zhang, et al (for the Belle Collaboration) Measurement of D0-D0bar mixing in D0->Ks pi+ p... 6 pages, 4 figures, Submitted to Physical Revi... Phys.Rev.Lett.99:131803,2007 10.1103/PhysRevLett.99.131803 BELLE-CONF-0702 hep-ex None We report a measurement of D0-D0bar mixing i... [{'version': 'v1', 'created': 'Sat, 7 Apr 2007... [[Zhang, L. M., ]] 2019
11 0704.1245 Pamela Klaassen P.D. Klaassen and C.D. Wilson Outflow and Infall in a Sample of Massive Star... 34 pages, 9 figures, accepted for publication ... Astrophys.J.663:1092-1102,2007 10.1086/518760 None astro-ph None We present single pointing observations of S... [{'version': 'v1', 'created': 'Tue, 10 Apr 200... [[Klaassen, P. D., ], [Wilson, C. D., ]] 2019
12 0704.1369 Kazuya Aoki K. Aoki (for the PHENIX Collaboration) Double Helicity Asymmetry of Inclusive pi0 Pro... 4 pages, 3 figures, to be published in the Pro... AIPConf.Proc.915:339-342,2007 10.1063/1.2750791 None hep-ex None The proton spin structure is not understood ... [{'version': 'v1', 'created': 'Wed, 11 Apr 200... [[Aoki, K., , for the PHENIX Collaboration]] 2019
13 0704.1403 Alberto S. Cattaneo Alberto S. Cattaneo, Florian Schaetz Equivalences of Higher Derived Brackets 16 pages; minor changes; corrected typos; to a... J. Pure Appl. Algebra, 212, 2450-2460 (2008) 10.1016/j.jpaa.2008.03.013 None math.QA None This note elaborates on Th. Voronov's constr... [{'version': 'v1', 'created': 'Wed, 11 Apr 200... [[Cattaneo, Alberto S., ], [Schaetz, Florian, ]] 2020
14 0704.1430 Simone Zaggia R. Y. Momany, E.V. Held, I. Saviane, S. Zaggia, L... The blue plume population in dwarf spheroidal ... Accepted for publication in Astronomy & Astrop... None 10.1051/0004-6361:20067024 None astro-ph None Abridged... Blue stragglers (BSS) are though... [{'version': 'v1', 'created': 'Wed, 11 Apr 200... [[Momany, Y., ], [Held, E. V., ], [Saviane, I.... 2019
15 0704.1445 Yasha Gindikin Yasha Gindikin and Vladimir A. Sablikov Deformed Wigner crystal in a one-dimensional q... 10 pages, 11 figures. Misprints fixed Phys. Rev. B 76, 045122 (2007) 10.1103/PhysRevB.76.045122 None cond-mat.str-el http://arxiv.org/licenses/nonexclusive-distrib... The spatial Fourier spectrum of the electron... [{'version': 'v1', 'created': 'Wed, 11 Apr 200... [[Gindikin, Yasha, ], [Sablikov, Vladimir A., ]] 2019
16 0704.1454 James R. Graham James R. Graham (1), Bruce Macintosh (2), Rene... Ground-Based Direct Detection of Exoplanets wi... White paper submitted to the NSF-NASA-DOE Astr... None None None astro-ph None The Gemini Planet (GPI) imager is an "extrem... [{'version': 'v1', 'created': 'Wed, 11 Apr 200... [[Graham, James R., ], [Macintosh, Bruce, ], [... 2019
17 0704.1507 David Ardila D.R. Ardila, D.A. Golimowski, J.E. Krist, M. C... HST/ACS Coronagraphic Observations of the Dust... Accepted to ApJ None None None astro-ph None We present ACS/HST coronagraphic observation... [{'version': 'v1', 'created': 'Wed, 11 Apr 200... [[Ardila, D. R., ], [Golimowski, D. A., ], [Kr... 2019
18 0704.1579 Jose Alfonso Lopez Aguerri J. A. L. Aguerri, R. Sanchez-Janssen and C. Mu... A Study of Catalogued Nearby Galaxy Clusters i... 19 pages, 11 figures, accepted for publication... None 10.1051/0004-6361:20066478 None astro-ph None We have selected a sample of 88 nearby (z<0.... [{'version': 'v1', 'created': 'Thu, 12 Apr 200... [[Aguerri, J. A. L., ], [Sanchez-Janssen, R., ... 2019
19 0704.1776 Joerg Junkersfeld H. van Pee, O. Bartholomy, V. Crede (for the C... Photoproduction of pi0-mesons off protons from... 17 pages, 17 figures Eur.Phys.J.A31:61-77,2007 10.1140/epja/i2006-10160-3 None nucl-ex None Photoproduction of pi0 mesons was studied wi... [{'version': 'v1', 'created': 'Fri, 13 Apr 200... [[van Pee, H., , for the CB-ELSA Collaboration... 2019
first_df=first_data.merge(df_taxonomy, on="categories", how="left").drop_duplicates(["id","group_name"]).groupby("group_name").agg({"id":"count"}).sort_values(by="id",ascending=False).reset_index()
first_df
group_name id
0 Physics 162521
1 Computer Science 103933
2 Mathematics 92523
3 Electrical Engineering and Systems Science 14555
4 Statistics 11618
5 Quantitative Biology 5205
6 Quantitative Finance 2151
7 Economics 1776

取第一个类别时,论文分类(大类)情况

mydf=first_df
fig = plt.figure(figsize=(15,12))
explode = (0, 0, 0, 0.2, 0.3, 0.3, 0.2, 0.1) 
plt.pie(mydf["id"],  labels=mydf["group_name"], autopct='%1.2f%%', 
startangle=160, explode=explode)
plt.tight_layout()
plt.show()



统计计算机小类(2019年和2020年)的论文数量

group_name="Computer Science"
cats = data.merge(df_taxonomy, on="categories").query("group_name == @group_name")
mydf=cats.groupby(["year","category_name"]).count().reset_index().pivot(index="category_name", columns="year",values="id") 
mydf["2019+2020"]=mydf[2019]+mydf[2020]
mydf.sort_values("2019+2020",ascending=False)
year 2019 2020 2019+2020
category_name
Computer Vision and Pattern Recognition 5559 6517 12076
Computation and Language 2153 2906 5059
Cryptography and Security 1067 1238 2305
Robotics 917 1298 2215
Networking and Internet Architecture 864 783 1647
Data Structures and Algorithms 711 902 1613
Distributed, Parallel, and Cluster Computing 715 774 1489
Software Engineering 659 804 1463
Artificial Intelligence 558 757 1315
Human-Computer Interaction 420 580 1000
Logic in Computer Science 470 504 974
Computers and Society 346 564 910
Machine Learning 177 538 715
Databases 282 342 624
Computer Science and Game Theory 281 323 604
Information Retrieval 245 331 576
Programming Languages 268 294 562
Systems and Control 415 133 548
Social and Information Networks 202 325 527
Neural and Evolutionary Computing 235 279 514
Computational Geometry 199 216 415
Computational Complexity 131 188 319
Computational Engineering, Finance, and Science 108 205 313
Formal Languages and Automata Theory 152 137 289
Digital Libraries 125 157 282
Graphics 116 151 267
Hardware Architecture 95 159 254
Emerging Technologies 101 84 185
Multiagent Systems 85 90 175
Discrete Mathematics 84 81 165
Multimedia 76 66 142
Other Computer Science 67 69 136
Performance 45 51 96
Symbolic Computation 44 36 80
Mathematical Software 27 45 72
Operating Systems 36 33 69
Numerical Analysis 40 11 51
Sound 7 4 11
General Literature 5 5 10
posted @ 2021-01-13 12:13  Zfancy  阅读(197)  评论(0)    收藏  举报