task02-论文作者统计
task02: 论文作者统计
https://github.com/datawhalechina/team-learning-data-mining/tree/master/AcademicTrends
任务说明
- 任务主题:论⽂文作者统计,统计所有论⽂文作者出现评率 Top10的姓名;
- 任务内容:论⽂文作者的统计、使⽤ Pandas 读取数据并使⽤用字符串串操作;
- 任务成果:任务成果:学习 Pandas 的字符串串操作;
代码
引入模块
import seaborn as sns
from bs4 import BeautifulSoup
import re
import requests
import json
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
from ast import literal_eval
从保存好的csv文件中读取data
data=pd.read_csv("data.csv")
data.head()
D:\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3146: DtypeWarning: Columns (0) have mixed types.Specify dtype option on import or set low_memory=False.
has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
id | submitter | authors | title | comments | journal-ref | doi | report-no | categories | license | abstract | versions | update_date | authors_parsed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 704 | Pavel Nadolsky | C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-... | Calculation of prompt diphoton production cros... | 37 pages, 15 figures; published version | Phys.Rev.D76:013009,2007 | 10.1103/PhysRevD.76.013009 | ANL-HEP-PR-07-12 | hep-ph | NaN | A fully differential calculation in perturba... | [{'version': 'v1', 'created': 'Mon, 2 Apr 2007... | 2008-11-26 | [['Balázs', 'C.', ''], ['Berger', 'E. L.', '']... |
1 | 704 | Louis Theran | Ileana Streinu and Louis Theran | Sparsity-certifying Graph Decompositions | To appear in Graphs and Combinatorics | NaN | NaN | NaN | math.CO cs.CG | http://arxiv.org/licenses/nonexclusive-distrib... | We describe a new algorithm, the $(k,\ell)$-... | [{'version': 'v1', 'created': 'Sat, 31 Mar 200... | 2008-12-13 | [['Streinu', 'Ileana', ''], ['Theran', 'Louis'... |
2 | 704 | Hongjun Pan | Hongjun Pan | The evolution of the Earth-Moon system based o... | 23 pages, 3 figures | NaN | NaN | NaN | physics.gen-ph | NaN | The evolution of Earth-Moon system is descri... | [{'version': 'v1', 'created': 'Sun, 1 Apr 2007... | 2008-01-13 | [['Pan', 'Hongjun', '']] |
3 | 704 | David Callan | David Callan | A determinant of Stirling cycle numbers counts... | 11 pages | NaN | NaN | NaN | math.CO | NaN | We show that a determinant of Stirling cycle... | [{'version': 'v1', 'created': 'Sat, 31 Mar 200... | 2007-05-23 | [['Callan', 'David', '']] |
4 | 704 | Alberto Torchinsky | Wael Abu-Shammala and Alberto Torchinsky | From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a... | NaN | Illinois J. Math. 52 (2008) no.2, 681-689 | NaN | NaN | math.CA math.FA | NaN | In this paper we show how to compute the $\L... | [{'version': 'v1', 'created': 'Mon, 2 Apr 2007... | 2013-10-15 | [['Abu-Shammala', 'Wael', ''], ['Torchinsky', ... |
data.shape
(1796911, 14)
##统计AL领域
data2 = data[data['categories'].apply(lambda x: 'cs.AI' in x)]
data2.head()
id | submitter | authors | title | comments | journal-ref | doi | report-no | categories | license | abstract | versions | update_date | authors_parsed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
46 | 704.005 | Igor Grabec | T. Kosel and I. Grabec | Intelligent location of simultaneously active ... | 5 pages, 5 eps figures, uses IEEEtran.cls | NaN | NaN | NaN | cs.NE cs.AI | NaN | The intelligent acoustic emission locator is... | [{'version': 'v1', 'created': 'Sun, 1 Apr 2007... | 2009-09-29 | [['Kosel', 'T.', ''], ['Grabec', 'I.', '']] |
49 | 704.005 | Igor Grabec | T. Kosel and I. Grabec | Intelligent location of simultaneously active ... | 5 pages, 7 eps figures, uses IEEEtran.cls | NaN | NaN | NaN | cs.NE cs.AI | NaN | Part I describes an intelligent acoustic emi... | [{'version': 'v1', 'created': 'Sun, 1 Apr 2007... | 2007-05-23 | [['Kosel', 'T.', ''], ['Grabec', 'I.', '']] |
303 | 704.03 | Carlos Gershenson | Carlos Gershenson | The World as Evolving Information | 16 pages. Extended version, three more laws of... | Minai, A., Braha, D., and Bar-Yam, Y., eds. Un... | 10.1007/978-3-642-18003-3_10 | NaN | cs.IT cs.AI math.IT q-bio.PE | http://arxiv.org/licenses/nonexclusive-distrib... | This paper discusses the benefits of describ... | [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... | 2013-04-05 | [['Gershenson', 'Carlos', '']] |
984 | 704.098 | Mohd Abubakr | Mohd Abubakr, R.M.Vinay | Architecture for Pseudo Acausal Evolvable Embe... | 4 pages, 2 figures. Submitted to SASO 2007 | NaN | NaN | NaN | cs.NE cs.AI | NaN | Advances in semiconductor technology are con... | [{'version': 'v1', 'created': 'Sat, 7 Apr 2007... | 2007-05-23 | [['Abubakr', 'Mohd', ''], ['Vinay', 'R. M.', '']] |
1027 | 704.103 | Jianlin Cheng | Jianlin Cheng | A neural network approach to ordinal regression | 8 pages | NaN | NaN | NaN | cs.LG cs.AI cs.NE | NaN | Ordinal regression is an important type of l... | [{'version': 'v1', 'created': 'Sun, 8 Apr 2007... | 2007-05-23 | [['Cheng', 'Jianlin', '']] |
len(data2)
28061
data2.head(20)
id | submitter | authors | title | comments | journal-ref | doi | report-no | categories | license | abstract | versions | update_date | authors_parsed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
46 | 704.005 | Igor Grabec | T. Kosel and I. Grabec | Intelligent location of simultaneously active ... | 5 pages, 5 eps figures, uses IEEEtran.cls | NaN | NaN | NaN | cs.NE cs.AI | NaN | The intelligent acoustic emission locator is... | [{'version': 'v1', 'created': 'Sun, 1 Apr 2007... | 2009-09-29 | [['Kosel', 'T.', ''], ['Grabec', 'I.', '']] |
49 | 704.005 | Igor Grabec | T. Kosel and I. Grabec | Intelligent location of simultaneously active ... | 5 pages, 7 eps figures, uses IEEEtran.cls | NaN | NaN | NaN | cs.NE cs.AI | NaN | Part I describes an intelligent acoustic emi... | [{'version': 'v1', 'created': 'Sun, 1 Apr 2007... | 2007-05-23 | [['Kosel', 'T.', ''], ['Grabec', 'I.', '']] |
303 | 704.03 | Carlos Gershenson | Carlos Gershenson | The World as Evolving Information | 16 pages. Extended version, three more laws of... | Minai, A., Braha, D., and Bar-Yam, Y., eds. Un... | 10.1007/978-3-642-18003-3_10 | NaN | cs.IT cs.AI math.IT q-bio.PE | http://arxiv.org/licenses/nonexclusive-distrib... | This paper discusses the benefits of describ... | [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... | 2013-04-05 | [['Gershenson', 'Carlos', '']] |
984 | 704.098 | Mohd Abubakr | Mohd Abubakr, R.M.Vinay | Architecture for Pseudo Acausal Evolvable Embe... | 4 pages, 2 figures. Submitted to SASO 2007 | NaN | NaN | NaN | cs.NE cs.AI | NaN | Advances in semiconductor technology are con... | [{'version': 'v1', 'created': 'Sat, 7 Apr 2007... | 2007-05-23 | [['Abubakr', 'Mohd', ''], ['Vinay', 'R. M.', '']] |
1027 | 704.103 | Jianlin Cheng | Jianlin Cheng | A neural network approach to ordinal regression | 8 pages | NaN | NaN | NaN | cs.LG cs.AI cs.NE | NaN | Ordinal regression is an important type of l... | [{'version': 'v1', 'created': 'Sun, 8 Apr 2007... | 2007-05-23 | [['Cheng', 'Jianlin', '']] |
1393 | 704.139 | Tarik Had\v{z}i\'c | Tarik Hadzic, Rune Moller Jensen, Henrik Reif ... | Calculating Valid Domains for BDD-Based Intera... | NaN | NaN | NaN | NaN | cs.AI | NaN | In these notes we formally describe the func... | [{'version': 'v1', 'created': 'Wed, 11 Apr 200... | 2007-05-23 | [['Hadzic', 'Tarik', ''], ['Jensen', 'Rune Mol... |
1408 | 704.141 | Yao Hengshuai | Yao HengShuai | Preconditioned Temporal Difference Learning | This paper has been withdrawn by the author. L... | NaN | NaN | NaN | cs.LG cs.AI | NaN | This paper has been withdrawn by the author.... | [{'version': 'v1', 'created': 'Wed, 11 Apr 200... | 2012-06-11 | [['HengShuai', 'Yao', '']] |
1674 | 704.168 | Kristina Lerman | Anon Plangprasopchok and Kristina Lerman | Exploiting Social Annotation for Automatic Res... | 6 pages, submitted to AAAI07 workshop on Infor... | NaN | NaN | NaN | cs.AI cs.CY cs.DL | NaN | Information integration applications, such a... | [{'version': 'v1', 'created': 'Thu, 12 Apr 200... | 2016-09-08 | [['Plangprasopchok', 'Anon', ''], ['Lerman', '... |
1675 | 704.168 | Kristina Lerman | Kristina Lerman, Anon Plangprasopchok and Chio... | Personalizing Image Search Results on Flickr | 12 pages, submitted to AAAI07 workshop on Inte... | NaN | NaN | NaN | cs.IR cs.AI cs.CY cs.DL cs.HC | NaN | The social media site Flickr allows users to... | [{'version': 'v1', 'created': 'Thu, 12 Apr 200... | 2007-05-23 | [['Lerman', 'Kristina', ''], ['Plangprasopchok... |
1782 | 704.178 | Francesco Santini | Stefano Bistarelli, Ugo Montanari, Francesca R... | Unicast and Multicast Qos Routing with Soft Co... | 45 pages | NaN | NaN | NaN | cs.LO cs.AI cs.NI | NaN | We present a formal model to represent and s... | [{'version': 'v1', 'created': 'Fri, 13 Apr 200... | 2009-09-29 | [['Bistarelli', 'Stefano', ''], ['Montanari', ... |
2009 | 704.201 | Juliana Bernardes | Juliana S Bernardes, Alberto Davila, Vitor San... | A study of structural properties on profiles HMMs | 6 pages, 7 figures | NaN | NaN | NaN | cs.AI | http://arxiv.org/licenses/nonexclusive-distrib... | Motivation: Profile hidden Markov Models (pH... | [{'version': 'v1', 'created': 'Mon, 16 Apr 200... | 2008-12-11 | [['Bernardes', 'Juliana S', ''], ['Davila', 'A... |
2082 | 704.208 | Hassan Satori | H. Satori, M. Harti and N. Chenfour | Introduction to Arabic Speech Recognition Usin... | 4 pages, 3 figures and 2 tables, was in Inform... | NaN | NaN | NaN | cs.CL cs.AI | NaN | In this paper Arabic was investigated from t... | [{'version': 'v1', 'created': 'Tue, 17 Apr 200... | 2007-05-23 | [['Satori', 'H.', ''], ['Harti', 'M.', ''], ['... |
2200 | 704.22 | Hassan Satori | H. Satori, M. Harti and N. Chenfour | Arabic Speech Recognition System using CMU-Sph... | 5 pages, 3 figures and 2 tables, in French | NaN | NaN | NaN | cs.CL cs.AI | NaN | In this paper we present the creation of an ... | [{'version': 'v1', 'created': 'Tue, 17 Apr 200... | 2007-05-23 | [['Satori', 'H.', ''], ['Harti', 'M.', ''], ['... |
3156 | 704.316 | Giorgio Terracina | Giorgio Terracina, Nicola Leone, Vincenzino Li... | Experimenting with recursive queries in databa... | To appear in Theory and Practice of Logic Prog... | NaN | NaN | NaN | cs.AI cs.DB | NaN | This paper considers the problem of reasonin... | [{'version': 'v1', 'created': 'Tue, 24 Apr 200... | 2007-05-23 | [['Terracina', 'Giorgio', ''], ['Leone', 'Nico... |
3358 | 704.336 | Alex Smola J | Quoc Le and Alexander Smola | Direct Optimization of Ranking Measures | NaN | NaN | NaN | NaN | cs.IR cs.AI | NaN | Web page ranking and collaborative filtering... | [{'version': 'v1', 'created': 'Wed, 25 Apr 200... | 2007-05-23 | [['Le', 'Quoc', ''], ['Smola', 'Alexander', '']] |
3394 | 704.34 | Marko A. Rodriguez | Marko A. Rodriguez | General-Purpose Computing on a Semantic Networ... | NaN | Emergent Web Intelligence: Advanced Semantic T... | NaN | LA-UR-07-2885 | cs.AI cs.PL | http://creativecommons.org/licenses/publicdomain/ | This article presents a model of general-pur... | [{'version': 'v1', 'created': 'Wed, 25 Apr 200... | 2010-06-08 | [['Rodriguez', 'Marko A.', '']] |
3432 | 704.343 | Tshilidzi Marwala | Tshilidzi Marwala and Bodie Crossingham | Bayesian approach to rough set | 20 pages, 3 figures | NaN | NaN | NaN | cs.AI | NaN | This paper proposes an approach to training ... | [{'version': 'v1', 'created': 'Wed, 25 Apr 200... | 2007-05-23 | [['Marwala', 'Tshilidzi', ''], ['Crossingham',... |
3452 | 704.345 | Tshilidzi Marwala | S. Mohamed, D. Rubin, and T. Marwala | An Adaptive Strategy for the Classification of... | 9 pages, 5 tables, 3 figures | NaN | NaN | NaN | cs.AI q-bio.QM | NaN | One of the major problems in computational b... | [{'version': 'v1', 'created': 'Wed, 25 Apr 200... | 2007-06-25 | [['Mohamed', 'S.', ''], ['Rubin', 'D.', ''], [... |
3514 | 704.351 | Jegor Uglov Mr | J. Uglov, V. Schetinin, C. Maple | Comparing Robustness of Pairwise and Multiclas... | NaN | NaN | 10.1155/2008/468693 | NaN | cs.AI | NaN | Noise, corruptions and variations in face im... | [{'version': 'v1', 'created': 'Thu, 26 Apr 200... | 2016-02-17 | [['Uglov', 'J.', ''], ['Schetinin', 'V.', ''],... |
3885 | 704.389 | W Saba | Walid S. Saba | A Note on Ontology and Ordinary Language | 19 pages, 1 figure | NaN | NaN | NaN | cs.AI cs.CL | NaN | We argue for a compositional semantics groun... | [{'version': 'v1', 'created': 'Mon, 30 Apr 200... | 2007-05-23 | [['Saba', 'Walid S.', '']] |
# ast中的literal_eval可将字符串形式的list转回list、
tmplist=literal_eval(data2.authors_parsed.iloc[5])
tmplist
[['Hadzic', 'Tarik', ''],
['Jensen', 'Rune Moller', ''],
['Andersen', 'Henrik Reif', '']]
# 拼接所有的作者
all_authors=[]
for i in range(0,len(data2)):
all_authors.extend(literal_eval(data2.authors_parsed.iloc[i]))
authors_names=[' '.join(x)[:-1] for x in all_authors]
authors_names[0:5]
['Kosel T.',
'Grabec I.',
'Kosel T.',
'Grabec I.',
'Gershenson Carlos']
authors_names = pd.DataFrame(authors_names)
authors_names.head()
0 | |
---|---|
0 | Kosel T. |
1 | Grabec I. |
2 | Kosel T. |
3 | Grabec I. |
4 | Gershenson Carlos |
# 根据作者频率绘制直⽅方图
plt.figure(figsize=(10, 6))
authors_names[0].value_counts().head(10).plot(kind='barh')
plt.ylabel('Author')
plt.xlabel('Count')
Text(0.5, 0, 'Count')
# 统计姓
authors_lastnames = [x[0] for x in all_authors]
authors_lastnames = pd.DataFrame(authors_lastnames)
plt.figure(figsize=(10, 6))
authors_lastnames[0].value_counts().head(10).plot(kind='barh')
plt.ylabel('Author')
plt.xlabel('Count')
Text(0.5, 0, 'Count')
authors_lastnames.value_counts()
Wang 1591
Zhang 1561
Li 1395
Liu 1220
Chen 1126
...
Merayo 1
Mercat 1
Mercer 1
Merchant 1
'Baya 1
Length: 26326, dtype: int64