在本书中,笔者提供了一张,从大数据的角度看,它非常的小,却是真实世界中的图,这个图以GraphML的格式存储,这种GraphML是一种标准的描述图的XML格式,这种格式可以在不同的应用中传递图数据。这张航线图是笔者依照真实的航线网络建模的,它相当的精确。
Along with this book I have provided what is, in big data terms, a very small, but nonetheless real
world graph that is stored in GraphML, a standard XML format for describing graphs that can be
used to move graphs between applications. The graph, air-routes is a model I built of the world
airline route network that is fairly accurate.
您可以从Github仓库样例数据文件中下载到这份名为air-routes.graphml的文件。
The air-routes.graphml file can be downloded from the sample-data folder located
in the GitHub repository at the following URL: https://github.com/krlawrence/
graph/tree/master/sample-data
 
当然在真实世界中,航空公司会随时增加或者删除航线,所以请您不要使用本图来规划您未来的休闲度假或者商务出行。但是把它作为学习的工具,笔者相信您会发现它的价值并轻松的使用它。如果您对该文件感兴趣,您也可以把它加载到文本编辑工具中,并研究它的布局结构。在从事图数据工作的时候,您就会对于流行的图序列化格式非常熟悉。常见的两种格式是GraphML和GraphSON。GraphSON是一种JSON格式,它由阿帕奇TinkerPop定义,并主要用在TinkerPop中。 GraphML也能被TinkerPop识别,其它的一些工具如Gephi, 一种流行的开源图数据可视化工具也能识别GraphML。大量的图分析工具也会使用CSV格式的文件。
Of course, in the real world, routes are added and deleted by airlines all the time so please don’t use
this graph to plan your next vacation or business trip! However, as a learning tool I hope you will
find it useful and easy to relate to. If you feel so inclined you can load the file into a text editor and
examine how it is laid out. As you work with graphs you will want to become familiar with popular
graph serialization formats. Two common ones are GraphML and GraphSON. The latter is a JSON
format that is defined by Apache TinkerPop and heavily used in that environment. GraphML is
widely recognized by TinkerPop and many other tools as well such as Gephi, a popular open source
tool for visualizing graph data. A lot of graph ingestion tools also still use comma separated values
(CSV) format files.
我们会在第二节和第四节简要地了解加载和保存图数据。 笔者会在本书最后的“常见的图序列化格式”一节中一起看一下在存储于文本格式的图数据文件上工作的多种方式,包括导入图数据和导出图数据。
We will briefly look at loading and saving graph data in Sections 2 and 4. I take a look at different
ways to work with graph data stored in text format files including importing and exporting graph
data in the "COMMON GRAPH SERIALIZATION FORMATS" section towards the end of the book.
航线图包括了多种顶点类型,这些顶点类型是通过标签来指示的。最常见的是机场和国家。也有七个周的顶点,和唯一的版本顶点,这个是用于测试您所使用的图的版本信息的。
The air-routes graph contains several vertex types that are specified using labels. The most common
ones being airport and country. There are also vertices for each of the seven continents (continent)
and a single version vertex that I provided as a way to test which version of the graph you are
using.
机场之间的路径建模成了边。这些边包括了路线标签和两个相邻机场之间的距离,这个距离我们把它做为一个属性称之为“dist”。国家和机场之间的关系建模的时候用到了带有“contains”标签的边.
Routes between airports are modeled as edges. These edges carry the route label and include the
distance between the two connected airport vertices as a property called dist. Connections between
countries and airports are modelled using an edge with a contains label.
每个机场顶点都有多个与之关联的属性,这些属性给出来了这个机场的详细的信息,包括了国际航空运输协会 IATA代码和国际民用航空组织ICAO代码、描述信息、它所在的城市和地理位置。特别要说的是,每个机场顶点都有唯一的ID,机场的标签和如下的属性。括号中的内容说明了该属性的数据类型。
Each airport vertex has many properties associated with it giving various details about that airport
including its IATA and ICAO codes, its description, the city it is in and its geographic location.
Specifically, each airport vertex has a unique ID, a label of airport and contains the following
properties. The word in parenthesis indicates the type of the property.

航线图加载到了小精灵控制台后,我们就可以看到机场顶点的属性了。下面这个例子是Austin机场顶点的样子。稍后笔者会解释构成小精灵查询的步骤。在此之前,我们一起花点时间了解一下如何加载数据,并做进行一些控制台的偏好设置。

We can use Gremlin once the air route graph is loaded to show us what properties an airport vertex
has. As an example here is what the Austin airport vertex looks like. I will explain the steps that 
make up the Gremlin query shortly. First we need to dig a little bit into how to load the data and
configure a few preferences.

 即使机场顶点的标签是“airport”,笔者仍然确保有一个称为“type”的属性,它所包含的内容是字符串“airport”.这样做是为了方便在其它的图数据库系统索引时用到,该内容会在本书后续展开来解释。

Even though the airport vertex label is airport I chose to also have a property called type that also
contains the string airport. This was done to aid with indexing when working with other graph
database systems and is explained in more detail later in this book.
您可能已经注意到,每个属性的值都是以列表的形式呈现的,甚至一些列表中只包含一个元素。之所以这样做,会在本书后别详细解释。简单来说就是因为TinkerPop允许我们把一个列表做为顶点的属性。在“把多个值赋值给一个属性”这一节里,我们会分析这一能力为您带来的好处。航线图所具有的特性的详细信息您可以通过阅读航线图文件开头的注释部分,或者阅读README.txt文件。
You may have noticed that the values for each property are represented as lists (or arrays if you
prefer), even though each list only contains one element. The reasons for this will be explored later
in this book but the quick explanation is that this is because TinkerPop allows us to associate a list
of values with any vertex property. We will explore ways that you can take advantage of this
capability in the "Attaching multiple values (lists or sets) to a single property" section.
The full details of all the features contained in the air-routes graph can be learned by reading the
comments at the start of the air-routes.graphml file or reading the README.txt file.
这个航线图中包括了3619个顶点和50148条边,其中3374个顶点是机场,43400条边是机场之间的路径。从大数据领域来看,这真的是一张迷你小图;但是这张图对我们来说已经足够大了,我们用它来体会小精灵查询语言的乐趣,这图已足够大,足够用了。
The graph currently contains a total of 3,619 vertices and 50,148 edges. Of these 3,374 vertices are
airports, and 43,400 of the edges represent routes. While in big data terms this is really a tiny graph,
it is plenty big enough for us to build up and experiment with some very interesting Gremlin
queries.
最后,是从航线图中得到的一些统计数据和事实。如果您想看到更多的统计信息,可以查看README.txt文件。
Lastly, here are some statistics and facts about the air-routes graph. If you want to see a lot more
statistics check the README.txt file that is included with the air-routes graph.

 下面是按机场的全部路线排序得到的排名前15的机场的信息。在图论或者图数据库领域我们通常称之为顶点的度数,或者顶点的度。

Here are the Top 15 airports sorted by overall number of routes (in and out). In graph terminology
this is often called the degree of the vertex or just vertex degree. 

 学习过本书后,您会发现使用小精灵查询语言,您可以生成这些统计结果中的大多数。

Throughout this book you will find Gremlin queries that can be used to generate many of these
statistics. 
名称为graph-stats.groovy的样例脚本,您可以从Github 仓库的样例代码文件夹中找到。这个脚本说明了如何去生成图的统计数据。脚本的地址如下:
There is a sample script called graph-stats.groovy in the GitHub repository located
in the sample-code folder that shows how to generate some statistics about the
graph. The script can be found at the following URL: https://github.com/
krlawrence/graph/tree/master/sample-code

 

 

 

 
 
posted on 2022-04-01 22:40  bokeyuannicheng0000  阅读(112)  评论(0)    收藏  举报