有多种方式可以把图数据保存在一个文件中。在本节,笔者提供了对这些方法中的一些的简要的总体了解。您会看到有大量的方式您可以使用简单的CSV 文件来展现您的图数据。也有XML格式的,JSON格式的和其它的很多格式。在这些方式中,有一个看起来是很多的工具和平台都支持的,它就是GraphML格式。不是阿帕奇TinkerPop提供的特性都可以使用GraphML来表示。然而,让我们看一些常见的使用形式。
There are a number of ways that graph data can be stored in a file. In this section I have provided a
brief overview of a few of them. As you will see there are a number of ways you can represent
graph data using simple CSV files. There are also XML format, JSON formats and many more. Of
these, the one that still seems to be supported across most tools and platforms is GraphML. Not all
of the features offered by Apache TinkerPop can be expressed using GraphML however. So let’s
take a look at some of the more commonly used formats.
8.1. 逗号分隔值 Comma Separated Values (CSV)
使用CSV文件来保存图有多种方式。笔者没有特别偏爱哪一种格式。然而一种常见的、便捷的方式是:当顶点包含了许多的属性时,要用到两个CSV文件。一个包含了所有的顶点数据,另一个包含了所有的边数据。
There are a number of ways that a graph can be stored using CSV files. There is no single preferred
format that I am aware of. However, a common and convenient way, especially when vertices
contain lots of properties is to use two CSV files. One will contain all of the vertex data and the
other will contain all of the edge data.
8.1.1. 使用两个CSV文件来表达航线图的数据 Using two CSV files to represent the air-routes data
如果我们打算把来自航线图的机场数据存储成CSV格式,我们可能得像下面的例子这样做一些处理。注意,为了提高可读性,在这个例子中,笔者没有包含每个属性(或者事实上每个机场)。注意怎么让每个顶点都有唯一的ID。在我们定义边的时候,这是相当重要的,我们需要顶点的ID来建立连接。
If we were to store the airport data from the air-routes graph in CSV format we might do something
like the example below. Note that to improve readability I have not included every property (or
indeed every airport) in this example. Notice how each vertex has a unique ID assigned. This is
important as when we define the edges we will need the vertex IDs to build the connections.
对于路径数据,我们图中的边, 我们可以使用类似下面这样的格式。笔者没有包含边的ID,因为通常我们是让图系统来分配ID。出于完整性笔者包含了一个标签,然而当每行边有相同的类型时,当程序吸收数据,并知道要分配哪个标签时,您可以选择忽略它。大部分的图系统要求边有一个标签,即是它对于顶点来说是可选的。对于机场数据也是如此。然而,在一些情况,在同一CSV文件中的顶点和边是不同类型的,显然,在这种情况,最好总是包含每条记录的标签。
For the route data, the edges in our graph, we might use a format like the one below. I did not
include an edge ID as we typically let the graph system assign those. For completeness I did include
a label however when every edge is of the same type, you could choose to leave this out so long as
the program ingesting the data knew what label to assign. Most graph systems require edges to
have a label even if it is optional for vertices. This is equally true for the airport data. However, in
cases where vertices and edges within the same CSV file are of different types then clearly for those
cases it is best to always include the label for each entry.
一些图系统提供了提取的工具,在像我们这展示的一样呈现CSV文件时,弄明白如何处理它们并构建一幅图。然而很多情况下,您可能总是发现您要自己写脚本或者小的程序来处理它。
Some graph systems provide ingestion tools that, when presented with a CSV file like the ones we
have shown here can figure out how to process them and build a graph. However, in many other
situations you may also find yourself writing your own scripts or small programs to do it.
笔者发现自己写Ruby或Groovy脚本来生成CSV或者GraphML文件,这样就可以让图系统来提取它们了。在一些情况下,笔者使用脚本来处理CSV,或GraphML数据,生成可以创建图的小精灵语句。这非常类似于其它一些实见的实践,在进行SQL数据库处理时,使用脚本生成INSERT语句。
I often find myself writing Ruby or Groovy scripts that can generate CSV or GraphML files so that a
graph system can ingest them. In some cases I have used scripts to take CSV or GraphML data and
generate the Gremlin statements that would create the graph. This is very similar to another
common practice, namely, using a script to generate INSERT statements when working with SQL
databases.
笔才也写了Java和Groovy程序,它们读到CSV文件,使用TinkerPop API 或者小精灵服务器的REST API来插入顶点或边到图中。如果您在图系统上工作了一段时间,您可能会发现您自己也经常做相似的事。
I have also written Java and Groovy programs that will read the CSV file and use the TinkerPop API
or the Gremlin Server REST API to insert vertices and edges into a graph. If you work with graph
systems for a while you will probably find yourself also doing similar things.
8.1.2. 邻接矩阵形式 Adjacency matrix format
上面的例子说明了如何用一个CSV文件来存储关于顶点和边的数据。它是一种很自然的方式。然而这并不意味着这是您可以做的唯一的方式。对于没有包含属性的图,您可以像下面这样使用一个邻接矩阵来布局一个图。字母代表了顶点的标签,数字1代表了它们之间有边,数字0代表没有边。如果您的顶点和边没有属性并眀图很小时,可以使用这个格式。但是总的来说,这不是一种表示大图的好的方式。
The examples shown above of how a CSV file can be used to store data about vertices and edges
presents a convenient way to do it. However, this is by no means the only way you could do it. For
graphs that do not contain properties you could lay the graph out using an adjacency matrix as
shown below. The letters represent the vertex labels and a 1 indicates there is an edge between
them and a zero indicates no edge. This format can be useful if your vertices and edges do not have
properties and if the graph is small but in general is not a great way to try and represent large
graphs.
8.1.3. 邻接表形式 Adjacency List format
上面的邻接矩阵也可以表示成一个邻接表。在这个例子中,每行的首列代表了一个顶点,每行的其它部分代表了这个顶点可以连接的其它的顶点。
The adjacency matrix shown above could also be represented as an adjacency list. In this case, the
first column of each row represents a vertex. The remaining parts of each row represent all of the
other vertices that this vertex is connected to.
除了这个简单的例子,邻接表也可以表示更复杂一些的图,例如航线图。我们可以构建一个更复杂的CSV文件,顶点和它的属性列在前边,后边是其它的它可以连接到的顶点的和边的属性。
While this is a simple example, it is possible to represent a more complex graph such as the air
routes graph in this way. We could build a more complex CSV file where the vertex and its
properties are listed first, followed by all of the other vertices it connects to and the properties for
those edges.
一些图数据库系统实际上使用了这个形式的变体来在磁盘上存储它们的图。在保存顶点和边的数据到它的持久化存储时, 杰森图实际上就使用了一个和这个很类似的系统。
Some graph database systems actually store their graphs to disk using a variation of this format.
JanusGraph in fact uses a system a lot like this when storing vertex and edge data to its persistent
store.
8.1.4. 边表形式 Edge List format
在使用边列表格式时,每一行代表一条边。我们的简单的例子可以表示成下面这样。只显示了一些边。
When using an edge list format, each line represents an edge. So our simple example could be
represented as follows. Only a few edges are shown.
您可以用多种方式来构建一个边表。顺带举一个简单的例子,我们可以把航线图中的边表示成类似下面所示的格式。在这个例子中,笔者也包含了每对顶点之间的边的标签。顶点是用它们的ID表示的。
There are many ways you could construct an edge list. By way of another simple example we could
represent routes in the air-routes graph in a format similar to that shown below. In this case we
also include the label of the edge between each of the vertices. The vertices are represented by their
ID value.
如果您想要导出一个非常简单版本的航线图,只使用机场的IATA代码和边标签,您可以写一个小精灵查询,像下面这样。只有前10条返回的结果被显示了。
If you wanted to export a very simple version of the air-routes graph, using just the airport IATA
codes and the edge labels you could write a Gremlin query to do it for you as follows. Only the first
10 results returned are shown.
在样例程序文件夹中,有一个名为GraphFromCSV.java 的样例程序,它说明了如何去读取一个像上面这样的CSV文件,根据它创建一个图。
There is a sample program called GraphFromCSV.java in the sample programs
folder that shows how to read a CSV file like the one above and create a graph
from it.
如果您要打印不包含方括号的列表,您可以利用Java的Iterator 接口的forEachRemaining 方法来对于查询的最后进行一些后处理。又一次只让前10条返回结果展示出来。
If you wanted to print the list without the containing square brackets you could take advantage of
the Java forEachRemaining method from the Iterator interface to add a bit of post processing to the
end of the query. Once again only the first 10 results are shown.
浙公网安备 33010602011771号