solr中的Join Query

Solr中的Join其实有两大类，普通Join和Block Join.
当前页面先说普通的Join。

其实Join操作让我发现，Solr所谓的跨节点查询，是有问题的。
Solr中的查询，是基于每个node的每个replica（也就是内部的Core）。每次query的结算，都必须在一个code中完整的结束，然后在一个node上完整的结束，然后再合并。
这导致如果数据分布在多个shard上，或者某个shard的replica不在当前node上，就会报错，或者搜索结果偏少。
在运行环境中，因为一定需要Join操作，被迫的，将某个collection设置为单shard，并且replica数一定要>=solr node数。
这样的配置，让solr原本号称的cloud模式荡然无存了。
所以solr的计算方式，适合于真正可以分割计算的那种，如果有依赖关系的计算，它是做不到真正的cloud模式的。
不知道Elastic Search是否能做到这一点。

下文中如果测试中不单独说明，则表示为单shard，如果是多shard情况，会特地注明。

准备工作

我们以用测试结果来看看Join的作用。

启动Solr cloud

single mode的solr会更简单，所以这里就只以solr cloud为例。
首先我使用docker拉起一个solr cloud (我本地宿主机是windows)。
为了简化，但是能说明多shard情况下的区别，我们拉起1个zookeeper node，2个solr node。

命令行执行：docker-compose.exe -f solr_cloud.yml up -d
其中，solr_cloud.yml 的内容如下：

version: '3.7'
services:
  solr1:
    image: solr:8.5.0
    container_name: solr1
    ports:
     - "8981:8983"
    environment:
      - ZK_HOST=zoo1:2181
    networks:
      - solr
    depends_on:
      - zoo1

  solr2:
    image: solr:8.5.0
    container_name: solr2
    ports:
     - "8982:8983"
    environment:
      - ZK_HOST=zoo1:2181
    networks:
      - solr
    depends_on:
      - zoo1

  zoo1:
    image: zookeeper:3.5
    container_name: zoo1
    restart: always
    hostname: zoo1
    ports:
      - 2181:2181
    environment:
      ZOO_MY_ID: 1
      ZOO_SERVERS: server.1=0.0.0.0:2888:3888;2181
      ZOO_4LW_COMMANDS_WHITELIST: mntr,conf,ruok
    networks:
      - solr

networks:
  solr:

创建collection

等待solr cloud起来后，使用默认的configset（_default）创建collection。
使用Postman之类的工具发送HTTP命令（下文中的HTTP命令都是用Postman之类的工具发送的，但是这些命令在Solr Admin Console上也一样可以执行）：

POST http://localhost:8981/solr/admin/collections?action=CREATE&name=collection1&numShards=1&replicationFactor=1

现在我们创建了一个collection，它的信息：

name：collection1
shard num：1
replication num: 1 (per shard)

虽然对下文的测试可能没有影响，但是说明一下：这时我创建了一个单shard单replica的collection，这个single mode启动的solr是不一样的。
我有2个solr node，只有一个node上产生了index目录。也就是说，如果我在这个collection上query，便只有产生replica的那个node会收到真实需要处理的query request（另一台想做也做不了）。

Schema的准备

下文中都使用默认的dynamicField，所以不额外创建field。
默认的dynamicField中，*_i是int，*_s是string，*_ss是多值string。

普通Join

示例1-1：from单值string to单值string （单shard）

index数据准备

执行index命令：

POST http://localhost:8981/solr/collection1/update?commit=true&overwrite=true
[
{"id":"001", "single_s":"001", "other_s":"aa"},
{"id":"002", "single_s":"003", "other_s":"aa"},
{"id":"003", "single_s":"005", "other_s":"aa"},
{"id":"004", "single_s":"004", "other_s":"bb"},
{"id":"005"}
]

测试

GET http://localhost:8981/solr/collection1/select?fq={!join to=id from=single_s}other_s:aa&q=*:*&fl=id

解析：

other_s:aa 返回id值为001/002/003
{!join to=id from=single_s}other_s:aa 翻译成sql类似于select * from collection1 where id in (select single_s from collection1 where other_s contains 'aa'),返回id为001/003/005

最后结果：返回001/003/005

示例1-2：from单值string to单值string （多shard）

创建的collection的shard num为2的情况下，其它index以及query语句都和示例1-1一样。返回结果：

丢了005！

请注意：这个结果并不一定是固定的。取决于shard的index路由规则。

这里是因为：solr在单个replic上进行结算，即select * from collection1 where id in (select single_s from collection1 where other_s contains 'aa')这一条语句，是在单个shard上结算的。
因为shard数为2，分别获取两个shard上的doc id：

GET http://localhost:8981/solr/collection1/select?q=*:*&shards=shard1&fl=id

得到：

GET http://localhost:8981/solr/collection1/select?q=*:*&shards=shard2&fl=id

得到：

于是上述所说的where doc1.single_s=doc2.id and doc1.other_s='aa'分别在两个shard上结算。

select * from collection1.shard1 where id in (select single_s from collection1.shard1 where other_s contains 'aa')仅返回了001
select * from collection1.shard2 where id in (select single_s from collection1.shard2 where other_s contains 'aa')仅返回了003

最后总的结果就是001和003。

005是怎么丢掉的？
因为other_s:aa在shard2上得到了id为003的结果，它的single_s为005，它试图在shard2上找寻doc id为005的项，但并没有找到，所以这条记录就被抛弃了。

示例1-3：from number to number

情况和示例1-1其实是一样的。index数据如下：

[
{"id":"001", "single_i":11,"single_2_i":11, "other_s":"aa"},
{"id":"002", "single_i":22,"single_2_i":33, "other_s":"aa"},
{"id":"003", "single_i":33,"single_2_i":77, "other_s":"aa"},
{"id":"004", "single_i":44,"single_2_i":44, "other_s":"bb"},
{"id":"005", "single_i":55,"single_2_i":66, "other_s":"aa"},
{"id":"006", "single_i":66}
]

query语句：

GET http://localhost:8981/solr/collection1/select?fq={!join to=single_i from=single_2_i}other_s:aa&q=*:*&fl=id

示例1-4：from string to number

没有返回，也不报错。

示例1-5：from number to string

抛异常。

ERROR (qtp2048537720-22) [c:collection1 s:shard1 r:core_node3 x:collection1_shard1_replica_n1] o.a.s.s.HttpSolrCall null:java.lang.IllegalStateException: unexpected docvalues type SORTED for field 'single_s' (expected one of [SORTED_NUMERIC, NUMERIC]). Re-index with correct docvalues type.
        at org.apache.lucene.index.DocValues.checkField(DocValues.java:317)
        at org.apache.lucene.index.DocValues.getSortedNumeric(DocValues.java:389)
        at org.apache.solr.search.join.GraphPointsCollector.doSetNextReader(GraphPointsCollector.java:50)
        at org.apache.lucene.search.SimpleCollector.getLeafCollector(SimpleCollector.java:33)
        at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:652)
        at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:445)
        at org.apache.solr.search.JoinQuery$JoinQueryWeight.getDocSet(JoinQParserPlugin.java:387)
        at org.apache.solr.search.JoinQuery$JoinQueryWeight.scorer(JoinQParserPlugin.java:311)

示例1-6：from多值string to单值string （单shard）

index数据准备

执行index命令：

POST http://localhost:8981/solr/collection1/update?commit=true&overwrite=true
[
{"id":"001", "multi_ss":["001", "002"]},
{"id":"002", "multi_ss":["003"]},
{"id":"003", "multi_ss":["006", "005"]},
{"id":"004", "multi_ss":["005"]},
{"id":"005"}
]

测试

GET http://localhost:8981/solr/collection1/select?fq={!join to=id from=multi_ss }*:*&q=*:*&fl=id

和1-1的where子句其实是一样的，只是这里的multi_ss有了多个值了。比如id为001的doc，它的multi_ss指向了001和002两个doc。

跨Collection/Core Join

其实这和普通的Join来自同一个实现类：org.apache.solr.search.JoinQParserPlugin。
它能支持的参数本来就包括

final String fromField = qparser.getParam("from");
final String fromIndex = qparser.getParam("fromIndex");
final String toField = qparser.getParam("to");
final String v = qparser.localParams.get(QueryParsing.V);

只是在没有fromIndex时，默认把fromIndex设置为当前core（在solr cloud情况下，就是shard的一个replica）。
所以它的情况和普通Join都是一样的。

但它不同的情况是，它对于from的collection有了更多的要求 (to的collection没有要求)。

单shard （这个情况和上文中的普通Join是相似的，不报错，但是返回结果不正确，比应该返回的doc少。）
shard在每个solr node上都有relica （实际上是to collection shard所在的所有nodes），否则就报错！

它报错的信息如下：

    "error":{
        "metadata":[
            "error-class",
            "org.apache.solr.common.SolrException",
            "root-error-class",
            "org.apache.solr.common.SolrException",
            "error-class",
            "org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException",
            "root-error-class",
            "org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException"
        ],
        "msg":"Error from server at null: SolrCloud join: No active replicas for collection1 found in node 172.23.0.3:8983_solr",
        "code":400
    }

为什么普通Join没有第二项这个要求？
因为一个shard已经被分配到了某个join query，代表它已经存在在当前node了啊。。。

示例2-1：from多值string to单值string （from单shard to 多shard）

创建新的collection

重建环境，创建两个collection。

POST http://localhost:8981/solr/admin/collections?action=CREATE&name=collection1&numShards=1&replicationFactor=2

现在我们创建了一个collection，它的信息：

name：collection1
shard num：1
replication num: 2 (per shard)

由于我们拉起来的solr node就是2个，默认创建时，会平摊replica，所以这里创建后，collection1就会在两个solr node上分别有一个shard的replica。

POST http://localhost:8981/solr/admin/collections?action=CREATE&name=collection2&numShards=2&replicationFactor=1

现在我们创建了一个新collection，它的信息：

name：collection2
shard num：2
replication num: 1 (per shard)

我们有了2个collection：

collection1：单shard，2 replica/shard
collection2：多shard, 1 replica/shard

index数据准备

执行index命令：

POST http://localhost:8981/solr/collection1/update?commit=true&overwrite=true
[
{"id":"b001", "multi_ss":["001", "002"]},
{"id":"b002", "multi_ss":["003"]},
{"id":"c003", "multi_ss":["006", "005"]}
]

POST http://localhost:8981/solr/collection2/update?commit=true&overwrite=true
[
{"id":"001"},
{"id":"002"},
{"id":"003"},
{"id":"004"},
{"id":"005"}
]

测试

GET http://localhost:8981/solr/collection2/select?fq={!join fromIndex=collection1 to=id from=multi_ss}id:b00*&q=*:*&fl=id

这里翻译一下这条Join就类似于：
select * from collection2 where id in (select single_s from collection1 where other_s contains 'aa')

跨多个collection/Core的Join

你肯定要问我如果要连级跳怎么办？从collection1搜索后join到collection2，然后再从collection2 join到collection3，可以吗？
当然可以了。
不过这就超出了一条Join语句所能支持的范围了。
你需要subQuery功能。

posted @ 2020-11-09 18:09 爪哇国的小蚂蚁阅读(567) 评论(0) 收藏举报

刷新页面返回顶部

solr中的Join Query

准备工作

启动Solr cloud

创建collection

Schema的准备

普通Join

示例1-1：from单值string to单值string （单shard）

index数据准备

测试

示例1-2：from单值string to单值string （多shard）

示例1-3：from number to number

示例1-4：from string to number

示例1-5：from number to string

示例1-6：from多值string to单值string （单shard）

index数据准备

测试

跨Collection/Core Join

示例2-1：from多值string to单值string （from单shard to 多shard）

创建新的collection

index数据准备

测试

跨多个collection/Core的Join

公告