内部优化 -- m_lira_courtbulletin_solr_01.sql脚本增量导入、在where字句的过滤条件中使用select子查询过滤数据、APP层dd表逻辑

内部优化 -- m_lira_courtbulletin_solr_01.sql脚本增量导入、在where字句的过滤条件中使用select子查询过滤数据、APP层dd表逻辑

背景:m_lira_courtbulletin_solr_01.sql目前全量插入时间较长(3小时左右),影响整体跑批性能

该表为parquet表

方案:在从lira_courtbulletin_solr拿数据时使用updatetime增量写入(insert into)

原代码:

insert overwrite table fin_dw.m_lira_courtbulletin_solr_01
select 
  cbid,
  batchno,
  uuid,
  ptype,
  ptypename,
  title,
  caseno,
  casereason,
  partys,
  court,
  casedate,
  pdate,
  province,
  city,
  keywords,
  datasource,
  noticeaddr,
  docuclass,
  docuclassname,
  targetamount,
  collectiondate,
  crawlerid,
  inputtime,
  judgementresult,
  structedent,
  partyent,
  plaintiffent,
  structedinfo,
  checkstatus,
  updatetime,
  pronouns,
  judicialprocess,
  doctpye,
  judgebasis,
  courtclass,
  nerinfo,
  docid
  from fin_operation.lira_courtbulletin_solr -- kudu表
  where ptype in ('14','15','16','17','18','19','22','23')
	and nvl(structedent,'')<>''
	;

修改之后的代码示例:

insert into table fin_dw.m_lira_courtbulletin_solr_01
select 
  cbid,
  batchno,
  uuid,
  ptype,
  ptypename,
  title,
  caseno,
  casereason,
  partys,
  court,
  casedate,
  pdate,
  province,
  city,
  keywords,
  datasource,
  noticeaddr,
  docuclass,
  docuclassname,
  targetamount,
  collectiondate,
  crawlerid,
  inputtime,
  judgementresult,
  structedent,
  partyent,
  plaintiffent,
  structedinfo,
  checkstatus,
  updatetime,
  pronouns,
  judicialprocess,
  doctpye,
  judgebasis,
  courtclass,
  nerinfo,
  docid
  from fin_operation.lira_courtbulletin_solr
  where ptype in ('14','15','16','17','18','19','22','23')
	and nvl(structedent,'')<>'' and updatetime > (select max(updatetime) from fin_dw.m_lira_courtbulletin_solr_01)
;

但是这样可能会出现重复数据:fin_operation.lira_courtbulletin_solr为kudu表,会根据主键(cbid)去重。

​ 因为如果上游新增一条数据,存入kudu表后会根据主键(cbid)去重,那么该条bcid的数据就是最新的了并且老数据会被覆盖,那么根据上面的逻辑,该条数据就会被 insert intofin_dw.m_lira_courtbulletin_solr_01 ,但是parquet表并不会根据主键去重,所以该cbid的数据就会有两条。

所以选择使用APP层dd表(全量表)逻辑:

-- 该临时表表结构和fin_dw.m_lira_courtbulletin_solr_01保持一致

CREATE TABLE fin_dw.t_m_lira_courtbulletin_solr_01_01 (
  cbid STRING,
  batchno BIGINT,
  uuid STRING,
  ptype STRING,
  ptypename STRING,
  title STRING,
  caseno STRING,
  casereason STRING,
  partys STRING,
  court STRING,
  casedate STRING,
  pdate STRING,
  province STRING,
  city STRING,
  keywords STRING,
  datasource STRING,
  noticeaddr STRING,
  docuclass STRING,
  docuclassname STRING,
  targetamount STRING,
  collectiondate STRING,
  crawlerid STRING,
  inputtime STRING,
  judgementresult STRING,
  structedent STRING,
  partyent STRING,
  plaintiffent STRING,
  structedinfo STRING,
  checkstatus STRING,
  updatetime STRING,
  pronouns STRING,
  judicialprocess STRING,
  doctpye STRING,
  judgebasis STRING,
  courtclass STRING,
  nerinfo STRING,
  docid STRING
)
 COMMENT '临时表01 -- 诉讼结构化数据01'
STORED AS PARQUET;
-- 将近七天增量数据插入临时表
insert overwrite table fin_dw.t_m_lira_courtbulletin_solr_01_01
select 
  cbid,
  batchno,
  uuid,
  ptype,
  ptypename,
  title,
  caseno,
  casereason,
  partys,
  court,
  casedate,
  pdate,
  province,
  city,
  keywords,
  datasource,
  noticeaddr,
  docuclass,
  docuclassname,
  targetamount,
  collectiondate,
  crawlerid,
  inputtime,
  judgementresult,
  structedent,
  partyent,
  plaintiffent,
  structedinfo,
  checkstatus,
  updatetime,
  pronouns,
  judicialprocess,
  doctpye,
  judgebasis,
  courtclass,
  nerinfo,
  docid
  from fin_operation.lira_courtbulletin_solr
  where ptype in ('14','15','16','17','18','19','22','23')
	and nvl(structedent,'')<>'' and updatetime > to_timestamp(from_unixtime(unix_timestamp(subdate(now(), 7))), 'yyyy-MM-dd')
;

--增量数据插入目标表
insert overwrite table fin_dw.m_lira_courtbulletin_solr_01
select 
  a.cbid,
  a.batchno,
  a.uuid,
  a.ptype,
  a.ptypename,
  a.title,
  a.caseno,
  a.casereason,
  a.partys,
  a.court,
  a.casedate,
  a.pdate,
  a.province,
  a.city,
  a.keywords,
  a.datasource,
  a.noticeaddr,
  a.docuclass,
  a.docuclassname,
  a.targetamount,
  a.collectiondate,
  a.crawlerid,
  a.inputtime,
  a.judgementresult,
  a.structedent,
  a.partyent,
  a.plaintiffent,
  a.structedinfo,
  a.checkstatus,
  a.updatetime,
  a.pronouns,
  a.judicialprocess,
  a.doctpye,
  a.judgebasis,
  a.courtclass,
  a.nerinfo,
  a.docid
from fin_dw.m_lira_courtbulletin_solr_01 as a 
left join fin_dw.t_m_lira_courtbulletin_solr_01_01 as b
on a.cbid = b.cbid
where b.cbid is null -- 过滤出没有变化的数据
union all 
select 
  cbid,
  batchno,
  uuid,
  ptype,
  ptypename,
  title,
  caseno,
  casereason,
  partys,
  court,
  casedate,
  pdate,
  province,
  city,
  keywords,
  datasource,
  noticeaddr,
  docuclass,
  docuclassname,
  targetamount,
  collectiondate,
  crawlerid,
  inputtime,
  judgementresult,
  structedent,
  partyent,
  plaintiffent,
  structedinfo,
  checkstatus,
  updatetime,
  pronouns,
  judicialprocess,
  doctpye,
  judgebasis,
  courtclass,
  nerinfo,
  docid
from fin_dw.t_m_lira_courtbulletin_solr_01_01 --近七天更新数据
;
posted @ 2022-08-10 20:45  赤兔胭脂小吕布  阅读(45)  评论(0)    收藏  举报