异常企业数据清洗
一、数据导入
(1)增值税发票数据,文件名zzsfp
(2)发票对应货物明细数据,文件名zzsfp_hwmx
(3)企业信息,文件名nsrxx
建表语句:
CREATE EXTERNAL TABLE hive.zzsfp(
fp_nid STRING,
xf_id STRING,
gf_id STRING,
je float,
se float,
jshj float,
kpyf STRING,
kprq STRING,
zfbz STRING)
COMMENT 'Now create hive.zzsfp!'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/zzsfp'
CREATE EXTERNAL TABLE hive.zzsfp_hwmx(
fp_nid STRING,
date_key STRING,
hwmc STRING,
ggxh STRING,
dw STRING,
sl double,
dj double,
je double,
se double,
spbm string
)
COMMENT 'Now create hive.zsfp_hwmx'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/zzsfp_hwmx'
CREATE EXTERNAL TABLE hive.nsrxx(
hydm STRING,
nsr_id STRING,
djzclx_dm STRING,
kydjrq STRING,
xgrq STRING,
label STRING
)
COMMENT 'Now create hive.nsrxx!'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/nsrxx'
二、数据清洗
1.去掉zzsfp表中多余的括号
insert into zzsfp1 select regexp_replace(fp_nid, "\\(", '') as fp_nid,xf_id,gf_id ,je,se, jshj,kpyf,kprq,regexp_replace(zfbz, "\\)", '') as zfbz from zzsfp (下面两个语句类似)
2.去掉nsrxx表中多余的括号
3.去掉zzsfp_hwmx中多余的引号
至此,三个表数据清洗完毕
三、数据分析
总览:增值税发票总数:4436075 企业总数33829
3.1企业开发票情况分析:
增值税发票表中销方企业有119633家,说明增值税发票表中的发票有不属于33829家企业以内的企业。
查询后共有22788家企业在销方企业中
购方企业有162760家,说明增值税发票表中的发票有不属于33829家企业以内的企业。
查询后共有29173家企业在购方企业中
销项发票共有2168877张 进项发票440003张
3.2建立销项发票表和进项发票表
3.2.1销项发票表
进一步分离出
3.2.2进项发票表
3.3异常企业筛选标准一:企业增值税发票进项与出项严重不符即出现只出不进或者只进不出的企业
3.3.1建立不开进项发票的企业id表
CREATE EXTERNAL TABLE hive.nojx_qiye(
Qiye_id int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
然后查看没有开进项发票企业开销项发票的情况然后筛选出销项发票<2的企业
select xf_id,count(fp_nid) from xxfp where xf_id in (select qiye_id from nojx_qiye) group by xf_id having count(fp_nid < 2);
共308家,查询发现这些企业销项发票总数均为1;
3.3.2建立没有开销项发票的企业id表
CREATE EXTERNAL TABLE hive.noxx_qiye(
Qiye_id int
)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
然后查看没有开销项发票企业开进项发票的情况然后筛选出进项发票<2的企业
select gf_id,count(fp_nid) from jxfp where gf_id in (select qiye_id from noxx_qiye) group by gf_id having count(fp_nid < 2);
共1493家
CREATE EXTERNAL TABLE hive.3_1(
Id_1 String,
Count int
)
Insert into 3_1 select xf_id,count(fp_nid) from xxfp where xf_id in (select qiye_id from nojx_qiye) group by xf_id having count(fp_nid < 2);
CREATE EXTERNAL TABLE hive.3_2(
Id_2 String,
Count int
)
CREATE EXTERNAL TABLE hive.3_3(
Id_3 String,
Count int
)
3.3.3调整发票数量参数结合两个表得出可能异常的企业id
四、数据导入
将异常企业id导入mysql
异常企业表:
CREATE EXTERNAL TABLE hive.error(
id String
)
4.1创建临时表
create table hive.error_inner(id string) COMMENT 'Welcome to XMU dblab! Now create inner table inner_user_log ' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
在mysql中创建表:
Sqoop命令进行导入
sqoop export --connect jdbc:mysql://localhost:3306/dbtaobao --username root --password 123456 --table error --export-dir '/user/hive/warehouse/hive.db/error_inner' --fields-terminated-by ',';
查看mysql数据库,发现数据已成功导入,
至此,异常企业清洗完毕。
浙公网安备 33010602011771号