第三周周总结——kettle的简单使用以及MYSQL数据库去重

一、kettle的简单使用

主要是跟据德拓云端培训门户中的“大数据采集技术与应用”来进行学习,链接:http://113.31.104.47/portal/#/course/dashboard/b34d160db64624732ef152a1118af11a

但是在安装spoon的过程中,自己的spoon不能正确打开,出现的问题以及解决方法如下:

出现的问题:双击spoon.bat文件之后,cmd出现1s消失,spoon未打开。

解决方法:文本文件的方式打开spoon.bat,

if "%PENTAHO_DI_JAVA_OPTIONS%"=="" set PENTAHO_DI_JAVA_OPTIONS="-Xms1024m" "-Xmx2048m" "-XX:MaxPermSize=256m"

修改为

if "%PENTAHO_DI_JAVA_OPTIONS%"=="" set PENTAHO_DI_JAVA_OPTIONS="-Xms512m" "-Xmx512m" "-XX:MaxPermSize=256m"

完成后,重新运行Spoon.bat,即可成功运行。
来源:https://blog.csdn.net/yue31313/article/details/79898002

 

另外,使用spoon连接MYSQL数据库时,还需注意将自己数据库对应的驱动包,放入软件的lib目录下!!!

二、MYSQL数据库去重操作

(1)sql语句一

下面方法我在使用时数据库会报错,sql语句应用时出错:

You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ------;

查阅资料可能跟数据库的类型有关,我觉得下列语句更适用于Oracle数据库,因为oracle数据库的表中的每一行数据都有一个唯一的标识符,或者称为rowid,下列语句中对应的即Oracle中的rowid。

1、查找表中多余的重复记录,重复记录是根据单个字段(peopleId)来判断

select * from people
where peopleId in (select peopleId from people group by peopleId having count(peopleId) > 1)

2、删除表中多余的重复记录,重复记录是根据单个字段(peopleId)来判断,只留有rowid最小的记录

delete from people 
where peopleId in (select peopleId from people group by peopleId having count(peopleId) > 1)
and rowid not in (select min(rowid) from people group by peopleId having count(peopleId )>1)

3、查找表中多余的重复记录(多个字段) 

select * from vitae a
where (a.peopleId,a.seq) in (select peopleId,seq from vitae group by peopleId,seq having count(*) > 1)

4、删除表中多余的重复记录(多个字段),只留有rowid最小的记录

delete from vitae a
where (a.peopleId,a.seq) in (select peopleId,seq from vitae group by peopleId,seq having count(*) > 1)
and rowid not in (select min(rowid) from vitae group by peopleId,seq having count(*)>1)

5、查找表中多余的重复记录(多个字段),不包含rowid最小的记录

select * from vitae a
where (a.peopleId,a.seq) in (select peopleId,seq from vitae group by peopleId,seq having count(*) > 1)
and rowid not in (select min(rowid) from vitae group by peopleId,seq having count(*)>1)

————————————————
版权声明:本文为CSDN博主「Java高知」的原创文章。
原文链接:https://blog.csdn.net/tjcyjd/article/details/8950621

(2)sql语句二

接着查询,我发现了另一种适合自己的方法,即利用下面的代码来进行去重并保留一条的操作;

DELETE
FROM
    dept
WHERE
    (dname, db_source) IN (
        SELECT
            t.dname,
            t.db_source
        FROM
            (
                SELECT
                    dname,
                    db_source
                FROM
                    dept
                GROUP BY
                    dname,
                    db_source
                HAVING
                    count(1) > 1
            ) t
    )
AND deptno NOT IN (
    SELECT
        dt.mindeptno
    FROM
        (
            SELECT
                min(deptno) AS mindeptno
            FROM
                dept
            GROUP BY
                dname,
                db_source
            HAVING
                count(1) > 1
        ) dt
)

来源:

 

版权声明:本文为CSDN博主「千千的大千世界」的原创文章。
原文链接:https://blog.csdn.net/n950814abc/article/details/82284838

posted @ 2020-03-07 17:13  祺&Qi  阅读(551)  评论(0编辑  收藏  举报