kafka消费者脚本:
./kafka-console-consumer.sh --bootstrap-server IP:9092 --topic topicname
kafka生产者脚本:
./kafka-console-producer.sh --broker-list IP:9092 --topic testtopic
consumer的groupid重新指定offset到指定位置:
./kafka-consumer-groups.sh --bootstrap-server IP:9092 --group testgroupid --reset-offsets --topic testtopic --to-offset 27543 --execute
./kafka-consumer-groups.sh --bootstrap-server IP:9092 --group testgroupid --reset-offsets --topic testtopic --to-latest --execute
一般是因为某个offset数据损坏需要跳过时才执行,或者有特殊数据回滚需求时才用到
kafka编程遇到的异常:
异常提示:org.apache.kafka.common.KafkaException: Record for partition <> at offset 449883 is invalid, cause: Record is corrupt (stored crc = 2171407101, computed crc = 1371274824)
Apache官网也没有解决:https://issues.apache.org/jira/browse/KAFKA-4888
是consumer多线程是没有使用锁,抢数据导致的损坏,增加锁后恢复正常
Shell 替换文件中的回车符:
cat gushiwen_content.txt | tr "\n" " "
查看consumer的group消费情况
./kafka-consumer-groups.sh --bootstrap-server IP:9092 --group estest --describe
HIVE导出数据:
hive -e "select xxx,count(*) as cc from Axxx group by xxxorder by cc desc limit 1000000000" >> /jiazhuang/test.txt
---------------------------------------------------------------------------------------------------------------------------------------
hive创建外部表
创建hbase表
(1) 查看hbase表的构造
hbase(main):005:0> describe 'classes'
DESCRIPTION ENABLED
'classes', {NAME => 'user', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', true
VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => '
false', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
(2) 加入2行数据
put 'classes','001','user:name','jack'
put 'classes','001','user:age','20'
put 'classes','002','user:name','liza'
put 'classes','002','user:age','18'
(3) 查看classes中的数据
hbase(main):016:0> scan 'classes'
ROW COLUMN+CELL
001 column=user:age, timestamp=1404980824151, value=20
001 column=user:name, timestamp=1404980772073, value=jack
002 column=user:age, timestamp=1404980963764, value=18
002 column=user:name, timestamp=1404980953897, value=liza
(4) 创建外部hive表,查询验证
create external table classes(id int, name string, age int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,user:name,user:age")
TBLPROPERTIES("hbase.table.name" = "classes");
(5)Hive查询,看看新数据
select * from classes;
OK
1 jack 20
2 liza 18
3 NULL NULL --这里是null了,因为003没有name,所以补位Null,而age为Null是因为超过最大值
---------------------------------------------------------------------------------------------------------------------------------------
HADOOP命令:
hadoop fs -mkdir /user//jz
DOCKER搭建zookper+kafka:
docker pull wurstmeister/zookeeper
docker pull wurstmeister/kafka
docker run -d --name zookeeper --publish 2181:2181 --volume /data/kafka:/data/kafka zookeeper:latest
docker run -d --name kafka --publish 9092:9092 --link zookeeper --env KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181 --env KAFKA_ADVERTISED_HOST_NAME=IP --env KAFKA_ADVERTISED_PORT=9092 --volume /etc/localtime:/etc/localtime wurstmeister/kafka:latest
docker通过bash进入:
docker exec -it 775c7c9ee1e1 /bin/bash
docker查看错误日志:
docker logs --tail=100 d59039c77409
AWK:
合并多个文件内容到一个文件:
awk '!a[$0]++' 1.txt 2.txt 3.txt
打乱顺序并输出:
cat /data/jiazhuang/url/combine_1101.txt | awk -F "\t" ' BEGIN{ srand(); }{ value=int(rand()*10000000); print value"\t"$0 }' | sort | awk -F "\t" '{print $2}' > /data/jiazhuang/url/shuffle_1101.txt
匹配函数用法(match)
awk '{if(!match($0,"第") && !match($0,"章") && !match($0,"chapterId=")) print $0}'
通过shell脚本对两个文件内容取交集:
b-a
grep -F -v -f b.txt a.txt | sort | uniq
maven工程jar包冲突解决办法:
如:netty
<dependency>
<groupId>org.elasticsearch.client</groupId>
<artifactId>transport</artifactId>
<version>5.2.2</version>
<exclusions>
<exclusion>
<groupId>io.netty</groupId>
<artifactId>netty-all</artifactId>
</exclusion>
</exclusions>
</dependency>
ES测试默认分词器分词效果:
curl 'http://MASTERIP:9200/webpage/_analyze?pretty=true' -d '{"text":"这里是好记性不如烂笔头感叹号的博客园"}'
指定分词器:
curl 'http://XXXX:9202/webpage/_analyze?analyzer=jieba_index&pretty=true' -d '{"text":"这里是好记性不如烂笔头感叹号的博客园"}'
指定分词器:
curl 'http://MASTERIP:9200/webpage/_analyze?analyzer=ik_max_word&pretty=true' -d '{"text":"这里是好记性不如烂笔头感叹号的博客园"}'
ElasticSearch 5.X.X 里面字段类型的一点点差别:
text和keyword的区别:
text类型的字段,ES会默认按照拆词器(如果指定按照指定的拆词器拆,如果未指定按照ES的.yml默认的拆词器拆)进行拆词,不支持term查询;
keyword,ES不会进行拆词,支持term查询,不支持querystring。
string类型后面会废弃,被拆分为:text和keyword了
ES使用bulk提交数据时:
如果抛异常:ActionRequestValidationException ,一般是因为bulk时的数据集是空导致。
如果抛异常:RemoteTransportException,需要优化一下参数elasxxxxxxx.yml:thread_pool.bulk.queue_size: 6000或手动运行时指定参数:-E thread_pool.bulk.queue_size=6000
HBASE错误情况排查:
org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 91 actions: UnknownHostException: 91 times,
这种情况一般是:hosts文件只配置了部分的zookper或HBASE节点的IP,会导致有时候会报错;只需要在/etc/hosts里面配置上对应的IP和hostname即可
ElasticSearch优化操作:
禁用交换分区:
swapoff -a
强制段合并
curl -X POST "IP:9200(HTTP端口)/indexname/_forcemerge"
清除缓存
http://IP:9200(HTTP端口)/*/_cache/clear
索引优化
http://IP:9200(HTTP端口)/*/_optimize
http://IP:9200(HTTP端口)/indexname/_optimize
ES设置动态mapping的方式,忽略大小写:
curl -X PUT "XXXX:9201/my_index" -H 'Content-Type: application/json' -d'{"settings": { "analysis": { "normalizer": { "my_normalizer": { "type": "custom", "char_filter": [], "filter": ["lowercase", "asciifolding"] } } } }, "mappings": { "my_type": { "dynamic_templates": [ { "integers": { "match_mapping_type": "long", "mapping": { "type": "integer" } } }, { "strings": { "match_mapping_type": "string", "mapping": { "type": "text", "fields": { "keyword": { "type": "keyword", "normalizer": "my_normalizer", "ignore_above": 256 } } } } } ] } }}'
curl -X PUT "XXXX:9201/my_index/my_type/1" -H 'Content-Type: application/json' -d'{ "my_integer": 5, "my_string": "Some string" }'
删除错误数据
curl -X POST "http://IP:9200(HTTP端口)/indexname/typename/_delete_by_query?conflicts=proceed" -H 'Content-Type: application/json' -d'
{
"query": {
"match_all": {}
}
}'
pyspider 传参使用方法:
在first_page函数中,self.crawl(url, callback = self.last_page, save = {'current_url': url, 'path':path_dir}),在last_page函数中使用response.save['current_url']和response.save['path']即可获得url和path_dir的值.
pyspider中SSL报错解决办法:
增加参数:validate_cert=False
self.crawl('http://www.reeoo.com', callback=self.index_page, validate_cert=False)
python3下安装pip3:
wget --no-check-certificate https://pypi.python.org/packages/source/p/pip/pip-8.0.2.tar.gz#md5=3a73c4188f8dbad6a1e6f6d44d117eeb tar -zxvf pip-8.0.2.tar.gz cd pip-8.0.2 python3 setup.py build python3 setup.py install
导入俩HDFS上的文件到HIVE,导完文件会被自动删除:
create table tmp_url1205 (url1 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
load data inpath '/user/final_1203_2.txt' into table tmp_url1205;
create table tmp_1205_2 (url2 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
load data inpath '/user/req1203.txt' into table tmp_hiveurl1205;
俩大文件求差集:
nohup hive -e "SELECT url2 FROM testdb.tmp_1205_2 LEFT OUTER JOIN testdb.tmp_url1205 on (url1 = url2) where url1 is null " >/data/jz/url_dif.txt 2>&1 &
HDFS文件行数:
$ hadoop fs -cat /user/test.txt | wc -l
743504
扫HBASE某一列然后输出到文件:
echo "scan 'namespace:table',{COLUMN=>['familyname:fieldkey','familyname:fieldkey'], LIMIT=>2000}" | ./hbase shell > /user/hbase_test.txt
删除错误数据
curl -X POST "http://X.X.X.X:9200/indexname/typename/_delete_by_query?conflicts=proceed" -H 'Content-Type: application/json' -d'
{
"query": {
"match_all": {}
}
}'
学习相关博客:
https://www.cnblogs.com/BigFishFly/p/6380046.html
http://www.pyspider.cn/book/pyspider/Response-17.html
select parse_url('http://facebook.com/path/p1.php?query=1', 'HOST') from dual;---facebook.com
select parse_url('http://facebook.com/path/p1.php?query=1', 'REF') from dual;---空
select parse_url('http://facebook.com/path/p1.php?query=1', 'PATH') from dual;---/path/p1.php
select parse_url('http://facebook.com/path/p1.php?query=1', 'QUERY') from dual;---空
select parse_url('http://facebook.com/path/p1.php?query=1', 'FILE') from dual;---/path/p1.php?query=1
select parse_url('http://facebook.com/path/p1.php?query=1', 'AUTHORITY') from dual;---facebook.com
select parse_url('http://facebook.com/path/p1.php?query=1', 'USERINFO') from dual;---空
================================================================================
构建docker(ZK+KAFKA):
docker run -d --name zookeeper2 --publish 2182:2181 -e wurstmeister/zookeeper:latest
docker run -d --name spiderkafka -p 9093:9093 --link zookeeper2 -e KAFKA_BROKER_ID=1 -e KAFKA_ZOOKEEPER_CONNECT=zookeeper2:2181 -e KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://IP.IP.IP.IP:9093 -e KAFKA_LISTENERS=PLAINTEXT://0.0.0.0:9093 -t wurstmeister/kafka
创建topic:
./kafka-topics.sh --zookeeper IP.IP.IP.IP:2182 --create --topic urlqueue --partitions 10 --replication-factor 1
./kafka-topics.sh --zookeeper IP.IP.IP.IP:2182 --create --topic spiderdata --partitions 20 --replication-factor 1
./kafka-console-consumer.sh --bootstrap-server IP.IP.IP.IP:9093 --topic spiderdata --from-beginning
================================================================================
设置数据大小上限:
./kafka-topics.sh --zookeeper IP.IP.IP.IP:2182 --alter --topic urlqueue --config retention.bytes=10374182400
./kafka-topics.sh --zookeeper IP.IP.IP.IP:2182 --alter --topic spiderdata --config retention.bytes=30374182400
./kafka-topics.sh --zookeeper IP.IP.IP.IP:2182 --alter --topic urlqueue --config cleanup.policy=delete
./kafka-topics.sh --zookeeper IP.IP.IP.IP:2182 --alter --topic spiderdata --config cleanup.policy=delete
kafka的监控中间件:
https://github.com/quantifind/KafkaOffsetMonitor/releases

浙公网安备 33010602011771号