屹洋屹可  

kafka消费者脚本:

  ./kafka-console-consumer.sh --bootstrap-server IP:9092 --topic topicname

kafka生产者脚本:

  ./kafka-console-producer.sh --broker-list IP:9092 --topic testtopic

consumer的groupid重新指定offset到指定位置:

  ./kafka-consumer-groups.sh --bootstrap-server IP:9092 --group testgroupid --reset-offsets  --topic testtopic --to-offset  27543 --execute

  ./kafka-consumer-groups.sh --bootstrap-server IP:9092 --group testgroupid --reset-offsets  --topic testtopic  --to-latest --execute

  一般是因为某个offset数据损坏需要跳过时才执行,或者有特殊数据回滚需求时才用到

kafka编程遇到的异常:

  异常提示:org.apache.kafka.common.KafkaException: Record for partition <> at offset 449883 is invalid, cause: Record is corrupt (stored crc = 2171407101, computed crc = 1371274824)

  Apache官网也没有解决:https://issues.apache.org/jira/browse/KAFKA-4888

  是consumer多线程是没有使用锁,抢数据导致的损坏,增加锁后恢复正常

 

Shell 替换文件中的回车符:

cat gushiwen_content.txt | tr "\n" " " 

查看consumer的group消费情况

  ./kafka-consumer-groups.sh --bootstrap-server IP:9092 --group estest --describe

HIVE导出数据:

  hive -e "select xxx,count(*)  as cc from Axxx group by xxxorder by cc desc limit 1000000000"  >> /jiazhuang/test.txt

---------------------------------------------------------------------------------------------------------------------------------------

hive创建外部表

创建hbase表

(1) 查看hbase表的构造

hbase(main):005:0> describe 'classes'
DESCRIPTION ENABLED
'classes', {NAME => 'user', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', true
VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => '
false', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
(2) 加入2行数据

put 'classes','001','user:name','jack'
put 'classes','001','user:age','20'
put 'classes','002','user:name','liza'
put 'classes','002','user:age','18'
(3) 查看classes中的数据

hbase(main):016:0> scan 'classes'
ROW COLUMN+CELL
001 column=user:age, timestamp=1404980824151, value=20
001 column=user:name, timestamp=1404980772073, value=jack
002 column=user:age, timestamp=1404980963764, value=18
002 column=user:name, timestamp=1404980953897, value=liza

(4) 创建外部hive表,查询验证

create external table classes(id int, name string, age int)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,user:name,user:age")
TBLPROPERTIES("hbase.table.name" = "classes");

(5)Hive查询,看看新数据


select * from classes;
OK
1 jack 20
2 liza 18
3 NULL NULL --这里是null了,因为003没有name,所以补位Null,而age为Null是因为超过最大值

 

 ---------------------------------------------------------------------------------------------------------------------------------------

HADOOP命令:

  hadoop fs -mkdir /user//jz

DOCKER搭建zookper+kafka:
  docker pull wurstmeister/zookeeper

  docker pull wurstmeister/kafka

  docker run -d --name zookeeper --publish 2181:2181 --volume /data/kafka:/data/kafka zookeeper:latest

  docker run -d --name kafka --publish 9092:9092 --link zookeeper --env KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181 --env KAFKA_ADVERTISED_HOST_NAME=IP --env KAFKA_ADVERTISED_PORT=9092 --volume /etc/localtime:/etc/localtime wurstmeister/kafka:latest

   docker通过bash进入:

  docker exec -it 775c7c9ee1e1 /bin/bash

  docker查看错误日志:

    docker logs --tail=100 d59039c77409

AWK:

 合并多个文件内容到一个文件:

  awk '!a[$0]++' 1.txt  2.txt  3.txt

 

   打乱顺序并输出:

  cat /data/jiazhuang/url/combine_1101.txt  | awk -F "\t"  ' BEGIN{ srand(); }{ value=int(rand()*10000000); print value"\t"$0 }' |  sort  | awk  -F  "\t" '{print $2}' > /data/jiazhuang/url/shuffle_1101.txt

  匹配函数用法(match)

        awk   '{if(!match($0,"第") && !match($0,"章") && !match($0,"chapterId=")) print $0}'

 通过shell脚本对两个文件内容取交集:

  b-a

    grep -F -v -f b.txt a.txt | sort | uniq

 

  

maven工程jar包冲突解决办法:

  如:netty


<dependency>
<groupId>org.elasticsearch.client</groupId>
<artifactId>transport</artifactId>
<version>5.2.2</version>
<exclusions>
<exclusion>
<groupId>io.netty</groupId>
<artifactId>netty-all</artifactId>
</exclusion>
</exclusions>
</dependency>

ES测试默认分词器分词效果:
  curl 'http://MASTERIP:9200/webpage/_analyze?pretty=true' -d '{"text":"这里是好记性不如烂笔头感叹号的博客园"}'

指定分词器:

  curl 'http://XXXX:9202/webpage/_analyze?analyzer=jieba_index&pretty=true' -d '{"text":"这里是好记性不如烂笔头感叹号的博客园"}'

指定分词器:
  curl 'http://MASTERIP:9200/webpage/_analyze?analyzer=ik_max_word&pretty=true' -d '{"text":"这里是好记性不如烂笔头感叹号的博客园"}'

 

 


ElasticSearch 5.X.X 里面字段类型的一点点差别:
text和keyword的区别:
text类型的字段,ES会默认按照拆词器(如果指定按照指定的拆词器拆,如果未指定按照ES的.yml默认的拆词器拆)进行拆词,不支持term查询;
keyword,ES不会进行拆词,支持term查询,不支持querystring。

 

string类型后面会废弃,被拆分为:text和keyword了

 

ES使用bulk提交数据时:

  如果抛异常:ActionRequestValidationException ,一般是因为bulk时的数据集是空导致。

   如果抛异常:RemoteTransportException,需要优化一下参数elasxxxxxxx.yml:thread_pool.bulk.queue_size: 6000或手动运行时指定参数:-E thread_pool.bulk.queue_size=6000

HBASE错误情况排查:


org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 91 actions: UnknownHostException: 91 times,

 

这种情况一般是:hosts文件只配置了部分的zookper或HBASE节点的IP,会导致有时候会报错;只需要在/etc/hosts里面配置上对应的IP和hostname即可

 

 

ElasticSearch优化操作:

禁用交换分区:
swapoff -a

强制段合并
curl -X POST "IP:9200(HTTP端口)/indexname/_forcemerge"

清除缓存
http://IP:9200(HTTP端口)/*/_cache/clear


索引优化

http://IP:9200(HTTP端口)/*/_optimize

http://IP:9200(HTTP端口)/indexname/_optimize

 

ES设置动态mapping的方式,忽略大小写:

curl -X PUT "XXXX:9201/my_index" -H 'Content-Type: application/json' -d'{"settings": {    "analysis": {      "normalizer": { "my_normalizer": {   "type": "custom",   "char_filter": [],   "filter": ["lowercase", "asciifolding"] }      }    }  },  "mappings": {    "my_type": {      "dynamic_templates": [ {   "integers": {     "match_mapping_type": "long",     "mapping": {       "type": "integer"     }   } }, {   "strings": {     "match_mapping_type": "string",     "mapping": {       "type": "text",       "fields": {  "keyword": {  "type":  "keyword",  "normalizer": "my_normalizer",  "ignore_above": 256  }    }     }   } }  ]    }  }}'

curl -X PUT "XXXX:9201/my_index/my_type/1" -H 'Content-Type: application/json' -d'{  "my_integer": 5,   "my_string": "Some string" }'


删除错误数据
curl -X POST "http://IP:9200(HTTP端口)/indexname/typename/_delete_by_query?conflicts=proceed" -H 'Content-Type: application/json' -d'
{
"query": {
"match_all": {}
}
}'

 

pyspider 传参使用方法:

  在first_page函数中,self.crawl(url, callback = self.last_page, save = {'current_url': url,  'path':path_dir}),在last_page函数中使用response.save['current_url']和response.save['path']即可获得url和path_dir的值.

 

pyspider中SSL报错解决办法:

  增加参数:validate_cert=False

  self.crawl('http://www.reeoo.com', callback=self.index_page, validate_cert=False)  

 

python3下安装pip3:

wget --no-check-certificate  https://pypi.python.org/packages/source/p/pip/pip-8.0.2.tar.gz#md5=3a73c4188f8dbad6a1e6f6d44d117eeb

tar -zxvf pip-8.0.2.tar.gz

cd pip-8.0.2

python3 setup.py build

python3 setup.py install


导入俩HDFS上的文件到HIVE,导完文件会被自动删除:
create table tmp_url1205 (url1 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

load data inpath '/user/final_1203_2.txt' into table tmp_url1205;

create table tmp_1205_2  (url2 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

load data inpath '/user/req1203.txt' into table tmp_hiveurl1205;


俩大文件求差集:
nohup hive -e "SELECT url2 FROM testdb.tmp_1205_2 LEFT OUTER JOIN testdb.tmp_url1205 on (url1 = url2) where url1 is null " >/data/jz/url_dif.txt 2>&1 &

HDFS文件行数:
$ hadoop fs -cat /user/test.txt | wc -l
743504

扫HBASE某一列然后输出到文件:
echo "scan 'namespace:table',{COLUMN=>['familyname:fieldkey','familyname:fieldkey'], LIMIT=>2000}" | ./hbase shell > /user/hbase_test.txt


删除错误数据
curl -X POST "http://X.X.X.X:9200/indexname/typename/_delete_by_query?conflicts=proceed" -H 'Content-Type: application/json' -d'
{
"query": {
"match_all": {}
}
}'

 

学习相关博客:

  https://www.cnblogs.com/BigFishFly/p/6380046.html

  http://www.pyspider.cn/book/pyspider/Response-17.html

 
iostat命令安装:
  镜像名称其实叫:sysstat
  yum install sysstat 再执行即可
 
 
1、Hive的parse_url函数
parse_url(url, partToExtract[, key]) - extracts a part from a URL
解析URL字符串,partToExtract的选项包含[HOST,PATH,QUERY,REF,PROTOCOL,FILE,AUTHORITY,USERINFO]。
【host,path,query,ref,protocol,file,authority,userinfo】
 
举例 :
select parse_url('http://facebook.com/path/p1.php?query=1', 'PROTOCOL') from dual;   --http
select parse_url('http://facebook.com/path/p1.php?query=1', 'HOST') from dual;---facebook.com​
select parse_url('http://facebook.com/path/p1.php?query=1', 'REF') from dual;---空​
select parse_url('http://facebook.com/path/p1.php?query=1', 'PATH') from dual;---/path/p1.php​
select parse_url('http://facebook.com/path/p1.php?query=1', 'QUERY') from dual;---空​
​select parse_url('http://facebook.com/path/p1.php?query=1', 'FILE') from dual;​---/path/p1.php?query=1​
​select parse_url('http://facebook.com/path/p1.php?query=1', 'AUTHORITY') from dual;​---facebook.com​
​select parse_url('http://facebook.com/path/p1.php?query=1', 'USERINFO') from dual;​---空
 
 


================================================================================
构建docker(ZK+KAFKA):
docker run -d --name zookeeper2 --publish 2182:2181 -e wurstmeister/zookeeper:latest


docker run -d --name spiderkafka -p 9093:9093 --link zookeeper2 -e KAFKA_BROKER_ID=1 -e KAFKA_ZOOKEEPER_CONNECT=zookeeper2:2181 -e KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://IP.IP.IP.IP:9093 -e KAFKA_LISTENERS=PLAINTEXT://0.0.0.0:9093 -t wurstmeister/kafka


创建topic:
./kafka-topics.sh --zookeeper IP.IP.IP.IP:2182 --create --topic urlqueue --partitions 10 --replication-factor 1
./kafka-topics.sh --zookeeper IP.IP.IP.IP:2182 --create --topic spiderdata --partitions 20 --replication-factor 1
./kafka-console-consumer.sh --bootstrap-server IP.IP.IP.IP:9093 --topic spiderdata --from-beginning

================================================================================

设置数据大小上限:
./kafka-topics.sh --zookeeper IP.IP.IP.IP:2182 --alter --topic urlqueue --config retention.bytes=10374182400
./kafka-topics.sh --zookeeper IP.IP.IP.IP:2182 --alter --topic spiderdata --config retention.bytes=30374182400
./kafka-topics.sh --zookeeper IP.IP.IP.IP:2182 --alter --topic urlqueue --config cleanup.policy=delete
./kafka-topics.sh --zookeeper IP.IP.IP.IP:2182 --alter --topic spiderdata --config cleanup.policy=delete

 

kafka的监控中间件:

https://github.com/quantifind/KafkaOffsetMonitor/releases

 

 

posted on 2018-10-31 17:35  屹洋屹可  阅读(625)  评论(0)    收藏  举报