9.27学习笔记

  1. 今日速览
  2. 环境拓扑(安全开启后)
    表格
    节点 新增角色 Principal 样例
    node1 KDC (kerberos) kadmin/admin@HADOOP.COM
    node2 YARN RM + Queue Mgmt yarn/node2@HADOOP.COM
    node3 Phoenix RS hbase/node3@HADOOP.COM
  3. 关键知识点
    3.1 Kerberos 流程
    Client → AS 获取 TGT
    TGT → TGS 获取 Service Ticket
    Service Ticket → Hadoop NameNode 完成 SASL 握手
    3.2 YARN Capacity 调度器
    层级队列:root → prod/dev → 子队列 spark/flink
    抢占:当 prod 空闲资源 < 10 % 且 dev 占用 > 20 % 时触发
    用户限额:yarn.scheduler.capacity.root.dev.maximum-user-limit-factor=0.5
    3.3 Phoenix 索引
    全局索引(Global Index)= 新表,覆盖列减少回表
    本地索引(Local Index)= 同 Region,前缀过滤高效
    写放大:WAL + Index 表双写,可调 phoenix.index.wal.disabled=true(容忍宕机丢数据)
    3.4 数据仓库拉链表
    关链 end_date = 当天-1,开链 start_date = 当天
    Hive SQL 用 row_number() over (partition by user_id order by ts) 去重
  4. 实操流水
    4.1 Kerberos 安装
    bash

node1

yum -y install krb5-server krb5-workstation
vim /var/kerberos/krb5kdc/kdc.conf # realms = HADOOP.COM
kdb5_util create -s
systemctl enable --now krb5kdc kadmin

创建主体

kadmin.local -q "addprinc -randkey hdfs/node1@HADOOP.COM"
kadmin.local -q "xst -k /etc/security/keytabs/hdfs.service.keytab hdfs/node1@HADOOP.COM"

分发 keytab 并 chmod 400

4.2 Hadoop 开启安全
xml

hadoop.security.authentication kerberos hadoop.security.authorization true xml dfs.block.access.token.enable true dfs.datanode.data.dir.perm 700 启动顺序: kinit -kt /etc/security/keytabs/hdfs.service.keytab hdfs/node1@HADOOP.COM start-dfs.sh 验证:无 ticket 时 hdfs dfs -ls / → Permission denied (user=root, code=401) 4.3 YARN Capacity 队列 yarn.scheduler.capacity.root.queues prod,dev yarn.scheduler.capacity.root.prod.capacity 70 yarn.scheduler.capacity.root.dev.capacity 30 yarn.scheduler.capacity.root.dev.maximum-allocation-vcores 4 yarn.scheduler.capacity.root.prod.user-limit-factor 1 yarn.scheduler.capacity.root.prod.state RUNNING yarn.scheduler.capacity.root.dev.state RUNNING yarn.scheduler.capacity.root.prod.acl_submit_applications prod yarn.scheduler.capacity.root.dev.acl_submit_applications dev 刷新: yarn rmadmin -refreshQueues 提交到 dev: spark-submit --master yarn --queue dev --class ... 4.4 Phoenix 索引实验 sql -- 建订单主表 CREATE TABLE ORDERS ( id VARCHAR PRIMARY KEY, user_id VARCHAR, amount DECIMAL, ts BIGINT ) SALT_BUCKETS=6;

-- 写 1000 w 行
upsert into orders select ...

-- 无索引查询
select * from orders where user_id='u1234'; -- 14.2 s 全表

-- 建全局索引
CREATE INDEX idx_user ON ORDERS(user_id) INCLUDE(amount);

-- 同样查询 1.1 s,Explain 显示 RANGE SCAN OVER idx_user
写入 10 w 行测试:
– 无主索引 6.8 s
– 有主索引 7.3 s
吞吐下降 ≈ 7 %,可接受。
4.5 数据仓库分层
ODS 层(原始)

CREATE EXTERNAL TABLE ods.user_log(
user_id STRING,
event_type STRING,
ts BIGINT,
json STRING
) STORED AS TEXTFILE
LOCATION '/data/ods/user_log/';
load data inpath '/tmp/user_log_20250924.txt' into table ods.user_log;
DWD 层(清洗拉链)
sql

WITH tmp AS (
SELECT *, row_number() over (partition by user_id order by ts desc) rn
FROM ods.user_log
)
INSERT OVERWRITE TABLE dwd.user_log_chain
SELECT user_id, event_type, ts, '2025-09-24' start_date, '9999-12-31' end_date
FROM tmp WHERE rn=1;
DWS 层(日活宽表)

CREATE TABLE dws.user_daily (
dt STRING,
user_id STRING,
first_event STRING,
last_event STRING,
event_cnt INT
) STORED AS ORC;

INSERT OVERWRITE TABLE dws.user_daily PARTITION(dt='2025-09-24')
SELECT '2025-09-24', user_id, min(event_type), max(event_type), count(*)
FROM dwd.user_log_chain
WHERE start_date<='2025-09-24' AND end_date>='2025-09-24'
GROUP BY user_id;
结果:1 GB 原始 → 350 MB 宽表。
4.6 差异快照 & 滚动删除
bash

基于 last snapshot 做差异

$VMRUN snapshot node1.vmx "diff_$(date +%F)" -memory false -quiesce true

查询上次完整快照

LAST=$(find /backup -name "full_*.vmsn" -printf '%T@ %p\n' | sort -n | tail -1 | awk '{print $2}')

7 天前

find /backup -name "*.vmsn" -mtime +7 -delete
节省空间:
full_20Sep 37 GB
diff_25Sep 14 GB(仅增改)

posted @ 2025-09-27 22:12  头发少的文不识  阅读(16)  评论(0)    收藏  举报