k8s集群从0到1安装部署

 

文档

两个官方文档,

一个是官方教程:https://kubernetes.io/docs/tutorials/

另一个是生产环境安装手册:https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/

教程里建议用学习环境快速上手,我试了一下发现学习学习环境本身也有学习成本,并且不利于概念和工具学习,因为小白接触k8s

本身就需要接触很多新概念新工具。

所以准备了二个虚拟机,直接进行生产环境安装:ubuntu24 server + ubuntu24 server

 

准备

apt-get install -y apt-transport-https ca-certificates curl gpg

curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.32/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg

echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.32/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list

apt-get update

关swap

swapoff -a

 

装kubeadmin

sudo apt-get install -y kubelet kubeadm kubectl

sudo apt-mark hold kubelet kubeadm kubectl

 

装docker

sudo apt install docker.io

配置

mkdir /etc/containerd/

containerd config default > /etc/containerd/config.toml

vim /etc/containerd/config.toml  # 设置  SystemdCgroup = true

systemctl restart containerd

管理工具

crictl

测试一下

docker run hello-world 

查看日志细节

部署阶段docker是底层依赖,好多问题会在这里暴露,养成随手看看docker日志的习惯很重要

journalctl -u docker.service -f

journalctl -u containerd.service -f

 

控制节点

生成默认配置(不自定义配置的话不需要)

kubeadm config print init-defaults |sudo tee default.yaml  

初始化控制面

sudo kubeadm init --pod-network-cidr 192.168.100.0/24

cidr必须配,跟下文calico改配置那地方要写错一致。并且不能与物理网络重叠。

如果不用默认网关的网卡通信,就这样

sudo kubeadm init --apiserver-advertise-address 192.168.100.105

kubeadm init --apiserver-advertise-address 192.168.100.105 --pod-network-cidr 192.168.100.0/24

 

// 不知道为啥 kubelet起不来,swapoff -a 后成功。

如果不是root: 

    mkdir -p  \$HOME/.kube
    sudo cp -i /etc/kubernetes/admin.conf \$HOME/.kube/config
    sudo chown  \$(id -u):\$(id -g) \$HOME/.kube/config

如果是root

    export KUBECONFIG=/etc/kubernetes/admin.conf

初始化回退

sudo kubeadm reset

sudo rm -rf /etc/cni/net.d

rm -rf  \$HOME/.kube

如果初始化有问题,或者失败了,在重新初始化之前,先reset一下就可以了。

参考:https://kubernetes.io/docs/reference/setup-tools/kubeadm/kubeadm-reset/ 

 

安装CNI

CNI有很多,盲选了calico

文档:

https://docs.tigera.io/calico/latest/getting-started/kubernetes/self-managed-onprem/onpremises

https://docs.tigera.io/calico/latest/getting-started/kubernetes/quickstart

https://docs.tigera.io/calico/latest/getting-started/kubernetes/hardway/overview

 

kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.29.2/manifests/tigera-operator.yaml

curl https://raw.githubusercontent.com/projectcalico/calico/v3.29.2/manifests/custom-resources.yaml -O

编辑custom-resources.yaml,修改cidr为 192.168.100.0/24

kubectl create -f custom-resources.yaml

卸载

kubectl delete -f custom-resources.yaml

kubectl delete -f https://raw.githubusercontent.com/projectcalico/calico/v3.29.2/manifests/tigera-operator.yaml

 

工作节点

这个命令是上一个init 命令生成的

kubeadm join 192.168.56.105:6443 --token 1fz06v.zsbq4swb0743z6ov \
       --discovery-token-ca-cert-hash sha256:8d585c09512dec4d94a08cd3f098d7fbe15c5b2155ddd25fc7c8c54e182de7a8

 

让控制节点参与工作调度

kubectl taint nodes --all node-role.kubernetes.io/control-plane-

完成

最终效果,跑了这样几个pod,如图:

 

 

查看node

kubectl get nodes -o wide

查看容器

crictl ps -a      // 可以看见container与pod的对应关系

ctr container list     // 这是containerd的另一个底层接口命令

查看pod

kubectl get pods -A

kubectl describe pods -A |less

kubectl describe pods/csi-node-driver-ppxhc -n calico-system

查看日志

journalctl -u kubelet -f

journalctl -u containerd.service -f

kubectl describe pods/csi-node-driver-ppxhc -n calico-system

kubectl logs -f pods/calico-node-2rprg -n calico-system   // 这个 -f 要放对地方。

在pod内执行命令

kubectl exec -it calico-node-2rprg -n calico-system -- bash

 

部署

kubectl create deployment kubernetes-bootcamp --image=gcr.io/google-samples/kubernetes-bootcamp:v1
kubectl get deployments

 部署效果

 

 

创建服务

kubectl expose deployment/kubernetes-bootcamp --type="NodePort" --port 8080

 上图的端口32480,可以直接使用node ip进行访问

curl 192.168.56.107:32480

 (有趣的是使用netstat看不见这个端口被监听了,应该是被k8s直接在内核截获了,没有到达传输层。)

 

扩容

 kubectl scale deployments/kubernetes-bootcamp --replicas=4

 

验证一下

 

节点缩容

kubectl drain --ignore-daemonsets --delete-emptydir-data

kubectl delete node <节点名>

 

故障排查:

Q:The connection to the server 192.168.56.107:6443 was refused - did you specify the right host or port?

A:container安装配置不对

参考 :

https://blog.csdn.net/ldjjbzh626/article/details/128400797

https://kubernetes.io/docs/setup/production-environment/container-runtimes/

 

Q: boocamp is always pending

 

A:

使用命令 kubectl describe pod kubernetes-bootcamp-9bc58d867-hmm4j

0/2 nodes are available:

1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: },

1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }.

preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.

查看错误:

kubectl get nodes

 

journalctl -u kubelet -r

r runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"

 

Q:calico运行不起来

因为有墙访问不了registry-1.docker.io

 

 

弄通了防火墙之后还是不行,因为用了阿里云的dns:223.5.5.5, 换成8.8.8.8之后就通了。

 

Q: reset之后 calico create失败

root@k8s-node0:~# kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.29.2/manifests/tigera-operator.yaml
error: error validating "https://raw.githubusercontent.com/projectcalico/calico/v3.29.2/manifests/tigera-operator.yaml": error validating data: failed to download openapi: 
Get "https://192.168.56.107:6443/openapi/v2?timeout=32s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly
because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes"); if you choose to ignore these errors, turn validation off with --validate=false

原因:reset之后要执行这个:rm -rf  \$HOME/.kube

 

Q:装完calico,pod csi-node-driver一直不runing,卡在ContainerCreating上

不知道为什么,但是重启了kubelet,docker,containerd的服务之后,神奇的好了。

 

Q:运行在node1上的名叫calico-node的pod一直在重启,反复进入crashLoopBackOff状态.

查看发现如下日志

  Warning  Unhealthy       7s (x6 over 2m22s)     kubelet  Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix
/var/run/calico/bird.ctl: connect: connection refused

 

posted on 2025-02-18 16:15  toong  阅读(59)  评论(0)    收藏  举报