避免重复折腾

大量小文件传输

sftp替代品

rsync -avz --progress -e ssh 用户名@主机:远程路径 本地路径

PIL加速

pip uninstall pillow
sudo apt-get install libjpeg-dev
sudo apt-get install zlib1g-dev
sudo apt-get install libpng-dev
pip install pillow-simd

HF镜像

export HF_ENDPOINT=https://hf-mirror.com

conda里升级g++

  • 最新
conda install -c conda-forge gcc gxx

这里改=11.2也可以

  • 5.4.0
conda install https://anaconda.org/brown-data-science/gcc/5.4.0/download/linux-64/gcc-5.4.0-0.tar.bz2

一种bias_act 能用的版本:
cuda=12.0 gcc=8.5.0 gxx=8.5.0

ZSH三板斧

wget https://gitee.com/mirrors/oh-my-zsh/raw/master/tools/install.sh
chmod +x install.sh
./install.sh

plugins=(git zsh-syntax-highlighting zsh-autosuggestions)

git clone https://github.com/zsh-users/zsh-syntax-highlighting.git ${ZSH_CUSTOM:-~/.oh-my-zsh/custom}/plugins/zsh-syntax-highlighting
git clone https://github.com/zsh-users/zsh-autosuggestions ${ZSH_CUSTOM:-~/.oh-my-zsh/custom}/plugins/zsh-autosuggestions
git clone https://github.com/romkatv/powerlevel10k.git $ZSH_CUSTOM/themes/powerlevel10k

ZSH_THEME="powerlevel10k/powerlevel10k"

slurm查看节点drain原因(长)

sinfo -o "%200E %9u %19H %N"

Can't optimize non-leaf tensor

定义的时候cuda要在requires_grad前面
正确写法xxx = torch.zeros_like(ccc).cuda().requires_grad_(True)

显存不均

每个gpu都在gpu0上占了一块额外1789MB的显存
应该是torch.load的问题,要加一个map_location='cpu'
这个问题主要出现在deca上,deca.py的第89行要加

另一个情况是NCCL本身也要用显存,卡越多占用越多

离谱问题,torch.jit.load卡死

追了半天,最后发现_jit_compile里的baton = FileBaton(os.path.join(build_directory, 'lock'))卡死了

https://www.jianshu.com/p/a0d769971b2a 给了解决方案,清空~/.cache/torch_extensions即可,似乎是并发问题,拿不到锁

好看的bash

~/.bashrc里加

PS1="\[\033[m\]|\[\033[1;35m\]\t\[\033[m\]|\[\e[1;31m\]\u\[\e[1;36m\]\[\033[m\]@\[\e[1;36m\]\h\[\033[m\]:\[\e[0m\]\[\e[1;32m\][\W]> \[\e[0m\]"
alias ls='ls --color'

vscode server XHR fail 快速解决

写成sh脚本:

read commit_id

# 预先创建文件夹,对应的${commit_id}需要替换成那串数字(给萌新解释)
mkdir -p ~/.vscode-server/bin/${commit_id}

# 进入到文件夹并下载依赖
cd ~/.vscode-server/bin/${commit_id}
#这个国内镜像下载很快,注意Remote-SSH的版本,这里是stable
wget https://vscode.cdn.azure.cn/stable/${commit_id}/vscode-server-linux-x64.tar.gz --no-check-certificate

# tar解压文件到当前的文件夹,因为之前已经cd,所以正好是vscode代码需要检索的地方
# 检测到有相应的东西,vscode就会跳过下载直接启动远程的终端及相应线程
tar zxvf vscode-server-linux-x64.tar.gz --strip 1
#这个命令尤其重要否则会不成功
touch ~/.vscode-server/bin/${commit_id}/0

conda环境内不能clear

export 或者改shell配置文件

export TERMINFO=/usr/share/terminfo
export TERM=vt100

无权限情况下安装g++

conda install https://anaconda.org/brown-data-science/gcc/5.4.0/download/linux-64/gcc-5.4.0-0.tar.bz2

upd: 还可以这样

conda install -c conda-forge gcc
conda install -c conda-forge gxx
posted @ 2022-10-19 15:25  GhostCai  阅读(117)  评论(0编辑  收藏  举报