Chineseocr在GPU上运行的问题及解决方法

系统:Ubuntu 18.0

CUDA: 10.0.130

仅支持tensorflow 1.14.0以上,否则import时报错

ImportError: libcudnn.so.7: cannot open shared object file: No such file or directory

Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/errors

for some common reasons and solutions. Include the entire stack trace
above this error message when asking for help.

 

chineseocr:tensorflow最高支持1.13.1,否则报错:

tensorflow.python.framework.errors_impl.InvalidArgumentError: You must feed a value for placeholder tensor 'Placeholder_368' with dtype float and shape [2]
[[{{node Placeholder_368}}]]

解决方法:

修改keras_yolo3.py line 365-366

boxes  = concatenate(boxes, axis=0)
scores = concatenate(scores, axis=0)

改为

boxes  = K.concatenate(boxes, axis=0)
scores = K.concatenate(scores, axis=0)

修改后的安装版本:

keras==2.2.4 
tensorflow==1.14.0 
tensorflow-gpu==1.14.0

 

更新版本后依然报错:

2020-08-29 18:47:06.157935: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] remapper failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
2020-08-29 18:47:06.286745: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] remapper failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)

解决方法(详见这里):

修改keras_yolo3.py line 120-121

box_xy = (K.sigmoid(feats[..., :2]) + grid) / K.cast(grid_shape[::-1], K.dtype(feats))
box_wh = K.exp(feats[..., 2:4]) * anchors_tensor / K.cast(input_shape[::-1], K.dtype(feats))

改为

box_xy = (K.sigmoid(feats[..., :2]) + grid) / K.cast(grid_shape[..., ::-1], K.dtype(feats))
box_wh = K.exp(feats[..., 2:4]) * anchors_tensor / K.cast(input_shape[..., ::-1], K.dtype(feats))

 

同时pytorch遇到一个报错:

Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 3 does not equal 0 (while checking arguments for cudnn_convolution)

原因:pytorch加载weights和加载input的device不一致

原code加载train weight时, 将weight保存在了CPU上(详见这里):

 map_location=lambda storage, loc: storage

解决方法:统一加载device

device = torch.device('cuda', GPUID) if GPU and torch.cuda.is_available() else torch.device('cpu')
model = CRNN(32, 1, len(alphabet) + 1, 256, 1, lstmFlag=LSTMFLAG).to(device)
trainWeights = torch.load(ocrModel, map_location=device)
...
image = image.to(device)

 

运行tf时发现GPU可以被识别,但运行时并未使用GPU,

检测code参考这篇文章

import tensorflow as tf
  
# 检测gpu
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
print("============")

# 新建一个 graph.
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# 新建session with log_device_placement并设置为True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# 运行这个 op.
print (sess.run(c))

 

发现原来import tensorflow时有个报错被忽略了

2020-08-29 23:08:12.874793: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory
2020-08-29 23:08:12.874804: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2020-08-29 23:08:12.874832: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-08-29 23:08:12.874840: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 1 2 3 
2020-08-29 23:08:12.874847: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N Y Y Y 
2020-08-29 23:08:12.874854: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1: Y N Y Y 
2020-08-29 23:08:12.874861: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 2: Y Y N Y 
2020-08-29 23:08:12.874867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 3: Y Y Y N

参考了一些文章,可能安装CUDA的时候没有建立CUDNN连接,但是由于需要sudo权限,稍后再更新解决方案,先列出一些参考资料:

https://github.com/tensorflow/tensorflow/issues/20271

https://blog.csdn.net/weixin_40298200/article/details/79420758

 

更新解决方法:安装CUDNN 7.6:

wget http://file.ppwwyyxx.com/nvidia/cudnn-10.0-linux-x64-v7.6.4.38.tgz
tar xzvf cudnn-10.0-linux-x64-v7.6.4.38.tgz

sudo cp cuda/include/cudnn*.h /usr/local/cuda-10.0/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda-10.0/lib64
sudo chmod a+r /usr/local/cuda-10.0/include/cudnn*.h /usr/local/cuda-10.0/lib64/libcudnn*

关于CUDA和CUDNN的关系可以看这篇文章

另外server上nvcc command失效,原因是没有指定LD_LIBRARY_PATH,在~/.bashrc中添加:

if [ -d "/usr/local/cuda-10.0/bin/" ]; then
    export PATH=/usr/local/cuda-10.0/bin${PATH:+:${PATH}}
    export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
fi

 

posted @ 2020-08-30 02:25  Sherrrry  阅读(2173)  评论(0编辑  收藏  举报