python使用impyla链接Hive查询数据
系统依赖:
linux: # 查看是否安装 yum list installed #yum更多命令:https://www.cnblogs.com/AganRun/p/8118339.html # 安装 yum install gcc libffi-devel python-devel openssl-devel -y windows: Microsoft Visual C++ 14.0 #下载地址:https://pan.baidu.com/s/1zW9bEaPEUVaKJJFxNWjM3g 提取码: tefz
pip安装python2.7依赖模块:
# 依赖包异常地址 # https://www.lfd.uci.edu/~gohlke/pythonlibs/ pip install bitarray==0.9.3 pip install pure-sasl==0.6.1 pip install thrift_sasl==0.2.1 --no-deps pip install thrift==0.9.3 pip install thriftpy==0.3.9 pip install impyla==0.14.1
pip install six==1.9.0
pip安装python3X依赖
pip install bitarray==1.1.0 pip install pure-sasl== 0.6.2 pip install thrift_sasl==0.4a1 --no-deps pip install thrift==0.13.0 pip install thriftpy2== 0.4.8 pip install impyla==0.16.2
pip install six==1.12.0
安装时候遇到的坑:
#坑ONE(centos7安装) #安装bitarray报错:command 'gcc' failed with exit status 1 . . . creating build/lib.linux-x86_64-2.7 creating build/lib.linux-x86_64-2.7/bitarray copying bitarray/__init__.py -> build/lib.linux-x86_64-2.7/bitarray copying bitarray/test_bitarray.py -> build/lib.linux-x86_64-2.7/bitarray running build_ext building 'bitarray._bitarray' extension creating build/temp.linux-x86_64-2.7 creating build/temp.linux-x86_64-2.7/bitarray gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/usr/include/python2.7 -c bitarray/_bitarray.c -o build/temp.linux-x86_64-2.7/bitarray/_bitarray.o bitarray/_bitarray.c:12:20: fatal error: Python.h: No such file or directory #include "Python.h" ^ compilation terminated. error: command 'gcc' failed with exit status 1 ---------------------------------------- ERROR: Command errored out with exit status 1: /bin/python2 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-QNmzub/bitarray/setup.py'"'"'; __file__='"'"'/tmp/pip-install-QNmzub/bitarray/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-FdlQ9q/install-record.txt --single-version-externally-managed --compile Check the logs for full command output. #因为缺少openssl等相关组件导致的,使用yum install gcc libffi-devel python-devel openssl-devel -y 安装即可解决
#坑TWO(windows环境3.6安装)
ERROR: Command errored out with exit status 1:
command: 'D:\python3.6\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\ly\\AppData\\Local\\Temp\\pip-install-2mhw5trv\\bitarr
ay\\setup.py'"'"'; __file__='"'"'C:\\Users\\ly\\AppData\\Local\\Temp\\pip-install-2mhw5trv\\bitarray\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__fil
e__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\ly\AppData\Local\Temp\pi
p-wheel-zpjie3cg' --python-tag cp36
cwd: C:\Users\ly\AppData\Local\Temp\pip-install-2mhw5trv\bitarray\
Complete output (11 lines):
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win-amd64-3.6
creating build\lib.win-amd64-3.6\bitarray
copying bitarray\test_bitarray.py -> build\lib.win-amd64-3.6\bitarray
copying bitarray\__init__.py -> build\lib.win-amd64-3.6\bitarray
running build_ext
building 'bitarray._bitarray' extension
error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools
----------------------------------------
ERROR: Failed building wheel for bitarray
缺少C++组件,下载地址:https://pan.baidu.com/s/1zW9bEaPEUVaKJJFxNWjM3g 提取码: tefz 或者通过https://pypi.org/project/impyla/0.16.2/ 下载impyla使用python setup.py install安装
#坑THREE
WARNING:impala._thrift_api:Unable to import 'sasl'. Fallback to 'puresasl'.
INFO:thrift.transport.TSocket:Could not connect to ('172.18.35.108', 10000)
Traceback (most recent call last):
File "/usr/lib64/python2.7/site-packages/thrift/transport/TSocket.py", line 104, in open
handle.connect(sockaddr)
File "/usr/lib64/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
error: [Errno 110] Connection timed out
ERROR:thrift.transport.TSocket:Could not connect to any of [('172.18.35.108', 10000)]
Traceback (most recent call last):
File "query_hive_by_impyla.py", line 88, in <module>
hive.get_connection()
File "query_hive_by_impyla.py", line 36, in get_connection
self.conn = connect(host=self.host, port=self.port, auth_mechanism='PLAIN')
File "/usr/lib/python2.7/site-packages/impala/dbapi.py", line 150, in connect
http_path=http_path)
File "/usr/lib/python2.7/site-packages/impala/hiveserver2.py", line 820, in connect
transport.open()
File "/usr/lib/python2.7/site-packages/thrift_sasl/__init__.py", line 61, in open
self._trans.open()
File "/usr/lib64/python2.7/site-packages/thrift/transport/TSocket.py", line 113, in open
raise TTransportException(TTransportException.NOT_OPEN, msg)
thrift.transport.TTransport.TTransportException: Could not connect to any of [('172.18.35.108', 10000)]
[superadmin@yc-nsg-d20 liying.a]$ python query_hive_by_impyla.py
WARNING:impala._thrift_api:Unable to import 'sasl'. Fallback to 'puresasl'.
Traceback (most recent call last):
File "query_hive_by_impyla.py", line 90, in <module>
print('result:'+hive.get_result())
File "query_hive_by_impyla.py", line 42, in get_result
if self.conn in None:
TypeError: argument of type 'NoneType' is not iterable
#坑FOUR 执行脚本报错: Traceback (most recent call last): File "query_hive_by_impyla.py", line 51, in <module> cursor = conn.cursor() File "/usr/lib/python2.7/site-packages/impala/hiveserver2.py", line 124, in cursor session = self.service.open_session(user, configuration) File "/usr/lib/python2.7/site-packages/impala/hiveserver2.py", line 1057, in open_session resp = self._rpc('OpenSession', req) File "/usr/lib/python2.7/site-packages/impala/hiveserver2.py", line 985, in _rpc response = self._execute(func_name, request) File "/usr/lib/python2.7/site-packages/impala/hiveserver2.py", line 1002, in _execute return func(request) File "/usr/lib/python2.7/site-packages/impala/_thrift_gen/TCLIService/TCLIService.py", line 175, in OpenSession self.send_OpenSession(req) File "/usr/lib/python2.7/site-packages/impala/_thrift_gen/TCLIService/TCLIService.py", line 182, in send_OpenSession args.write(self._oprot) File "/usr/lib/python2.7/site-packages/impala/_thrift_gen/TCLIService/TCLIService.py", line 1250, in write oprot.trans.write(fastbinary.encode_binary(self, (self.__class__, self.thrift_spec))) TypeError: expecting list of size 2 for struct args #乍一看完全蒙蔽,我没有定义集合或者任何东西,为啥报错,完全搞不懂,其实就是组件版本不对导致的问题
#可能一 最后和uat(我之前偶然间安装的能够正常执行的环境)对比发现impyla版本sit是0.16.2,uat是0.14.1
#可能二 thrift (0.13.0)导致的,按照上边版本要求降级为 thrift==0.9.3就好了 #那好卸载重新降级试试,果然降级之后一路畅通。可能因为自己运气比较好,好多错误都没遇到
# 坑FIVE(windows 3.7安装)
WARNING:impala._thrift_api:Unable to import 'sasl'. Fallback to 'puresasl'.
Traceback (most recent call last):
File "D:\PyCharm 2019.2.4\helpers\pydev\pydevd.py", line 1415, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "D:\PyCharm 2019.2.4\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "E:/Code/python/KFC20190830-py/hive/query_hive_by_impyla.py", line 88, in <module>
hive.get_connection()
File "E:/Code/python/KFC20190830-py/hive/query_hive_by_impyla.py", line 33, in get_connection
self.conn = connect(host=self.host, port=self.port, auth_mechanism='PLAIN')
File "D:\python3.6\lib\site-packages\impyla-0.16.2-py3.6.egg\impala\dbapi.py", line 150, in connect
http_path=http_path)
File "D:\python3.6\lib\site-packages\impyla-0.16.2-py3.6.egg\impala\hiveserver2.py", line 825, in connect
transport.open()
File "D:\python3.6\lib\site-packages\thrift_sasl\__init__.py", line 87, in open
self._send_message(self.START, chosen_mech)
File "D:\python3.6\lib\site-packages\thrift_sasl\__init__.py", line 106, in _send_message
self._trans.write(header + body)
TypeError: can't concat bytes to str
解决办法:
修改impyla包impyla/thrift路径下__init__.py的_send_message()函数,如下:
header = struct.pack(">BI", status, len(body))
if(type(body) is str):
body = body.encode()
self._trans.write(header + body)
self._trans.flush()
参考:https://blog.csdn.net/qq_35958094/article/details/83582865
#坑SIX
WARNING:impala._thrift_api:Unable to import 'sasl'. Fallback to 'puresasl'.
Traceback (most recent call last):
File "D:\PyCharm 2019.2.4\helpers\pydev\pydevd.py", line 1415, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "D:\PyCharm 2019.2.4\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "E:/Code/python/KFC20190830-py/hive/query_hive_by_impyla.py", line 87, in <module>
hive.get_connection()
File "E:/Code/python/KFC20190830-py/hive/query_hive_by_impyla.py", line 33, in get_connection
self.conn = connect(host=self.host, port=self.port, auth_mechanism='PLAIN')
File "D:\python3.6\lib\site-packages\impyla-0.16.2-py3.6.egg\impala\dbapi.py", line 150, in connect
http_path=http_path)
File "D:\python3.6\lib\site-packages\impyla-0.16.2-py3.6.egg\impala\hiveserver2.py", line 825, in connect
transport.open()
File "D:\python3.6\lib\site-packages\thrift_sasl\__init__.py", line 67, in open
if not self._trans.isOpen():
AttributeError: 'TSocket' object has no attribute 'isOpen'
因thrift-sasl版本为0.3.0导致的将版本升级为0.4或者降级为0.2.0即可
pip安装组件到指定site-packages
pip install -t /home/bigdata/local/python/lib/python2.7/site-packages kafka
# 其他方法,为尝试
Python 2.6:
python2.6 -m pip install beautifulsoup4
Python 2.7
python2.7 -m pip install beautifulsoup4
开发的python脚本:
#-*- coding:utf-8 –*- import logging import os import re import sys import time from logging.handlers import TimedRotatingFileHandler from impala.dbapi import connect reload(sys) sys.setdefaultencoding('utf-8') ''' author:杏仁拌饭 python-version: 2.7 date: 2019-11-26 13:57 description: ''' LOGGER_PATH = './log/queryHiveByImpyla' # Hive查询辅助类 class HiveConnect: def __init__(self, host='localhost', port=10000, user=None, password=None, hql='show databases'): self.host = host self.port = port self.user = user self.password = password self.hql = hql def get_connection(self): self.conn = None if self.user is None or self.password is None: # hive, 无密码, 无kerberos self.conn = connect(host=self.host, port=self.port, auth_mechanism='PLAIN') elif self.user is not None and self.password is not None: # impala, 有密码, 有kerberos self.conn = connect(host=self.host, port=self.port, auth_mechanism='PLAIN', user=self.user, password=self.password) def get_result(self): if self.conn is None: # 链接为空 logging.error('链接为空 :host:%s, port:%s user:%s, password:%s\n' % (self.host, self.port, self.user, self.password)) # 异常退出 sys.exit(0) cursor = self.conn.cursor() cursor.execute(self.hql) # 打印结果集的schema logging.info('query Result: %s' % cursor.description) # 将查询到的数据集合打印出来 return cursor.fetchall() # for result in cursor.fetchall(): # print(result) #初始化日志记录器 def initLogging(logFilename): """Init for logging """ checkFileExistence(logFilename) # time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())) 获取并且格式化当前时间 # logFilename = time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())) + logFilename log_fmt = '%(asctime)s\tFile \"%(filename)s\",line %(lineno)s\t%(levelname)s: %(message)s' formatter = logging.Formatter(log_fmt) log_file_handler = TimedRotatingFileHandler(filename=logFilename, when="d", interval=1, backupCount=15) log_file_handler.suffix = "%Y%m%d%H" log_file_handler.extMatch = re.compile(r"[\w\W]*\d{10}$") log_file_handler.setFormatter(formatter) logging.basicConfig(level=logging.INFO) logging.getLogger().addHandler(log_file_handler) # 创建文件夹 def checkFileExistence(fileName): # 判断文件是否存在 path.dirname(logFilename)-获取文件的父目录 not()时取反 path.abspath(logFilename) - 获取绝对路径 if not(os.path.exists(os.path.dirname(fileName))): # 创建文件夹 os.makedirs(os.path.dirname(fileName)) #------------------------------------------------------实现---------------------------------------------------------- if __name__ == '__main__': # 初始化日志程序 nowTime = time.strftime('%Y%m%d', time.localtime(time.time())) initLogging(LOGGER_PATH + nowTime + '.log') try: # hive = HiveConnect(host='youIp', port=10000) hive = HiveConnect(host='youIp', port=10000, hql='select * from recom_fusion.recommend limit 10') hive.get_connection() for result in hive.get_result(): # 输出第一行结果 print(result) for row in result: #输出第一行第一列结果 print(row) except Exception, e: logging.error('hive Exception, reason:%s' % e)
未使用组件,直接使用cmd方式查询:
#-*- coding:utf-8 –*- import commands import logging import os import re import sys import time from logging.handlers import TimedRotatingFileHandler reload(sys) sys.setdefaultencoding('utf-8') ''' author:杏仁拌饭 python-version: 2.7 date: 2019-11-25 16:57 description: ''' LOGGER_PATH = './log/queryHive' #-----------------------------------------------utils------------------------- #日志辅助类 class LoggerInit: def __init__(self, logpath='./runlog'): self.logFilename = logpath #初始化日志记录器 def initLogging(self): """Init for logging """ #校验日志文件是否存在 self.checkFileExistence() # time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())) 获取并且格式化当前时间 # logFilename = time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())) + logFilename log_fmt = '%(asctime)s\tFile \"%(filename)s\",line %(lineno)s\t%(levelname)s: %(message)s' formatter = logging.Formatter(log_fmt) log_file_handler = TimedRotatingFileHandler(filename=self.logFilename, when="d", interval=1, backupCount=15) log_file_handler.suffix = "%Y%m%d%H" log_file_handler.extMatch = re.compile(r"[\w\W]*\d{10}$") log_file_handler.setFormatter(formatter) logging.basicConfig(level=logging.INFO) logging.getLogger().addHandler(log_file_handler) # 创建文件夹 def checkFileExistence(self): # 判断文件是否存在 path.dirname(logFilename)-获取文件的父目录 not()时取反 path.abspath(logFilename) - 获取绝对路径 if not(os.path.exists(os.path.dirname(self.logFilename))): # 创建文件夹 os.makedirs(os.path.dirname(self.logFilename)) def exec_command(command): logging.info("执行语句:" + command) return commands.getstatusoutput(command) #----------------------------utils end----------------------------------------------- if __name__ == '__main__': # 初始化日志程序 nowTime = time.strftime('%Y%m%d', time.localtime(time.time())) logger = LoggerInit(logpath=LOGGER_PATH + nowTime + '.log') logger.initLogging() hql = 'hive -e "set hive.cli.print.header=true;set mapred.job.queue.name=pms;set mapred.job.name=[HQL]exps_prepro_query;select * from recom_fusion.recommend limit 10;"' (status, out) = exec_command(hql) if status == 0: logging.info("***************[DONE]hive sql: %s Done****************" % hql) logging.info("[OUT]: %s" % out) else: logging.error("*********************CMD ERROR************************") logging.error("[STATUS]: %d" % status) logging.error("[OUT]: %s" % out)
开发过程拜读的文章:
gcc错误:https://support.huaweicloud.com/trouble-ecs/zh-cn_topic_0171671781.html https://blog.csdn.net/hellofuturecyj/article/details/84312189
impaly依赖以及版本号和问题:https://blog.csdn.net/wangdi620/article/details/90291636 https://www.jianshu.com/p/798a76acda31
impaly与pyhive比较:https://blog.csdn.net/bluejoe2000/article/details/51527029 (个人首先使用的是pyhive,因为各种依赖组件下载和未知问题得不到解决才弃用pyhive另寻他法找到的impaly)
pyhive链接Hive:https://www.cnblogs.com/Jesse-Li/p/8933776.html https://blog.csdn.net/weixin_41734687/article/details/83934228 https://www.cnblogs.com/drjava/p/10631975.html https://blog.csdn.net/a6822342/article/details/80713652
HiveServer2应用介绍:https://www.jianshu.com/p/4ef28607fc04 https://zhidao.baidu.com/question/588756373084105205.html https://blog.csdn.net/calicobox/article/details/75098172
windows下载Thrift:http://thrift.apache.org/tutorial/py
windows环境下载python的一些组件:https://www.lfd.uci.edu/~gohlke/pythonlibs/#bitarray
安装sasl异常:https://www.jianshu.com/p/10aca9967e93
链接Hive方式:https://blog.csdn.net/clany888/article/details/82989068 https://ask.hellobi.com/blog/ysfyb/18251 https://www.cnblogs.com/xjh713/p/9813119.html
pip组件下载:https://pypi.org/project/bitarray/0.9.3/#files https://blog.csdn.net/weixin_34234829/article/details/94016630
python获取shell输出结果:https://www.cnblogs.com/meitian/p/6814591.html