python使用impyla链接Hive查询数据

系统依赖:

linux:
# 查看是否安装
yum list installed #yum更多命令:https://www.cnblogs.com/AganRun/p/8118339.html
# 安装
yum install gcc libffi-devel python-devel openssl-devel -y 

windows:
Microsoft Visual C++ 14.0 #下载地址:https://pan.baidu.com/s/1zW9bEaPEUVaKJJFxNWjM3g 提取码: tefz

 

pip安装python2.7依赖模块:

# 依赖包异常地址
# https://www.lfd.uci.edu/~gohlke/pythonlibs/
pip install bitarray==0.9.3
pip install pure-sasl==0.6.1
pip install thrift_sasl==0.2.1 --no-deps
pip install thrift==0.9.3
pip install thriftpy==0.3.9
pip install impyla==0.14.1
pip install six==1.9.0

pip安装python3X依赖

pip install bitarray==1.1.0
pip install pure-sasl== 0.6.2
pip install thrift_sasl==0.4a1 --no-deps
pip install thrift==0.13.0
pip install thriftpy2== 0.4.8
pip install impyla==0.16.2
pip install six==1.12.0

 

安装时候遇到的坑

#坑ONE(centos7安装)
#安装bitarray报错:command 'gcc' failed with exit status 1
.
.
.
    creating build/lib.linux-x86_64-2.7
    creating build/lib.linux-x86_64-2.7/bitarray
    copying bitarray/__init__.py -> build/lib.linux-x86_64-2.7/bitarray
    copying bitarray/test_bitarray.py -> build/lib.linux-x86_64-2.7/bitarray
    running build_ext
    building 'bitarray._bitarray' extension
    creating build/temp.linux-x86_64-2.7
    creating build/temp.linux-x86_64-2.7/bitarray
    gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/usr/include/python2.7 -c bitarray/_bitarray.c -o build/temp.linux-x86_64-2.7/bitarray/_bitarray.o
    bitarray/_bitarray.c:12:20: fatal error: Python.h: No such file or directory
     #include "Python.h"
                        ^
    compilation terminated.
    error: command 'gcc' failed with exit status 1
    ----------------------------------------
ERROR: Command errored out with exit status 1: /bin/python2 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-QNmzub/bitarray/setup.py'"'"'; __file__='"'"'/tmp/pip-install-QNmzub/bitarray/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-FdlQ9q/install-record.txt --single-version-externally-managed --compile Check the logs for full command output.

#因为缺少openssl等相关组件导致的,使用yum install gcc libffi-devel python-devel openssl-devel -y 安装即可解决

#坑TWO(windows环境3.6安装)

ERROR: Command errored out with exit status 1:
command: 'D:\python3.6\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\ly\\AppData\\Local\\Temp\\pip-install-2mhw5trv\\bitarr
ay\\setup.py'"'"'; __file__='"'"'C:\\Users\\ly\\AppData\\Local\\Temp\\pip-install-2mhw5trv\\bitarray\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__fil
e__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\ly\AppData\Local\Temp\pi
p-wheel-zpjie3cg' --python-tag cp36
cwd: C:\Users\ly\AppData\Local\Temp\pip-install-2mhw5trv\bitarray\
Complete output (11 lines):
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win-amd64-3.6
creating build\lib.win-amd64-3.6\bitarray
copying bitarray\test_bitarray.py -> build\lib.win-amd64-3.6\bitarray
copying bitarray\__init__.py -> build\lib.win-amd64-3.6\bitarray
running build_ext
building 'bitarray._bitarray' extension
error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools
----------------------------------------
ERROR: Failed building wheel for bitarray

缺少C++组件,下载地址:https://pan.baidu.com/s/1zW9bEaPEUVaKJJFxNWjM3g 提取码: tefz  或者通过https://pypi.org/project/impyla/0.16.2/ 下载impyla使用python setup.py install安装



#坑THREE

WARNING:impala._thrift_api:Unable to import 'sasl'. Fallback to 'puresasl'.
INFO:thrift.transport.TSocket:Could not connect to ('172.18.35.108', 10000)
Traceback (most recent call last):
File "/usr/lib64/python2.7/site-packages/thrift/transport/TSocket.py", line 104, in open
handle.connect(sockaddr)
File "/usr/lib64/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
error: [Errno 110] Connection timed out
ERROR:thrift.transport.TSocket:Could not connect to any of [('172.18.35.108', 10000)]
Traceback (most recent call last):
File "query_hive_by_impyla.py", line 88, in <module>
hive.get_connection()
File "query_hive_by_impyla.py", line 36, in get_connection
self.conn = connect(host=self.host, port=self.port, auth_mechanism='PLAIN')
File "/usr/lib/python2.7/site-packages/impala/dbapi.py", line 150, in connect
http_path=http_path)
File "/usr/lib/python2.7/site-packages/impala/hiveserver2.py", line 820, in connect
transport.open()
File "/usr/lib/python2.7/site-packages/thrift_sasl/__init__.py", line 61, in open
self._trans.open()
File "/usr/lib64/python2.7/site-packages/thrift/transport/TSocket.py", line 113, in open
raise TTransportException(TTransportException.NOT_OPEN, msg)
thrift.transport.TTransport.TTransportException: Could not connect to any of [('172.18.35.108', 10000)]
[superadmin@yc-nsg-d20 liying.a]$ python query_hive_by_impyla.py
WARNING:impala._thrift_api:Unable to import 'sasl'. Fallback to 'puresasl'.
Traceback (most recent call last):
File "query_hive_by_impyla.py", line 90, in <module>
print('result:'+hive.get_result())
File "query_hive_by_impyla.py", line 42, in get_result
if self.conn in None:
TypeError: argument of type 'NoneType' is not iterable

#坑FOUR
执行脚本报错:
Traceback (most recent call last):
File "query_hive_by_impyla.py", line 51, in <module>
cursor = conn.cursor()
File "/usr/lib/python2.7/site-packages/impala/hiveserver2.py", line 124, in cursor
session = self.service.open_session(user, configuration)
File "/usr/lib/python2.7/site-packages/impala/hiveserver2.py", line 1057, in open_session
resp = self._rpc('OpenSession', req)
File "/usr/lib/python2.7/site-packages/impala/hiveserver2.py", line 985, in _rpc
response = self._execute(func_name, request)
File "/usr/lib/python2.7/site-packages/impala/hiveserver2.py", line 1002, in _execute
return func(request)
File "/usr/lib/python2.7/site-packages/impala/_thrift_gen/TCLIService/TCLIService.py", line 175, in OpenSession
self.send_OpenSession(req)
File "/usr/lib/python2.7/site-packages/impala/_thrift_gen/TCLIService/TCLIService.py", line 182, in send_OpenSession
args.write(self._oprot)
File "/usr/lib/python2.7/site-packages/impala/_thrift_gen/TCLIService/TCLIService.py", line 1250, in write
oprot.trans.write(fastbinary.encode_binary(self, (self.__class__, self.thrift_spec)))
TypeError: expecting list of size 2 for struct args

#乍一看完全蒙蔽,我没有定义集合或者任何东西,为啥报错,完全搞不懂,其实就是组件版本不对导致的问题
#可能一 最后和uat(我之前偶然间安装的能够正常执行的环境)对比发现impyla版本sit是0.16.2,uat是0.14.1
#可能二 thrift (0.13.0)导致的,按照上边版本要求降级为 thrift==0.9.3就好了 #那好卸载重新降级试试,果然降级之后一路畅通。可能因为自己运气比较好,好多错误都没遇到

# 坑FIVE(windows 3.7安装)

WARNING:impala._thrift_api:Unable to import 'sasl'. Fallback to 'puresasl'.
Traceback (most recent call last):
File "D:\PyCharm 2019.2.4\helpers\pydev\pydevd.py", line 1415, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "D:\PyCharm 2019.2.4\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "E:/Code/python/KFC20190830-py/hive/query_hive_by_impyla.py", line 88, in <module>
hive.get_connection()
File "E:/Code/python/KFC20190830-py/hive/query_hive_by_impyla.py", line 33, in get_connection
self.conn = connect(host=self.host, port=self.port, auth_mechanism='PLAIN')
File "D:\python3.6\lib\site-packages\impyla-0.16.2-py3.6.egg\impala\dbapi.py", line 150, in connect
http_path=http_path)
File "D:\python3.6\lib\site-packages\impyla-0.16.2-py3.6.egg\impala\hiveserver2.py", line 825, in connect
transport.open()
File "D:\python3.6\lib\site-packages\thrift_sasl\__init__.py", line 87, in open
self._send_message(self.START, chosen_mech)
File "D:\python3.6\lib\site-packages\thrift_sasl\__init__.py", line 106, in _send_message
self._trans.write(header + body)
TypeError: can't concat bytes to str

解决办法:

修改impyla包impyla/thrift路径下__init__.py的_send_message()函数,如下:

header = struct.pack(">BI", status, len(body))
if(type(body) is str):
body = body.encode()
self._trans.write(header + body)
self._trans.flush()
参考:https://blog.csdn.net/qq_35958094/article/details/83582865

#坑SIX

WARNING:impala._thrift_api:Unable to import 'sasl'. Fallback to 'puresasl'.
Traceback (most recent call last):
File "D:\PyCharm 2019.2.4\helpers\pydev\pydevd.py", line 1415, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "D:\PyCharm 2019.2.4\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "E:/Code/python/KFC20190830-py/hive/query_hive_by_impyla.py", line 87, in <module>
hive.get_connection()
File "E:/Code/python/KFC20190830-py/hive/query_hive_by_impyla.py", line 33, in get_connection
self.conn = connect(host=self.host, port=self.port, auth_mechanism='PLAIN')
File "D:\python3.6\lib\site-packages\impyla-0.16.2-py3.6.egg\impala\dbapi.py", line 150, in connect
http_path=http_path)
File "D:\python3.6\lib\site-packages\impyla-0.16.2-py3.6.egg\impala\hiveserver2.py", line 825, in connect
transport.open()
File "D:\python3.6\lib\site-packages\thrift_sasl\__init__.py", line 67, in open
if not self._trans.isOpen():
AttributeError: 'TSocket' object has no attribute 'isOpen'

 

因thrift-sasl版本为0.3.0导致的将版本升级为0.4或者降级为0.2.0即可

 

 

pip安装组件到指定site-packages

pip install -t /home/bigdata/local/python/lib/python2.7/site-packages kafka

# 其他方法,为尝试

 Python 2.6:

  python2.6 -m pip install beautifulsoup4

 Python 2.7

  python2.7 -m pip install beautifulsoup4


 

开发的python脚本:

#-*- coding:utf-8 –*-

import logging
import os
import re
import sys
import time
from logging.handlers import TimedRotatingFileHandler
from impala.dbapi import connect

reload(sys)
sys.setdefaultencoding('utf-8')

'''
    author:杏仁拌饭
    python-version: 2.7
    date: 2019-11-26 13:57
    description:
'''

LOGGER_PATH = './log/queryHiveByImpyla'

# Hive查询辅助类
class HiveConnect:
    def __init__(self, host='localhost', port=10000, user=None, password=None, hql='show databases'):
        self.host = host
        self.port = port
        self.user = user
        self.password = password
        self.hql = hql

    def get_connection(self):
        self.conn = None
        if self.user is None or self.password is None:
            # hive, 无密码, 无kerberos
            self.conn = connect(host=self.host, port=self.port, auth_mechanism='PLAIN')
        elif self.user is not None and self.password is not None:
            # impala, 有密码, 有kerberos
            self.conn = connect(host=self.host, port=self.port, auth_mechanism='PLAIN', user=self.user, password=self.password)

    def get_result(self):
        if self.conn is None:
            # 链接为空
            logging.error('链接为空 :host:%s, port:%s user:%s, password:%s\n' % (self.host, self.port, self.user, self.password))
            # 异常退出
            sys.exit(0)
        cursor = self.conn.cursor()
        cursor.execute(self.hql)
        # 打印结果集的schema
        logging.info('query Result: %s' % cursor.description)
        # 将查询到的数据集合打印出来
        return cursor.fetchall()
        # for result in cursor.fetchall():
        #     print(result)

#初始化日志记录器
def initLogging(logFilename):
    """Init for logging
    """
    checkFileExistence(logFilename)
    # time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())) 获取并且格式化当前时间
    # logFilename = time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())) + logFilename

    log_fmt = '%(asctime)s\tFile \"%(filename)s\",line %(lineno)s\t%(levelname)s: %(message)s'
    formatter = logging.Formatter(log_fmt)
    log_file_handler = TimedRotatingFileHandler(filename=logFilename, when="d", interval=1, backupCount=15)
    log_file_handler.suffix = "%Y%m%d%H"
    log_file_handler.extMatch = re.compile(r"[\w\W]*\d{10}$")
    log_file_handler.setFormatter(formatter)
    logging.basicConfig(level=logging.INFO)
    logging.getLogger().addHandler(log_file_handler)

# 创建文件夹
def checkFileExistence(fileName):
    # 判断文件是否存在 path.dirname(logFilename)-获取文件的父目录 not()时取反 path.abspath(logFilename) - 获取绝对路径
    if not(os.path.exists(os.path.dirname(fileName))):
        # 创建文件夹
        os.makedirs(os.path.dirname(fileName))

#------------------------------------------------------实现----------------------------------------------------------

if __name__ == '__main__':
    # 初始化日志程序
    nowTime = time.strftime('%Y%m%d', time.localtime(time.time()))
    initLogging(LOGGER_PATH + nowTime + '.log')

    try:
        # hive = HiveConnect(host='youIp', port=10000)
        hive = HiveConnect(host='youIp', port=10000, hql='select * from recom_fusion.recommend limit 10')
        hive.get_connection()
        for result in hive.get_result():
            # 输出第一行结果
            print(result)
            for row in result:
                #输出第一行第一列结果
                print(row)
    except Exception, e:
        logging.error('hive Exception, reason:%s' % e)

 

未使用组件,直接使用cmd方式查询:

#-*- coding:utf-8 –*-

import commands
import logging
import os
import re
import sys
import time
from logging.handlers import TimedRotatingFileHandler

reload(sys)
sys.setdefaultencoding('utf-8')

'''
    author:杏仁拌饭
    python-version: 2.7
    date: 2019-11-25 16:57
    description:
'''

LOGGER_PATH = './log/queryHive'

#-----------------------------------------------utils-------------------------
#日志辅助类
class LoggerInit:
    def __init__(self, logpath='./runlog'):
        self.logFilename = logpath
    #初始化日志记录器
    def initLogging(self):
        """Init for logging
        """
        #校验日志文件是否存在
        self.checkFileExistence()
        # time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())) 获取并且格式化当前时间
        # logFilename = time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())) + logFilename

        log_fmt = '%(asctime)s\tFile \"%(filename)s\",line %(lineno)s\t%(levelname)s: %(message)s'
        formatter = logging.Formatter(log_fmt)
        log_file_handler = TimedRotatingFileHandler(filename=self.logFilename, when="d", interval=1, backupCount=15)
        log_file_handler.suffix = "%Y%m%d%H"
        log_file_handler.extMatch = re.compile(r"[\w\W]*\d{10}$")
        log_file_handler.setFormatter(formatter)
        logging.basicConfig(level=logging.INFO)
        logging.getLogger().addHandler(log_file_handler)

    # 创建文件夹
    def checkFileExistence(self):
        # 判断文件是否存在 path.dirname(logFilename)-获取文件的父目录 not()时取反 path.abspath(logFilename) - 获取绝对路径
        if not(os.path.exists(os.path.dirname(self.logFilename))):
            # 创建文件夹
            os.makedirs(os.path.dirname(self.logFilename))

def exec_command(command):
    logging.info("执行语句:" + command)
    return commands.getstatusoutput(command)

#----------------------------utils end-----------------------------------------------

if __name__ == '__main__':
    # 初始化日志程序
    nowTime = time.strftime('%Y%m%d', time.localtime(time.time()))
    logger = LoggerInit(logpath=LOGGER_PATH + nowTime + '.log')
    logger.initLogging()
    hql = 'hive -e "set hive.cli.print.header=true;set mapred.job.queue.name=pms;set mapred.job.name=[HQL]exps_prepro_query;select * from recom_fusion.recommend limit 10;"'
    (status, out) = exec_command(hql)
    if status == 0:
        logging.info("***************[DONE]hive sql: %s Done****************" % hql)
        logging.info("[OUT]: %s" % out)
    else:
        logging.error("*********************CMD ERROR************************")
        logging.error("[STATUS]: %d" % status)
        logging.error("[OUT]: %s" % out)

 

 

 

开发过程拜读的文章:

gcc错误:https://support.huaweicloud.com/trouble-ecs/zh-cn_topic_0171671781.html  https://blog.csdn.net/hellofuturecyj/article/details/84312189

impaly依赖以及版本号和问题:https://blog.csdn.net/wangdi620/article/details/90291636    https://www.jianshu.com/p/798a76acda31

impaly与pyhive比较:https://blog.csdn.net/bluejoe2000/article/details/51527029  (个人首先使用的是pyhive,因为各种依赖组件下载和未知问题得不到解决才弃用pyhive另寻他法找到的impaly)

pyhive链接Hive:https://www.cnblogs.com/Jesse-Li/p/8933776.html     https://blog.csdn.net/weixin_41734687/article/details/83934228    https://www.cnblogs.com/drjava/p/10631975.html https://blog.csdn.net/a6822342/article/details/80713652

HiveServer2应用介绍:https://www.jianshu.com/p/4ef28607fc04    https://zhidao.baidu.com/question/588756373084105205.html    https://blog.csdn.net/calicobox/article/details/75098172

windows下载Thrift:http://thrift.apache.org/tutorial/py  

windows环境下载python的一些组件:https://www.lfd.uci.edu/~gohlke/pythonlibs/#bitarray

安装sasl异常:https://www.jianshu.com/p/10aca9967e93

链接Hive方式:https://blog.csdn.net/clany888/article/details/82989068    https://ask.hellobi.com/blog/ysfyb/18251 https://www.cnblogs.com/xjh713/p/9813119.html

pip组件下载:https://pypi.org/project/bitarray/0.9.3/#files https://blog.csdn.net/weixin_34234829/article/details/94016630

python获取shell输出结果:https://www.cnblogs.com/meitian/p/6814591.html

 

 

posted @ 2019-11-26 22:27  杏仁拌饭  阅读(1826)  评论(0)    收藏  举报