[Spark] Spark读取gbk编码文件

def output_mapper(line):
    """ 输入文件是gbk编码，
        使用spark的GBKFileInputFormat读取后自动转为utf-8编码.
        Keys are the position in the file,
        and values are the line of text,
        and will be converted to UTF-8 Text.
    Args:
        line    (position, bidword \t sp \t tag_info)
    Returns:
        list    [bidword, sp, tag_info, theDate]
    """
    try:
        global theDate
        value = line[1]
        bidword, sp, tag_info = value.strip().split('\t')
        return [bidword, sp, tag_info, theDate]
    except Exception as e:
        logging.error("add_date_mapper error: {}".format(e))
        return None

test_df = sc.hadoopFile(test_file,
                        "org.apache.spark.input.GBKFileInputFormat",
                        "org.apache.hadoop.io.LongWritable",
                        "org.apache.hadoop.io.Text")\
                   .map(output_mapper)\
                   .filter(lambda x: x is not None)\
                   .toDF()

参考链接：

https://www.wangt.cc/2019/11/feature%EF%BC%9Aspark%E6%94%AF%E6%8C%81gbk%E6%96%87%E4%BB%B6%E8%AF%BB%E5%8F%96%E5%8A%9F%E8%83%BD/

/**

* FileInputFormat for gbk encoded files. Files are broken into lines.Either linefeed

* or carriage-return are used to signal end of line. Keys are the position in the file,

* and values are the line of text and will be converted to UTF-8 Text.

*/

posted @ 2021-02-03 19:37 listenviolet 阅读(565) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

认真积累每一天

[Spark] Spark读取gbk编码文件

公告