介绍

之前想到：博客一次写作，多地发布，于是边有了这个项目：选择性同步Hexo与博客园。

GitHub链接：https://github.com/JeffersonQin/hexo-cnblogs-sync

魔改Next模板，增加标题栏提醒

在themes/next/layout/_macro/post.swig中，选择合适的地方加入：

{% if post.cnblogs %}
    <br>
    <span class="post-meta-item">
        <span class="post-meta-item-icon">
            <i class="fas fa-rss"></i>
        </span>
    <span class="post-meta-item-text" id="cnblogs_sync_text"><a href="https://www.cnblogs.com/jeffersonqin/">博客园</a> 同步已开启</span>
    </span>
{% endif %}

这样的话，如果要同步，只需要控制文件头就可以了，类似于本文：

---
title: 博客园与Hexo同步发布的方法
date: 2020-09-15 00:31:57
tags:
- Hexo
- Blog
- cnblogs
- python
- html
- css
categories: Hexo
cnblogs: true
---

BS4微调Hexo生成的HTML

先扫描一遍public文件夹里的输出文件：

orig_dir = "../public/"
repo_dir = "../public_cnblogs/"
ignore_files = ["../public/index.html"]
ignore_dirs = ["../public/page/", "../public/archives"]
# Filtering the files needed to post
for root, dirs, files in os.walk(orig_dir):
    for file in files:
        try:
            file_name = str(file)
            file_path = os.path.join(root, file)
            file_dir = root.replace('\\', '/')
            file_path = file_path.replace('\\', '/')
            
            flag = True
            
            for ignore_dir in ignore_dirs:
                if file_dir.startswith(ignore_dir):
                    flag = False
                    break
            for ignore_file in ignore_files:
                if file_path == ignore_file:
                    flag = False
                    break

            if (file_name == 'index.html' and flag):
                index_files.append(file_path)
                print('\033[42m\033[37m[LOGG]\033[0m File Detected:', file_path)
            
        except Exception as e:
            raise e

print('\033[44m\033[37m[INFO]\033[0m File Detection Ended.')

然后使用BS4来操作HTML和CSS

# Filter the articles that are to be synced

resource_dict = {}

for index_file in index_files:
    post_body = ""
    with open(index_file, 'r', encoding='utf-8') as f:
        html_doc = f.read()
        soup = BeautifulSoup(html_doc, "html.parser")
        check_msg = soup.select('span[id=cnblogs_sync_text]')
        if (len(check_msg) == 0): continue
        post_body_list = soup.select('div[class=post-body]')
        if (len(post_body_list) == 0): continue
        print('\033[42m\033[37m[LOGG]\033[0m Target Detected:', index_file)
        for child in soup.descendants:
            if (child.name == 'img'):
                if ('data-original' in child.attrs and 'src' in child.attrs):
                    child['src'] = 'https://gyrojeff.moe' + child['data-original']
                elif ('src' in child.attrs):
                    child['src'] = 'https://gyrojeff.moe' + child['src']
            if (child.name == 'a'):
                if ('href' in child.attrs):
                    if (str(child['href']).startswith('/') and not str(child['href']).startswith('//')):
                        child['href'] = 'https://gyrojeff.moe' + child['href']
        post_body = soup.select('div[class=post-body]')[0]
        math_blocks = post_body.select('script[type="math/tex; mode=display"]')
        for math_block in math_blocks:
            math_string = str(math_block).replace('<script type="math/tex; mode=display">', '<p>$$').replace('</script>', '\n$$</p>')
            math_block.replace_with(BeautifulSoup(math_string, 'html.parser'))
    save_dir = os.path.join(repo_dir, index_file[len(orig_dir):-len("index.html")])
    if not os.path.exists(save_dir): os.makedirs(save_dir)
    copyright_div = str(soup.select('div[class=my_post_copyright]')[0])
    with open(save_dir + 'index.html', 'w', encoding='utf-8') as f:
        f.write(header + '\n' + copyright_div + '\n' + str(post_body))
    tags = soup.select('div[class=post-tags]')
    tags_text = []
    if len(tags) != 0:
        tags_div = tags[0].select('a')
        if len(tags_div) > 0:
            for tag in tags_div:
                tags_text.append(tag.contents[1][1:])
    
    resource_dict[save_dir + 'index.html'] = {
        'tags': tags_text, 
        'title': soup.select('meta[property="og:title"]')[0]['content']
    }
    print('\033[44m\033[37m[INFO]\033[0m File Generated:', save_dir + 'index.html')

print('\033[44m\033[37m[INFO]\033[0m File Generation Ended.')

上面的大多数代码基本上都需要具体情况具体分析，分析生成的HTML和我们需要的代码之间的关系。值得注意的是，math_blocks的那一段代码是巧妙地解决数学公式mathjax只渲染到<script>的问题的。（直接匹配再用$$$$替换，这样可以直接使用博客园的markdown进行渲染，毕竟markdown可以兼容html）这里面，header是我根据这套Next主题自己扒的，源码在我的GitHub上有（链接在本文文末）

MetaWeblog发布博文

博客园可以使用MetaWeblog接口，不过categories有点问题，不太能使用。

原作：https://github.com/1024th/cnblogs_githook

但是在删除接口方面存在问题，这里进行了更改。

import xmlrpc.client as xmlrpclib
import json
import datetime
import time
import getpass

'''
配置字典：
type | description(example)
str  | metaWeblog url, 博客设置中有('https://rpc.cnblogs.com/metaweblog/1024th')
str  | appkey, Blog地址名('1024th')
str  | blogid, 这个无需手动输入，通过getUsersBlogs得到
str  | usr, 登录用户名
str  | passwd, 登录密码
str  | rootpath, 博文存放根路径（添加git管理）
'''

'''
POST:
dateTime	dateCreated - Required when posting.
string	description - Required when posting.
string	title - Required when posting.
array of string	categories (optional)
struct Enclosure	enclosure (optional)
string	link (optional)
string	permalink (optional)
any	postid (optional)
struct Source	source (optional)
string	userid (optional)
any	mt_allow_comments (optional)
any	mt_allow_pings (optional)
any	mt_convert_breaks (optional)
string	mt_text_more (optional)
string	mt_excerpt (optional)
string	mt_keywords (optional)
string	wp_slug (optional)
'''

class MetaWebBlogClient():
    def __init__(self, configpath):
        '''
        @configpath: 指定配置文件路径
        '''
        self._configpath = configpath
        self._config = None
        self._server = None
        self._mwb = None
    
    def createConfig(self):
        '''
        创建配置
        '''
        while True:
            cfg = {}
            for item in [("url", "MetaWebBlog URL: "),
                        ("appkey", "博客地址名(网址的用户部分): "),
                        ("usr", "登录用户名: ")]:
                cfg[item[0]] = input("输入" + item[1])
            cfg['passwd'] = getpass.getpass('输入登录密码: ')
            try:
                server = xmlrpclib.ServerProxy(cfg["url"])
                userInfo = server.blogger.getUsersBlogs(cfg["appkey"], cfg["usr"], cfg["passwd"])
                print(userInfo[0])
                # {'blogid': 'xxx', 'url': 'xxx', 'blogName': 'xxx'}
                cfg["blogid"] = userInfo[0]["blogid"]
                break
            except:
                print("发生错误！")
        with open(self._configpath, "w", encoding="utf-8") as f:
                json.dump(cfg, f, indent=4, ensure_ascii=False)
        
    def existConfig(self):
        '''
        返回配置是否存在
        '''
        try:
            with open(self._configpath, "r", encoding="utf-8") as f:
                try:
                    cfg = json.load(f)
                    if cfg == {}:
                        return False
                    else:
                        return True
                except json.decoder.JSONDecodeError:  # 文件为空
                    return False
        except:
            with open(self._configpath, "w", encoding="utf-8") as f:
                json.dump({}, f)
                return False

    def readConfig(self):
        '''
        读取配置
        '''
        if not self.existConfig():
            self.createConfig()

        with open(self._configpath, "r", encoding="utf-8") as f:
            self._config = json.load(f)
            self._server = xmlrpclib.ServerProxy(self._config["url"])
            self._mwb = self._server.metaWeblog

    def getUsersBlogs(self):
        '''
        获取博客信息
        @return: {
            string  blogid
            string	url
            string	blogName
        }
        '''
        userInfo = self._server.blogger.getUsersBlogs(self._config["appkey"], self._config["usr"], self._config["passwd"])
        return userInfo

    def getRecentPosts(self, num):
        '''
        读取最近的博文信息
        '''
        return self._mwb.getRecentPosts(self._config["blogid"], self._config["usr"], self._config["passwd"], num)

    def newPost(self, post, publish):
        '''
        发布新博文
        @post: 发布内容
        @publish: 是否公开
        '''
        while True:
            try:
                postid = self._mwb.newPost(self._config['blogid'], self._config['usr'], self._config['passwd'], post, publish)
                break
            except:
                time.sleep(5)
        return postid

    def editPost(self, postid, post, publish):
        '''
        更新已存在的博文
        @postid: 已存在博文ID
        @post: 发布内容
        @publish: 是否公开发布
        '''
        self._mwb.editPost(postid, self._config['usr'], self._config['passwd'], post, publish)

    def deletePost(self, postid, publish):
        '''
        删除博文
        '''
        return self._server.blogger.deletePost(self._config['appkey'], postid, self._config['usr'], self._config['passwd'], publish)

    def getCategories(self):
        '''
        获取博文分类
        '''
        return self._mwb.getCategories(self._config['blogid'], self._config['usr'], self._config['passwd'])

    def getPost(self, postid):
        '''
        读取博文信息
        @postid: 博文ID
        @return: POST
        '''
        return self._mwb.getPost(postid, self._config['usr'], self._config['passwd'])

    def newMediaObject(self, file):
        '''
        资源文件（图片，音频，视频...)上传
        @file: {
            base64	bits
            string	name
            string	type
        }
        @return: URL
        '''
        return self._mwb.newMediaObject(self._config['blogid'], self._config['usr'], self._config['passwd'], file)
    
    def newCategory(self, categoray):
        '''
        新建分类
        @categoray: {
            string	name
            string	slug (optional)
            integer	parent_id
            string	description (optional)
        }
        @return : categorayid
        '''
        return self._server.wp.newCategory(self._config['blogid'], self._config['usr'], self._config['passwd'], categoray)

GitPython版本管理

在生成新的html文档之前，先删除旧的（除了.git目录）:

if os.path.exists(repo_dir):
    for sub_dir in os.listdir(repo_dir):
        if os.path.isdir(os.path.join(repo_dir, sub_dir)) and sub_dir != '.git': 
            shutil.rmtree(os.path.join(repo_dir, sub_dir))
        if os.path.isfile(os.path.join(repo_dir, sub_dir)):
            os.remove(os.path.join(repo_dir, sub_dir))

下面是封装好的gitpython的class:

import git
import os

class RepoScanner():

    def __init__(self, repopath):
        self._root = repopath
        try:
            self._repo = git.Repo(self._root)
        except:
            # TODO: color log
            print('\033[44m\033[37m[INFO]\033[0m Fail to open git repo at: %s' % (repopath))
            while (True):
                in_content = input('\033[44m\033[37m[INFO]\033[0m Try to create a new repo? [y/n]: ')
                if (in_content == 'y' or in_content == 'Y'):
                    break
                if (in_content == 'n' or in_content == 'N'):
                    return
            try:
                self._repo = git.Repo.init(path=self._root)
            except Exception as e:
                raise e

    def getNewFiles(self):
        return self._repo.untracked_files

    def scan(self):
        diff = [ item.a_path for item in self._repo.index.diff(None) ]
        deleted = []
        changed = []
        for item in diff:
            if not os.path.exists(os.path.join(self._root, item)):
                deleted.append(item)
            else: changed.append(item)
        return {'new': self.getNewFiles(), 'deleted': deleted, 'changed': changed}

剩下还有一些细节：包括本地化文章ID等就不在本文内过多赘述，这里只提供大致思路，具体代码详见Github

注意

如果发现中文转码出现问题，记得运行下面这行命令：

1	git config --global core.quotepath false

GitHub

https://github.com/JeffersonQin/hexo-cnblogs-sync

gyro永不抽风

博客园与Hexo同步发布的方法

介绍

魔改Next模板，增加标题栏提醒

BS4微调Hexo生成的HTML

MetaWeblog发布博文

GitPython版本管理

注意

GitHub

公告