Transformers/SpaCy安装在Android手机(Termux)的Python Data Science开发环境

  1. 安装Rust(Python库safetensors依赖Rust)并启用本地仓库crates加速下载:
    $ rm -rf ~/.cargo #删除所有Rust残余旧版本
    $ pkg install rust #最好安装完退出Termux所有Sessions
    $ mkdir -p ~/.cargo #重建rust的用户配置目录
    $ export THU='https://mirrors.tuna.tsinghua.edu.cn/rustup'
    $ echo "export RUSTUP_DIST_SERVER=${THU} >> ~/.cargo/env"
    $ rustc --version #重启Termux运行;

  2. Termux安装pytorch, opencv, numpy, scipy, pandas, pillow…
    $ pkg install python opencv-python vim-python
    $ pkg install python-{numpy,scipy,pandas,pillow}
    $ pkg install python-torch{,audio,vision}
    $ pkg install protobuf{,-dev} google{test,-glog}

  3. 安装SpaCy(Industrial-Strength Natural Language Processing, ExplosionAI GmbH):
    # 修改thinc/spacy{,-transformers}的依赖文件(pyproject.toml, setup.cfg, setup.py),
    # 用最新版numpy(1.25.0);

    $ pip install thinc spacy spacy-pkuseg

  4. 安装 SentencePiece(unsupervised tokenizer and detokenizer, from the Google)
    $ git clone https://github.com/google/sentencepiece.git
    $ cd sentencepiece
    $ mkdir build && cd build
    # Android上编译必须要指定 -llog 链接器参数
    $ LDFLAGS="-llog" cmake .. \
    $ -DSPM_ENABLE_SHARED=ON \
    $ -DCMAKE_INSTALL_PREFIX=/data/data/com.termux/files/usr
    $ make install
    $ cd ../python && python setup.py bdist_wheel
    $ pip install dist/sentencepiece*.whl

  5. 安装transformers库的依赖Python库
    $ pip install safetensors # rust联网下载crates
    $ pip install protobuf tokenizers sentencepiece

  6. 安装transformers库(Transformer/BERT是Google原创论文; 前文准备工作做足即顺利):
    # 这个 transformers 库是大热的 Huggingface.co 开源的全家桶系列
    # 连 Google/Microsoft/Meta(Facebook)/Amazon/Intel/… 都在用 Huggingface 的
    # 注册 huggingface.co 账号, Python库 huggingface_hub 需要的 token 可在
    # 这个 https://huggingface.co/settings/tokens 页面新建与获取
    $ pip install transformers[sentencepiece]
    $ pip install spacy-{alignments,transformers}
    $ pip install huggingface-hub
    $ git config --global credential.helper store
    $ huggingface-cli login # 登录并记录账号密钥

  7. 下载SpaCy的多语言模型文件spacy_models:
    #最新发布 https://github.com/explosion/spacy-models/releases?q=en_core
    $ python -m spacy download zh_core_web_trf
    $ python -m spacy download en_core_web_trf
    # 下载 spacy_models 时最好用有断点续传的浏览器(例如Microsoft的Edge)
    # 用 spacy 库下载前打印出下载文件的url 张贴到Edge浏览器上打开下载.
    # BASEURL='https://github.com/explosion/spacy-models/releases/download'
    # ENCW=\({BASEURL}/en_core_web_trf-3.5.0/en_core_web_trf-3.5.0-py3-none-any.whl \# ZHCW=\){BASEURL}/zh_core_web_trf-3.5.0/zh_core_web_trf-3.5.0-py3-none-any.whl

  8. Spacy Package naming conventions
    In general, spaCy expects all pipeline packages to follow the naming convention of:
    [lang]_[name]
    For spaCy’s pipelines, we also chose to divide the name into three components:

    1. Type: Capabilities (
      e.g.
      core for general-purpose pipeline with tagging, parsing,
      lemmatization and named entity recognition, or
      dep for only tagging, parsing and lemmatization).
    2. Genre: Type of text the pipeline is trained on, e.g. web or news.
    3. Size: Package size indicator, sm, md, lg or trf.
      sm and trf pipelines have no static word vectors.
      For pipelines with default vectors,
      md has a reduced word vector table with 20k unique vectors for ~500k words,
      lg has a large word vector table with ~500k entries.
      For pipelines with floret vectors,
      md vector tables have 50k entries and lg vector tables have 200k entries.

    For example, en_core_web_sm is a small English pipeline trained on written web text (
    blogs, news, comments), that includes vocabulary, syntax and entities.

  9. Spacy Package versioning
    Additionally, the pipeline package versioning reflects both the compatibility with spaCy,
    as well as the model version. A package version a.b.c translates to:
    a: spaCy major version. For example, 2 for spaCy v2.x.
    b: spaCy minor version. For example, 3 for spaCy v2.3.x.
    c: Model version.
    Different model config: e.g. from being trained on different data, with different
    parameters, for different numbers of iterations, with different vectors, etc.
    For a detailed compatibility overview, see the compatibility.json.
    This is also the source of spaCy’s internal compatibility check,
    performed when you run the download command

posted @ 2023-06-20 13:27  abaelhe  阅读(898)  评论(0)    收藏  举报