Python tldextract模块

最新发布的 PyPI：

pip install tldextract

或者最新的开发版本：

pip install -e 'git://github.com/john-kurkowski/tldextract.git#egg=tldextract'

命令行用法，按空格分开网址：

tldextract http://forums.bbc.co.uk
# forums bbc co.uk

当第一次运行该模块时，它会用实时HTTP请求更新其后缀列表。这个更新的后缀集在无限期缓存/path/to/tldextract/.tld_set 。（可以说运行时引导类似这样不应该是默认行为，就像生产系统，但我想要你有最新的后缀，特别是当我没有保持这个代码的最新）。要避免此提取或控制缓存的位置，请通过设置后缀EXTRACT_CACHE环境变量或通过在后缀Extract初始化中设置cache_file路径来使用您自己的提取调用。

# extract callable that falls back to the included TLD snapshot, no live HTTP fetching
no_fetch_extract = tldextract.TLDExtract(suffix_list_urls=None)
no_fetch_extract('http://www.google.com')

# extract callable that reads/writes the updated TLD set to a different path
custom_cache_extract = tldextract.TLDExtract(cache_file='/path/to/your/cache/file')
custom_cache_extract('http://www.google.com')

# extract callable that doesn't use caching
no_cache_extract = tldextract.TLDExtract(cache_file=False)
no_cache_extract('http://www.google.com')

如果你想保持最新后缀定义 - 虽然他们不经常更改 - 偶尔删除缓存文件，运行更新命令

tldextract --update

或：

env TLDEXTRACT_CACHE="~/tldextract.cache" tldextract --update

也建议在升级此lib之后删除文件。

高级用法

为后缀列表数据指定自己的URL或文件

您可以指定自己的输入数据代替默认的Mozilla公共后缀列表：

extract = tldextract.TLDExtract(
 suffix_list_urls=["http://foo.bar.baz"],
 # Recommended: Specify your own cache file, to minimize ambiguities about where
 # tldextract is getting its data, or cached data, from.
 cache_file='/path/to/your/cache/file')

以上片段将与您指定的网址提取，在首先需要下载后缀列表（即如果cache_file不存在）。如果你想从你的本地文件系统使用的输入数据，只需要使用file://协议：

extract = tldextract.TLDExtract(
 suffix_list_urls=["file://absolute/path/to/your/local/suffix/list/file"],
 cache_file='/path/to/your/cache/file')

请使用绝对路径suffix_list_urls关键字参数。 os.path是友好路径。

posted on 2019-06-25 14:29 小黑崽阅读(884) 评论(0) 收藏举报

刷新页面返回顶部

疯疯癫癫的小可爱

Python tldextract模块

高级用法

为后缀列表数据指定自己的URL或文件

导航

公告