fast-ai-数值线性代数讲义中文版-v2-全-

fast.ai 数值线性代数讲义中文版 v2（全）

原文：fastai/numerical-linear-algebra-v2

译者：飞龙

协议：CC BY-NC-SA 4.0

一、我们为什么在这里

你可以在此博客文章中阅读数值线性代数课程的概述。该课程最初在旧金山大学数据科学的研究生项目中讲授。YouTube 上提供了课程视频（请注意，笔记本的编码和视频编号不对应，因为有些笔记本需要超过 1 个视频才能覆盖）。

你可以在我们的 fast.ai 论坛上提出有关该课程的问题。

注意：未来的课程代码比这个更多。

为什么学习数值线性代数？

本课程的关键问题：我们如何以可接受的速度和可接受的准确度进行矩阵计算？

20 世纪的十大科学与工程算法列表包括线性代数的矩阵分解方法。它还包括 QR 算法，我们将介绍，以及 Krylov 迭代方法，我们将看到它的一个例子（见这里）。

来源：十大算法

在选择或设计矩阵计算算法时，要记住以下四点：

内存使用
速度
准确度
可扩展性/并行

通常会在这些类别之间进行权衡。

动机

矩阵无处不在 - 任何可以放在 Excel 电子表格中的东西都是矩阵，语言和图片也可以表示为矩阵。了解矩阵算法的选项以及如何做出权衡，可能为你的解决方案带来巨大差异。例如，近似矩阵计算通常比精确矩阵计算快数千倍。

这不仅仅是了解现有库的内容，而是了解它们的工作原理。这是因为你通常可以制作算法的变体，库不支持它们，从而为你提供所需的性能或准确度。此外，此领域目前正在快速发展，特别是在与深度学习，推荐系统，近似算法和图形分析相关的领域，因此你经常会发现最近的结果可能会对你的项目产生重大影响，但是不在你的库中。

了解算法的真正工作原理。有助于调试和加速解决方案。

矩阵计算

矩阵计算有两种关键类型，它们以多种不同的方式组合在一起。这些是：

矩阵和张量积
矩阵分解

所以基本上我们将组合矩阵，并将它们再次分开！

矩阵和张量积

矩阵和向量的积：

下面的矩阵给出了 1 年内从 1 个健康状态转移到另一个健康状态的概率。如果一组的当前健康状况是：

85% 无症状
10% 有症状
5% 的艾滋病
0% 死亡

1 年内每个健康状况的百分比是多少？

来源：马尔科夫链的概念

答案

import numpy as np

# Exercise: Use Numpy to compute the answer to the above

# array([[ 0.765 ,  0.1525,  0.0645,  0.018 ]])

矩阵和矩阵的积

来源：几种线性代数工具的简单的现实世界的应用

答案

# Exercise: Use Numpy to compute the answer to the above

'''
array([[ 50. ,  49. ],
       [ 58.5,  61. ],
       [ 43.5,  43.5]])
'''

图像数据

图像可以表示为矩阵。

来源：Adam Geitgey

卷积

卷积是卷积神经网络（CNN）的核心，它是一种深度学习，产生了过去几年图像识别的巨大进步。它们现在也越来越多地用于语音，例如 Facebook AI 的语音翻译结果比 RNN 快 9 倍（目前最流行的语音翻译方法）。

现在，在图像分类方面，计算机比人更准确。

来源：Andrej Karpathy

来源：Nvidia

你可以将卷积视为一种特殊的矩阵积。

以下3张图片均来自优秀博客文章“不同观点下的 CNN”，由 fast.ai 学生撰写：

卷积将过滤器应用于图像的每个部分：

神经网络观点：

矩阵乘法观点：

让我们在这个笔记本中看看，如何使用卷积进行边缘检测（最初来自 fast.ai 深度学习课程）。

矩阵分解

我们将在本课程的每一天讨论矩阵分解，并将在以后的课程中介绍以下示例：

主题建模（NMF 和 SVD，SVD 使用 QR）一组文档可以由术语 - 文档矩阵表示。

来源：信息检索导论

来源：NMF 教程

背景移除（截断 SVD）：

噪声移除（鲁棒 PCA，使用 SVD）：

这个示例来自 Jean Kossaifi 的博客。

谷歌的 PageRank 算法（特征值分解）：

来源：什么是 PageRank？

其它分解和应用的列表。

准确度

浮点数算术

为了理解准确性，我们首先需要了解计算机（有限和离散）如何存储数字（无限且连续）

练习

花点时间看看下面的函数f。在尝试运行它之前，在纸上写下x1 = f(110)的输出。现在，（仍在纸上）将其代入f并计算x2 = f(x1)。继续进行 10 次迭代。

这个例子来自 Greenbaum 和 Chartier 的 Numerical Methods 的第 107 页。

def f(x):
    if x <= 1/2:
        return 2 * x
    if x > 1/2:
        return 2*x - 1

仅仅在你写下你认为答案应该是什么之后，运行下面的代码：

x = 1/10
for i in range(80):
    print(x)
    x = f(x)

'''
0.1
0.2
0.4
0.8
0.6000000000000001
0.20000000000000018
0.40000000000000036
0.8000000000000007
0.6000000000000014
0.20000000000000284
0.4000000000000057
0.8000000000000114
0.6000000000000227
0.20000000000004547
0.40000000000009095
0.8000000000001819
0.6000000000003638
0.2000000000007276
0.4000000000014552
0.8000000000029104
0.6000000000058208
0.20000000001164153
0.40000000002328306
0.8000000000465661
0.6000000000931323
0.20000000018626451
0.40000000037252903
0.8000000007450581
0.6000000014901161
0.20000000298023224
0.4000000059604645
0.800000011920929
0.6000000238418579
0.20000004768371582
0.40000009536743164
0.8000001907348633
0.6000003814697266
0.20000076293945312
0.40000152587890625
0.8000030517578125
0.600006103515625
0.20001220703125
0.4000244140625
0.800048828125
0.60009765625
0.2001953125
0.400390625
0.80078125
0.6015625
0.203125
0.40625
0.8125
0.625
0.25
0.5
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
'''

哪里不对？

问题：数学是连续的或无限的，但计算机是离散的和有限的

计算机的数字表示的两个局限：

它们不能是随意大小
它们之间肯定存在差距

我们需要关心准确性的原因是，计算机无法存储无限准确的数字。可以创建给出非常错误答案的计算（特别是在多次重复操作时，因为每个操作可能会使错误成倍增加）。

计算机如何存储数字：

尾数也可以称为有效数。

IEEE双精度算术：

数字可以大到1.79×10308，小到2.23×10-308。

区间[1,2]由离散子集表示：

区间[2,4]表示为：

浮点并不是等间距的。

来源：你从未想过要了解浮点数，但将被迫了解

机器`ε`

1 和下一个较大数字之间的距离的一半。这可能因计算机而异。IEEE双精度标准规定：

浮点运算的两个重要属性

实数x与其最接近的浮点近似值fl(x)之间的差值，总是小于机器ε，相对而言。对于某些ε，其中：

*为任何操作（+，-，×，÷），⊛是它的浮点模拟，对于一些ε，其中：

也就是说，浮点运算的每个操作都精确到最大为机器ε的相对误差。

历史

事后看来，浮点运算似乎是一个明确的选择，但存在许多存储数字的方法：

定点算术
对数和半对数数系统
连续分数
有理数
有理数的可能无限字符串
级别索引数字系统
固定斜杠和浮动斜杠数字系统
2-adic 数字

对于参考，请参阅“浮点运算手册”的第 1 章（免费）。是的，有一本关于浮点的完整的 16 章书！

浮点运算的时间线历史

公元前 1600：巴比伦基数 60 的系统是最早的浮点系统（Donald Knuth）。使用基数 60 的浮点表示来表示尾数（如果两个数字的比率是 60 的幂，则表示相同）
1630：滑动规则。仅操纵有效数字（基数 10）
1914：Leonardo Torres 和 Quevedo 使用浮点运算描述了 Babbage 的分析引擎的的机电实现。
1941：第一个真正的现代实现。 Konrad Zuse 的Z3电脑。使用基数 2，具有 14 位有效数字，7 位指数和 1 个符号位。
1985：IEEE 754-1985 二进制浮点运算标准发布。提高了准确性，可靠性和便携性。 William Kahan 主导。

“已经引入了许多不同的方法来在计算机上近似实数。然而，浮点运算是迄今为止在现代计算机中最常用的表示实数的方法。使用有限集合（“机器数字”）模拟无限连续集（实数）并不是一项简单的任务：必须在速度，准确性，动态范围，易用性和实现以及内存之间找到明智的妥协。如果选择得当的参数（基数，精度，极值指数等），浮点运算似乎对于大多数数值应用来说是一个非常好的折衷方案。”

尽管基数 2（二进制）似乎是计算机中非常明显的赢家，但在各个方面使用了各种其他基数值：

早期机器 PDP-10，Burroughs 570 和 6700 使用基数 8
基数 16 的 IBM 360
基数 10 的财务计算，袖珍计算器，Maple
基数 3 的俄罗斯 SETUN 计算机（1958 年）。优点：最小化beta x p（符号乘位数），对于固定的最大可表示数字beta^p - 1。舍入等于截断
基数 2 最常见。理由：易于实现。研究表明（带有隐式前导位），这比其他所有基数都具有更好的最坏情况或平均精度。

条件作用和稳定性

由于我们无法在计算机上准确表示数字（由于存储的有限性以及浮点结构中数字之间的间隔），因此了解输入中的小扰动如何影响输出变得非常重要。

“稳定的算法几乎可以为几乎正确的问题提供正确的答案。”--Trefethen

条件作用：数学问题的扰动行为（例如最小二乘）

稳定性：用于在计算机上解决该问题的算法的扰动行为（例如，最小二乘算法，householder，回代，高斯消除）

示例：矩阵的特征值：

import scipy.linalg as la 

A = np.array([[1., 1000], [0, 1]])
B = np.array([[1, 1000], [0.001, 1]])

print(A)

print(B)

'''
[[    1.  1000.]
 [    0.     1.]]
[[  1.00000000e+00   1.00000000e+03]
 [  1.00000000e-03   1.00000000e+00]]
'''

np.set_printoptions(suppress=True, precision=4)

wA, vrA = la.eig(A)
wB, vrB = la.eig(B)

wA, wB
'''
(array([ 1.+0.j,  1.+0.j]), array([ 2.+0.j,  0.+0.j]))
'''

提醒：浮点运算的两个属性

实数x与其最接近的浮点近似值fl(x)之间的差值总是小于机器ε，相对而言。

浮点运算的每个运算+，-，×，÷都精确到最大为机器ε的相对误差。

我们将看到的例子：

经典与修正的 Gram-Schmidt 精度
Gram-Schmidt 与 Householder（计算 QR 分解的两种不同方式），答案的正交性如何
方程组的条件

近似的准确度

我们很少需要大规模地进行高精度的矩阵计算。事实上，我们经常进行某种机器学习，而不太准确的方法可以防止过度拟合。

在许多情况下，输入数据可能不那么精确，因此使用输入数据和计算来寻求更高的准确度是没有意义的。

随机数据结构
布隆过滤器
HyperLogLog
局部敏感哈希
跳过列表
Count-min 草图
最小哈希

示例：布隆过滤器能够使用每个元素的<10位，搜索集合的成员性，具有 1% 的假阳性。这通常表示数千次的内存使用减少。

通过对所有退回的项目进行第二阶段（确切）检查，可以轻松处理假阳性 - 对于稀有项目，这可能非常有效。例如，许多网络浏览器使用布隆过滤器来创建一组被阻止的页面（例如，带有病毒的页面），因为被阻止的网页只是整个网络的一小部分。可以通过获取布隆过滤器返回的任何内容并使用完整的精确列表检查 Web 服务来处理假阳性。（详细信息请参阅布隆过滤器教程）。

随机算法

Karger 算法（图的最小切割）
随机回归
蒙特卡罗模拟
随机 LU 分解（高斯消元）
随机 SVD

如果我们接受一些精度降低，那么我们通常可以通过使用近似算法将速度提高几个数量级（和/或减少内存使用）。这些算法通常以一定的概率给出正确的答案。通过多次重新运行算法，你通常可以加倍增加该概率！

昂贵的错误

以下示例来自 Greenbaum 和 Chartier。

欧洲航天局在阿丽亚娜 5 火箭上 10 年花费了 70 亿美元。

当你尝试将 64 位数放入 16 位空间（整数溢出）时会发生什么：

from IPython.display import YouTubeVideo
YouTubeVideo("PK_yguLapgA")

https://www.youtube.com/embed/PK_yguLapgA

这是一个浮点错误，耗资 4.75 亿英镑：

1994 纽约时报关于英特尔奔腾错误的文章

资源：浮点运算的更多信息，请参阅 Trefethen & Bau 的第 13 讲和 Greenbaum & Chartier 的第 5 章。

内存使用

稀疏 VS 密集

上面我们介绍了如何存储数字，现在让我们来谈谈如何存储矩阵。节省内存（和计算）的关键方法不是存储所有矩阵。相反，只需存储非零元素。这称为稀疏存储，它非常适合稀疏矩阵，即大多数元素为零的矩阵。

以下是有限元问题的矩阵示例，该问题出现在工程中（例如，在对平面周围的气流进行建模时）。在此示例中，非零元素为黑色，零元素为白色：

来源

还有特殊类型的结构化矩阵，例如对角线，三对角线，hessenberg 和三角，它们都表现稀疏性的特定模式，可以利用它们来减少内存和计算。

与稀疏矩阵相反的是密集矩阵，以及密集存储，其仅仅指代主要包含非零的矩阵，其中每个元素被显式存储。由于稀疏矩阵是有用且常见的，因此数值线性代数侧重于通过尽可能多的操作来保持稀疏性。

速度

速度差异来自许多方面，特别是：

计算复杂性
向量化
扩展到多个核心和节点
局部性

计算复杂性

算法通常使用关于矩阵中的行数和列数的计算复杂度来表示。例如。你可能会发现一个描述为O(n^2 m)的算法。

计算复杂性和O符号经常出现在技术评论中，所以实践它们是个好主意。你可以在这些网站中了解概念和实践：

向量化

现代 CPU 和 GPU 可以在单个核心上同时对多个元素应用操作。例如，在一个步骤中选取向量中的 4 个浮点数的指数。这称为 SIMD。你不会显式编写 SIMD 代码（它往往需要汇编语言或特殊的 C “内在函数”），而是在像 numpy 这样的库中使用向量化操作，而后者又依赖于专门调整的向量化底层线性代数 API（特别是 BLAS 和 LAPACK）。

矩阵计算包：BLAS 和 LAPACK

BLAS（基本线性代数子程序）：底层矩阵和向量算术运算的规范。这些是用于执行基本向量和矩阵运算的标准积木。 BLAS 起源于 1979 年的 Fortran 库。BLAS 库的示例包括：AMD 核心数学库（ACML），ATLAS，英特尔数学核心库（MKL）和 OpenBLAS。

LAPACK 是用 Fortran 编写的，提供了解决线性方程组，特征值问题和奇异值问题的例程。矩阵因式分解（LU，Cholesky，QR，SVD，Schur）。处理密集和带状矩阵，但不处理一般稀疏矩阵。实数和复数，单精度和双精度。

20 世纪 70 年代和 80 年代：EISPACK（特征值例程）和 LINPACK（线性方程和线性最小二乘例程）库。

LAPACK 最初的目标：使 LINAPCK 和 EISPACK 在共享内存的向量和并行处理器上高效运行，并在现代基于缓存的体系结构上利用缓存（最初于 1992 年发布）。 EISPACK 和 LINPACK 忽略了多层内存架构，并且花费了太多时间来移动数据。

LAPACK 使用高度优化的块操作实现（在每台机器上大量实现），以便尽可能多的计算由 BLAS 执行。

局部性

使用较慢的方式来访问数据（例如，通过互联网）可以比快速方式（例如来自寄存器）慢十亿倍。但慢速存储器远多于快速存储器。因此，一旦我们在快速存储中拥有数据，我们就希望在那里进行所需的任何计算，而不是每次需要时都需要多次加载。此外，对于大多数类型的存储，访问彼此相邻的数据项要快得多，因此我们应该尽量使用附近存储的任何数据，我们知道我们很快就需要它。这两个问题被称为局部性。

不同类型内存的速度

以下是每个人都应该知道的一些数字（来自传奇人物 Jeff Dean）：

L1 高速缓存引用：0.5 ns
L2 缓存引用：7 ns
主存储器引用 / RAM：100 ns
通过 1 Gbps 网络发送 2K 字节：20,000 ns
从内存顺序读取 1 MB：250,000 ns
在同一数据中心内往返：500,000 ns
磁盘搜索：10,000,000 ns
从网络顺序读取 1 MB：10,000,000 ns
从磁盘顺序读取 1 MB：30,000,000 ns
发送数据包 CA-> Netherlands-> CA：150,000,000 ns

这是一个更新的交互式版本，其中包含这些数字如何变化的时间表。

关键结论：每个连续的存储器类型（至少）比之前的存储器类型差一个数量级。磁盘搜索非常慢。

这个视频有一个很好的例子，展示了几种计算照片模糊的方法，并进行了各种权衡。不要担心出现的 C 代码，只关注矩阵计算的红色和绿色运动图像。

虽然视频是关于一种名为 Halide 的新语言，但它很好地说明了它所引发的问题是普遍存在的。观看 1-13 分钟：

from IPython.display import YouTubeVideo
YouTubeVideo("3uiEyEKji0M")

https://www.youtube.com/embed/3uiEyEKji0M

局部性很难。潜在的权衡：

用于节省内存带宽的冗余计算
牺牲并行性来获得更好的复用

临时性

当计算结果存储在 RAM 中的临时变量中，然后加载该变量来对其进行另一次计算时，就会出现“临时性”问题。这比将数据保存在缓存或寄存器中，并在将最终结果存储到 RAM 之前，进行所有必要的计算要慢很多个数量级。这对我们来说尤其是一个问题，因为 numpy 通常为每一个操作或功能创造临时性。例如。a = b * c^2 + ln(d)将创建四个临时值（因为有四个操作和函数）。

扩展到多个核心和节点

我们有一个单独的可扩展性章节，但值得注意的是，这对速度也很重要 - 如果我们无法扩展我们拥有的所有计算资源，我们将会遇到计算速度较慢的问题。

可扩展性/并行化

通常我们会发现我们拥有的数据比用于处理内存或计算时间要多。在这种情况下，我们希望能够在多个核（在一个计算机内）或节点（即网络上的多个计算机）上扩展我们的算法。虽然我们将研究跨多个核的扩展（称为并行化），但我们不会在本课程中处理多节点扩展。通常，可扩展算法是指，输入可以分解成较小的部分，每个部分由不同的核心/计算机处理，然后在最后重新组合在一起。

十、实现 QR 分解

我们在计算特征值时使用 QR 分解并计算最小二乘回归。它是数值线性代数中的重要组成部分。

“数值线性代数中的一种算法比其他算法更重要：QR 分解。” --Trefethen，第 48 页

回想一下，对于任何矩阵A，A = QR，其中Q是正交的，R是上三角。

提醒：我们在上一课中看到的 QR 算法使用 QR 分解，但不要混淆二者。

NumPy 中

import numpy as np

np.set_printoptions(suppress=True, precision=4)

n = 5
A = np.random.rand(n,n)
npQ, npR = np.linalg.qr(A)

检查Q是正交的：

np.allclose(np.eye(n), npQ @ npQ.T), np.allclose(np.eye(n), npQ.T @ npQ)

# (True, True)

检查R是三角。

npR

'''
array([[-0.8524, -0.7872, -1.1163, -1.2248, -0.7587],
       [ 0.    , -0.9363, -0.2958, -0.7666, -0.632 ],
       [ 0.    ,  0.    ,  0.4645, -0.1744, -0.3542],
       [ 0.    ,  0.    ,  0.    ,  0.4328, -0.2567],
       [ 0.    ,  0.    ,  0.    ,  0.    ,  0.1111]])
'''

当向量b投影到直线a上时，其投影p是b沿着直线a的一部分。

让我们看看沉浸式线性代数在线版的第 3.2.2 节：投影的交互图。

来源：沉浸式数学

以下是将向量投影到平面上的样子：

来源：最小二乘回归的线性代数视角

当向量b投影到直线a上时，其投影p是b沿着直线a的一部分。所以p是a的一些倍数。设 $\mathbf{p} = \hat{x}\mathbf{a}$ 其中 $\hat{x}$ 是标量。

正交性

投影的关键是正交性：从b到p的直线（可以写成 $\mathbf{b} - \hat{x}\mathbf{a}$ ）垂直于a。

这意味着：

$\mathbf{a} \cdot (\mathbf{b} - \hat{x}\mathbf{a}) = 0$

所以：

$\hat{x} = \frac{\mathbf{a} \cdot \mathbf{b}}{\mathbf{a} \cdot \mathbf{a}}$

Gram-Schmidt

经典的 Gram-Schmidt（不稳定）

对于每列j，计算单一投影：

其中与 $q_1, ..., q_{j-1}$ 的跨度正交的空间。

def cgs(A):
    m, n = A.shape
    Q = np.zeros([m,n], dtype=np.float64)
    R = np.zeros([n,n], dtype=np.float64)
    for j in range(n):
        v = A[:,j]
        for i in range(j):
            R[i,j] = np.dot(Q[:,i], A[:,j])
            v = v - (R[i,j] * Q[:,i])
        R[j,j] = np.linalg.norm(v)
        Q[:, j] = v / R[j,j]
    return Q, R

Q, R = cgs(A)

np.allclose(A, Q @ R)

# True

检查Q是酉矩阵。

np.allclose(np.eye(len(Q)), Q.dot(Q.T))

# True

np.allclose(npQ, -Q)

# True

R

'''
array([[ 0.02771,  0.02006, -0.0164 , ...,  0.00351,  0.00198,  0.00639],
       [ 0.     ,  0.10006, -0.00501, ...,  0.07689, -0.0379 , -0.03095],
       [ 0.     ,  0.     ,  0.01229, ...,  0.01635,  0.02988,  0.01442],
       ..., 
       [ 0.     ,  0.     ,  0.     , ...,  0.     , -0.     , -0.     ],
       [ 0.     ,  0.     ,  0.     , ...,  0.     ,  0.     , -0.     ],
       [ 0.     ,  0.     ,  0.     , ...,  0.     ,  0.     ,  0.     ]])
'''

Gram-Schmidt 应该让你想起来一点 Arnoldi 迭代（用于将矩阵转换为海森堡形式），因为它也是一个结构化的正交化。

改进版 Gram-Schmidt

经典（不稳定的）Gram-Schmidt：对于每列j，计算单一投影：

其中与 $q_1, ..., q_{j-1}$ 的跨度正交的空间。

改进版 Gram-Schmidt：对于每列j，计算n - 1个投影：

$P_j = P_{\perp q_{j-1}\cdots\perp q_{2}\perp q_{1}}$

import numpy as np
n = 3
A = np.random.rand(n,n).astype(np.float64)

def cgs(A):
    m, n = A.shape
    Q = np.zeros([m,n], dtype=np.float64)
    R = np.zeros([n,n], dtype=np.float64)
    for j in range(n):
        v = A[:,j]
        for i in range(j):
            R[i,j] = np.dot(Q[:,i], A[:,j])
            v = v - (R[i,j] * Q[:,i])
        R[j,j] = np.linalg.norm(v)
        Q[:, j] = v / R[j,j]
    return Q, R

def mgs(A):
    V = A.copy()
    m, n = A.shape
    Q = np.zeros([m,n], dtype=np.float64)
    R = np.zeros([n,n], dtype=np.float64)
    for i in range(n):
        R[i,i] = np.linalg.norm(V[:,i])
        Q[:,i] = V[:,i] / R[i,i]
        for j in range(i, n):
            R[i,j] = np.dot(Q[:,i],V[:,j])
            V[:,j] = V[:,j] - R[i,j]*Q[:,i]
    return Q, R

Q, R = mgs(A)

np.allclose(np.eye(len(Q)), Q.dot(Q.T.conj()))

# True

np.allclose(A, np.matmul(Q,R))

# True

Householder

引言

$\begin{array}{ l | l | c } \hline Gram-Schmidt & Triangular, Orthogonalization & A R_1 R_2 \cdots R_n = Q \ Householder & Orthogonal, Triangularization & Q_n \cdots Q_2 Q_1 A = R \ \hline \end{array}$

Householder 反射产生更接近正交的矩阵Q，具有舍入误差

Gram-Schmidt 可以部分停止，留下A的前n列的简化 QR。

初始化

import numpy as np
n = 4
A = np.random.rand(n,n).astype(np.float64)

Q = np.zeros([n,n], dtype=np.float64)
R = np.zeros([n,n], dtype=np.float64)

A

'''
array([[ 0.5435,  0.6379,  0.4011,  0.5773],
       [ 0.0054,  0.8049,  0.6804,  0.0821],
       [ 0.2832,  0.2416,  0.8656,  0.8099],
       [ 0.1139,  0.9621,  0.7623,  0.5648]])
'''

from scipy.linalg import block_diag

np.set_printoptions(5)

算法

我添加了更多的计算和更多的信息，因为它说明了算法的工作原理。此版本也返回 Householder 反射。

def householder_lots(A):
    m, n = A.shape
    R = np.copy(A)
    V = []
    Fs = []
    for k in range(n):
        v = np.copy(R[k:,k])
        v = np.reshape(v, (n-k, 1))
        v[0] += np.sign(v[0]) * np.linalg.norm(v)
        v /= np.linalg.norm(v)
        R[k:,k:] = R[k:,k:] - 2*np.matmul(v, np.matmul(v.T, R[k:,k:]))
        V.append(v)
        F = np.eye(n-k) - 2 * np.matmul(v, v.T)/np.matmul(v.T, v)
        Fs.append(F)
    return R, V, Fs

检查R是上三角。

R

'''
array([[-0.62337, -0.84873, -0.88817, -0.97516],
       [ 0.     , -1.14818, -0.86417, -0.30109],
       [ 0.     ,  0.     , -0.64691, -0.45234],
       [-0.     ,  0.     ,  0.     , -0.26191]])
'''

作为检查，我们将使用分块矩阵F计算和R。矩阵F是 householder 反射。

请注意，这不是一种处理Q的有效计算方式。在大多数情况下，你实际上并不需要Q。例如，如果你使用 QR 来求解最小二乘，则只需要Q * b。

对于隐式计算乘积Q * b或Qx的技巧，请参阅 Trefethen 第 74 页。
请参阅这些讲义，了解 Householder 的不同实现，它同时计算Q，作为R的一部分。

QT = np.matmul(block_diag(np.eye(3), F[3]), 
               np.matmul(block_diag(np.eye(2), F[2]), 
                         np.matmul(block_diag(np.eye(1), F[1]), F[0])))

F[1]

'''
array([[-0.69502,  0.10379, -0.71146],
       [ 0.10379,  0.99364,  0.04356],
       [-0.71146,  0.04356,  0.70138]])
'''

block_diag(np.eye(1), F[1])

'''
array([[ 1.     ,  0.     ,  0.     ,  0.     ],
       [ 0.     , -0.69502,  0.10379, -0.71146],
       [ 0.     ,  0.10379,  0.99364,  0.04356],
       [ 0.     , -0.71146,  0.04356,  0.70138]])
'''

block_diag(np.eye(2), F[2])

'''
array([[ 1.     ,  0.     ,  0.     ,  0.     ],
       [ 0.     ,  1.     ,  0.     ,  0.     ],
       [ 0.     ,  0.     , -0.99989,  0.01452],
       [ 0.     ,  0.     ,  0.01452,  0.99989]])
'''

block_diag(np.eye(3), F[3])

'''
array([[ 1.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.],
       [ 0.,  0.,  0., -1.]])
'''

np.matmul(block_diag(np.eye(1), F[1]), F[0])

'''
array([[-0.87185, -0.00861, -0.45431, -0.18279],
       [ 0.08888, -0.69462,  0.12536, -0.70278],
       [-0.46028,  0.10167,  0.88193, -0.00138],
       [-0.14187, -0.71211,  0.00913,  0.68753]])
'''

QT

'''
array([[-0.87185, -0.00861, -0.45431, -0.18279],
       [ 0.08888, -0.69462,  0.12536, -0.70278],
       [ 0.45817, -0.112  , -0.88171,  0.01136],
       [ 0.14854,  0.71056, -0.02193, -0.68743]])
'''

R2 = np.matmul(block_diag(np.eye(3), F[3]), 
               np.matmul(block_diag(np.eye(2), F[2]),
                         np.matmul(block_diag(np.eye(1), F[1]), 
                                   np.matmul(F[0], A))))

np.allclose(A, np.matmul(np.transpose(QT), R2))

# True

np.allclose(R, R2)

# True

这是 Householder 的简洁版本（尽管我创建了一个新的R，而不是覆盖A，并原地计算它）。

def householder(A):
    m, n = A.shape
    R = np.copy(A)
    Q = np.eye(m)
    V = []
    for k in range(n):
        v = np.copy(R[k:,k])
        v = np.reshape(v, (n-k, 1))
        v[0] += np.sign(v[0]) * np.linalg.norm(v)
        v /= np.linalg.norm(v)
        R[k:,k:] = R[k:,k:] - 2 * v @ v.T @ R[k:,k:]
        V.append(v)
    return R, V

RH, VH = householder(A)

检查R是对角的。

RH

'''
array([[-0.62337, -0.84873, -0.88817, -0.97516],
       [-0.     , -1.14818, -0.86417, -0.30109],
       [-0.     , -0.     , -0.64691, -0.45234],
       [-0.     ,  0.     ,  0.     , -0.26191]])
'''

VH

'''
[array([[ 0.96743],
        [ 0.00445],
        [ 0.2348 ],
        [ 0.09447]]), array([[ 0.9206 ],
        [-0.05637],
        [ 0.38641]]), array([[ 0.99997],
        [-0.00726]]), array([[ 1.]])]
'''

np.allclose(R, RH)

# True

def implicit_Qx(V,x):
    n = len(x)
    for k in range(n-1,-1,-1):
        x[k:n] -= 2*np.matmul(v[-k], np.matmul(v[-k], x[k:n]))

A

'''
array([[ 0.54348,  0.63791,  0.40114,  0.57728],
       [ 0.00537,  0.80485,  0.68037,  0.0821 ],
       [ 0.2832 ,  0.24164,  0.86556,  0.80986],
       [ 0.11395,  0.96205,  0.76232,  0.56475]])
'''

经典和改良的 Gram-Schmidt 都需要2mn^2个浮点运算。

陷阱

有些事情需要注意：

当你复制值时 VS 当你有两个指向同一内存位置的变量时
长度为n的向量与1 x n矩阵之间的差异（np.matmul以不同方式处理它们）

类比

	`A=QR`	`A=QHQ*`
正交结构化	Householder	Householder
结构化正交	Gram-Schmidt	Arnoldi

Gram-Schmidt 和 Arnoldi：连续的三角运算，可以部分停止，前n列是正确的。

Householder：连续的正交运算。在存在舍入误差的情况下产生更接近正交的A。

请注意，要计算海森堡化简A = QHQ *，将 Householder 反射应用于A的两侧，而不是仅应用于一侧。

示例

以下示例来自 Trefethen 和 Bau 的第 9 讲，尽管从 MATLAB 翻译成 Python。

示例：经典与改进的 Gram-Schmidt

这个例子是 Trefethen 第 9 节的实验 2。我们想要构造一个方阵A，它具有随机奇异向量和广泛变化的奇异值，间隔为 $2^{-1}$ 和 $2^{-(n + 1)}$ 之间的 2 的倍数。

import matplotlib.pyplot as plt
from matplotlib import rcParams
%matplotlib inline

n = 100
U, X = np.linalg.qr(np.random.randn(n,n))   # 将 U 设为随机正交矩阵
V, X = np.linalg.qr(np.random.randn(n,n))   # 将 V 设为随机正交矩阵
S = np.diag(np.power(2,np.arange(-1,-(n+1),-1), dtype=float))  # 将 S 设为对角矩阵 w/ exp 
                                                              # 值在 2^-1 和 2^-(n+1) 之间

A = np.matmul(U,np.matmul(S,V))

QC, RC = cgs(A)
QM, RM = mgs(A)

plt.figure(figsize=(10,10))
plt.semilogy(np.diag(S), 'r.', basey=2, label="True Singular Values")
plt.semilogy(np.diag(RM), 'go', basey=2, label="Modified Gram-Shmidt")
plt.semilogy(np.diag(RC), 'bx', basey=2, label="Classic Gram-Shmidt")
plt.legend()
rcParams.update({'font.size': 18})

type(A[0,0]), type(RC[0,0]), type(S[0,0])

# (numpy.float64, numpy.float64, numpy.float64)

eps = np.finfo(np.float64).eps; eps

# 2.2204460492503131e-16

np.log2(eps), np.log2(np.sqrt(eps))

# (-52.0, -26.0)

示例：正交性的数值损失

这个例子是 Trefethen 第 9 节的实验 3。

A = np.array([[0.70000, 0.70711], [0.70001, 0.70711]])

A

'''
array([[ 0.7    ,  0.70711],
       [ 0.70001,  0.70711]])
'''

Gram-Schmidt：

Q1, R1 = mgs(A)

Householder：

R2, V, F = householder_lots(A)
Q2T = np.matmul(block_diag(np.eye(1), F[1]), F[0])

NumPy 的 Householder：

Q3, R3 = np.linalg.qr(A)

检查 QR 分解是否能用：

np.matmul(Q1, R1)

'''
array([[ 0.7   ,  0.7071],
       [ 0.7   ,  0.7071]])
'''

np.matmul(Q2T.T, R2)

'''
array([[ 0.7   ,  0.7071],
       [ 0.7   ,  0.7071]])
'''

np.matmul(Q3, R3)

'''
array([[ 0.7   ,  0.7071],
       [ 0.7   ,  0.7071]])
'''

检查Q多么接近完美正交。

np.linalg.norm(np.matmul(Q1.T, Q1) - np.eye(2))  # 改进的 Gram-Schmidt

# 3.2547268868202263e-11

np.linalg.norm(np.matmul(Q2T.T, Q2T) - np.eye(2))  # 我们的 Householder 实现

# 1.1110522984689321e-16

np.linalg.norm(np.matmul(Q3.T, Q3) - np.eye(2))  # Numpy（它使用 Householder）

# 2.5020189909116529e-16

GS（Q1）不如 Householder（Q2T，Q3）稳定。

二、SVD 背景消除

我们今天的目标：

加载和格式化数据

让我们使用 BMC 2012 背景建模挑战数据集中的真实视频 003。

导入所需的库：

import imageio
imageio.plugins.ffmpeg.download()

'''
Imageio: 'ffmpeg.linux64' was not found on your computer; downloading it now.
Try 1. Download from https://github.com/imageio/imageio-binaries/raw/master/ffmpeg/ffmpeg.linux64 (27.2 MB)
Downloading: 8192/28549024 bytes (0.02260992/28549024 bytes (7.9%5455872/28549024 bytes (19.18790016/28549024 bytes (30.812189696/28549024 bytes (42.7%15687680/28549024 bytes (54.9%18898944/28549024 bytes (66.2%22134784/28549024 bytes (77.5%25518080/28549024 bytes (89.4%28549024/28549024 bytes (100.0%)
  Done
File saved as /home/racheltho/.imageio/ffmpeg/ffmpeg.linux64.
'''

import moviepy.editor as mpe
import numpy as np
import scipy

%matplotlib inline
import matplotlib.pyplot as plt

np.set_printoptions(precision=4, suppress=True)

video = mpe.VideoFileClip("movie/Video_003.avi")

video.subclip(0,50).ipython_display(width=300)

'''
100%|█████████▉| 350/351 [00:00<00:00, 914.11it/s]
'''

video.duration

# 113.57

辅助方法

def create_data_matrix_from_video(clip, fps=5, scale=50):
    return np.vstack([scipy.misc.imresize(rgb2gray(clip.get_frame(i/float(fps))).astype(int), 
                      scale).flatten() for i in range(fps * int(clip.duration))]).T

def rgb2gray(rgb):
    return np.dot(rgb[...,:3], [0.299, 0.587, 0.114])

格式化数据

一个时刻的图像是120像素乘160像素（缩放时）。我们可以将该图片展开为一个很高的列。因此，我们不是拥有120×160的 2D 图像，而是拥有1×19,200的一列。

这不是人类可读的，但它很方便，因为它可以让我们将来自不同时间的图像叠加在一起，将视频全部放入 1 个矩阵中。如果我们拍摄视频 100 秒，每隔百分之一秒一张图像（所以有 10,000 个不同的图像，每个图像来自不同的时间点），我们将拥有10,000×19,200的矩阵，代表视频！

scale = 0.50   # Adjust scale to change resolution of image
dims = (int(240 * scale), int(320 * scale))
fps = 60      # frames per second

M = create_data_matrix_from_video(video.subclip(0,100), fps, scale)
# M = np.load("movie/med_res_surveillance_matrix_60fps.npy")

print(dims, M.shape)

# (120, 160) (19200, 6000)

plt.imshow(np.reshape(M[:,140], dims), cmap='gray');

由于create_data_from_matrix有点慢，我们将保存矩阵。通常，只要你的预处理步骤较慢，最好保存结果以备将来使用。

np.save("movie/med_res_surveillance_matrix_60fps.npy", M)

plt.figure(figsize=(12, 12))
plt.imshow(M[::3,:], cmap='gray')

# <matplotlib.image.AxesImage at 0x7f92be09d710>

问题：那些黑波浪线是什么？水平线是什么？

奇异值分解

SVD介绍

“将矩阵分解为我们关心的更简单，更有意义的片段的便捷方式” - 大卫奥斯汀

“我忘了学习的最重要的线性代数概念” - Daniel Lemire

SVD的应用：

主成分分析
数据压缩
伪逆
协同过滤
主题建模
背景消除
删除损坏的数据

U, s, V = np.linalg.svd(M, full_matrices=False)

这非常慢，因此你可能希望保存结果以便将来使用。

np.save("movie/U.npy", U)
np.save("movie/s.npy", s)
np.save("move/V.npy", V)

将来，你只需加载已保存的内容：

U = np.load("movie/U.npy")
s = np.load("movie/s.npy")
V = np.load("movie/V.npy")

U, s, V是什么样子？

U.shape, s.shape, V.shape

# ((19200, 6000), (6000,), (6000, 6000))

练习

检查它们是否是M的分解。

# Exercise: 

# True

他们正是。

np.allclose(M, reconstructed_matrix)

# True

np.set_printoptions(suppress=True, precision=0)

`s`的属性

np.diag(s[:6])

你看到s的顺序了吗？

s[0:2000:50]

'''
array([ 1341720.,    10528.,     6162.,     4235.,     3174.,     2548.,
           2138.,     1813.,     1558.,     1346.,     1163.,     1001.,
            841.,      666.,        0.,        0.,        0.,        0.,
              0.,        0.,        0.,        0.,        0.,        0.,
              0.,        0.,        0.,        0.,        0.,        0.,
              0.,        0.,        0.,        0.,        0.,        0.,
              0.,        0.,        0.,        0.])
'''

len(s)

# 6000

s[700]

# 3.2309523518534773e-10

np.set_printoptions(suppress=True, precision=4)

U是个巨大的矩阵，所以仅仅查看一小部分：

U[:5,:5]

'''
array([[-0.0083, -0.0009, -0.0007,  0.003 , -0.0002],
       [-0.0083, -0.0013, -0.0005,  0.0034, -0.0001],
       [-0.0084, -0.0012,  0.0002,  0.0045, -0.0003],
       [-0.0085, -0.0011,  0.0001,  0.0044, -0.    ],
       [-0.0086, -0.0013, -0.0002,  0.004 ,  0.0001]])
'''

寻找背景

U.shape, s.shape, V.shape

# ((19200, 6000), (6000,), (6000, 6000))

low_rank = np.expand_dims(U[:,0], 1) * s[0] * np.expand_dims(V[0,:], 0)

plt.figure(figsize=(12, 12))
plt.imshow(low_rank, cmap='gray')

# <matplotlib.image.AxesImage at 0x7f1cc3e2c9e8>

plt.imshow(np.reshape(low_rank[:,0], dims), cmap='gray');

如何得到里面的人？

plt.imshow(np.reshape(M[:,0] - low_rank[:,0], dims), cmap='gray');

高分辨率版本。

plt.imshow(np.reshape(M[:,140] - low_rank[:,140], dims), cmap='gray');

制作视频

我受到了 fast.ai 学生萨米尔·穆萨（Samir Moussa）的启发来制作人物视频。

from moviepy.video.io.bindings import mplfig_to_npimage

def make_video(matrix, dims, filename):
    mat_reshaped = np.reshape(matrix, (dims[0], dims[1], -1))
    
    fig, ax = plt.subplots()
    def make_frame(t):
        ax.clear()
        ax.imshow(mat_reshaped[...,int(t*fps)])
        return mplfig_to_npimage(fig)
    
    animation = mpe.VideoClip(make_frame, duration=int(10))
    animation.write_videofile('videos/' + filename + '.mp4', fps=fps)

make_video(M - low_rank, dims, "figures2")

'''
[MoviePy] >>>> Building video videos/figures2.mp4
[MoviePy] Writing video videos/figures2.mp4

100%|█████████▉| 600/601 [00:39<00:00, 15.22it/s]

[MoviePy] Done.
[MoviePy] >>>> Video ready: videos/figures2.mp4 
'''

mpe.VideoFileClip("videos/figures2.mp4").subclip(0,10).ipython_display(width=300)

# 100%|█████████▉| 600/601 [00:00<00:00, 858.48it/s]

SVD 分解不同尺寸矩阵的速度

import timeit
import pandas as pd

m_array = np.array([100, int(1e3), int(1e4)])
n_array = np.array([100, int(1e3), int(1e4)])

index = 
pd.MultiIndex.from_product([m_array, n_array], names=['# rows', '# cols'])

pd.options.display.float_format = '{:,.3f}'.format
df = pd.DataFrame(index=m_array, columns=n_array)

# %%prun
for m in m_array:
    for n in n_array:      
        A = np.random.uniform(-40,40,[m,n])  
        t = timeit.timeit('np.linalg.svd(A, full_matrices=False)', number=3, globals=globals())
        df.set_value(m, n, t)

df/3

	100	1000	10000
100	0.006	0.009	0.043
1000	0.004	0.259	0.992
10000	0.019	0.984	218.726

一个视频中的两个背景

我们现在使用 BMC 2012 背景建模挑战数据集中的真实视频 008，以及上面使用的 003。

from moviepy.editor import concatenate_videoclips

video2 = mpe.VideoFileClip("movie/Video_008.avi")

concat_video = concatenate_videoclips([video2.subclip(0,20), video.subclip(0,10)])
concat_video.write_videofile("movie/concatenated_video.mp4")

'''
[MoviePy] >>>> Building video movie/concatenated_video.mp4
[MoviePy] Writing video movie/concatenated_video.mp4

100%|█████████▉| 300/301 [00:00<00:00, 481.76it/s]

[MoviePy] Done.
[MoviePy] >>>> Video ready: movie/concatenated_video.mp4 
'''

concat_video = mpe.VideoFileClip("movie/concatenated_video.mp4")

现在回到我们的背景消除问题

concat_video.ipython_display(width=300, maxduration=160)

# 100%|█████████▉| 300/301 [00:00<00:00, 533.88it/s]

scale = 0.5   # 调整比例来更改图像的分辨率
dims = (int(240 * scale), int(320 * scale))

N = create_data_matrix_from_video(concat_video, fps, scale)
# N = np.load("low_res_traffic_matrix.npy")
np.save("med_res_concat_video.npy", N)

N.shape

# (19200, 1800)

plt.imshow(np.reshape(N[:,200], dims), cmap='gray');

U_concat, s_concat, V_concat = np.linalg.svd(N, full_matrices=False)

这很慢，因此你可能希望保存结果以便将来使用。

np.save("movie/U_concat.npy", U_concat)
np.save("movie/s_concat.npy", s_concat)
np.save("movie/V_concat.npy", V_concat)

将来，你只需加载已保存的内容：

U_concat = np.load("movie/U_concat.npy")
s_concat = np.load("movie/s_concat.npy")
V_concat = np.load("movie/V_concat.npy")

low_rank = U_concat[:,:10] @ np.diag(s_concat[:10]) @ V_concat[:10,:]

U的最小主成分：

plt.imshow(np.reshape(U_concat[:, 1], dims), cmap='gray')

# <matplotlib.image.AxesImage at 0x7f92bcf47da0>

plt.imshow(np.reshape(U_concat[:, 2], dims), cmap='gray')

# <matplotlib.image.AxesImage at 0x7f92bc691a90>

plt.imshow(np.reshape(U_concat[:, 3], dims), cmap='gray')

# <matplotlib.image.AxesImage at 0x7f92bc5aa240>

背景移除：

plt.imshow(np.reshape((N - low_rank)[:, -40], dims), cmap='gray')

# <matplotlib.image.AxesImage at 0x7f92bc540908>

plt.imshow(np.reshape((N - low_rank)[:, 240], dims), cmap='gray')

# <matplotlib.image.AxesImage at 0x7f92bc4d7f28>

数据压缩旁注

假设我们采用 700 个奇异值（记住，总共有 10000 个奇异值）。

s[0:1000:50]

'''
array([ 1341719.6552,    10527.5148,     6162.0638,     4234.9367,
           3174.0389,     2548.4273,     2138.1887,     1812.9873,
           1557.7163,     1345.805 ,     1163.2866,     1000.5186,
            841.4604,      665.7271,        0.    ,        0.    ,
              0.    ,        0.    ,        0.    ,        0.    ])
'''

k = 700
compressed_M = U[:,:k] @ np.diag(s[:k]) @ V[:k,:]

plt.figure(figsize=(12, 12))
plt.imshow(compressed_M, cmap='gray')

# <matplotlib.image.AxesImage at 0x7fefa0076ac8>

plt.imshow(np.reshape(compressed_M[:,140], dims), cmap='gray');

np.allclose(compressed_M, M)

# True

np.linalg.norm(M - compressed_M)

# 2.864899899979104e-09

U[:,:k].shape, s[:k].shape, V[:k,:].shape

# ((19200, 700), (700,), (700, 6000))

节省的空间为对于 700 个奇异值的U, s, V中的数据比原始矩阵。

((19200 + 1 + 6000) * 700) / (19200 * 6000)

# 0.1531310763888889

我们只需要存储 15.3% 的数据，并且可以将精度保持在1e-5！很棒！

很漂亮！！！但...
SVD 的运行时复杂度为O(min(m^2 n, m n^2))
缺点：这真的很慢（同样，我们摒弃了很多计算）。

%time u, s, v = np.linalg.svd(M, full_matrices=False)

'''
CPU times: user 5min 38s, sys: 1.53 s, total: 5min 40s
Wall time: 57.1 s
'''

M.shape

# (19200, 6000)

三、使用 NMF 和 SVD 的主题建模

主题建模是开始使用矩阵分解的好方法。我们从术语 - 文档矩阵开始：

来源：信息检索导论

我们可以将其分解为一个高的窄矩阵乘以一个宽的扁矩阵（中间可能有对角矩阵）。

请注意，此表示不考虑单词顺序或句子结构。这是一个词袋的例子。

动机

考虑最极端的情况 - 使用两个向量的外积重建矩阵。显然，在大多数情况下，我们无法准确地重建矩阵。但是，如果我们有一个向量，带有每个单词在所有单词中的相对频率，而另一个向量具有每个文档的平均单词数，那么外积将尽可能接近。

现在考虑将矩阵增加到两列和两行。现在最佳分解是将文档聚类成两组，每组具有尽可能彼此不同的单词分布，但在簇中的文档中尽可能相似。我们将这两个组称为“主题”。我们会根据每个主题中最常出现的词汇将这些词汇分为两组。

今天的课程中

我们将采用几个不同类别的文档数据集，并为它们查找主题（由单词组组成）。了解实际类别有助于我们评估我们发现的主题是否有意义。

我们将尝试使用两种不同的矩阵因式分解：奇异值分解（SVD）和非负矩阵分解（NMF）。

import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn import decomposition
from scipy import linalg
import matplotlib.pyplot as plt

%matplotlib inline
np.set_printoptions(suppress=True)

附加资源

数据源：新闻组是 Usenet 上的讨论组，它在网络真正起飞之前的 80 年代和 90 年代很流行。该数据集包括 18,000 个新闻组帖子，带有 20 个主题。
Chris Manning 的矩阵分解和 LSI 的书
Sklearn 的截断 SVD LSI 的细节

其它教程

Scikit-Learn：文本文档的核外分类：使用 Reuters-21578 数据集（标有 ~100 个类别的路透社文章），HashingVectorizer
使用人文和社会科学主题模型进行文本分析：使用 Jane Austen，Charlotte Bronte，Victor Hugo 等人的英国和法国文学数据集

建立数据

Scikit Learn 附带了许多内置数据集，以及加载工具来加载多个标准外部数据集。这是一个很好的资源，数据集包括波士顿房价，人脸图像，森林斑块，糖尿病，乳腺癌等。我们将使用新闻组数据集。

新闻组是 Usenet 上的讨论组，它在网络真正起飞之前的 80 年代和 90 年代很流行。该数据集包括 18,000 个新闻组帖子，带有 20 个主题。

categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
remove = ('headers', 'footers', 'quotes')
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=remove)

newsgroups_train.filenames.shape, newsgroups_train.target.shape

# ((2034,), (2034,))

我们来看看一些数据。你能猜出这些消息属于哪个类别？

print("\n".join(newsgroups_train.data[:3]))

'''
Hi,

I've noticed that if you only save a model (with all your mapping planes
positioned carefully) to a .3DS file that when you reload it after restarting
3DS, they are given a default position and orientation.  But if you save
to a .PRJ file their positions/orientation are preserved.  Does anyone
know why this information is not stored in the .3DS file?  Nothing is
explicitly said in the manual about saving texture rules in the .PRJ file. 
I'd like to be able to read the texture rule information, does anyone have 
the format for the .PRJ file?

Is the .CEL file format available from somewhere?

Rych


Seems to be, barring evidence to the contrary, that Koresh was simply
another deranged fanatic who thought it neccessary to take a whole bunch of
folks with him, children and all, to satisfy his delusional mania. Jim
Jones, circa 1993.


Nope - fruitcakes like Koresh have been demonstrating such evil corruption
for centuries.

 >In article <1993Apr19.020359.26996@sq.sq.com>, msb@sq.sq.com (Mark Brader) 

MB>                                                             So the
MB> 1970 figure seems unlikely to actually be anything but a perijove.

JG>Sorry, _perijoves_...I'm not used to talking this language.

Couldn't we just say periapsis or apoapsis?
'''

提示：perijove 的定义是离木星中心最近的木星卫星轨道上的点。

np.array(newsgroups_train.target_names)[newsgroups_train.target[:3]]

'''
array(['comp.graphics', 'talk.religion.misc', 'sci.space'], 
      dtype='<U18')
'''

target属性是类别的整数索引。

newsgroups_train.target[:10]

# array([1, 3, 2, 0, 2, 0, 2, 1, 2, 1])

num_topics, num_top_words = 6, 8

接下来，scikit learn 有一个方法可以为我们提取所有字数。

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vectorizer = CountVectorizer(stop_words='english')
vectors = vectorizer.fit_transform(newsgroups_train.data).todense() # (documents, vocab)
vectors.shape #, vectors.nnz / vectors.shape[0], row_means.shape

# (2034, 26576)

print(len(newsgroups_train.data), vectors.shape)

# 2034 (2034, 26576)

vocab = np.array(vectorizer.get_feature_names())

vocab.shape

# (26576,)

vocab[7000:7020]

'''
array(['cosmonauts', 'cosmos', 'cosponsored', 'cost', 'costa', 'costar',
       'costing', 'costly', 'costruction', 'costs', 'cosy', 'cote',
       'couched', 'couldn', 'council', 'councils', 'counsel', 'counselees',
       'counselor', 'count'],
      dtype='<U80')
'''

奇异值分解（SVD）

“SVD并不像应该的那样出名。” - 吉尔伯特斯特朗

我们显然希望，在一个主题中最常出现的单词在另一个主题中出现的频率较低 - 否则该单词不会成为分离这两个主题的不错选择。因此，我们希望主题是正交的。

SVD 算法将矩阵分解为具有正交列的和具有正交行的矩阵（以及包含每个因子的相对重要性的对角矩阵）。

来源：Facebook 研究：快速随机 SVD

SVD 是精确分解，因为它产生的矩阵足够大，完全覆盖原始矩阵。 SVD 在线性代数中的使用非常广泛，特别是在数据科学中，包括：

语义分析
协同过滤/推荐（获的 Netflix 奖项）
计算 Moore-Penrose 伪逆
数据压缩
主成分分析（将在后面介绍）

%time U, s, Vh = linalg.svd(vectors, full_matrices=False)

'''
CPU times: user 1min 4s, sys: 8.82 s, total: 1min 13s
Wall time: 13.3 s
'''

print(U.shape, s.shape, Vh.shape)

# (2034, 2034) (2034,) (2034, 26576)

确认这是输入的分解。

答案

# 练习：确认 U，s，Vh 是 var 向量的分解

# True

确认U, V正交。

答案

# 练习：确认`U, V`正交

# True

主题

关于奇异值s我们能说什么？

plt.plot(s);

plt.plot(s[:10])

# [<matplotlib.lines.Line2D at 0x7fcada6c6828>]

num_top_words=8

def show_topics(a):
    top_words = lambda t: [vocab[i] for i in np.argsort(t)[:-num_top_words-1:-1]]
    topic_words = ([top_words(t) for t in a])
    return [' '.join(t) for t in topic_words]

show_topics(Vh[:10])

'''
['critus ditto propagandist surname galacticentric kindergarten surreal imaginative',
 'jpeg gif file color quality image jfif format',
 'graphics edu pub mail 128 3d ray ftp',
 'jesus god matthew people atheists atheism does graphics',
 'image data processing analysis software available tools display',
 'god atheists atheism religious believe religion argument true',
 'space nasa lunar mars probe moon missions probes',
 'image probe surface lunar mars probes moon orbit',
 'argument fallacy conclusion example true ad argumentum premises',
 'space larson image theory universe physical nasa material']
'''

我们得到的主题匹配我们期望的簇的类型！尽管事实上这是一个无监督的算法 - 也就是说，我们从未真正告诉算法我们的文档是如何分组的。

稍后我们将更详细地回顾 SVD。目前，重要的一点是我们有一个工具可以让我们将矩阵精确地分解为正交列和正交行。

非负矩阵分解（NMF）

动机

来源：NMF 教程

更加可解释的方法：

来源：NMF 教程

理念

不是将我们的因式限制为正交，而是另一种想法将它们限制为非负的。 NMF 是非负数据集V到非负矩阵W，H的因子分解：

V = WH

通常正因式会更容易解释（这也是 NMF 受欢迎的原因）。

来源：NMF 教程

非负矩阵分解（NMF）是一种非精确因子分解，它将因子分解为一个窄的和一个短的正矩阵。 NMF 是 NP 难和不唯一的。它有许多变体，通过添加不同的约束来创建。

NMF 的应用

来源：NMF 教程

来自 sklearn 的 NMF

首先我们使用 sklearn 的 NMF 实现：

m,n=vectors.shape
d=5  # 主题数量

clf = decomposition.NMF(n_components=d, random_state=1)

W1 = clf.fit_transform(vectors)
H1 = clf.components_

show_topics(H1)

'''
['jpeg image gif file color images format quality',
 'edu graphics pub mail 128 ray ftp send',
 'space launch satellite nasa commercial satellites year market',
 'jesus matthew prophecy people said messiah david isaiah',
 'image data available software processing ftp edu analysis',
 'god atheists atheism religious believe people religion does']
'''

TF-IDF

主题频率 - 逆文档频率（TF-IDF）是一种规范术语计数的方法，通过考虑它们在文档中出现的频率，文档的持续时间以及该术语的常见/稀有程度。

TF =（文档中术语t的出现次数）/（文档中的单词数）

IDF = log（文档数量/包含术语t的文档数）

vectorizer_tfidf = TfidfVectorizer(stop_words='english')
vectors_tfidf = vectorizer_tfidf.fit_transform(newsgroups_train.data) # (documents, vocab)

W1 = clf.fit_transform(vectors_tfidf)
H1 = clf.components_

show_topics(H1)

'''
['don people just think like know say religion',
 'thanks graphics files image file program windows format',
 'space nasa launch shuttle orbit lunar moon earth',
 'ico bobbe tek beauchaine bronx manhattan sank queens',
 'god jesus bible believe atheism christian does belief',
 'objective morality values moral subjective science absolute claim']
'''

plt.plot(clf.components_[0])

# [<matplotlib.lines.Line2D at 0x7f0e1039a7b8>]

clf.reconstruction_err_

# 43.71292605795277

NMF 总结

优点：快速且易于使用！

缺点：需要多年的研究和专业知识才能创建

注意：

对于 NMF，矩阵高度需要至少与宽度一样，否则我们会遇到fit_transform的错误
可以在CountVectorizer中使用df_min来仅查看至少k个分割文本中的单词

使用 SGD 在 NumPy 中从零开始编写 NMF

梯度下降

标准梯度下降的关键思想：

随机选择一些权重来开始
循环：
- 使用权重来计算预测
- 计算损失的导数
- 更新权重
多次重复步骤 2。最终我们得到了一些不错的权重。

关键：我们希望减少损失，导数告诉我们最陡的下降方向。

请注意，损失，误差和成本都是用于描述相同内容的术语。

让我们来看看梯度下降导论笔记本（最初来自 fast.ai 深度学习课程）。

随机梯度下降（SGD）

随机梯度下降是一种非常有用的优化方法（它也是深度学习的核心，它用于反向传播）。

对于标准梯度下降，我们使用所有数据来计算损失，可能非常慢。在随机梯度下降中，我们仅根据我们的数据样本（有时称为小批量）来计算我们的损失函数。我们会在不同的数据样本上得到不同的损失值，因此这就是随机的原因。事实证明，这仍然是一种有效的优化方式，而且效率更高！

我们可以在这个电子表格中看到它是如何工作的（最初来自 fast.ai 深度学习课程）。

资源：

对 NMF 应用 SGD

目标：将V(mxn)分解为：

V ≈ WH

其中W(mxd)和H(dxn)，W, H >= 0，我们最小化V − WH 的 Frobenius 范式。

方法：我们将随机选择正的W和H，然后使用 SGD 进行优化。

要使用 SGD，我们需要知道损失函数的梯度。

资料来源：

lam=1e3
lr=1e-2
m, n = vectors_tfidf.shape

W1 = clf.fit_transform(vectors)
H1 = clf.components_

show_topics(H1)

'''
['jpeg image gif file color images format quality',
 'edu graphics pub mail 128 ray ftp send',
 'space launch satellite nasa commercial satellites year market',
 'jesus matthew prophecy people said messiah david isaiah',
 'image data available software processing ftp edu analysis',
 'god atheists atheism religious believe people religion does']
'''

mu = 1e-6
def grads(M, W, H):
    R = W@H-M
    return R@H.T + penalty(W, mu)*lam, W.T@R + penalty(H, mu)*lam # dW, dH

def penalty(M, mu):
    return np.where(M>=mu,0, np.min(M - mu, 0))

def upd(M, W, H, lr):
    dW,dH = grads(M,W,H)
    W -= lr*dW; H -= lr*dH

def report(M,W,H): 
    print(np.linalg.norm(M-W@H), W.min(), H.min(), (W<0).sum(), (H<0).sum())

W = np.abs(np.random.normal(scale=0.01, size=(m,d)))
H = np.abs(np.random.normal(scale=0.01, size=(d,n)))

report(vectors_tfidf, W, H)

# 44.4395133509 5.67503308167e-07 2.49717354504e-07 0 0

upd(vectors_tfidf,W,H,lr)

report(vectors_tfidf, W, H)

# 44.4194155587 -0.00186845669883 -0.000182969569359 509 788

for i in range(50): 
    upd(vectors_tfidf,W,H,lr)
    if i % 10 == 0: report(vectors_tfidf,W,H)

'''
44.4071645597 -0.00145791197281 -0.00012862260312 343 1174
44.352156176 -0.000549676823494 -9.16363641124e-05 218 4689
44.3020593384 -0.000284017335617 -0.000130903875061 165 9685
44.2468609535 -0.000279317810433 -0.000182173029912 169 16735
44.199218 -0.000290092649623 -0.000198140867356 222 25109
'''

show_topics(H)

'''
['cview file image edu files use directory temp',
 'moral like time does don software new years',
 'god jesus bible believe objective exist atheism belief',
 'thanks graphics program know help looking windows advance',
 'space nasa launch shuttle orbit station moon lunar',
 'people don said think ico tek bobbe bronx']
'''

这种训练非常缓慢！大量的参数要调整，但仍然很慢（或爆炸）。

PyTorch

PyTorch 是一个用于张量和动态神经网络的 Python 框架，具有 GPU 加速功能。许多核心贡献者都在 Facebook 的 AI 团队中工作。在许多方面，它与 Numpy 类似，只是增加了使用 GPU 的并行化。

从 PyTorch 文档来看：

进一步学习：如果你想了解动态神经网络是什么，你可能希望观看 Facebook AI 研究员和 PyTorch 核心贡献者 Soumith Chintala 的演讲。

如果你想更加了解 PyTorch，你可以尝试本教程或通过示例学习。

GPU 的注意事项：如果你不使用GPU，则需要从以下方法中删除.cuda()。本课程不需要使用 GPU，但我认为有些人会感兴趣。要了解如何使用 GPU 创建 AWS 实例，你可以观看 fast.ai 的配置课程。

import torch
import torch.cuda as tc
from torch.autograd import Variable

def V(M): return Variable(M, requires_grad=True)

v=vectors_tfidf.todense()

t_vectors = 
torch.Tensor(v.astype(np.float32)).cuda()

mu = 1e-5

def grads_t(M, W, H):
    R = W.mm(H)-M
    return (R.mm(H.t()) + penalty_t(W, mu)*lam, 
        W.t().mm(R) + penalty_t(H, mu)*lam) # dW, dH

def penalty_t(M, mu):
    return (M<mu).type(tc.FloatTensor)*torch.clamp(M - mu, max=0.)

def upd_t(M, W, H, lr):
    dW,dH = grads_t(M,W,H)
    W.sub_(lr*dW); H.sub_(lr*dH)

def report_t(M,W,H): 
    print((M-W.mm(H)).norm(2), W.min(), H.min(), (W<0).sum(), (H<0).sum())

t_W = tc.FloatTensor(m,d)
t_H = tc.FloatTensor(d,n)
t_W.normal_(std=0.01).abs_(); 
t_H.normal_(std=0.01).abs_();

d=6; lam=100; lr=0.05

for i in range(1000): 
    upd_t(t_vectors,t_W,t_H,lr)
    if i % 100 == 0: 
        report_t(t_vectors,t_W,t_H)
        lr *= 0.9

'''
44.392791748046875 -0.0060190423391759396 -0.0004986411076970398 1505 2282
43.746803283691406 -0.009054708294570446 -0.011047963984310627 2085 23854
43.702056884765625 -0.008214150555431843 -0.014783496037125587 2295 24432
43.654273986816406 -0.006195350084453821 -0.006913348101079464 2625 22663
43.646759033203125 -0.004638500511646271 -0.003197424579411745 2684 23270
43.645320892333984 -0.005679543130099773 -0.00419645756483078 3242 24199
43.6449089050293 -0.0041352929547429085 -0.00843129213899374 3282 25030
43.64469528198242 -0.003943094052374363 -0.00745873199775815 3129 26220
43.64459991455078 -0.004347225651144981 -0.007400824688374996 3031 26323
43.64434051513672 -0.0039044099394232035 -0.0067480215802788734 3930 33718
'''

show_topics(t_H.cpu().numpy())

'''
['objective morality values moral subjective science absolute claim',
 'god jesus bible believe atheism christian does belief',
 'ico bobbe tek bronx beauchaine manhattan sank queens',
 'thanks graphics files image file program windows know',
 'space nasa launch shuttle orbit lunar moon earth',
 'don people just think like know say religion']
'''

plt.plot(t_H.cpu().numpy()[0])

# [<matplotlib.lines.Line2D at 0x7fe4173f1d68>]

t_W.mm(t_H).max()

0.43389660120010376

t_vectors.max()

0.9188119769096375

PyTorch：自动梯度

在上面，我们使用了我们的损失函数梯度的知识，在 PyTorch 中从零编写 SGD。但是，PyTorch 有一个自动梯度包，我们可以使用它。这非常有用，因为我们可以在我们不知道导数是什么的问题上使用自动梯度。

我们在下面使用的方法非常通用，几乎可以用于任何优化问题。

在 PyTorch 中，变量与张量具有相同的 API，但变量记住了用于创建它们的操作。这让我们可以求导。

PyTorch 自动梯度简介

示例从官方文档中的本教程获取。

x = Variable(torch.ones(2, 2), requires_grad=True)
print(x)

'''
Variable containing:
 1  1
 1  1
[torch.FloatTensor of size 2x2]
'''

print(x.data)

'''
 1  1
 1  1
[torch.FloatTensor of size 2x2]
'''

print(x.grad)

'''
Variable containing:
 0  0
 0  0
[torch.FloatTensor of size 2x2]
'''

y = x + 2
print(y)

'''
Variable containing:
 3  3
 3  3
[torch.FloatTensor of size 2x2]
'''

z = y * y * 3
out = z.sum()
print(z, out)

'''
Variable containing:
 27  27
 27  27
[torch.FloatTensor of size 2x2]
 Variable containing:
 108
[torch.FloatTensor of size 1]
'''

out.backward()
print(x.grad)

'''
Variable containing:
 18  18
 18  18
[torch.FloatTensor of size 2x2]
'''

对 NMF 使用自动梯度

lam=1e6

pW = Variable(tc.FloatTensor(m,d), requires_grad=True)
pH = Variable(tc.FloatTensor(d,n), requires_grad=True)
pW.data.normal_(std=0.01).abs_()
pH.data.normal_(std=0.01).abs_();

def report():
    W,H = pW.data, pH.data
    print((M-pW.mm(pH)).norm(2).data[0], W.min(), H.min(), (W<0).sum(), (H<0).sum())

def penalty(A):
    return torch.pow((A<0).type(tc.FloatTensor)*torch.clamp(A, max=0.), 2)

def penalize(): return penalty(pW).mean() + penalty(pH).mean()

def loss(): return (M-pW.mm(pH)).norm(2) + penalize()*lam

M = Variable(t_vectors).cuda()

opt = torch.optim.Adam([pW,pH], lr=1e-3, betas=(0.9,0.9))
lr = 0.05
report()

# 43.66044616699219 -0.0002547535696066916 -0.00046720390673726797 319 8633

使用自动梯度，如何应用 SGD：

for i in range(1000): 
    opt.zero_grad()
    l = loss()
    l.backward()
    opt.step()
    if i % 100 == 99: 
        report()
        lr *= 0.9     # 学习率衰减

'''
43.628597259521484 -0.022899555042386055 -0.26526615023612976 692 82579
43.62860107421875 -0.021287493407726288 -0.2440912425518036 617 77552
43.628597259521484 -0.020111067220568657 -0.22828206419944763 576 77726
43.628604888916016 -0.01912039890885353 -0.21654289960861206 553 84411
43.62861251831055 -0.018248897045850754 -0.20736189186573029 544 75546
43.62862014770508 -0.01753264293074608 -0.19999365508556366 491 78949
43.62862777709961 -0.016773322597146034 -0.194113627076149 513 83822
43.628639221191406 -0.01622121036052704 -0.18905577063560486 485 74101
43.62863540649414 -0.01574397087097168 -0.18498440086841583 478 85987
43.628639221191406 -0.015293922275304794 -0.18137598037719727 487 74023
'''

h = pH.data.cpu().numpy()
show_topics(h)

'''
['god jesus bible believe atheism christian belief does',
 'thanks graphics files image file windows program format',
 'just don think people like ve graphics religion',
 'objective morality values moral subjective science absolute claim',
 'ico bobbe tek beauchaine bronx manhattan sank queens',
 'space nasa shuttle launch orbit lunar moon data']
'''
plt.plot(h[0]);

比较方法

Scikit-Learn 的 NMF

快
没有参数调整
依靠几十年的学术研究，花了专家很长时间来实现

来源：Python Numfa 文档

使用 PyTorch 和 SGD

花了我们一个小时来实现，没有必要成为 NMF 的专家
参数很繁琐
没有那么快（在 numpy 中尝试过，我们不得不切换到 PyTorch）

截断 SVD

当我们计算 NMF 时，通过计算我们感兴趣的列的子集，我们节省了大量时间。有没有办法通过 SVD 获得这个好处？就在这里！它被称为截断 SVD。我们只对与最大奇异值对应的向量感兴趣。

来源：Facebook 研究：快速随机 SVD

经典算法对于分解的缺陷

矩阵“非常大”
数据通常缺失或不准确。当输入的不精确限制输出的精度时，为什么要花费额外的计算资源？
数据传输现在在算法时间中起主要作用。即使需要更多的浮点运算（flop），需要更少数据传递的技术也可以极大加速。
重要的是利用 GPU。

来源：Halko

随机算法的优势

内在稳定
性能保证不依赖于细微的光谱特性
所需的矩阵向量积可以并行完成

来源：Halko

四、随机化 SVD

第一部分：随机投影（使用词向量）

本节的目的是用单词向量的具体例子，来说明随机投影保留结构的想法！

要在机器学习中使用语言（例如，Skype 翻译器如何在语言之间进行翻译，或 Gmail 智能回复如何自动为你的电子邮件建议可能的回复），我们需要将单词表示为向量。

我们可以使用 Google 的 Word2Vec 或 Stanford 的 GloVe 将单词表示为 100 维向量。例如，这里是“python”这个词在 GloVe 中的向量：

vecs[wordidx['python']]

'''
array([ 0.2493,  0.6832, -0.0447, -1.3842, -0.0073,  0.651 , -0.3396,
       -0.1979, -0.3392,  0.2669, -0.0331,  0.1592,  0.8955,  0.54  ,
       -0.5582,  0.4624,  0.3672,  0.1889,  0.8319,  0.8142, -0.1183,
       -0.5346,  0.2416, -0.0389,  1.1907,  0.7935, -0.1231,  0.6642,
       -0.7762, -0.4571, -1.054 , -0.2056, -0.133 ,  0.1224,  0.8846,
        1.024 ,  0.3229,  0.821 , -0.0694,  0.0242, -0.5142,  0.8727,
        0.2576,  0.9153, -0.6422,  0.0412, -0.6021,  0.5463,  0.6608,
        0.198 , -1.1393,  0.7951,  0.4597, -0.1846, -0.6413, -0.2493,
       -0.4019, -0.5079,  0.8058,  0.5336,  0.5273,  0.3925, -0.2988,
        0.0096,  0.9995, -0.0613,  0.7194,  0.329 , -0.0528,  0.6714,
       -0.8025, -0.2579,  0.4961,  0.4808, -0.684 , -0.0122,  0.0482,
        0.2946,  0.2061,  0.3356, -0.6417, -0.6471,  0.1338, -0.1257,
       -0.4638,  1.3878,  0.9564, -0.0679, -0.0017,  0.5296,  0.4567,
        0.6104, -0.1151,  0.4263,  0.1734, -0.7995, -0.245 , -0.6089,
       -0.3847, -0.4797], dtype=float32)
'''

目标：使用随机性将此值从 100 维减少到 20。检查相似的单词是否仍然组合在一起。

更多信息：如果你对词嵌入感兴趣并想要更多细节，我在这里提供了一个更长的学习小组（带有代码演示）。

风格说明：我使用可折叠标题和 jupyter 主题。

加载数据

import pickle
import numpy as np
import re
import json

np.set_printoptions(precision=4, suppress=True)

该数据集可从这里获得。要从命令行下载和解压缩文件，你可以运行：

wget http://files.fast.ai/models/glove_50_glove_100.tgz 
tar xvzf glove_50_glove_100.tgz

你需要更新以下路径，来指定存储数据的位置。

path = "../data/"

vecs = np.load(path + "glove_vectors_100d.npy")

with open(path + "words.txt") as f:
    content = f.readlines()
words = [x.strip() for x in content] 

wordidx = json.load(open(path + "wordsidx.txt"))

我们的数据的样子

我们有一个单词的长列表。

len(words)

# 400000

words[:10]

# ['the', ',', '.', 'of', 'to', 'and', 'in', 'a', '"', "'s"]

words[600:610]

'''
['together',
 'congress',
 'index',
 'australia',
 'results',
 'hard',
 'hours',
 'land',
 'action',
 'higher']
'''

wordidx允许我们查找单词来找出它的索引：

wordidx['python']

# 20019

words[20019]

# 'python'

作为向量的单词

单词“python”由 100 维向量表示：

vecs[wordidx['python']]

'''
array([ 0.2493,  0.6832, -0.0447, -1.3842, -0.0073,  0.651 , -0.3396,
       -0.1979, -0.3392,  0.2669, -0.0331,  0.1592,  0.8955,  0.54  ,
       -0.5582,  0.4624,  0.3672,  0.1889,  0.8319,  0.8142, -0.1183,
       -0.5346,  0.2416, -0.0389,  1.1907,  0.7935, -0.1231,  0.6642,
       -0.7762, -0.4571, -1.054 , -0.2056, -0.133 ,  0.1224,  0.8846,
        1.024 ,  0.3229,  0.821 , -0.0694,  0.0242, -0.5142,  0.8727,
        0.2576,  0.9153, -0.6422,  0.0412, -0.6021,  0.5463,  0.6608,
        0.198 , -1.1393,  0.7951,  0.4597, -0.1846, -0.6413, -0.2493,
       -0.4019, -0.5079,  0.8058,  0.5336,  0.5273,  0.3925, -0.2988,
        0.0096,  0.9995, -0.0613,  0.7194,  0.329 , -0.0528,  0.6714,
       -0.8025, -0.2579,  0.4961,  0.4808, -0.684 , -0.0122,  0.0482,
        0.2946,  0.2061,  0.3356, -0.6417, -0.6471,  0.1338, -0.1257,
       -0.4638,  1.3878,  0.9564, -0.0679, -0.0017,  0.5296,  0.4567,
        0.6104, -0.1151,  0.4263,  0.1734, -0.7995, -0.245 , -0.6089,
       -0.3847, -0.4797], dtype=float32)
'''

这让我们可以做一些有用的计算。例如，我们可以使用距离度量，看到两个单词有多远：

from scipy.spatial.distance import cosine as dist

较小的数字意味着两个单词更接近，较大的数字意味着它们更加分开。

相似单词之间的距离很短：

dist(vecs[wordidx["puppy"]], vecs[wordidx["dog"]])

# 0.27636240676695256

dist(vecs[wordidx["queen"]], vecs[wordidx["princess"]])

# 0.20527545040329642

并且无关词之间的距离很高：

dist(vecs[wordidx["celebrity"]], vecs[wordidx["dusty"]])

# 0.98835787578057777

dist(vecs[wordidx["avalanche"]], vecs[wordidx["antique"]])

# 0.96211070091611983

偏见

有很多偏见的机会：

dist(vecs[wordidx["man"]], vecs[wordidx["genius"]])

# 0.50985148631697985

dist(vecs[wordidx["woman"]], vecs[wordidx["genius"]])

# 0.6897833082810727

我只是检查了几对词之间的距离，因为这是说明这个概念的快速而简单的方式。这也是一种非常嘈杂的方法，研究人员用更系统的方式解决这个问题。

我在这个学习小组上更深入地讨论了偏见。

可视化

让我们可视化一些单词！

我们将使用 Plotly，一个制作交互式图形的 Python 库（注意：以下所有内容都是在不创建帐户的情况下完成的，使用免费的离线版 Plotly）。

方法

import plotly
import plotly.graph_objs as go    
from IPython.display import IFrame

def plotly_3d(Y, cat_labels, filename="temp-plot.html"):
    trace_dict = {}
    for i, label in enumerate(cat_labels):
        trace_dict[i] = go.Scatter3d(
            x=Y[i*5:(i+1)*5, 0],
            y=Y[i*5:(i+1)*5, 1],
            z=Y[i*5:(i+1)*5, 2],
            mode='markers',
            marker=dict(
                size=8,
                line=dict(
                    color='rgba('+ str(i*40) + ',' + str(i*40) + ',' + str(i*40) + ', 0.14)',
                    width=0.5
                ),
                opacity=0.8
            ),
            text = my_words[i*5:(i+1)*5],
            name = label
        )

    data = [item for item in trace_dict.values()]
    layout = go.Layout(
        margin=dict(
            l=0,
            r=0,
            b=0,
            t=0
        )
    )

    plotly.offline.plot({
        "data": data,
        "layout": layout,
    }, filename=filename)

def plotly_2d(Y, cat_labels, filename="temp-plot.html"):
    trace_dict = {}
    for i, label in enumerate(cat_labels):
        trace_dict[i] = go.Scatter(
            x=Y[i*5:(i+1)*5, 0],
            y=Y[i*5:(i+1)*5, 1],
            mode='markers',
            marker=dict(
                size=8,
                line=dict(
                    color='rgba('+ str(i*40) + ',' + str(i*40) + ',' + str(i*40) + ', 0.14)',
                    width=0.5
                ),
                opacity=0.8
            ),
            text = my_words[i*5:(i+1)*5],
            name = label
        )

    data = [item for item in trace_dict.values()]
    layout = go.Layout(
        margin=dict(
            l=0,
            r=0,
            b=0,
            t=0
        )
    )

    plotly.offline.plot({
        "data": data,
        "layout": layout
    }, filename=filename)

此方法将挑选出 3 个维度，最能将我们的类别彼此分开（存储在dist_btwn_cats中），同时最小化给定类别中单词的距离（存储在dist_within_cats中）。

def get_components(data, categories, word_indices):
    num_components = 30
    pca = decomposition.PCA(n_components=num_components).fit(data.T)
    all_components = pca.components_
    centroids = {}
    print(all_components.shape)
    for i, category in enumerate(categories):
        cen = np.mean(all_components[:, i*5:(i+1)*5], axis = 1)
        dist_within_cats = np.sum(np.abs(np.expand_dims(cen, axis=1) - all_components[:, i*5:(i+1)*5]), axis=1)
        centroids[category] = cen
    dist_btwn_cats = np.zeros(num_components)
    for category1, averages1 in centroids.items():
        for category2, averages2 in centroids.items():
            dist_btwn_cats += abs(averages1 - averages2)
            clusterness = dist_btwn_cats / dist_within_cats
    comp_indices = np.argpartition(clusterness, -3)[-3:]
    return all_components[comp_indices]

准备数据

让我们绘制几个不同类别的单词：

my_words = [
            "maggot", "flea", "tarantula", "bedbug", "mosquito", 
            "violin", "cello", "flute", "harp", "mandolin",
            "joy", "love", "peace", "pleasure", "wonderful",
            "agony", "terrible", "horrible", "nasty", "failure", 
            "physics", "chemistry", "science", "technology", "engineering",
            "poetry", "art", "literature", "dance", "symphony",
           ]

categories = [
              "bugs", "music", 
              "pleasant", "unpleasant", 
              "science", "arts"
             ]

同样，我们需要使用wordidx字典查找单词的索引：

my_word_indices = np.array([wordidx[word] for word in my_words])

vecs[my_word_indices].shape

# (30, 100)

现在，我们将组合我们的单词与我们整个单词集中的前 10,000 个单词（其中一些单词已经存在），并创建嵌入矩阵。

embeddings = np.concatenate((vecs[my_word_indices], vecs[:10000,:]), axis=0); embeddings.shape

# (10030, 100)

在 3D 中查看单词

单词有 100 个维度，我们需要一种在 3D 中可视化它们的方法。

我们将使用主成分分析（PCA），这是一种广泛使用的技术，具有许多应用，包括在较低维度可视化高维数据集！

PCA

from collections import defaultdict
from sklearn import decomposition

components = get_components(embeddings, categories, my_word_indices)
plotly_3d(components.T[:len(my_words),:], categories, "pca.html")

# (30, 10030)

IFrame('pca.html', width=600, height=400)

随机投影

Johnson-Lindenstrauss 引理：（来自维基百科）高维空间中的一小组点可以嵌入到更低维度的空间中，使点之间的距离几乎保留（使用随机投影证明）。

有用的是，能够以保持距离的方式减少数据的维度。 Johnson-Lindenstrauss 引理是这种类型的经典结果。

embeddings.shape

# (10030, 100)

rand_proj = embeddings @ np.random.normal(size=(embeddings.shape[1], 40)); rand_proj.shape

# (10030, 40)

# pca = decomposition.PCA(n_components=3).fit(rand_proj.T)
# components = pca.components_
components = get_components(rand_proj, categories, my_word_indices)
plotly_3d(components.T[:len(my_words),:], categories, "pca-rand-proj.html")

# (30, 10030)

IFrame('pca-rand-proj.html', width=600, height=400)

第二部分：用于背景消除的随机 SVD

我们今天的目标：

加载和格式化数据

让我们使用 BMC 2012 背景模型挑战数据集中的真实视频 003。

导入所需的库：

import imageio
imageio.plugins.ffmpeg.download()

import moviepy.editor as mpe
import numpy as np
import scipy

%matplotlib inline
import matplotlib.pyplot as plt

scale = 0.50   # 调整比例来更改图像的分辨率
dims = (int(240 * scale), int(320 * scale))
fps = 60      # 每秒的帧

M = np.load("movie/med_res_surveillance_matrix_60fps.npy")

print(dims, M.shape)

# (120, 160) (19200, 6000)

plt.imshow(np.reshape(M[:,140], dims), cmap='gray');

plt.figure(figsize=(12, 12))
plt.imshow(M, cmap='gray')

# <matplotlib.image.AxesImage at 0x7f601f315fd0>

将来，你可以加载已保存的内容：

U = np.load("U.npy")
s = np.load("s.npy")
V = np.load("V.npy")

U, S, V是什么样呢？

U.shape, s.shape, V.shape

# ((19200, 6000), (6000,), (6000, 6000))

检查它们是M的分解。

reconstructed_matrix = U @ np.diag(s) @ V

np.allclose(M, reconstructed_matrix)

# True

是的。

移除背景

low_rank = np.expand_dims(U[:,0], 1) * s[0] * np.expand_dims(V[0,:], 0)

plt.figure(figsize=(12, 12))
plt.imshow(low_rank, cmap='gray')

# <matplotlib.image.AxesImage at 0x7f1cc3e2c9e8>

plt.imshow(np.reshape(low_rank[:,0], dims), cmap='gray');

我们如何获取里面的人？

plt.imshow(np.reshape(M[:,0] - low_rank[:,0], dims), cmap='gray');

SVD 对不同大小的矩阵的速度

s是对角矩阵的对角线。

np.set_printoptions(suppress=True, precision=4)

import timeit
import pandas as pd

m_array = np.array([100, int(1e3), int(1e4)])
n_array = np.array([100, int(1e3), int(1e4)])

index = pd.MultiIndex.from_product([m_array, n_array], names=['# rows', '# cols'])

pd.options.display.float_format = '{:,.3f}'.format
df = pd.DataFrame(index=m_array, columns=n_array)

# %%prun
for m in m_array:
    for n in n_array:      
        A = np.random.uniform(-40,40,[m,n])  
        t = timeit.timeit('np.linalg.svd(A, full_matrices=False)', number=3, globals=globals())
        df.set_value(m, n, t)

df/3

	100	1000	10000
100	0.006	0.009	0.043
1000	0.004	0.259	0.992
10000	0.019	0.984	218.726

很好！！！但是...

缺点：这真的很慢（同样，我们摒弃了很多计算）。

%time u, s, v = np.linalg.svd(M, full_matrices=False)

'''
CPU times: user 5min 38s, sys: 1.53 s, total: 5min 40s
Wall time: 57.1 s
'''

M.shape

# (19200, 6000)

随机化 SVD 的最简单版本

想法：让我们使用更小的矩阵！

我们还没有找到更好的通用 SVD 方法，我们只会使用我们在较小矩阵上使用的方法，该矩阵与原始矩阵的范围大致相同。

def simple_randomized_svd(M, k=10):
    m, n = M.shape
    transpose = False
    if m < n:
        transpose = True
        M = M.T
        
    rand_matrix = np.random.normal(size=(M.shape[1], k))  # short side by k
    Q, _ = np.linalg.qr(M @ rand_matrix, mode='reduced')  # long side by k
    smaller_matrix = Q.T @ M                              # k by short side
    U_hat, s, V = np.linalg.svd(smaller_matrix, full_matrices=False)
    U = Q @ U_hat
    
    if transpose:
        return V.T, s.T, U.T
    else:
        return U, s, V

%time u, s, v = simple_randomized_svd(M, 10)

'''
CPU times: user 3.06 s, sys: 268 ms, total: 3.33 s
Wall time: 789 ms
'''

U_rand, s_rand, V_rand = simple_randomized_svd(M, 10)

low_rank = np.expand_dims(U_rand[:,0], 1) * s_rand[0] * np.expand_dims(V_rand[0,:], 0)

plt.imshow(np.reshape(low_rank[:,0], dims), cmap='gray');

我们如何获取里面的人？

这个方法在做什么

rand_matrix = np.random.normal(size=(M.shape[1], 10))

rand_matrix.shape

# (6000, 10)

plt.imshow(np.reshape(rand_matrix[:4900,0], (70,70)), cmap='gray');

temp = M @ rand_matrix; temp.shape

# (19200, 10)

plt.imshow(np.reshape(temp[:,0], dims), cmap='gray');

plt.imshow(np.reshape(temp[:,1], dims), cmap='gray');

Q, _ = np.linalg.qr(M @ rand_matrix, mode='reduced'); Q.shape

# (19200, 10)

np.dot(Q[:,0], Q[:,1])

# -3.8163916471489756e-17

plt.imshow(np.reshape(Q[:,0], dims), cmap='gray');

plt.imshow(np.reshape(Q[:,1], dims), cmap='gray');

smaller_matrix = Q.T @ M; smaller_matrix.shape

# (10, 6000)

U_hat, s, V = np.linalg.svd(smaller_matrix, full_matrices=False)

U = Q @ U_hat

plt.imshow(np.reshape(U[:,0], dims), cmap='gray');

reconstructed_small_M = U @ np.diag(s) @ V

以及人。

plt.imshow(np.reshape(M[:,0] - reconstructed_small_M[:,0], dims), cmap='gray');

时间比较

from sklearn import decomposition
import fbpca

完整的 SVD：

%time u, s, v = np.linalg.svd(M, full_matrices=False)

'''
CPU times: user 5min 38s, sys: 1.53 s, total: 5min 40s
Wall time: 57.1 s
'''

我们的（过度简化）的randomized_svd：

%time u, s, v = simple_randomized_svd(M, 10)

'''
CPU times: user 2.37 s, sys: 160 ms, total: 2.53 s
Wall time: 641 ms
'''

Sklearn：

%time u, s, v = decomposition.randomized_svd(M, 10)

'''
CPU times: user 19.2 s, sys: 1.44 s, total: 20.7 s
Wall time: 3.67 s
'''

来自 Facebook fbpca 库的随机 SVD：

%time u, s, v = fbpca.pca(M, 10)

'''
CPU times: user 7.28 s, sys: 424 ms, total: 7.7 s
Wall time: 1.37 s
'''

我会选择 fbpca，因为它比 sklearn 更快，比我们简单的实现更健壮，更准确。

以下是 Facebook Research 的一些结果：

`k`变化下的时间和准确度

import timeit
import pandas as pd

U_rand, s_rand, V_rand = fbpca.pca(M, 700, raw=True)
reconstructed = U_rand @ np.diag(s_rand) @ V_rand

np.linalg.norm(M - reconstructed)

# 1.1065914828881536e-07

plt.imshow(np.reshape(reconstructed[:,140], dims), cmap='gray');

pd.options.display.float_format = '{:,.2f}'.format
k_values = np.arange(100,1000,100)
df_rand = pd.DataFrame(index=["time", "error"], columns=k_values)

# df_rand = pd.read_pickle("svd_df")

for k in k_values:
    U_rand, s_rand, V_rand = fbpca.pca(M, k, raw=True)
    reconstructed = U_rand @ np.diag(s_rand) @ V_rand
    df_rand.set_value("error", k, np.linalg.norm(M - reconstructed))
    t = timeit.timeit('fbpca.pca(M, k)', number=3, globals=globals())
    df_rand.set_value("time", k, t/3)

df_rand.to_pickle("df_rand")

df_rand

	100	200	300	400	500	600	700	800	900	1000
time	2.07	2.57	3.45	6.44	7.99	9.02	10.24	11.70	13.30	10.87
error	58,997.27	37,539.54	26,569.89	18,769.37	12,559.34	6,936.17	0.00	0.00	0.00	0.00

df = pd.DataFrame(index=["error"], columns=k_values)

for k in k_values:
    reconstructed = U[:,:k] @ np.diag(s[:k]) @ V[:k,:]
    df.set_value("error", k, np.linalg.norm(M - reconstructed))

df.to_pickle("df")

fig, ax1 = plt.subplots()
ax1.plot(df.columns, df_rand.loc["time"].values, 'b-', label="randomized SVD time")
ax1.plot(df.columns, np.tile([57], 9), 'g-', label="SVD time")
ax1.set_xlabel('k: # of singular values')
# Make the y-axis label, ticks and tick labels match the line color.
ax1.set_ylabel('time', color='b')
ax1.tick_params('y', colors='b')
ax1.legend(loc = 0)

ax2 = ax1.twinx()
ax2.plot(df.columns, df_rand.loc["error"].values, 'r--', label="randomized SVD error")
ax2.plot(df.columns, df.loc["error"].values, 'm--', label="SVD error")
ax2.set_ylabel('error', color='r')
ax2.tick_params('y', colors='r')
ax2.legend(loc=1)

#fig.tight_layout()
plt.show()

数学细节

随机 SVD 背后的处理

下面是一个计算截断 SVD 的过程，在“带有随机性的搜索结构：用于构造近似矩阵分解的概率算法”中描述，并在此博客文章中总结：

1.计算A的近似范围。也就是说，我们希望Q具有r个正交列，使得：

2.构造，它比较小r×n。

3.通过标准方法计算B的 SVD（因为B小于A所以更快），。

由于：

如果我们设置U = QS，那么我们有一个低秩的近似值。

那么我们如何找到`Q`（步骤 1）？

为了估计A的范围，我们可以只取一堆随机向量，来求解形成的子空间。我们可以用作为列来形成矩阵W。现在，我们采用AW = QR的 QR 分解，然后Q的列形成AW的标准正交基，这是A的范围。

由于乘积矩阵AW的行比列更多，因此列大致正交。这是一个简单的概率 - 有很多行，列很少，列不可能是线性相关的。

为什么 $M \sim Q Q^T M$

我们试图找到矩阵Q，使得 $M \approx QQ^TM$ 。我们对M的范围很感兴趣，我们称之为MX。 Q有正交列，因此（但不是I，因为Q是矩形的）。

$QR=MX \ QQTQR=QQTMX \ QR=QQ^TMX$

于是...

如果X是单位，我们就做成了（但是X会太大，我们不会得到我们想要的加速）。在我们的问题中，X只是一个小的随机矩阵。 Johnson-Lindenstrauss 引理为其工作原理提供了一些理由。

QR 分解

我们稍后将深入了解QR分解。现在，你只需要知道A = QR，其中Q由正交列组成，R是上三角形。 Trefethen 说 QR 分解是数值线性代数中最重要的思想！我们一定会将回顾它。

我们该如何选择r？
假设我们的矩阵有 100 列，我们想要U和V中的5列。为了安全起见，我们应该将矩阵投影到正交基上，其中的行数和列数多于 5（让我们使用 15）。最后，我们取U和V的前 5 列。

因此，即使我们的投影只是近似的，通过使它比我们需要的更大，我们可以弥补精度的损失（因为我们随后只采用了一个子集）。

这与随机均有何不同

test = M @ np.random.normal(size=(M.shape[1], 2)); test.shape

# (4800, 2)

随机均值：

plt.imshow(np.reshape(test[:,0], dims), cmap='gray');

均值图像：

plt.imshow(np.reshape(M.mean(axis=1), dims), cmap='gray')

# <matplotlib.image.AxesImage at 0x7f83f4093fd0>

ut, st, vt = np.linalg.svd(test, full_matrices=False)

plt.imshow(np.reshape(smaller_matrix[0,:], dims), cmap='gray');

plt.imshow(np.reshape(smaller_matrix[1,:], dims), cmap='gray');

plt.imshow(np.reshape(M[:,140], dims), cmap='gray');

第三部分：用于主体建模的随机 SVD

随机 SVD

提醒：完整的 SVD 很慢。这是我们使用 Scipy 的 Linalg SVD 进行的计算：

import numpy as np

vectors = np.load("topics/vectors.npy")

vectors.shape

# (2034, 26576)

%time U, s, Vh = linalg.svd(vectors, full_matrices=False)

'''
CPU times: user 27.2 s, sys: 812 ms, total: 28 s
Wall time: 27.9 s
'''

print(U.shape, s.shape, Vh.shape)

# (2034, 2034) (2034,) (2034, 26576)

运行的是，还有更快的方法：

%time u, s, v = decomposition.randomized_svd(vectors, 5)

'''
CPU times: user 144 ms, sys: 8 ms, total: 152 ms
Wall time: 154 ms
'''

SVD 的运行时复杂度为O(min(m^2 n,m n^2))。

问题：我们如何加快速度？（没有 SVD 研究的新突破的情况下）。

想法：让我们使用更小的矩阵（n更小）！

我们不使用m×n的整个矩阵A计算 SVD，而是使用B = AQ，它只是m×r，并且r << n。

我们还没有找到更好的 SVD 通用方法，我们只是在较小的矩阵上使用我们的方法。

%time u, s, v = decomposition.randomized_svd(vectors, 5)

'''
CPU times: user 144 ms, sys: 8 ms, total: 152 ms
Wall time: 154 ms
'''

u.shape, s.shape, v.shape

# ((2034, 5), (5,), (5, 26576))

show_topics(v)

'''
['jpeg image edu file graphics images gif data',
 'jpeg gif file color quality image jfif format',
 'space jesus launch god people satellite matthew atheists',
 'jesus god matthew people atheists atheism does graphics',
 'image data processing analysis software available tools display']
'''

随机 SVD，第二版

from scipy import linalg

方法randomized_range_finder找到一个正交矩阵，其范围近似于A的范围（我们的算法中的步骤 1）。为此，我们使用 LU 和 QR 分解，我们将在稍后深入介绍这两种分解。

我使用sklearn.extmath.randomized_svd源代码作为指南。

# 计算一个正交矩阵，其范围近似于A的范围
# power_iteration_normalizer 可以是 safe_sparse_dot（快但不稳定），LU（二者之间）或 QR（慢但最准确）
def randomized_range_finder(A, size, n_iter=5):
    Q = np.random.normal(size=(A.shape[1], size))
    
    for i in range(n_iter):
        Q, _ = linalg.lu(A @ Q, permute_l=True)
        Q, _ = linalg.lu(A.T @ Q, permute_l=True)
        
    Q, _ = linalg.qr(A @ Q, mode='economic')
    return Q

这里是我们的随机 SVD 方法。

def randomized_svd(M, n_components, n_oversamples=10, n_iter=4):
    
    n_random = n_components + n_oversamples
    
    Q = randomized_range_finder(M, n_random, n_iter)
    print(Q.shape)
    # project M to the (k + p) dimensional space using the basis vectors
    B = Q.T @ M
    print(B.shape)
    # compute the SVD on the thin matrix: (k + p) wide
    Uhat, s, V = linalg.svd(B, full_matrices=False)
    del B
    U = Q @ Uhat
    print(U.shape)
    
    return U[:, :n_components], s[:n_components], V[:n_components, :]

u, s, v = randomized_svd(vectors, 5)

'''
(2034, 15)
(15, 26576)
(2034, 15)
'''

测试

vectors.shape

# (2034, 26576)

Q = np.random.normal(size=(vectors.shape[1], 10)); Q.shape

# (26576, 10)

Q2, _ = linalg.qr(vectors @ Q, mode='economic'); Q2.shape

# (2034, 10)

Q2.shape

# (2034, 10)

测试结束

%time u, s, v = randomized_svd(vectors, 5)

'''
CPU times: user 136 ms, sys: 0 ns, total: 136 ms
Wall time: 137 ms
'''

u.shape, s.shape, v.shape

# ((2034, 5), (5,), (5, 26576))

show_topics(v)

'''
['jpeg image edu file graphics images gif data',
 'edu graphics data space pub mail 128 3d',
 'space jesus launch god people satellite matthew atheists',
 'space launch satellite commercial nasa satellites market year',
 'image data processing analysis software available tools display']
'''

在改变主题数时，写一个循环来计算分解的误差。绘制结果。

答案

# 在改变主题数时，写一个循环来计算分解的误差。绘制结果

plt.plot(range(0,n*step,step), error)

# [<matplotlib.lines.Line2D at 0x7fe3f8a1b438>]

%time u, s, v = decomposition.randomized_svd(vectors, 5)

'''
CPU times: user 144 ms, sys: 8 ms, total: 152 ms
Wall time: 154 ms
'''

%time u, s, v = decomposition.randomized_svd(vectors.todense(), 5)

'''
CPU times: user 2.38 s, sys: 592 ms, total: 2.97 s
Wall time: 2.96 s
'''

扩展资源

随机算法的整个课程

五、LU 分解

fbpca和我们自己的randomized_range_finder方法都使用 LU 分解，它将矩阵分解为下三角矩阵和上三角矩阵的乘积。

高斯消元

本节基于 Trefethen 的 20-22 讲座。

如果你不熟悉高斯消元或需要复习，请观看此可汗学院视频。

让我们手动使用高斯消元来回顾：

$A=\begin{pmatrix} 1 & -2 & -2 & -3 \ 3 & -9 & 0 & -9 \ -1 & 2 & 4 & 7 \ -3 & -6 & 26 & 2 \end{pmatrix}$

答案：

$LU = \begin{bmatrix} 1 & 0 & 0 & 0\ 3 & 1 & 0 & 0 \ -1 & 0 & 1 & 0 \ -3 & 4 & -2 & 1\end{bmatrix} \cdot \begin{bmatrix} 1 & -2 & -2 & -3 \ 0 & -3 & 6 & 0 \ 0 & 0 & 2 & 4 \ 0 & 0 & 0 & 1 \end{bmatrix}$

以上示例来自 Trefethen 的讲座 20,21。

高斯消元通过在左侧应用线性变换，将线性方程组变换为上三角形方程组。它是三角形三角化。

$L_{m-1} \dots L_2 L_1 A = U$

L是单位下三角形：所有对角线元素都是 1。

def LU(A):
    U = np.copy(A)
    m, n = A.shape
    L = np.eye(n)
    for k in range(n-1):
        for j in range(k+1,n):
            L[j,k] = U[j,k]/U[k,k]
            U[j,k:n] -= L[j,k] * U[k,k:n]
    return L, U

A = np.array([[2,1,1,0],[4,3,3,1],[8,7,9,5],[6,7,9,8]]).astype(np.float)

L, U = LU(A)

np.allclose(A, L @ U)

# True

LU分解很有用！

求解Ax = b变为LUx = b：

找到A = LU
解Ly = b
解Ux = y
完事

工作量

高斯消元的工作量： $2\cdot\frac{1}{3} n^3$

内存

在上面，我们创建了两个新的矩阵，L和U。但是，我们可以将L和U的值存储在矩阵A中（覆盖原始矩阵）。由于L的对角线都是 1，因此不需要存储。在原地进行因式分解或计算，是数值线性代数中用于节省内存的常用技术。注意：如果你将来需要再次使用原始矩阵A，则不希望这样做。其中一个作业问题是重写 LU 方法来原地操作。

考虑矩阵：

$A = \begin{bmatrix} 10^{-20} & 1 \ 1 & 1 \end{bmatrix}$

A = np.array([[1e-20, 1], [1,1]])

手动使用高斯消元法计算L和U：

# 练习：

np.set_printoptions(suppress=True)

# 练习：

L2, U2 = LU(A)

'''
[[  1.00000000e-20   1.00000000e+00]
 [  0.00000000e+00  -1.00000000e+20]]
'''

L2, U2

'''
(array([[  1.00000000e+00,   0.00000000e+00],
        [  1.00000000e+20,   1.00000000e+00]]),
 array([[  1.00000000e-20,   1.00000000e+00],
        [  0.00000000e+00,  -1.00000000e+20]]))
'''

np.allclose(L1, L2)

# True

np.allclose(U1, U2)

# True

np.allclose(A, L2 @ U2)

# False

这是使用交换主元进行 LU 分解的动机。

这也说明 LU 分解是稳定的，但不是向后稳定的。（剧透：即使部分交换主元，LU 对某些矩阵来说“爆炸性不稳定”，但在实践中稳定）

稳定性

问题f的算法 $\hat f$ 是稳定的，如果对于每个x：

$\frac{\lVert \hat{f}(x) - f(y) \rVert}{ \lVert f(y) \rVert } = \mathcal{O}(\varepsilon_{machine})$

对于一些y：

$\frac{\lVert y - x \rVert }{\lVert x \rVert} = \mathcal{O}(\varepsilon_{machine})$

一个稳定的算法几乎可以为几乎正确的问题提供正确的答案（Trefethen，第 104 页）。

翻译：

正确的问题：x
几乎正确的问题：y
正确答案：f
几乎正确的问题的正确答案：f(y)

向后稳定

向后稳定性比稳定性更强大，更简单。

问题f的算法 $\hat f$ 是向后稳定的，如果对于每个x，

$\hat{f}(x) = f(y)$

对于一些y：

$\frac{\lVert y - x \rVert }{\lVert x \rVert} = \mathcal{O}(\varepsilon_{machine})$

向后稳定的算法为几乎正确的问题提供了正确的答案（Trefethen，第 104 页）。

翻译：

正确的问题：x
几乎正确的问题：y
正确答案：f
几乎正确的问题的正确答案：f(y)

带有交换主元的 LU 分解

让我们看看矩阵：

$\hat{A} = \begin{bmatrix} 1 & 1 \ 10^{-20} & 1 \end{bmatrix}$

A = np.array([[1,1], [1e-20, 1]])

手动使用高斯消元法计算L和U：

$\hat{L} = \begin{bmatrix} 1 & 0 \ 10^{-20} & 1 \end{bmatrix}$

$\hat{U} = \begin{bmatrix} 1 & 1 \ 0 & 1 - 10^{-20} \end{bmatrix}$

L, U = LU(A)

np.allclose(A, L @ U)

# True

想法：我们可以切换行的顺序，来获得更稳定的答案！这相当于乘以置换矩阵P。例如，

$\begin{bmatrix} 0 & 1 \ 1 & 0 \end{bmatrix} \cdot \begin{bmatrix} 10^{-20} & 1 \ 1 & 1 \end{bmatrix} = \begin{bmatrix} 1 & 1 \ 10^{-20} & 1 \end{bmatrix}$

$PA = \hat{A}$

对PA应用高斯消元。

在每个步骤中，选择列k中的最大值，并将该行移动到行k。

作业

def swap(a,b):
    temp = np.copy(a)
    a[:] = b
    b[:] = temp

a=np.array([1,2,3])
b=np.array([3,2,1])
swap(a,b)
a,b

# 练习：重新编写上面的 LU 分解以使用交换主元

示例

A = np.array([[2,1,1,0],[4,3,3,1],[8,7,9,5],[6,7,9,8]]).astype(np.float)

L, U, P = LU_pivot(A)

可以比较下面 Trefethen，第 159 页的答案：

A

'''
array([[ 2.,  1.,  1.,  0.],
       [ 4.,  3.,  3.,  1.],
       [ 8.,  7.,  9.,  5.],
       [ 6.,  7.,  9.,  8.]])
'''

U

'''
array([[ 8.        ,  7.        ,  9.        ,  5.        ],
       [ 0.        ,  1.75      ,  2.25      ,  4.25      ],
       [ 0.        ,  0.        , -0.28571429,  0.57142857],
       [ 0.        ,  0.        ,  0.        , -2.        ]])
'''

P

'''
array([[ 0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.],
       [ 1.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.]])
'''

部分交换主元可以置换行。这是一种普遍的做法，这通常是 LU 分解的意思。

完全交换主元可以置换行和列。完全交换主元非常耗时，很少在实践中使用。

示例

考虑方程组：

$\begin{bmatrix} 1 & 0 & 0 & 0 & 0 & 1 \ -1 & 1 & 0 & 0 & 0 & 1 \ -1 & -1 & 1 & 0 & 0 & 1 \ -1 & -1 & -1 & 1 & 0 & 1 \ -1 & -1 & -1 & -1 & 1 & 1 \ -1 & -1 & -1 & -1 & -1 & 1 \end{bmatrix} \mathbf{x} = \begin{bmatrix} 1 \ 1 \ 1 \ 1 \ 2 \ 1 \end{bmatrix}$

def make_matrix(n):
    A = np.eye(n)
    for i in range(n):
        A[i,-1] = 1
        for j in range(i):
            A[i,j] = -1
    return A 

def make_vector(n):
    b = np.ones(n)
    b[-2] = 2
    return b

make_vector(7)

# array([ 1.,  1.,  1.,  1.,  1.,  2.,  1.])

练习

练习：让我们在5×5方程组上使用高斯消元法。

Scipy 也有这种功能。让我们看看最后 5 个方程的解，其中n = 10,20,30,40,50,60。

np.set_printoptions(precision=3, suppress=True)

?scipy.linalg.solve

for n, ls in zip(range(10, 70, 10), ['--', ':', '-', '-.', '--', ':']):
    soln = scipy.linalg.lu_solve(scipy.linalg.lu_factor(make_matrix(n)), make_vector(n))
    plt.plot(soln[-5:], ls)
    print(soln[-5:])

'''
[-0.062 -0.125 -0.25   0.5    1.002]
[-0.062 -0.125 -0.25   0.5    1.   ]
[-0.062 -0.125 -0.25   0.5    1.   ]
[-0.062 -0.125 -0.25   0.5    1.   ]
[-0.062 -0.125 -0.25   0.5    1.   ]
[ 0.  0.  0.  0.  1.]
'''

当n = 60时会发生什么？

定理：让矩阵A的因式分解PA = LU通过高斯消元和部分交换主元来计算。所得矩阵（由计算机使用浮点算术） $\hat P$ ， $\hat L$ 和 $\hat U$ 满足：

$\hat{L}\hat{U} = \hat{P} A + \delta A, \quad \frac{\delta A}{A} = \mathcal{O}(\rho \varepsilon_{machine})$

其中ρ是增长因子。

$\rho = \frac{max_{i,j} \lvert u_{ij} \rvert }{max_{i,j} \lvert a_{ij} \rvert }$

对于我们上面的矩阵， $\rho = 2^{m-1}$ 。

理论上不稳定，实际上稳定

大多数算法（例如 QR）的稳定性很简单。具有部分交换主元的高斯消元不是这种情况。只有当L和/或U相对于A的大小较大时，才会出现高斯消元（有或没有交换主元）的不稳定性。

Trefethen：“尽管有（22.4）这样的例子，部分交换主元的高斯消元在实践中是完全稳定的......在计算的五十年中，在自然环境下不会出现产生爆炸性不稳定性的矩阵问题。”【虽然人为的例子很容易构造】

虽然有些矩阵会导致不稳定，但由于统计原因，占所有矩阵的比例非常小，因此“从不”出现。 “如果你随机挑选十亿个矩阵，你几乎肯定找不到高斯消元不稳定的矩阵。”

扩展阅读

高斯消元/ LU 分解 - Trefethn 讲座 20
交换主元 - Trefethn 讲座 21
高斯消除的稳定性 - Trefethn 讲座 22

随机投影发生了什么？

我们在下面的矩阵中采用线性组合（带有随机权重）：

plt.figure(figsize=(12, 12))
plt.imshow(M, cmap='gray')

# <matplotlib.image.AxesImage at 0x7f601f315fd0>

这就像一个随机加权平均值。如果你选取其中的几个，你最终会得到彼此不高度相关的列（大致正交）。

Johnson-Lindenstrauss 引理：（来自维基百科）高维空间中的一小组点可以嵌入到更低维度的空间中，使得点之间的距离几乎保持不变。

我们期望，能够以保留相关结构的方式，减少数据的维度。 Johnson-Lindenstrauss 引理是这种类型的经典结果。

高斯消元的历史

有趣的事实：高斯并没有发明高斯消元，但可能在 Cholesky 之前发现了 Cholesky 因子分解

— Rachel Thomas (@math_rachel) 2017 年 6 月 6 日

根据维基百科，Stigler 的 Eponymy 定律：“没有任何科学发现以它的原始发现者命名。例子包括哈勃定律，它是由 Georges Lemaître 在 Edwin Hubble 两年之前得到的，毕达哥拉斯定理在毕达哥拉斯之前为巴比伦数学家所知，哈雷彗星是自公元前 240 年以来天文学家观察到的彗星。Stigler 本人将社会学家 Robert K. Merton 命名为 Stigler 定律的发现者，表明它遵循自己的法令，尽管这一现象之前曾被其他人注意到。”

迷人的高斯消元的历史。一些亮点：

公元前 20 0年左右，高斯消元的第一个书面记录在中文书籍“九章算术”中。
古代中国人使用彩色竹棒放在“计数板”的列中。
日本数学家 Seki Kowa（1643-1708）在 1683 年之前推进了中国的淘汰消元，并发明了行列式。大约在同一时间，莱布尼兹独立地发现了相似的发现，但是 Kowa 和莱布尼兹都没有因为他们的发现而受到赞扬。
高斯称消元方法是“众所周知的”并且从未声称已经发明了它，尽管他可能已经发明了 Cholesky 分解。

这里有更多历史

加速高斯消元

并行 LU 分解：LU 分解可以完全并行化

随机化 LU 分解（2016 年文章）：随机 LU 完全为在标准 GPU 上运行而实现，无需任何 GPU-CPU 数据传输。

`scipy.linalg` vs `lu_solve`

n = 60
A = make_matrix(n)
b = make_vector(n)

这个问题有很大的增长因子= 259。我们使用scipy.linalg.lu_solve获得了错误的答案，但使用scipy.linalg.solve得到了正确的答案。什么是scipy.linalg.solve呢？

print(scipy.linalg.lu_solve(scipy.linalg.lu_factor(A), b)[-5:])
print(scipy.linalg.solve(A, b)[-5:])

'''
[ 0.  0.  0.  0.  1.]
[-0.062 -0.125 -0.25   0.5    1.   ]
'''

%%timeit
soln = scipy.linalg.lu_solve(scipy.linalg.lu_factor(A), b)
soln[-5:]

# 91.2 µs ± 192 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%%timeit
soln = scipy.linalg.solve(A, b)
soln[-5:]

# 153 µs ± 5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

查看scipy的源代码，我们看到它正在调用LAPACK例程gesvx。这是sgesvx的 Fortran 源代码（s指的是单个，也有用于浮点的dgesvx和复数的cgesvx）。在注释中，我们看到它正在计算 reciprocal 主元增长因子，因此它考虑了这个增长因子，并做了一些比普通的部分主元 LU 分解更复杂的事情。

分块矩阵

经典的矩阵乘法

问题：计算两个n×n的矩阵A×B = C的矩阵乘法的计算复杂度（大O）是多少？

你可以在 Codecademy 学习（或复习）大O。

它的样子是：

for i=1 to n
    {read row i of A into fast memory}
    for j=1 to n
        {read col j of B into fast memory}
        for k=1 to n
            C[i,j] += A[i,k] x B[k,j]
        {write C[i,j] back to slow memory}

问题：进行了多少次读写操作？

分块矩阵相乘

将A，B，C分成大小为N/n × N/n的N×N个块。

来源

它的样子是：

for i=1 to N
    for j=1 to N
        for k=1 to N
            {read block (i,k) of A}
            {read block (k,j) of B}
            block (i,j) of C += block of A times block of B
        {write block (i,j) of C back to slow memory}

问题 1：这个的大O是什么？

问题 2：进行了多少次读写操作？

六、使用鲁棒回归的 CT 扫描的压缩感知

广播

术语广播描述了在算术运算期间，如何处理不同形状的数组。 Numpy 首先使用广播一词，但现在用于其他库，如 Tensorflow 和 Matlab；规则因库而异。

来自 Numpy 文档：

广播提供了一种向量化数组操作的方法，使循环在 C 而不是 Python 中出现。它可以不制作不必要的数据副本而实现，并且通常可以产生高效实现。最简单的广播示例在数组乘以标量时发生。

a = np.array([1.0, 2.0, 3.0])
b = 2.0
a * b

# array([ 2.,  4.,  6.])

v=np.array([1,2,3])
print(v, v.shape)

# [1 2 3] (3,)

m=np.array([v,v*2,v*3]); m, m.shape

'''
(array([[1, 2, 3],
        [2, 4, 6],
        [3, 6, 9]]), (3, 3))
'''

n = np.array([m*1, m*5])

n

'''
array([[[ 1,  2,  3],
        [ 2,  4,  6],
        [ 3,  6,  9]],

       [[ 5, 10, 15],
        [10, 20, 30],
        [15, 30, 45]]])
'''

n.shape, m.shape

# ((2, 3, 3), (3, 3))

我们可以使用广播来将矩阵和数组相加：

m+v

'''
array([[ 2,  4,  6],
       [ 3,  6,  9],
       [ 4,  8, 12]])
'''

注意如果我们转置数组会发生什么：

v1=np.expand_dims(v,-1); v1, v1.shape

'''
(array([[1],
        [2],
        [3]]), (3, 1))
'''

m+v1

'''
array([[ 2,  3,  4],
       [ 4,  6,  8],
       [ 6,  9, 12]])
'''

通用的 NumPy 广播规则

操作两个数组时，NumPy 会逐元素地比较它们的形状。它从最后的维度开始，并向前移动。如果满足：

他们是相等的，或者
其中一个是 1

两个维度兼容。

数组不需要具有相同数量的维度。例如，如果你有一个256×256×3的 RGB 值数组，并且你希望将图像中的每种颜色缩放不同的值，则可以将图像乘以具有 3 个值的一维数组。根据广播规则排列这些数组的尾部轴的大小，表明它们是兼容的：

Image  (3d array): 256 x 256 x 3
Scale  (1d array):             3
Result (3d array): 256 x 256 x 3

回顾

v = np.array([1,2,3,4])
m = np.array([v,v*2,v*3])
A = np.array([5*m, -1*m])

v.shape, m.shape, A.shape

# ((4,), (3, 4), (2, 3, 4))

下列操作有效嘛？

A

A + v

A.T + v

A.T.shape

（SciPy 中的）稀疏矩阵

具有大量零的矩阵称为稀疏（稀疏是密集的反义）。对于稀疏矩阵，仅仅存储非零值，可以节省大量内存。

另一个大型稀疏矩阵的例子：

来源

这是最常见的稀疏存储格式：

逐坐标（scipy 称 COO）
压缩稀疏行（CSR）
压缩稀疏列（CSC）

让我们来看看这些例子。

实际上还有更多格式。

如果非零元素的数量与行（或列）的数量成比例而不是与行列的乘积成比例，则通常将一类矩阵（例如，对角）称为稀疏。

Scipy 实现

来自 Scipy 稀疏矩阵文档

为了有效地构造矩阵，请使用dok_matrix或lil_matrix。 lil_matrix类支持基本切片和花式索引，其语法与 NumPy 数组类似。如下所示，COO 格式也可用于有效地构造矩阵
要执行乘法或求逆等操作，首先要将矩阵转换为 CSC 或 CSR 格式。
CSR，CSC 和 COO 格式之间的所有转换都是高效的线性时间操作。

今天：CT 扫描

引言

“数学真的可以拯救你的生命吗？当然可以！！” （可爱的文章）

（CAT 和 CT 扫描指代相同的过程。CT 扫描是更现代的术语）

本课程基于 Scikit-Learn 示例压缩感知：使用 L1 先验的层析成像重建（Lasso）。

我们今天的目标

读取 CT 扫描的结果并构建原始图像。

对于（特定位置和特定角度的）每个 X 射线，我们进行单次测量。我们需要从这些测量中构建原始图像。此外，我们不希望患者经历大量辐射，因此我们收集的数据少于图片区域。

我们会看到：

来源：压缩感知

来源

导入

%matplotlib inline
import numpy as np, matplotlib.pyplot as plt, math
from scipy import ndimage, sparse

np.set_printoptions(suppress=True)

生成数据

引言

我们将使用生成的数据（不是真正的 CT 扫描）。生成数据涉及一些有趣的 numpy 和线性代数，我们稍后会再回过头来看。

代码来自 Scikit-Learn 示例压缩感知：使用 L1 先验的层析成像重建（Lasso）。

生成图像

def generate_synthetic_data():
    rs = np.random.RandomState(0)
    n_pts = 36
    x, y = np.ogrid[0:l, 0:l]
    mask_outer = (x - l / 2) ** 2 + (y - l / 2) ** 2 < (l / 2) ** 2
    mx,my = rs.randint(0, l, (2,n_pts))
    mask = np.zeros((l, l))
    mask[mx,my] = 1
    mask = ndimage.gaussian_filter(mask, sigma=l / n_pts)
    res = (mask > mask.mean()) & mask_outer
    return res ^ ndimage.binary_erosion(res)

l = 128
data = generate_synthetic_data()

plt.figure(figsize=(5,5))
plt.imshow(data, cmap=plt.cm.gray);

`generate_synthetic_data`在做什么

l=8; n_pts=5
rs = np.random.RandomState(0)

x, y = np.ogrid[0:l, 0:l]; x,y

'''
(array([[0],
        [1],
        [2],
        [3],
        [4],
        [5],
        [6],
        [7]]), array([[0, 1, 2, 3, 4, 5, 6, 7]]))
'''

x + y

'''
array([[ 0,  1,  2,  3,  4,  5,  6,  7],
       [ 1,  2,  3,  4,  5,  6,  7,  8],
       [ 2,  3,  4,  5,  6,  7,  8,  9],
       [ 3,  4,  5,  6,  7,  8,  9, 10],
       [ 4,  5,  6,  7,  8,  9, 10, 11],
       [ 5,  6,  7,  8,  9, 10, 11, 12],
       [ 6,  7,  8,  9, 10, 11, 12, 13],
       [ 7,  8,  9, 10, 11, 12, 13, 14]])
'''

(x - l/2) ** 2 

'''
array([[ 16.],
       [  9.],
       [  4.],
       [  1.],
       [  0.],
       [  1.],
       [  4.],
       [  9.]])
'''

(x - l/2) ** 2 + (y - l/2) ** 2

'''
array([[ 32.,  25.,  20.,  17.,  16.,  17.,  20.,  25.],
       [ 25.,  18.,  13.,  10.,   9.,  10.,  13.,  18.],
       [ 20.,  13.,   8.,   5.,   4.,   5.,   8.,  13.],
       [ 17.,  10.,   5.,   2.,   1.,   2.,   5.,  10.],
       [ 16.,   9.,   4.,   1.,   0.,   1.,   4.,   9.],
       [ 17.,  10.,   5.,   2.,   1.,   2.,   5.,  10.],
       [ 20.,  13.,   8.,   5.,   4.,   5.,   8.,  13.],
       [ 25.,  18.,  13.,  10.,   9.,  10.,  13.,  18.]])
'''

mask_outer = (x - l/2) ** 2 + (y - l/2) ** 2 < (l/2) ** 2; mask_outer

'''
array([[False, False, False, False, False, False, False, False],
       [False, False,  True,  True,  True,  True,  True, False],
       [False,  True,  True,  True,  True,  True,  True,  True],
       [False,  True,  True,  True,  True,  True,  True,  True],
       [False,  True,  True,  True,  True,  True,  True,  True],
       [False,  True,  True,  True,  True,  True,  True,  True],
       [False,  True,  True,  True,  True,  True,  True,  True],
       [False, False,  True,  True,  True,  True,  True, False]], dtype=bool)
'''

plt.imshow(mask_outer, cmap='gray')

# <matplotlib.image.AxesImage at 0x7efcd9303278>

mask = np.zeros((l, l))
mx,my = rs.randint(0, l, (2,n_pts))
mask[mx,my] = 1; mask

'''
array([[ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.]])
'''

plt.imshow(mask, cmap='gray')

# <matplotlib.image.AxesImage at 0x7efcd9293940>

mask = ndimage.gaussian_filter(mask, sigma=l / n_pts)

plt.imshow(mask, cmap='gray')

# <matplotlib.image.AxesImage at 0x7efcd922c0b8>

res = np.logical_and(mask > mask.mean(), mask_outer)
plt.imshow(res, cmap='gray');

plt.imshow(ndimage.binary_erosion(res), cmap='gray');

plt.imshow(res ^ ndimage.binary_erosion(res), cmap='gray');

生成投影

代码

def _weights(x, dx=1, orig=0):
    x = np.ravel(x)
    floor_x = np.floor((x - orig) / dx)
    alpha = (x - orig - floor_x * dx) / dx
    return np.hstack((floor_x, floor_x + 1)), np.hstack((1 - alpha, alpha))


def _generate_center_coordinates(l_x):
    X, Y = np.mgrid[:l_x, :l_x].astype(np.float64)
    center = l_x / 2.
    X += 0.5 - center
    Y += 0.5 - center
    return X, Y

def build_projection_operator(l_x, n_dir):
    X, Y = _generate_center_coordinates(l_x)
    angles = np.linspace(0, np.pi, n_dir, endpoint=False)
    data_inds, weights, camera_inds = [], [], []
    data_unravel_indices = np.arange(l_x ** 2)
    data_unravel_indices = np.hstack((data_unravel_indices,
                                      data_unravel_indices))
    for i, angle in enumerate(angles):
        Xrot = np.cos(angle) * X - np.sin(angle) * Y
        inds, w = _weights(Xrot, dx=1, orig=X.min())
        mask = (inds >= 0) & (inds < l_x)
        weights += list(w[mask])
        camera_inds += list(inds[mask] + i * l_x)
        data_inds += list(data_unravel_indices[mask])
    proj_operator = sparse.coo_matrix((weights, (camera_inds, data_inds)))
    return proj_operator

投影运算符

l = 128

proj_operator = build_projection_operator(l, l // 7)

proj_operator

'''
<2304x16384 sparse matrix of type '<class 'numpy.float64'>'
    with 555378 stored elements in COOrdinate format>
'''

维度：角度（l // 7），位置（l），每个图像（l x l）

proj_t = np.reshape(proj_operator.todense().A, (l//7,l,l,l))

第一个坐标指的是线的角度，第二个坐标指代线的位置。

索引为 3 的角度的直线：

plt.imshow(proj_t[3,0], cmap='gray');

plt.imshow(proj_t[3,1], cmap='gray');

plt.imshow(proj_t[3,2], cmap='gray');

plt.imshow(proj_t[3,40], cmap='gray');

垂直位置 40 处的其他直线：

plt.imshow(proj_t[4,40], cmap='gray');

plt.imshow(proj_t[15,40], cmap='gray');

plt.imshow(proj_t[17,40], cmap='gray');

X 射线和数据之间的交点

接下来，我们想看看直线如何与我们的数据相交。请记住，这就是数据的样子：

plt.figure(figsize=(5,5))
plt.imshow(data, cmap=plt.cm.gray)
plt.axis('off')
plt.savefig("images/data.png")

proj = proj_operator @ data.ravel()[:, np.newaxis]

角度为 17，位置为 40 的穿过数据的 X 射线：

plt.figure(figsize=(5,5))
plt.imshow(data + proj_t[17,40], cmap=plt.cm.gray)
plt.axis('off')
plt.savefig("images/data_xray.png")

它们相交的地方。

both = data + proj_t[17,40]
plt.imshow((both > 1.1).astype(int), cmap=plt.cm.gray);

那条 X 射线的强度：

np.resize(proj, (l//7,l))[17,40]

# 6.4384498372605989

角度为 3，位置为 14 的穿过数据的 X 射线：

plt.imshow(data + proj_t[3,14], cmap=plt.cm.gray);

它们相交的地方。

both = data + proj_t[3,14]
plt.imshow((both > 1.1).astype(int), cmap=plt.cm.gray);

CT 扫描的测量结果在这里是一个小数字：

np.resize(proj, (l//7,l))[3,14]

# 2.1374953737965541

proj += 0.15 * np.random.randn(*proj.shape)

关于`*args`

a = [1,2,3]
b = [4,5,6]

c = list(zip(a, b))

c

# [(1, 4), (2, 5), (3, 6)]

list(zip(*c))

# [(1, 2, 3), (4, 5, 6)]

投影（CT 读取）

plt.figure(figsize=(7,7))
plt.imshow(np.resize(proj, (l//7,l)), cmap='gray')
plt.axis('off')
plt.savefig("images/proj.png")

回归

现在我们将尝试仅从投影中恢复数据（CT 扫描的测量值）。

线性回归：`Xβ=y`

我们的矩阵A是投影算子。这是我们不同 X 射线上方的 4d 矩阵（角度，位置，x，y）：

plt.figure(figsize=(12,12))
plt.title("X: Projection Operator")
plt.imshow(proj_operator.todense().A, cmap='gray')

# <matplotlib.image.AxesImage at 0x7efcd414ed30>

我们正在求解原始数据x。我们将 2D 数据展开为单个列。

plt.figure(figsize=(5,5))
plt.title("beta: Image")
plt.imshow(data, cmap='gray')

plt.figure(figsize=(4,12))
# 我正在平铺列，使其更容易看到
plt.imshow(np.tile(data.ravel(), (80,1)).T, cmap='gray')

# <matplotlib.image.AxesImage at 0x7efcd3b1e278>

我们的向量y是展开的测量值矩阵：

plt.figure(figsize=(8,8))
plt.imshow(np.resize(proj, (l//7,l)), cmap='gray')

plt.figure(figsize=(10,10))
plt.imshow(np.tile(proj.ravel(), (20,1)).T, cmap='gray')

# <matplotlib.image.AxesImage at 0x7efcd34f8710>

使用 Sklearn 线性回归重构图像

from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge

# 用 L2（岭）惩罚重建
rgr_ridge = Ridge(alpha=0.2)
rgr_ridge.fit(proj_operator, proj.ravel())
rec_l2 = rgr_ridge.coef_.reshape(l, l)
plt.imshow(rec_l2, cmap='gray')

# <matplotlib.image.AxesImage at 0x7efcd453d5c0>

18*128

# 2304

18 x 128 x 128 x 128

L1 范数产生稀疏性

单位球 $\lVert x \rVert_1 = 1$ 在 L1 范数中是菱形。它的极值是角：

来源

类似的视角是看损失函数的轮廓：

来源

$\lVert \cdot \rVert_1$ 是 L1 范数。最小化 L1 范数会产生稀疏值。对于矩阵，L1 范数等于最大绝对列范数。

$\lVert \cdot \rVert_*$ 是核范数，它是奇异值的 L1 范数。试图最小化它会产生稀疏的奇异值 -> 低秩。

proj_operator.shape

# (2304, 16384)

# 使用 L1（Lasso）惩罚重建 α 的最佳值
# 使用 LassoCV 交叉验证来确定
rgr_lasso = Lasso(alpha=0.001)
rgr_lasso.fit(proj_operator, proj.ravel())
rec_l1 = rgr_lasso.coef_.reshape(l, l)
plt.imshow(rec_l1, cmap='gray')

# <matplotlib.image.AxesImage at 0x7efcd4919cf8>

这里的 L1 惩罚明显优于 L2 惩罚！

七、线性回归和健康结果

糖尿病数据集

我们将使用来自糖尿病患者的数据集。数据由 442 个样本和 10 个变量（都是生理特征）组成，因此它很高而且很窄。因变量是基线后一年疾病进展的定量测量。

这是一个经典的数据集，由 Efron，Hastie，Johnstone 和 Tibshirani 在他们的最小角度回归的论文中使用，也是 scikit-learn 中包含的众多数据集之一。

data = datasets.load_diabetes()

feature_names=['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

trn,test,y_trn,y_test = train_test_split(data.data, data.target, test_size=0.2)

trn.shape, test.shape

# ((353, 10), (89, 10))

Sklearn 中的线性回归

考虑系统Xβ=y，其中X的行比列更多。当你有比变量更多的数据样本时会发生这种情况。我们想要找到 $\hat \beta$ 来最小化：

$\big\vert\big\vert X\beta - y \big\vert\big\vert_2$

让我们从使用 sklearn 实现开始：

regr = linear_model.LinearRegression()
%timeit regr.fit(trn, y_trn)

# 458 µs ± 62.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

pred = regr.predict(test)

有一些指标来表示我们的预测有多好，会很有帮助。我们将研究均方范数（L2）和平均绝对误差（L1）。

def regr_metrics(act, pred):
    return (math.sqrt(metrics.mean_squared_error(act, pred)), 
    metrics.mean_absolute_error(act, pred))

regr_metrics(y_test, regr.predict(test))

# (75.36166834955054, 60.629082113104403)

多项式特征

线性回归找到最佳系数βi：

$x_0\beta_0 + x_1\beta_1 + x_2\beta_2 = y$

添加多项式特征仍然是线性回归问题，只需更多项：

$x_0\beta_0 + x_1\beta_1 + x_2\beta_2 + x_0^2\beta_3 + x_0 x_1\beta_4 + x_0 x_2\beta_5 + x_1^2\beta_6 + x_1 x_2\beta_7 + x_2^2\beta_8 = y$

我们需要使用原始数据X来计算其他多项式特征。

trn.shape

# (353, 10)

现在，我们想通过添加更多功能，来尝试提高模型的表现。目前，我们的模型在每个变量中都是线性的，但我们可以添加多项式特征来改变它。

poly = PolynomialFeatures(include_bias=False)

trn_feat = poly.fit_transform(trn)

', '.join(poly.get_feature_names(feature_names))

# 'age, sex, bmi, bp, s1, s2, s3, s4, s5, s6, age^2, age sex, age bmi, age bp, age s1, age s2, age s3, age s4, age s5, age s6, sex^2, sex bmi, sex bp, sex s1, sex s2, sex s3, sex s4, sex s5, sex s6, bmi^2, bmi bp, bmi s1, bmi s2, bmi s3, bmi s4, bmi s5, bmi s6, bp^2, bp s1, bp s2, bp s3, bp s4, bp s5, bp s6, s1^2, s1 s2, s1 s3, s1 s4, s1 s5, s1 s6, s2^2, s2 s3, s2 s4, s2 s5, s2 s6, s3^2, s3 s4, s3 s5, s3 s6, s4^2, s4 s5, s4 s6, s5^2, s5 s6, s6^2'

trn_feat.shape

# (353, 65)

regr.fit(trn_feat, y_trn)

# LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

regr_metrics(y_test, regr.predict(poly.fit_transform(test)))

# (55.747345922929185, 42.836164292252235)

时间对于特征数是平方的，对于样本数是线性的，所以这将变得非常慢！

%timeit poly.fit_transform(trn)

# 635 µs ± 9.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

加速特征生成

我们想加快速度。我们将使用 Numba，一个直接将代码编译为 C 的 Python 库。

Numba 是一个编译器。

资源

Jake VanderPlas 的这个教程是一个很好的介绍。在这里，Jake 使用 Numba 实现了一个非平凡的算法（非均匀快速傅里叶变换）。

Cython 是另一种选择。我发现 Cython 主要比 Numba 更多的知识（它更接近 C），但提供类似 Numba 的加速。

这里是预先编译（AOT）编译器，即时编译（JIT）编译器和解释器之间差异的全面回答。

使用向量化和原生代码进行实验

让我们先了解一下 Numba，然后我们将回到我们的糖尿病数据集回归的多项式特征问题。

%matplotlib inline

import math, numpy as np, matplotlib.pyplot as plt
from pandas_summary import DataFrameSummary
from scipy import ndimage

from numba import jit, vectorize, guvectorize, cuda, float32, void, float64

我们将展示以下方面的影响：

避免内存分配和副本（比 CPU 计算慢）
更好的局部性
向量化

如果我们一次在整个数组上使用 numpy，它会创建大量的临时值，并且不能使用缓存。如果我们一次使用 numba 循环遍历数组项，那么我们就不必分配大型临时数组，并且可以复用缓存数据，因为我们正在对每个数组项进行多次计算。

# 无类型和没有向量化
def proc_python(xx,yy):
    zz = np.zeros(nobs, dtype='float32')
    for j in range(nobs):   
        x, y = xx[j], yy[j] 
        x = x*2 - ( y * 55 )
        y = x + y*2         
        z = x + y + 99      
        z = z * ( z - .88 ) 
        zz[j] = z           
    return zz

    nobs = 10000
x = np.random.randn(nobs).astype('float32')
y = np.random.randn(nobs).astype('float32')

%timeit proc_python(x,y) 

# 49.8 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

NumPy

Numpy 让我们对其向量化：

# 有类型和向量化
def proc_numpy(x,y):
    z = np.zeros(nobs, dtype='float32')
    x = x*2 - ( y * 55 )
    y = x + y*2         
    z = x + y + 99      
    z = z * ( z - .88 ) 
    return z

np.allclose( proc_numpy(x,y), proc_python(x,y), atol=1e-4 )

# True

%timeit proc_numpy(x,y)    # Typed and vectorized

# 35.9 µs ± 166 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Numba

Numba 提供几种不同的装饰器。我们将尝试两种不同的方法：

@jit：非常一般
@vectorize：不需要编写for循环。操作相同大小的向量时很有用

首先，我们将使用 Numba 的jit（即时）编译器装饰器，而无需显式向量化。这避免了大量内存分配，因此我们有更好的局部性：

@jit()
def proc_numba(xx,yy,zz):
    for j in range(nobs):   
        x, y = xx[j], yy[j] 
        x = x*2 - ( y * 55 )
        y = x + y*2         
        z = x + y + 99      
        z = z * ( z - .88 ) 
        zz[j] = z           
    return zz

z = np.zeros(nobs).astype('float32')
np.allclose( proc_numpy(x,y), proc_numba(x,y,z), atol=1e-4 )

# True

%timeit proc_numba(x,y,z)

# 6.4 µs ± 17.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

现在我们将使用 Numba 的vectorize装饰器。 Numba 的编译器以比普通 Python 和 Numpy 更聪明的方式优化它。它为你写了一个 Numpy ufunc，传统上它涉及编写 C 并且不那么简单。

@vectorize
def vec_numba(x,y):
    x = x*2 - ( y * 55 )
    y = x + y*2         
    z = x + y + 99      
    return z * ( z - .88 ) 

np.allclose(vec_numba(x,y), proc_numba(x,y,z), atol=1e-4 )

# True

%timeit vec_numba(x,y)

# 5.82 µs ± 14.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Numba 很棒。看看这有多快！

Numba 多项式特征

@jit(nopython=True)
def vec_poly(x, res):
    m,n=x.shape
    feat_idx=0
    for i in range(n):
        v1=x[:,i]
        for k in range(m): res[k,feat_idx] = v1[k]
        feat_idx+=1
        for j in range(i,n):
            for k in range(m): res[k,feat_idx] = v1[k]*x[k,j]
            feat_idx+=1

行序和列序存储

来自 Eli Bendersky 的博客文章：

“矩阵的行序布局将第一行放在连续的内存中，然后是第二行放在它后面，然后是第三行，依此类推。列序布局将第一列放在连续内存中，然后放入第二列，等等....虽然知道特定数据集使用哪种布局对于良好的性能至关重要，但对于哪种布局“更好”的问题，没有单一的答案。”

“事实证明，匹配算法与数据布局的工作方式，可以决定应用程序的性能。”

“简短的说法是：始终按照布局顺序遍历数据。”

列序布局：Fortran，Matlab，R 和 Julia

行序布局：C，C ++，Python，Pascal，Mathematica

trn = np.asfortranarray(trn)
test = np.asfortranarray(test)

m,n=trn.shape
n_feat = n*(n+1)//2 + n
trn_feat = np.zeros((m,n_feat), order='F')
test_feat = np.zeros((len(y_test), n_feat), order='F')

vec_poly(trn, trn_feat)
vec_poly(test, test_feat)

regr.fit(trn_feat, y_trn)

# LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

regr_metrics(y_test, regr.predict(test_feat))

# (55.74734592292935, 42.836164292252306)

%timeit vec_poly(trn, trn_feat)

# 7.33 µs ± 19.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

回想一下，这是 sklearn PolynomialFeatures实现的时间，它是由专家创建的：

%timeit poly.fit_transform(trn)

# 635 µs ± 9.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

605/7.7

# 78.57142857142857

这是一个大问题！ Numba 太神奇了！只需一行代码，我们就可以获得比 scikit 学习快 78 倍的速度（由专家优化）。

正则化和噪声

正则化是一种减少过拟合，并创建更好地泛化到新数据的模型的方法。

正则化

Lasso 回归使用 L1 惩罚，产生稀疏系数。参数α用于加权惩罚项。 Scikit Learn 的LassoCV使用许多不同的α值进行交叉验证。

观看 Lasso 回归的 Coursera 视频，了解更多信息。

reg_regr = linear_model.LassoCV(n_alphas=10)

reg_regr.fit(trn_feat, y_trn)

'''
/home/jhoward/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:484: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)

LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,
    max_iter=1000, n_alphas=10, n_jobs=1, normalize=False, positive=False,
    precompute='auto', random_state=None, selection='cyclic', tol=0.0001,
    verbose=False)
'''

reg_regr.alpha_

# 0.0098199431661591518

regr_metrics(y_test, reg_regr.predict(test_feat))

# (50.0982471642817, 40.065199085003101)

噪声

现在我们将为数据添加一些噪音。

idxs = np.random.randint(0, len(trn), 10)

y_trn2 = np.copy(y_trn)
y_trn2[idxs] *= 10 # label noise

regr = linear_model.LinearRegression()
regr.fit(trn, y_trn)
regr_metrics(y_test, regr.predict(test))

# (51.1766253181518, 41.415992803872754)

regr.fit(trn, y_trn2)
regr_metrics(y_test, regr.predict(test))

# (62.66110319520415, 53.21914420254862)

Huber 损失是一种损失函数，对异常值的敏感度低于平方误差损失。对于小的误差值，它是二次的，对于大的值，它是线性的。

$L(x)= \begin{cases} \frac{1}{2}x^2, & \text{for } \lvert x\rvert\leq \delta \ \delta(\lvert x \rvert - \frac{1}{2}\delta), & \text{otherwise} \end{cases}$

hregr = linear_model.HuberRegressor()
hregr.fit(trn, y_trn2)
regr_metrics(y_test, hregr.predict(test))

# (51.24055602541746, 41.670840571376822)

八、如何实现线性回归

在上一课中，我们使用 scikit learn 的实现计算了糖尿病数据集的最小二乘线性回归。今天，我们将看看如何编写自己的实现。

起步

from sklearn import datasets, linear_model, metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
import math, scipy, numpy as np
from scipy import linalg

np.set_printoptions(precision=6)

data = datasets.load_diabetes()

feature_names=['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

trn,test,y_trn,y_test = train_test_split(data.data, data.target, test_size=0.2)

trn.shape, test.shape

# ((353, 10), (89, 10))

def regr_metrics(act, pred):
    return (math.sqrt(metrics.mean_squared_error(act, pred)), 
    metrics.mean_absolute_error(act, pred))

sklearn 如何实现它

sklearn 是如何做到这一点的？通过检查源代码，你可以看到在密集的情况下，它调用scipy.linalg.lstqr，它调用 LAPACK 方法：

选项是'gelsd'，'gelsy'，'gelss'。默认值'gelsd'是个好的选择，但是，'gelsy'在许多问题上更快一些。'gelss'由于历史原因而使用。它通常更慢但是使用更少内存。

gelsd：使用 SVD 和分治方法
gelsy：使用 QR 分解
gelss：使用 SVD

Scipy 稀疏最小二乘

我们不会详细介绍稀疏版本的最小二乘法。如果你有兴趣，请参考以下信息：

Scipy 稀疏最小二乘使用称为 Golub 和 Kahan 双对角化的迭代方法。

Scipy 稀疏最小二乘源代码：预处理是减少迭代次数的另一种方法。如果有可能有效地求解相关系统M*x = b，其中M以某种有用的方式近似A（例如，M-A具有低秩或其元素相对于A的元素较小），则 LSQR 可以在系统A*M(inverse)*z = b更快地收敛。之后可以通过求解M * x = z来恢复x。

如果A是对称的，则不应使用 LSQR！替代方案是对称共轭梯度法（cg）和/或 SYMMLQ。 SYMMLQ 是对称 cg 的一种实现，适用于任何对称的A，并且比 LSQR 更快收敛。如果A是正定的，则存在对称 cg 的其他实现，每次迭代需要的工作量比 SYMMLQ 少一些（但需要相同的迭代次数）。

`linalg.lstqr`

sklearn 实现为我们添加一个常数项（因为对于我们正在学习的直线，y截距可能不是 0）。我们现在需要手工完成：

trn_int = np.c_[trn, np.ones(trn.shape[0])]
test_int = np.c_[test, np.ones(test.shape[0])]

由于linalg.lstsq允许我们指定，我们想要使用哪个 LAPACK 例程，让我们尝试它们并进行一些时间比较：

%timeit coef, _,_,_ = linalg.lstsq(trn_int, y_trn, lapack_driver="gelsd")

# 290 µs ± 9.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit coef, _,_,_ = linalg.lstsq(trn_int, y_trn, lapack_driver="gelsy")

# 140 µs ± 91.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit coef, _,_,_ = linalg.lstsq(trn_int, y_trn, lapack_driver="gelss")

# 199 µs ± 228 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

朴素解法

回想一下，我们想找到 $\hat x$ ，来最小化：

$\big\vert\big\vert Ax - b \big\vert\big\vert_2$

另一种思考方式是，我们对向量b最接近A的子空间（称为A的范围）的地方感兴趣。这是b在A上的投影。由于 $b - A \hat x$ 必须垂直于A的子空间，我们可以看到：

$A^T (b - A\hat{x}) = 0$

使用了因为要相乘A和 $b - A\hat{x}$ 的每列。

这让我们得到正规方程：

$x = (ATA){-1}A^T b$

def ls_naive(A, b):
     return np.linalg.inv(A.T @ A) @ A.T @ b

%timeit coeffs_naive = ls_naive(trn_int, y_trn)

# 45.8 µs ± 4.65 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

coeffs_naive = ls_naive(trn_int, y_trn)
regr_metrics(y_test, test_int @ coeffs_naive)

# (57.94102134545707, 48.053565198516438)

正规方程（Cholesky）

正规方程：

如果A具有满秩，则伪逆 $(ATA){-1}A^T$ 是正方形，埃尔米特正定矩阵。解决这种系统的标准方法是 Cholesky 分解，它找到上三角形R，满足。

以下步骤基于 Trefethen 的算法 11.1：

A = trn_int

b = y_trn

AtA = A.T @ A
Atb = A.T @ b

警告：对于Cholesky，Numpy 和 Scipy 默认为不同的上/下三角。

R = scipy.linalg.cholesky(AtA)

np.set_printoptions(suppress=True, precision=4)
R

'''
array([[  0.9124,   0.1438,   0.1511,   0.3002,   0.2228,   0.188 ,
         -0.051 ,   0.1746,   0.22  ,   0.2768,  -0.2583],
       [  0.    ,   0.8832,   0.0507,   0.1826,  -0.0251,   0.0928,
         -0.3842,   0.2999,   0.0911,   0.15  ,   0.4393],
       [  0.    ,   0.    ,   0.8672,   0.2845,   0.2096,   0.2153,
         -0.2695,   0.3181,   0.3387,   0.2894,  -0.005 ],
       [  0.    ,   0.    ,   0.    ,   0.7678,   0.0762,  -0.0077,
          0.0383,   0.0014,   0.165 ,   0.166 ,   0.0234],
       [  0.    ,   0.    ,   0.    ,   0.    ,   0.8288,   0.7381,
          0.1145,   0.4067,   0.3494,   0.158 ,  -0.2826],
       [  0.    ,   0.    ,   0.    ,   0.    ,   0.    ,   0.3735,
         -0.3891,   0.2492,  -0.3245,  -0.0323,  -0.1137],
       [  0.    ,   0.    ,   0.    ,   0.    ,   0.    ,   0.    ,
          0.6406,  -0.511 ,  -0.5234,  -0.172 ,  -0.9392],
       [  0.    ,   0.    ,   0.    ,   0.    ,   0.    ,   0.    ,
          0.    ,   0.2887,  -0.0267,  -0.0062,   0.0643],
       [  0.    ,   0.    ,   0.    ,   0.    ,   0.    ,   0.    ,
          0.    ,   0.    ,   0.2823,   0.0636,   0.9355],
       [  0.    ,   0.    ,   0.    ,   0.    ,   0.    ,   0.    ,
          0.    ,   0.    ,   0.    ,   0.7238,   0.0202],
       [  0.    ,   0.    ,   0.    ,   0.    ,   0.    ,   0.    ,
          0.    ,   0.    ,   0.    ,   0.    ,  18.7319]])
'''

检查我们的分解：

np.linalg.norm(AtA - R.T @ R)

# 4.5140158187158533e-16

$A^T A x = A^T b \ R^T R x = A^T b \R^T w = A^T b \R x = w$

w = scipy.linalg.solve_triangular(R, Atb, lower=False, trans='T')

检查我们的结果是否符合预期总是好的：（以防我们输入错误的参数，函数没有返回我们想要的东西，或者有时文档甚至过时）。

np.linalg.norm(R.T @ w - Atb)

# 1.1368683772161603e-13

coeffs_chol = scipy.linalg.solve_triangular(R, w, lower=False)

np.linalg.norm(R @ coeffs_chol - w)

# 6.861429794408013e-14

def ls_chol(A, b):
    R = scipy.linalg.cholesky(A.T @ A)
    w = scipy.linalg.solve_triangular(R, A.T @ b, trans='T')
    return scipy.linalg.solve_triangular(R, w)

%timeit coeffs_chol = ls_chol(trn_int, y_trn)

# 111 µs ± 272 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

coeffs_chol = ls_chol(trn_int, y_trn)
regr_metrics(y_test, test_int @ coeffs_chol)

# (57.9410213454571, 48.053565198516438)

QR 分解

$A x = b \ A = Q R \ Q R x = b \ R x = Q^T b$

def ls_qr(A,b):
    Q, R = scipy.linalg.qr(A, mode='economic')
    return scipy.linalg.solve_triangular(R, Q.T @ b)

%timeit coeffs_qr = ls_qr(trn_int, y_trn)

# 205 µs ± 264 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

coeffs_qr = ls_qr(trn_int, y_trn)
regr_metrics(y_test, test_int @ coeffs_qr)

# (57.94102134545711, 48.053565198516452)

SVD

$A x = b \ A = U \Sigma V \ \Sigma V x = U^T b \ \Sigma w = U^T b \ x = V^T w$

SVD 给出伪逆。

def ls_svd(A,b):
    m, n = A.shape
    U, sigma, Vh = scipy.linalg.svd(A, full_matrices=False)
    w = (U.T @ b)/ sigma
    return Vh.T @ w

%timeit coeffs_svd = ls_svd(trn_int, y_trn)

# 1.11 ms ± 320 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit coeffs_svd = ls_svd(trn_int, y_trn)

# 266 µs ± 8.49 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

coeffs_svd = ls_svd(trn_int, y_trn)
regr_metrics(y_test, test_int @ coeffs_svd)

# (57.941021345457244, 48.053565198516687)

最小二乘回归的随机 Sketching 技巧

线性 Sketching（Woodruff）

抽取r×n随机矩阵S，r << n
计算S A和S b
找到回归SA x = Sb的精确解x

时间比较

import timeit
import pandas as pd

def scipylstq(A, b):
    return scipy.linalg.lstsq(A,b)[0]

row_names = ['Normal Eqns- Naive',
             'Normal Eqns- Cholesky', 
             'QR Factorization', 
             'SVD', 
             'Scipy lstsq']

name2func = {'Normal Eqns- Naive': 'ls_naive', 
             'Normal Eqns- Cholesky': 'ls_chol', 
             'QR Factorization': 'ls_qr',
             'SVD': 'ls_svd',
             'Scipy lstsq': 'scipylstq'}

m_array = np.array([100, 1000, 10000])
n_array = np.array([20, 100, 1000])

index = pd.MultiIndex.from_product([m_array, n_array], names=['# rows', '# cols'])

pd.options.display.float_format = '{:,.6f}'.format
df = pd.DataFrame(index=row_names, columns=index)
df_error = pd.DataFrame(index=row_names, columns=index)

# %%prun
for m in m_array:
    for n in n_array:
        if m >= n:        
            x = np.random.uniform(-10,10,n)
            A = np.random.uniform(-40,40,[m,n])   # removed np.asfortranarray
            b = np.matmul(A, x) + np.random.normal(0,2,m)
            for name in row_names:
                fcn = name2func[name]
                t = timeit.timeit(fcn + '(A,b)', number=5, globals=globals())
                df.set_value(name, (m,n), t)
                coeffs = locals()[fcn](A, b)
                reg_met = regr_metrics(b, A @ coeffs)
                df_error.set_value(name, (m,n), reg_met[0])

df

# rows	100			1000			10000
# cols	20	100	1000	20	100	1000	20	100	1000
Normal Eqns- Naive	0.001276	0.003634	NaN	0.000960	0.005172	0.293126	0.002226	0.021248	1.164655
Normal Eqns- Cholesky	0.001660	0.003958	NaN	0.001665	0.004007	0.093696	0.001928	0.010456	0.399464
QR Factorization	0.002174	0.006486	NaN	0.004235	0.017773	0.213232	0.019229	0.116122	2.208129
SVD	0.003880	0.021737	NaN	0.004672	0.026950	1.280490	0.018138	0.130652	3.433003
Scipy lstsq	0.004338	0.020198	NaN	0.004320	0.021199	1.083804	0.012200	0.088467	2.134780

df_error

# rows	100			1000			10000
# cols	20	100	1000	20	100	1000	20	100	1000
Normal Eqns- Naive	1.702742	0.000000	NaN	1.970767	1.904873	0.000000	1.978383	1.980449	1.884440
Normal Eqns- Cholesky	1.702742	0.000000	NaN	1.970767	1.904873	0.000000	1.978383	1.980449	1.884440
QR Factorization	1.702742	0.000000	NaN	1.970767	1.904873	0.000000	1.978383	1.980449	1.884440
SVD	1.702742	0.000000	NaN	1.970767	1.904873	0.000000	1.978383	1.980449	1.884440
Scipy lstsq	1.702742	0.000000	NaN	1.970767	1.904873	0.000000	1.978383	1.980449	1.884440

store = pd.HDFStore('least_squares_results.h5')
store['df'] = df

'''
C:\Users\rache\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2881: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->floating,key->block0_values] [items->[(100, 20), (100, 100), (100, 1000), (1000, 20), (1000, 100), (1000, 1000), (5000, 20), (5000, 100), (5000, 1000)]]

  exec(code_obj, self.user_global_ns, self.user_ns)
'''

注解

我用魔术指令%prun来测量我的代码。

替代方案：最小绝对偏差（L1 回归）

异常值的敏感度低于最小二乘法。
没有闭式解，但可以通过线性规划解决。

条件作用和稳定性

条件数

条件数是一个指标，衡量输入的小变化导致输出变化的程度。

问题：为什么我们在数值线性代数中，关心输入的小变化的相关行为？

相对条件数由下式定义：

$\kappa = \sup_{\delta x} \frac{|\delta f|}{| f(x) |}\bigg/ \frac{| \delta x |}{| x |}$

其中 $\delta x$ 是无穷小。

根据 Trefethen（第 91 页），如果κ很小（例如 1, 10, 10^2），问题是良态的，如果κ很大（例如10^6, 10^16），那么问题是病态的。

条件作用：数学问题的扰动行为（例如最小二乘）

稳定性：用于在计算机上解决该问题的算法的扰动行为（例如，最小二乘算法，householder，回代，高斯消除）

条件作用的例子

计算非对称矩阵的特征值的问题通常是病态的。

A = [[1, 1000], [0, 1]]
B = [[1, 1000], [0.001, 1]]

wA, vrA = scipy.linalg.eig(A)
wB, vrB = scipy.linalg.eig(B)

wA, wB

'''
(array([ 1.+0.j,  1.+0.j]),
 array([  2.00000000e+00+0.j,  -2.22044605e-16+0.j]))
'''

矩阵的条件数

乘积 $| A| |A^{-1} |$ 经常出现，它有自己的名字：A的条件数。注意，通常我们谈论问题的条件作用，而不是矩阵。

A的条件数涉及：

给定Ax = b中的A和x，计算b
给定Ax = b中的A和b，计算x

未交待清楚的事情

完整和简化分解

SVD

来自 Trefethen 的图：

对于所有矩阵，QR 分解都存在

与 SVD 一样，有 QR 分解的完整版和简化版。

矩阵的逆是不稳定的

from scipy.linalg import hilbert

n = 5
hilbert(n)

'''
array([[ 1.    ,  0.5   ,  0.3333,  0.25  ,  0.2   ],
       [ 0.5   ,  0.3333,  0.25  ,  0.2   ,  0.1667],
       [ 0.3333,  0.25  ,  0.2   ,  0.1667,  0.1429],
       [ 0.25  ,  0.2   ,  0.1667,  0.1429,  0.125 ],
       [ 0.2   ,  0.1667,  0.1429,  0.125 ,  0.1111]])
'''

n = 14
A = hilbert(n)
x = np.random.uniform(-10,10,n)
b = A @ x

A_inv = np.linalg.inv(A)

np.linalg.norm(np.eye(n) - A @ A_inv)

# 5.0516495470543212

np.linalg.cond(A)

# 2.2271635826494112e+17

A @ A_inv

'''
array([[ 1.    ,  0.    , -0.0001,  0.0005, -0.0006,  0.0105, -0.0243,
         0.1862, -0.6351,  2.2005, -0.8729,  0.8925, -0.0032, -0.0106],
       [ 0.    ,  1.    , -0.    ,  0.    ,  0.0035,  0.0097, -0.0408,
         0.0773, -0.0524,  1.6926, -0.7776, -0.111 , -0.0403, -0.0184],
       [ 0.    ,  0.    ,  1.    ,  0.0002,  0.0017,  0.0127, -0.0273,
         0.    ,  0.    ,  1.4688, -0.5312,  0.2812,  0.0117,  0.0264],
       [ 0.    ,  0.    , -0.    ,  1.0005,  0.0013,  0.0098, -0.0225,
         0.1555, -0.0168,  1.1571, -0.9656, -0.0391,  0.018 , -0.0259],
       [-0.    ,  0.    , -0.    ,  0.0007,  1.0001,  0.0154,  0.011 ,
        -0.2319,  0.5651, -0.2017,  0.2933, -0.6565,  0.2835, -0.0482],
       [ 0.    , -0.    ,  0.    , -0.0004,  0.0059,  0.9945, -0.0078,
        -0.0018, -0.0066,  1.1839, -0.9919,  0.2144, -0.1866,  0.0187],
       [-0.    ,  0.    , -0.    ,  0.0009, -0.002 ,  0.0266,  0.974 ,
        -0.146 ,  0.1883, -0.2966,  0.4267, -0.8857,  0.2265, -0.0453],
       [ 0.    ,  0.    , -0.    ,  0.0002,  0.0009,  0.0197, -0.0435,
         1.1372, -0.0692,  0.7691, -1.233 ,  0.1159, -0.1766, -0.0033],
       [ 0.    ,  0.    , -0.    ,  0.0002,  0.    , -0.0018, -0.0136,
         0.1332,  0.945 ,  0.3652, -0.2478, -0.1682,  0.0756, -0.0212],
       [ 0.    , -0.    , -0.    ,  0.0003,  0.0038, -0.0007,  0.0318,
        -0.0738,  0.2245,  1.2023, -0.2623, -0.2783,  0.0486, -0.0093],
       [-0.    ,  0.    , -0.    ,  0.0004, -0.0006,  0.013 , -0.0415,
         0.0292, -0.0371,  0.169 ,  1.0715, -0.09  ,  0.1668, -0.0197],
       [ 0.    , -0.    ,  0.    ,  0.    ,  0.0016,  0.0062, -0.0504,
         0.1476, -0.2341,  0.8454, -0.7907,  1.4812, -0.15  ,  0.0186],
       [ 0.    , -0.    ,  0.    , -0.0001,  0.0022,  0.0034, -0.0296,
         0.0944, -0.1833,  0.6901, -0.6526,  0.2556,  0.8563,  0.0128],
       [ 0.    ,  0.    ,  0.    , -0.0001,  0.0018, -0.0041, -0.0057,
        -0.0374, -0.165 ,  0.3968, -0.2264, -0.1538, -0.0076,  1.005 ]])
'''

row_names = ['Normal Eqns- Naive',
             'QR Factorization', 
             'SVD', 
             'Scipy lstsq']

name2func = {'Normal Eqns- Naive': 'ls_naive', 
             'QR Factorization': 'ls_qr',
             'SVD': 'ls_svd',
             'Scipy lstsq': 'scipylstq'}

pd.options.display.float_format = '{:,.9f}'.format
df = pd.DataFrame(index=row_names, columns=['Time', 'Error'])

for name in row_names:
    fcn = name2func[name]
    t = timeit.timeit(fcn + '(A,b)', number=5, globals=globals())
    coeffs = locals()[fcn](A, b)
    df.set_value(name, 'Time', t)
    df.set_value(name, 'Error', regr_metrics(b, A @ coeffs)[0])

SVD 在这里最好

不要重新运行。

df

	Time	Error
Normal Eqns- Naive	0.001334339	3.598901966
QR Factorization	0.002166139	0.000000000
SVD	0.001556937	0.000000000
Scipy lstsq	0.001871590	0.000000000

即使A是稀疏的， $A^{-1}$ 通常是密集的。对于大型矩阵， $A^{-1}$ 放不进内存。

运行时间

矩阵求逆：
矩阵乘法：
Cholesky：
QR，Gram Schmidt：， $m \ge n$ （Trefethen 第 8 章）
QR，Householder：2（Trefethen 第 10 章）
求解三角形系统：

为什么 Cholesky 较快：

来源：斯坦福凸优化：数值线性代数背景幻灯片

QR 最优的一个案例

m=100
n=15
t=np.linspace(0, 1, m)

# 范德蒙矩阵
A=np.stack([t**i for i in range(n)], 1)
b=np.exp(np.sin(4*t))

# 这将使解决方案标准化为 1
b /= 2006.787453080206

from matplotlib import pyplot as plt
%matplotlib inline

plt.plot(t, b)

# [<matplotlib.lines.Line2D at 0x7fdfc1fa7eb8>]

检查我们得到了 1：

1 - ls_qr(A, b)[14]

# 1.4137685733217609e-07

不好的条件数：

kappa = np.linalg.cond(A); kappa

# 5.827807196683593e+17

row_names = ['Normal Eqns- Naive',
             'QR Factorization', 
             'SVD', 
             'Scipy lstsq']

name2func = {'Normal Eqns- Naive': 'ls_naive', 
             'QR Factorization': 'ls_qr',
             'SVD': 'ls_svd',
             'Scipy lstsq': 'scipylstq'}

pd.options.display.float_format = '{:,.9f}'.format
df = pd.DataFrame(index=row_names, columns=['Time', 'Error'])

for name in row_names:
    fcn = name2func[name]
    t = timeit.timeit(fcn + '(A,b)', number=5, globals=globals())
    coeffs = locals()[fcn](A, b)
    df.set_value(name, 'Time', t)
    df.set_value(name, 'Error', np.abs(1 - coeffs[-1]))

df

	Time	Error
Normal Eqns- Naive	0.001565099	1.357066025
QR Factorization	0.002632104	0.000000116
SVD	0.003503785	0.000000116
Scipy lstsq	0.002763502	0.000000116

通过正规方程求解最小二乘的解决方案通常是不稳定的，尽管对于小条件数的问题是稳定的。

低秩

m = 100
n = 10
x = np.random.uniform(-10,10,n)
A2 = np.random.uniform(-40,40, [m, int(n/2)])   # removed np.asfortranarray
A = np.hstack([A2, A2])

A.shape, A2.shape

# ((100, 10), (100, 5))

b = A @ x + np.random.normal(0,1,m)

row_names = ['Normal Eqns- Naive',
             'QR Factorization', 
             'SVD', 
             'Scipy lstsq']

name2func = {'Normal Eqns- Naive': 'ls_naive', 
             'QR Factorization': 'ls_qr',
             'SVD': 'ls_svd',
             'Scipy lstsq': 'scipylstq'}

pd.options.display.float_format = '{:,.9f}'.format
df = pd.DataFrame(index=row_names, columns=['Time', 'Error'])

for name in row_names:
    fcn = name2func[name]
    t = timeit.timeit(fcn + '(A,b)', number=5, globals=globals())
    coeffs = locals()[fcn](A, b)
    df.set_value(name, 'Time', t)
    df.set_value(name, 'Error', regr_metrics(b, A @ coeffs)[0])

df

	Time	Error
Normal Eqns- Naive	0.001227640	300.658979382
QR Factorization	0.002315920	0.876019803
SVD	0.001745647	1.584746056
Scipy lstsq	0.002067989	0.804750398

比较

比较我们的结果和上面：

df

# rows	100			1000			10000
# cols	20	100	1000	20	100	1000	20	100	1000
Normal Eqns- Naive	0.001276	0.003634	NaN	0.000960	0.005172	0.293126	0.002226	0.021248	1.164655
Normal Eqns- Cholesky	0.001660	0.003958	NaN	0.001665	0.004007	0.093696	0.001928	0.010456	0.399464
QR Factorization	0.002174	0.006486	NaN	0.004235	0.017773	0.213232	0.019229	0.116122	2.208129
SVD	0.003880	0.021737	NaN	0.004672	0.026950	1.280490	0.018138	0.130652	3.433003
Scipy lstsq	0.004338	0.020198	NaN	0.004320	0.021199	1.083804	0.012200	0.088467	2.134780

来自 Trefethen（第 84 页）：

正规方程式/ Cholesky 在生效时速度最快。 Cholesky 只能用于对称正定矩阵。此外，对于具有高条件数或具有低秩的矩阵，正规方程/ Cholesky 是不稳定的。

数值分析师推荐通过 QR 进行线性回归，作为多年的标准方法。它自然，优雅，适合“日常使用”。

九、PageRank 和特征值分解

两个方便的技巧

以下是我们今天将使用的两个工具，它们通常很有用。

1）Psutil 是检查内存使用情况的好方法。这在这里很有用，因为我们使用更大的数据集。

import psutil

process = psutil.Process(os.getpid())
t = process.memory_info()

t.vms, t.rss

# (19475513344, 17856520192)

def mem_usage():
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / psutil.virtual_memory().total

mem_usage()

# 0.13217061955758594

2）TQDM 提供了进度条。

from time import sleep

# 不使用 TQDM
s = 0
for i in range(10):
    s += i
    sleep(0.2)
print(s)

# 45

# 使用 TQDM
from tqdm import tqdm

s = 0
for i in tqdm(range(10)):
    s += i
    sleep(0.2)
print(s)

'''
100%|██████████| 10/10 [00:02<00:00,  4.96it/s]
45
'''

动机

回顾

什么是S VD？
SVD 有哪些应用？

额外的 SVD 应用

我最近遇到的一个有趣的 SVD 用法是为 Word2Vec 词嵌入消除偏见，来自词嵌入中的量化和减少刻板印象（Bolukbasi 等）。

Word2Vec 是 Google 发布的一个有用的库，它将单词表示为向量。向量的相似性捕获了语义，并且可以找到类比，例如巴黎：法国::东京：日本。

来源：词的向量表示

然而，这些嵌入可能隐式编码偏见，例如父亲：医生::母亲：护士，和男人：计算机程序员::女人：家庭主妇。

一种为空间消除偏见的方法涉及使用 SVD 来降低维数（Bolukbasi 论文）。

你可以在词嵌入中阅读关于偏见的更多信息：

向量空间数学如何揭示隐藏在语言中的性别歧视（麻省理工学院技术评论）
ConceptNet：更好，更少刻板印象的词向量
从语料库中自动导出的语义必然包含人类偏见（优秀且非常有趣的论文！）

考虑 SVD 的方法

数据压缩
SVD 使用更小更好的特征替换大量特征
所有矩阵都是对角的（如果你使用基于域和范围的更改）

SVD 的视角

我们通常从矩阵的角度谈论SVD，

$A = U \Sigma V^T$

但我们也可以用向量来思考它。 SVD 为我们提供了一组正交向量和。

$A v_j = \sigma_j u_j$

是标量，称为奇异值。

问题：这个让你想起了什么？

答案

SVD 与特征分解之间的关系：A的左奇异向量是的特征向量。A的右奇异向量是的特征向量。 A的非零奇异值是（和）的特征值的平方根。

SVD 是特征值分解的推广。并非所有矩阵都具有特征值，但所有矩阵都具有奇异值。

让我们忘记 SVD，并讨论如何找到对称正定矩阵的特征值......

SVD 的扩展资源

特征值分解

用于计算 SVD 的最佳经典方法，是用于计算特征值的方法的变体。除了它们与 SVD 的链接之外，特征值分解也是有用的。以下是特征值分解的一些实际应用：

快速矩阵求幂
第 n 个斐波纳契数
ODE 的行为
马尔科夫链（医疗保健经济学，PageRank）
Iris 数据集的线性判别分析

查看 3 Blue 1 Brown 视频：基变换和特征值和特征向量。

“特征值是一种了解矩阵核心的方式......矩阵的所有困难都被扫除了”--Strang

术语：埃尔米特矩阵是共轭转置等于自己的矩阵。在实值矩阵的情况下（这是我们在本课程中考虑的所有矩阵），埃尔米特的意思与对称相同。

DBpedia 数据集

让我们从幂方法开始，它找到一个特征向量。只是一个特征向量有什么用？你可能想知道。这实际上是 PageRank 的基础（阅读价值 25,000,000,000 美元的Eigenvector：Google 背后的线性代数来了解更多信息）

我们将使用来自 DBpedia 的维基百科链接数据集，而不是试图排名互联网上所有网站的重要性。 DBpedia 提供 125 种语言的结构化维基百科数据。

“完整的 DBpedia 数据集包含用于 125 种不同语言的 3800 万个标签和摘要，2520 万个图像链接和 2980 万个外部网页的链接；8090 万个维基百科类别的链接，4120 万个 YAGO 类别的链接” - 关于 DBpedia

今天的课程灵感来自这个 SciKit 学习示例。

导入

import os, numpy as np, pickle
from bz2 import BZ2File
from datetime import datetime
from pprint import pprint
from time import time
from tqdm import tqdm_notebook
from scipy import sparse

from sklearn.decomposition import randomized_svd
from sklearn.externals.joblib import Memory
from urllib.request import urlopen

下载数据

我们拥有的数据是：

重定向：重定向到其他 URL 的 URL
链接：哪些页面链接到哪些其他页面

注意：这需要一段时间。

PATH = 'data/dbpedia/'
URL_BASE = 'http://downloads.dbpedia.org/3.5.1/en/'
filenames = ["redirects_en.nt.bz2", "page_links_en.nt.bz2"]

for filename in filenames:
    if not os.path.exists(PATH+filename):
        print("Downloading '%s', please wait..." % filename)
        open(PATH+filename, 'wb').write(urlopen(URL_BASE+filename).read())

redirects_filename = PATH+filenames[0]
page_links_filename = PATH+filenames[1]

图的邻接矩阵

我们将构造一个图的邻接矩阵，表示哪个页面指向哪个页面。

来源：PageRank 和 HyperLink 生成的主题搜索

幂将为你提供，通过两步从一个页面到另一个页面有多少种方式。你可以在这些笔记中看到更详细的示例，如适用于航空旅行。

我们希望跟踪哪些页面指向哪些页面。我们将它存储在一个方形矩阵中，位置(r, c)为 1，表示行r中的主题指向列c中的主题

你可以在此处更加了解图。

数据格式

文件中的一样看起来：

<http://dbpedia.org/resource/AfghanistanHistory> <http://dbpedia.org/property/redirect> <http://dbpedia.org/resource/History_of_Afghanistan> .

在下面的切片中，+1, -1来删除<>。

DBPEDIA_RESOURCE_PREFIX_LEN = len("http://dbpedia.org/resource/")
SLICE = slice(DBPEDIA_RESOURCE_PREFIX_LEN + 1, -1)

def get_lines(filename): return (line.split() for line in BZ2File(filename))

遍历重定向并创建来源到目的地的字典。

def get_redirect(targ, redirects):
    seen = set()
    while True:
        transitive_targ = targ
        targ = redirects.get(targ)
        if targ is None or targ in seen: break
        seen.add(targ)
    return transitive_targ

def get_redirects(redirects_filename):
    redirects={}
    lines = get_lines(redirects_filename)
    return {src[SLICE]:get_redirect(targ[SLICE], redirects) 
                for src,_,targ,_ in tqdm_notebook(lines, leave=False)}

redirects = get_redirects(redirects_filename)

mem_usage()

# 13.766303744

def add_item(lst, redirects, index_map, item):
    k = item[SLICE]
    lst.append(index_map.setdefault(redirects.get(k, k), len(index_map)))

limit=119077682 #5000000

# 计算整数索引映射
index_map = dict() # links->IDs
lines = get_lines(page_links_filename)
source, destination, data = [],[],[]
for l, split in tqdm_notebook(enumerate(lines), total=limit):
    if l >= limit: break
    add_item(source, redirects, index_map, split[0])
    add_item(destination, redirects, index_map, split[2])
    data.append(1)

n=len(data); n

# 119077682

查看我们的数据

以下步骤仅用于说明我们的数据中的信息及其结构。他们效率不高。

让我们看看index_map中的项目类型：

index_map.popitem()

# (b'1940_Cincinnati_Reds_Team_Issue', 9991173)

让我们看一下索引映射中的一个项目：

1940_Cincinnati_Reds_Team_Issue具有索引9991173.，这仅在目标列表中显示一次：

[i for i,x in enumerate(source) if x == 9991173]

# [119077649]

source[119077649], destination[119077649]

# (9991173, 9991050)

现在，我们要检查哪个页面是源（具有索引 9991050）。注意：通常你不应通过搜索其值来访问字典。这是低效的，不是使用字典的方式。

for page_name, index in index_map.items():
    if index == 9991050:
        print(page_name)

# b'W711-2'

我们可以在维基百科上看到辛辛那提红队问题重定向到 W711-2：

test_inds = [i for i,x in enumerate(source) if x == 9991050]

len(test_inds)

# 47

test_inds[:5]

# [119076756, 119076757, 119076758, 119076759, 119076760]

test_dests = [destination[i] for i in test_inds]

现在，我们要检查哪个页面是源（具有索引 9991174）：

for page_name, index in index_map.items():
    if index in test_dests:
        print(page_name)

'''
b'Baseball'
b'Ohio'
b'Cincinnati'
b'Flash_Thompson'
b'1940'
b'1938'
b'Lonny_Frey'
b'Cincinnati_Reds'
b'Ernie_Lombardi'
b'Baseball_card'
b'James_Wilson'
b'Trading_card'
b'Detroit_Tigers'
b'Baseball_culture'
b'Frank_McCormick'
b'Bucky_Walters'
b'1940_World_Series'
b'Billy_Werber'
b'Ival_Goodman'
b'Harry_Craft'
b'Paul_Derringer'
b'Johnny_Vander_Meer'
b'Cigarette_card'
b'Eddie_Joost'
b'Myron_McCormick'
b'Beckett_Media'
b'Icarus_affair'
b'Ephemera'
b'Sports_card'
b'James_Turner'
b'Jimmy_Ripple'
b'Lewis_Riggs'
b'The_American_Card_Catalog'
b'Rookie_card'
b'Willard_Hershberger'
b'Elmer_Riddle'
b'Joseph_Beggs'
b'Witt_Guise'
b'Milburn_Shoffner'
'''

我们可以看到列表中的项目出现在维基百科页面中：

创建矩阵

下面我们使用 Scipy 的 COO 格式创建一个稀疏矩阵，并将其转换为 CSR。

问题：COO 和 CSR 是什么？为什么我们要用 COO 创建它然后马上转换它？

X = sparse.coo_matrix((data, (destination,source)), shape=(n,n), dtype=np.float32)
X = X.tocsr()

del(data,destination, source)

X

'''
<119077682x119077682 sparse matrix of type '<class 'numpy.float32'>'
    with 93985520 stored elements in Compressed Sparse Row format>
'''

names = {i: name for name, i in index_map.items()}

mem_usage()

# 12.903882752

保存矩阵以便不会重复计算

pickle.dump(X, open(PATH+'X.pkl', 'wb'))
pickle.dump(index_map, open(PATH+'index_map.pkl', 'wb'))

X = pickle.load(open(PATH+'X.pkl', 'rb'))
index_map = pickle.load(open(PATH+'index_map.pkl', 'rb'))

names = {i: name for name, i in index_map.items()}

X

'''
<119077682x119077682 sparse matrix of type '<class 'numpy.float32'>'
    with 93985520 stored elements in Compressed Sparse Row format>
'''

幂方法

动机

如果n×n矩阵A具有n个线性独立的特征向量 $v_1,, \ldots v_n$ ，则它是可对角化的。

那么对于一些标量，任何w都可以表示为 $w = \sum_{j=1}^n c_j v_j$ 。

练习：证明：

$A^k w = \sum_{j=1}^n c_j \lambda_j^k v_j$

问题：对于较大的k，表现如何？

这是幂方法的灵感来源。

代码

def show_ex(v):
    print(', '.join(names[i].decode() for i in np.abs(v.squeeze()).argsort()[-1:-10:-1]))

?np.squeeze

如何规范化稀疏矩阵：

S = sparse.csr_matrix(np.array([[1,2],[3,4]]))
Sr = S.sum(axis=0).A1; Sr

# array([4, 6], dtype=int64)

numpy.matrix.A1。

S.indices

# array([0, 1, 0, 1], dtype=int32)

S.data

# array([1, 2, 3, 4], dtype=int64)

S.data / np.take(Sr, S.indices)

# array([ 0.25   ,  0.33333,  0.75   ,  0.66667])

def power_method(A, max_iter=100):
    n = A.shape[1]
    A = np.copy(A)
    A.data /= np.take(A.sum(axis=0).A1, A.indices)

    scores = np.ones(n, dtype=np.float32) * np.sqrt(A.sum()/(n*n)) # initial guess
    for i in range(max_iter):
        scores = A @ scores
        nrm = np.linalg.norm(scores)
        scores /= nrm
        print(nrm)

    return scores

问题：为什么要对每次迭代的得分标准化？

scores = power_method(X, max_iter=10)

'''
0.621209
0.856139
1.02793
1.02029
1.02645
1.02504
1.02364
1.02126
1.019
1.01679
'''

show_ex(scores)

'''
Living_people, Year_of_birth_missing_%28living_people%29, United_States, United_Kingdom, Race_and_ethnicity_in_the_United_States_Census, France, Year_of_birth_missing, World_War_II, Germany
'''

mem_usage()

# 11.692331008

在实践中使用的许多高级特征值算法是幂方法的变体。

在第 3 课：背景消除中，我们使用了 Facebook 的快速随机 pca/svd 库 fbpca。查看我们使用的 pca 方法的源代码。它使用的是幂方法！

深入研究

查看 Google 网页排名，幂迭代和 Google Matrix 的第二个特征值，观看它收敛过程中的分布的动画。

幂方法的收敛速度是最大特征值与第二大特征值的比率。它可以通过添加移位来加速。为了找到除最大值之外的特征值，可以使用称为 deflation 的方法。详细信息，请参阅 Greenbaum 和 Chartier 的第 12.1 章。

Krylov 子空间：在幂迭代中，注意我们每次乘以矩阵A，有效地计算

$Ab,,A2b,,A3b,,A^4b, \ldots$

将这些向量作为列的矩阵称为 Krylov 矩阵，由这些向量跨越的空间是 Krylov 子空间。请记住这些内容以便之后使用。

与 SVD 比较

%time U, s, V = randomized_svd(X, 3, n_iter=3)

'''
CPU times: user 8min 40s, sys: 1min 20s, total: 10min 1s
Wall time: 5min 56s
'''

mem_usage()

# 28.353073152

# 根据主要奇异向量的顶级维基百科页面
show_ex(U.T[0])

# List_of_World_War_II_air_aces, List_of_animated_feature-length_films, List_of_animated_feature_films:2000s, International_Swimming_Hall_of_Fame, List_of_museum_ships, List_of_linguists, List_of_television_programs_by_episode_count, List_of_game_show_hosts, List_of_astronomers

show_ex(U.T[1])

# List_of_former_United_States_senators, List_of_United_States_Representatives_from_New_York, List_of_United_States_Representatives_from_Pennsylvania, Members_of_the_110th_United_States_Congress, List_of_Justices_of_the_Ohio_Supreme_Court, Congressional_endorsements_for_the_2008_United_States_presidential_election, List_of_United_States_Representatives_in_the_110th_Congress_by_seniority, List_of_Members_of_the_United_States_House_of_Representatives_in_the_103rd_Congress_by_seniority, List_of_United_States_Representatives_in_the_111th_Congress_by_seniority

show_ex(V[0])

# United_States, Japan, United_Kingdom, Germany, Race_and_ethnicity_in_the_United_States_Census, France, United_States_Army_Air_Forces, Australia, Canada

show_ex(V[1])

# Democratic_Party_%28United_States%29, Republican_Party_%28United_States%29, Democratic-Republican_Party, United_States, Whig_Party_%28United_States%29, Federalist_Party, National_Republican_Party, Catholic_Church, Jacksonian_democracy

练习：以各种方式规范化数据。不要覆盖邻接矩阵，而是创建一个新矩阵。了解你的结果有何不同。

特征值分解与 SVD：

SVD 涉及 2 个基，特征分解涉及 1 个基
SVD 基是正交的，特征基通常不是正交的
所有矩阵都有 SVD，并非所有矩阵（甚至不是所有的矩阵）都有一个特征值分解。

QR 算法

我们使用幂方法，找到对应维基百科链接矩阵的最大特征值的特征向量。这个特征向量给了我们每个维基百科页面的相对重要性（就像简化的 PageRank）。

接下来，让我们看一个方法，查找对称正定矩阵的所有特征值。该方法包括数值线性代数中的 2 种基本算法，并且是许多更复杂方法的基础。

Google Matrix 的第二个特征值：具有“implications for the convergence rate of the standard PageRank algorithm as the web scales, for the stability of PageRank to perturbations to the link structure of the web, for the detection of Google spammers, and for the design of algorithms to speed up PageRank”。

避免混淆：QR 算法和 QR 分解

QR 算法使用称为 QR 分解的东西。两者都很重要，所以不要混淆他们。 QR 分解将矩阵A = QR分解为一组正交列Q和三角矩阵R。我们将在未来的课程中查看几种计算 QR 分解的方法。现在，只要知道它给了我们一个正交矩阵和一个三角矩阵。

线性代数

如果存在非奇异矩阵X：

$B = X^{-1}AX$

则两个矩阵A和B是相似的。

观看这个：基变换

定理：如果X是非奇异的，那么A和 $X^{-1}AX$ 具有相同的特征值。

其他资源

回顾：线性组合，跨度和基向量。

上述定理的证明（以及更多！）请参阅第 24 讲。

算法

QR 算法的最基本版本：

for k=1,2,...
    Q, R = A
    A = R @ Q

在适当的假设下，该算法收敛于A的 Schur 形式！

工作原理

再写一次，只有下标：

对于 $k=1,2,\ldots$

$Q_k, R_k = A_{k-1} \ A_k = R_kQ_k$

我们可以将其视为构建，和的序列。

$A_k = Q_k , R_k \ Q_k^{-1} , A_k = R_k$

所以：

$R_k Q_k = Q_k^{-1} , A_k , Q_k \ A_k = Q_k^{-1} \ldots Q_2^{-1} Q_1^{-1} A Q_1 Q_2 \dots Q_k$

Trefethen 在第 216-217 页证明了以下内容：

$A^k = Q_1 Q_2 \dots Q_k R_k R_{k-1}\dots R_1$

要点：QR 算法为连续的幂构造的正交基。记住A的幂与特征值分解之间的密切关系。

要了解更多信息，请阅读瑞利熵。

纯 QR

n = 6
A = np.random.rand(n,n)
AT = A @ A.T

def pure_qr(A, max_iter=1000):
    Ak = np.copy(A)
    n = A.shape[0]
    QQ = np.eye(n)
    for k in range(max_iter):
        Q, R = np.linalg.qr(Ak)
        Ak = R @ Q
        QQ = QQ @ Q
        if k % 100 == 0:
            print(Ak)
            print("\n")
    return Ak, QQ

A

'''
array([[ 0.62878,  0.23258,  0.63909,  0.90223,  0.94772,  0.80247],
       [ 0.64361,  0.52469,  0.92231,  0.32869,  0.58532,  0.75104],
       [ 0.44363,  0.00427,  0.62418,  0.47093,  0.6762 ,  0.28078],
       [ 0.14629,  0.76324,  0.23316,  0.55208,  0.21712,  0.20255],
       [ 0.56122,  0.08282,  0.12788,  0.10419,  0.40358,  0.69272],
       [ 0.41172,  0.06411,  0.92162,  0.53139,  0.27901,  0.61592]])
'''

Ak, Q = pure_qr(A)

'''
[[ 2.65646  0.21386  0.16765  0.75256  0.61251  0.93518]
 [ 0.52744  0.47579  0.17052 -0.41086 -0.21182 -0.01876]
 [ 0.29923  0.06964  0.11173  0.1879  -0.29101  0.60032]
 [ 0.2274   0.46162 -0.26654  0.08899  0.24631  0.26447]
 [-0.06093  0.02892  0.34162  0.07533  0.02393 -0.05456]
 [-0.06025  0.02694 -0.11675 -0.00927 -0.11939 -0.00767]]


[[ 2.78023  0.52642  0.0395  -0.11135  0.1569   1.15184]
 [ 0.       0.18624 -0.297   -0.07256 -0.04537  0.27907]
 [ 0.       0.69328  0.34105 -0.12198  0.11029  0.0992 ]
 [-0.      -0.0494  -0.02057  0.09461  0.59466  0.09115]
 [-0.       0.00008 -0.02659 -0.40372  0.06542  0.38612]
 [-0.       0.       0.       0.      -0.      -0.11832]]


[[ 2.78023 -0.12185 -0.51401  0.17625 -0.07467  1.15184]
 [ 0.       0.2117  -0.70351  0.09974 -0.02986  0.00172]
 [ 0.       0.28284  0.32635 -0.0847  -0.08488 -0.29104]
 [-0.      -0.00068 -0.00088 -0.01282  0.54836  0.13447]
 [-0.       0.      -0.00102 -0.45718  0.16208 -0.37726]
 [-0.       0.       0.       0.      -0.      -0.11832]]


[[ 2.78023 -0.33997  0.4049   0.17949  0.06291  1.15184]
 [ 0.       0.48719 -0.48788 -0.05831 -0.12286 -0.23486]
 [ 0.       0.49874  0.05104 -0.07191  0.03638  0.17261]
 [ 0.       0.00002  0.       0.02128  0.41958  0.3531 ]
 [ 0.      -0.       0.00002 -0.58571  0.1278  -0.18838]
 [ 0.       0.       0.      -0.       0.      -0.11832]]


[[ 2.78023  0.35761  0.38941  0.07462  0.17493  1.15184]
 [ 0.       0.0597  -0.55441 -0.0681  -0.04456  0.14084]
 [ 0.       0.43221  0.47853 -0.06068  0.12117  0.25519]
 [-0.      -0.      -0.       0.16206  0.45708  0.37724]
 [-0.       0.      -0.      -0.54821 -0.01298  0.1336 ]
 [ 0.       0.       0.       0.      -0.      -0.11832]]


[[ 2.78023  0.06853 -0.52424  0.05224 -0.18287  1.15184]
 [ 0.       0.36572 -0.6889   0.07864 -0.09263  0.105  ]
 [ 0.       0.29772  0.17251 -0.09836 -0.02347 -0.27191]
 [ 0.      -0.      -0.       0.13719  0.57888 -0.20884]
 [ 0.       0.      -0.      -0.42642  0.01189 -0.34139]
 [ 0.       0.       0.       0.      -0.      -0.11832]]


[[ 2.78023 -0.52782  0.03045 -0.14389 -0.12436  1.15184]
 [ 0.       0.25091 -0.27593  0.08994 -0.06581 -0.28672]
 [ 0.       0.7107   0.28732  0.10154  0.04751 -0.05245]
 [ 0.      -0.      -0.       0.0297  -0.59054 -0.01234]
 [ 0.      -0.       0.       0.41475  0.11938 -0.40001]
 [ 0.       0.       0.       0.       0.      -0.11832]]


[[ 2.78023  0.1533   0.50599  0.18983  0.01158  1.15184]
 [ 0.       0.18627 -0.69511 -0.0991  -0.00189  0.01621]
 [ 0.       0.29151  0.35196  0.05638 -0.10949  0.29102]
 [ 0.      -0.      -0.      -0.02207 -0.48261  0.25246]
 [ 0.      -0.       0.       0.52268  0.17116  0.31053]
 [ 0.       0.       0.       0.       0.      -0.11832]]


[[ 2.78023  0.29683 -0.43751 -0.13852  0.13032  1.15184]
 [ 0.       0.48375 -0.53231 -0.01164  0.13482  0.216  ]
 [ 0.       0.45431  0.05448 -0.07972 -0.01795 -0.19571]
 [ 0.      -0.       0.       0.10042 -0.40743 -0.39915]
 [ 0.      -0.       0.       0.59786  0.04867 -0.02893]
 [ 0.       0.       0.       0.       0.      -0.11832]]


[[ 2.78023 -0.39373 -0.35284  0.00838 -0.19     1.15184]
 [ 0.       0.05184 -0.51278  0.05752 -0.0564  -0.16496]
 [ 0.       0.47384  0.48639  0.09426  0.09806 -0.24031]
 [ 0.      -0.      -0.       0.17043 -0.52593  0.30622]
 [ 0.      -0.       0.       0.47936 -0.02134 -0.25766]
 [ 0.       0.       0.       0.       0.      -0.11832]]
'''

让我们和特征值比较。

np.linalg.eigvals(A)

'''
array([ 2.78023+0.j     , -0.11832+0.j     ,  0.26911+0.44246j,
        0.26911-0.44246j,  0.07454+0.49287j,  0.07454-0.49287j])
'''

检查Q是正交的。

np.allclose(np.eye(n), Q @ Q.T), np.allclose(np.eye(n), Q.T @ Q)

# (True, True)

非常非常慢。

实际的 QR（带移位的 QR）

不是将分解为，

1）获取 QR 分解：

2）设置：

$A_{k+1} = R_k Q_k + s_k I$

选择来近似A的特征值。我们将使用。

在数值线性代数（包括幂方法，逆迭代和瑞利商迭代）中的许多算法中都出现了添加移位以加速收敛的想法。

作业：向 QR 算法添加移位。

# 练习：向 QR 算法添加移位
# 练习：def practical_qr(A, iters=10):
# 练习：    return Ak, Q

Ak, Q = practical_qr(A, 10)

'''
[ 5.16264  2.53603  0.31923  0.35315  0.97569  0.43615]
[ 7.99381  0.05922  0.34478  0.29482  0.79026  0.29999]
[ 8.00491  0.04358  0.89735  0.26386  0.26182  0.31135]
[ 8.00493  0.13648  0.91881  0.14839  0.24313  0.33115]
[ 8.00493  0.43377  0.62809  0.13429  0.24592  0.33589]
[ 8.00493  0.81058  0.25128  0.13297  0.24722  0.3359 ]
[ 8.00493  0.98945  0.07221  0.13292  0.24747  0.3359 ]
[ 8.00493  1.0366   0.02497  0.13296  0.24751  0.3359 ]
[ 8.00493  1.04688  0.01465  0.13299  0.24752  0.3359 ]
[ 8.00493  1.04902  0.0125   0.13301  0.24753  0.3359 ]
'''

检查Q是正交的。

np.allclose(np.eye(n), Q @ Q.T), np.allclose(np.eye(n), Q.T @ Q)

# (True, True)

让我们和特征值比较。

np.linalg.eigvals(A)

'''
array([ 2.68500+0.j     ,  0.19274+0.41647j,  0.19274-0.41647j,
       -0.35988+0.43753j, -0.35988-0.43753j, -0.18346+0.j     ])
'''

问题：这比未移位的版本更好（它甚至不能保证收敛），但仍然很慢！事实上，它是O(n^4)，这是非常糟糕的。

在对称矩阵的情况下，它是O(n^3)

但是，如果从海森堡矩阵（第一个子对角线下面是零）开始，它会更快：O(n^3)，如果是对称的则是O(n^2)。

两阶段方法

在实践中，使用两阶段方法来找到特征值：

将化简缩减为海森堡形式（第一个子对角线下方是零）
让海森堡矩阵收敛到三角矩阵的迭代过程。三角矩阵的特征值是对角线上的值，所以我们完成了！

来源：Trefethen，第 25 讲

在埃尔米特矩阵的情况下，这种方法甚至更快，因为中间步骤也是埃尔米特（并且埃尔米特海森堡矩阵是三对角的）。

来源：Trefethen，第 25 讲

阶段 1 在有限步骤中达到精确解，而阶段 2 理论上从不会达到精确解。

我们已经完成了第 2 步：QR 算法。请记住，只使用 QR 算法是可能的，但极其慢。

Arnoldi 迭代

我们可以将 Arnoldi 迭代用于阶段 1（以及 QR 算法用于阶段 2）。

初始化

import numpy as np
n = 5
A0 = np.random.rand(n,n)  #.astype(np.float64)
A = A0 @ A0.T

np.set_printoptions(precision=5, suppress=True)

线性代数回顾：投影

当向量b投影到直线a上时，其投影p是b沿着直线a的一部分。

让我们看看沉浸式线性代数在线版的第 3.2.2 节：投影的交互图。

来源：沉浸式数学

以下是将向量投影到平面上的样子：

来源：最小二乘回归的线性代数视角

当向量b投影到直线a上时，其投影p是b沿着直线a的一部分。所以p是a的一些倍数。设 $\mathbf{p} = \hat{x}\mathbf{a}$ 其中 $\hat{x}$ 是标量。

正交性

投影的关键是正交性：从b到p的直线（可以写成 $\mathbf{b} - \hat{x}\mathbf{a}$ ）垂直于a。

这意味着：

$\mathbf{a} \cdot (\mathbf{b} - \hat{x}\mathbf{a}) = 0$

所以：

$\hat{x} = \frac{\mathbf{a} \cdot \mathbf{b}}{\mathbf{a} \cdot \mathbf{a}}$

算法

动机

我们想要Q中的正交列和海森堡矩阵H，使得AQ = QH。

迭代式思考它：

$A Q_n = Q_{n+1} H_n$

其中 $Q_{n + 1}$ 为 $n \times n + 1$ ，为 $n + 1 \times n$ 。这会创建一个可解决的递归关系。

来源：Trefethen 第 30 讲

Arnoldi 算法的伪代码

对于 Q 的第一列，以任意向量开始（标准化以便范数为 1）
for n=1,2,3...
    v = A @ nth col of Q
    for j=1,...n
        将 v 投影到 q_j，从 v 减去投影
        打算捕获 v 的一部分，它没有由 Q 的前一列跨越
        将系数储存在 H 中
    标准化 v, 之后使它为 Q 的第 (n+1) 列

请注意，我们将A乘以Q中的前一个向量，并删除与Q的现有列不正交的分量。

问题：A的重复乘法会让你想起什么吗？

答案：幂方法也包括A的迭代乘法！

Arnoldi 迭代如何工作

通过 Arnoldi 迭代，我们找到了 Krylov 子空间的标准正交基。 Krylov 矩阵：

$K = \left[b ; Ab ; A^2b ; \dots ; A^{n-1}b \right]$

具有 QR 分解：

这与在 Arnoldi 迭代中发现的Q相同。请注意，Arnoldi 迭代不会明确计算K或R。

直觉：K包含关于A的最大特征值的良好信息，并且 QR 分解通过一次剥离一个近似特征向量，来揭示该信息。

Arnoldi 迭代是两件事：

许多数值线性代数迭代算法的基础
一种寻找非厄米矩阵特征值的技术（Trefethen，第 257 页）

Arnoldi 如何定位特征值

执行 Arnoldi 迭代
使用 QR 算法定期计算海森堡矩阵H的特征值（称为 Arnoldi 估计值或 Ritz 值）
检查这些值是否收敛。如果是，它们可能是A的特征值。

实现

# 分解方阵 A @ Q ~= Q @ H
def arnoldi(A):
    m, n = A.shape
    assert(n <= m)
    
    # 海森堡矩阵
    H = np.zeros([n+1,n]) #, dtype=np.float64)
    # 正交列
    Q = np.zeros([m,n+1]) #, dtype=np.float64)
    # Q 的第 1 列是具有单位范数的随机列
    b = np.random.rand(m)
    Q[:,0] = b / np.linalg.norm(b)
    for j in range(n):
        v = A @ Q[:,j]
        for i in range(j+1):
            # 这来自 v 投影到 q 的公式。
            # 由于列 q 是正交的，q dot q = 1
            H[i,j] = np.dot(Q[:,i], v)
            v = v - (H[i,j] * Q[:,i])
        H[j+1,j] = np.linalg.norm(v)
        Q[:,j+1] = v / H[j+1,j]
        
        # 打印这个看到收敛，在实践中使用会很慢
        print(np.linalg.norm(A @ Q[:,:-1] - Q @ H))
    return Q[:,:-1], H[:-1,:]

Q, H = arnoldi(A)

'''
8.59112969133
4.45398729097
0.935693639985
3.36613943339
0.817740180293
'''

检查H是三对角的。

H

'''
array([[ 5.62746,  4.05085, -0.     ,  0.     , -0.     ],
       [ 4.05085,  3.07109,  0.33036,  0.     , -0.     ],
       [ 0.     ,  0.33036,  0.98297,  0.11172, -0.     ],
       [ 0.     ,  0.     ,  0.11172,  0.29777,  0.07923],
       [ 0.     ,  0.     ,  0.     ,  0.07923,  0.06034]])
'''

练习

编写代码来验证：

AQ = QH
Q是正交的

答案

# 练习：

# True

# 练习：

# True

一般案例：

一般矩阵：现在我们可以在我们的一般矩阵A（非对称）上执行此操作。在这种情况下，我们得到的是海森堡而不三对角。

Q0, H0 = arnoldi(A0)

'''
1.44287067354
1.06234006889
0.689291414367
0.918098818651
4.7124490411e-16
'''

检查H是海森堡的。

H0

'''
array([[ 1.67416,  0.83233, -0.39284,  0.10833,  0.63853],
       [ 1.64571,  1.16678, -0.54779,  0.50529,  0.28515],
       [ 0.     ,  0.16654, -0.22314,  0.08577, -0.02334],
       [ 0.     ,  0.     ,  0.79017,  0.11732,  0.58978],
       [ 0.     ,  0.     ,  0.     ,  0.43238, -0.07413]])
'''

np.allclose(A0 @ Q0, Q0 @ H0)

# True

np.allclose(np.eye(len(Q0)), Q0.T @ Q0), np.allclose(np.eye(len(Q0)), Q0 @ Q0.T)

# (True, True)

放到一起

def eigen(A, max_iter=20):
    Q, H = arnoldi(A)
    Ak, QQ = practical_qr(H, max_iter)
    U = Q @ QQ
    D = np.diag(Ak)
    return U, D

n = 10
A0 = np.random.rand(n,n)
A = A0 @ A0.T

U, D = eigen(A, 40)

'''
14.897422908
1.57451192745
1.4820012435
0.668164424736
0.438450319682
0.674050723258
1.19470880942
0.217103444634
0.105443975462
3.8162597576e-15
[ 27.34799   1.22613   1.29671   0.70253   0.49651   0.56779   0.60974
   0.70123   0.19209   0.04905]
[ 27.34981   1.85544   1.04793   0.49607   0.44505   0.7106    1.00724
   0.07293   0.16058   0.04411]
[ 27.34981   2.01074   0.96045   0.54926   0.61117   0.8972    0.53424
   0.19564   0.03712   0.04414]
[ 27.34981   2.04342   0.94444   0.61517   0.89717   0.80888   0.25402
   0.19737   0.03535   0.04414]
[ 27.34981   2.04998   0.94362   0.72142   1.04674   0.58643   0.21495
   0.19735   0.03534   0.04414]
[ 27.34981   2.05129   0.94496   0.90506   0.95536   0.49632   0.21015
   0.19732   0.03534   0.04414]
[ 27.34981   2.05156   0.94657   1.09452   0.79382   0.46723   0.20948
   0.1973    0.03534   0.04414]
[ 27.34981   2.05161   0.94863   1.1919    0.70539   0.45628   0.20939
   0.19728   0.03534   0.04414]
[ 27.34981   2.05162   0.95178   1.22253   0.67616   0.45174   0.20939
   0.19727   0.03534   0.04414]
[ 27.34981   2.05162   0.95697   1.22715   0.66828   0.44981   0.2094
   0.19725   0.03534   0.04414]
[ 27.34981   2.05162   0.96563   1.22124   0.66635   0.44899   0.20941
   0.19724   0.03534   0.04414]
[ 27.34981   2.05162   0.97969   1.20796   0.66592   0.44864   0.20942
   0.19723   0.03534   0.04414]
[ 27.34981   2.05162   1.00135   1.18652   0.66585   0.44849   0.20943
   0.19722   0.03534   0.04414]
[ 27.34981   2.05162   1.03207   1.15586   0.66584   0.44843   0.20943
   0.19722   0.03534   0.04414]
[ 27.34981   2.05162   1.07082   1.11714   0.66584   0.4484    0.20944
   0.19721   0.03534   0.04414]
[ 27.34981   2.05162   1.11307   1.07489   0.66585   0.44839   0.20944
   0.1972    0.03534   0.04414]
[ 27.34981   2.05162   1.15241   1.03556   0.66585   0.44839   0.20945
   0.1972    0.03534   0.04414]
[ 27.34981   2.05162   1.18401   1.00396   0.66585   0.44839   0.20945
   0.1972    0.03534   0.04414]
[ 27.34981   2.05162   1.20652   0.98145   0.66585   0.44839   0.20946
   0.19719   0.03534   0.04414]
[ 27.34981   2.05162   1.22121   0.96676   0.66585   0.44839   0.20946
   0.19719   0.03534   0.04414]
[ 27.34981   2.05162   1.23026   0.95771   0.66585   0.44839   0.20946
   0.19719   0.03534   0.04414]
[ 27.34981   2.05162   1.23563   0.95234   0.66585   0.44839   0.20946
   0.19718   0.03534   0.04414]
[ 27.34981   2.05162   1.23876   0.94921   0.66585   0.44839   0.20947
   0.19718   0.03534   0.04414]
[ 27.34981   2.05162   1.24056   0.94741   0.66585   0.44839   0.20947
   0.19718   0.03534   0.04414]
[ 27.34981   2.05162   1.24158   0.94639   0.66585   0.44839   0.20947
   0.19718   0.03534   0.04414]
[ 27.34981   2.05162   1.24216   0.94581   0.66585   0.44839   0.20947
   0.19718   0.03534   0.04414]
[ 27.34981   2.05162   1.24249   0.94548   0.66585   0.44839   0.20947
   0.19718   0.03534   0.04414]
[ 27.34981   2.05162   1.24268   0.94529   0.66585   0.44839   0.20947
   0.19718   0.03534   0.04414]
[ 27.34981   2.05162   1.24278   0.94519   0.66585   0.44839   0.20947
   0.19718   0.03534   0.04414]
[ 27.34981   2.05162   1.24284   0.94513   0.66585   0.44839   0.20947
   0.19718   0.03534   0.04414]
[ 27.34981   2.05162   1.24288   0.94509   0.66585   0.44839   0.20947
   0.19718   0.03534   0.04414]
[ 27.34981   2.05162   1.2429    0.94507   0.66585   0.44839   0.20947
   0.19717   0.03534   0.04414]
[ 27.34981   2.05162   1.24291   0.94506   0.66585   0.44839   0.20947
   0.19717   0.03534   0.04414]
[ 27.34981   2.05162   1.24291   0.94506   0.66585   0.44839   0.20947
   0.19717   0.03534   0.04414]
[ 27.34981   2.05162   1.24292   0.94505   0.66585   0.44839   0.20947
   0.19717   0.03534   0.04414]
[ 27.34981   2.05162   1.24292   0.94505   0.66585   0.44839   0.20947
   0.19717   0.03534   0.04414]
[ 27.34981   2.05162   1.24292   0.94505   0.66585   0.44839   0.20948
   0.19717   0.03534   0.04414]
[ 27.34981   2.05162   1.24292   0.94505   0.66585   0.44839   0.20948
   0.19717   0.03534   0.04414]
[ 27.34981   2.05162   1.24292   0.94505   0.66585   0.44839   0.20948
   0.19717   0.03534   0.04414]
[ 27.34981   2.05162   1.24292   0.94505   0.66585   0.44839   0.20948
   0.19717   0.03534   0.04414]
'''

D

'''
array([ 5.10503,  0.58805,  0.49071, -0.65174, -0.60231, -0.37664,
       -0.13165,  0.0778 , -0.10469, -0.29771])
np.linalg.eigvals(A)
array([ 27.34981,   2.05162,   1.24292,   0.94505,   0.66585,   0.44839,
         0.20948,   0.19717,   0.04414,   0.03534])
'''

np.linalg.norm(U @ np.diag(D) @ U.T - A)

# 0.0008321887107978883

np.allclose(U @ np.diag(D) @ U.T, A, atol=1e-3)

# True

扩展阅读

让我们找一些特征值！

来自非对称特征值问题章节：

注意，“直接”方法仍然必须迭代，因为找到特征值在数学上等同于找到多项式的零，因此不存在非迭代方法。如果经验表明它（几乎）永远不会在固定数量的迭代中收敛，我们就直接调用一个方法。

迭代方法通常仅对特征值和特征向量的子集提供近似，并且通常仅运行足够长的时间，来获得一些足够精确的特征值而不是大量的特征值。

我们的最终算法：带移位的海森堡 QR 算法。

阅读更多：

对称特征问题和 SVD
特征值问题的迭代方法：Rayleigh-Ritz 方法，Lanczos 算法

接下来

我们将来会编写自己的 QR 分解（两种不同的方式！），但首先我们将看到另一种可以使用 QR 分解的方法：计算线性回归。

最后

其它笔记

对称矩阵自然会带来：

距离矩阵
关系矩阵（Facebook 或 LinkedIn）
常微分方程
我们将研究正定矩阵，因为这可以保证所有特征值都是实的。

注意：在 NLA 令人困惑的语言中，QR 算法是直接的，因为你一次处理所有列。在其他数学/ CS 语言中，QR 算法是迭代的，因为它迭代地收敛并且永远不会达到精确的解。

结构正交化。在 NLA 语言中，Arnoldi 迭代被认为是一种迭代算法，因为你可以在某种程度上停止并完成一些列。

用于将矩阵转换为海森堡形式的 Gram-Schmidt 式迭代。

posted @ 2026-03-26 08:48 布客飞龙II 阅读(104) 评论(0) 收藏举报

刷新页面返回顶部

龙哥盟

人最大的痛苦就是说一些自己都不相信的话。

fast-ai-数值线性代数讲义中文版-v2-全-

fast.ai 数值线性代数讲义中文版 v2（全）

一、我们为什么在这里

为什么学习数值线性代数？

动机

矩阵计算

矩阵和张量积

矩阵和向量的积：

答案

矩阵和矩阵的积

答案

图像数据

卷积

矩阵分解

准确度

浮点数算术

练习

问题：数学是连续的或无限的，但计算机是离散的和有限的

机器ε

浮点运算的两个重要属性

历史

浮点运算的时间线历史

条件作用和稳定性

近似的准确度

随机算法

昂贵的错误

内存使用

稀疏 VS 密集

速度

计算复杂性

向量化

矩阵计算包：BLAS 和 LAPACK

局部性

不同类型内存的速度

临时性

扩展到多个核心和节点

可扩展性/并行化

十、实现 QR 分解

NumPy 中

正交性

Gram-Schmidt

经典的 Gram-Schmidt（不稳定）

改进版 Gram-Schmidt

Householder

引言

初始化

算法

陷阱

类比

示例

示例：经典与改进的 Gram-Schmidt

示例：正交性的数值损失

二、SVD 背景消除

加载和格式化数据

辅助方法

格式化数据

奇异值分解

SVD介绍

练习

s的属性

寻找背景

制作视频

SVD 分解不同尺寸矩阵的速度

一个视频中的两个背景

现在回到我们的背景消除问题

数据压缩旁注

三、使用 NMF 和 SVD 的主题建模

动机

今天的课程中

附加资源

其它教程

建立数据

奇异值分解（SVD）

答案

答案

主题

非负矩阵分解（NMF）

动机

机器`ε`

`s`的属性

`k`变化下的时间和准确度

那么我们如何找到`Q`（步骤 1）？

为什么 $M \sim Q Q^T M$

`scipy.linalg` vs `lu_solve`