题解 | DeepML Solutions

Contents

    1. Matrix times Vector
    1. Calculate Mean by Row or Column
    1. Scalar Multiplication of a Matrix
    1. Linear Regression Using Normal Equation
    1. Linear Regression Using Gradient Descent
    1. K-Means Clustering
    1. Principal Component Analysis (PCA) Implementation
    1. Decision Tree Learning
    1. Sigmoid Activation Function Understanding
    1. Single Neuron
    1. Transformation Matrix from Basis B to C
    1. Divide Dataset Based on Feature Threshold
    1. One-Hot Encoding of Nominal Values
    1. Implement AdaBoost Fit Method
    1. Implementation of Log Softmax Function
    1. Implement Ridge Regression Loss Function
    1. Leaky ReLU Activation Function
    1. Linear Kernel Function
    1. Implement Gradient Descent Variants with MSE Loss
    1. Implement Reduced Row Echelon Form (RREF) Function
    1. Implement Adam Optimization Algorithm
    1. Implement Gini Impurity Calculation for a Set of Classes
    1. Calculate Image Brightness
    1. Calculate Performance Metrics for a Classification Model
    1. Binomial Distribution Probability
    1. Phi Transformation for Polynomial Features
    1. Positional Encoding Calculator
    1. Adam Optimizer
    1. Calculate the Phi Coefficient
    1. Implement the Softplus Activation Function
    1. Implement the Softsign Activation Function
    1. Implement the Swish Activation Function
    1. Implement Masked Self-Attention
    1. Vector Element-wise Sum


1. Matrix times Vector

Deep-ML | Matrix times Vector

(Easy)


4. Calculate Mean by Row or Column

Deep-ML | Calculate Mean by Row or Column

Code

def calculate_matrix_mean(matrix: list[list[float]], mode: str) -> list[float]:
    matrix = matrix if mode == 'row' else zip(*matrix)
    import numpy as np
    means = []
    for row in matrix:
        means.append(np.mean(np.array(list(row))))
    return means

5. Scalar Multiplication of a Matrix

Deep-ML | Scalar Multiplication of a Matrix

(Easy)


14. Linear Regression Using Normal Equation

Deep-ML | Linear Regression Using Normal Equation

前置知识

正规方程

线性回归的目标是找到一组参数 \(\theta\),使得预测值 \(X\theta\) 与真实值 \(y\) 的均方误差(MSE)最小化:

\[J(\theta) = \frac{1}{2m} \| X\theta - y \|^2 \]

正规方程通过求导并令导数为零,可以直接解出最优参数 \(\theta\)

\[\theta = (X^T X)^{-1} X^T y \]

Code

def linear_regression_normal_equation(X: list[list[float]], y: list[float]) -> list[float]:
	X = np.array(X)
	y = np.array(y)
	theta = np.linalg.inv(X.T @ X) @ X.T @ y
	return np.round(theta, 4).flatten().tolist()

15. Linear Regression Using Gradient Descent

Deep-ML | Linear Regression Using Gradient Descent

前置知识

梯度下降法

权重更新规则如下。

\[\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \]

详解见 No. 47.

Code

Numpy 的矩阵乘法规则为 X(m, n) @ Y(n, p),注意将数组 y(3,) 调整为向量 y(3, 1).

def linear_regression_gradient_descent(X: np.ndarray, y: np.ndarray, alpha: float, iterations: int) -> np.ndarray:
	m, n = X.shape
	theta = np.zeros((n, 1))

	for i in range(iterations):
		h = np.dot(X, theta)           # 3 x 1
		# -1 means AUTO
		err = h - y.reshape(-1, 1)     # 3 x 1
		update = np.dot(X.T, err) / m  # 2 x 1
		theta -= alpha * update

	return theta

17. K-Means Clustering

Deep-ML | K-Means Clustering

前置知识

K-均值聚类

聚类是 NP-Hard 问题,因此使用 K-means 等启发式算法求解。对样本集 \(D = \{x_1, x_2, ..., x_m\}\),假设簇划分为 \((C_1, C_2, ..., C_k)\),K-means 的目标是最小化平方误差

\[\text{SSE} = \sum_{i}^{k} \sum_{x \in C_i} {||x - {\mu}_i||}_2^2 \]

算法流程如下。

  1. 初始化 \(C_t = \varnothing, (t = 1, 2, ..., k)\)
  2. 对每一个 \(x \in D\),计算 \(x\) 与各个质心向量的距离 \(d_i, (i = 1, 2, ..., k)\),将 \(x\) 标记为最小的 \(d_i\) 对应的类别 \(\lambda_i\),更新 \(C_i = C_i \cup \{x\}\)
  3. 对每一个簇 \(C_i\),使用 \(C_i\) 中所有样本点更新质心

    \[\mu_i = \frac{1}{|C_i|} \sum_{x \in C_i} x \]

  4. 重复 (1) ~ (3) 直至 \(\mu = \{\mu_1, \mu_2, ..., \mu_k\}\) 不再更新
  5. 输出簇划分 \(C\)

Code

算法实现中,使用欧氏距离进行聚类。

def k_means_clustering(points: list[tuple[float, float]], k: int, initial_centroids: list[tuple[float, float]], max_iterations: int) -> list[tuple[float, float]]:
	
	def euc_distance(p1, p2):
		n_dims = len(p1)
		return sum((p1[i] - p2[i]) ** 2 for i in range(n_dims)) ** 0.5
	
	def do_clustering(centroids):
		labels = []
		for point in points:
			dist = [euc_distance(point, centroids[i]) for i in range(k)]
			labels.append(dist.index(min(dist)))
		return labels
	
	def update_centroids(labels, old_centroids, use_random=True):
		new_centroids = []
		for i in range(k):
			subpoints = [points[j] for j in range(len(points)) if labels[j] == i]
			if subpoints:
				n_dims = len(subpoints[0])
				new_centroid = (*(
					sum(point[i] for point in subpoints) / len(subpoints) for i in range(n_dims)
				),)
				new_centroids.append(new_centroid)
			else:
				if use_random:
					import random
					new_centroids.append(random.choice(points))
				else:
					new_centroids.append(old_centroids[i])

		return new_centroids

	current_centroids = initial_centroids
	for _ in range(max_iterations):
		current_labels = do_clustering(current_centroids)
		new_centroids = update_centroids(current_labels, current_centroids, True)

		if all(euc_distance(c, n) < 1e-6 for c, n in zip(current_centroids, new_centroids)):
			break
		current_centroids = new_centroids

	return current_centroids


19. Principal Component Analysis (PCA) Implementation

Deep-ML | Principal Component Analysis (PCA) Implementation

前置知识

主成分分析 (PCA)

PCA 是数据降维常用算法。设有 ( n ) 条 ( d ) 维数据,过程如下。

  1. 将原始数据按列组成 ( n ) 行 ( d ) 列矩阵 ( X )
  2. 将 ( X ) 的每一列(代表一个属性)进行标准化,即 \(x_i \leftarrow \frac{x_i - \mu_i}{\sigma_i}\)
  3. 求出标准化矩阵的协方差矩阵
  4. 求出协方差矩阵的特征值及对应的特征向量
  5. 将特征向量按对应特征值大小从上到下按行排列成矩阵,取前 ( k ) 行组成矩阵 ( P )
  6. ( Y = PX ) 即为降维到 ( k ) 维后的数据。

题目要求返回前 \(k\) 个特征向量即可,故算法对应步骤 (1) ~ (5).

Code

import numpy as np

def pca(data, k):
    # Standardize the data
    data_standardized = (data - np.mean(data, axis=0)) / np.std(data, axis=0)
    
    # Compute the covariance matrix
    covariance_matrix = np.cov(data_standardized, rowvar=False)
    
    # Eigen decomposition
    eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)
    
    # Sort the eigenvectors by decreasing eigenvalues
    idx = np.argsort(eigenvalues)[::-1]
    eigenvalues_sorted = eigenvalues[idx]
    eigenvectors_sorted = eigenvectors[:,idx]
    
    # Select the top k eigenvectors (principal components)
    principal_components = eigenvectors_sorted[:, :k]
    
    return np.round(principal_components, 4)

20. Decision Tree Learning

Deep-ML | Decision Tree Learning

前置知识

ID3 决策树

(转载)代码参考

本题目要求实现一颗 ID3 决策树。ID3 算法的核心是根据信息增益来选择进行划分的特征,然后递归地构建决策树。

首先计算样本总熵

\[E = - \sum_{i=1}^n p(x_i) \log_2{(x_i)} \]

引入条件熵

\[E(Y/X) = \sum_{i=1}^n p(x_i) \log_2{(Y/x_i)} \]

代码实现。根据信息 \(Y\) 进行分类后的 \(|Y|\) 个子集分别计算自己的熵,即为 \(E(Y/x_i)\). 见参考代码 chooseBestFeatureToSplit 函数。在 splitDataSet 函数中还对当前特征进行了移除,这是为了便于递归建树。

最后,引入信息增益

\[Gain = E(D) - E(D/A) \]

有了以上概念后,ID3 便可基于信息增益 \(G\) 进行特征选取,思路是每次选择属性 \(A\) 使得 \(\argmax_{A} G(A)\),其意义是每次选取使得数据集 \(D\) 不确定性降低最多的属性进行分类。


22. Sigmoid Activation Function Understanding

Deep-ML | Sigmoid Activation Function Understanding

前置知识

Sigmoid

\[\sigma(z) = \frac{1}{1 + e^{-z}} \]

Code

return round(1 / (1 + math.exp(-z)), 4)

24. Single Neuron

Deep-ML | Single Neuron

前置知识

MSE (均方误差)

\[MSE = \frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y_i})^2 \]

Code

import math

def single_neuron_model(features: list[list[float]], labels: list[int], weights: list[float], bias: float) -> (list[float], float):
	# Your code here

	def sigmoid(z):
		return 1 / (1 + math.exp(-z))
	
	def dot(x, w):
		return sum([x_ * w_ for x_, w_ in zip(x, w)])
	
	probabilities = []

	for feature_vectors in features:
		Z = dot(feature_vectors, weights) + bias
		probabilities.append(sigmoid(Z))
	
	mse = sum([(p - float(l)) ** 2 for p, l in zip(probabilities, labels)]) / len(labels)

	# ROUND
	probabilities = [round(p, 4) for p in probabilities]
	mse = round(mse, 4)

	return probabilities, mse


27. Transformation Matrix from Basis B to C

Deep-ML | Transformation Matrix from Basis B to C

前置知识

基变换矩阵

\(B\) 到基 \(C\) 的变换矩阵 \(P\) 使得 \(BX=CX'\)\(X' = PX\),因此

\[P = C^{-1} \cdot B \]

Code

def transform_basis(B: list[list[int]], C: list[list[int]]) -> list[list[float]]:
	import numpy as np
	B, C = np.array(B), np.array(C)
	C_inv = np.linalg.inv(C)
	P = C_inv @ B
	return P

31. Divide Dataset Based on Feature Threshold

Deep-ML | Decision Tree Learning

这道题是 Easy,不做赘述。


34. One-Hot Encoding of Nominal Values

Deep-ML | One-Hot Encoding of Nominal Values

编写一个独热码转换函数。学习 np 用法,特别是广播处理。

Code

import numpy as np

def to_categorical(x, n_col=None):
	n_col = n_col or np.amax(x) + 1
	one_hots = np.zeros((x.shape[0], n_col))
	one_hots[np.arange(x.shape[0]), x] = 1  # broadcast
	return one_hots

38. Implement AdaBoost Fit Method

Deep-ML | Implement AdaBoost Fit Method

前置知识

Boosting

集成学习方法,其主要思想就是将弱的基学习器提升 (boost) 为强学习器。具体步骤如下:

  1. 先用每个样本权重相等的训练集训练一个初始的基学习器;
  2. 根据上轮得到的学习器对训练集的预测表现情况调整训练集中的样本权重, 然后据此训练一个新的基学习器;
  3. 重复步骤 2 直到得到 \(M\) 个基学习器,最终的集成结果是 \(M\) 个基学习器的组合。

由此看出,Boosting 是一个串行的过程。

AdaBoost

AdaBoost 主要记住两个重要环节

  1. 得到基学习器的权重系数 \(\alpha_m\)
  2. 更新训练样本权重 \(W\)

具体过程如下。

考虑二分类问题,给定数据集 \(D = \{ (x_1, y_1), (x_2, y_2), \ldots, (x_N, y_N) \}\),其中 \(x_i\) 是一个含有 \(d\) 个元素的列向量,即 \(x_i \in \mathcal{X} \subseteq \mathbb{R}^d\),而 \(y_i\) 是标量,\(y \in \{ +1, -1 \}\)

1) 初始化样本权重

\[W_1=(w_1^{(1)}, w_{2}^{(1)},\cdots,w_{N}^{(1)}),\, w_{i}^{(1)}=\frac 1 N, \,i = 1,2,\cdots,N \]

2) 对 \(m = 1, 2, \ldots, M\),重复以下操作得到 \(M\) 个基学习器:

  1. 按照样本的权重分布 \(W_m\) 训练数据得到第 \(m\) 个基学习器 \(G_m(x)\)

\[G_m(x): \mathcal{X} \rightarrow y \in \{+1,-1\} \]

  1. 计算 \(G_m(x)\) 在加权训练数据集上的分类误差率。其中 \(I(\cdot)\) 是指示函数,括号里的条件成立其取值为 1,否则为 0。\(P(G_m(x_i) \neq y_i)\) 表示编号为 \(m\) 的基分类器 \(G_m(x)\)\(x_i\) 的预测标记与 \(x_i\) 的真实标记 \(y_i\) 之间不一致的概率,记为 \(\epsilon\)

\[\epsilon_m = \sum_{i=1}^N P(G_m(x_i) \neq y_i)=\sum_{i=1}^N w_{i}^{(m)} I(G_m(x_i) \neq y_i) \]

3) 计算 \(G_m(x)\) 的权重系数(即最终集成时使用的基学习器的权重)

\[\alpha_m = \cfrac{1}{2}ln\cfrac{1-\epsilon_m}{\epsilon_m} \]

4) 更新训练样本权重:

\[W_{m+1} = (w_1^{(m+1)}, w_{2}^{(m+1)},\cdots,w_{N}^{(m+1)}) \\ w_i^{(m+1)} = \cfrac{w_i^{(m)}}{Z_m}e^{-\alpha_my_iG_m(x_i)}, \, i=1,2,\cdots,N \]

5) 其中,\(Z_m\) 是归一化因子,目的是为了使 \(W_{m+1}\) 的所有元素和为 1。即

\[Z_m=\sum_{i=1}^N w_i^{(m)} e^{-\alpha_my_iG_m(x_i)} \]

6) 构建最终的分类器线性组合

\[f(x) = \sum_{m=1}^{M}\alpha_{m}G_{m}(x) \]

7) 最终的分类器为

\[G(x) = sign(f(x))=sign(\sum_{i=1}^M \alpha_m G_m(x)) \]

Summary

代码其实并不复杂,建议伴随理解。主要记住 \(\alpha_m\)\(W\) 的更新。

本文转载自 博客园


39. Implementation of Log Softmax Function

Deep-ML | Implementation of Log Softmax Function

前置知识

SoftMax

\[\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}} \]

Log SoftMax

直接计算上式对数会导致数值不稳定。为此,减去 \(\max x\)

\[\log \, \text{softmax}(x_i) = x_i - \max(x) - \log\left(\sum_{j=1}^{n} e^{x_j - \max(x)}\right) \]

Code

import numpy as np

def log_softmax(scores: list) -> np.ndarray:
	sub_max = scores - np.max(scores)
	log_submax = np.log(np.sum(np.exp(sub_max)))
	return sub_max - log_submax

43. Implement Ridge Regression Loss Function

Deep-ML | Implement Ridge Regression Loss Function

前置知识

岭回归

MSE 损失 + 权重的 \(\ell_2\) 正则。

\[L(\beta) = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^p \beta_j^2 \]

Code

def ridge_loss(X: np.ndarray, w: np.ndarray, y_true: np.ndarray, alpha: float) -> float:
    y_pred = X @ w
    mse = np.mean((y_true - y_pred) ** 2)
    l2_penalty = np.sum(w ** 2)
    loss = mse + alpha * l2_penalty
    return loss

44. Leaky ReLU Activation Function

Deep-ML | Leaky ReLU Activation Function

前置知识

Leaky ReLU

Leaky ReLU 通过允许小的负斜率来解决 Dying ReLU 问题。

\[f(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{if } x \leq 0 \end{cases} \]


45. Linear Kernel Function

Deep-ML | Linear Kernel Function

可以手动实现

return sum(x1_ * x2_ for x1_, x2_ in zip(x1, x2))

也可直接调用 np.inner.


47. Implement Gradient Descent Variants with MSE Loss

Deep-ML | Implement Gradient Descent Variants with MSE Loss

前置知识

本题要求对一个简单的全连接层 (无偏置)

\[\mathbf{\hat{y}} = \mathbf{W} \cdot \mathbf{x} \]

使用梯度下降法进行优化。损失函数为 MSE,其梯度

\[\frac{\partial W}{\partial \text{MSE}} = \frac{1}{n} \sum_{i=1}^{n} -2(y_i - \hat{y}_i) X_i \]

Batch Gradient Descent

处理整个数据集,计算平均梯度更新参数

\[\theta = \theta - \alpha \cdot \frac{2}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)}) x^{(i)} \]

\(m\) 为样本容量。

Stochastic Gradient Descent (SGD)

\[\theta = \theta - \alpha \cdot 2 (h_{\theta}(x^{(i)}) - y^{(i)}) x^{(i)} \]

Mini-Batch Gradient Descent

最常用的方式,每次处理小批量数据。

\[\theta = \theta - \alpha \cdot \frac{2}{b} \sum_{i=1}^{b} (h_{\theta}(x^{(i)}) - y^{(i)}) x^{(i)} \]

\(b\) 为 batch_size.

Code

import numpy as np

def gradient_descent(X, y, weights, learning_rate, n_iterations, batch_size=1, method='batch'):
	m = len(y)

	if method == 'batch':
		batch_size = m
	elif method == 'stochastic':
		batch_size = 1
	elif method == 'mini_batch':
		batch_size = batch_size
	
	for _ in range(n_iterations):
		for i in range(0, m, batch_size):
			X_batch = X[i: i + batch_size]
			y_batch = y[i: i + batch_size]

			pred = X_batch.dot(weights)
			err = pred - y_batch
			gradient = 2 * X_batch.T.dot(err) / batch_size  # broadcast
			
			# update
			weights -= learning_rate * gradient

	return weights


48. Implement Reduced Row Echelon Form (RREF) Function

Deep-ML | Implement Reduced Row Echelon Form (RREF) Function

前置知识

高斯消元法

  1. 遍历每一列:从左到右处理每一列,找到当前列的主元(非零元素)。
  2. 交换行:将包含主元的行交换到当前处理的行。
  3. 归一化主元行:将主元行除以主元值,使主元变为 1。
  4. 消去其他行:使用主元行消去其他行在当前列的元素,使得该列其他元素为 0。
  5. 移动到下一列:处理下一列,直到所有主元处理完毕或超出矩阵行数。

Code

import numpy as np

def rref(matrix):
    mat = matrix.astype(np.float64)
    rows, cols = mat.shape
    pivot = 0
    
    for col in range(cols):
        if pivot >= rows:
            break
        # 找到当前列中绝对值最大的行
        current_col = mat[pivot:, col]
        max_row = np.argmax(np.abs(current_col)) + pivot
        if mat[max_row, col] == 0:
            continue
        # 交换行
        if max_row != pivot:
            mat[[pivot, max_row]] = mat[[max_row, pivot]]
        # 归一化主元行
        pivot_val = mat[pivot, col]
        mat[pivot] = mat[pivot] / pivot_val
        # 消去其他行
        for i in range(rows):
            if i != pivot:
                factor = mat[i, col]
                mat[i] -= factor * mat[pivot]
        pivot += 1
    return mat

49. Implement Adam Optimization Algorithm

Deep-ML | Implement Adam Optimization Algorithm

本题与 No. 87 重复。


64. Implement Gini Impurity Calculation for a Set of Classes

Deep-ML | Implement Gini Impurity Calculation for a Set of Classes

前置知识

基尼不纯度

常用于计算决策树节点的最优分割。

\[\text{Gini Impurity} = 1 - \sum_{i = 1}^{n} p_i \]

其中 \(p_i = \frac{n_i}{n}\) 为某一类别的概率。

Code


import numpy as np
from collections import Counter

def gini_impurity(y):
	"""
	Calculate Gini Impurity for a list of class labels.

	:param y: List of class labels
	:return: Gini Impurity rounded to three decimal places
	"""
	n_labels = len(y)
	probs = [float(p) / n_labels for _, p in Counter(y).items()]
	
	return round(1 - sum(p ** 2 for p in probs), 3)

70. Calculate Image Brightness

Deep-ML | Calculate Image Brightness

本题注意输入合法性检查。不做赘述。


77. Calculate Performance Metrics for a Classification Model

Deep-ML | Calculate Performance Metrics for a Classification Model

前置知识

混淆矩阵

\[\begin{array}{|c|c|c|} \hline & \text{Predicted Positive} & \text{Predicted Negative} \\ \hline \text{Actual Positive} & TP & FN \\ \hline \text{Actual Negative} & FP & TN \\ \hline \end{array} \]

Accuracy

根据 Confusion Matrix 定义,有

\[\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} \]

Precision

\[\text{Precision} = \frac{TP}{TP + FP} \]

Negative Predictive Value

\[\text{Negative Predictive Value} = \frac{TN}{TN + FN} \]

Recall

\[\text{Recall} = \frac{TP}{TP + FN} \]

Specificity

\[\text{Specificity} = \frac{TN}{TN + FP} \]

F1 Score

Trade-off of \(FN\)s and \(FP\)s.

\[\text{F1 Score} = 2 \times \frac{1}{\frac{1}{\text{Precision}} + \frac{1}{\text{Recall}}} \]

Code

def performance_metrics(actual: list[int], predicted: list[int]) -> tuple:
	# Implement your code here
	n = len(actual)
	TP = sum(1 if a == p == 1 else 0 for a, p in zip(actual, predicted))
	TN = sum(1 if a == p == 0 else 0 for a, p in zip(actual, predicted))
	FN = sum(actual) - TP
	FP = sum(predicted) - TP

	accuracy = float(TP + TN) / (TP + TN + FP + FN)
	precision = float(TP) / (TP + FP)
	negativePredictive = float(TN) / (TN + FN)
	recall = float(TP) / (TP + FN)
	specificity = float(TN) / (TN + FP)
	f1 = 2.0 / (1 / precision + 1 / recall)
	confusion_matrix = [[TP, FN], [FP, TN]]
	

	return confusion_matrix, round(accuracy, 3), round(f1, 3), round(specificity, 3), round(negativePredictive, 3)

79. Binomial Distribution Probability

Deep-ML | Binomial Distribution Probability

前置知识

二项分布

\[P(X=k) = \binom{n}{k} \cdot p^k \cdot (1-p)^{n-k} \]

Code

import math

def binomial_probability(n, k, p):
	C = math.comb(n, k)
	probability = C * pow(p, k) * pow(1 - p, n - k)
	return round(probability, 5)

84. Phi Transformation for Polynomial Features

Deep-ML | Phi Transformation for Polynomial Features

(Easy)


85. Positional Encoding Calculator

Deep-ML | Positional Encoding Calculator

前置知识

Transformer 位置编码器

对维度 \(i \in d_{model}\) 和位置 \(pos\),位置编码的计算公式如下。对于偶数维度:

\[\sin{(\frac{pos}{10000^{\frac{i}{d}}})} \]

对于奇数维度:

\[\cos{(\frac{pos}{10000^{\frac{i - 1}{d}}})} \]

Code

(Generated by GPT-4o)

import numpy as np

def pos_encoding(position: int, d_model: int):
    # 检查输入的有效性
    if position == 0 or d_model <= 0:
        return -1
    
    # 创建位置编码数组
    pos_enc = np.zeros((1, position, d_model), dtype=np.float16)

    # 计算位置编码
    for pos in range(position):
        for i in range(d_model):
            if i % 2 == 0:
                pos_enc[0, pos, i] = np.sin(pos / (10000 ** (i / d_model)))
            else:
                pos_enc[0, pos, i] = np.cos(pos / (10000 ** ((i - 1) / d_model)))

    pos_enc = np.float16(pos_enc)
    return pos_enc

87. Adam Optimizer

Deep-ML | Adam Optimizer

前置知识

Adam

Adam (Kingma. Ba. 2014) 是有效的优化器,它使用指数加权移动平均值来估算梯度的动量和二次矩。具体地,

\[\begin{aligned} \mathbf{v}_t & \leftarrow \beta_1 \mathbf{v}_{t-1} + (1 - \beta_1) \mathbf{g}_t, \\ \mathbf{s}_t & \leftarrow \beta_2 \mathbf{s}_{t-1} + (1 - \beta_2) \mathbf{g}_t^2. \end{aligned} \]

通常 \(\beta_1 = 0.9, \beta_2 = 0.999\),接下来修正偏差,得到 \(\hat{v}\)\(\hat{s}\)

\[\hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1 - \beta_1^t} \text{ and } \hat{\mathbf{s}}_t = \frac{\mathbf{s}_t}{1 - \beta_2^t}. \]

最后,以非常类似于 RMSProp 算法的方式重新缩放梯度,得到 \(g^t\) 的更新方程

\[\mathbf{g}_t' = \frac{\eta \hat{\mathbf{v}}_t}{\sqrt{\hat{\mathbf{s}}_t} + \epsilon}. \]

更新参数

\[\mathbf{x}_t \leftarrow \mathbf{x}_{t-1} - \mathbf{g}_t'. \]

以下伪代码描述了 Adam 的整个过程。

Code

import numpy as np

def adam_optimizer(parameter, grad, m, v, t, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
	"""
	Update parameters using the Adam optimizer.
	Adjusts the learning rate based on the moving averages of the gradient and squared gradient.
	:param parameter: Current parameter value
	:param grad: Current gradient
	:param m: First moment estimate
	:param v: Second moment estimate
	:param t: Current timestep
	:param learning_rate: Learning rate (default=0.001)
	:param beta1: First moment decay rate (default=0.9)
	:param beta2: Second moment decay rate (default=0.999)
	:param epsilon: Small constant for numerical stability (default=1e-8)
	:return: tuple: (updated_parameter, updated_m, updated_v)
	"""
	# Your code here
	m = beta1 * m + (1 - beta1) * grad
	v = beta2 * v + (1 - beta2) * np.square(grad)
	m_bias_corr = m / (1 - np.power(beta1, t))
	v_bias_corr = v / (1 - np.power(beta2, t))
	grad_1 = learning_rate * m_bias_corr / (np.sqrt(v_bias_corr) + epsilon)
	parameter -= grad_1

	return np.round(parameter,5), np.round(m,5), np.round(v,5)


95. Calculate the Phi Coefficient

Deep-ML | Calculate the Phi Coefficient

前置知识

The Phi Coefficient

\[\phi = \frac{(x_{00}x_{11}) - (x_{01}x_{10})}{\sqrt{(x_{00}+x_{01})(x_{10}+x_{11})(x_{00}+x_{10})(x_{01}+x_{11})}} \]

Code

def phi_corr(x: list[int], y: list[int]) -> float:
	n = len(x)

	x_00 = sum(1 for i in range(n) if x[i] == y[i] == 0)
	x_11 = sum(1 for i in range(n) if x[i] == y[i] == 1)
	x_10 = sum(1 for i in range(n) if x[i] == 1 != y[i])
	x_01 = sum(1 for i in range(n) if x[i] == 0 != y[i])

	sqrt_ = math.sqrt((x_00 + x_01) * (x_10 + x_11) * (x_00 + x_10) * (x_01 + x_11))
	val = (x_00 * x_11 - x_01 * x_10) / sqrt_
	return round(val,4)

100. Implement the Softplus Activation Function

Deep-ML | Implement the Softplus Activation Function

前置知识

SoftPlus

\[\text{Softplus}(x) = \log (1 + e^x) \]

Code

return round(np.log(1 + np.exp(x)), 4)

100. Implement the Softsign Activation Function

Deep-ML | Implement the Softsign Activation Function

前置知识

SoftSign

\[\text{Softsign}(x) = \frac{x}{1 + |x|} \]

Code

return round(x / (1 + abs(x)), 4)

102. Implement the Swish Activation Function

Deep-ML | Implement the Swish Activation Function

前置知识

Swish Activation Function

\[\text{Swish}(x) = x \cdot \sigma(x) \text{ where } \sigma(x) = \frac{1}{1 + e^{-x}} \]

Code

sigma = lambda x: 1 / (1 + np.exp(-x))
return x * sigma(x)

107. Implement Masked Self-Attention

Deep-ML | Implement Masked Self-Attention

前置知识

掩码自注意力

Masked attention is a variation of the attention mechanism used primarily in sequence modeling tasks, such as language modeling and text generation. The key idea behind masked attention is to control the flow of information by selectively masking certain elements in the input sequence. This ensures that the model attends only to valid positions when computing attention scores.

已经获取 \(Q \in \mathbb{R}^{n \times d_K}, K \in \mathbb{R}^{m \times d_K}, V \in \mathbb{R}^{m \times d_V}\)。计算注意力分数

\[\text{score} = \frac{QK^T}{\sqrt{d_K}} \]

掩码形如

\[\text{mask} = \begin{bmatrix} 0 & -\infty & -\infty \\ 0 & 0 & -\infty \\ 0 & 0 & 0 \end{bmatrix} \]

应用掩码,

\[\text{maskedScore} = \text{score} + \text{mask} \]

计算 Softmax 前首先减去每行最大评分,以稳定数值计算

\[A = \frac{\exp(\text{maskedScore} - \text{maskedScore}_{\text{max}})}{\sum \exp(\text{maskedScore} - \text{maskedScore}_{\text{max}})} \]

输出结果

\[\text{Output} = A \cdot V \]

Code

import numpy as np

def compute_qkv(X: np.ndarray, W_q: np.ndarray, W_k: np.ndarray, W_v: np.ndarray):
	Q = np.dot(X, W_q)
	K = np.dot(X, W_k)
	V = np.dot(X, W_v)
	return Q, K, V

def masked_attention(Q: np.ndarray, K: np.ndarray, V: np.ndarray, mask: np.ndarray) -> np.ndarray:
	d_k = Q.shape[1]
	scores = np.matmul(Q, K.T) / np.sqrt(d_k)

	# Apply mask
	# 可以直接相加 (数值 mask)
	scores = scores + mask
	# 或使用 np.wehre (布尔 mask)
	scores = np.where(mask, -1e9, scores)

	attention_weights = np.exp(scores - np.max(scores, axis=1, keepdims=True))
	attention_weights = attention_weights / np.sum(attention_weights, axis=1, keepdims=True)
	return np.matmul(attention_weights, V)

121. Vector Element-wise Sum

Deep-ML | Vector Element-wise Sum

Code

return [sum(c) for c in zip(a, b)] if len(a) == len(b) else -1
posted @ 2025-05-17 04:36  Miya_Official  阅读(128)  评论(0)    收藏  举报