Time series

time series

import pandas as pd

data = pd.read_csv(r'../data/data.csv')

data
# 数据那么好，为什么不回归呢...

	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10	x11	x12	x13	y
0	3831732	181.54	448.19	7571.00	6212.70	6370241	525.71	985.31	60.62	65.66	120.0	1.029	5321	64.87
1	3913824	214.63	549.97	9038.16	7601.73	6467115	618.25	1259.20	73.46	95.46	113.5	1.051	6529	99.75
2	3928907	239.56	686.44	9905.31	8092.82	6560508	638.94	1468.06	81.16	81.16	108.2	1.064	7008	88.11
3	4282130	261.58	802.59	10444.60	8767.98	6664862	656.58	1678.12	85.72	91.70	102.2	1.092	7694	106.07
4	4453911	283.14	904.57	11255.70	9422.33	6741400	758.83	1893.52	88.88	114.61	97.7	1.200	8027	137.32
5	4548852	308.58	1000.69	12018.52	9751.44	6850024	878.26	2139.18	92.85	152.78	98.5	1.198	8549	188.14
6	4962579	348.09	1121.13	13966.53	11349.47	7006896	923.67	2492.74	94.37	170.62	102.8	1.348	9566	219.91
7	5029338	387.81	1248.29	14694.00	11467.35	7125979	978.21	2841.65	97.28	214.53	98.9	1.467	10473	271.91
8	5070216	453.49	1370.68	13380.47	10671.78	7206229	1009.24	3203.96	103.07	202.18	97.6	1.560	11469	269.10
9	5210706	533.55	1494.27	15002.59	11570.58	7251888	1175.17	3758.62	109.91	222.51	100.1	1.456	12360	300.55
10	5407087	598.33	1677.77	16884.16	13120.83	7376720	1348.93	4450.55	117.15	249.01	101.7	1.424	14174	338.45
11	5744550	665.32	1905.84	18287.24	14468.24	7505322	1519.16	5154.23	130.22	303.41	101.5	1.456	16394	408.86
12	5994973	738.97	2199.14	19850.66	15444.93	7607220	1696.38	6081.86	128.51	356.99	102.3	1.438	17881	476.72
13	6236312	877.07	2624.24	22469.22	18951.32	7734787	1863.34	7140.32	149.87	429.36	103.4	1.474	20058	838.99
14	6529045	1005.37	3187.39	25316.72	20835.95	7841695	2105.54	8287.38	169.19	508.84	105.9	1.515	22114	843.14
15	6791495	1118.03	3615.77	27609.59	22820.89	7946154	2659.85	9138.21	172.28	557.74	97.5	1.633	24190	1107.67
16	7110695	1304.48	4476.38	30658.49	25011.61	8061370	3263.57	10748.28	188.57	664.06	103.2	1.638	29549	1399.16
17	7431755	1700.87	5243.03	34438.08	28209.74	8145797	3412.21	12423.44	204.54	710.66	105.5	1.670	34214	1535.14
18	7512997	1969.51	5977.27	38053.52	30490.44	8222969	3758.39	13551.21	213.76	760.49	103.0	1.825	37934	1579.68
19	7599295	2110.78	6882.85	42049.14	33156.83	8323096	4454.55	15420.14	228.46	852.56	102.6	1.906	41972	2088.14

AutoGluon : Regression

train_data_reg = data.iloc[0:-2,:]
test_data_reg = data.iloc[-2:,:]

from autogluon.tabular import TabularDataset, TabularPredictor
train_data_reg_auto = TabularDataset(train_data_reg)
test_data_reg_auto = TabularDataset(test_data_reg)

predictor = TabularPredictor(label='y').fit(train_data=train_data_reg_auto)
predictions = predictor.predict(test_data_reg_auto)
## AutoGluon infers your prediction problem is: 'regression'
## (because dtype of label-column == float and many unique label-values observed).

No path specified. Models will be saved in: "AutogluonModels/ag-20220331_070120\"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20220331_070120\"
AutoGluon Version:  0.4.0
Python Version:     3.9.7
Operating System:   Windows
Train Data Rows:    18
Train Data Columns: 13
Label Column: y
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
	Label info (max, min, mean, stddev): (1535.14, 64.87, 482.99222, 462.62691)
	If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    693.54 MB
	Train Data (Original)  Memory Usage: 0.0 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Types of features in original data (raw dtype, special dtypes):
		('float', []) : 10 | ['x2', 'x3', 'x4', 'x5', 'x7', ...]
		('int', [])   :  3 | ['x1', 'x6', 'x13']
	Types of features in processed data (raw dtype, special dtypes):
		('float', []) : 10 | ['x2', 'x3', 'x4', 'x5', 'x7', ...]
		('int', [])   :  3 | ['x1', 'x6', 'x13']
	0.1s = Fit runtime
	13 features in original data used to generate 13 features in processed data.
	Train Data (Processed) Memory Usage: 0.0 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.09s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
	To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 14, Val Rows: 4
Fitting 11 L1 models ...
Fitting model: KNeighborsUnif ...
	-61.9471	 = Validation score   (root_mean_squared_error)
	0.01s	 = Training   runtime
	0.02s	 = Validation runtime
Fitting model: KNeighborsDist ...
	-29.1196	 = Validation score   (root_mean_squared_error)
	0.01s	 = Training   runtime
	0.01s	 = Validation runtime
Fitting model: LightGBMXT ...
	-334.4967	 = Validation score   (root_mean_squared_error)
	0.2s	 = Training   runtime
	0.01s	 = Validation runtime
Fitting model: LightGBM ...
	-334.4967	 = Validation score   (root_mean_squared_error)
	0.22s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: RandomForestMSE ...
	-54.861	 = Validation score   (root_mean_squared_error)
	0.67s	 = Training   runtime
	0.05s	 = Validation runtime
Fitting model: CatBoost ...
	-31.2363	 = Validation score   (root_mean_squared_error)
	0.94s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: ExtraTreesMSE ...
	-23.7411	 = Validation score   (root_mean_squared_error)
	0.65s	 = Training   runtime
	0.1s	 = Validation runtime
Fitting model: NeuralNetFastAI ...
No improvement since epoch 0: early stopping
	-290.8096	 = Validation score   (root_mean_squared_error)
	0.35s	 = Training   runtime
	0.02s	 = Validation runtime
Fitting model: XGBoost ...
	-30.2997	 = Validation score   (root_mean_squared_error)
	0.2s	 = Training   runtime
	0.02s	 = Validation runtime
Fitting model: NeuralNetTorch ...
	Warning: Exception caused NeuralNetTorch to fail during training... Skipping this model.
		float division by zero
Detailed Traceback:
Traceback (most recent call last):
  File "D:\miniConda_Python\lib\site-packages\autogluon\core\trainer\abstract_trainer.py", line 1074, in _train_and_save
    model = self._train_single(X, y, model, X_val, y_val, **model_fit_kwargs)
  File "D:\miniConda_Python\lib\site-packages\autogluon\core\trainer\abstract_trainer.py", line 1032, in _train_single
    model = model.fit(X=X, y=y, X_val=X_val, y_val=y_val, **model_fit_kwargs)
  File "D:\miniConda_Python\lib\site-packages\autogluon\core\models\abstract\abstract_model.py", line 577, in fit
    out = self._fit(**kwargs)
  File "D:\miniConda_Python\lib\site-packages\autogluon\tabular\models\tabular_nn\torch\tabular_nn_torch.py", line 196, in _fit
    self._train_net(train_dataset=train_dataset,
  File "D:\miniConda_Python\lib\site-packages\autogluon\tabular\models\tabular_nn\torch\tabular_nn_torch.py", line 350, in _train_net
    f"Train loss: {round(total_train_loss / total_train_size, 4)}, "
ZeroDivisionError: float division by zero
Fitting model: LightGBMLarge ...
	-43.8279	 = Validation score   (root_mean_squared_error)
	0.32s	 = Training   runtime
	0.01s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
	-4.0605	 = Validation score   (root_mean_squared_error)
	0.4s	 = Training   runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 4.9s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20220331_070120\")

predictions

18    1409.151489
19    1405.571655
Name: y, dtype: float32

## AutoGluon training complete, total runtime = 11.11s ... Best model: "WeightedEnsemble_L2"
import joblib
model = joblib.load("AutogluonModels/ag-20220331_035208/models/WeightedEnsemble_L2/model.pkl")

y_test = test_data_reg_auto.iloc[:,0:-1]
y_true = test_data_reg_auto.iloc[:,-1]

y_pred = model.predict(y_test)

y_pred

array([1190.105 , 1180.3666], dtype=float32)

y_true

18    1579.68
19    2088.14
Name: y, dtype: float64

## 才注意到py编写函数不用指定参数类型的.......
from sklearn.metrics import mean_squared_error,mean_absolute_error
def metrics(y_true,y_pred):
    mae = mean_absolute_error(y_true,y_pred)
    mse = mean_squared_error(y_true,y_pred)
    print("mae = {} , mse = {}".format(mae,mse))

## 挺夸张的，不过好像也正常
## std 后数据会好看点,但都autogluon了,就懒得操作了
metrics(y_true,y_pred)

mae = 648.6742211914062 , mse = 487910.64153920766

Time_Series

import numpy as np
data_time_series = np.array(data.iloc[:,-1])

import seaborn as sns
from matplotlib import pyplot as plt
ax = sns.lineplot(data = data_time_series)
plt.show()

加权移动平均 : 蠢

data_time_series_train = data_time_series[0:-2]
data_time_series_test = data_time_series[-2:]

data_time_series_train_array = np.array(data_time_series_train).reshape(1,18)
weight = np.array([0.1,0.2,0.2,0.5])
record = []
for x in range(2,0,-1):
    series = data_time_series_train_array[0,-5-x:-1-x],
    res = np.multiply(series,weight)
    record.append(np.sum(res) + 0.5 * np.sum(res))
    print(" %d : %f" %(2015 - x + 1 , np.sum(res) + 0.5 * np.sum(res)))
    ## res + 0.5res : 加权和 和 提升 蠢
metrics(y_true,record)

 2014 : 1088.397000
 2015 : 1406.899500
mae = 586.26175 , mse = 352723.802464625

Torch : Lstm

补充成任务数据的样式，反归一化输出
我自己怎么可能写得出来呢

# data_time_series : 切出来的 y 一列
data_time_series_lstm = data_time_series
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
max_value = np.max(data_time_series_lstm)
min_value = np.min(data_time_series_lstm)
data_time_series_lstm = (data_time_series_lstm - min_value) / (max_value - min_value)
# std
import seaborn as sns
from matplotlib import pyplot as plt
ax = sns.lineplot(data = data_time_series_lstm)
plt.show()

import torch

DAYS_FOR_TRAIN = 5
def create_dataset(data, days_for_train=5) -> (np.array, np.array):
    """
        根据给定的序列data，生成数据集
        数据集分为输入和输出，每一个输入的长度为days_for_train，每一个输出的长度为1。
        也就是说用days_for_train天的数据，对应下一天的数据。
        若给定序列的长度为d，将输出长度为(d-days_for_train+1)个输入/输出对
        数据形式:[x1,x2,x3,x4,y]
    """
    dataset_x, dataset_y= [], []
    for i in range(len(data)-days_for_train):
        _x = data[i:(i+days_for_train)]
        dataset_x.append(_x)
        dataset_y.append(data[i+days_for_train])
    return (np.array(dataset_x), np.array(dataset_y))

dataset_x, dataset_y = create_dataset(data_time_series_lstm, DAYS_FOR_TRAIN)
dataset_y

array([0.06092612, 0.07662843, 0.1023294 , 0.10094056, 0.1164847 ,
       0.13521675, 0.17001685, 0.20355662, 0.38260835, 0.38465949,
       0.51540328, 0.65947204, 0.72668008, 0.74869395, 1.        ])

dataset_y

array([0.06092612, 0.07662843, 0.1023294 , 0.10094056, 0.1164847 ,
       0.13521675, 0.17001685, 0.20355662, 0.38260835, 0.38465949,
       0.51540328, 0.65947204, 0.72668008, 0.74869395, 1.        ])

# 划分训练集和测试集，70%作为训练集
train_size = int(len(dataset_x) * 0.7)

train_x = dataset_x[:train_size]
train_y = dataset_y[:train_size]

# 将数据改变形状，RNN 读入的数据维度是 (seq_size, batch_size, feature_size)
train_x = train_x.reshape(-1, 1, DAYS_FOR_TRAIN)
train_y = train_y.reshape(-1, 1, 1)

# 转为pytorch的tensor对象
train_x = torch.from_numpy(train_x).to(torch.float32)
train_y = torch.from_numpy(train_y).to(torch.float32)

import torch
from torch import nn
class LSTM_Regression(nn.Module):
    """
        使用LSTM进行回归
        参数：
        - input_size: feature size
        - hidden_size: number of hidden units
        - output_size: number of output
        - num_layers: layers of LSTM to stack
    """
    def __init__(self, input_size, hidden_size, output_size=1, num_layers=2):
        super().__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers)
        self.fc = nn.Linear(hidden_size, output_size)
    def forward(self, _x):
        x, _ = self.lstm(_x)  # _x is input, size (seq_len, batch, input_size)
        s, b, h = x.shape  # x is output, size (seq_len, batch, hidden_size)
        x = x.view(s*b, h)
        x = self.fc(x)
        x = x.view(s, b, -1)  # 把形状改回来
        return x
model = LSTM_Regression(DAYS_FOR_TRAIN, 8, output_size=1, num_layers=2)
loss_function = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)

for i in range(100):
    out = model(train_x)
    loss = loss_function(out, train_y)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    if (i+1) % 10 == 0:
        print('Epoch: {}, Loss:{:.5f}'.format(i+1, loss.item()))

Epoch: 10, Loss:0.00674
Epoch: 20, Loss:0.00327
Epoch: 30, Loss:0.00204
Epoch: 40, Loss:0.00116
Epoch: 50, Loss:0.00085
Epoch: 60, Loss:0.00071
Epoch: 70, Loss:0.00064
Epoch: 80, Loss:0.00059
Epoch: 90, Loss:0.00057
Epoch: 100, Loss:0.00055

import matplotlib.pyplot as plt
model = model.eval() # 转换成测试模式
# 注意这里用的是全集 模型的输出长度会比原数据少DAYS_FOR_TRAIN 填充使长度相等再作图
dataset_x = dataset_x.reshape(-1, 1, DAYS_FOR_TRAIN)  # (seq_size, batch_size, feature_size)
dataset_x = torch.tensor(dataset_x).to(torch.float32)
pred_test = model(dataset_x) # 全量训练集的模型输出 (seq_size, batch_size, output_size)
pred_test = pred_test.view(-1).data.numpy()
pred_test = np.concatenate((np.zeros(DAYS_FOR_TRAIN), pred_test))  # 填充0 使长度相同
assert len(pred_test) == len(data_time_series_lstm)
plt.plot(pred_test, 'r', label='prediction')
plt.plot(data_time_series_lstm, 'b', label='real')
plt.plot((train_size, train_size), (0, 1), 'g--')
plt.legend(loc='best')
plt.show()

### 数据反归一化
pred_test =  pred_test * (max_value - pred_test) + pred_test
ax = sns.lineplot(data = pred_test)
plt.show(ax)

pred_test

array([   0.        ,    0.        ,    0.        ,    0.        ,
          0.        ,  141.97164252,  181.31122203,  194.75042542,
        205.88793585,  228.55081975,  277.00616729,  366.82627218,
        508.25917347,  683.14472015,  851.27406525,  981.55267148,
       1076.44698354, 1146.16705891, 1205.26890924, 1250.06205714])

寄

metrics(data_time_series_lstm, pred_test)

mae = 464.65907024157497 , mse = 414380.31475062034

posted on 2022-03-31 15:08 动物园天下第一阅读(318) 评论(0) 收藏举报