翻译——1_Project Overview, Data Wrangling and Exploratory Analysis-checkpoint

为提高提高大学能源效率进行建筑能源需求预测

本文翻译哈佛大学的能源分析和预测报告,这是原文

暂无数据源,个人认为学习分析方法就足够

内容:

  1. 项目概述
  2. 了解数据
  3. 探索性分析
  4. 使用不同的机器学习方法进行预测
  5. 总结
  6. 结论
  7. 讨论

1. 项目概述

用机器学习来进行能源预测,希望能够节约能源

2. 了解数据

有三种类型的能源消耗,电力,冷水和热水。图显示了哈佛工厂供应的冷水和热水的建筑物。

图:哈佛的冷水和热水供应。(左:冷水,用蓝色标出。右边:热水,用黄色突出显示。)

我们选择了一栋建筑,并获得了2011年7月1日至2014年10月31日的能耗数据。由于仪表故障,有几个月的数据丢失。数据分辨率是每小时一次。在原始数据中,每小时的数据是仪表读数。为了得到每小时的消耗,我们需要抵消数据然后减去。我们有2012年1月到2014年10月的每小时天气和能源数据(2.75年)。天气数据来自剑桥气象站。

在本节中,我们将完成以下任务。

。手动从哈佛能源见证网站下载原始数据,获取每小时的电力、冷水和热水。

。干净的天气数据,增加了更多的功能,包括冷度,热度和湿度比。

。根据假期、学年和周末估算每日入住率。

。创建与小时相关的特性,即cos(hourOfDay * 2 * pi / 24)。

。合并电力、冷水和热水数据流与天气、时间和占用功能。

%matplotlib inline 

import requests 
from StringIO import StringIO
import numpy as np
import pandas as pd # pandas
import matplotlib.pyplot as plt # module for plotting 
import datetime as dt # module for manipulating dates and times
import numpy.linalg as lin # 执行线性代数运算的模块
from __future__ import division
from math import log10,exp

pd.options.display.mpl_style = 'default'

原始能耗数据

原始数据从哈佛能源见证网站下载

然后我们用Pandas 把它们放在一个dataframe里。

file = 'Data/Org/0701-0930-2011.xls' 
df = pd.read_excel(file, header = 0, skiprows = np.arange(0,6))

files = ['Data/Org/1101-1130-2011.xls', 
         'Data/Org/1201-2011-0131-2012.xls',
         'Data/Org/0201-0331-2012.xls',
         'Data/Org/0401-0531-2012.xls',
         'Data/Org/0101-0228-2013.xls',
         'Data/Org/0301-0430-2013.xls',
         'Data/Org/0501-0630-2013.xls',
         'Data/Org/0701-0831-2013.xls',
         'Data/Org/0901-1031-2013.xls',
         'Data/Org/1101-1231-2013.xls',
         'Data/Org/0101-0228-2014.xls',
         'Data/Org/0301-0430-2014.xls', 
         'Data/Org/0501-0630-2014.xls', 
         'Data/Org/0701-0831-2014.xls',
         'Data/Org/0901-1031-2014.xls']

for file in files:
    data = pd.read_excel(file, header = 0, skiprows = np.arange(0,6))
    df = df.append(data)

df.head()
WARNING *** file size (2481102) not 512 + multiple of sector size (512)
WARNING *** file size (848833) not 512 + multiple of sector size (512)
WARNING *** file size (1694257) not 512 + multiple of sector size (512)
WARNING *** file size (1640459) not 512 + multiple of sector size (512)
WARNING *** file size (1667907) not 512 + multiple of sector size (512)
WARNING *** file size (847258) not 512 + multiple of sector size (512)
WARNING *** file size (1691449) not 512 + multiple of sector size (512)
WARNING *** file size (1666647) not 512 + multiple of sector size (512)
WARNING *** file size (1665736) not 512 + multiple of sector size (512)
WARNING *** file size (1614814) not 512 + multiple of sector size (512)
WARNING *** file size (1665980) not 512 + multiple of sector size (512)
WARNING *** file size (1667276) not 512 + multiple of sector size (512)
WARNING *** file size (1691736) not 512 + multiple of sector size (512)
WARNING *** file size (1666704) not 512 + multiple of sector size (512)
WARNING *** file size (1665920) not 512 + multiple of sector size (512)
WARNING *** file size (1614900) not 512 + multiple of sector size (512)
WARNING *** file size (1666228) not 512 + multiple of sector size (512)
WARNING *** file size (1666191) not 512 + multiple of sector size (512)
WARNING *** file size (1691845) not 512 + multiple of sector size (512)
WARNING *** file size (1663846) not 512 + multiple of sector size (512)
Unnamed: 0 Unnamed: 1 Gund Bus-A 15 Min Block Demand - kW Gund Bus-A CurrentA - Amps Unnamed: 4 Unnamed: 5 Gund Bus-A CurrentB - Amps Unnamed: 7 Gund Bus-A CurrentC - Amps Unnamed: 9 ... Gund Main Demand - Tons Gund Main Energy - Ton-Days Gund Main FlowRate - gpm Gund Main FlowTotal - kgal(1000) Gund Main SignalAeration - Count Gund Main SignalStrength - Count Gund Main SonicVelocity - Ft/Sec Gund Main TempDelta - Deg F Gund Main TempReturn - Deg F Gund Main TempSupply - Deg F
0 2011-07-01 01:00:00 White 48.458733 65.977882 NaN NaN 52.631417 NaN 55.603840 NaN ... 4.677294 17912.537804 6.916454 48168.083414 0.693405 57.208127 1437.640543 16.238684 59.757447 43.516103
1 2011-07-01 02:00:00 White 40.472697 57.230223 NaN NaN 42.483092 NaN 50.243230 NaN ... 4.586403 17912.853518 6.739337 48168.645429 0.567355 57.082909 1438.030719 16.263573 59.710199 43.495128
2 2011-07-01 03:00:00 #d2e4b0 39.472809 55.487443 NaN NaN 41.911784 NaN 48.482163 NaN ... 4.462877 17913.169232 6.725142 48169.207444 0.441304 57.001646 1439.111130 15.797043 59.248158 43.457344
3 2011-07-01 04:00:00 White 39.198879 55.849806 NaN NaN 41.525529 NaN 48.987457 NaN ... 4.696993 17913.484946 7.041330 48169.769458 0.315254 57.000000 1440.768604 15.947392 59.207097 43.267682
4 2011-07-01 05:00:00 White 39.297522 55.736219 NaN NaN 41.299381 NaN 48.710408 NaN ... 4.550372 17913.800660 6.863004 48170.331473 0.189204 57.000000 1442.426077 15.903679 59.282707 43.372615

5 rows × 55 columns

以上是原始的每小时数据。

正如你所看到的,它很乱。首先要删除没有意义的列。

df.rename(columns={'Unnamed: 0':'Datetime'}, inplace=True)
nonBlankColumns = ['Unnamed' not in s for s in df.columns]
columns = df.columns[nonBlankColumns]
df = df[columns]
df = df.set_index(['Datetime'])
df.index.name = None
df.head()
Gund Bus-A 15 Min Block Demand - kW Gund Bus-A CurrentA - Amps Gund Bus-A CurrentB - Amps Gund Bus-A CurrentC - Amps Gund Bus-A CurrentN - Amps Gund Bus-A EnergyReal - kWhr Gund Bus-A Freq - Hertz Gund Bus-A Max Monthly Demand - kW Gund Bus-A PowerApp - kVA Gund Bus-A PowerReac - kVAR ... Gund Main Demand - Tons Gund Main Energy - Ton-Days Gund Main FlowRate - gpm Gund Main FlowTotal - kgal(1000) Gund Main SignalAeration - Count Gund Main SignalStrength - Count Gund Main SonicVelocity - Ft/Sec Gund Main TempDelta - Deg F Gund Main TempReturn - Deg F Gund Main TempSupply - Deg F
2011-07-01 01:00:00 48.458733 65.977882 52.631417 55.603840 15.982278 1796757.502803 59.837524 96.117915 48.757073 12.344712 ... 4.677294 17912.537804 6.916454 48168.083414 0.693405 57.208127 1437.640543 16.238684 59.757447 43.516103
2011-07-01 02:00:00 40.472697 57.230223 42.483092 50.243230 13.423762 1796800.145991 60.005569 96.117915 42.238685 12.967984 ... 4.586403 17912.853518 6.739337 48168.645429 0.567355 57.082909 1438.030719 16.263573 59.710199 43.495128
2011-07-01 03:00:00 39.472809 55.487443 41.911784 48.482163 13.478933 1796840.146023 59.833880 96.117915 41.278573 12.732046 ... 4.462877 17913.169232 6.725142 48169.207444 0.441304 57.001646 1439.111130 15.797043 59.248158 43.457344
2011-07-01 04:00:00 39.198879 55.849806 41.525529 48.987457 13.603309 1796879.023607 59.673044 96.117915 41.345776 12.687845 ... 4.696993 17913.484946 7.041330 48169.769458 0.315254 57.000000 1440.768604 15.947392 59.207097 43.267682
2011-07-01 05:00:00 39.297522 55.736219 41.299381 48.710408 13.797331 1796918.273558 59.986672 96.117915 41.166736 12.437842 ... 4.550372 17913.800660 6.863004 48170.331473 0.189204 57.000000 1442.426077 15.903679 59.282707 43.372615

5 rows × 48 columns

然后我们打印出所有的列名。只有几根柱子可用来获得每小时的电力、冷水和热水。

for item in df.columns:
    print item
Gund Bus-A 15 Min Block Demand - kW
Gund Bus-A CurrentA - Amps
Gund Bus-A CurrentB - Amps
Gund Bus-A CurrentC - Amps
Gund Bus-A CurrentN - Amps
Gund Bus-A EnergyReal - kWhr
Gund Bus-A Freq - Hertz
Gund Bus-A Max Monthly Demand - kW
Gund Bus-A PowerApp - kVA
Gund Bus-A PowerReac - kVAR
Gund Bus-A PowerReal - kW
Gund Bus-A TruePF - PF
Gund Bus-A VoltageAB - Volts
Gund Bus-A VoltageAN - Volts
Gund Bus-A VoltageBC - Volts
Gund Bus-A VoltageBN - Volts
Gund Bus-A VoltageCA - Volts
Gund Bus-A VoltageCN - Volts
Gund Bus-B 15 Min Block Demand - kW
Gund Bus-B CurrentA - Amps
Gund Bus-B CurrentB - Amps
Gund Bus-B CurrentC - Amps
Gund Bus-B CurrentN - Amps
Gund Bus-B EnergyReal - kWhr
Gund Bus-B Freq - Hertz
Gund Bus-B Max Monthly Demand - kW
Gund Bus-B PowerApp - kVA
Gund Bus-B PowerReac - kVAR
Gund Bus-B PowerReal - kW
Gund Bus-B TruePF - PF
Gund Bus-B VoltageAB - Volts
Gund Bus-B VoltageAN - Volts
Gund Bus-B VoltageBC - Volts
Gund Bus-B VoltageBN - Volts
Gund Bus-B VoltageCA - Volts
Gund Bus-B VoltageCN - Volts
Gund Condensate Counter - count
Gund Condensate FlowTotal - LBS
Gund Main Demand - Tons
Gund Main Energy - Ton-Days
Gund Main FlowRate - gpm
Gund Main FlowTotal - kgal(1000)
Gund Main SignalAeration - Count
Gund Main SignalStrength - Count
Gund Main SonicVelocity - Ft/Sec
Gund Main TempDelta - Deg F
Gund Main TempReturn - Deg F
Gund Main TempSupply - Deg F

电力

以电力为例,“Gund Bus A”和“Gund Bus B”。“EnergyReal - kWhr”记录累计消耗量。我们不确定什么是“PowerReal”。为了以防万一,我们也把它放进了电日计。

electricity=df[['Gund Bus-A EnergyReal - kWhr',
                'Gund Bus-B EnergyReal - kWhr',
                'Gund Bus-A PowerReal - kW',
                'Gund Bus-B PowerReal - kW',]]
electricity.head()
Gund Bus-A EnergyReal - kWhr Gund Bus-B EnergyReal - kWhr Gund Bus-A PowerReal - kW Gund Bus-B PowerReal - kW
2011-07-01 01:00:00 1796757.502803 3657811.582122 47.184015 63.486186
2011-07-01 02:00:00 1796800.145991 3657873.464938 40.208796 61.270542
2011-07-01 03:00:00 1796840.146023 3657934.837505 39.209866 61.464394
2011-07-01 04:00:00 1796879.023607 3657995.470348 39.378507 59.396581
2011-07-01 05:00:00 1796918.273558 3658054.470285 39.240837 58.911729

通过检查每月的能耗来验证我们的数据处理方法

为了检验我们对数据的理解是否正确,我们想从每小时的数据中计算出每个月的用电量,然后将结果与facalities提供的每个月的数据进行比较,这些数据也可以在Energy Witness上找到。

以下是facalities提供的月度数据,"Bus A & B"以月度形式称为"CE603B kWh"和"CE604B kWh"。请注意,查表周期不是公历月份。

file = 'Data/monthly electricity.csv' 
monthlyElectricityFromFacility = pd.read_csv(file, header=0)
monthlyElectricityFromFacility
monthlyElectricityFromFacility = monthlyElectricityFromFacility.set_index(['month'])
monthlyElectricityFromFacility.head()
startDate endDate CE603B kWh CE604B kWh
month
Jul 11 6/16/11 7/18/11 43968.1 106307.1
Aug 11 7/18/11 8/17/11 41270.1 83121.1
Sep 11 8/17/11 9/16/11 51514.1 107083.1
Oct 11 9/16/11 10/18/11 65338.1 114350.1
Nov 11 10/18/11 11/17/11 65453.1 115318.1

我们用“EnergyReal - kWhr”列表示两个电表。我们计算了查表周期的开始日期和结束日期的数字,用结束日期的数字减去开始日期的数字,就得到了每月的电量消耗。

monthlyElectricityFromFacility['startDate'] = pd.to_datetime(monthlyElectricityFromFacility['startDate'], format="%m/%d/%y")
values = monthlyElectricityFromFacility.index.values

keys = np.array(monthlyElectricityFromFacility['startDate'])

dates = {}
for key, value in zip(keys, values):
    dates[key] = value

sortedDates =  np.sort(dates.keys())
sortedDates = sortedDates[sortedDates > np.datetime64('2011-11-01')]

months = []
monthlyElectricityOrg = np.zeros((len(sortedDates) - 1, 2))
for i in range(len(sortedDates) - 1):
    begin = sortedDates[i]
    end = sortedDates[i+1]
    months.append(dates[sortedDates[i]])
    monthlyElectricityOrg[i, 0] = (np.round(electricity.loc[end,'Gund Bus-A EnergyReal - kWhr'] 
                           -  electricity.loc[begin,'Gund Bus-A EnergyReal - kWhr'], 1))
    monthlyElectricityOrg[i, 1] = (np.round(electricity.loc[end,'Gund Bus-B EnergyReal - kWhr'] 
                           -  electricity.loc[begin,'Gund Bus-B EnergyReal - kWhr'], 1))

monthlyElectricity = pd.DataFrame(data = monthlyElectricityOrg, index = months, columns = ['CE603B kWh', 'CE604B kWh'])


plt.figure()
fig, ax = plt.subplots()
fig = monthlyElectricity.plot(marker = 'o', figsize=(15,6), rot = 40, fontsize = 13, ax = ax, linestyle='')
fig.set_axis_bgcolor('w')
plt.xlabel('Billing month', fontsize = 15)
plt.ylabel('kWh', fontsize = 15)
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 13)
plt.xticks(np.arange(0,len(months)),months)
plt.title('Original monthly consumption from hourly data',fontsize = 17)

text = 'Meter malfunction'
ax.annotate(text, xy = (9, 4500000), 
            xytext = (5, 2), fontsize = 15,
            textcoords = 'offset points', ha = 'center', va = 'top')

ax.annotate(text, xy = (8, -4500000), 
            xytext = (5, 2), fontsize = 15, 
            textcoords = 'offset points', ha = 'center', va = 'bottom')

ax.annotate(text, xy = (14, -2500000), 
            xytext = (5, 2), fontsize = 15, 
            textcoords = 'offset points', ha = 'center', va = 'bottom')

ax.annotate(text, xy = (15, 2500000), 
            xytext = (5, 2), fontsize = 15, 
            textcoords = 'offset points', ha = 'center', va = 'top')

plt.show()

上面是使用我们的数据处理方法的每月用电量图。显然,这两个电表故障了好几个月。有两组圆点“CE603B”和“CE604B”,它们来自两个电表。有两个电表。它们的总和就是该建筑物的总耗电量。

monthlyElectricity.loc['Aug 12','CE604B kWh'] = np.nan
monthlyElectricity.loc['Sep 12','CE604B kWh'] = np.nan
monthlyElectricity.loc['Feb 13','CE603B kWh'] = np.nan
monthlyElectricity.loc['Mar 13','CE603B kWh'] = np.nan


fig,ax = plt.subplots(1, 1,figsize=(15,8))
#ax.set_axis_bgcolor('w')
plt.bar(np.arange(0, len(monthlyElectricity))-0.5,monthlyElectricity['CE603B kWh'], label='Our data processing from hourly data')
plt.plot(monthlyElectricityFromFacility.loc[months,'CE603B kWh'],'or', label='Facility data')
plt.xticks(np.arange(0,len(months)),months)
plt.xlabel('Month',fontsize=15)
plt.ylabel('kWh',fontsize=15)
plt.xlim([0, len(monthlyElectricity)])
plt.legend()
ax.set_xticklabels(months, rotation=40, fontsize=13)
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
plt.title('Comparison between our data processing and facilities, Meter CE603B',fontsize=20)

text = 'Meter malfunction \n estimated by facility'
ax.annotate(text, xy = (14, monthlyElectricityFromFacility.loc['Feb 13','CE603B kWh']), 
            xytext = (5, 50), fontsize = 15, 
            arrowprops=dict(facecolor='black', shrink=0.15),
            textcoords = 'offset points', ha = 'center', va = 'bottom')

ax.annotate(text, xy = (15, monthlyElectricityFromFacility.loc['Mar 13','CE603B kWh']), 
            xytext = (5, 50), fontsize = 15, 
            arrowprops=dict(facecolor='black', shrink=0.15),
            textcoords = 'offset points', ha = 'center', va = 'bottom')
plt.show()

fig,ax = plt.subplots(1, 1,figsize=(15,8))
#ax.set_axis_bgcolor('w')
plt.bar(np.arange(0, len(monthlyElectricity))-0.5, monthlyElectricity['CE604B kWh'], label='Our data processing from hourly data')
plt.plot(monthlyElectricityFromFacility.loc[months,'CE604B kWh'],'or', label='Facility data')
plt.xticks(np.arange(0,len(months)),months)
plt.xlabel('Month',fontsize=15)
plt.ylabel('kWh',fontsize=15)
plt.xlim([0, len(monthlyElectricity)])
plt.legend()
ax.set_xticklabels(months, rotation=40, fontsize=13)
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
plt.title('Comparison between our data processing and facilities, Meter CE604B',fontsize=20)


ax.annotate(text, xy = (9, monthlyElectricityFromFacility.loc['Sep 12','CE604B kWh']), 
            xytext = (5, 50), fontsize = 15, 
            arrowprops=dict(facecolor='black', shrink=0.15),
            textcoords = 'offset points', ha = 'center', va = 'bottom')

ax.annotate(text, xy = (8, monthlyElectricityFromFacility.loc['Aug 12','CE604B kWh']), 
            xytext = (5, 50), fontsize = 15, 
            arrowprops=dict(facecolor='black', shrink=0.15),
            textcoords = 'offset points', ha = 'center', va = 'bottom')
plt.show()

图中的点是设施每月提供的数据。它们是钞票上的账单。当电表发生故障时,该设施只是对表的能量消耗进行估算。我们排除的数据点不在图中。它们只是比正常水平高得多(是其他正常消费水平的30倍)。“仪表故障,设施预估”是指本月仪表出现故障,这个点是设施预估的。

2013年2月和3月,电表“CE603B”出现故障。2012年8月和9月,电表“CE604B”出现故障。我们把这几个月的关键字设为np.nan。简单地把他们排除在回归之外。再次绘制它们,并与设施月度数据进行比较。 当然,当仪表出故障时除外。在这几个月里,这些设施为了记账而估算能源消耗。对于我们来说,我们只是在回归中剔除了这些月份的每小时和每天的数据点。

建筑总用电量是这电表的总和。为了以防万一,我们把“power”和“energy”放在一起比较

electricity['energy'] = electricity['Gund Bus-A EnergyReal - kWhr'] + electricity['Gund Bus-B EnergyReal - kWhr']
electricity['power'] = electricity['Gund Bus-A PowerReal - kW'] + electricity['Gund Bus-B PowerReal - kW']
electricity.head()
Gund Bus-A EnergyReal - kWhr Gund Bus-B EnergyReal - kWhr Gund Bus-A PowerReal - kW Gund Bus-B PowerReal - kW energy power startTime endTime
2011-07-01 00:00:00 NaN NaN NaN NaN NaN NaN 2011-07-01 00:00:00 2011-07-01 01:00:00
2011-07-01 01:00:00 1796757.502803 3657811.582122 47.184015 63.486186 5454569.084925 110.670201 2011-07-01 01:00:00 2011-07-01 02:00:00
2011-07-01 02:00:00 1796800.145991 3657873.464938 40.208796 61.270542 5454673.610929 101.479338 2011-07-01 02:00:00 2011-07-01 03:00:00
2011-07-01 03:00:00 1796840.146023 3657934.837505 39.209866 61.464394 5454774.983528 100.674260 2011-07-01 03:00:00 2011-07-01 04:00:00
2011-07-01 04:00:00 1796879.023607 3657995.470348 39.378507 59.396581 5454874.493955 98.775088 2011-07-01 04:00:00 2011-07-01 05:00:00

获得每小时消耗

这个小时和下一个小时之间的每小时能耗是下一个小时减去这个小时的电表kWh读数(dataframe中的“energy”)。我们假设电表读数是在一小时结束时记录的。为了避免混淆,我们还在dataframe中标记了抄表的开始时间和结束时间。我们将下一小时的每小时电量与推导出的每小时电量进行了比较,大多数时候,电量接近每小时电量。有时,会有很大的不同。

电表记录了累计能耗。假设今天一开始,数字是10。一小时后,用电量为1,然后电表号加1变成11,等等。所以仪表的数量应该不断增加。然而,我们发现,有时,水表读数突然下降到0,过了一段时间,又得到一个很高的正数。这将产生负的每小时消费,然后是一个荒谬的高正电荷数。我们用“t + 1时刻的电表读数- t时刻的电表读数”来计算每小时/每天的能源消耗。如果结果是负数,或者高得离谱,我们就认为仪表出了故障。我们把这些数据点设为np.nan。通过查看excel文件中的原始数据和每月的图表,我们计算出了仪表发生故障的确切日期。

除此之外,偶尔还会有更多的无意义数据点,它们比正常值高出一千倍。正常的小时消耗量范围在100到400之间。我们创建了一个过滤器:“index = abs(hourlyEnergy) < 200000”,这意味着只保留低于200000的值。

# 如果有任何遗漏的小时,重新索引,以获得整个时间跨度。填写nan数据。
hourlyTimestamp = pd.date_range(start = '2011/7/1', end = '2014/10/31', freq = 'H')

# 不知何故,重索引并不能很好地工作。2011年10月和其他几个小时的失踪。
# 基本上就是原始长度的长度。
# electricity.reindex(hourlyTimestamp, inplace = True, fill_value = np.nan)

startTime = hourlyTimestamp
endTime = hourlyTimestamp + np.timedelta64(1,'h')
hourlyTime = pd.DataFrame(data = np.transpose([startTime, endTime]), index = hourlyTimestamp, columns = ['startTime', 'endTime'])

electricity = electricity.join(hourlyTime, how = 'outer')

# 以防万一,为了使用diff方法,时间戳必须是按升序排列的。
electricity.sort_index(inplace = True)
hourlyEnergy = electricity.diff(periods=1)['energy']

hourlyElectricity = pd.DataFrame(data = hourlyEnergy.values, index = hourlyEnergy.index, columns = ['electricity-kWh'])
hourlyElectricity = hourlyElectricity.join(hourlyTime, how = 'inner')

print "Data length: ", len(hourlyElectricity)/24, " days"
hourlyElectricity.head()

Data length:  1218.04166667  days
electricity-kWh startTime endTime
2011-07-01 00:00:00 NaN 2011-07-01 00:00:00 2011-07-01 01:00:00
2011-07-01 01:00:00 NaN 2011-07-01 01:00:00 2011-07-01 02:00:00
2011-07-01 02:00:00 104.526004 2011-07-01 02:00:00 2011-07-01 03:00:00
2011-07-01 03:00:00 101.372599 2011-07-01 03:00:00 2011-07-01 04:00:00
2011-07-01 04:00:00 99.510428 2011-07-01 04:00:00 2011-07-01 05:00:00

以上是小时耗电量。数据被导出到excel文件中。我们假设电表读数是在一小时结束时记录的。

# 过滤数据,保留NaN并生成两个excels,有NaN和没有NaN

hourlyElectricity.loc[abs(hourlyElectricity['electricity-kWh']) > 100000,'electricity-kWh'] = np.nan

time = hourlyElectricity.index
index = ((time > np.datetime64('2012-07-26')) & (time < np.datetime64('2012-08-18'))) \
        | ((time > np.datetime64('2013-01-21')) & (time < np.datetime64('2013-03-08')))

hourlyElectricity.loc[index,'electricity-kWh'] = np.nan
hourlyElectricityWithoutNaN = hourlyElectricity.dropna(axis=0, how='any')

hourlyElectricity.to_excel('Data/hourlyElectricity.xlsx')
hourlyElectricityWithoutNaN.to_excel('Data/hourlyElectricityWithoutNaN.xlsx')
plt.figure()
fig = hourlyElectricity.plot(fontsize = 15, figsize = (15, 6))
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
fig.set_axis_bgcolor('w')
plt.title('All the hourly electricity data', fontsize = 16)
plt.ylabel('kWh')
plt.show()

plt.figure()
fig = hourlyElectricity.iloc[26200:27400,:].plot(marker = 'o',label='hourly electricity', fontsize = 15, figsize = (15, 6))
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
fig.set_axis_bgcolor('w')
plt.title('Hourly electricity data of selected days', fontsize = 16)
plt.ylabel('kWh')
plt.legend()
plt.show()

上面是每小时的数据图。在第一个图中,空白部分是由于仪表故障时数据丢失造成的。在第二张图中,你可以看到白天和晚上,工作日和周末的区别。

获得日常消耗

我们通过“reindex”来获得每天的用电量。

dailyTimestamp = pd.date_range(start = '2011/7/1', end = '2014/10/31', freq = 'D')
electricityReindexed = electricity.reindex(dailyTimestamp, inplace = False)

# 以防万一,为了使用diff方法,时间戳必须是按升序排列的。
electricityReindexed.sort_index(inplace = True)
dailyEnergy = electricityReindexed.diff(periods=1)['energy']

dailyElectricity = pd.DataFrame(data = dailyEnergy.values, index = electricityReindexed.index - np.timedelta64(1,'D'), columns = ['electricity-kWh'])
dailyElectricity['startDay'] = dailyElectricity.index
dailyElectricity['endDay'] = dailyElectricity.index + np.timedelta64(1,'D')

# 过滤数据,保留NaN并生成两个excels,有NaN和没有NaN
dailyElectricity.loc[abs(dailyElectricity['electricity-kWh']) > 2000000,'electricity-kWh'] = np.nan

time = dailyElectricity.index
index = ((time > np.datetime64('2012-07-26')) & (time < np.datetime64('2012-08-18'))) | ((time > np.datetime64('2013-01-21')) & (time < np.datetime64('2013-03-08')))

dailyElectricity.loc[index,'electricity-kWh'] = np.nan
dailyElectricityWithoutNaN = dailyElectricity.dropna(axis=0, how='any')

dailyElectricity.to_excel('Data/dailyElectricity.xlsx')
dailyElectricityWithoutNaN.to_excel('Data/dailyElectricityWithoutNaN.xlsx')

dailyElectricity.head()
electricity-kWh startDay endDay
2011-06-30 NaN 2011-06-30 2011-07-01
2011-07-01 NaN 2011-07-01 2011-07-02
2011-07-02 3630.398480 2011-07-02 2011-07-03
2011-07-03 3750.648885 2011-07-03 2011-07-04
2011-07-04 4568.724044 2011-07-04 2011-07-05

以上是日用电量。

plt.figure()
fig = dailyElectricity.plot(figsize = (15, 6))
fig.set_axis_bgcolor('w')
plt.title('All the daily electricity data', fontsize = 16)
plt.ylabel('kWh')
plt.show()

plt.figure()
fig = dailyElectricity.iloc[1000:1130,:].plot(marker = 'o', figsize = (15, 6))
fig.set_axis_bgcolor('w')
plt.title('Daily electricity data of selected days', fontsize = 16)
plt.ylabel('kWh')
plt.show()

上面是每日电量图。

超低耗电量发生在学期结束后,包括圣诞节假期。另外,如图2所示,开学时夏季的能耗较低。

冷水

我们处理冷水数据的方式与电力相同。

chilledWater = df[['Gund Main Energy - Ton-Days']]
chilledWater.head()
Gund Main Energy - Ton-Days
2011-07-01 01:00:00 17912.537804
2011-07-01 02:00:00 17912.853518
2011-07-01 03:00:00 17913.169232
2011-07-01 04:00:00 17913.484946
2011-07-01 05:00:00 17913.800660
file = 'Data/monthly chilled water.csv' 
monthlyChilledWaterFromFacility = pd.read_csv(file, header=0)
monthlyChilledWaterFromFacility.set_index(['month'], inplace = True)
monthlyChilledWaterFromFacility.head()
startDate endDate chilledWater
month
11-Jul 6/12/11 7/12/11 2258
11-Aug 7/12/11 8/12/11 2095
11-Sep 8/12/11 9/12/11 2200
11-Oct 9/12/11 10/12/11 1664
11-Nov 10/12/11 11/12/11 447
monthlyChilledWaterFromFacility['startDate'] = pd.to_datetime(monthlyChilledWaterFromFacility['startDate'], format="%m/%d/%y")
values = monthlyChilledWaterFromFacility.index.values

keys = np.array(monthlyChilledWaterFromFacility['startDate'])

dates = {}
for key, value in zip(keys, values):
    dates[key] = value

sortedDates =  np.sort(dates.keys())
sortedDates = sortedDates[sortedDates > np.datetime64('2011-11-01')]

months = []
monthlyChilledWaterOrg = np.zeros((len(sortedDates) - 1))
for i in range(len(sortedDates) - 1):
    begin = sortedDates[i]
    end = sortedDates[i+1]
    months.append(dates[sortedDates[i]])
    monthlyChilledWaterOrg[i] = (np.round(chilledWater.loc[end,:] -  chilledWater.loc[begin,:], 1))
    

monthlyChilledWater = pd.DataFrame(data = monthlyChilledWaterOrg, index = months, columns = ['chilledWater-TonDays'])

fig,ax = plt.subplots(1, 1,figsize=(15,8))
#ax.set_axis_bgcolor('w')
#plt.plot(monthlyChilledWater, label='Our data processing from hourly data', marker = 'x', markersize = 15, linestyle = '')
plt.bar(np.arange(len(monthlyChilledWater))-0.5, monthlyChilledWater.values, label='Our data processing from hourly data')
plt.plot(monthlyChilledWaterFromFacility[5:-1]['chilledWater'],'or', label='Facility data')
plt.xticks(np.arange(0,len(months)),months)
plt.xlabel('Month',fontsize=15)
plt.ylabel('kWh',fontsize=15)
plt.xlim([0,len(months)])
plt.legend()
ax.set_xticklabels(months, rotation=40, fontsize=13)
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
plt.title('Data Validation: comparison between our data processing and facilities',fontsize=20)

text = 'Match! Our processing method is valid.'
ax.annotate(text, xy = (15, 2000), 
            xytext = (5, 50), fontsize = 15, 
            textcoords = 'offset points', ha = 'center', va = 'bottom')

plt.show()

hourlyTimestamp = pd.date_range(start = '2011/7/1', end = '2014/10/31', freq = 'H')
chilledWater.reindex(hourlyTimestamp, inplace = True)

# 以防万一,为了使用diff方法,时间戳必须是按升序排列的。
chilledWater.sort_index(inplace = True)
hourlyEnergy = chilledWater.diff(periods=1)

hourlyChilledWater = pd.DataFrame(data = hourlyEnergy.values, index = hourlyEnergy.index, columns = ['chilledWater-TonDays'])
hourlyChilledWater['startTime'] = hourlyChilledWater.index
hourlyChilledWater['endTime'] = hourlyChilledWater.index + np.timedelta64(1,'h')

hourlyChilledWater.loc[abs(hourlyChilledWater['chilledWater-TonDays']) > 50,'chilledWater-TonDays'] = np.nan

hourlyChilledWaterWithoutNaN = hourlyChilledWater.dropna(axis=0, how='any')

hourlyChilledWater.to_excel('Data/hourlyChilledWater.xlsx')
hourlyChilledWaterWithoutNaN.to_excel('Data/hourlyChilledWaterWithoutNaN.xlsx')

plt.figure()
fig = hourlyChilledWater.plot(fontsize = 15, figsize = (15, 6))
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
fig.set_axis_bgcolor('w')
plt.title('All the hourly chilled water data', fontsize = 16)
plt.ylabel('Ton-Days')
plt.show()

hourlyChilledWater.head()

chilledWater-TonDays startTime endTime
2011-07-01 01:00:00 NaN 2011-07-01 01:00:00 2011-07-01 02:00:00
2011-07-01 02:00:00 0.315714 2011-07-01 02:00:00 2011-07-01 03:00:00
2011-07-01 03:00:00 0.315714 2011-07-01 03:00:00 2011-07-01 04:00:00
2011-07-01 04:00:00 0.315714 2011-07-01 04:00:00 2011-07-01 05:00:00
2011-07-01 05:00:00 0.315714 2011-07-01 05:00:00 2011-07-01 06:00:00
dailyTimestamp = pd.date_range(start = '2011/7/1', end = '2014/10/31', freq = 'D')
chilledWaterReindexed = chilledWater.reindex(dailyTimestamp, inplace = False)

chilledWaterReindexed.sort_index(inplace = True)
dailyEnergy = chilledWaterReindexed.diff(periods=1)['Gund Main Energy - Ton-Days']

dailyChilledWater = pd.DataFrame(data = dailyEnergy.values, index = chilledWaterReindexed.index - np.timedelta64(1,'D'), columns = ['chilledWater-TonDays'])
dailyChilledWater['startDay'] = dailyChilledWater.index
dailyChilledWater['endDay'] = dailyChilledWater.index + np.timedelta64(1,'D')

dailyChilledWaterWithoutNaN = dailyChilledWater.dropna(axis=0, how='any')

dailyChilledWater.to_excel('Data/dailyChilledWater.xlsx')
dailyChilledWaterWithoutNaN.to_excel('Data/dailyChilledWaterWithoutNaN.xlsx')

plt.figure()
fig = dailyChilledWater.plot(fontsize = 15, figsize = (15, 6))
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
fig.set_axis_bgcolor('w')
plt.title('All the daily chilled water data', fontsize = 16)
plt.ylabel('Ton-Days')
plt.show()


dailyChilledWater.head()

10

chilledWater-TonDays startDay endDay
2011-06-30 NaN 2011-06-30 2011-07-01
2011-07-01 NaN 2011-07-01 2011-07-02
2011-07-02 54.741028 2011-07-02 2011-07-03
2011-07-03 55.649728 2011-07-03 2011-07-04
2011-07-04 109.049077 2011-07-04 2011-07-05

热水

steam = df[['Gund Condensate FlowTotal - LBS']]
steam.head()
Gund Condensate FlowTotal - LBS
2011-07-01 01:00:00 15443350.388455
2011-07-01 02:00:00 15443459.322917
2011-07-01 03:00:00 15443574.687500
2011-07-01 04:00:00 15443701.953125
2011-07-01 05:00:00 15443818.359375
file = 'Data/monthly steam.csv' 
monthlySteamFromFacility = pd.read_csv(file, header=0)
monthlySteamFromFacility.set_index(['month'], inplace = True)
monthlySteamFromFacility.head()
startDate endDate steam
month
Jul 11 6/17/2011 7/21/2011 0.0
Aug 11 7/21/2011 8/20/2011 0.0
Sep 11 8/20/2011 9/17/2011 0.0
Oct 11 9/17/2011 10/20/2011 246.5
Nov 11 10/20/2011 11/20/2011 786.1
monthlySteamFromFacility['startDate'] = pd.to_datetime(monthlySteamFromFacility['startDate'], format="%m/%d/%Y")
values = monthlySteamFromFacility.index.values

keys = np.array(monthlySteamFromFacility['startDate'])

dates = {}
for key, value in zip(keys, values):
    dates[key] = value

sortedDates =  np.sort(dates.keys())
sortedDates = sortedDates[sortedDates > np.datetime64('2011-11-01')]

months = []
monthlySteamOrg = np.zeros((len(sortedDates) - 1))
for i in range(len(sortedDates) - 1):
    begin = sortedDates[i]
    end = sortedDates[i+1]
    months.append(dates[sortedDates[i]])
    monthlySteamOrg[i] = (np.round(steam.loc[end,:] -  steam.loc[begin,:], 1))
    

monthlySteam = pd.DataFrame(data = monthlySteamOrg, index = months, columns = ['steam-LBS'])

# 867 LBS ~= 1MMBTU steam

fig,ax = plt.subplots(1, 1,figsize=(15,8))
#ax.set_axis_bgcolor('w')
#plt.plot(monthlySteam/867, label='Our data processing from hourly data')
plt.bar(np.arange(len(monthlySteam))-0.5, monthlySteam.values/867, label='Our data processing from hourly data')
plt.plot(monthlySteamFromFacility.loc[months,'steam'],'or', label='Facility data')
plt.xticks(np.arange(0,len(months)),months)
plt.xlabel('Month',fontsize=15)
plt.ylabel('Steam (MMBTU)',fontsize=15)
plt.xlim([0,len(months)])
plt.legend()
ax.set_xticklabels(months, rotation=40, fontsize=13)
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
plt.title('Comparison between our data processing and facilities - Steam',fontsize=20)

text = 'Match! Our processing method is valid.'
ax.annotate(text, xy = (9, 1500), 
            xytext = (5, 50), fontsize = 15, 
            textcoords = 'offset points', ha = 'center', va = 'bottom')

plt.show()

9.5

hourlyTimestamp = pd.date_range(start = '2011/7/1', end = '2014/10/31', freq = 'H')
steam.reindex(hourlyTimestamp, inplace = True)

# Just in case, in order to use diff method, timestamp has to be in asending order.
steam.sort_index(inplace = True)
hourlyEnergy = steam.diff(periods=1)

hourlySteam = pd.DataFrame(data = hourlyEnergy.values, index = hourlyEnergy.index, columns = ['steam-LBS'])
hourlySteam['startTime'] = hourlySteam.index
hourlySteam['endTime'] = hourlySteam.index + np.timedelta64(1,'h')

hourlySteam.loc[abs(hourlySteam['steam-LBS']) > 100000,'steam-LBS'] = np.nan

plt.figure()
fig = hourlySteam.plot(fontsize = 15, figsize = (15, 6))
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 17)
fig.set_axis_bgcolor('w')
plt.title('All the hourly steam data', fontsize = 16)
plt.ylabel('LBS')
plt.show()

hourlySteamWithoutNaN = hourlySteam.dropna(axis=0, how='any')

hourlySteam.to_excel('Data/hourlySteam.xlsx')
hourlySteamWithoutNaN.to_excel('Data/hourlySteamWithoutNaN.xlsx')

hourlySteam.head()

steam-LBS startTime endTime
2011-07-01 01:00:00 NaN 2011-07-01 01:00:00 2011-07-01 02:00:00
2011-07-01 02:00:00 108.934462 2011-07-01 02:00:00 2011-07-01 03:00:00
2011-07-01 03:00:00 115.364583 2011-07-01 03:00:00 2011-07-01 04:00:00
2011-07-01 04:00:00 127.265625 2011-07-01 04:00:00 2011-07-01 05:00:00
2011-07-01 05:00:00 116.406250 2011-07-01 05:00:00 2011-07-01 06:00:00
dailyTimestamp = pd.date_range(start = '2011/7/1', end = '2014/10/31', freq = 'D')
steamReindexed = steam.reindex(dailyTimestamp, inplace = False)

steamReindexed.sort_index(inplace = True)
dailyEnergy = steamReindexed.diff(periods=1)['Gund Condensate FlowTotal - LBS']

dailySteam = pd.DataFrame(data = dailyEnergy.values, index = steamReindexed.index - np.timedelta64(1,'D'), columns = ['steam-LBS'])
dailySteam['startDay'] = dailySteam.index
dailySteam['endDay'] = dailySteam.index + np.timedelta64(1,'D')

plt.figure()
fig = dailySteam.plot(fontsize = 15, figsize = (15, 6))
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
fig.set_axis_bgcolor('w')
plt.title('All the daily steam data', fontsize = 16)
plt.ylabel('LBS')
plt.show()

dailySteamWithoutNaN = dailyChilledWater.dropna(axis=0, how='any')

dailySteam.to_excel('Data/dailySteam.xlsx')
dailySteamWithoutNaN.to_excel('Data/dailySteamWithoutNaN.xlsx')

dailySteam.head()

steam-LBS startDay endDay
2011-06-30 NaN 2011-06-30 2011-07-01
2011-07-01 NaN 2011-07-01 2011-07-02
2011-07-02 3250.651042 2011-07-02 2011-07-03
2011-07-03 3271.276042 2011-07-03 2011-07-04
2011-07-04 3236.718750 2011-07-04 2011-07-05

天气数据

原始数据

天气有两个数据源。

  • 2014年: GSD(设计研究生院)楼顶上的天气。
  • 2012 & 2013年: 从位于Cambridge, MA的气象站购买天气数据。请注意,2012年和2013年的数据来自不同的气象站。

这是经过一些清洗后的原始天气数据,包括单位转换。间隔是5分钟。

weather2014 = pd.read_excel('Data/weather-2014.xlsx')
weather2014.head()
Datetime T-C RH-% Tdew-C windDirection windSpeed-m/s pressure-mbar solarRadiation-W/m2
0 2014-01-01 00:00:00 -5.02 53.2 -13.07 256 4.5 1020.8 1
1 2014-01-01 00:05:00 -5.14 54.3 -12.93 257 4.0 1020.7 1
2 2014-01-01 00:10:00 -5.08 53.4 -13.08 258 3.5 1021.1 1
3 2014-01-01 00:15:00 -5.17 52.6 -13.35 257 2.8 1021.0 1
4 2014-01-01 00:20:00 -5.23 52.9 -13.33 248 2.5 1021.2 1

通过重采样方法转换为每小时一次。

这是2014年每小时的天气数据。

weather2014 = weather2014.set_index('Datetime')
weather2014 = weather2014.resample('H')
weather2014.head()
T-C RH-% Tdew-C windDirection windSpeed-m/s pressure-mbar solarRadiation-W/m2
Datetime
2014-01-01 00:00:00 -5.281667 52.858333 -13.393333 253.500000 2.775000 1021.158333 1
2014-01-01 01:00:00 -5.725000 51.650000 -14.090833 253.000000 2.350000 1021.700000 1
2014-01-01 02:00:00 -6.002500 50.766667 -14.555833 239.250000 1.133333 1022.708333 1
2014-01-01 03:00:00 -6.320833 49.675000 -15.114167 234.333333 1.191667 1023.233333 1
2014-01-01 04:00:00 -6.535833 49.708333 -15.305833 227.333333 1.633333 1023.925000 1

这里是2013年和2014年的原始天气数据,经过一些清洗,包括单位转换。可以看出,Datetime格式不正确。

weather2012and2013 = pd.read_excel('Data/weather-2012-2013.xlsx')
weather2012and2013.head()
Datetime RH-% windDirection solarRadiation-W/m2 T-C Tdew-C pressure-mbar windSpeed-m/s
0 2012-01-01-01 87 310 0 4 1.9 1004 4.166670
1 2012-01-01-02 87 280 0 4 1.9 1005 4.166670
2 2012-01-01-03 81 270 0 5 1.9 1006 4.166670
3 2012-01-01-04 76 290 0 6 1.9 1007 4.722226
4 2012-01-01-05 87 280 0 4 1.9 1007 3.055558

这是清洗后每小时的数据。

weather2012and2013['Datetime'] = pd.to_datetime(weather2012and2013['Datetime'], format='%Y-%m-%d-%H')
weather2012and2013 = weather2012and2013.set_index('Datetime')
weather2012and2013.head()
RH-% windDirection solarRadiation-W/m2 T-C Tdew-C pressure-mbar windSpeed-m/s
Datetime
2012-01-01 01:00:00 87 310 0 4 1.9 1004 4.166670
2012-01-01 02:00:00 87 280 0 4 1.9 1005 4.166670
2012-01-01 03:00:00 81 270 0 5 1.9 1006 4.166670
2012-01-01 04:00:00 76 290 0 6 1.9 1007 4.722226
2012-01-01 05:00:00 87 280 0 4 1.9 1007 3.055558
每小时天气数据

合并两个文件,增加更多的功能,包括冷度,热度,湿度比和除湿。

这是每小时的天气数据。

# 合并两个天气文件
hourlyWeather = weather2014.append(weather2012and2013)
hourlyWeather.index.name = None
hourlyWeather.sort_index(inplace = True)

# 增加更多的特征
# 将相对湿度转换为相对湿度

Mw=18.0160 # 水的分子量
Md=28.9660 # 干燥空气的分子量
R =  8.31432E3 # 气体常数
Rd = R/Md # 干燥空气的比气体常数
Rv = R/Mw # 蒸汽的比气体常数
Lv = 2.5e6 # 水蒸气凝结放热[J kg-1]
eps = Mw/Md

# 饱和压力
def esat(T):
    ''' 给定空气温度(单位[K]),得到给定压力(单位[Pa]) '''
    from numpy import log10
    TK = 273.15
    e1 = 101325.0
    logTTK = log10(T/TK)
    esat =  e1*10**(10.79586*(1-TK/T)-5.02808*logTTK+ 1.50474*1e-4*(1.-10**(-8.29692*(T/TK-1)))+ 0.42873*1e-3*(10**(4.76955*(1-TK/T))-1)-2.2195983) 
    return esat

def rh2sh(RH,p,T):
    '''用途:将相对湿度(无单位)换算成比湿度(湿度比)[kg/kg]'''
    es = esat(T)
    W = Mw/Md*RH*es/(p-RH*es)

    return W/(1.+W)


p = hourlyWeather['pressure-mbar'] * 100
RH = hourlyWeather['RH-%'] / 100
T = hourlyWeather['T-C'] + 273.15
w = rh2sh(RH,p,T)

hourlyWeather['humidityRatio-kg/kg'] = w
hourlyWeather['coolingDegrees'] = hourlyWeather['T-C'] - 12
hourlyWeather.loc[hourlyWeather['coolingDegrees'] < 0, 'coolingDegrees'] = 0

hourlyWeather['heatingDegrees'] = 15 - hourlyWeather['T-C']
hourlyWeather.loc[hourlyWeather['heatingDegrees'] < 0, 'heatingDegrees'] = 0

hourlyWeather['dehumidification'] = hourlyWeather['humidityRatio-kg/kg'] - 0.00886
hourlyWeather.loc[hourlyWeather['dehumidification'] < 0, 'dehumidification'] = 0

#hourlyWeather.to_excel('Data/hourlyWeather.xlsx')
hourlyWeather.head()
RH-% T-C Tdew-C pressure-mbar solarRadiation-W/m2 windDirection windSpeed-m/s humidityRatio-kg/kg coolingDegrees heatingDegrees dehumidification
2012-01-01 01:00:00 87 4 1.9 1004 0 310 4.166670 0.004396 0 11 0
2012-01-01 02:00:00 87 4 1.9 1005 0 280 4.166670 0.004391 0 11 0
2012-01-01 03:00:00 81 5 1.9 1006 0 270 4.166670 0.004380 0 10 0
2012-01-01 04:00:00 76 6 1.9 1007 0 290 4.722226 0.004401 0 9 0
2012-01-01 05:00:00 87 4 1.9 1007 0 280 3.055558 0.004382 0 11 0
plt.figure()
fig = hourlyWeather.plot(y = 'T-C', figsize = (15, 6))
fig.set_axis_bgcolor('w')
plt.title('All hourly temperture', fontsize = 16)
plt.ylabel(r'Temperature ($\circ$C)')
plt.show()

plt.figure()
fig = hourlyWeather.plot(y = 'solarRadiation-W/m2', figsize = (15, 6))
fig.set_axis_bgcolor('w')
plt.title('All hourly solar radiation', fontsize = 16)
plt.ylabel(r'$W/m^2$', fontsize = 13)
plt.show()

plt.figure()
fig = hourlyWeather['2014-10'].plot(y = 'T-C', figsize = (15, 6), marker = 'o')
fig.set_axis_bgcolor('w')
plt.title('Selected hourly temperture',fontsize = 16)
plt.ylabel(r'Temperature ($\circ$C)',fontsize = 13)
plt.show()

plt.figure()
fig = hourlyWeather['2014-10'].plot(y = 'solarRadiation-W/m2', figsize = (15, 6), marker ='o')
fig.set_axis_bgcolor('w')
plt.title('Selected hourly solar radiation', fontsize = 16)
plt.ylabel(r'$W/m^2$', fontsize = 13)
plt.show()

每天的天气数据
dailyWeather = hourlyWeather.resample('D')
#dailyWeather.to_excel('Data/dailyWeather.xlsx')
dailyWeather.head()
RH-% T-C Tdew-C pressure-mbar solarRadiation-W/m2 windDirection windSpeed-m/s humidityRatio-kg/kg coolingDegrees heatingDegrees dehumidification
2012-01-01 76.652174 7.173913 3.073913 1004.956522 95.260870 236.086957 4.118361 0.004796 0 7.826087 0
2012-01-02 55.958333 5.833333 -2.937500 994.625000 87.333333 253.750000 5.914357 0.003415 0 9.166667 0
2012-01-03 42.500000 -3.208333 -12.975000 1002.125000 95.708333 302.916667 6.250005 0.001327 0 18.208333 0
2012-01-04 41.541667 -7.083333 -16.958333 1008.250000 98.750000 286.666667 5.127319 0.000890 0 22.083333 0
2012-01-05 46.916667 -0.583333 -9.866667 1002.041667 90.750000 258.333333 5.162041 0.001746 0 15.583333 0
plt.figure()
fig = dailyWeather.plot(y = 'T-C', figsize = (15, 6), marker ='o')
fig.set_axis_bgcolor('w')
plt.title('All daily temperture', fontsize = 16)
plt.ylabel(r'Temperature ($\circ$C)', fontsize = 13)
plt.show()

plt.figure()
fig = dailyWeather['2014'].plot(y = 'T-C', figsize = (15, 6), marker ='o')
fig.set_axis_bgcolor('w')
plt.title('Selected daily temperture', fontsize = 16)
plt.ylabel(r'Temperature ($\circ$C)', fontsize = 13)
plt.show()

plt.figure()
fig = dailyWeather['2014'].plot(y = 'solarRadiation-W/m2', figsize = (15, 6), marker ='o')
fig.set_axis_bgcolor('w')
plt.title('Selected daily solar radiation', fontsize = 16)
plt.ylabel(r'$W/m^2$', fontsize = 14)
plt.show()

与占用有关的特征

这是一个0到1之间的数。0表示无人居住,1表示正常居住。这是一个基于假期、周末和学校学术日历的估计。

holidays = pd.read_excel('Data/holidays.xlsx')
holidays.head()
startDate endDate value
0 2011-07-01 2011-09-06 0.5
1 2011-10-10 2011-10-11 0.6
2 2011-11-24 2011-11-28 0.2
3 2011-12-22 2011-12-24 0.1
4 2011-12-24 2012-01-02 0.0
hourlyTimestamp = pd.date_range(start = '2011/7/1', end = '2014/10/31', freq = 'H')
occupancy = np.ones(len(hourlyTimestamp)) 

hourlyOccupancy = pd.DataFrame(data = occupancy, index = hourlyTimestamp, columns = ['occupancy'])


Saturdays = hourlyOccupancy.index.weekday == 5
Sundays = hourlyOccupancy.index.weekday == 6
hourlyOccupancy.loc[Saturdays, 'occupancy'] = 0.5 
hourlyOccupancy.loc[Sundays, 'occupancy'] = 0.5 


for i in range(len(holidays)):
    timestamp = pd.date_range(start = holidays.loc[i, 'startDate'], end = holidays.loc[i, 'endDate'], freq = 'H')
    hourlyOccupancy.loc[timestamp, 'occupancy'] = holidays.loc[i, 'value']

#hourlyHolidays['Datetime'] = pd.to_datetime(hourlyHolidays['Datetime'], format="%Y-%m-%d %H:%M:%S")
hourlyOccupancy['cosHour'] = np.cos((hourlyOccupancy.index.hour - 3) * 2 * np.pi / 24)

dailyOccupancy = hourlyOccupancy.resample('D')
dailyOccupancy.drop('cosHour', axis = 1, inplace = True)
合并能源消耗数据与天气和占用特征
hourlyElectricityWithFeatures = hourlyElectricity.join(hourlyWeather, how = 'inner')
hourlyElectricityWithFeatures = hourlyElectricityWithFeatures.join(hourlyOccupancy, how = 'inner')
hourlyElectricityWithFeatures.dropna(axis=0, how='any', inplace = True)
hourlyElectricityWithFeatures.to_excel('Data/hourlyElectricityWithFeatures.xlsx')
hourlyElectricityWithFeatures.head()
electricity-kWh startTime endTime RH-% T-C Tdew-C pressure-mbar solarRadiation-W/m2 windDirection windSpeed-m/s humidityRatio-kg/kg coolingDegrees heatingDegrees dehumidification occupancy cosHour
2012-01-01 01:00:00 111.479277 2012-01-01 01:00:00 2012-01-01 02:00:00 87 4 1.9 1004 0 310 4.166670 0.004396 0 11 0 0 0.866025
2012-01-01 02:00:00 117.989395 2012-01-01 02:00:00 2012-01-01 03:00:00 87 4 1.9 1005 0 280 4.166670 0.004391 0 11 0 0 0.965926
2012-01-01 03:00:00 119.010131 2012-01-01 03:00:00 2012-01-01 04:00:00 81 5 1.9 1006 0 270 4.166670 0.004380 0 10 0 0 1.000000
2012-01-01 04:00:00 116.005587 2012-01-01 04:00:00 2012-01-01 05:00:00 76 6 1.9 1007 0 290 4.722226 0.004401 0 9 0 0 0.965926
2012-01-01 05:00:00 111.132977 2012-01-01 05:00:00 2012-01-01 06:00:00 87 4 1.9 1007 0 280 3.055558 0.004382 0 11 0 0 0.866025
hourlyChilledWaterWithFeatures = hourlyChilledWater.join(hourlyWeather, how = 'inner')
hourlyChilledWaterWithFeatures = hourlyChilledWaterWithFeatures.join(hourlyOccupancy, how = 'inner')
hourlyChilledWaterWithFeatures.dropna(axis=0, how='any', inplace = True)
hourlyChilledWaterWithFeatures.to_excel('Data/hourlyChilledWaterWithFeatures.xlsx')

hourlySteamWithFeatures = hourlySteam.join(hourlyWeather, how = 'inner')
hourlySteamWithFeatures = hourlySteamWithFeatures.join(hourlyOccupancy, how = 'inner')
hourlySteamWithFeatures.dropna(axis=0, how='any', inplace = True)
hourlySteamWithFeatures.to_excel('Data/hourlySteamWithFeatures.xlsx')

dailyElectricityWithFeatures = dailyElectricity.join(dailyWeather, how = 'inner')
dailyElectricityWithFeatures = dailyElectricityWithFeatures.join(dailyOccupancy, how = 'inner')
dailyElectricityWithFeatures.dropna(axis=0, how='any', inplace = True)
dailyElectricityWithFeatures.to_excel('Data/dailyElectricityWithFeatures.xlsx')

dailyChilledWaterWithFeatures = dailyChilledWater.join(dailyWeather, how = 'inner')
dailyChilledWaterWithFeatures = dailyChilledWaterWithFeatures.join(dailyOccupancy, how = 'inner')
dailyChilledWaterWithFeatures.dropna(axis=0, how='any', inplace = True)
dailyChilledWaterWithFeatures.to_excel('Data/dailyChilledWaterWithFeatures.xlsx')

dailySteamWithFeatures = dailySteam.join(dailyWeather, how = 'inner')
dailySteamWithFeatures = dailySteamWithFeatures.join(dailyOccupancy, how = 'inner')
dailySteamWithFeatures.dropna(axis=0, how='any', inplace = True)
dailySteamWithFeatures.to_excel('Data/dailySteamWithFeatures.xlsx')

特性说明

命名法(按字母顺序)

coolingDegrees: 如果T-C - 12>0,则为T-C-12,否则为0。假设当室外温度低于12C时,不需要冷却,这对许多建筑来说是真实的。这对于每天的预测是有用的,因为每小时的平均冷却度比每小时的平均温度要好。

cosHour: cos(hourOfDay⋅2𝜋24)cos(hourOfDay⋅2π24)

dehumidification: 如果humidityRatio - 0.00886 > 0,则为humidityRatio - 0.00886,否则为 0。这将有助于冷水预测,特别是日常冷水预测。

heatingDegrees: 如果15 - T-C >0,则为 15 - T-C,否则为 0。假设室外温度高于15C时,不需要加热。这对于每日的预测是有用的,因为每小时的平均加热度要比每小时的平均温度好。

occupancy: 介于0和1之间的数字。0表示无人居住,1表示正常居住。这是一个基于假期、周末和学校学术日历的估计。

pressure-mbar: 大气压力

RH-% : 相对湿度

T-C : Dry-bulb 温度

Tdew-C : Dew-point 温度

湿度

湿度比是预测冷水的重要因素,因为冷水也用于干燥排到房间里的空气。使用湿度比比使用相对湿度和露点温度更有效。

%matplotlib inline 

import requests 
from StringIO import StringIO
import numpy as np
import pandas as pd # pandas
import matplotlib.pyplot as plt # module for plotting 
import datetime as dt # module for manipulating dates and times
import numpy.linalg as lin # module for performing linear algebra operations
from __future__ import division
import matplotlib

pd.options.display.mpl_style = 'default'

每月能源消耗

pd.options.display.mpl_style = 'default'
consumption = pd.read_csv('Data/Monthly_Energy_Gund.csv')
for i in range(len(consumption)):
    consumption['CW-kBtu'][i] = float(consumption['CW-kBtu'].values[i].replace(',', ''))
    consumption['EL-kBtu'][i] = float(consumption['EL-kBtu'].values[i].replace(',', ''))
    consumption['ST-kBtu'][i] = float(consumption['ST-kBtu'].values[i].replace(',', ''))

time_index = np.arange(len(consumption)) 
plt.figure(figsize=(15,7))
b1 = plt.bar(time_index, consumption['EL-kBtu'], width = 0.6, color='g')
b2 = plt.bar(time_index, consumption['ST-kBtu'], bottom=consumption['EL-kBtu'], width = 0.6, color='r')
b3 = plt.bar(time_index, consumption['CW-kBtu'], bottom=consumption['EL-kBtu']+consumption['ST-kBtu'], width = 0.6, color='b')

plt.xticks(time_index+0.5, consumption['Time'], rotation=90)
plt.title('Monthly Energy consumption')
plt.xlabel('Month')
plt.ylabel('Consumption (kBtu)')
plt.legend( (b1, b2, b3), ('Electricity', 'Steam', 'Chilled Water') )

电力能源消费格局

首先,让我们看看我们可以找到什么在每小时和每天的电力能源消耗。

hourlyElectricity = pd.read_excel('Data/hourlyElectricity.xlsx')

index = (hourlyElectricity['startTime'] >= np.datetime64('2011-07-03')) & (hourlyElectricity['startTime'] < np.datetime64('2014-10-26'))
hourlyElectricityForVisualization = hourlyElectricity.loc[index,'electricity-kWh']

print "Data length: ", len(hourlyElectricityForVisualization)/24/7, " weeks"
Data length:  173.0  weeks
data = hourlyElectricityForVisualization.values
data = data.reshape((len(data)/24/7,24*7))

from mpl_toolkits.axes_grid1 import make_axes_locatable

yTickLabels = pd.DataFrame(data = pd.date_range(start = '2011-07-03', 
                                                end   = '2014-10-25', 
                                                freq  = '4W', 
                                                columns=['datetime'])
yTickLabels['date'] = yTickLabels['datetime'].apply(lambda x: x.strftime('%Y-%m-%d'))

s1 = ['Sun ', 'Mon ', 'Tue ', 'Wed ', 'Thu ', 'Fri ', 'Sat ']
s2 = ['12AM ', '6 AM', '12PM', '6 PM']
s1 = np.repeat(s1, 4)
s2 = np.tile(s2, 7)
xTickLabels = np.char.add(s1, s2)

fig = plt.figure(figsize=(20,30))
ax = plt.gca()
im = ax.imshow(data, vmin =0, vmax = 500, interpolation='nearest', origin='upper')
# 在ax的右边创建一个坐标轴。cax的宽度为ax的5%,cax与ax之间的填充物固定在0.05英寸处。
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="3%", pad=0.2)
ax.set_yticks(range(0,173,4))
ax.set_yticklabels(labels = yTickLabels['date'], fontsize = 14)

ax.set_xticks(range(0,168,6))
ax.set_xticklabels(labels = xTickLabels, fontsize = 14, rotation = 90)

plt.colorbar(im, cax=cax)

上图是三年内每小时用电量的累积图。banlk部分表示丢失的数据。

dailyElectricity = pd.read_excel('Data/dailyElectricity.xlsx')

index = (dailyElectricity['startDay'] >= np.datetime64('2011-07-03')) 
	   &(dailyElectricity['startDay'] <  np.datetime64('2014-10-19'))
dailyElectricityForVisualization = dailyElectricity.loc[index,'electricity-kWh']

print "Data length: ", len(dailyElectricityForVisualization)/7, " weeks"

data = dailyElectricityForVisualization.values
data = data.reshape((len(data)/7/4,7*4))

from mpl_toolkits.axes_grid1 import make_axes_locatable

yTickLabels = pd.DataFrame(data = pd.date_range(start = '2011-07-03', 
                                                end = '2014-10-25', 
                                                freq = '4W', 
                                                columns=['datetime'])
yTickLabels['date'] = yTickLabels['datetime'].apply(lambda x: x.strftime('%Y-%m-%d'))

s = ['Sun ', 'Mon ', 'Tue ', 'Wed ', 'Thu ', 'Fri ', 'Sat ']
xTickLabels = np.tile(s, 4)

fig = plt.figure(figsize=(14,15))
ax = plt.gca()
im = ax.imshow(data, interpolation='nearest', origin='upper')
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="3%", pad=0.2)
ax.set_yticks(range(43))
ax.set_yticklabels(labels = yTickLabels['date'], fontsize = 14)

ax.set_xticks(range(28))
ax.set_xticklabels(labels = xTickLabels, fontsize = 14, rotation = 90)

plt.colorbar(im, cax=cax)
plt.show()

plt.figure()
fig = dailyElectricity.plot(figsize = (15, 6))
fig.set_axis_bgcolor('w')
plt.title('All the daily electricity data', fontsize = 16)
plt.ylabel('kWh')
plt.show()

Data length:  172.0  weeks

上面是一张热图和一张日常用电量图。空白部分表示缺少数据。

dailyElectricity = pd.read_excel('Data/dailyElectricity.xlsx')
weeklyElectricity = dailyElectricity.asfreq('W', how='sume', normalize=False)

plt.figure()
fig = weeklyElectricity['2012-01':'2014-01'].plot(figsize = (15, 6), 
                                                  fontsize = 15, 
                                                  marker = 'o', 
                                                  linestyle='--')
fig.set_axis_bgcolor('w')
plt.title('Weekly electricity data', fontsize = 16)
plt.ylabel('kWh')
ax = plt.gca()
plt.show()

上面是每周消费的图表。折线部分表示数据丢失。

很明显,在决赛期间是最大的消耗。然后突然下降。重复的模式非常明显。

发现

  • 电表现出强烈的周期性。你可以清楚地看到白天和晚上,工作日和周末的区别。

  • 似乎在每个学期中,期末考试的用电量都会达到峰值,这可能代表了学习模式。学生们正越来越努力地准备期末考试。学期结束后会有一个假期,包括圣诞假期。1月和夏季学期的用电量相对较低,而春假期间校园可以相对空闲

能耗与特征的关系

我们考虑的主要特征

在这一节中,我们将电力、冷冻水和蒸汽消耗(每小时和每天)与各种特性进行对比。

# 从预处理结果中读取数据

hourlyElectricityWithFeatures =                            		pd.read_excel('Data/hourlyElectricityWithFeatures.xlsx')
hourlyChilledWaterWithFeatures = pd.read_excel('Data/hourlyChilledWaterWithFeatures.xlsx')
hourlySteamWithFeatures = pd.read_excel('Data/hourlySteamWithFeatures.xlsx')

dailyElectricityWithFeatures = pd.read_excel('Data/dailyElectricityWithFeatures.xlsx')
dailyChilledWaterWithFeatures = pd.read_excel('Data/dailyChilledWaterWithFeatures.xlsx')
dailySteamWithFeatures = pd.read_excel('Data/dailySteamWithFeatures.xlsx')

dailyChilledWaterWithFeatures.head()
chilledWater-TonDays startDay endDay RH-% T-C Tdew-C pressure-mbar solarRadiation-W/m2 windDirection windSpeed-m/s humidityRatio-kg/kg coolingDegrees heatingDegrees dehumidification occupancy
2012-01-01 0.961857 2012-01-01 2012-01-02 76.652174 7.173913 3.073913 1004.956522 95.260870 236.086957 4.118361 0.004796 0 7.826087 0 0.0
2012-01-02 0.981725 2012-01-02 2012-01-03 55.958333 5.833333 -2.937500 994.625000 87.333333 253.750000 5.914357 0.003415 0 9.166667 0 0.3
2012-01-03 1.003672 2012-01-03 2012-01-04 42.500000 -3.208333 -12.975000 1002.125000 95.708333 302.916667 6.250005 0.001327 0 18.208333 0 0.3
2012-01-04 1.483192 2012-01-04 2012-01-05 41.541667 -7.083333 -16.958333 1008.250000 98.750000 286.666667 5.127319 0.000890 0 22.083333 0 0.3
2012-01-05 3.465091 2012-01-05 2012-01-06 46.916667 -0.583333 -9.866667 1002.041667 90.750000 258.333333 5.162041 0.001746 0 15.583333 0 0.3

上面我们打印出所有的功能。

holidays = pd.read_excel('Data/holidays.xlsx')
holidays
startDate endDate value
0 2011-07-01 2011-09-06 0.5
1 2011-10-10 2011-10-11 0.6
2 2011-11-24 2011-11-28 0.2
3 2011-12-22 2011-12-24 0.1
4 2011-12-24 2012-01-02 0.0
5 2012-01-02 2012-01-23 0.3
6 2012-03-10 2012-03-19 0.4
7 2012-05-17 2012-09-04 0.5
8 2012-05-28 2012-05-29 0.2
9 2012-10-08 2012-10-09 0.6
10 2012-11-22 2012-11-26 0.2
11 2012-12-22 2012-12-24 0.1
12 2012-12-24 2013-01-02 0.0
13 2013-01-02 2013-01-27 0.3
14 2013-01-20 2013-01-21 0.1
15 2013-03-16 2013-03-25 0.4
16 2013-05-18 2013-09-03 0.5
17 2013-10-14 2013-10-15 0.6
18 2013-11-28 2013-12-02 0.2
19 2013-12-20 2013-12-24 0.1
20 2013-12-24 2014-01-02 0.0
21 2014-01-02 2014-01-26 0.3
22 2014-03-16 2014-03-24 0.4
23 2014-05-17 2014-09-02 0.5

上面是“占用”的设置。全部入住率被赋值为1。

能源消耗与特征
fig, ax = plt.subplots(3, 2, sharey='row', figsize = (15, 12))
fig.subplots_adjust(hspace = 0.1, wspace = 0.1)

hourlyElectricityWithFeatures.plot(kind = 'scatter', 
                                   x = 'T-C', 
                                   y = 'electricity-kWh', 
                                   ax = ax[0,0])
hourlyElectricityWithFeatures.plot(kind = 'scatter', 
                                   x = 'coolingDegrees', 
                                   y = 'electricity-kWh', 
                                   ax = ax[0,1])
hourlyChilledWaterWithFeatures.plot(kind = 'scatter', 
                                    x = 'T-C', 
                                    y = 'chilledWater-TonDays', 
                                    ax = ax[1,0])
hourlyChilledWaterWithFeatures.plot(kind = 'scatter', 
                                    x = 'coolingDegrees', 
                                    y = 'chilledWater-TonDays', 
                                    ax = ax[1,1])
hourlySteamWithFeatures.plot(kind = 'scatter', 
                             x = 'T-C', 
                             y = 'steam-LBS', 
                             ax = ax[2,0])
hourlySteamWithFeatures.plot(kind = 'scatter', 
                             x = 'heatingDegrees', 
                             y = 'steam-LBS', 
                             ax = ax[2,1])

for i in range(3):    
    ax[i,0].tick_params(which=u'major', 
                        reset=False, 
                        axis = 'y', 
                        labelsize = 13)
    #ax[i,0].set_axis_bgcolor('w')

for i in range(2):    
    ax[2,i].tick_params(which=u'major', 
                        reset=False, 
                        axis = 'x', 
                        labelsize = 13)

ax[2,0].set_xlabel(r'Temperature ($^\circ$C)', fontsize = 13)
ax[2,0].set_xlim([-20,40])
ax[0,0].set_title(
    'Hourly energy use versus ourdoor temperature', 
    fontsize = 15)

ax[2,1].set_xlabel(r'Cooling/Heating degrees ($^\circ$C)', 
                   fontsize = 13)
#ax[2,1].set_xlim([0,30])
ax[0,1].set_title(
    'Hourly energy use versus cooling/heating degrees', 
    fontsize = 15)

plt.show()

冷冻水和蒸汽与温度密切相关。然而,仅使用室外温度或冷却/加热温度来预测每小时的冷冻水和蒸汽是不够的。

fig, ax = plt.subplots(3, 2, sharey='row', figsize = (15, 12))
fig.subplots_adjust(hspace = 0.1, wspace = 0.1)

dailyElectricityWithFeatures.plot(kind = 'scatter', 
                                  x = 'T-C', 
                                  y = 'electricity-kWh', 
                                  ax = ax[0,0])
dailyElectricityWithFeatures.plot(kind = 'scatter', 
                                  x = 'coolingDegrees', 
                                  y = 'electricity-kWh', 
                                  ax = ax[0,1])
dailyChilledWaterWithFeatures.plot(kind = 'scatter', 
                                   x = 'T-C', 
                                   y = 'chilledWater-TonDays', 
                                   ax = ax[1,0])
dailyChilledWaterWithFeatures.plot(kind = 'scatter', 
                                   x = 'coolingDegrees', 
                                   y = 'chilledWater-TonDays', 
                                   ax = ax[1,1])
dailySteamWithFeatures.plot(kind = 'scatter', 
                            x = 'T-C', 
                            y = 'steam-LBS', 
                            ax = ax[2,0])
dailySteamWithFeatures.plot(kind = 'scatter', 
                            x = 'heatingDegrees', 
                            y = 'steam-LBS', 
                            ax = ax[2,1])

for i in range(3):    
    ax[i,0].tick_params(which=u'major', reset=False, axis = 'y', labelsize = 13)
    #ax[i,0].set_axis_bgcolor('w')

for i in range(2):    
    ax[2,i].tick_params(which=u'major', reset=False, axis = 'x', labelsize = 13)

ax[2,0].set_xlabel(r'Temperature ($^\circ$C)', fontsize = 13)
ax[2,0].set_xlim([-20,40])
ax[0,0].set_title('Daily energy use versus ourdoor temperature', fontsize = 15)

ax[2,1].set_xlabel(r'Cooling/Heating degrees ($^\circ$C)', fontsize = 13)
#ax[2,1].set_xlim([0,30])
ax[0,1].set_title('Daily energy use versus cooling/heating degrees', fontsize = 15)

plt.show()

日常冷冻水和蒸汽与室外温度有很强的线性关系。如果使用冷却/加热度而不是T-C,则应避免逐步线性回归。

fig, ax = plt.subplots(3, 2, sharex = 'col', sharey='row', figsize = (15, 12))
fig.subplots_adjust(hspace = 0.1, wspace = 0.1)

hourlyElectricityWithFeatures.plot(kind = 'scatter', x = 'humidityRatio-kg/kg', y = 'electricity-kWh', ax = ax[0,0])
hourlyElectricityWithFeatures.plot(kind = 'scatter', x = 'dehumidification', y = 'electricity-kWh', ax = ax[0,1])
hourlyChilledWaterWithFeatures.plot(kind = 'scatter', x = 'humidityRatio-kg/kg', y = 'chilledWater-TonDays', ax = ax[1,0])
hourlyChilledWaterWithFeatures.plot(kind = 'scatter', x = 'dehumidification', y = 'chilledWater-TonDays', ax = ax[1,1])
hourlySteamWithFeatures.plot(kind = 'scatter', x = 'humidityRatio-kg/kg', y = 'steam-LBS', ax = ax[2,0])
hourlySteamWithFeatures.plot(kind = 'scatter', x = 'dehumidification', y = 'steam-LBS', ax = ax[2,1])

for i in range(3):    
    ax[i,0].tick_params(which=u'major', reset=False, axis = 'y', labelsize = 13)
    #ax[i,0].set_axis_bgcolor('w')

for i in range(2):    
    ax[2,i].tick_params(which=u'major', reset=False, axis = 'x', labelsize = 13)

ax[2,0].set_xlabel(r'Humidity ratio (kg/kg)', fontsize = 13)
ax[2,0].set_xlim([0,0.02])
ax[0,0].set_title('Hourly energy use versus humidity ratio', fontsize = 15)

ax[2,1].set_xlabel(r'Dehumidification', fontsize = 13)
ax[2,1].set_xlim([0,0.01])
ax[0,1].set_title('Hourly energy use versus dehumidification', fontsize = 15)

plt.show()

湿度比肯定有助于预测冷冻水的消耗量,它优于RH和Tdrew。

fig, ax = plt.subplots(3, 2, sharex = 'col', sharey='row', figsize = (15, 12))
fig.subplots_adjust(hspace = 0.1, wspace = 0.1)

dailyElectricityWithFeatures.plot(kind = 'scatter', 
                                  x = 'humidityRatio-kg/kg', 
                                  y = 'electricity-kWh', 
                                  ax = ax[0,0])
dailyElectricityWithFeatures.plot(kind = 'scatter', 
                                  x = 'dehumidification', 
                                  y = 'electricity-kWh', 
                                  ax = ax[0,1])
dailyChilledWaterWithFeatures.plot(kind = 'scatter', 
                                   x = 'humidityRatio-kg/kg', 
                                   y = 'chilledWater-TonDays', 
                                   ax = ax[1,0])
dailyChilledWaterWithFeatures.plot(kind = 'scatter', 
                                   x = 'dehumidification', 
                                   y = 'chilledWater-TonDays', 
                                   ax = ax[1,1])
dailySteamWithFeatures.plot(kind = 'scatter', 
                            x = 'humidityRatio-kg/kg', 
                            y = 'steam-LBS', 
                            ax = ax[2,0])
dailySteamWithFeatures.plot(kind = 'scatter', 
                            x = 'dehumidification', 
                            y = 'steam-LBS', 
                            ax = ax[2,1])

for i in range(3):    
    ax[i,0].tick_params(which=u'major', reset=False, axis = 'y', labelsize = 13)
    #ax[i,0].set_axis_bgcolor('w')

for i in range(2):    
    ax[2,i].tick_params(which=u'major', reset=False, axis = 'x', labelsize = 13)

ax[2,0].set_xlabel(r'Humidity ratio (kg/kg)', fontsize = 13)
ax[2,0].set_xlim([0,0.02])
ax[0,0].set_title('Daily energy use versus humidity ratio', fontsize = 15)

ax[2,1].set_xlabel(r'Dehumidification', fontsize = 13)
ax[2,1].set_xlim([0,0.01])
ax[0,1].set_title('Daily energy use versus dehumidification', fontsize = 15)

plt.show()

除湿是用来预测冷水的,而不是蒸汽。

fig, ax = plt.subplots(3, 2, sharex = 'col', figsize = (15, 12))
fig.subplots_adjust(hspace = 0.1, wspace = 0.15)

hourlyElectricityWithFeatures.plot(kind = 'scatter', 
                                   x = 'occupancy', 
                                   y = 'electricity-kWh', 
                                   ax = ax[0,0])
dailyElectricityWithFeatures.plot(kind = 'scatter', 
                                  x = 'occupancy', 
                                  y = 'electricity-kWh', 
                                  ax = ax[0,1])
hourlyChilledWaterWithFeatures.plot(kind = 'scatter', 
                                    x = 'occupancy', 
                                    y = 'chilledWater-TonDays', 
                                    ax = ax[1,0])
dailyChilledWaterWithFeatures.plot(kind = 'scatter', 
                                   x = 'occupancy', 
                                   y = 'chilledWater-TonDays', 
                                   ax = ax[1,1])
hourlySteamWithFeatures.plot(kind = 'scatter', 
                             x = 'occupancy', 
                             y = 'steam-LBS', 
                             ax = ax[2,0])
dailySteamWithFeatures.plot(kind = 'scatter', 
                            x = 'occupancy', 
                            y = 'steam-LBS', 
                            ax = ax[2,1])

for i in range(3):    
    ax[i,0].tick_params(which=u'major', reset=False, axis = 'y', labelsize = 13)
    #ax[i,0].set_axis_bgcolor('w')

for i in range(2):    
    ax[2,i].tick_params(which=u'major', reset=False, axis = 'x', labelsize = 13)

ax[2,0].set_xlabel(r'Occupancy', fontsize = 13)
#ax[2,0].set_xlim([0,0.02])
ax[0,0].set_title('Hourly energy use versus occupancy', fontsize = 15)

ax[2,1].set_xlabel(r'Occupancy', fontsize = 13)

#ax[2,1].set_xlim([0,0.01])
ax[0,1].set_title('Daily energy use versus occupancy', fontsize = 15)

plt.show()

入住率来源于学年、假期和周末。基本上,我们只是给假期、周末和夏天赋了一个较低的值。cosHour,入住率可能有用,也可能没用,因为它们只是对入住率的估计。

fig, ax = plt.subplots(3, 1, sharex = 'col', figsize = (8, 12))
fig.subplots_adjust(hspace = 0.1, wspace = 0.15)

hourlyElectricityWithFeatures.plot(kind = 'scatter', 
                                   x = 'cosHour', 
                                   y = 'electricity-kWh', 
                                   ax = ax[0])
hourlyChilledWaterWithFeatures.plot(kind = 'scatter', 
                                    x = 'cosHour', 
                                    y = 'chilledWater-TonDays', 
                                    ax = ax[1])
hourlySteamWithFeatures.plot(kind = 'scatter', 
                             x = 'cosHour', 
                             y = 'steam-LBS', 
                             ax = ax[2])

for i in range(3):    
    ax[i].tick_params(which=u'major', reset=False, axis = 'y', labelsize = 13)
    #ax[i,0].set_axis_bgcolor('w')

ax[2].tick_params(which=u'major', reset=False, axis = 'x', labelsize = 13)

ax[2].set_xlabel(r'cosHour', fontsize = 13)
#ax[2,0].set_xlim([0,0.02])
ax[0].set_title('Hourly energy use versus cosHourOfDay', fontsize = 15)

plt.show()

能源利用与cosHour之间存在一定的联系.

fig, ax = plt.subplots(3, 2, sharex = 'col', sharey = 'row', figsize = (15, 12))
fig.subplots_adjust(hspace = 0.1, wspace = 0.15)

hourlyElectricityWithFeatures.plot(kind = 'scatter', 
                                   x = 'solarRadiation-W/m2', 
                                   y = 'electricity-kWh', 
                                   ax = ax[0,0])
hourlyElectricityWithFeatures.plot(kind = 'scatter',
                                   x = 'windSpeed-m/s', 
                                   y = 'electricity-kWh', 
                                   ax = ax[0,1])
hourlyChilledWaterWithFeatures.plot(kind = 'scatter', 
                                    x = 'solarRadiation-W/m2', 
                                    y = 'chilledWater-TonDays', 
                                    ax = ax[1,0])
hourlyChilledWaterWithFeatures.plot(kind = 'scatter', 
                                    x = 'windSpeed-m/s', 
                                    y = 'chilledWater-TonDays', 
                                    ax = ax[1,1])
hourlySteamWithFeatures.plot(kind = 'scatter', 
                             x = 'solarRadiation-W/m2', 
                             y = 'steam-LBS', 
                             ax = ax[2,0])
hourlySteamWithFeatures.plot(kind = 'scatter', 
                             x = 'windSpeed-m/s', 
                             y = 'steam-LBS', 
                             ax = ax[2,1])

for i in range(3):    
    ax[i,0].tick_params(which=u'major', reset=False, axis = 'y', labelsize = 13)
    #ax[i,0].set_axis_bgcolor('w')

for i in range(2):    
    ax[2,i].tick_params(which=u'major', reset=False, axis = 'x', labelsize = 13)

ax[2,0].set_xlabel(r'Solar radiation (W/m2)', fontsize = 13)
#ax[2,0].set_xlim([0,0.02])
ax[0,0].set_title('Hourly energy use versus solar radiation', fontsize = 15)

ax[2,1].set_xlabel(r'Wind speed (m/s)', fontsize = 13)

#ax[2,1].set_xlim([0,0.01])
ax[0,1].set_title('Hourly energy use versus wind speed', fontsize = 15)

plt.show()

fig, ax = plt.subplots(3, 2, sharex = 'col', sharey = 'row', figsize = (15, 12))
fig.subplots_adjust(hspace = 0.1, wspace = 0.15)

dailyElectricityWithFeatures.plot(kind = 'scatter', 
                                  x = 'solarRadiation-W/m2', 
                                  y = 'electricity-kWh', 
                                  ax = ax[0,0])
dailyElectricityWithFeatures.plot(kind = 'scatter', 
                                  x = 'windSpeed-m/s', 
                                  y = 'electricity-kWh', 
                                  ax = ax[0,1])
dailyChilledWaterWithFeatures.plot(kind = 'scatter', 
                                   x = 'solarRadiation-W/m2', 
                                   y = 'chilledWater-TonDays', 
                                   ax = ax[1,0])
dailyChilledWaterWithFeatures.plot(kind = 'scatter', 
                                   x = 'windSpeed-m/s', 
                                   y = 'chilledWater-TonDays', 
                                   ax = ax[1,1])
dailySteamWithFeatures.plot(kind = 'scatter', 
                            x = 'solarRadiation-W/m2', 
                            y = 'steam-LBS', 
                            ax = ax[2,0])
dailySteamWithFeatures.plot(kind = 'scatter', 
                            x = 'windSpeed-m/s', 
                            y = 'steam-LBS', 
                            ax = ax[2,1])

for i in range(3):    
    ax[i,0].tick_params(which=u'major', reset=False, axis = 'y', labelsize = 13)
    #ax[i,0].set_axis_bgcolor('w')

for i in range(2):    
    ax[2,i].tick_params(which=u'major', reset=False, axis = 'x', labelsize = 13)

ax[2,0].set_xlabel(r'Solar radiation (W/m2)', fontsize = 13)
#ax[2,0].set_xlim([0,0.02])
ax[0,0].set_title('Daily energy use versus solar radiation', fontsize = 15)

ax[2,1].set_xlabel(r'Wind speed (m/s)', fontsize = 13)

#ax[2,1].set_xlim([0,0.01])
ax[0,1].set_title('DAily energy use versus wind speed', fontsize = 15)

plt.show()

太阳辐射和风速并不那么重要,它们与温度有关。

发现

  • 电力与天气数据(温度)无关。利用天气信息来预测电力的想法是行不通的。我认为这主要取决于时间/入住率。但是我们仍然可以做一些模式探索来找出白天/晚上、工作日/周末、学校日/假期的用电模式。实际上,我们应该从月度数据中注意到这一点。

  • 冷冻水和蒸汽与温度和湿度密切相关。日冷水、蒸汽消耗量与冷热度呈良好的线性关系。因此,简单的线性回归可能已经足够精确了。

  • 虽然冷水和热水的消耗与天气有很强的相关性,但是根据上面的图用天气信息来预测每小时的冷冻水和蒸汽是不够的。这是因为运行计划影响每小时的能源消耗。每小时的冷水和蒸汽预测必须包括用房和运行计划。

  • 湿度比肯定有助于预测冷水的消耗量,它优于RH和Tdrew。

  • 冷和热温度将有助于预测每天的冷冻水和蒸汽。如果使用冷却/加热度而不是T-C,则应避免逐步线性回归。

  • 入住率来源于学年、假期和周末。基本上,我们只是给假期、周末和夏天赋了一个较低的值。cosHour,入住率可能有用,也可能没用,因为它们只是对入住率的估计。

下一篇 2_Linear Regression and Support Vector Regression

posted @ 2020-02-04 16:59  小萝卜鸭  阅读(605)  评论(1编辑  收藏  举报