Python数据分析-日期数据和时间序列数据处理
1. 主要内容
1.1. 日期数据处理
- DataFrame的日期数据转换
- dt对象的使用
- 获取日期区间的数据
- 按不同时期统计并显示数据
1.2. 时间序列数据处理
- 重采样(Resample()方法)
- 降采样处理
- 升采样处理
- 时间序列数据汇总(ohlc()函数)
- 移动窗口数据计算(rolling()函数)
2. 日期数据处理
2.1. DataFrame的日期数据转换
主要使用函数:pd.to_datetime(),函数格式如下:
pandas.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=False, format=None, exact=_NoDefault.no_default, unit=None, infer_datetime_format=_NoDefault.no_default, origin='unix', cache=True)
参数说明:
- arg:int, float, str, datetime, list, tuple, 1-d array, Series, DataFrame/dict-like
The object to convert to a datetime. If a DataFrame is provided, the method expects minimally the following columns: "year", "month", "day". The column “year” must be specified in 4-digit format.
- errors:{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’
- If 'raise', then invalid parsing will raise an exception.
- If 'coerce', then invalid parsing will be set as NaT.
- If 'ignore', then invalid parsing will return the input.
- dayfirst:bool, default False
Specify a date parse order if arg is str or is list-like. If True, parses dates with the day first, e.g. "10/11/12" is parsed as 2012-11-10.
Warning:dayfirst=True is not strict, but will prefer to parse with day first.
- yearfirst:bool, default False
Specify a date parse order if arg is str or is list-like.
- If True parses dates with the year first, e.g. "10/11/12" is parsed as 2010-11-12.
- If both dayfirst and yearfirst are True, yearfirst is preceded (same as dateutil).
Warning:yearfirst=True is not strict, but will prefer to parse with year first.
- utc:bool, default False
Control timezone-related parsing, localization and conversion.
- If True, the function always returns a timezone-aware UTC-localized Timestamp, Series or DatetimeIndex. To do this, timezone-naive inputs are localized as UTC, while timezone-aware inputs are converted to UTC.
- If False (default), inputs will not be coerced to UTC. Timezone-naive inputs will remain naive, while timezone-aware ones will keep their time offsets. Limitations exist for mixed offsets (typically, daylight savings).
Warning:In a future version of pandas, parsing datetimes with mixed time zones will raise an error unless utc=True. Please specify utc=True to opt in to the new behaviour and silence this warning. To create a Series with mixed offsets and object dtype, please use apply and datetime.datetime.strptime.
- format:str, default None
The strftime to parse time, e.g."%d/%m/%Y".
- “ISO8601”, to parse any ISO8601 time string (not necessarily in exactly the same format);
- “mixed”, to infer the format for each element individually. This is risky, and you should probably use it along with dayfirst.
Warning:If a DataFrame is passed, then format has no effect.
- exact:bool, default True
Control how format is used:
- If True, require an exact format match.
- If False, allow the format to match anywhere in the target string.
Cannot be used alongside format='ISO8601' or format='mixed'.
- unit:str, *default ‘ns’
The unit of the arg (D,s,ms,us,ns) denote the unit, which is an integer or float number. This will be based off the origin. Example, with unit='ms' and origin='unix', this would calculate the number of milliseconds to the unix epoch start.
- infer_datetime_format:bool, default False
If Trueand no format is given, attempt to infer the format of the datetime strings based on the first non-NaN element, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by ~5-10x.
- origin:scalar, default ‘unix’
Define the reference date. The numeric values would be parsed as number of units (defined by unit) since this reference date.
- If 'unix'(or POSIX) time; origin is set to 1970-01-01.
- If 'julian', unit must be'D', and origin is set to beginning of Julian Calendar. Julian day number 0 is assigned to the day starting at noon on January 1, 4713 BC.
- If Timestamp convertible (Timestamp, dt.datetime, np.datetimt64 or date string), origin is set to Timestamp identified by origin.
- If a float or integer, origin is the difference (in units determined by the unitargument) relative to 1970-01-01.
- cache:bool, default True
If True, use a cache of unique, converted dates to apply the datetime conversion. May produce significant speed-up when parsing duplicate date strings, especially ones with timezone offsets. The cache is only used when there are at least 50 values. The presence of out-of-bounds values will render the cache unusable and may slow down parsing.
代码示例1:
1 df = pd.DataFrame({'year': [2015, 2016], 2 'month': [2, 3], 3 'day': [4, 5]}) 4 res = pd.to_datetime(df) 5 print(res) 6 7 # 结果 8 # 0 2015-02-04 9 # 1 2016-03-05 10 # dtype: datetime64[ns] 11 12 ################################################### 13 14 res1 = pd.to_datetime(1490195805, unit='s') 15 res2 = pd.to_datetime(1490195805433502912, unit='ns') 16 print(res1) 17 print(res2) 18 19 # 结果 20 # 2017-03-22 15:16:45 21 # 2017-03-22 15:16:45.433502912
代码示例2:
1 res = pd.to_datetime([1, 2, 3], unit='D', 2 origin=pd.Timestamp('1960-01-01')) 3 print(res) 4 5 # 结果 6 # DatetimeIndex(['1960-01-02', '1960-01-03', '1960-01-04'], dtype='datetime64[ns]', freq=None) 7 8 res = pd.to_datetime('2018-10-26 12:00:00.0000000011', 9 format='%Y-%m-%d %H:%M:%S.%f') 10 print(res) 11 12 # 结果 13 # 2018-10-26 12:00:00.000000001 14 15 ################################################################# 16 17 res = pd.to_datetime('13000101', format='%Y%m%d', errors='coerce') 18 print(res) 19 20 # 结果 21 # NaT
代码示例3:
1 from datetime import datetime 2 res = pd.to_datetime(["2020-01-01 01:00:00-01:00", 3 datetime(2020, 1, 1, 3, 0)]) 4 print(res) 5 6 # 结果 7 # DatetimeIndex(['2020-01-01 01:00:00-01:00', '2020-01-01 02:00:00-01:00'], dtype='datetime64[ns, pytz.FixedOffset(-60)]', freq=None) 8 9 ################################################## 10 11 res = pd.to_datetime(['2018-10-26 12:00', '2018-10-26 13:00'], utc=True) 12 print(res) 13 14 # 结果 15 # DatetimeIndex(['2018-10-26 12:00:00+00:00', '2018-10-26 13:00:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)
2.2. dt 对象的使用
主要是对dt.year,dt.month,dt.day等接口的调用。大多数都是见名知意,就不赘述。
代码示例:
1 import pandas as pd 2 # 解决数据输出时列名不对齐的问题 3 pd.set_option('display.unicode.east_asian_width', True) 4 5 df = pd.DataFrame({'原日期': ['2019.1.05', '2019.2.15', '2019.3.25', '2019.6.25', '2019.9.15', '2019.12.31']}) 6 df['日期'] = pd.to_datetime(df['原日期']) 7 print(df) 8 df['年'], df['月'], df['日'] = df['日期'].dt.year, df['日期'].dt.month, df['日期'].dt.day 9 df['星期几'] = df['日期'].dt.day_name() 10 df['季度'] = df['日期'].dt.quarter 11 df['是否年底'] = df['日期'].dt.is_year_end 12 print(df) 13 14 # 结果 15 # 原日期 日期 16 # 0 2019.1.05 2019-01-05 17 # 1 2019.2.15 2019-02-15 18 # 2 2019.3.25 2019-03-25 19 # 3 2019.6.25 2019-06-25 20 # 4 2019.9.15 2019-09-15 21 # 5 2019.12.31 2019-12-31 22 23 # 原日期 日期 年 月 日 星期几 季度 是否年底 24 # 0 2019.1.05 2019-01-05 2019 1 5 Saturday 1 False 25 # 1 2019.2.15 2019-02-15 2019 2 15 Friday 1 False 26 # 2 2019.3.25 2019-03-25 2019 3 25 Monday 1 False 27 # 3 2019.6.25 2019-06-25 2019 6 25 Tuesday 2 False 28 # 4 2019.9.15 2019-09-15 2019 9 15 Sunday 3 False 29 # 5 2019.12.31 2019-12-31 2019 12 31 Tuesday 4 True
2.3. 获取日期区间的数据
类似于列表切片。例如:
df2 = df1['2018-05-11':'2018-06-10'] # 即可获取对应区间的数据
2.4. 按不同时期统计并显示数据
伪代码:
1 import pandas as pd 2 3 df = pd.DataFrame(pd.read_excel("文件")) 4 df1=df[['订单付款时间','买家会员名','联系手机','买家实际支付金额']] 5 df1 = df1.set_index('订单付款时间') # 将date设置为index 6 7 print('---------按月统计数据-----------') 8 #“MS”是每个月第一天为开始日期,“M”是每个月最后一天 9 print(df1.resample('M').sum().to_period('M')) 10 11 print('---------按季统计数据-----------') 12 #“QS”是每个季度第一天为开始日期,“Q”是每个季度最后一天 13 print(df1.resample('QS').sum()) 14 15 print('---------按年统计数据-----------') 16 #“AS”是每年第一天为开始日期,“A”是每年最后一天 17 print(df1.resample('AS').sum()) 18 19 20 print('---------按年统计并显示数据-----------') 21 #“AS”是每年第一天为开始日期,“A”是每年最后一天 22 print(df1.resample('AS').sum().to_period('A')) 23 print('---------按季度统计并显示数据-----------') 24 print(df1.resample('Q').sum().to_period('Q')) 25 print('---------按月统计并显示数据-----------') 26 print(df1.resample('M').sum().to_period('M')) 27 df2=df1.resample('M').sum().to_period('M') 28 print('---------按星期统计并显示数据-----------') 29 print(df1.resample('w').sum().to_period('W').head())
3. 时间序列数据处理
3.1. 重采样(Resample()方法)
DataFrame.resample(rule, axis=_NoDefault.no_default, closed=None, label=None, convention=_NoDefault.no_default, kind=_NoDefault.no_default, on=None, level=None, origin='start_day', offset=None, group_keys=False)
参数说明:
- rule:DateOffset, Timedelta or str
The offset string or object representing target conversion.
- axis:{0 or ‘index’, 1 or ‘columns’}, default 0
Which axis to use for up- or down-sampling. For Series this parameter is unused and defaults to 0. Must be DatetimeIndex, TimedeltaIndex or PeriodIndex.
- closed:{‘right’, ‘left’}, default None
Which side of bin interval is closed. The default is ‘left’ for all frequency offsets except for ‘ME’, ‘YE’, ‘QE’, ‘BME’, ‘BA’, ‘BQE’, and ‘W’ which all have a default of ‘right’.
- label:{‘right’, ‘left’}, default None
Which bin edge label to label bucket with. The default is ‘left’ for all frequency offsets except for ‘ME’, ‘YE’, ‘QE’, ‘BME’, ‘BA’, ‘BQE’, and ‘W’ which all have a default of ‘right’.
- convention:{‘start’, ‘end’, ‘s’, ‘e’}, default ‘start’
For PeriodIndex only, controls whether to use the start or end of rule.
- kind:{‘timestamp’, ‘period’}, optional, default None
Pass ‘timestamp’ to convert the resulting index to a DateTimeIndex or ‘period’ to convert it to a PeriodIndex. By default the input representation is retained.
- on:str, optional
For a DataFrame, column to use instead of index for resampling. Column must be datetime-like.
- level:str or int, optional
For a MultiIndex, level (name or number) to use for resampling. level must be datetime-like.
- origin:Timestamp or str, default ‘start_day’
The timestamp on which to adjust the grouping. The timezone of origin must match the timezone of the index. If string, must be one of the following:
- ‘epoch’: origin is 1970-01-01
- ‘start’: origin is the first value of the timeseries
- ‘start_day’: origin is the first day at midnight of the timeseries
- ‘end’: origin is the last value of the timeseries
- ‘end_day’: origin is the ceiling midnight of the last day
- offset:Timedelta or str, default is None
An offset timedelta added to the origin.
- group_keys:bool, default False
Whether to include the group keys in the result index when using.apply()on the resampled object.
代码示例1:
1 index = pd.date_range('02/02/2020', periods=9, freq='T') 2 series = pd.Series(range(9), index=index) 3 print(series) 4 print(series.resample('3T').sum()) 5 6 # 结果 7 # 2020-02-02 00:00:00 0 8 # 2020-02-02 00:01:00 1 9 # 2020-02-02 00:02:00 2 10 # 2020-02-02 00:03:00 3 11 # 2020-02-02 00:04:00 4 12 # 2020-02-02 00:05:00 5 13 # 2020-02-02 00:06:00 6 14 # 2020-02-02 00:07:00 7 15 # 2020-02-02 00:08:00 8 16 # Freq: T, dtype: int64 17 # 2020-02-02 00:00:00 3 18 # 2020-02-02 00:03:00 12 19 # 2020-02-02 00:06:00 21 20 # Freq: 3T, dtype: int64
代码示例2:
1 index = pd.date_range('1/1/2000', periods=9, freq='min') 2 series = pd.Series(range(9), index=index) 3 4 def custom_resampler(arraylike): 5 return np.sum(arraylike) + 5 6 7 print(series.resample('3min').apply(custom_resampler)) 8 9 # 结果 10 # 2000-01-01 00:00:00 8 11 # 2000-01-01 00:03:00 17 12 # 2000-01-01 00:06:00 26 13 # Freq: 3T, dtype: int64
3.2. 降采样处理、升采样处理
降采样(downsampling)是指将数据从高频率转换为低频率,例如将每分钟的数据转换为每小时的数据。
升采样是指将数据从低频率转换为高频率,例如将每小时的数据转换为每分钟的数据。
在Python的pandas库中,可以使用resample()函数进行降采样和升采样处理。以下是示例代码:
1 import pandas as pd 2 3 # 创建一个时间序列数据 4 date_rng = pd.date_range(start='1/1/2020', end='1/10/2020', freq='H') 5 df = pd.DataFrame(date_rng, columns=['date']) 6 df['data'] = range(len(df)) 7 8 # 将日期设置为索引 9 df = df.set_index('date') 10 11 # 降采样:将数据从小时转换为天 12 downsampled = df.resample('D').mean() 13 print("降采样结果:") 14 print(downsampled) 15 16 # 升采样:将数据从天转换为小时 17 upsampled = df.resample('H').ffill() 18 print("升采样结果:") 19 print(upsampled)
3.3. 时间序列数据汇总(ohlc()函数)
1 import pandas as pd 2 import numpy as np 3 4 rng = pd.date_range('2/2/2020', periods=12, freq='T') 5 s1 = pd.Series(np.arange(12), index=rng) 6 print(s1.resample('5min').ohlc())
3.4. 移动窗口数据计算(rolling()函数)
1 import pandas as pd 2 index = pd.date_range('20200201', '20200215') 3 data = [3, 6, 7, 4, 2, 1, 3, 8, 9, 10, 12, 15, 13, 22, 14] 4 s1_data = pd.Series(data, index=index) 5 print(s1_data) 6 print(s1_data.rolling(3).mean())
时间:2024年4月17日

Python数据分析-日期数据和时间序列数据处理
浙公网安备 33010602011771号