Python数据分析-数据移位与数据转换
1. 数据移位
shift()
方法是一个非常有用的方法,用于数据位移与其他方法结合,能实现很多难以想象的功能,语法格式如下:
DataFrame.shift(periods=1, freq=None, axis=0, fill_value=_NoDefault.no_default, suffix=None)
使用可选的时间序列按所需周期数移动索引。
参数说明:
- periods:int or Sequence
Number of periods to shift. Can be positive or negative. If an iterable of ints, the data will be shifted once by each int. This is equivalent to shifting by one value at a time and concatenating all resulting frames. The resulting columns will have the shift suffixed to their column names. For multiple periods, axis must not be 1.
- freq:DateOffset, tseries.offsets, timedelta, or str, optional
Offset to use from the tseries module or time rule (e.g. ‘EOM’). If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data. If freq is specified as “infer” then it will be inferred from the freq or inferred_freq attributes of the index. If neither of those attributes exist, a ValueError is thrown.
- axis:{0 or ‘index’, 1 or ‘columns’, None}, default None
Shift direction. For Series this parameter is unused and defaults to 0.
- fill_value:object, optional
The scalar value to use for newly introduced missing values. the default depends on the dtype of self. For numeric data, np.nan
is used. For datetime, timedelta, or period data, etc. NaT
is used. For extension dtypes, self.dtype.na_value
is used.
- suffix:str, optional
If str and periods is an iterable, this is added after the column name and before the shift value for each shifted column name.
代码示例:
1 df = pd.DataFrame({"Col1": [10, 20, 15, 30, 45], 2 "Col2": [13, 23, 18, 33, 48], 3 "Col3": [17, 27, 22, 37, 52]}, 4 index=pd.date_range("2020-01-01", "2020-01-05")) 5 print(df) 6 7 ### 结果 8 # Col1 Col2 Col3 9 # 2020-01-01 10 13 17 10 # 2020-01-02 20 23 27 11 # 2020-01-03 15 18 22 12 # 2020-01-04 30 33 37 13 # 2020-01-05 45 48 52
1 df1 = df.shift(periods=3) 2 print(df1) 3 4 ### 结果 5 # Col1 Col2 Col3 6 # 2020-01-01 NaN NaN NaN 7 # 2020-01-02 NaN NaN NaN 8 # 2020-01-03 NaN NaN NaN 9 # 2020-01-04 10.0 13.0 17.0 10 # 2020-01-05 20.0 23.0 27.0
1 df1 = df.shift(periods=1, axis="columns") 2 print(df1) 3 4 ### 结果 5 # Col1 Col2 Col3 6 # 2020-01-01 NaN 10 13 7 # 2020-01-02 NaN 20 23 8 # 2020-01-03 NaN 15 18 9 # 2020-01-04 NaN 30 33 10 # 2020-01-05 NaN 45 48
1 df1 = df.shift(periods=3, fill_value=0) 2 print(df1) 3 4 ### 结果 5 # Col1 Col2 Col3 6 # 2020-01-01 0 0 0 7 # 2020-01-02 0 0 0 8 # 2020-01-03 0 0 0 9 # 2020-01-04 10 13 17 10 # 2020-01-05 20 23 27
1 df1 = df.shift(periods=3, freq="D") 2 print(df1) 3 4 ### 结果 5 # Col1 Col2 Col3 6 # 2020-01-04 10 13 17 7 # 2020-01-05 20 23 27 8 # 2020-01-06 15 18 22 9 # 2020-01-07 30 33 37 10 # 2020-01-08 45 48 52
1 df1 = df.shift(periods=3, freq="infer") 2 print(df1) 3 4 ### 结果 5 # Col1 Col2 Col3 6 # 2020-01-04 10 13 17 7 # 2020-01-05 20 23 27 8 # 2020-01-06 15 18 22 9 # 2020-01-07 30 33 37 10 # 2020-01-08 45 48 52
2. 数据转换
数据转换一般包括一列数据转换为多列数据、行列转换、DataFrame转换为字典、DataFrame转换为列表和DataFrame转换为元组等。
2.1. 该篇主要内容
- 行列转换
- Series转换为字典
- Series转换为列表
- DataFrame转换为HTML网页格式
注意:此外还有很多方法,有需要请参考官方文档
2.2. 行列转换
实现DataFrame的行列转换,使用的方法是df.T
,语法定义如下:
property DataFrame.T
返回值:DataFrame
代码示例:
1 df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) 2 print(df) 3 df1 = df.T 4 print(df1) 5 6 ### 结果 7 # col1 col2 8 # 0 1 3 9 # 1 2 4 10 11 # 0 1 12 # col1 1 2 13 # col2 3 4
2.3. Series转换为字典
实现Series转换为字典,使用方法语法Series.to_dict()
,定义如下:
Series.to_dict(*, into=<class 'dict'>)
参数说明:
- into:class, default dict
The collections.abc.MutableMapping subclass to use as the return object. Can be the actual class or an empty instance of the mapping type you want. If you want a collections.defaultdict, you must pass it initialized.
返回值:collections.abc.MutableMapping
Key-value representation of Series.
代码示例:
1 s = pd.Series([1, 2, 3, 4]) 2 s.to_dict() 3 4 ### 结果 5 # {0: 1, 1: 2, 2: 3, 3: 4}
1 from collections import OrderedDict, defaultdict 2 s.to_dict(into=OrderedDict) 3 4 ### 结果 5 # OrderedDict([(0, 1), (1, 2), (2, 3), (3, 4)])
1 dd = defaultdict(list) 2 s.to_dict(into=dd) 3 4 ### 结果 5 # defaultdict(<class 'list'>, {0: 1, 1: 2, 2: 3, 3: 4})
2.4. Series转换为列表
实现Series转换为列表,使用方法语法Series.to_list()
,定义如下:
Series.to_list()
返回值:list
代码示例:
1 s = pd.Series([1, 2, 3]) 2 s.to_list() 3 4 ### 结果 5 # [1, 2, 3]
1 idx = pd.Index([1, 2, 3]) 2 3 ### 结果 4 # Index([1, 2, 3], dtype='int64') 5 6 idx.to_list() 7 8 ### 结果 9 # [1, 2, 3]
2.5. DataFrame转换为HTML网页格式
实现DataFrame转换为HTML网页格式,使用方法语法Series.to_html()
,定义如下:
DataFrame.to_html(buf=None, *, columns=None, col_space=None, header=True, index=True, na_rep='NaN', formatters=None, float_format=None, sparsify=None, index_names=True, justify=None, max_rows=None, max_cols=None, show_dimensions=False, decimal='.', bold_rows=True, classes=None, escape=True, notebook=False, border=None, table_id=None, render_links=False, encoding=None)
参数说明:
- buf:str, Path or StringIO-like, optional, default None
Buffer to write to. If None, the output is returned as a string.
- columns:array-like, optional, default None
The subset of columns to write. Writes all columns by default.
- col_space:str or int, list or dict of int or str, optional
The minimum width of each column in CSS length units. An int is assumed to be px units..
- header:bool, optional
Whether to print column labels, default True.
- index:bool, optional, default True
Whether to print index (row) labels.
- na_rep:str, optional, default ‘NaN’
String representation of NaN
to use.
- formatters:list, tuple or dict of one-param. functions, optional
Formatter functions to apply to columns’ elements by position or name. The result of each function must be a unicode string. List/tuple must be of length equal to the number of columns.
- float_format:one-parameter function, optional, default None
Formatter function to apply to columns’ elements if they are floats. This function must return a unicode string and will be applied only to the non-NaN
elements, with NaN
being handled by na_rep
.
- sparsify:bool, optional, default True
Set to False for a DataFrame with a hierarchical index to print every multiindex key at each row.
- index_names:bool, optional, default True
Prints the names of the indexes.
- justify:str, default None
How to justify the column labels. If None uses the option from the print configuration (controlled by set_option), ‘right’ out of the box. Valid values are
- left
- right
- center
- justify
- justify-all
- start
- end
- inherit
- match-parent
- initial
- unset.
- max_rows:int, optional
Maximum number of rows to display in the console.
- max_cols:int, optional
Maximum number of columns to display in the console.
- show_dimensions:bool, default False
Display DataFrame dimensions (number of rows by number of columns).
- decimal:str, default ‘.’
Character recognized as decimal separator, e.g. ‘,’ in Europe.
- bold_rows:bool, default True
Make the row labels bold in the output.
- classes:str or list or tuple, default None
CSS class(es) to apply to the resulting html table.
- escape:bool, default True
Convert the characters <, >, and & to HTML-safe sequences.
- notebook:{True, False}, default False
Whether the generated HTML is for IPython Notebook.
- border:int
A border=border
attribute is included in the opening <table> tag. Default pd.options.display.html.border
.
- table_id:str, optional
A css id is included in the opening <table> tag if specified.
- render_links:bool, default False
Convert URLs to HTML links.
- encoding:str, default “utf-8”
Set character encoding.
返回值:str or None
If buf is None, returns the result as a string. Otherwise returns None.
代码示例:
1 df = pd.DataFrame(data={'col1': [1, 2], 'col2': [4, 3]}) 2 df1 = df.to_html() 3 print(df1) 4 5 ### 结果 6 # <table border="1" class="dataframe"> 7 # <thead> 8 # <tr style="text-align: right;"> 9 # <th></th> 10 # <th>col1</th> 11 # <th>col2</th> 12 # </tr> 13 # </thead> 14 # <tbody> 15 # <tr> 16 # <th>0</th> 17 # <td>1</td> 18 # <td>4</td> 19 # </tr> 20 # <tr> 21 # <th>1</th> 22 # <td>2</td> 23 # <td>3</td> 24 # </tr> 25 # </tbody> 26 # </table>
时间:2024年2月7日