Python中pandas模块解析

一、定义

pandas官方网站 

pandas 是一个Python包,提供快速、灵活和富有表现力的数据结构,旨在使处理“关系”或“标记”数据既简单又直观。它旨在成为在 Python 中进行实用、真实世界数据分析的基本高级构建块。此外,它还有一个更广泛的目标,即成为任何语言中可用的最强大、最灵活的开源数据分析/操作工具。它已经在朝着这个目标前进。

pandas 非常适合许多不同类型的数据:

  具有异构类型列的表格数据,如 SQL 表或 Excel 电子表格

  有序和无序(不一定是固定频率)时间序列数据。

  具有行和列标签的任意矩阵数据(同种类型或异类)

  任何其他形式的观察/统计数据集。数据根本不需要标记即可放入pandas数据结构中

二、功能

pandas 的两个主要数据结构Series(一维)和DataFrame(二维)处理金融、统计、社会科学和许多工程领域的绝大多数典型用例。对于 R 用户,DataFrame提供R提供的一切 data.frame以及更多。pandas 建立在NumPy 之上,旨在与许多其他 3rd 方库在科学计算环境中很好地集成。

以下是 Pandas 擅长的一些事情:

  轻松处理浮点和非浮点数据中的缺失数据(表示为 NaN)

  大小可变性:可以从 DataFrame 和更高维度的对象中插入和删除列

  自动和显式数据对齐:对象可以明确地对齐一组标签,或者用户可以简单地忽略标签和让Series,DataFrame等自动对齐数据你计算

  强大、灵活的分组功能,可对数据集执行拆分-应用-组合操作,用于聚合和转换数据

  使它易于转换衣衫褴褛,在其他Python和NumPy的数据结构不同索引的数据转换成数据帧对象

  基于标签的智能切片、花式索引和 大型数据集的子集

  直观的合并和连接数据集

  灵活地重塑和旋转数据集

  轴的分层标记(每个刻度可能有多个标签)

  强大的 IO 工具,用于从平面文件(CSV 和带分隔符)、Excel 文件、数据库加载数据,以及从超快HDF5 格式保存/加载数据

  时间序列特定功能:日期范围生成和频率转换、移动窗口统计、日期偏移和滞后。

这些原则中有许多是为了解决使用其他语言/科学研究环境时经常遇到的缺点。对于数据科学家来说,处理数据通常分为多个阶段:处理和清理数据、分析/建模,然后将分析结果组织成适合绘图或表格显示的形式。pandas 是所有这些任务的理想工具。

三、类型

Pandas基于两种数据类型: series 与 dataframe 。

1、Series

一个series是一个一维的数据类型,其中每一个元素都有一个标签。类似于Numpy中元素带标签的数组。其中,标签可以是数字或者字符串。

import numpy as np 
import pandas as pd 
s = pd.Series([1, 2, 5, np.nan, 6, 8]) 
print(s) 
'''
输出:
0    1.0 
1    2.0 
2    5.0 
3    NaN 
4    6.0 
5    8.0 
dtype: float64 
'''

2、DataFrame

一个dataframe是一个二维的表结构。Pandas的dataframe可以存储许多种不同的数据类型,并且每一个坐标轴都有自己的标签。你可以把它想象成一个series的字典项。

#创建一个 DateFrame:
#创建日期索引序列 
dates =pd.date_range('20130101', periods=6) 
print(type(dates)) 

#创建Dataframe,其中 index 决定索引序列,columns 决定列名 
df =pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD')) 
print(df)

'''
输出:
<class 'pandas.core.indexes.datetimes.DatetimeIndex'> 
                   A         B         C         D 
2013-01-01  0.406575 -1.356139  0.188997 -1.308049 
2013-01-02 -0.412154  0.123879  0.907458  0.201024 
2013-01-03  0.576566 -1.875753  1.967512 -1.044405 
2013-01-04  1.116106 -0.796381  0.432589  0.764339 
2013-01-05 -1.851676  0.378964 -0.282481  0.296629 
2013-01-06 -1.051984  0.960433 -1.313190 -0.093666 
'''

#字典创建 DataFrame
df2 =pd.DataFrame({'A' : 1., 
   'B': pd.Timestamp('20130102'), 
   'C': pd.Series(1,index=list(range(4)),dtype='float32'), 
   'D': np.array([3]*4,dtype='int32'), 
   'E': pd.Categorical(["test","train","test","train"]), 
   'F':'foo' }) 
print(df2) 

'''
输出:
     A          B    C  D      E    F 
0  1.0 2013-01-02  1.0  3   test  foo 
1  1.0 2013-01-02  1.0  3  train  foo 
2  1.0 2013-01-02  1.0  3   test  foo 
3  1.0 2013-01-02  1.0  3  train  foo  
'''

二、运用

1.  导入模块

import pandas as pd
import numpy as np

2.  读取excel文件

df = pd.read_csv(path='file.csv')

'''
参数:header=None 用默认行名,0,1,2,3...
names=['A', 'B', 'C'...] 自定义列名
index_col='A'|['A', 'B'...] 给索引列指定名称,如果是多重索引,可以传list
skiprows=[0,1,2] 需要跳过的行号,从文件头0开始,skip_footer从文件尾开始
nrows=N 需要读取的行数,前N行
chunksize=M 返回迭代类型TextFileReader,每M条迭代一次,数据占用较大内存时使用
sep=':'数据分隔默认是',',根据文件选择合适的分隔符,如果不指定参数,会自动解析
skip_blank_lines=False 默认为True,跳过空行,如果选择不跳过,会填充NaN
converters={'col1', func} 对选定列使用函数func转换,通常表示编号的列会使用(避免转换成int)
dfjs = pd.read_json('file.json') 可以传入json格式字符串
dfex = pd.read_excel('file.xls', sheetname=[0,1..]) 读取多个sheet页,返回多个df的字典
'''
#df.to_csv()

3. 查询数据

df.shape                        #显示数据的多少行和多少列
df.dtypes                       #显示数据的格式
df.columns                      #显示数据的所有列名
df.head(n)                      #显示数据的前n=5行
df.tail(n)                      #显示数据的后n=5行
df.head(1)[‘date’]            #获取第一行的date列
df.head(1)[‘date’][0]         #获取第一行的date列的元素值
df.describe(include='all')      # all代表需要将所有列都列出
df.columns.tolist()             #把列名转换为list
df.T                            #对数据的转置:
df.notnull()                    #df的非空值为True
df.isnull()                     #isnull是Python中检验空值的函数,返回的结果是逻辑值,包含空值返回True,不包含则返回False。可以对整个数据表进行检查,也可以单独对某一列进行空值检查。  
df[“列名”]                    #返回这一列(“列名”)的数据
df[[“name”,”age”]]      #返回列名为name和 age的两列数据
df[‘列字段名’].unique()   #显示数据某列的所有唯一值, 有0值是因为对数据缺失值进行了填充
df = pd.read_excel(file,skiprows=[2] )      #不读取哪里数据,可用skiprows=[i],跳过文件的第i行不读取
df.loc[0]              #使用loc[]方法来选择第一行的数据
df.loc[0][“name”]     #使用loc[]方法来选择第一行且列名为name的数据
df.loc[2:4]             #返回第3行到第4行的数据
df.loc[[2,5,10]]        #返回行标号为2,5,10三行数据,注意必须是由列表包含起来的数据。
df.loc[:,’test1’]     #获取test1的那一列,这个冒号的意思是所有行,逗号表示行与列的区分
df.loc[:,[‘test1’,’test2’]]     #获取test1列和test2列的数据
df.loc[1,[‘test1’,’test2’]]     #获取第二行的test1和test2列的数据
df.at[1,’test1’]      #表示取第二行,test1列的数据,和上面的方法类似
df.iloc[0]              #获取第一行
df.iloc[0:2,0:2]        #获取前两行前两列的数据
df.iloc[[1,2,4],[0,2]] #获取第1,2,4行中的0,2列的数据

4. 数据处理

(1)数据获取(excel文件数据基本信息)

#coding=utf-8
import pandas as pd
import numpy as np

excel_data = pd.read_excel("test.xlsx")
print excel_data.shape            #显示数据多少行多少列
print excel_data.index            #显示数据所有行的索引数
print excel_data.columns          #显示数据所有列的列名
print excel_data.info             #显示所有列的列名
print excel_data.dtypes           #显示数据的类型
#Help on function read_excel in module pandas.io.excel:

read_excel(*args, **kwargs)
    Read an Excel table into a pandas DataFrame
    
    Parameters
    ----------
    io : string, path object (pathlib.Path or py._path.local.LocalPath),
        file-like object, pandas ExcelFile, or xlrd workbook.
        The string could be a URL. Valid URL schemes include http, ftp, s3,
        and file. For file URLs, a host is expected. For instance, a local
        file could be file://localhost/path/to/workbook.xlsx
    sheet_name : string, int, mixed list of strings/ints, or None, default 0
    
        Strings are used for sheet names, Integers are used in zero-indexed
        sheet positions.
    
        Lists of strings/integers are used to request multiple sheets.
    
        Specify None to get all sheets.
    
        str|int -> DataFrame is returned.
        list|None -> Dict of DataFrames is returned, with keys representing
        sheets.
    
        Available Cases
    
        * Defaults to 0 -> 1st sheet as a DataFrame
        * 1 -> 2nd sheet as a DataFrame
        * "Sheet1" -> 1st sheet as a DataFrame
        * [0,1,"Sheet5"] -> 1st, 2nd & 5th sheet as a dictionary of DataFrames
        * None -> All sheets as a dictionary of DataFrames
    
    sheetname : string, int, mixed list of strings/ints, or None, default 0
    
        .. deprecated:: 0.21.0
           Use `sheet_name` instead
    
    header : int, list of ints, default 0
        Row (0-indexed) to use for the column labels of the parsed
        DataFrame. If a list of integers is passed those row positions will
        be combined into a ``MultiIndex``. Use None if there is no header.
    names : array-like, default None
        List of column names to use. If file contains no header row,
        then you should explicitly pass header=None
    index_col : int, list of ints, default None
        Column (0-indexed) to use as the row labels of the DataFrame.
        Pass None if there is no such column.  If a list is passed,
        those columns will be combined into a ``MultiIndex``.  If a
        subset of data is selected with ``usecols``, index_col
        is based on the subset.
    parse_cols : int or list, default None
    
        .. deprecated:: 0.21.0
           Pass in `usecols` instead.
    
    usecols : int or list, default None
        * If None then parse all columns,
        * If int then indicates last column to be parsed
        * If list of ints then indicates list of column numbers to be parsed
        * If string then indicates comma separated list of Excel column letters and
          column ranges (e.g. "A:E" or "A,C,E:F").  Ranges are inclusive of
          both sides.
    squeeze : boolean, default False
        If the parsed data only contains one column then return a Series
    dtype : Type name or dict of column -> type, default None
        Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}
        Use `object` to preserve data as stored in Excel and not interpret dtype.
        If converters are specified, they will be applied INSTEAD
        of dtype conversion.
    
        .. versionadded:: 0.20.0
    
    engine: string, default None
        If io is not a buffer or path, this must be set to identify io.
        Acceptable values are None or xlrd
    converters : dict, default None
        Dict of functions for converting values in certain columns. Keys can
        either be integers or column labels, values are functions that take one
        input argument, the Excel cell content, and return the transformed
        content.
    true_values : list, default None
        Values to consider as True
    
        .. versionadded:: 0.19.0
    
    false_values : list, default None
        Values to consider as False
    
        .. versionadded:: 0.19.0
    
    skiprows : list-like
        Rows to skip at the beginning (0-indexed)
    nrows : int, default None
        Number of rows to parse
    
        .. versionadded:: 0.23.0
    
    na_values : scalar, str, list-like, or dict, default None
        Additional strings to recognize as NA/NaN. If dict passed, specific
        per-column NA values. By default the following values are interpreted
        as NaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan',
        '1.#IND', '1.#QNAN', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan',
        'null'.
    keep_default_na : bool, default True
        If na_values are specified and keep_default_na is False the default NaN
        values are overridden, otherwise they're appended to.
    verbose : boolean, default False
        Indicate number of NA values placed in non-numeric columns
    thousands : str, default None
        Thousands separator for parsing string columns to numeric.  Note that
        this parameter is only necessary for columns stored as TEXT in Excel,
        any numeric columns will automatically be parsed, regardless of display
        format.
    comment : str, default None
        Comments out remainder of line. Pass a character or characters to this
        argument to indicate comments in the input file. Any data between the
        comment string and the end of the current line is ignored.
    skip_footer : int, default 0
    
        .. deprecated:: 0.23.0
           Pass in `skipfooter` instead.
    skipfooter : int, default 0
        Rows at the end to skip (0-indexed)
    convert_float : boolean, default True
        convert integral floats to int (i.e., 1.0 --> 1). If False, all numeric
        data will be read in as floats: Excel stores all numbers as floats
        internally
    
    Returns
    -------
    parsed : DataFrame or Dict of DataFrames
        DataFrame from the passed in Excel file.  See notes in sheet_name
        argument for more information on when a Dict of Dataframes is returned.
read_excel参数详解
获取行
excel_data.head(5)                   #显示数据的前5行
excel_data.tail(5)                      #显示数据的后5行
excel_data.loc[0]                       #获取第一行的数据
excel_data.loc[2:4]                    #返回第3行到第4行的数据
excel_data.loc[[2,5,10]]             #返回行标号为2,5,10三行数据,注意必须是由列表包含起来的数据。
excel_data.iloc[0]                       #获取第一行

获取列 
excel_data["name"]                      #返回这一列("name")的数据
excel_data[["name","age"]]          #返回列名为name和 age的两列数据
excel_data["name"].unique()         #显示数据name列的所有唯一值, 有0值是因为对数据缺失值进行了填充

获取某行某列
excel_data.head(5)["name"]                 #获取前5行的name列
excel_data.head(5)["name"][0]             #获取前5行的name列的元素值
excel_data.at[1,"age"]                          #表示取第二行"age"列的数据
excel_data.loc[0]["name"]                     #获取第一行且列名为name的数据
excel_data.loc[:,"age"]                          #获取age的那一列,这个冒号的意思是所有行,逗号表示行与列的区分
excel_data.loc[:,["age","time"]]             #获取所有行的age列和time列的数据
excel_data.loc[1,["age","time"]]             #获取第二行的age和time列的数据
excel_data.iloc[0:2,0:2]                          #获取前两行前两列的数据
excel_data.iloc[[1,2,4],[0,2]]                   #获取第1,2,4行中的0,2列的数据


获取空值
excel_data.notnull()                    #excel_data的非空值为True
excel_data.isnull()                      #isnull是Python中检验空值的函数,返回的结果是逻辑值,包含空值返回True,不包含则返回False。可以对整个数据表进行检查,也可以单独对某一列进行空值检查。

(2)数据清洗转换

1)增

2)删

a、删除无效行、列(整行、列都是空白,且说明无效的行、列)

b、删除指定行、列

Help on method drop in module pandas.core.frame:

drop(self, labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise') method of pandas.core.frame.DataFrame instance
    Drop specified labels from rows or columns.
    
    Remove rows or columns by specifying label names and corresponding
    axis, or by specifying directly index or column names. When using a
    multi-index, labels on different levels can be removed by specifying
    the level.
    
    Parameters
    ----------
    labels : single label or list-like
        Index or column labels to drop.
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Whether to drop labels from the index (0 or 'index') or
        columns (1 or 'columns').
    index, columns : single label or list-like
        Alternative to specifying axis (``labels, axis=1``
        is equivalent to ``columns=labels``).
    
        .. versionadded:: 0.21.0
    level : int or level name, optional
        For MultiIndex, level from which the labels will be removed.
    inplace : bool, default False
        If True, do operation inplace and return None.
    errors : {'ignore', 'raise'}, default 'raise'
        If 'ignore', suppress error and only existing labels are
        dropped.
excel_data.drop
#Help on method dropna in module pandas.core.frame:

dropna(self, axis=0, how='any', thresh=None, subset=None, inplace=False) method of pandas.core.frame.DataFrame instance
    Remove missing values.
    
    See the :ref:`User Guide <missing_data>` for more on which values are
    considered missing, and how to work with missing data.
    
    Parameters
    ----------
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Determine if rows or columns which contain missing values are
        removed.
    
        * 0, or 'index' : Drop rows which contain missing values.
        * 1, or 'columns' : Drop columns which contain missing value.
    
        .. deprecated:: 0.23.0: Pass tuple or list to drop on multiple
        axes.
    how : {'any', 'all'}, default 'any'
        Determine if row or column is removed from DataFrame, when we have
        at least one NA or all NA.
    
        * 'any' : If any NA values are present, drop that row or column.
        * 'all' : If all values are NA, drop that row or column.
    thresh : int, optional
        Require that many non-NA values.
    subset : array-like, optional
        Labels along other axis to consider, e.g. if you are dropping rows
        these would be a list of columns to include.
    inplace : bool, default False
        If True, do operation inplace and return None.
excel_data.dropna

3)改

#Help on method fillna in module pandas.core.frame:

fillna(self, value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs) method of pandas.core.frame.DataFrame instance
    Fill NA/NaN values using the specified method
    
    Parameters
    ----------
    value : scalar, dict, Series, or DataFrame
        Value to use to fill holes (e.g. 0), alternately a
        dict/Series/DataFrame of values specifying which value to use for
        each index (for a Series) or column (for a DataFrame). (values not
        in the dict/Series/DataFrame will not be filled). This value cannot
        be a list.
    method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None
        Method to use for filling holes in reindexed Series
        pad / ffill: propagate last valid observation forward to next valid
        backfill / bfill: use NEXT valid observation to fill gap
    axis : {0 or 'index', 1 or 'columns'}
    inplace : boolean, default False
        If True, fill in place. Note: this will modify any
        other views on this object, (e.g. a no-copy slice for a column in a
        DataFrame).
    limit : int, default None
        If method is specified, this is the maximum number of consecutive
        NaN values to forward/backward fill. In other words, if there is
        a gap with more than this number of consecutive NaNs, it will only
        be partially filled. If method is not specified, this is the
        maximum number of entries along the entire axis where NaNs will be
        filled. Must be greater than 0 if not None.
    downcast : dict, default is None
        a dict of item->dtype of what to downcast if possible,
        or the string 'infer' which will try to downcast to an appropriate
        equal type (e.g. float64 to int64 if possible)
excel_data.fillna
excel_data.reindex()

excel_data.rename()

excel_data.replace()

excel_data.astype()

excel_data.duplicated()

excel_data.unique()

excel_data.drop_duplictad()

5. 数据分析流程

(1)获取数据

(2)去掉空值、空行、重复行

  空值处理方法:

    dropna()
    fillna()
    isna()
    notna()

(3)筛选数据

(4)分组数据

(5)数据排序

https://pandas.pydata.org/docs/reference/general_functions.html

posted @ 2019-03-25 17:27  Einewhaw  阅读(5431)  评论(0编辑  收藏  举报