python数据结构：pandas(1)

废话不说，直接上干货

一、数据结构

　　(1)Series：一维数组，与Numpy中的一维array类似。二者与Python基本的数据结构List也很相近。Series如今能保存不同种数据类型，字符串、boolean值、数字等都能保存在Series中。

　　(2)Time- Series：以时间为索引的Series。

　　(3)DataFrame：二维的表格型数据结构。很多功能与R中的data.frame类似。可以将DataFrame理解为Series的容器。

　　(4)Panel ：三维的数组，可以理解为DataFrame的容器。

二、基本用法

　　1.创建Series对象:类似于一维数组的对象，下面通过list来构建Series

　　　　注意：Series由数据和索引构成：索引在左边，数据在右边，索引是自动创建的

er_obj =pd.Series(range(10,20))   #
print('type(ser_obj)：\n',type(ser_obj))   #pandas的数据类型是：<class 'pandas.core.series.Series'>
print('ser_obj=\n',ser_obj)

type(ser_obj)： <class 'pandas.core.series.Series'>
ser_obj=
 0    10
1    11
2    12
3    13
4    14
5    15
6    16
7    17
8    18
9    19
dtype: int64

　　2.获取数据的值和索引：

print(ser_obj)   #显示所有的数据

0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19

print(type(ser_obj))  #显示数据类型   <class 'pandas.core.series.Series'>
print(ser_obj.values) #打印出数据的value值   [10 11 12 13 14 15 16 17 18 19]
print(type(ser_obj.values))  #打印出values的值的数据类型 <class 'numpy.ndarray'>

print(ser_obj.index)   #打印出所有的索引对象  #RangeIndex(start=0, stop=10, step=1)
print(type(ser_obj.index))  #打印出所有索引对象的类型  <class 'pandas.core.indexes.range.RangeIndex'>

print(ser_obj.items())   <zip object at 0x000000000B8DEAC8>
print(type(ser_obj.items()))   <class 'zip'>

　　3.预览数据

print(ser_obj.head(3))

0 10
1 11
2 12

看看head()的源码

def head(self, n=5):默认情况下是前5行
    """
    Return the first `n` rows.  返回前n行

　　这个函数是返回基于位置对象的前n行，对于快速检测你对象中是否有正确类型的数据在其中是很有用的e
    This function returns the first `n` rows for the object based
    on position. It is useful for quickly testing if your object
    has the right type of data in it.

    Parameters
    ----------
    n : int, default 5
        Number of rows to select.

    Returns
    -------
    obj_head : type of caller
        The first `n` rows of the caller object.

    See Also
    --------
    pandas.DataFrame.tail: Returns the last `n` rows.

    Examples
    --------
    >>> df = pd.DataFrame({'animal':['alligator', 'bee', 'falcon', 'lion',
    ...                    'monkey', 'parrot', 'shark', 'whale', 'zebra']})
    >>> df
          animal
    0  alligator
    1        bee
    2     falcon
    3       lion
    4     monkey
    5     parrot
    6      shark
    7      whale
    8      zebra

    Viewing the first 5 lines

    >>> df.head()
          animal
    0  alligator
    1        bee
    2     falcon
    3       lion
    4     monkey

    Viewing the first `n` lines (three in this case)

    >>> df.head(3)
          animal
    0  alligator
    1        bee
    2     falcon

　　　　4.通过索引获取数据

print(ser_obj[0])  #10

print(ser_obj[8])   #18

　　　　5.索引与数据的对应关系任然保持在数组运算的结果中

print(ser_obj*2)

0 20
1 22
2 24
3 26
4 28
5 30
6 32
7 34
8 36
9 38

print(ser_obj[ser_obj>15])

6 16
7 17
8 18
9 19

　　6.通过dict构建Series

#通过dict构建Series
year_data={2001:17.8,2002:20.1,2003:16.5,2004:19.9,2005:20.2,2006:22.6}
ser_obj2 =pd.Series(year_data)

print(ser_obj2.head())  #，默认打印前5行

2001 17.8
2002 20.1
2003 16.5
2004 19.9
2005 20.2


print(ser_obj2.index)   #打印出ser_obj2的索引

Int64Index([2001, 2002, 2003, 2004, 2005, 2006], dtype='int64')

　　7.设置name属性

　　ser_obj.name = ser_obj.index.name =

ser_obj2.name='temp'   #将name设置为temp
ser_obj2.index.name='year'　　　　#将索引设置为year
print(ser_obj2.head())　　#打印出前5行
print(ser_obj2.name)     #打印出对象的名字
print(ser_obj2.index.name)   #打印出索引的名字

　　8.Pandas数据结构DataFrame

　　　　(1)类似于多维数组/表格数据

　　　　(2)梅列数据可以是不同的数据类型

　　　　(3)索引包括行索引和列索引

　　　　(4)可以通过ndarray构建DataFrame

import numpy as np
array = np.random.rand(5,4)   
print(array)   #生成一个5行4列的(0,1)之间的随机数组

df_obj = pd.DataFrame(array)  #将array转换为DataFrame的一个对象
print(df_obj.head())

[[0.16638712 0.7711124 0.72202224 0.2714576 ]
[0.39650865 0.01447041 0.41879748 0.27559135]
[0.46626184 0.67238444 0.72607271 0.93931229]
[0.41514637 0.23213519 0.68909139 0.83395236]
[0.84700412 0.3739937 0.64183245 0.64426823]]
0 1 2 3
0 0.166387 0.771112 0.722022 0.271458
1 0.396509 0.014470 0.418797 0.275591
2 0.466262 0.672384 0.726073 0.939312
3 0.415146 0.232135 0.689091 0.833952
4 0.847004 0.373994 0.641832 0.644268

　　　　(5)通过dict构建DataFrame

# 通过dict构建dataFrame

dict_data={'A':1,
           'B':pd.Timestamp('20190101'),
           'C':pd.Series(1,index=list(range(4)),dtype='float32'),
           'D':np.array([3]*4,dtype='int32'),
           'E':pd.Categorical(['python','java','C++','C#']),
           'F':'ChinaHadoop'
           }

df_obj2 = pd.DataFrame(dict_data)
print(df_obj2.head())


构建的结果：

A B C D E F
0 1 2019-01-01 1.0 3 python ChinaHadoop
1 1 2019-01-01 1.0 3 java ChinaHadoop
2 1 2019-01-01 1.0 3 C++ ChinaHadoop
3 1 2019-01-01 1.0 3 C# ChinaHadoop

　　　　(6)通过列索引来获取数据(Series类型)

　　　　　　df_obj[col_idx] 或者df_obj.col_obj

dict_data={'A':1,
           'B':pd.Timestamp('20190101'),
           'C':pd.Series(1,index=list(range(4)),dtype='float32'),
           'D':np.array([3]*4,dtype='int32'),
           'E':pd.Categorical(['python','java','C++','C#']),
           'F':'ChinaHadoop'
           }

df_obj2 = pd.DataFrame(dict_data)
print(df_obj2.head())

# 通过列索引来获取数据
print(df_obj2['A'])
print(type(df_obj2['A']))  #打印出索引A对应的数据类型,<class 'pandas.core.series.Series'>
print(df_obj2.A)    #以另一种方式对数据进行访问

0 1
1 1
2 1
3 1
Name: A, dtype: int64
<class 'pandas.core.series.Series'>
0 1
1 1
2 1
3 1
Name: A, dtype: int64

　　　　(7)增加列数据，类似dict添加key-value

　　　　　　df_obj[new_col_idx]=data

df_obj2['G']= df_obj2['D']+4
print(df_obj2)

A B C D E F G
0 1 2019-01-01 1.0 3 python ChinaHadoop 7
1 1 2019-01-01 1.0 3 java ChinaHadoop 7
2 1 2019-01-01 1.0 3 C++ ChinaHadoop 7
3 1 2019-01-01 1.0 3 C# ChinaHadoop 7

　　　　(8)删除列

　　　　　　del df_obj[col_idx]

#删除列
del(df_obj2['G'])
print(df_obj2)

A B C D E F
0 1 2019-01-01 1.0 3 python ChinaHadoop
1 1 2019-01-01 1.0 3 java ChinaHadoop
2 1 2019-01-01 1.0 3 C++ ChinaHadoop
3 1 2019-01-01 1.0 3 C# ChinaHadoop

　　9.索引对象Index

　　　　(1)Series和DataFrame中的索引都是Index对象

print(type(df_obj2))  #打印出dataFrame的索引种类  <class 'pandas.core.frame.DataFrame'>
print(type(ser_obj2)) #打印出Series的索引种类     <class 'pandas.core.series.Series'>

　　　　(2)不可变(immutable):保证了数据的安全性

# df_obj2.index[0]=3   # raise TypeError("Index does not support mutable operations")
#ser_obj2.index[2]=1   #TypeError: Index does not support mutable operations

　　　　(3)常见的Index种类

　　　　　　Index

　　　　　　Int64Index

　　　　　　MultiIndex，‘层级’索引

　　　　　　DatatimeIndex，时间戳索引

posted @ 2018-12-31 17:21 stone1234567890 阅读(610) 评论(0) 收藏举报

刷新页面返回顶部

大数据开发程序猿

做有态度的码农，欢迎各位朋友光临，本博客长期更新，需要学习讨论找工作面试的同学可以加qq群：694117549，交个朋友相互交流。

python数据结构：pandas(1)

一、数据结构

二、基本用法

公告