ArmRoundMan

博客园 首页 新随笔 联系 订阅 管理

DataFrames are the central data structure in the pandas API. It‘s like a spreadsheet, with numbered rows and named columns.

为方便引入例程,先导入对应模块。

1 import pandas as pd
View Code

1. Create, access and modify.

Read a .csv file into a pandas DataFrame:

chicago_taxi_dataset = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/chicago_taxi_train.csv")
View Code

Basic argument of  read_csv() :

  • filepath_or_buffervarious

Either a path to a file (a strpathlib.Path, or  py:py._path.local.LocalPath ), URL (including http, ftp, and S3 locations), or any object with a  read()  method (such as an open file or StringIO).

The following code instantiates a  pd.DataFrame  class to generate a DataFrame.

 1 # Create and populate a 5x2 NumPy array.
 2 my_data = np.array([[0, 3], [10, 7], [20, 9], [30, 14], [40, 15]])
 3 
 4 # Create a Python list that holds the names of the two columns.
 5 my_column_names = ['temperature', 'activity']
 6 
 7 # Create a DataFrame.
 8 my_dataframe = pd.DataFrame(data=my_data, columns=my_column_names)
 9 
10 # Print the entire DataFrame
11 print(my_dataframe)
View Code

See its  .index  attribute, the result is

RangeIndex(start=0, stop=5, step=1)

If index argument was passed at the definition, say modifying line 8 of the above code to

my_dataframe = pd.DataFrame(data=my_data, columns=my_column_names,index = ([10,20,30,40,50]))
View Code

the index attribute is

Index([10, 20, 30, 40, 50], dtype='int64')

 len(my_dataframe.index)  counts the number of rows of the DataFrame.

DataFrame.reset_index(level=None*drop=Falseinplace=Falsecol_level=0col_fill=''allow_duplicates=_NoDefault.no_defaultnames=None) method resets the index, where

  • level: int, str, tuple, or list

    Only remove the given levels from the index. Removes all levels by default.

  • drop: bool, try to insert the original index into dataframe columns if False.
  • inplace: bool, whether to modify the DataFrame rather than creating a new one.
  • names: int, str or 1-dimensional list

    Using the given string, rename the DataFrame column which contains the original index data. If the DataFrame has a MultiIndex, this has to be a list or tuple with length equal to the number of levels.

It returns a new DataFrame or None if inplace=True.

DataFrame.set_index(keys*drop=Trueappend=Falseinplace=Falseverify_integrity=False) does reversely, where

  • keys: label or array-like or list of labels/arrays. This parameter can be either a single column key, or other types described HERE.
  • drop: bool, whether to delete columns to be used as the new index.
  • append: bool, whether to append columns to existing index or just replace the original index. For more information, click HERE
  • inplace: bool, the same as the one above.

 

 

You may add a new column to an existing pandas DataFrame just by assigning values to a new column name.

1 # Create a new column named adjusted.
2 my_dataframe["adjusted"] = my_dataframe["activity"] + 2
3 # Print the entire DataFrame
4 print(my_dataframe)
View Code

Pandas provide multiples ways to isolate specific rows, columns, slices or cells in a DataFrame.

 1 print(my_dataframe.activity) #Equal to the corresponding column selection above
 2 print("Rows #0, #1, and #2:")
 3 print(my_dataframe.head(3), '\n')
 4 
 5 print("Row #2:")
 6 print(my_dataframe.iloc[[2]], '\n') # The type of result is DataFrame.
 7 print("Row #2:")
 8 print(my_dataframe.iloc[2], '\n') # The type of the result is Series.
 9 print("Rows #1, #2, and #3:")
10 print(my_dataframe[1:4], '\n') # Note the index starts from the second row not
11 # 1st
12 
13 print("Column 'temperature':")
14 print(my_dataframe['temperature'])
15 
16 training_df = chicago_taxi_dataset[['TRIP_MILES', 'TRIP_SECONDS', 'FARE', 'COMPANY', 'PAYMENT_TYPE', 'TIP_RATE']
17 training_df.head(200)
View Code

 

To get random samples of a DataFrame, use DataFrame.sample(n=Nonefrac=Nonereplace=Falseweights=Nonerandom_state=Noneaxis=Noneignore_index=False) method, where

  • n: int, optional

    Number of items from axis to return. Cannot be used with frac. Default = 1 if frac = None.

  • frac: float, optional

    Fraction of axis items to return. Cannot be used with n.

  • replace: bool, whether to allow sampling of the same row more than once. Supposed to be set to True if frac > 1.
  • random_state: If int, array-like, or BitGenerator, seed for random number generator. If np.random.RandomState or np.random.Generator, use as given.
  • axis: {0 or ‘index’, 1 or ‘columns’, None}, default is stat axis for given data type.
  • ignore_index: bool, deciding if the resulting index will be labeled 0, 1, …, n-1.

It returns a new object of same type as caller containing n items.

 

Q: What's the difference between Series and DataFrame? 

A: The former is a column(Google Gemini insists row but I don't know why) of the latter.

How to index a particular cell of the DataFrame?

 1 # Create a Python list that holds the names of the four columns.
 2 my_column_names = ['Eleanor', 'Chidi', 'Tahani', 'Jason']
 3 
 4 # Create a 3x4 numpy array, each cell populated with a random integer.
 5 my_data = np.random.randint(low=0, high=101, size=(3, 4))
 6 
 7 # Create a DataFrame.
 8 df = pd.DataFrame(data=my_data, columns=my_column_names)
 9 
10 # Print the entire DataFrame
11 print(df)
12 
13 # Print the value in row #1 of the Eleanor column.
14 print("\nSecond row of the Eleanor column: %d\n" % df['Eleanor'][1]) #Chained # indexing
View Code

Q: How to convert a Series to ndarray?

A: Series.values property returns Series as ndarray or ndarray-like depending on the dtype. However, it is recommended using Series.array or Series.to_numpy(), depending on whether you need a reference to the underlying data or a NumPy array.

The following code shows how to create a new column to an existing DataFrame through row-by-row calculation between or among columns:

1 # Create a column named Janet whose contents are the sum
2 # of two other columns.
3 df['Janet'] = df['Tahani'] + df['Jason']
4  
5 # Print the enhanced DataFrame
6 print(df)
View Code

Pandas provides two different ways to duplicate a DataFrame:

  • Referencing: 藕不断丝连。
  • Copying: 相互独立。

 

 1 # Create a reference by assigning my_dataframe to a new variable.
 2 print("Experiment with a reference:")
 3 reference_to_df = df
 4 
 5 # Print the starting value of a particular cell.
 6 print("  Starting value of df: %d" % df['Jason'][1])
 7 print("  Starting value of reference_to_df: %d\n" % reference_to_df['Jason'][1])
 8 
 9 # Modify a cell in df.
10 df.at[1, 'Jason'] = df['Jason'][1] + 5 # Why not using Chained Indexing for #DataFrame assignment?
11 print("  Updated df: %d" % df['Jason'][1])
12 print("  Updated reference_to_df: %d\n\n" % reference_to_df['Jason'][1])
View Code

There're a lot of differences among  .iloc 

Note: Deprecated since version 2.2.0: Returning a tuple from a callable is deprecated for  pandas.DataFrame.iloc . Better not input a callable function.、 

Confusion Solutions What's the difference among  .at .loc .iloc.iat  and chained indexing?    

 .iloc   is primarily integer position based (from 0 to  length-1  of the axis): Allowed inputs are  

  • An integer.

  • A list or array of integers.

  • A slice object with ints, e.g. 1:7.

  • A boolean array. 

  • A tuple of row and column indexes. The tuple elements consist of one of the above inputs, e.g. (0, 1).

but may also be used with a boolean array. 

To make a copy of a DataFrame, use DataFrame.copy(deep=True) method. When deep=True (default), a new object will be created with a copy of the calling object’s data and indices. Modifications to the data or indices of the copy will not be reflected in the original object. Things might be a lit more complicated in terms of deep=False(This inline label is a link, too) after pandas 3.0.

 

The following code shows an experiment of a copy

1 copy_of_my_dataframe = my_dataframe.copy()
View Code

  DataFrame.drop(labels=None, *, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')  drops specified labels from rows or columns, where

  • labels: single label or list-like

    Index or column labels to drop. A tuple will be used as a single label and not treated as a list-like.

  •  axis: {0 or ‘index’, 1 or ‘columns’}, whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).

  • index: single label or list-like

    Alternative to specifying axis (labels, axis=0 is equivalent to index=labels).

  •  columns: single label or list-like

    Alternative to specifying axis (labels, axis=1 is equivalent to columns=labels).

  •  level: int or level name, optional

    For MultiIndex, level from which the labels will be removed.

  •  inplace: bool, whether do operation in place and return None or return a copy.

  • errors: {‘ignore’, ‘raise’}. If ‘ignore’, suppress error and only existing labels are dropped.

 It returns DataFrame with the specified index or column labels removed or None.

 

2. Data exploration.

To preview first n rows of a large DataFrame, use  DataFrame.head(n=5)  and it returns the same type as caller.

use the  DataFrame.describe(percentiles=Noneinclude=Noneexclude=None)  method to view descriptive statistics about the dataset, where

  • percentiles: list-like of numbers, optional. All should fall between 0 and 1. The default is [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.
  • include: ‘all’, list-like of dtypes or None, optional. A white list of data types to include in the result. Ignored for  Series . 
    • ‘all’ : All columns of the input will be included in the output.

 training_df.describe(include='all')  results in

  TRIP_MILES TRIP_SECONDS FARE COMPANY PAYMENT_TYPE TIP_RATE
count 31694.000000 31694.000000 31694.000000 31694 31694 31694.000000
unique NaN NaN NaN 31 7 NaN
top NaN NaN NaN Flash Cab Credit Card NaN
freq NaN NaN NaN 7887 14142 NaN
mean 8.289463 1319.796397 23.905210 NaN NaN 12.965785
std 7.265672 928.932873 16.970022 NaN NaN 15.517765
min 0.500000 60.000000 3.250000 NaN NaN 0.000000
25% 1.720000 548.000000 9.000000 NaN NaN 0.000000
50% 5.920000 1081.000000 18.750000 NaN NaN 12.200000
75% 14.500000 1888.000000 38.750000 NaN NaN 20.800000
max 68.120000 7140.000000 159.250000 NaN NaN 648.600000
# How many cab companies are in the dataset? Try DataFrame.nunique(axis=0dropna=True) method, where
  • axis: {0 or ‘index’, 1 or ‘columns’}
  • dropna: bool, whether to exclude NaN in the counts.

It returns Series. Series.nunique(dropna=True) method returns an int.

查看代码
 num_unique_companies =  training_df['COMPANY'].nunique()

 

# What is the most frequent payment type?
First, count the frequency of each distinct row in the Dataframe:  DataFrame.value_counts(subset=None, normalize=False, sort=True, ascending=False, dropna=True)
, where
  • subset: label or list of labels of columns to use when counting unique combinations, optional.
  • normalize: bool, whether to return proportions rather than frequencies.
  • sort: bool. Sort by frequencies when True, otherwise by DataFrame column values(original order).
  • ascending: bool.
  • dropna: bool, whether to exclude rows containing NaN.

It returns a Series. Series.value_counts(normalize=Falsesort=Trueascending=Falsebins=Nonedropna=True) method returns a Series containing counts of unique values, where

  • bins: int, optional. Number of bins.

    Rather than count values, group them into half-open bins, a convenience for pd.cut, only works with numeric data.

Second, find the result with largest frequency by Series.idxmax(axis=0skipna=True*args**kwargs) method, where
  • axis: {0 or ‘index’} isn't used. It's needed for compatibility with DataFrame.
  • skipna: bool, whether to exclude NA/null values. If the entire Series is NA, the result will be NA.
  • *args, **kwargs: Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy. 

Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.

 It returns an Index, label of max.

查看代码
most_freq_payment_type = training_df['PAYMENT_TYPE'].value_counts().idxmax()
Often we want summary statistics of numerical features, before this we need to find all of the numerical columns. To do this, First, try  DataFrame.select_dtypes(include=None, exclude=None) ,where
  • include, exclude: scalar or list-like. For example, 
    • To select all numeric types, use np.number or 'number'

It returns the subset of the DataFrame.

 Second, try  DataFrame.columns  to get labels of the subset DataFrame.
# What is the maximum fare?
Try  Series.max(axis=0, skipna=True, numeric_only=False, **kwargs) method, where
  • axis: {index (0)} is unsed, either.
  • skipna: bool, same as the above one.
  • numeric_only: not implemented.

It returns scalar.

查看代码
 max_fare = training_df['FARE'].max()

 The same method for DataFrame is  DataFrame.mean(axis=0, skipna=True, numeric_only=False, **kwargs) , where

  • axis: {index (0), columns (1)}

    For DataFrames, specifying  axis=None  will apply the aggregation across both axes.

  • numeric_only: bool, whether to include only floatint or boolean data.
To calculate standard deviation of all numerical features, try  DataFrame.std(axis=0, skipna=True, ddof=1, numeric_only=False, **kwargs) , where
  • ddof: int, Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1).
 It returns Series or DataFrame (if level specified). For Series, try  Series.std(axis=None, skipna=True, ddof=1, numeric_only=False, **kwargs)  instead.
 
# What is the mean distance across all trips?
Try  method Series.mean(axis=0skipna=Truenumeric_only=False**kwargs). Its return type and parameters are the same as above.
查看代码
 mean_distance = training_df['TRIP_MILES'].mean()
 
# How many are the features missing examples?
First, find all the features missing examples by isnull() method, 
missing_values = training_df.isnull()
View Code

It returns a boolean same-sized DataFrame indicating if the values are NA.

Second, count numbers of True by DataFrame.sum(axis=0skipna=Truenumeric_only=Falsemin_count=0**kwargs) method, where

  • axis: {index (0), columns (1)}
  • skipna: bool, whether to exclude NA/null values when computing the result.
  • numeric_only: bool, whether to include only float, int, boolean columns.
  • min_count: int, required number of valid values to perform the operation. If fewer than min_count non-NA values are present, the result will be NA.

It returns a Series or scalar. The same method of Series is Series.sum(axis=Noneskipna=Truenumeric_only=Falsemin_count=0**kwargs), where

  • axis, numeric_only are the same as Series.max.

查看代码

 missing_values = training_df.isnull().sum().sum()

To view correlation matrix among features(may including label), try DataFrame.corr(method='pearson'min_periods=1numeric_only=False)(Note that this inline code is also a link) method, where

  • method: of correlation:
    • pearson : standard correlation coefficient

    • kendall : Kendall Tau correlation coefficient

    • spearman : Spearman rank correlation

    • callable: callable with input two 1d ndarrays and returning a float. 
  • numeric_only: Same as the one in DataFrame.mean().

It returns a DataFrame of correlation matrix.

查看代码

 training_df.corr(numeric_only = True)

get the following result:

 TRIP_MILESTRIP_SECONDSFARETIP_RATE
TRIP_MILES 1.000000 0.800855 0.975344 -0.049594
TRIP_SECONDS 0.800855 1.000000 0.830292 -0.084294
FARE 0.975344 0.830292 1.000000 -0.070979
TIP_RATE -0.049594 -0.084294 -0.070979 1.000000

4. Options and settings

Options have a case-insensitive name. You can get/set options directly as attributes of the top-level  options  attribute:
pd.options.display.max_rows = 10
View Code
If max_rows is exceeded, switch to truncate view. Depending on `large_repr`, objects are either centrally truncated or printed as a summary view. 'None' value means unlimited.
 display.float_format : callable The callable should accept a floating point number and return a string with the desired format of the number. For example,

Format Specification Mini-Language: options.display.float_format = "{:.1f}".format 

 
posted on 2024-08-22 20:08  后生那各膊客圆了  阅读(68)  评论(0)    收藏  举报