Pandas备忘录（持续更新）

DataFrames are the central data structure in the pandas API. It‘s like a spreadsheet, with numbered rows and named columns.

为方便引入例程，先导入对应模块。

1 import pandas as pd

View Code

1. Create, access and modify.

Read a .csv file into a pandas DataFrame:

chicago_taxi_dataset = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/chicago_taxi_train.csv")

View Code

Basic argument of read_csv() :

filepath_or_buffervarious

Either a path to a file (a str, pathlib.Path, or py:py._path.local.LocalPath ), URL (including http, ftp, and S3 locations), or any object with a read() method (such as an open file or StringIO).

The following code instantiates a pd.DataFrame class to generate a DataFrame.

 1 # Create and populate a 5x2 NumPy array.
 2 my_data = np.array([[0, 3], [10, 7], [20, 9], [30, 14], [40, 15]])
 3 
 4 # Create a Python list that holds the names of the two columns.
 5 my_column_names = ['temperature', 'activity']
 6 
 7 # Create a DataFrame.
 8 my_dataframe = pd.DataFrame(data=my_data, columns=my_column_names)
 9 
10 # Print the entire DataFrame
11 print(my_dataframe)

View Code

See its .index attribute, the result is

RangeIndex(start=0, stop=5, step=1)

If index argument was passed at the definition, say modifying line 8 of the above code to

my_dataframe = pd.DataFrame(data=my_data, columns=my_column_names,index = ([10,20,30,40,50]))

View Code

the index attribute is

Index([10, 20, 30, 40, 50], dtype='int64')

len(my_dataframe.index) counts the number of rows of the DataFrame.

DataFrame.reset_index(level=None, *, drop=False, inplace=False, col_level=0, col_fill='', allow_duplicates=_NoDefault.no_default, names=None) method resets the index, where

level: int, str, tuple, or list
Only remove the given levels from the index. Removes all levels by default.
drop: bool, try to insert the original index into dataframe columns if False.
inplace: bool, whether to modify the DataFrame rather than creating a new one.
names: int, str or 1-dimensional list
Using the given string, rename the DataFrame column which contains the original index data. If the DataFrame has a MultiIndex, this has to be a list or tuple with length equal to the number of levels.

It returns a new DataFrame or None if inplace=True.

DataFrame.set_index(keys, *, drop=True, append=False, inplace=False, verify_integrity=False) does reversely, where

keys: label or array-like or list of labels/arrays. This parameter can be either a single column key, or other types described HERE.
drop: bool, whether to delete columns to be used as the new index.
append: bool, whether to append columns to existing index or just replace the original index. For more information, click HERE
inplace: bool, the same as the one above.

You may add a new column to an existing pandas DataFrame just by assigning values to a new column name.

1 # Create a new column named adjusted.
2 my_dataframe["adjusted"] = my_dataframe["activity"] + 2
3 # Print the entire DataFrame
4 print(my_dataframe)

View Code

Pandas provide multiples ways to isolate specific rows, columns, slices or cells in a DataFrame.

 1 print(my_dataframe.activity) #Equal to the corresponding column selection above
 2 print("Rows #0, #1, and #2:")
 3 print(my_dataframe.head(3), '\n')
 4 
 5 print("Row #2:")
 6 print(my_dataframe.iloc[[2]], '\n') # The type of result is DataFrame.
 7 print("Row #2:")
 8 print(my_dataframe.iloc[2], '\n') # The type of the result is Series.
 9 print("Rows #1, #2, and #3:")
10 print(my_dataframe[1:4], '\n') # Note the index starts from the second row not
11 # 1st
12 
13 print("Column 'temperature':")
14 print(my_dataframe['temperature'])
15 
16 training_df = chicago_taxi_dataset[['TRIP_MILES', 'TRIP_SECONDS', 'FARE', 'COMPANY', 'PAYMENT_TYPE', 'TIP_RATE']
17 training_df.head(200)

View Code

To get random samples of a DataFrame, use DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False) method, where

n: int, optional
Number of items from axis to return. Cannot be used with frac. Default = 1 if frac = None.
frac: float, optional
Fraction of axis items to return. Cannot be used with n.
replace: bool, whether to allow sampling of the same row more than once. Supposed to be set to True if frac > 1.
random_state: If int, array-like, or BitGenerator, seed for random number generator. If np.random.RandomState or np.random.Generator, use as given.
axis: {0 or ‘index’, 1 or ‘columns’, None}, default is stat axis for given data type.
ignore_index: bool, deciding if the resulting index will be labeled 0, 1, …, n-1.

It returns a new object of same type as caller containing n items.

Q: What's the difference between Series and DataFrame?

A: The former is a column(Google Gemini insists row but I don't know why) of the latter.

How to index a particular cell of the DataFrame?

 1 # Create a Python list that holds the names of the four columns.
 2 my_column_names = ['Eleanor', 'Chidi', 'Tahani', 'Jason']
 3 
 4 # Create a 3x4 numpy array, each cell populated with a random integer.
 5 my_data = np.random.randint(low=0, high=101, size=(3, 4))
 6 
 7 # Create a DataFrame.
 8 df = pd.DataFrame(data=my_data, columns=my_column_names)
 9 
10 # Print the entire DataFrame
11 print(df)
12 
13 # Print the value in row #1 of the Eleanor column.
14 print("\nSecond row of the Eleanor column: %d\n" % df['Eleanor'][1]) #Chained # indexing

View Code

Q: How to convert a Series to ndarray?

A: Series.values property returns Series as ndarray or ndarray-like depending on the dtype. However, it is recommended using Series.array or Series.to_numpy(), depending on whether you need a reference to the underlying data or a NumPy array.

The following code shows how to create a new column to an existing DataFrame through row-by-row calculation between or among columns:

1 # Create a column named Janet whose contents are the sum
2 # of two other columns.
3 df['Janet'] = df['Tahani'] + df['Jason']
4  
5 # Print the enhanced DataFrame
6 print(df)

View Code

Pandas provides two different ways to duplicate a DataFrame:

Referencing: 藕不断丝连。
Copying: 相互独立。

 1 # Create a reference by assigning my_dataframe to a new variable.
 2 print("Experiment with a reference:")
 3 reference_to_df = df
 4 
 5 # Print the starting value of a particular cell.
 6 print("  Starting value of df: %d" % df['Jason'][1])
 7 print("  Starting value of reference_to_df: %d\n" % reference_to_df['Jason'][1])
 8 
 9 # Modify a cell in df.
10 df.at[1, 'Jason'] = df['Jason'][1] + 5 # Why not using Chained Indexing for #DataFrame assignment?
11 print("  Updated df: %d" % df['Jason'][1])
12 print("  Updated reference_to_df: %d\n\n" % reference_to_df['Jason'][1])

View Code

There're a lot of differences among .iloc

Note: Deprecated since version 2.2.0: Returning a tuple from a callable is deprecated for pandas.DataFrame.iloc . Better not input a callable function.、

Confusion Solutions What's the difference among .at , .loc , .iloc, .iat and chained indexing?

.iloc is primarily integer position based (from 0 to length-1 of the axis): Allowed inputs are

An integer.
A list or array of integers.
A slice object with ints, e.g. 1:7.
A boolean array.
A tuple of row and column indexes. The tuple elements consist of one of the above inputs, e.g. (0, 1).

but may also be used with a boolean array.

To make a copy of a DataFrame, use DataFrame.copy(deep=True) method. When deep=True (default), a new object will be created with a copy of the calling object’s data and indices. Modifications to the data or indices of the copy will not be reflected in the original object. Things might be a lit more complicated in terms of deep=False(This inline label is a link, too) after pandas 3.0.

The following code shows an experiment of a copy

1 copy_of_my_dataframe = my_dataframe.copy()

View Code

DataFrame.drop(labels=None, *, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise') drops specified labels from rows or columns, where

labels: single label or list-like
Index or column labels to drop. A tuple will be used as a single label and not treated as a list-like.
axis: {0 or ‘index’, 1 or ‘columns’}, whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
index: single label or list-like
Alternative to specifying axis (labels, axis=0 is equivalent to index=labels).
columns: single label or list-like

Alternative to specifying axis (labels, axis=1 is equivalent to columns=labels).
level: int or level name, optional

For MultiIndex, level from which the labels will be removed.
inplace: bool, whether do operation in place and return None or return a copy.
errors: {‘ignore’, ‘raise’}. If ‘ignore’, suppress error and only existing labels are dropped.

It returns DataFrame with the specified index or column labels removed or None.

2. Data exploration.

To preview first n rows of a large DataFrame, use DataFrame.head(n=5) and it returns the same type as caller.

use the DataFrame.describe(percentiles=None, include=None, exclude=None) method to view descriptive statistics about the dataset, where

percentiles: list-like of numbers, optional. All should fall between 0 and 1. The default is [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.
include: ‘all’, list-like of dtypes or None, optional. A white list of data types to include in the result. Ignored for Series .
- ‘all’ : All columns of the input will be included in the output.

training_df.describe(include='all') results in

	TRIP_MILES	TRIP_SECONDS	FARE	COMPANY	PAYMENT_TYPE	TIP_RATE
count	31694.000000	31694.000000	31694.000000	31694	31694	31694.000000
unique	NaN	NaN	NaN	31	7	NaN
top	NaN	NaN	NaN	Flash Cab	Credit Card	NaN
freq	NaN	NaN	NaN	7887	14142	NaN
mean	8.289463	1319.796397	23.905210	NaN	NaN	12.965785
std	7.265672	928.932873	16.970022	NaN	NaN	15.517765
min	0.500000	60.000000	3.250000	NaN	NaN	0.000000
25%	1.720000	548.000000	9.000000	NaN	NaN	0.000000
50%	5.920000	1081.000000	18.750000	NaN	NaN	12.200000
75%	14.500000	1888.000000	38.750000	NaN	NaN	20.800000
max	68.120000	7140.000000	159.250000	NaN	NaN	648.600000

# How many cab companies are in the dataset? Try DataFrame.nunique(axis=0, dropna=True) method， where

axis: {0 or ‘index’, 1 or ‘columns’}
dropna: bool, whether to exclude NaN in the counts.

It returns Series. Series.nunique(dropna=True) method returns an int.

查看代码

 num_unique_companies =  training_df['COMPANY'].nunique()

# What is the most frequent payment type?

First, count the frequency of each distinct row in the Dataframe:

DataFrame.value_counts(subset=None, normalize=False, sort=True, ascending=False, dropna=True)

, where

subset: label or list of labels of columns to use when counting unique combinations, optional.
normalize: bool, whether to return proportions rather than frequencies.
sort: bool. Sort by frequencies when True, otherwise by DataFrame column values(original order).
ascending: bool.
dropna: bool, whether to exclude rows containing NaN.

It returns a Series. Series.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True) method returns a Series containing counts of unique values, where

bins: int, optional. Number of bins.
Rather than count values, group them into half-open bins, a convenience for pd.cut, only works with numeric data.

Second, find the result with largest frequency by Series.idxmax(axis=0, skipna=True, *args, **kwargs) method, where

axis: {0 or ‘index’} isn't used. It's needed for compatibility with DataFrame.
skipna: bool, whether to exclude NA/null values. If the entire Series is NA, the result will be NA.

*args, **kwargs: Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.

Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.

It returns an Index, label of max.

查看代码

most_freq_payment_type = training_df['PAYMENT_TYPE'].value_counts().idxmax()

Often we want summary statistics of numerical features, before this we need to find all of the numerical columns. To do this, First, try DataFrame.select_dtypes(include=None, exclude=None) ,where

include, exclude: scalar or list-like. For example,
- To select all numeric types, use np.number or 'number'

It returns the subset of the DataFrame.

Second, try DataFrame.columns to get labels of the subset DataFrame.

# What is the maximum fare?

Try Series.max(axis=0, skipna=True, numeric_only=False, **kwargs) method, where

axis: {index (0)} is unsed, either.
skipna: bool, same as the above one.
numeric_only: not implemented.

It returns scalar.

查看代码

 max_fare = training_df['FARE'].max()

The same method for DataFrame is DataFrame.mean(axis=0, skipna=True, numeric_only=False, **kwargs) , where

axis: {index (0), columns (1)}
For DataFrames, specifying axis=None will apply the aggregation across both axes.
numeric_only: bool, whether to include only float, int or boolean data.

To calculate standard deviation of all numerical features, try DataFrame.std(axis=0, skipna=True, ddof=1, numeric_only=False, **kwargs) , where

ddof: int, Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1).

It returns Series or DataFrame (if level specified). For Series, try Series.std(axis=None, skipna=True, ddof=1, numeric_only=False, **kwargs) instead.

# What is the mean distance across all trips?

Try method Series.mean(axis=0, skipna=True, numeric_only=False, **kwargs). Its return type and parameters are the same as above.

查看代码

 mean_distance = training_df['TRIP_MILES'].mean()

# How many are the features missing examples?

First, find all the features missing examples by isnull() method,

missing_values = training_df.isnull()

View Code

It returns a boolean same-sized DataFrame indicating if the values are NA.

Second, count numbers of True by DataFrame.sum(axis=0, skipna=True, numeric_only=False, min_count=0, **kwargs) method, where

axis: {index (0), columns (1)}
skipna: bool, whether to exclude NA/null values when computing the result.
numeric_only: bool, whether to include only float, int, boolean columns.
min_count: int, required number of valid values to perform the operation. If fewer than min_count non-NA values are present, the result will be NA.

It returns a Series or scalar. The same method of Series is Series.sum(axis=None, skipna=True, numeric_only=False, min_count=0, **kwargs), where

axis, numeric_only are the same as Series.max.

查看代码

 missing_values = training_df.isnull().sum().sum()

To view correlation matrix among features(may including label), try DataFrame.corr(method='pearson', min_periods=1, numeric_only=False)(Note that this inline code is also a link) method, where

method: of correlation:
- pearson : standard correlation coefficient
- kendall : Kendall Tau correlation coefficient
- spearman : Spearman rank correlation
- callable: callable with input two 1d ndarrays and returning a float.
numeric_only: Same as the one in DataFrame.mean().

It returns a DataFrame of correlation matrix.

查看代码

 training_df.corr(numeric_only = True)

get the following result:

	TRIP_MILES	TRIP_SECONDS	FARE	TIP_RATE
TRIP_MILES	1.000000	0.800855	0.975344	-0.049594
TRIP_SECONDS	0.800855	1.000000	0.830292	-0.084294
FARE	0.975344	0.830292	1.000000	-0.070979
TIP_RATE	-0.049594	-0.084294	-0.070979	1.000000

4. Options and settings

Options have a case-insensitive name. You can get/set options directly as attributes of the top-level options attribute:

pd.options.display.max_rows = 10

View Code

If max_rows is exceeded, switch to truncate view. Depending on `large_repr`, objects are either centrally truncated or printed as a summary view. 'None' value means unlimited.
 display.float_format : callable The callable should accept a floating point number and return a string with the desired format of the number. For example,

Format Specification Mini-Language： options.display.float_format = "{:.1f}".format

posted on 2024-08-22 20:08 后生那各膊客圆了阅读(86) 评论(0) 收藏举报

刷新页面返回顶部

ArmRoundMan

公告