动手学深度学习-番外篇2

Pandas

Pandas是非常受欢迎的数据分析库，跟着Kaggle 上的教程简单学一下

Creating, Reading and Writing

DataFrame

创建

pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 
              'Sue': ['Pretty good.', 'Bland.']},
             index=['Product A', 'Product B'])

fruit_sales = pd.DataFrame([[35, 21], [41, 34]], columns=['Apples', 'Bananas'],
                index=['2017 Sales', '2018 Sales'])

Series

创建

pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')

读文件

读CSV文件

wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv")

查看形状

wine_reviews.shape

展示前五行

wine_reviews.head()

指定源文件中某一行作为索引行

wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)

写文件

将DataFrame转换为CSV文件存储

animals.to_csv("cows_and_goats.csv")

Indexing, Selecting and Assigning

访问DataFrame中某一列(选择其中一个Series)

reviews.country

reviews['country']

访问某个特定值

reviews['country'][0]

Indexing in pandas

index-based selection：iloc
- loc和iloc都是行先，列后。原始Python是列先，行后
1
reviews.iloc[0]
获取某一列
1
reviews.iloc[:, 0]
获取特定行和列
1
2
3
reviews.iloc[1:3, 0]

reviews.iloc[[0, 1, 2], 0]
label-based selection：loc
- 看重数据索引值，而非位置。可以将任何标准数据类型作为索引
获取某一个entry
1
reviews.loc[0, 'country']
获取特定的行和列
1
reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']]
iloc与loc之间的差异
- iloc使用切片为左闭右开；loc使用切片为左闭右闭

Manipulating the index

设置索引列（ID）

reviews.set_index("title")

Conditional selection

逻辑判断

reviews.country == 'Italy'

# 与
reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]

# 或
reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]

# isin
reviews.loc[reviews.country.isin(['Italy', 'France'])]

# isnull/notnull
reviews.loc[reviews.price.notnull()]

Assigning data

reviews['critic'] = 'everyone'

reviews['index_backwards'] = range(len(reviews), 0, -1)

Summary Functions and Maps

Summary functions

总结信息（只对数值型有效）

reviews.points.describe()

求某列平均值

reviews.points.mean()

列出非重复的元素

reviews.taster_name.unique()

列出非重复元素以及他们的出现次数

reviews.taster_name.value_counts()

Maps

map()

将数据的均值映射为0，返回映射后的结果

review_points_mean = reviews.points.mean()

# 使用 map 函数对 'points' 列中的每个点数值进行处理，其中传入的是一个匿名函数（Lambda 函数），该函数将每个点数值减去平均值。这样做可以得到一个新的 Series，其中每个元素都是原始点数值减去平均值的结果。
reviews.points.map(lambda p: p - review_points_mean)

apply()

向apply()传入自制的函数，指定维度

def remean_points(row): # 对每一行的points值进行映射
row.points = row.points - review_points_mean
return row

reviews.apply(remean_points, axis='columns') # axis='columns' 会将函数应用在每一行上
# axis='index' 会将函数应用在每一列上

map()和apply()不会改变原数据，只会返回转换后的新Series或DataFrame

Pandas提供了许多常见的内置映射操作，下面的方法更快

review_points_mean = reviews.points.mean()
reviews.points - review_points_mean

在等长的Series之间也有一些内置操作（+,-,>,<,==等等），他们比map和apply快，但是没这么灵活
1
reviews.country + " - " + reviews.region_1

Grouping and Sorting

Groupwise analysis

和value_count()有同样的效果

reviews.groupby('points').points.count()

按分数分组后求每组最便宜的值

reviews.groupby('points').price.min()

apply()可以直接访问并操作分组后的数据集，每次访问一个分组对象

# 首先，使用 groupby 方法将 DataFrame reviews 按照 'winery' 列的唯一值进行分组。这将返回一个分组对象，每个组都包含属于同一个 'winery' 的所有行。然后，对每个分组应用一个函数。在这里，使用了匿名函数（Lambda 函数），该函数接受一个参数 df，表示每个分组（即每个 'winery' 的所有行）。在每个分组中，通过 df.title.iloc[0] 获取该分组的第一行的 'title' 列值，即每个 'winery' 的第一款葡萄酒的标题。
reviews.groupby('winery').apply(lambda df: df.title.iloc[0])

reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.idxmax()])

利用agg()函数对分组后的数据集进行多个函数操作

# 对每个分组应用聚合函数。在这里，price 是一个 Series，使用 agg() 方法应用多个聚合函数。具体地，使用了 len、min 和 max 函数，分别计算了每个分组中 'price' 列的长度（即行数）、最小值和最大值。
reviews.groupby(['country']).price.agg([len, min, max])

Multi-indexes

按两个以上特征分组可能会导致多级索引的问题

将多层索引转换回单层索引

countries_reviewed.reset_index()

Sorting

按某一个特定列的值进行排序

# sort_values()默认是递增的顺序
countries_reviewed.sort_values(by='len')

# 递减的顺序
countries_reviewed.sort_values(by='len', ascending=False)

按索引排序

countries_reviewed.sort_index()

按多于一个列进行排序

countries_reviewed.sort_values(by=['country', 'len'])

Data Types and Missing Values

Dtypes

获取某一列的数据类型

reviews.price.dtype

获取每一列的数据类型

reviews.dtypes

转换某一列的数据类型

reviews.points.astype('float64')

Missing data

选出NaN的entries

reviews[pd.isnull(reviews.country)]

填充NaN值

reviews.region_2.fillna("Unknown")

替换非空值

reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")

Renaming and Combining

Renaming

修改索引名或列名

reviews.rename(columns={'points': 'score'})

修改列名使用set_index()更方便

Both the row index and the column index can have their own name attribute. The complimentary rename_axis() method may be used to change these names.

reviews.rename_axis("wines", axis='rows').rename_axis("fields", axis='columns')

combining

concat()：最简单的方式

canadian_youtube = pd.read_csv("../input/youtube-new/CAvideos.csv")
british_youtube = pd.read_csv("../input/youtube-new/GBvideos.csv")

pd.concat([canadian_youtube, british_youtube])

join()：将拥有相同索引的数据集组合

left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])

left.join(right, lsuffix='_CAN', rsuffix='_UK') # suffix将用于区分left与right中的相同列名

merge()

敖炜的博客