写在前面

在进修这篇前需要有 Python基础，可以先学习前面的 Python_tutorial，或者从其他渠道习得。
学习 NumPy 和 Pandas 的本质，是学习它们的 数据结构。 NumPy的 Array， Pandas中的 Series 和 DataFrame。
学完，会有，自己修完《数据结构》这门课一样的感觉，“我好像都学过，遇到问题不会用”。感觉还是得，多做（做中学）

jupyter

居于网页的用于交互计算的应用程序。其可被应用于全过程计算：开发、文档编写、代码运行和结果展示
- 编程时具有语法高亮、缩进、tab补全的功能
- 可直接通过浏览器运行代码，同时在代码块下方展示运行结果。
- 可以到处不同格式，如pdf、html、py 文件
- 对代码编写说明文档或语句时，支持Markdown语法
其实可以直接用vscode跑ipynb文件

Numpy

学习 NumPy 所看的教程，是一个 YouTuber(Derek Banas) 的 Tutorial

Tutorial给的代码 Github： https://github.com/derekbanas/NumPy-Tutorial
笔者自己的写的ipynb文件： https://github.com/Carp2i/NumPy-Pandas-Matplotlib/tree/master/NumPy

其中 爱因斯坦求和， 克罗内克积 条件数 Crammer法则的落地 NumPy的经济学包

# import
import numpy as np
import matplotlib.pylab as plt
from numpy import random

Creating

创建列表

从 list 生成 array

1 2	list_1 = [1, 2, 3, 4, 5] np_arr_1 = np.array(list_1, dtype=np.int8)

从 函数生成
这里许多函数（方法）都与 Matlab 相同

np.arrange(1, 10)           # 生成 [1, 10) 的自然数列
np.linspace(0, 5, 6)        # 生成 [0, 5] 包含首尾，平均分段 6 个点的阵列
np.zeros(4)                 # 生成 长度为4的一维全零数组
np.ones((2, 3))             # 生成 2x3 的数组

np.random.randint(10, 50, 5)            # 生成[10, 50) 长度为 5 的一维矩阵
np.random.randint(10, 50, size=(2, 3))  # 生成[10, 50) 形状为 2x3 的二维矩阵

查看数组属性

1
2
3

# array 的 数据类型 跟 数组size 的信息，以属性的形式保存
arr.size    # 返回数组的形状
arr.dtype   # 返回元素类型

查看文件的使用方法

1	np.random.randint?

Slicing & Indexes

索引

索引与切片永远重要

# 同样用 list 的方式索引元素值(中括号)
arr[0, 0] = 2
# 同样方法可以用 itemset 方法实现
arr.itemset((0,1), 1)
# 但这种时候就必须用 tuple 来表示元素位置

# 两种相同的索引方式
arr[0, 1]
arr.item(0, 1)

# take 方法取出，以 list 为index的元素组成的 List， 元素编号为 行优先存储
np.take(arr, [0, 3, 6])
# put 的第二个可变参数 传入替代的元素位置 第三个可变参数传入元素值
np.put(arr, [0, 3, 6], [10, 10, 10])

切片

# 类Matlab的生成式
arr[:5:2]   # [0:5) 步长为2的取切片
# 取出了二位矩阵的第二列
arr[:, 1] 
arr_[::-1] # 按行倒序输出

# 索引中，用[]在中括号中加入条件，最后会 按行优先 输出符合条件的元素加入最后数组
evens = arr[arr % 2 ==0]
# 最后会赛选出 偶数 元素

np.unique(arr)
# 有序，消除重复

Reshape Arrays

reshape 方法

1	arr.reshape((1, 9))

resize 方法

1	np.resize(np_m_arr_1, (2,5))

其他变换

# 二维转置
arr.transpose()
# 交换坐标系
arr.swapaxes(0, 1)
# flatten方法，如果不额外输入参数，默认行优先，参数设置为 ‘F’ 则列优先
arr.flatten('F')

# 矩阵排序
arr.sort(axis = 1)

Stacking & Splitting

Stacking

把矩阵横着或者竖着堆起来

arr1 = np.random.randit(10, size=(2,2))
arr2 = np.random.randit(10, size=(2,2))

# 竖直/水平 堆叠stack
np.vstack((arr1, arr2))
np.hstack((arr1, arr2))

# 删除第二行的所有元素
np.delete(arr1, 1, 0)

# 另一种 竖直/水平 堆叠
# 注意输入得是 tuple
np.row_stack((arr1, arr2))
np.column_stack((arr1, arr2))

arr3 = np.random.randint(10, size=(2,10))
np.hsplit(arr3,(2,4)) # 会删去数组的前 2x4 的元素
np.vsplit(....)

Copying

直接赋值的时候不会形成复制
根据Python的机理，直接赋值，只会让两个标签指向同一内容

如果利用其中一个元素修改内容，另一个索引得到的结果也会修改

正确的copy方法

1 2	arr2 = arr1.copy() # 只是内容上的复制

Basic Math

np.add(arr1, arr2)  # 里面是常数也行，
# 类似的算术 .substract .multiply .divide 按位加减乘除
np.remainder(arr1)  # 取余数
np.power(arr1, arr2)  # 按位 幂乘，arr2为幂
np.sqrt(arr)        # 开方, cbrt()立方根

np.gcd.reduce([9, 12, 15])    # 求list所有元素的 最大公约数
np.lcm.reduce([9, 12, 15])    # 求list所有元素的 最小公倍数

Reading from files

文件读取永远重要

import pandas as pd
from numpy import genfromtxt

# pandas 读取方法
ic_sales = pd.read_csv('icreamsales.csv').to_numpy()
ic_sales
# numpy 读取方法 这种写法特别少
ic_sales2 = genfromtxt('icecreamsales.csv', delimiter=',')
ic_sales2 = [row[~np.isnan(row)]] for row in ic_sales2]

Statistics Function

arr1 = np.arrange(1, 6)
np.mean(arr) # .median .average .std .var
np.var([4, 6, 3, 5, 3])
# nanmedian, nanmean, nanstd, nanvar. just ingore nan


np.percentile(ic_sales, 50, axis=0)
ic_sales[:, 1]

np.corrcoef(ic_sales[:, 0], ic_sales[:, 1]) # correlation coefficent

Trig Functions

# 特地查了一个单词 三角函数
# np.pi pi的值被放在了
t_arr = np.linspace(-np.pi, np.pi, 200)
plt.plot(t_arr, np.sin(t_arr))  # 第二个参数也可以改成 cos

# 求反函数的方法
np.arctan(1)

# Also arctan2, sinh, cosh, tanh, arcsinh, arccosh, arctanh

# 这个方法是 degree to Rad 角度转弧度
np.deg2rad(180)
# Rad 2 degree
np.rad2deg(np.pi)

# htpotebyse 三角形求直角边
np.hypot(10, 10)

Linear Algebra Function

from numpy import linalg as LA
# Matrix multiplication with Dot Product
np.dot(arr1, arr2)
# LA的多目标点成，可以接受多个参数，依次点乘
LA.multi_dot([arr1, arr2, arr3])

# Inner product: 内积
np.inner(arr1, arr2)

# Tensor Dot Product
arr_9 = np.array([[[1, 2],
                    [3, 4]],
                    [[5, 6],
                    [7, 8]]])

arr_10 = np.array([[1, 2], [3, 4]], dtype=object)
np.tensordot(arr_9, arr_10)

# 生成矩阵平方(矩阵乘矩阵)
LA.matrix_power(arr, 2)

# Compute eigenvalues
LA.eig(arr) # this version is going to return eigvector, 
LA.eigvals(arr)   # return eig value

# Get Vector Norm sqrt(sum(x**2))
LA.norm(arr)

# Get Multiplicative Inverse of a matrix
LA.inv(arr)

# Get Condition number of matrix 条件数，矩阵敏感度指标* 有点超纲
LA.cond(arr)

# Determinates (行列式) are used to compute volume, area, to solve systems
# of equations and more. It is a way u can multiply values in a matrix
# to get 1 number.
# For a matrix to have an inverse its determinate must not equal 0
# det([[a, b], [c, d]]) = a * d - b * c

arr_12 = np.array([[1, 2], [3, 4]])
LA.det(arr_12)  # 1*4 - 2*3

# 加试内容： Einstein Summation， Kronecker Products， Crammer Rule

Saving & Loader

arr_15 = np.array([[1, 2], [3, 4]])
np.save('randarray', arr_15)
print('arr_15\n', arr_15)
arr_16 = np.load('randarray.npy')
arr_16

np.savetxt('randcsv.csv', arr_15)
arr_17 = np.loadtxt('randcsv.csv')
arr_17

Comparision Function

carr_1 = np.array([2, 3])
carr_2 = np.array([3, 2])
np.greater_equal(carr_1, carr_2)
np.less_equal(carr_1, carr_2)
np.not_equal(carr_1, carr_2)
np.equal(carr_1, carr_2)

Pandas

虽然莫烦老师的课录的已经比较早了（2017），但是 NumPy 与 Pandas 这两年变化并不大。
学习课程网站： https://www.bilibili.com/video/BV1Ex411L7oT?p=12&spm_id_from=333.1007.top_right_bar_window_history.content.click
莫烦老师给的代码：https://github.com/MorvanZhou/tutorials/tree/master/numpy%26pandas
笔者的ipynb文件：

Basically Pandas is going to provide numerous tools to work with tabular data like you’d find in spreadsheets or databases and will be working with spreadsheets and databases.
And it’s widely used for data preparation cleaning as well as analysis,, and it can work with a wide variety of different types of data

1
2
3

# import module
import pandas as pd
import numpy as np

Data Structure

Pandas有两个基本的数据结构，分别为 Series 与 DataFrame （注意大小写）

Series

Series 基本数据结构，一维标签数组，能够保存任何数据类型，索引依据为标签

DataFrame

DataFrame 基本数据结构，二维数组，是一组有序的列

row的索引为 index
columns的索引为 columns
两者一般都是字符串类型

数据结构的创建

# 没有主动设置 Index 的话，默认0-1
s = pd.Series([1, 3, 6, np.nan, 44, 1], name='ndarray')

# 生成 Index标签列表（数据类型应该需要相等
dates = pd.date_range('20130101')
df1 = pd.DataFrame(np.random.randn(6, 4), index = dates)

# 不人为规定 Index List 会自动生成 自然数Index{0, 1, 2, 3, ...}
df2 = pd.DataFrame(np.arange(12).reshape((3, 4)))

df3 = pd.DataFrame({'A' : 1.,
                       'B' : pd.Timestamp('20130102'),
                        'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                        'D' : np.array([3] * 4,dtype='int32'),
                        'E' : pd.Categorical(["test","train","test","train"]),
                        'F' : 'foo'})

上面那个例子可以看出，其实，DataFrame的创建还有Broadcast机制。纬度不够会广播填充。

常见操作

对象属性

dtypes
注意是 dtypes（有s的

1 2	df.types # 返回各个列对应的数据元素类型

index

1 2	df.index # 返回 DataFrame Index元素类型与元素列表

columns

1
2
3

df.columns
# 返回 Columns的元素类型与元素列表
# 类型一般为 字符串

values

1
2
3

# 返回按行优先的元素值
# 除去index与columns
df.values

对象常用的方法

.describe() 是常用的快速获取数据分布特征信息的方法
.T 矩阵转置
.sort 排序

调用函数 Call ur Element

索引 index

索引永远重要

dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.arange(24).reshape((6, 4)), index=dates, columns=['A', 'B', 'C', 'D'])
df_A = df['A']
print(f'{df_A} \n {df}')

切片 Slice

切片跟索引一样重要

print(df['A'], df.A)
df1 = df[0:3]
df2 = df['20130102':'20130104']
print(f'{df1}\n{df2}')

DataFrame 可以像字典一样，直接用字符串（关键字）索引
可以像 Array 或者，2-dim List 一样切片
也可以利用 关键字 来代替传统的数字索引

Select by Label：loc

利用pandas中的loc后缀，可以利用标签来索引元素

# 会把 index 为 '20130102' 的行拿出来作为 Series
print(df.loc['20130102'])
print(df.loc[:, ['A','B']])

# loc后缀的使用方式，只是将label替代数字索引
print(df.loc[:, ['A','B']])
print(df.loc['20130102', ['A', 'B']])
# 分别打印 index为20130102的行，以及 20130102与AB列

Select by position: iloc

1	print(df.iloc[[1, 3, 5], 1:3])

莫烦的视频里有 DataFrame.ix 方法，可以混合使用混合索引的方法
但是在版本更新中，被删除了（deprecated

1
2
3

# Boolean indexing
print(df)
print(df[df.A <8])

把条件语句放入中括号中，会返回符合条件的行

Pandas 设置值

dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)), index=dates, columns = ['A', 'B', 'C', 'D'])

df.iloc[2, 2] = 1111
df.loc['20130101', 'B'] = 222
df[df.A>0] = 0
# 可以直接加上空的columns
df['F'] = np.nan
# 可以直接加上已经定义好的列
df['E'] = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130101', periods=6))

print(df)

直接加上，不存在的列，会增添新列
会利用广播机制

缺失值处理

如果数据值确实，会被用 np.nan 填充缺失

缺失探测

.isnull()方法的使用

np.any(df.isnull()) == True 的全局监测

1 2	print(df.isnull()) # 会返回各个元素依次判断的Boolean矩阵(形状，缺失值为False print(np.any(df.isnull())==True) # 查看是否包含True值（的确缺失，返回True

缺失值处理方法

1 2	print(df.dropna(axis=0, how='all')) # how = {'any', 'all'} print(df.fillna(value=0)) # 填充 np.nan 为指定值

文件的读入/读出

读入
- read_csv
- read_excel
- …
导出
- to_csv
- to_excel
- …

data = pd.read_csv('student.csv')
print(data)
# csv文件上，若没有index
## 会默认附上从0开始依次递增的索引
data.to_csv('student_copy.csv')

合并

Concatenatings

同属性数据合并

df1 = pd.DataFrame(np.ones((3, 4))*0, columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(np.ones((3, 4))*1, columns=['a', 'b', 'c', 'd'])
df3 = pd.DataFrame(np.ones((3, 4))*2, columns=['a', 'b', 'c', 'd'])

print(df1)
print(df2)
print(df3)

res = pd.concat([df1, df2, df3], axis=0, ignore_index=True)
# 最后的参数，会将原先的参数全部忽略，重新排列
## 原本的index是0 1 2 0 1 2 0 1 2
print(res)

默认 ignore_index = False
如果默认设置合并，会有相同index元素堆叠
ignore_index = True, 相同index元素数值合并

join

res = pd.concat([df1, df2], join='inner', ignore_index=False)
# join = {'outer','inner'}
## outer 缺失值用 np.nan 填充
## inner 缺失值 所在列直接删除

# join_axes 在 莫烦 的视频里有出现，但是在20年的时候被取缔

# append
df1 = pd.DataFeame(np.ones((3,4))*0, columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFeame(np.ones((3,4))*1, columns=['a', 'b', 'c', 'd'])
df3 = pd.DataFeame(np.ones((3,4))*2, columns=['a', 'b', 'c', 'd'])
s1 = pd.Series([1, 2, 3, 4], index = ['a', 'b', 'c', 'd'])
# 把df2、df3与res合并，并且重组index
res = df1.append([df2, df3], ignore_index=True)
print(res)

merge

# merging two df by key/keys. (may be used in database)
# simple example
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                                  'A': ['A0', 'A1', 'A2', 'A3'],
                                  'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                                    'C': ['C0', 'C1', 'C2', 'C3'],
                                    'D': ['D0', 'D1', 'D2', 'D3']})
print(left)
print(right)
# 以column ‘Key'为关键词 合并
res = pd.merge(left, right, on='key')
print(res)

# consider two keys
## 二维的也一样
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                             'key2': ['K0', 'K1', 'K0', 'K1'],
                             'A': ['A0', 'A1', 'A2', 'A3'],
                             'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                              'key2': ['K0', 'K0', 'K0', 'K0'],
                              'C': ['C0', 'C1', 'C2', 'C3'],
                              'D': ['D0', 'D1', 'D2', 'D3']})
print(left)
print(right)
res = pd.merge(left, right, on=['key1', 'key2'], how='inner')  # default for how='inner'
# how = ['left', 'right', 'outer', 'inner']
res = pd.merge(left, right, on=['key1', 'key2'], how='left')
print(res)

# indicator
df1 = pd.DataFrame({'col1':[0,1], 'col_left':['a','b']})
df2 = pd.DataFrame({'col1':[1,2,2],'col_right':[2,2,2]})
print(df1)
print(df2)
res = pd.merge(df1, df2, on='col1', how='outer', indicator=True)
# give the indicator a custom name
res = pd.merge(df1, df2, on='col1', how='outer', indicator='indicator_column')


# merged by index
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                                  'B': ['B0', 'B1', 'B2']},
                                  index=['K0', 'K1', 'K2'])
right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                                     'D': ['D0', 'D2', 'D3']},
                                      index=['K0', 'K2', 'K3'])
print(left)
print(right)
# left_index and right_index
res = pd.merge(left, right, left_index=True, right_index=True, how='outer')
res = pd.merge(left, right, left_index=True, right_index=True, how='inner')

# handle overlapping
boys = pd.DataFrame({'k': ['K0', 'K1', 'K2'], 'age': [1, 2, 3]})
girls = pd.DataFrame({'k': ['K0', 'K0', 'K3'], 'age': [4, 5, 6]})
res = pd.merge(boys, girls, on='k', suffixes=['_boy', '_girl'], how='inner')
print(res)

# join function in pandas is similar with merge. If know merge, you will understand join

Pandas‘ visualization

这一章的可视化部分，其实就是 pandas 的 matplotlib 函数库的调用

1	import matplotlib.pyplot as plt

plot data

# Series
data.pd.Series(np.random.randn(1000),index=np.arrange(1000))
data = data.cumsum()
data.plot()
# plt.plot(x=, y=)
data.plot()   # 利用内置方法()
plt.show()

# DataFrame
data = pd.DataFrame(np.random.randn(1000, 4),
        index = np.arrange(1000),
        columns = list['ABCD'])
data.plot()
# show()就是 pyplot 的打印
plot.show()

常见plot方法

‘bar’
‘hist’
‘box’
‘kde’
‘scatter’
‘hexbin’

Scatter写法

1
2
3

ax = data.plot.scatter(x='A', y='B', color='DarkBlue', label='Class 1')
data.plot.scatter(x='A', y='C', color='DarkGreen', label='Class 2', ax=ax)
plt.show()