Scaling与Normalization的区别

时间:2022-07-28
本文章向大家介绍Scaling与Normalization的区别,主要内容包括其使用实例、应用技巧、基本知识点总结和需要注意事项,具有一定的参考价值,需要的朋友可以参考一下。

scale与normalize,是我们在做前期数据处理的时候经常做的操作,但是它们经常会被混淆,现在网上的一些讨论也比较混乱。

import pandas as pd
import numpy as np

# for Box-Cox Transformation
from scipy import stats

# for min_max scaling
from mlxtend.preprocessing import minmax_scaling
from sklearn import preprocessing

# plotting modules
import seaborn as sns
import matplotlib.pyplot as plt

# set seed for reproducibility
np.random.seed(0)
1. Scaling

特征缩放,特点是不改变数据分布情况。比如min-max或者Z-score (主要有如下四种方法,详见:Feature_scaling).

Min-Max scale:

original_data = np.random.beta(5, 1, 1000) * 60

# mix-max scale the data between 0 and 1
scaled_data = minmax_scaling(original_data, columns=[0])
# 或者
scaled_data = preprocessing.minmax_scale(original_data)

# plot both together to compare
fig, ax = plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(scaled_data, ax=ax[1])
ax[1].set_title("Scaled data")

Z-score:

s_scaler = preprocessing.StandardScaler(with_mean=True, with_std=True)
df_s = s_scaler.fit_transform(original_data.reshape(-1,1))

# plot both together to compare
fig, ax = plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(df_s, ax=ax[1])
ax[1].set_title("Scaled data")
2. Normalization

Normalization则会改变数据的分布。比如Box-Cox转换,可以将数据转为正态分布。

# normalize the exponential data with boxcox
normalized_data = stats.boxcox(original_data)

# plot both together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(normalized_data[0], ax=ax[1])
ax[1].set_title("Normalized data")

换一个分布看一下:

original_data = np.random.exponential(size=1000)
# normalize the exponential data with boxcox
normalized_data = stats.boxcox(original_data)

# plot both together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(normalized_data[0], ax=ax[1])
ax[1].set_title("Normalized data")

参考:

  1. https://www.kaggle.com/alexisbcook/scaling-and-normalization
  2. https://link.zhihu.com/?target=https%3A//en.wikipedia.org/wiki/Feature_scaling