정규화(Normalization), 표준화(standardization), 이상치(outlier) 제거

정규화(Normalization), 표준화(standardization), 이상치(outlier) 제거Deep Learning & Machine Learning/딥러닝&머신러닝 개념2023. 10. 23. 09:23@webnautes

Table of Contents

정규화(Normalization), 표준화(standardization), 이상치(outlier) 제거를 구현해본 코드입니다.

2021. 9. 17 - 최초작성

2022. 4. 15

졍규화는 값을 0 ~ 1 사이의 범위로 바꾸는 것이고, 표준화는 평균이 0, 표준편차는 1 인 표준정규분포(standard normal distribution)로 바꾸는 것입니다. 표준화의 경우 값의 범위가 정해지지 않습니다.

아래 링크에 따르면 데이터가 정규분포를 따르는 경우에는 표준화, 정규분포를 따르지 않는 경우에는 정규화를 하는 것이 좋다고 합니다.

( 참고 - https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/)

표준화를 적용 후, 적용하기 나름인데 보통 평균 - 3 * 표준편차 ~ 평균 + 3 * 표준편차 범위를 넘어가는 것을 이상치(outlier)로 봅니다.

import numpy as np
from matplotlib import pyplot as plt

a = np.random.randint(-10,20,100)
a[0] = 100
a[1] = -100

fig, ax = plt.subplots(4)

a_max = a.max()
a_min = a.min()
a_ = ( a - a_min)/(a_max-a_min)

mean = np.mean(a)
std = np.std(a)
a_2 = (a - mean) /std

mask = np.logical_and(a_2 < 3, a_2 > -3)
a_3 = a_2[mask]

ax[0].plot(a)
ax[1].plot(a_)
ax[2].set_ylim([-7, 7])
ax[2].plot(a_2)
ax[3].set_ylim([-7, 7])
ax[3].plot(a_3)

ax[0].set_title('source')
ax[1].set_title('normalization')
ax[2].set_title('standardization')
ax[3].set_title('standardization & remove outlier')

fig.tight_layout()
plt.show()

그래프의 Y축을 보면 각 경우에 따라 범위가 다른 것을 볼 수 있습니다.

첫번째 줄은 원본 데이터, 두번째 줄은 정규화를 적용하여 0 ~ 1 사이 범위가 되었고, 세번째 줄은 표준화 경우로 데이터에 따라 최대,최소 범위가 달라집니다. 마지막은 이상치를 제거해서 가장 높게 튀는 값과 가장 낮게 튀는 값이 제거되었습니다.

참고

https://soo-jjeong.tistory.com/121

https://blog.finxter.com/how-to-find-outliers-in-python-easily/#Method_3_Remove_Outliers_From_NumPy_Array_Using_npmean_and_npstd

https://stml.tistory.com/45

https://stackoverflow.com/questions/38459793/numpy-boolean-indexing-multiple-conditions

저작자표시 비영리 동일조건

'Deep Learning & Machine Learning > 딥러닝&머신러닝 개념' 카테고리의 다른 글

기대값이란 (0)	2023.11.05
신경망(neural networks)에서 편향(bais)의 역할 (0)	2023.10.28
XGBoost 개념 정리 (0)	2023.10.18
난수 생성에서 시드(seed) 사용하는 이유 (0)	2023.10.12
분산과 표준 편차 차이 정리 (2)	2023.10.09

시간날때마다 틈틈이 이것저것 해보며 블로그에 글을 남깁니다.

블로그의 문서는 종종 최신 버전으로 업데이트됩니다.

여유 시간이 날때 진행하는 거라 언제 진행될지는 알 수 없습니다.

블로그 글과 유튜브 영상을 만드는 것은 전문가라서라기보단 공부한 내용을 함께 공유하는 게 좋아서입니다.

'Deep Learning & Machine Learning > 딥러닝&머신러닝 개념' 카테고리의 다른 글

제가 쓴 책도 한번 검토해보세요 ^^

티스토리툴바