XGBoost에서 파이프라인 사용하여 표준화(standardization) 적용하기

XGBoost에서 파이프라인 사용하여 표준화(standardization) 적용하기Deep Learning & Machine Learning/XGBoost2024. 7. 8. 22:36@webnautes

Table of Contents

XGBoost와 표준화(standardization)를 하나의 파이프라인으로 생성하여 학습을 진행하면 나중에 파이프라인으로 추론시 표준화까지 처리됩니다.

포스트에서 사용하고 있는 스케일러인 StandardScaler 외에 RobustScaler, MinMaxScaler, Normalizer, QuantileTransformer, PowerTransformer 도 테스트를 통해 사용해보세요. 데이터셋에 따라 잘 동작하는 스케일러가 다릅니다.

테스트를 통해 스케일러를 적용 전후 또는 서로다른 스케일러 적용시 Optuna의 최적 파라미터값이 같을 수 있다는 것을 확인했습니다. 하지만 모델 추론시 차이가 발견되었습니다. 주의할 점은 데이터에 따라서는 이마저도 별차이가 없는 경우도 있습니다.

2024. 7. 4 최초작성

2024. 7. 5 표준화가 제대로 적용되는지 확인하고 있습니다.

2024. 7. 8 여러가지 테스트를 통해 표준화가 적용된것을 확인하고 스케일러 관련 내용 추가

포스트에서는 XGBClassifier를 사용한 예제코드입니다.

import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
import optuna
import joblib
import time

# iris 데이터셋을 로드합니다.
iris = load_iris()

# 데이터셋을 분리합니다.
X, y = iris.data, iris.target

# print(X.shape)
# (150, 4)
# print(y.shape)
# (150,)

# 데이터셋을 8:2 비율로 train과 test로 분할합니다.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_validation, X_test, y_validation, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=42)

RANDOM_SEED = 42

def objective(trial):

# 최적화할 XGBoost 파라미터입니다.
params = {
"objective": "multi:softprob",
"eval_metric": 'mlogloss',
"booster": 'gbtree',
"tree_method": 'hist',
"max_depth": trial.suggest_int("max_depth", 4, 10),
"learning_rate": trial.suggest_float('learning_rate', 0.0001, 0.99),
'n_estimators': trial.suggest_int("n_estimators", 1000, 10000, step=100),
"colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
"colsample_bylevel": trial.suggest_float("colsample_bylevel", 0.5, 1.0),
"colsample_bynode": trial.suggest_float("colsample_bynode", 0.5, 1.0),
"reg_lambda": trial.suggest_float("reg_lambda", 1e-2, 1.0),
"reg_alpha": trial.suggest_float("reg_alpha", 1e-2, 1.0),
'subsample': trial.suggest_float('subsample', 0.6, 1.0),
'min_child_weight': trial.suggest_int('min_child_weight', 2, 15),
"gamma": trial.suggest_float("gamma", 0.1, 1.0),
"random_state": RANDOM_SEED,
}

# 파이프 라인을 구성합니다.
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', XGBClassifier(**params))
])

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_validation)

accuracy = accuracy_score(y_validation, y_pred)

return accuracy

# Optuna를 사용한 하이퍼파라미터 최적화를 시작합니다.
start_time = time.time()
study = optuna.create_study(direction="maximize", sampler=optuna.samplers.TPESampler(seed=RANDOM_SEED))
study.optimize(objective, n_trials=15)
end_time = time.time()

print(f'실행시간 = {end_time - start_time:.2f} 초')
print(f'Best accuracy: {study.best_value:.4f}')
print('Best hyperparameters:', study.best_params)

# 이제 Optuna를 사용하여 구한 최적의 하이퍼파라미터로 최종 파이프라인 학습을 진행합니다.
best_pipeline = Pipeline([
('scaler', StandardScaler()),
('model', XGBClassifier(random_state=RANDOM_SEED, **study.best_params))
])

best_pipeline.fit(X_train, y_train)

# 파이프라인을 저장합니다.
joblib.dump(best_pipeline, 'best_pipeline.pkl')

print("파이프라인이 'best_pipeline.pkl'로 저장되었습니다.")

# 저장된 파이프라인을 로드합니다.
loaded_pipeline = joblib.load('best_pipeline.pkl')
print("'best_pipeline.pkl'에서 파이프 라인을 로드했습니다.")

# 테스트 데이터로 추론합니다.
y_pred = loaded_pipeline.predict(X_test)

# 정확도를 계산합니다.
accuracy = accuracy_score(y_test, y_pred)
print(f'Test Accuracy: {accuracy:.4f}')

실행결과입니다.

[I 2024-07-04 22:50:57,644] A new study created in memory with name: no-name-4e47d203-18b8-4d88-95c8-3a6d3f8a00a3

[I 2024-07-04 22:51:01,126] Trial 0 finished with value: 1.0 and parameters: {'max_depth': 6, 'learning_rate': 0.941212091915176, 'n_estimators': 7600, 'colsample_bytree': 0.7993292420985183, 'colsample_bylevel': 0.5780093202212182, 'colsample_bynode': 0.5779972601681014, 'reg_lambda': 0.06750277604651747, 'reg_alpha': 0.8675143843171859, 'subsample': 0.8404460046972835, 'min_child_weight': 11, 'gamma': 0.1185260448662222}. Best is trial 0 with value: 1.0.

[I 2024-07-04 22:51:02,453] Trial 1 finished with value: 1.0 and parameters: {'max_depth': 10, 'learning_rate': 0.8241349701283375, 'n_estimators': 2900, 'colsample_bytree': 0.5909124836035503, 'colsample_bylevel': 0.5917022549267169, 'colsample_bynode': 0.6521211214797689, 'reg_lambda': 0.5295088673159155, 'reg_alpha': 0.4376255684556946, 'subsample': 0.7164916560792167, 'min_child_weight': 10, 'gamma': 0.22554447458683766}. Best is trial 0 with value: 1.0.

[I 2024-07-04 22:51:04,793] Trial 2 finished with value: 1.0 and parameters: {'max_depth': 6, 'learning_rate': 0.3627615886764254, 'n_estimators': 5100, 'colsample_bytree': 0.8925879806965068, 'colsample_bylevel': 0.5998368910791798, 'colsample_bynode': 0.7571172192068059, 'reg_lambda': 0.596490423173422, 'reg_alpha': 0.05598590859279775, 'subsample': 0.8430179407605753, 'min_child_weight': 4, 'gamma': 0.1585464336867516}. Best is trial 0 with value: 1.0.

[I 2024-07-04 22:51:08,580] Trial 3 finished with value: 1.0 and parameters: {'max_depth': 10, 'learning_rate': 0.9559791495405063, 'n_estimators': 8300, 'colsample_bytree': 0.6523068845866853, 'colsample_bylevel': 0.5488360570031919, 'colsample_bynode': 0.8421165132560784, 'reg_lambda': 0.44575096880220527, 'reg_alpha': 0.13081785249633104, 'subsample': 0.798070764044508, 'min_child_weight': 2, 'gamma': 0.9183883618709039}. Best is trial 0 with value: 1.0.

[I 2024-07-04 22:51:10,308] Trial 4 finished with value: 1.0 and parameters: {'max_depth': 5, 'learning_rate': 0.6559308092820068, 'n_estimators': 3800, 'colsample_bytree': 0.7600340105889054, 'colsample_bylevel': 0.7733551396716398, 'colsample_bynode': 0.5924272277627636, 'reg_lambda': 0.9698887814869129, 'reg_alpha': 0.7773814951275034, 'subsample': 0.9757995766256756, 'min_child_weight': 14, 'gamma': 0.6381099809299766}. Best is trial 0 with value: 1.0.

[I 2024-07-04 22:51:11,558] Trial 5 finished with value: 1.0 and parameters: {'max_depth': 10, 'learning_rate': 0.08769872778119511, 'n_estimators': 2700, 'colsample_bytree': 0.522613644455269, 'colsample_bylevel': 0.6626651653816322, 'colsample_bynode': 0.6943386448447411, 'reg_lambda': 0.27863554145615693, 'reg_alpha': 0.83045013406041, 'subsample': 0.7427013306774357, 'min_child_weight': 5, 'gamma': 0.5884264748424236}. Best is trial 0 with value: 1.0.

[I 2024-07-04 22:51:12,291] Trial 6 finished with value: 1.0 and parameters: {'max_depth': 4, 'learning_rate': 0.7941947912484238, 'n_estimators': 1600, 'colsample_bytree': 0.9934434683002586, 'colsample_bylevel': 0.8861223846483287, 'colsample_bynode': 0.5993578407670862, 'reg_lambda': 0.015466895952366375, 'reg_alpha': 0.8173068141702858, 'subsample': 0.8827429375390468, 'min_child_weight': 12, 'gamma': 0.7941433120173511}. Best is trial 0 with value: 1.0.

[I 2024-07-04 22:51:13,234] Trial 7 finished with value: 1.0 and parameters: {'max_depth': 4, 'learning_rate': 0.35494522468597545, 'n_estimators': 2000, 'colsample_bytree': 0.9315517129377968, 'colsample_bylevel': 0.811649063413779, 'colsample_bynode': 0.6654490124263246, 'reg_lambda': 0.0729227667831634, 'reg_alpha': 0.3178724984985056, 'subsample': 0.7300733288106989, 'min_child_weight': 12, 'gamma': 0.6738017242196918}. Best is trial 0 with value: 1.0.

[I 2024-07-04 22:51:14,162] Trial 8 finished with value: 1.0 and parameters: {'max_depth': 10, 'learning_rate': 0.4675455544178136, 'n_estimators': 2000, 'colsample_bytree': 0.8566223936114975, 'colsample_bylevel': 0.8803925243084487, 'colsample_bynode': 0.7806385987847482, 'reg_lambda': 0.7732575081550154, 'reg_alpha': 0.4988576404007468, 'subsample': 0.8090931317527976, 'min_child_weight': 7, 'gamma': 0.12287721406968567}. Best is trial 0 with value: 1.0.

[I 2024-07-04 22:51:17,294] Trial 9 finished with value: 1.0 and parameters: {'max_depth': 4, 'learning_rate': 0.031211750911298235, 'n_estimators': 6700, 'colsample_bytree': 0.6571779905381634, 'colsample_bylevel': 0.7542853455823514, 'colsample_bynode': 0.9537832369630466, 'reg_lambda': 0.25679930685738617, 'reg_alpha': 0.41627909380527345, 'subsample': 0.9022204554172195, 'min_child_weight': 5, 'gamma': 0.1692819188459137}. Best is trial 0 with value: 1.0.

[I 2024-07-04 22:51:21,802] Trial 10 finished with value: 1.0 and parameters: {'max_depth': 7, 'learning_rate': 0.677494576859693, 'n_estimators': 9800, 'colsample_bytree': 0.7873088047582556, 'colsample_bylevel': 0.5076838686640521, 'colsample_bynode': 0.5193625999805915, 'reg_lambda': 0.2628890241775469, 'reg_alpha': 0.9763785008594151, 'subsample': 0.6387875118993118, 'min_child_weight': 15, 'gamma': 0.35268485881573364}. Best is trial 0 with value: 1.0.

[I 2024-07-04 22:51:24,461] Trial 11 finished with value: 1.0 and parameters: {'max_depth': 8, 'learning_rate': 0.9443520379760726, 'n_estimators': 5800, 'colsample_bytree': 0.5168163024120214, 'colsample_bylevel': 0.6533350465013262, 'colsample_bynode': 0.5049940036906653, 'reg_lambda': 0.564647252800192, 'reg_alpha': 0.6225250258868753, 'subsample': 0.6409964914633319, 'min_child_weight': 10, 'gamma': 0.35495166487149704}. Best is trial 0 with value: 1.0.

[I 2024-07-04 22:51:28,043] Trial 12 finished with value: 1.0 and parameters: {'max_depth': 8, 'learning_rate': 0.7816033479496383, 'n_estimators': 7800, 'colsample_bytree': 0.6859734324501636, 'colsample_bylevel': 0.6685543299982314, 'colsample_bynode': 0.6135695058254393, 'reg_lambda': 0.4410646602598562, 'reg_alpha': 0.2572442790499297, 'subsample': 0.7110618229950962, 'min_child_weight': 9, 'gamma': 0.36136601525256384}. Best is trial 0 with value: 1.0.

[I 2024-07-04 22:51:29,843] Trial 13 finished with value: 1.0 and parameters: {'max_depth': 8, 'learning_rate': 0.9849751658531088, 'n_estimators': 3900, 'colsample_bytree': 0.578760087301028, 'colsample_bylevel': 0.9826689730327278, 'colsample_bynode': 0.6826162761840294, 'reg_lambda': 0.7407173587049704, 'reg_alpha': 0.6255122381161018, 'subsample': 0.6923589423551142, 'min_child_weight': 11, 'gamma': 0.26746690877197943}. Best is trial 0 with value: 1.0.

[I 2024-07-04 22:51:34,418] Trial 14 finished with value: 1.0 and parameters: {'max_depth': 6, 'learning_rate': 0.8227435395768457, 'n_estimators': 10000, 'colsample_bytree': 0.7840223021755295, 'colsample_bylevel': 0.5879525450539488, 'colsample_bynode': 0.5715343140007164, 'reg_lambda': 0.1700697447308468, 'reg_alpha': 0.6373793033316469, 'subsample': 0.7773094334895967, 'min_child_weight': 8, 'gamma': 0.492998683292441}. Best is trial 0 with value: 1.0.

실행시간 = 36.77 초

Best accuracy: 1.0000

Best hyperparameters: {'max_depth': 6, 'learning_rate': 0.941212091915176, 'n_estimators': 7600, 'colsample_bytree': 0.7993292420985183, 'colsample_bylevel': 0.5780093202212182, 'colsample_bynode': 0.5779972601681014, 'reg_lambda': 0.06750277604651747, 'reg_alpha': 0.8675143843171859, 'subsample': 0.8404460046972835, 'min_child_weight': 11, 'gamma': 0.1185260448662222}

파이프라인이 'best_pipeline.pkl'로 저장되었습니다.

'best_pipeline.pkl'에서 파이프 라인을 로드했습니다.

Test Accuracy: 0.9333

저작자표시 비영리 동일조건

'Deep Learning & Machine Learning > XGBoost' 카테고리의 다른 글

Optuna를 사용하여 XGBoost 최적 하이퍼 파라미터 구하는 예제코드 (0)	2024.07.04
XGBoost에서 GPU(cuda) 사용하는 예제 (1)	2024.06.15
RandomizedSearchCV를 사용하여 XGBoost 최적 하이퍼 파라미터 구하는 예제코드 (0)	2024.05.30
XGBoost Warning 해결 방법 (0)	2023.10.18
MacBook m1에서 XGBoost 코드 실행시 segmentation fault 해결 (0)	2023.10.16

시간날때마다 틈틈이 이것저것 해보며 블로그에 글을 남깁니다.

블로그의 문서는 종종 최신 버전으로 업데이트됩니다.

여유 시간이 날때 진행하는 거라 언제 진행될지는 알 수 없습니다.

블로그 글과 유튜브 영상을 만드는 것은 전문가라서라기보단 공부한 내용을 함께 공유하는 게 좋아서입니다.

'Deep Learning & Machine Learning > XGBoost' 카테고리의 다른 글

제가 쓴 책도 한번 검토해보세요 ^^

티스토리툴바