XGBoost에서 파이프라인 사용하여 표준화(standardization) 적용하기

XGBoost와 표준화(standardization)를 하나의 파이프라인으로 생성하여 학습을 진행하면 나중에 파이프라인으로 추론시 표준화까지 처리됩니다.

 

포스트에서 사용하고 있는 스케일러인 StandardScaler 외에  RobustScaler, MinMaxScaler, Normalizer, QuantileTransformer, PowerTransformer 도 테스트를 통해 사용해보세요. 데이터셋에 따라 잘 동작하는 스케일러가 다릅니다. 



테스트를 통해 스케일러를 적용 전후 또는 서로다른 스케일러 적용시  Optuna의 최적 파라미터값이 같을 수 있다는 것을 확인했습니다. 하지만 모델 추론시 차이가 발견되었습니다. 주의할 점은 데이터에 따라서는 이마저도 별차이가 없는 경우도 있습니다.



2024. 7. 4 최초작성

2024. 7. 5  표준화가 제대로 적용되는지 확인하고 있습니다. 

2024. 7. 8  여러가지 테스트를 통해 표준화가 적용된것을 확인하고 스케일러 관련 내용 추가



포스트에서는 XGBClassifier를 사용한 예제코드입니다.

 

import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
import optuna
import joblib
import time



# iris 데이터셋을 로드합니다.
iris = load_iris()

# 데이터셋을 분리합니다.
X, y = iris.data, iris.target

# print(X.shape)
# (150, 4)
# print(y.shape)
# (150,)


# 데이터셋을 8:2 비율로 train과 test로 분할합니다.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_validation, X_test, y_validation, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=42)

RANDOM_SEED = 42




def objective(trial):

    # 최적화할 XGBoost 파라미터입니다.
    params = {
        "objective": "multi:softprob",
        "eval_metric": 'mlogloss',
        "booster": 'gbtree',
        "tree_method": 'hist',
        "max_depth": trial.suggest_int("max_depth", 4, 10),
        "learning_rate": trial.suggest_float('learning_rate', 0.0001, 0.99),
        'n_estimators': trial.suggest_int("n_estimators", 1000, 10000, step=100),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
        "colsample_bylevel": trial.suggest_float("colsample_bylevel", 0.5, 1.0),
        "colsample_bynode": trial.suggest_float("colsample_bynode", 0.5, 1.0),
        "reg_lambda": trial.suggest_float("reg_lambda", 1e-2, 1.0),
        "reg_alpha": trial.suggest_float("reg_alpha", 1e-2, 1.0),
        'subsample': trial.suggest_float('subsample', 0.6, 1.0),
        'min_child_weight': trial.suggest_int('min_child_weight', 2, 15),
        "gamma": trial.suggest_float("gamma", 0.1, 1.0),
        "random_state": RANDOM_SEED,
    }
   
    # 파이프 라인을 구성합니다.
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('model', XGBClassifier(**params))
    ])
   
    pipeline.fit(X_train, y_train)

    y_pred = pipeline.predict(X_validation)

    accuracy = accuracy_score(y_validation, y_pred)

    return accuracy


# Optuna를 사용한 하이퍼파라미터 최적화를 시작합니다.
start_time = time.time()
study = optuna.create_study(direction="maximize", sampler=optuna.samplers.TPESampler(seed=RANDOM_SEED))
study.optimize(objective, n_trials=15)
end_time = time.time()

print(f'실행시간 = {end_time - start_time:.2f} 초')
print(f'Best accuracy: {study.best_value:.4f}')
print('Best hyperparameters:', study.best_params)


# 이제 Optuna를 사용하여 구한 최적의 하이퍼파라미터로 최종 파이프라인 학습을 진행합니다.
best_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', XGBClassifier(random_state=RANDOM_SEED, **study.best_params))
])

best_pipeline.fit(X_train, y_train)


# 파이프라인을 저장합니다.
joblib.dump(best_pipeline, 'best_pipeline.pkl')

print("파이프라인이 'best_pipeline.pkl'로 저장되었습니다.")



# 저장된 파이프라인을 로드합니다.
loaded_pipeline = joblib.load('best_pipeline.pkl')
print("'best_pipeline.pkl'에서 파이프 라인을 로드했습니다.")


# 테스트 데이터로 추론합니다.
y_pred = loaded_pipeline.predict(X_test)


# 정확도를 계산합니다.
accuracy = accuracy_score(y_test, y_pred)
print(f'Test Accuracy: {accuracy:.4f}')



실행결과입니다.

 

[I 2024-07-04 22:50:57,644] A new study created in memory with name: no-name-4e47d203-18b8-4d88-95c8-3a6d3f8a00a3

[I 2024-07-04 22:51:01,126] Trial 0 finished with value: 1.0 and parameters: {'max_depth': 6, 'learning_rate': 0.941212091915176, 'n_estimators': 7600, 'colsample_bytree': 0.7993292420985183, 'colsample_bylevel': 0.5780093202212182, 'colsample_bynode': 0.5779972601681014, 'reg_lambda': 0.06750277604651747, 'reg_alpha': 0.8675143843171859, 'subsample': 0.8404460046972835, 'min_child_weight': 11, 'gamma': 0.1185260448662222}. Best is trial 0 with value: 1.0.

[I 2024-07-04 22:51:02,453] Trial 1 finished with value: 1.0 and parameters: {'max_depth': 10, 'learning_rate': 0.8241349701283375, 'n_estimators': 2900, 'colsample_bytree': 0.5909124836035503, 'colsample_bylevel': 0.5917022549267169, 'colsample_bynode': 0.6521211214797689, 'reg_lambda': 0.5295088673159155, 'reg_alpha': 0.4376255684556946, 'subsample': 0.7164916560792167, 'min_child_weight': 10, 'gamma': 0.22554447458683766}. Best is trial 0 with value: 1.0.

[I 2024-07-04 22:51:04,793] Trial 2 finished with value: 1.0 and parameters: {'max_depth': 6, 'learning_rate': 0.3627615886764254, 'n_estimators': 5100, 'colsample_bytree': 0.8925879806965068, 'colsample_bylevel': 0.5998368910791798, 'colsample_bynode': 0.7571172192068059, 'reg_lambda': 0.596490423173422, 'reg_alpha': 0.05598590859279775, 'subsample': 0.8430179407605753, 'min_child_weight': 4, 'gamma': 0.1585464336867516}. Best is trial 0 with value: 1.0.

[I 2024-07-04 22:51:08,580] Trial 3 finished with value: 1.0 and parameters: {'max_depth': 10, 'learning_rate': 0.9559791495405063, 'n_estimators': 8300, 'colsample_bytree': 0.6523068845866853, 'colsample_bylevel': 0.5488360570031919, 'colsample_bynode': 0.8421165132560784, 'reg_lambda': 0.44575096880220527, 'reg_alpha': 0.13081785249633104, 'subsample': 0.798070764044508, 'min_child_weight': 2, 'gamma': 0.9183883618709039}. Best is trial 0 with value: 1.0.

[I 2024-07-04 22:51:10,308] Trial 4 finished with value: 1.0 and parameters: {'max_depth': 5, 'learning_rate': 0.6559308092820068, 'n_estimators': 3800, 'colsample_bytree': 0.7600340105889054, 'colsample_bylevel': 0.7733551396716398, 'colsample_bynode': 0.5924272277627636, 'reg_lambda': 0.9698887814869129, 'reg_alpha': 0.7773814951275034, 'subsample': 0.9757995766256756, 'min_child_weight': 14, 'gamma': 0.6381099809299766}. Best is trial 0 with value: 1.0.

[I 2024-07-04 22:51:11,558] Trial 5 finished with value: 1.0 and parameters: {'max_depth': 10, 'learning_rate': 0.08769872778119511, 'n_estimators': 2700, 'colsample_bytree': 0.522613644455269, 'colsample_bylevel': 0.6626651653816322, 'colsample_bynode': 0.6943386448447411, 'reg_lambda': 0.27863554145615693, 'reg_alpha': 0.83045013406041, 'subsample': 0.7427013306774357, 'min_child_weight': 5, 'gamma': 0.5884264748424236}. Best is trial 0 with value: 1.0.

[I 2024-07-04 22:51:12,291] Trial 6 finished with value: 1.0 and parameters: {'max_depth': 4, 'learning_rate': 0.7941947912484238, 'n_estimators': 1600, 'colsample_bytree': 0.9934434683002586, 'colsample_bylevel': 0.8861223846483287, 'colsample_bynode': 0.5993578407670862, 'reg_lambda': 0.015466895952366375, 'reg_alpha': 0.8173068141702858, 'subsample': 0.8827429375390468, 'min_child_weight': 12, 'gamma': 0.7941433120173511}. Best is trial 0 with value: 1.0.

[I 2024-07-04 22:51:13,234] Trial 7 finished with value: 1.0 and parameters: {'max_depth': 4, 'learning_rate': 0.35494522468597545, 'n_estimators': 2000, 'colsample_bytree': 0.9315517129377968, 'colsample_bylevel': 0.811649063413779, 'colsample_bynode': 0.6654490124263246, 'reg_lambda': 0.0729227667831634, 'reg_alpha': 0.3178724984985056, 'subsample': 0.7300733288106989, 'min_child_weight': 12, 'gamma': 0.6738017242196918}. Best is trial 0 with value: 1.0.

[I 2024-07-04 22:51:14,162] Trial 8 finished with value: 1.0 and parameters: {'max_depth': 10, 'learning_rate': 0.4675455544178136, 'n_estimators': 2000, 'colsample_bytree': 0.8566223936114975, 'colsample_bylevel': 0.8803925243084487, 'colsample_bynode': 0.7806385987847482, 'reg_lambda': 0.7732575081550154, 'reg_alpha': 0.4988576404007468, 'subsample': 0.8090931317527976, 'min_child_weight': 7, 'gamma': 0.12287721406968567}. Best is trial 0 with value: 1.0.

[I 2024-07-04 22:51:17,294] Trial 9 finished with value: 1.0 and parameters: {'max_depth': 4, 'learning_rate': 0.031211750911298235, 'n_estimators': 6700, 'colsample_bytree': 0.6571779905381634, 'colsample_bylevel': 0.7542853455823514, 'colsample_bynode': 0.9537832369630466, 'reg_lambda': 0.25679930685738617, 'reg_alpha': 0.41627909380527345, 'subsample': 0.9022204554172195, 'min_child_weight': 5, 'gamma': 0.1692819188459137}. Best is trial 0 with value: 1.0.

[I 2024-07-04 22:51:21,802] Trial 10 finished with value: 1.0 and parameters: {'max_depth': 7, 'learning_rate': 0.677494576859693, 'n_estimators': 9800, 'colsample_bytree': 0.7873088047582556, 'colsample_bylevel': 0.5076838686640521, 'colsample_bynode': 0.5193625999805915, 'reg_lambda': 0.2628890241775469, 'reg_alpha': 0.9763785008594151, 'subsample': 0.6387875118993118, 'min_child_weight': 15, 'gamma': 0.35268485881573364}. Best is trial 0 with value: 1.0.

[I 2024-07-04 22:51:24,461] Trial 11 finished with value: 1.0 and parameters: {'max_depth': 8, 'learning_rate': 0.9443520379760726, 'n_estimators': 5800, 'colsample_bytree': 0.5168163024120214, 'colsample_bylevel': 0.6533350465013262, 'colsample_bynode': 0.5049940036906653, 'reg_lambda': 0.564647252800192, 'reg_alpha': 0.6225250258868753, 'subsample': 0.6409964914633319, 'min_child_weight': 10, 'gamma': 0.35495166487149704}. Best is trial 0 with value: 1.0.

[I 2024-07-04 22:51:28,043] Trial 12 finished with value: 1.0 and parameters: {'max_depth': 8, 'learning_rate': 0.7816033479496383, 'n_estimators': 7800, 'colsample_bytree': 0.6859734324501636, 'colsample_bylevel': 0.6685543299982314, 'colsample_bynode': 0.6135695058254393, 'reg_lambda': 0.4410646602598562, 'reg_alpha': 0.2572442790499297, 'subsample': 0.7110618229950962, 'min_child_weight': 9, 'gamma': 0.36136601525256384}. Best is trial 0 with value: 1.0.

[I 2024-07-04 22:51:29,843] Trial 13 finished with value: 1.0 and parameters: {'max_depth': 8, 'learning_rate': 0.9849751658531088, 'n_estimators': 3900, 'colsample_bytree': 0.578760087301028, 'colsample_bylevel': 0.9826689730327278, 'colsample_bynode': 0.6826162761840294, 'reg_lambda': 0.7407173587049704, 'reg_alpha': 0.6255122381161018, 'subsample': 0.6923589423551142, 'min_child_weight': 11, 'gamma': 0.26746690877197943}. Best is trial 0 with value: 1.0.

[I 2024-07-04 22:51:34,418] Trial 14 finished with value: 1.0 and parameters: {'max_depth': 6, 'learning_rate': 0.8227435395768457, 'n_estimators': 10000, 'colsample_bytree': 0.7840223021755295, 'colsample_bylevel': 0.5879525450539488, 'colsample_bynode': 0.5715343140007164, 'reg_lambda': 0.1700697447308468, 'reg_alpha': 0.6373793033316469, 'subsample': 0.7773094334895967, 'min_child_weight': 8, 'gamma': 0.492998683292441}. Best is trial 0 with value: 1.0.

실행시간 = 36.77 초

Best accuracy: 1.0000

Best hyperparameters: {'max_depth': 6, 'learning_rate': 0.941212091915176, 'n_estimators': 7600, 'colsample_bytree': 0.7993292420985183, 'colsample_bylevel': 0.5780093202212182, 'colsample_bynode': 0.5779972601681014, 'reg_lambda': 0.06750277604651747, 'reg_alpha': 0.8675143843171859, 'subsample': 0.8404460046972835, 'min_child_weight': 11, 'gamma': 0.1185260448662222}

파이프라인이 'best_pipeline.pkl'로 저장되었습니다.

'best_pipeline.pkl'에서 파이프 라인을 로드했습니다.

Test Accuracy: 0.9333



시간날때마다 틈틈이 이것저것 해보며 블로그에 글을 남깁니다.

블로그의 문서는 종종 최신 버전으로 업데이트됩니다.
여유 시간이 날때 진행하는 거라 언제 진행될지는 알 수 없습니다.



영화,책, 생각등을 올리는 블로그도 운영하고 있습니다.
https://freewriting2024.tistory.com


제가 쓴 책도 한번 검토해보세요 ^^