XGBoost와 표준화(standardization)를 하나의 파이프라인으로 생성하여 학습을 진행하면 나중에 파이프라인으로 추론시 표준화까지 처리됩니다.
포스트에서 사용하고 있는 스케일러인 StandardScaler 외에 RobustScaler, MinMaxScaler, Normalizer, QuantileTransformer, PowerTransformer 도 테스트를 통해 사용해보세요. 데이터셋에 따라 잘 동작하는 스케일러가 다릅니다.
테스트를 통해 스케일러를 적용 전후 또는 서로다른 스케일러 적용시 Optuna의 최적 파라미터값이 같을 수 있다는 것을 확인했습니다. 하지만 모델 추론시 차이가 발견되었습니다. 주의할 점은 데이터에 따라서는 이마저도 별차이가 없는 경우도 있습니다.
2024. 7. 4 최초작성
2024. 7. 5 표준화가 제대로 적용되는지 확인하고 있습니다.
2024. 7. 8 여러가지 테스트를 통해 표준화가 적용된것을 확인하고 스케일러 관련 내용 추가
포스트에서는 XGBClassifier를 사용한 예제코드입니다.
import pandas as pd import numpy as np from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score from xgboost import XGBClassifier from sklearn.pipeline import Pipeline import optuna import joblib import time # iris 데이터셋을 로드합니다. iris = load_iris() # 데이터셋을 분리합니다. X, y = iris.data, iris.target # print(X.shape) # (150, 4) # print(y.shape) # (150,) # 데이터셋을 8:2 비율로 train과 test로 분할합니다. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) X_validation, X_test, y_validation, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=42) RANDOM_SEED = 42 def objective(trial): # 최적화할 XGBoost 파라미터입니다. params = { "objective": "multi:softprob", "eval_metric": 'mlogloss', "booster": 'gbtree', "tree_method": 'hist', "max_depth": trial.suggest_int("max_depth", 4, 10), "learning_rate": trial.suggest_float('learning_rate', 0.0001, 0.99), 'n_estimators': trial.suggest_int("n_estimators", 1000, 10000, step=100), "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0), "colsample_bylevel": trial.suggest_float("colsample_bylevel", 0.5, 1.0), "colsample_bynode": trial.suggest_float("colsample_bynode", 0.5, 1.0), "reg_lambda": trial.suggest_float("reg_lambda", 1e-2, 1.0), "reg_alpha": trial.suggest_float("reg_alpha", 1e-2, 1.0), 'subsample': trial.suggest_float('subsample', 0.6, 1.0), 'min_child_weight': trial.suggest_int('min_child_weight', 2, 15), "gamma": trial.suggest_float("gamma", 0.1, 1.0), "random_state": RANDOM_SEED, } # 파이프 라인을 구성합니다. pipeline = Pipeline([ ('scaler', StandardScaler()), ('model', XGBClassifier(**params)) ]) pipeline.fit(X_train, y_train) y_pred = pipeline.predict(X_validation) accuracy = accuracy_score(y_validation, y_pred) return accuracy # Optuna를 사용한 하이퍼파라미터 최적화를 시작합니다. start_time = time.time() study = optuna.create_study(direction="maximize", sampler=optuna.samplers.TPESampler(seed=RANDOM_SEED)) study.optimize(objective, n_trials=15) end_time = time.time() print(f'실행시간 = {end_time - start_time:.2f} 초') print(f'Best accuracy: {study.best_value:.4f}') print('Best hyperparameters:', study.best_params) # 이제 Optuna를 사용하여 구한 최적의 하이퍼파라미터로 최종 파이프라인 학습을 진행합니다. best_pipeline = Pipeline([ ('scaler', StandardScaler()), ('model', XGBClassifier(random_state=RANDOM_SEED, **study.best_params)) ]) best_pipeline.fit(X_train, y_train) # 파이프라인을 저장합니다. joblib.dump(best_pipeline, 'best_pipeline.pkl') print("파이프라인이 'best_pipeline.pkl'로 저장되었습니다.") # 저장된 파이프라인을 로드합니다. loaded_pipeline = joblib.load('best_pipeline.pkl') print("'best_pipeline.pkl'에서 파이프 라인을 로드했습니다.") # 테스트 데이터로 추론합니다. y_pred = loaded_pipeline.predict(X_test) # 정확도를 계산합니다. accuracy = accuracy_score(y_test, y_pred) print(f'Test Accuracy: {accuracy:.4f}') |
실행결과입니다.
[I 2024-07-04 22:50:57,644] A new study created in memory with name: no-name-4e47d203-18b8-4d88-95c8-3a6d3f8a00a3
[I 2024-07-04 22:51:01,126] Trial 0 finished with value: 1.0 and parameters: {'max_depth': 6, 'learning_rate': 0.941212091915176, 'n_estimators': 7600, 'colsample_bytree': 0.7993292420985183, 'colsample_bylevel': 0.5780093202212182, 'colsample_bynode': 0.5779972601681014, 'reg_lambda': 0.06750277604651747, 'reg_alpha': 0.8675143843171859, 'subsample': 0.8404460046972835, 'min_child_weight': 11, 'gamma': 0.1185260448662222}. Best is trial 0 with value: 1.0.
[I 2024-07-04 22:51:02,453] Trial 1 finished with value: 1.0 and parameters: {'max_depth': 10, 'learning_rate': 0.8241349701283375, 'n_estimators': 2900, 'colsample_bytree': 0.5909124836035503, 'colsample_bylevel': 0.5917022549267169, 'colsample_bynode': 0.6521211214797689, 'reg_lambda': 0.5295088673159155, 'reg_alpha': 0.4376255684556946, 'subsample': 0.7164916560792167, 'min_child_weight': 10, 'gamma': 0.22554447458683766}. Best is trial 0 with value: 1.0.
[I 2024-07-04 22:51:04,793] Trial 2 finished with value: 1.0 and parameters: {'max_depth': 6, 'learning_rate': 0.3627615886764254, 'n_estimators': 5100, 'colsample_bytree': 0.8925879806965068, 'colsample_bylevel': 0.5998368910791798, 'colsample_bynode': 0.7571172192068059, 'reg_lambda': 0.596490423173422, 'reg_alpha': 0.05598590859279775, 'subsample': 0.8430179407605753, 'min_child_weight': 4, 'gamma': 0.1585464336867516}. Best is trial 0 with value: 1.0.
[I 2024-07-04 22:51:08,580] Trial 3 finished with value: 1.0 and parameters: {'max_depth': 10, 'learning_rate': 0.9559791495405063, 'n_estimators': 8300, 'colsample_bytree': 0.6523068845866853, 'colsample_bylevel': 0.5488360570031919, 'colsample_bynode': 0.8421165132560784, 'reg_lambda': 0.44575096880220527, 'reg_alpha': 0.13081785249633104, 'subsample': 0.798070764044508, 'min_child_weight': 2, 'gamma': 0.9183883618709039}. Best is trial 0 with value: 1.0.
[I 2024-07-04 22:51:10,308] Trial 4 finished with value: 1.0 and parameters: {'max_depth': 5, 'learning_rate': 0.6559308092820068, 'n_estimators': 3800, 'colsample_bytree': 0.7600340105889054, 'colsample_bylevel': 0.7733551396716398, 'colsample_bynode': 0.5924272277627636, 'reg_lambda': 0.9698887814869129, 'reg_alpha': 0.7773814951275034, 'subsample': 0.9757995766256756, 'min_child_weight': 14, 'gamma': 0.6381099809299766}. Best is trial 0 with value: 1.0.
[I 2024-07-04 22:51:11,558] Trial 5 finished with value: 1.0 and parameters: {'max_depth': 10, 'learning_rate': 0.08769872778119511, 'n_estimators': 2700, 'colsample_bytree': 0.522613644455269, 'colsample_bylevel': 0.6626651653816322, 'colsample_bynode': 0.6943386448447411, 'reg_lambda': 0.27863554145615693, 'reg_alpha': 0.83045013406041, 'subsample': 0.7427013306774357, 'min_child_weight': 5, 'gamma': 0.5884264748424236}. Best is trial 0 with value: 1.0.
[I 2024-07-04 22:51:12,291] Trial 6 finished with value: 1.0 and parameters: {'max_depth': 4, 'learning_rate': 0.7941947912484238, 'n_estimators': 1600, 'colsample_bytree': 0.9934434683002586, 'colsample_bylevel': 0.8861223846483287, 'colsample_bynode': 0.5993578407670862, 'reg_lambda': 0.015466895952366375, 'reg_alpha': 0.8173068141702858, 'subsample': 0.8827429375390468, 'min_child_weight': 12, 'gamma': 0.7941433120173511}. Best is trial 0 with value: 1.0.
[I 2024-07-04 22:51:13,234] Trial 7 finished with value: 1.0 and parameters: {'max_depth': 4, 'learning_rate': 0.35494522468597545, 'n_estimators': 2000, 'colsample_bytree': 0.9315517129377968, 'colsample_bylevel': 0.811649063413779, 'colsample_bynode': 0.6654490124263246, 'reg_lambda': 0.0729227667831634, 'reg_alpha': 0.3178724984985056, 'subsample': 0.7300733288106989, 'min_child_weight': 12, 'gamma': 0.6738017242196918}. Best is trial 0 with value: 1.0.
[I 2024-07-04 22:51:14,162] Trial 8 finished with value: 1.0 and parameters: {'max_depth': 10, 'learning_rate': 0.4675455544178136, 'n_estimators': 2000, 'colsample_bytree': 0.8566223936114975, 'colsample_bylevel': 0.8803925243084487, 'colsample_bynode': 0.7806385987847482, 'reg_lambda': 0.7732575081550154, 'reg_alpha': 0.4988576404007468, 'subsample': 0.8090931317527976, 'min_child_weight': 7, 'gamma': 0.12287721406968567}. Best is trial 0 with value: 1.0.
[I 2024-07-04 22:51:17,294] Trial 9 finished with value: 1.0 and parameters: {'max_depth': 4, 'learning_rate': 0.031211750911298235, 'n_estimators': 6700, 'colsample_bytree': 0.6571779905381634, 'colsample_bylevel': 0.7542853455823514, 'colsample_bynode': 0.9537832369630466, 'reg_lambda': 0.25679930685738617, 'reg_alpha': 0.41627909380527345, 'subsample': 0.9022204554172195, 'min_child_weight': 5, 'gamma': 0.1692819188459137}. Best is trial 0 with value: 1.0.
[I 2024-07-04 22:51:21,802] Trial 10 finished with value: 1.0 and parameters: {'max_depth': 7, 'learning_rate': 0.677494576859693, 'n_estimators': 9800, 'colsample_bytree': 0.7873088047582556, 'colsample_bylevel': 0.5076838686640521, 'colsample_bynode': 0.5193625999805915, 'reg_lambda': 0.2628890241775469, 'reg_alpha': 0.9763785008594151, 'subsample': 0.6387875118993118, 'min_child_weight': 15, 'gamma': 0.35268485881573364}. Best is trial 0 with value: 1.0.
[I 2024-07-04 22:51:24,461] Trial 11 finished with value: 1.0 and parameters: {'max_depth': 8, 'learning_rate': 0.9443520379760726, 'n_estimators': 5800, 'colsample_bytree': 0.5168163024120214, 'colsample_bylevel': 0.6533350465013262, 'colsample_bynode': 0.5049940036906653, 'reg_lambda': 0.564647252800192, 'reg_alpha': 0.6225250258868753, 'subsample': 0.6409964914633319, 'min_child_weight': 10, 'gamma': 0.35495166487149704}. Best is trial 0 with value: 1.0.
[I 2024-07-04 22:51:28,043] Trial 12 finished with value: 1.0 and parameters: {'max_depth': 8, 'learning_rate': 0.7816033479496383, 'n_estimators': 7800, 'colsample_bytree': 0.6859734324501636, 'colsample_bylevel': 0.6685543299982314, 'colsample_bynode': 0.6135695058254393, 'reg_lambda': 0.4410646602598562, 'reg_alpha': 0.2572442790499297, 'subsample': 0.7110618229950962, 'min_child_weight': 9, 'gamma': 0.36136601525256384}. Best is trial 0 with value: 1.0.
[I 2024-07-04 22:51:29,843] Trial 13 finished with value: 1.0 and parameters: {'max_depth': 8, 'learning_rate': 0.9849751658531088, 'n_estimators': 3900, 'colsample_bytree': 0.578760087301028, 'colsample_bylevel': 0.9826689730327278, 'colsample_bynode': 0.6826162761840294, 'reg_lambda': 0.7407173587049704, 'reg_alpha': 0.6255122381161018, 'subsample': 0.6923589423551142, 'min_child_weight': 11, 'gamma': 0.26746690877197943}. Best is trial 0 with value: 1.0.
[I 2024-07-04 22:51:34,418] Trial 14 finished with value: 1.0 and parameters: {'max_depth': 6, 'learning_rate': 0.8227435395768457, 'n_estimators': 10000, 'colsample_bytree': 0.7840223021755295, 'colsample_bylevel': 0.5879525450539488, 'colsample_bynode': 0.5715343140007164, 'reg_lambda': 0.1700697447308468, 'reg_alpha': 0.6373793033316469, 'subsample': 0.7773094334895967, 'min_child_weight': 8, 'gamma': 0.492998683292441}. Best is trial 0 with value: 1.0.
실행시간 = 36.77 초
Best accuracy: 1.0000
Best hyperparameters: {'max_depth': 6, 'learning_rate': 0.941212091915176, 'n_estimators': 7600, 'colsample_bytree': 0.7993292420985183, 'colsample_bylevel': 0.5780093202212182, 'colsample_bynode': 0.5779972601681014, 'reg_lambda': 0.06750277604651747, 'reg_alpha': 0.8675143843171859, 'subsample': 0.8404460046972835, 'min_child_weight': 11, 'gamma': 0.1185260448662222}
파이프라인이 'best_pipeline.pkl'로 저장되었습니다.
'best_pipeline.pkl'에서 파이프 라인을 로드했습니다.
Test Accuracy: 0.9333
'Deep Learning & Machine Learning > XGBoost' 카테고리의 다른 글
Optuna를 사용하여 XGBoost 최적 하이퍼 파라미터 구하는 예제코드 (0) | 2024.07.04 |
---|---|
XGBoost에서 GPU(cuda) 사용하는 예제 (1) | 2024.06.15 |
RandomizedSearchCV를 사용하여 XGBoost 최적 하이퍼 파라미터 구하는 예제코드 (0) | 2024.05.30 |
XGBoost Warning 해결 방법 (0) | 2023.10.18 |
MacBook m1에서 XGBoost 코드 실행시 segmentation fault 해결 (0) | 2023.10.16 |
시간날때마다 틈틈이 이것저것 해보며 블로그에 글을 남깁니다.
블로그의 문서는 종종 최신 버전으로 업데이트됩니다.
여유 시간이 날때 진행하는 거라 언제 진행될지는 알 수 없습니다.
영화,책, 생각등을 올리는 블로그도 운영하고 있습니다.
https://freewriting2024.tistory.com
제가 쓴 책도 한번 검토해보세요 ^^
그렇게 천천히 걸으면서도 그렇게 빨리 앞으로 나갈 수 있다는 건.
포스팅이 좋았다면 "좋아요❤️" 또는 "구독👍🏻" 해주세요!