[Programming][R] 랜덤 포레스트(Random Forest) R로 구현하기

패키지 설치

랜덤 포레스트를 사용하기 위해서 'randomForest' 패키지를 설치하고, 라이브러리로 불러온다.

install.packages("randomForest")
library(randomForest)

randomForest 패키지에 대한 자세한 설명은 아래 첨부에서 확인할 수 있다 :)

https://cran.r-project.org/web/packages/randomForest/randomForest.pdf

모델링

먼저, ntree와 mtry 파라미터를 이용하여 최적의 랜덤 포레스트 모델을 찾는다.

ntree: 증가시킬 수 있는 가지의 수로, 모델에서 만들 의사결정 나무의 개수를 의미한다.
mtry: 나무에서 분할할 때 랜덤 하게 표본 추출되는 변수의 수이다. 기본 값은 sqrt(변수의 수)이다.

코드에서는 ntree 수가 너무 적으면 안 되고 250이 넘으면 비슷한 수준의 값을 가지기 때문에 50부터 200까지의 범위로 설정했다. 또한, mtry의 넓은 범위를 보기 위해서 기본 값인 sqrt(변수의 수)에서 3을 뺀 범위부터 3을 더한 범위까지로 늘려주었다.

p <- length(X_train)

param_ntree <- c(50, 100, 150, 200)
param_mtry <- c((sqrt(p)-3) : (sqrt(p)+3))

이후 변수의 중요도를 알아보기 위해 importance = TRUE로 설정한다.

set.seed(2021)
for (i in param_ntree) {
  for (j in param_mtry) {
    model_params <- randomForest(as.factor(y_train$Species)  ~ .
                               , data = X_train, ntree = i, mtry = j, importance = TRUE)

    cat('ntree: ', i , '\n', 'mtry: ', j ,'\n')
    print(model_params)

  }
}

for문을 사용하여 OOB estimate of error rate이 가장 작은 것을 찾고, Confusion matrix에서 정확도(0일 때 0으로 예측하며, 1일 때 1로 예측한 값)가 높은 파라미터를 선택한다.

이때, mtry 값이 범위를 벗어나면 아래와 같은 에러를 나타낸다. 그러면, 가능한 mtry 값으로 자동 반영되어 진행된다.

In randomForest.default(m, y,...) : invalid mtry: reset to within valid range

모델에서 ntree: 100, mtry: -1이 가장 작은 오류율로 산정되었는데 -1은 가능하지 않은 값이므로 1로 자동 반영되었다.

따라서, 오류율이 가장 작은 모델을 선택하여 적합을 진행한다.

model_rf <- randomForest(as.factor(y_train$Species)  ~ .
                      , data = X_train, ntree = 100, mtry = 1, importance = TRUE)

Call:
randomForest(formula = as.factor(y_train$Species) ~., data = X_train, ntree = 100, mtry = 1, importance = TRUE)
Type of random forest: classification
Number of trees: 100
No. of variables tried at each split: 1

OOB estimate of error rate: 5.71%
Confusion matrix:
setosa versicolor virginica class.error
setosa 35 0 0 0.00000000
versicolor 0 32 3 0.08571429
virginica 0 3 32 0.08571429

예측 및 결론

predict 함수를 이용하여 X_test를 예측한다.

prediction <- predict(model_rf, X_test)

예측한 결과를 보기 위해서는 ConfusionMatrix를 사용할 것인데, 이를 위해 carat 라이브러리를 설치해준다.

library(caret)

confusionMatrix(prediction, y_test$Species)

Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 15 0 0
versicolor 0 15 2
virginica 0 0 13

Overall Statistics
Accuracy : 0.9556
95% CI : (0.8485, 0.9946)
No Information Rate : 0.3333
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9333
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 1.0000 0.8667
Specificity 1.0000 0.9333 1.0000
Pos Pred Value 1.0000 0.8824 1.0000
Neg Pred Value 1.0000 1.0000 0.9375
Prevalence 0.3333 0.3333 0.3333
Detection Rate 0.3333 0.3333 0.2889
Detection Prevalence 0.3333 0.3778 0.2889
Balanced Accuracy 1.0000 0.9667 0.9333

정확도가 96% 정도로 매우 예측력이 높은 모델이라는 것을 알 수 있다.

각 변수의 중요도를 보기 위해서는 varImpPlot을 이용한다.

(type = 1 : mean decrease in accuracy, type = 2 : mean decrease in node imputy)

varImpPlot(model_rf, type = 2, col = 1, cex = 1)

728x90

'Programming > R' 카테고리의 다른 글

[Programming][R] 작업 중간(workspace) 파일 저장하기 (0)	2022.02.04
[Programming][R] 데이터 타입 변환하기 (0)	2022.01.31
[Programming][R]sample( ) 난수 생성하기(set.seed( ) 활용) (0)	2022.01.31

리디아 인덱스

[Programming][R] 랜덤 포레스트(Random Forest) R로 구현하기

'Programming > R' 카테고리의 다른 글

댓글

티스토리툴바

[Programming][R] 랜덤 포레스트(Random Forest) R로 구현하기

'Programming > R' 카테고리의 다른 글

관련글

댓글

티스토리툴바