I have a dataset where I'd like to perform anomaly detection with an Isolation Forest. I don't have any way to validate the model (my data is not labeled – that's why I'm using unsupervised learning) – how can I tell if the model is working all right? I could do a train-test split, but again, how do I know if the predictions are correct if I'm using unlabelled data (plus I'd like to have as many words in my tf-idf vectoriser as possible, but that's another question)? There isn't any data about the amount of contamination either. How should I fine-tune the parameters/validate the results?

**Contents**hide

#### Best Answer

For isolation forest, here is a clue for validation reference.

From the paper and sklearn lib,we know there are two key parameters: n_estimators and max_samples.

- when n_estimator = 100, the average path( score of outlier) is convergence.
- when max_samples = 256(the default parameter), the different dataset will be convergence similar auc. that means no need to spend time for bigger samples.

According to the paper, they test difference n_estimator and max_sample to make sure the result is convergence.

So, for you case, you may need use grid search to check which combination of n_estimators and max_samples could be quick reach convergence.

In my case, I use titanic dataset. After scale raw data, the feature quantity is about 120. I used gridsearch and found n_estimator =200 and max_sample= 256, the outlier prediction begin to convergence.

This is my way to validate unsupervised outlier detection.

Hope it helpful.

### Similar Posts:

- Solved – Is Anomaly Detection Supervised or Un-supervised
- Solved – Anomaly Detection over multivariate categorical and numerical predictors
- Solved – R – Multivariate K-nearest neighbor outlier detection
- Solved – anomaly detection for clustering data structure
- Solved – Accuracy of Anomaly detection for unlabelled data