Predictive Equipment Failures

Manish M Dalvi
17 min readFeb 24, 2021
Source: ConocoPhillips

Contents

  1. Business Problem
  2. ML Problem Formulation
  3. Business Constraints
  4. Overview of Dataset
  5. Performance Metric
  6. Existing Solutions
  7. My First Cut Approach
  8. Exploratory Data Analysis
  9. Data Preprocessing
  10. Feature Engineering
  11. Data Preparation for Model
  12. Base Models
  13. Machine Learning Models
  14. Summary
  15. Deployment
  16. Future Work
  17. Code Repository
  18. References

Business Problem

Source: GABRIEL C. PéREZ / KUT

The Case Study “Predictive Equipment Failures” deals with prediction of the failure of an equipment before it can fail. As an employee while working in an MNC, we had been given the task of building IoT solutions for the underlying infrastructure of the company. The entire IoT system was built with sensors and connected over the Internet to Cloud to monitor readings, control devices such as HVAC, Diesel Generators etc. The important point to note is when a device like AC’s or diesel generator fail, the cost of repair at that instant in an emergency is multiple times of the cost that it would take to repair when done as maintenance. Such costs bring down revenue for the company even though the company had the data before hand but was not able to convert the same into useful information.

This case study exactly deals with a similar scenario as posed by a company named “ConocoPhillips” in West Texas. The company has several oil wells classified as stripper wells. Stripper wells produce low volumes of oil at well level, but at a country level of the US, they amount for significant domestic oil production. Since the wells have low throughput, if the equipment for extracting the oil fails, then profit margin too is impacted severely. Thus, the company has attached 107 sensors to the surface and down-hole equipment and provided a dataset of the values and the downtime of the equipment. Using the dataset, we need to predict if the equipment will fail in the near future. This would help the company in sending the maintenance team to the well location to fix the equipment on the surface level or even send a worker rig to the well to pull-down equipment and solve the issue.

ML Problem Formulation

The Task is to predict the failure of an equipment in advance which makes this problem a Binary Classification Problem where “1” represents down hole equipment failure and “0” represent no failure.

Business Constraints

The main business constraints for this problem would be as follows.

  1. We should not miss any failures since it would lead to high costs.
  2. There is no time constraint, but the model should take a few seconds to a minute for prediction.
  3. Interpretability is not important. But if available it would be useful in identifying the exact part where the equipment might fail in future.

Overview of Dataset

The dataset provided “ConocoPhillips” consists of data from 107 sensors at surface level and down-hole equipment. The dataset consists of two types of sensor columns namely, a. Measure columns — single measurement of the sensor b. Histogram bin columns — a set of 10 columns with different bins of a sensor that show its distribution over time.

Thus, in total there are 100 measure column sensors and 7 histogram bin column sensor values resulting in 170 columns of only sensors. The target value consists of “1” for equipment failure and “0” for no failure with majority of the target being “0” and very few values being “1” thus resulting in a highly imbalanced dataset. There are a total of 60,000 rows in the dataset.

The dataset was obtained from the following Kaggle Competition page https://www.kaggle.com/c/equipfails/

Performance Metric

Since the dataset is highly imbalanced, the ideal performance metric would be precision, recall and F1 scores especially not to miss failure cases. Other performance metrics like accuracy and ROC-AUC would provide a significantly high value due to the imbalance and not provide a clear picture for the minority class.

Among F1 score, we will choose the macro averaged f1 score since we would to like to treat both class as equals. Micro Average will be used when we want to maximize the classification of a particular class which is not the case in this problem study. Due to imbalance dataset, we will get a high micro average f1 score which is not called for.

Source: Google Images

Existing Solutions

There are few solutions available for the given dataset. The solution on Kaggle page of the challenge has a solution with 0.99267 F1 score for a Random Forest Model.

Since the challenge is closed, we will not be able to test our result as part of the challenge.

My First Cut Approach

With a few solutions available on the internet, my first cut approach would be to try different methods that have not been tried. Few approaches of Random Search CV hyper parameter tuned classifiers of XG Boost, Random Forest followed by Voting and Stacking Classifiers. Since the dataset is imbalanced, it would be good to try with Ensemble Model with approximately 30–50 Decision Trees and have a Logistic Regression as the meta classifier to predict the final output.

Exploratory Data Analysis

Target Variable Distribution: The dataset is binary class classification problem. On observing the target data, we see that the dataset is highly imbalanced with 59,000 non failure points and 1000 downhole failure points.

targets = data_df['target'].value_counts()figure = plt.figure()
axes = figure.add_axes([0,0,1,1])
target_values = ['Non Failure Cases', 'Failure Cases']
counts = [targets[0], targets[1]]
print("No of Non Failure Cases",targets[0])
print("No of Failure Cases",targets[1])
axes.bar(target_values,counts)
plt.show()
Total Count of Target Values

Non Temporal Sensor Distributions: From the PDF plots of the various sensors, we see most distributions with over lap of both classes. There are very few sensors where the distributions can be used to differentiate between the two target classes.

Distribution of Non Temporal Sensor values

The Sensors whose distributions are visually separable are important since they have more contribution in predicting the final outcome. The top 5 important sensors based sensors are sensor1_measure, sensor8_measure, sensor14_measure, sensor15_measure, sensor16_measure, sensor17_measure and many more.

We can further understand the distribution by plotting box plot of sensors whose 10th percentile of one class is greater than the 90th percentile of the other class.

Box Plot of Sensors which have distributions that are separable

Null Values in each Feature: By visually observing the provided dataset, the dataset contains significant amount of Null values in few features. It would be good to the distribution of Null values across all features.

Null value count from few sensors

Zero Values in each Feature: Similar to Null Values, it would be note features where the number of zeros is more than 95% of the feature. This helps in dropping the features due to insignificant contribution to dataset.

zero_columns = []
for column in data_df.columns:
if column != 'target' and column!='id':
value = data_df[column].value_counts()
if 0 in value:
if value[0] >= (0.95*60000):
zero_columns.append(column)
print(column, '---> % of zeros',(value[0]*100/60000))
Columns with zeros more than 95%

Temporal Sensors Distributions: The distribution graph of the temporal based sensors show significant amount of overlap for both the classes. Some distributions tend to concentrate among the zero showing the high number of zeros in the feature column.

Histogram bins probability distribution plots

Correlation: Correlation feature affects linear models since the dataset features are interdependent. If two features are highly correlated then it means only one of them is sufficient and many times the other need not be used. In this case study, correlation is found between the important sensors which had highly separable distributions.

Important Sensors and their correlation

Another correlation is found between the temporal histogram bin data. Subsequent bins of the same sensor have higher correlation since they are temporally dependent.

bin_correlation = histogram_data.corr()
plt.figure(figsize=(25,25))
sns.heatmap(bin_correlation)
plt.title("Bin Correlation of time data")
plt.show()
Correlation of Histogram Bins

TSNE Plot: From the TSNE plot of the dataset we notice that the data for the both classes are grouped but they are overlapped in a particular region which shows that most of the data can be visually separated but for a few points it needs to be improved. There are few outliers in the dataset with the blue dots scattered around the plot.

The following TSNE was obtained with a perplexity of 20 and 1000 iterations.

perplexity = [5, 10, 20, 40]for ppx in perplexity:
tsne = TSNE(n_components=2, verbose=0, perplexity=ppx, n_iter=1000)
tsne_data = tsne.fit_transform(important_df)
plt.figure(figsize=(15,10))
df = pd.DataFrame({'x':tsne_data[:,0], 'y':tsne_data[:,1] ,'label':y_true})
sns.lmplot(data=df, x='x', y='y', hue='label', fit_reg=False, size=8, palette="Set1", markers=['s','o'])
plt.title("Perplexity:{}".format(ppx))
plt.show()
TSNE plot with Perplexity 20 and 1000 iterations

Data Preprocessing

Several columns in the dataset have String “na” instead of NaN which would cause problem in the future, hence they are replaced with Numpy NaN values.

We have seen that many columns contain high number of Null and Zero Values. Features with more than 50% Null values and 95% Zero values need to be dropped. Before dropping those features, the correlation of the features were checked with the target variable and found no correlation between them and hence they were dropped.

The important non temporal sensors whose distributions very highly separable were checked for correlation.

There are many sensors which are correlated. The sensors with more than 99% correlation were dropped.

high_correlated_columns = ['sensor45_measure', 'sensor32_measure', 'sensor46_measure', 'sensor47_measure', 'sensor67_measure']
for column in high_correlated_columns:
if column in X_train.columns:
X_train = X_train.drop([column],axis=1)
X_cv = X_cv.drop([column],axis=1)
X_test = X_test.drop([column],axis=1)

Similarly, the histogram bins were checked for correlation.

We see there is high correlation between bins of the same sensor. These need not be removed since the data is linked in terms of time and hence bound to have slightly higher correlation. But there is one sensor64 bin5 which had high correlation with another sensor7 bin6 and hence it was dropped.

These features can be deleted without the need of train test split since the entire column is being dropped and not modified.

Next step is to split the data into train, cv and test split before the application of feature engineering. Note that stratify is used in order to have equal distribution of each class of data in the train, cv and test data.

X_train, X_test, y_train, y_test = train_test_split(data_df, y_true, stratify=y_true, test_size=0.2)X_train, X_cv, y_train, y_cv = train_test_split(X_train, y_train, stratify=y_train, test_size=0.2)

Feature Engineering

Before feature Engineering, we see there are several rows with NaN values and they need to be replaced. Two methods are used to impute data in this case study mainly,

  1. Median based Imputation: The median of the train dataset for each column is used to replace the NaN value in that particular column.
median_values = X_train.median()X_train_median = X_train.fillna(median_values)
X_cv_median = X_cv.fillna(median_values)
X_test_median = X_test.fillna(median_values)

2. KNN based Imputation: We train a KNN based Imputer on the train dataset and this data is used to replace the NaN values in the train, cv and test part of the dataset.

knn_imputer = KNNImputer(weights="distance")X_train_knn_imputation = knn_imputer.fit_transform(X_train)
X_cv_knn_imputation = knn_imputer.transform(X_cv)
X_test_knn_imputation = knn_imputer.transform(X_test)

Coming to Feature Engineering, there are several features that are added to our dataset and they are as follows,

  1. Truncated SVD: Using Truncated SVD, we reduce the dimensionality of dataset from over 170 dimension to 4 dimensions. These 4 features are the top 4 eigen vectors which consist of high spread data to provide better decisions.
T_SVD = TruncatedSVD(n_components=4, n_iter=20)X_train_median_tSVD = T_SVD.fit_transform(X_train_median)
X_cv_median_tSVD = T_SVD.transform(X_cv_median)
X_test_median_tSVD = T_SVD.fit_transform(X_test_median)

2. Bin Average: Here the average across the bins of each temporal sensor is added as a new feature. Since the bin values of the sensors have high variance, it would be a good feature extract their average and see their contribution in the final result.

def add_average_bin_feature(df, column_names):
for sensor in column_names.keys():
bins = column_names[sensor]
temp_df = df[bins]
df[sensor+'_bin_average'] = temp_df.mean(axis=1)
return df

Data Preparation for Model

As seen before, the dataset is highly imbalanced. This would result in underfitting of the model for the minority class and overfit for the majority class. In order to increase the count of the minority class, we use SMOTE (Synthetic Minority Oversampling TEchnique). This helps in careful oversampling of the minority class. SMOTE uses KNN to know its neighbors and then it generates data based on the line drawn between the points of its neighbors. This helps in generating new and unique data and not just sampling with repetition.

SMOTE is applied to the minority class and increased the total minority class to 10% of the majority class. Sampling strategy 0.1 denotes, sample till 10% of majority class.

oversampling = SMOTE(sampling_strategy=0.1)X_train, y_train = oversampling.fit_sample(X_train, y_train)

The dataset has high variations in the few columns, with zero to values ranging in 10,000s. This variations prevents the ML Models from performing to its optimum due to different scales of the sensors. Hence the dataset is normalized using the MinMaxScaler which is performed on the train data and further used to transform the CV and test data.

mnmx_scaler = MinMaxScaler()X_train_median_os=pd.DataFrame(mnmx_scaler.fit_transform(X_train))
X_cv_median = pd.DataFrame(mnmx_scaler.transform(X_cv_median))
X_test_median = pd.DataFrame(mnmx_scaler.transform(X_test_median))
Normalization using MinMaxScaler

Base Model

Before we move on to the ML models, it would be good to test the dataset on reference models so that they would act as a bench mark for our ML models. The ML models should perform better than the Base Models. All the Models here after are tested with two datasets based on Median imputation and KNN based imputation.

  1. Random Model: This model randomly generates 1s and 0s as the output. The models as expected would have a low test F1 score of 0.34.
Random Model Confusion Matrix
def random_model(train_shape, cv_shape, test_shape):
train = np.random.randint(2, size=train_shape)
cv = np.random.randint(2, size=cv_shape)
test = np.random.randint(2, size=test_shape)
return train, cv, test

2. All 1s: Model predicts all outputs as 1s. Since our dataset has high number of 0s. This would reduce the test F1 score further down to 0.01.

def model_ones(train_shape, cv_shape, test_shape):
train = np.ones(train_shape)
cv = np.ones(cv_shape)
test = np.ones(test_shape)
return train, cv, test

3. All 0s: Model predicts all outputs as 0s. This would increase the F1 score since our dataset has high number 0s. The F1 score stands at ~0.5 since the minority class is completely misclassified.

4. Sensor 35 based Model: We use only Sensor 35 distribution values to predict the output. Sensor 35 has a good separable distribution for both classes and hence performs well with a test F1 score of 0.65.

The image has train, cv and test F1 scores for each important sensor.

Machine Learning Models

The dataset as seen is a highly imbalanced dataset, hence it is good to go for Tree based Models such as Random Forest, XG Boost. Apart from the Tree based models it is necessary to try basic models such as KNNs and also complex models of Ensemble, Stacking and Voting classifiers to have a broader picture.

K Nearest Neighbor

KNN are mainly used for classification problems. KNN finds the “k” nearest neighbors to know the class of the point under consideration by vote of majority. The value of “k” is a hyper parameter and can be tuned depending on the dataset. There are more hyperparameters where we can have weights depending the distance of the neighbor to the point under consideration.

Using Random Search Cross validation, the best hyper parameters were found as follows.

Hyper Parameter Tuning for Left: KNN with Median Imputed Data, Right: KNN with KNN Imputed Data
x_cfl=KNeighborsClassifier(metric='manhattan', n_neighbors=3, weights='distance')

KNN provided a good test F1 score of 0.84 (Median Imputed data) and 0.83 (KNN imputed data) over the base models.

The Test Confusion Matrix 43 misclassified points for the median imputed data and 52 for the other.

plot = plot_confusion_matrix(x_cfl, X_test, y_test) 
plot.ax_.set_title('Test Confusion Matrix')
plt.show()
KNN Test Confusion Matrix — Left: Median Imputed Data; Right: KNN Imputed Data

Random Forest

Random Forest is an extension Decision Trees. Random Forest uses Bootstrap and Aggregations which together is known as Bagging. Several Decision Trees are trained on the train dataset and a majority vote is sought between the many decision trees in order to classify the point.

Random Forest was tuned for the hyper parameters and the best hyper parameters were found to be with max depth of 100 and over 2000 estimators.

Hyper Parameter Tuning for Left: RF with Median Imputed Data, Right: RF with KNN Imputed Data

Random Forest improved upon the result of the KNNs by small margin. The test F1 score stood at 0.889 (Median Imputation) and 0.879 (KNN Imputation).

Random Forest Test Confusion Matrix — Left: Median Imputed Data; Right: KNN Imputed Data

XG Boost

XG Boost also known as Extreme Gradient Boosting is an ML algorithm based on the Gradient Boosting Decision Trees(GBDT). Usually XG Boost, when it comes to small to medium unstructured data out performs most of the Tree based algorithms.

Gradient Boosting works training subsequent models based on the errors of the previous models. The error reduces on each model that is trained on the errors of the previous models till a certain point without overfitting.

XG Boost has several to many hyper parameters and Random Search was used in order to obtain the best hyper parameter for the train dataset.

Hyper Parameter Tuning for Left: XG Boost with Median Imputed Data, Right: XG Boost with KNN Imputed Data

We can see the feature importance based on the trained. We can see that top 5 features in the graph are the same features that had the highly separable distributions for each class during the EDA.

features = X_train_knn_os.columns
importances = x_cfl.feature_importances_
indices = (np.argsort(importances))[-25:]
plt.figure(figsize=(10,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='r', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Feature Importance obtained from XG Boost classifier

As expected, XG Boost performed better than all previous models with test F1 score of 0.887 (Median Imputation) and 0.877 (KNN Imputation).

Test Confusion Matrix for Left: XG Boost with Median Imputed Data, Right: XG Boost with KNN Imputed Data

Since XG boost performed well, it was tried on the data without over sampling of the minority class with SMOTE. However the results were not good as expected with a test F1 score at 0.5. The minority class was too low in number to be used for training and hence the low F1 macro average score.

Stacking Classifiers

Stacking Classifiers are classifiers which have several ML classifiers which are trained on the given data. The output of the ML classifiers are then used to train a meta classifier which predicts the final output. The final output is not a majority based voting but the classifier is an ML model which gets trained.

sclf = StackingClassifier(classifiers=[rf_clf, xgb_clf, knn_clf],
use_probas=False,
average_probas=False,
meta_classifier=knn_meta)
sclf.fit(X_train, y_train.values.ravel())

We have used the same 3 ML Algorithms with the already tuned hyper parameters as the 3 classifiers (KNN, Random Forest and XG Boost) and another KNN as the meta classifiers. The KNN was used with “k” value of 5 after trying for several “k” values.

The Stacking Classifier overall improved slightly over XG Boost with a test F1 score of 0.894 (Median Imputation) and 0.891 (KNN Imputation).

Test Confusion Matrix for Left: Stacking Classifiers with Median Imputed Data, Right: Stacking Classifiers with KNN Imputed Data

Voting Classifiers

Voting Classifiers are similar to Stacking Classifiers with removal the meta classifier and using a majority vote for the output from the classifiers. The output from the various classifiers are voted upon for the majority which provides the final output.

clf = VotingClassifier(estimators=[('rf', rf), ('xbg', xgb), ('knn', knn)], voting='hard',n_jobs=-1)clf.fit(X_train, y_train.values.ravel())

3 Classifiers were used as part of Voting classifiers namely KNN, Random Forest and XG Boost with hyper parameters obtained from the above hyper parameter tuning.

Voting Classifiers provided the best results with a test F1 score of 0.898 (Median Imputation) and 0.889 (KNN Imputation).

Test Confusion Matrix for Left: Voting Classifiers with Median Imputed Data, Right: Voting Classifiers with KNN Imputed Data

Ensemble Model

In this Model, we used “n” different Decision Trees and train it with randomly sampled data with repetition. The output of the decision trees and provided as an input to a meta classifier which here is an logistic regression.

The entire dataset is split into D1, D2 and Test with ratio of 40:40:20. The D1 dataset is randomly sampled. Further the randomly sampled dataset is oversampled by sampling with repetition. This data is used to train the Decision Trees. D2 is passed to trained Decision Trees to obtain an output. This output is used to train the Meta Classifier Logistic Regression. Finally, the test data is passed onto the Decision Trees, the output is fed to the meta classifier in order to obtain the final output.

The Decision Trees was tuned with default values and The Logistic Regression was tuned based on the Cross validation from Random Search. The number of decision Trees is an hyper parameters and hence it was tested for 10, 20, 50 and 75 decision trees.

Hyper Parameter Tuning for Left: Logistic Regression with Median Imputed Data, Right: Logistic Regression with KNN Imputed Data

50 Decision Trees and Logistic Regression at C=0.1 provided good results with test F1 score at 0.881 (Median and KNN imputed data)

Test Confusion Matrix for Left: Ensemble Model with Median Imputed Data, Right: Ensemble Model with KNN Imputed Data

The Ensemble Model was tried for both dataset with Feature engineering without SMOTE and dataset without any feature engineering without SMOTE. The Model with Feature engineering provided better results. SMOTE was not used in this model since the data is being sampled with replacement and hence data will already be repeated. If SMOTE was used, then it would create more replications and thus over fitting the model. The output and results of other tried and tested combinations can be found in the summary.

Summary

  1. With every model, we saw improvement in the F1 scores.
    Few sensors like Sensor 85, 35, 82 ,9 and Bin 1 of Sensor 25 are the top features that added to classification.
  2. Bin Average Feature which was added as Feature Engineering feature has good amount of importance with few of the features being in the top 25 features.
  3. We observed that the Voting and Stacking Classifiers performed the best when used the best hyper parameters of the dataset based on the imputation.
  4. Overall Voting Classifier with median imputation had the highest F1 score of ~0.90 over all other Models including Ensemble models. Other Models like Stacking and XG boost were not far with scores of ~0.89

The Overall Summary of all the tried and tested models are as follows

Result Summary

Deployment

The Case study was deployed onto a AWS EC2 instance with basic html page and Flask Server. The website has certain validation checks for it work smoothly. Since there are 170 columns, it is advised to download the sample text file, fill with the desired data and upload the same. The link for the website is as follows.

http://ec2-3-139-100-121.us-east-2.compute.amazonaws.com:8080/index

Future Work

This entire case study was performed using the Machine Learning Models. This case study can also be tried with Deep Learning Models and also use LSTM for the histogram bins as they are time series data.

More hyper parameters can be tuned and more feature engineered data can be added such as Moving Average.

Instead of Median and KNN based imputation, Deep learning based imputations can be used in order provided accurate imputations.

We can try using smaller CNNs after converting the row numerical data into Images of 13x13 (~169 features) after scaling the values between 0 and 1 and note their performance.

Code Repository

The code developed as part of the case study can be found on my GitHub account.

https://github.com/ManishDalvi/Predictive-Equipment-Failures/

Contact me on Linkedin

--

--

Manish M Dalvi

M.S Information Technology - University of Stuttgart, Germany