Better Predict Future Oilfield Production using Machine Learning Technology

Predicting anything is a true science. In the oilfield, production predictions are paramount to drilling in the right place for the best possible result. That means analyzing all the data available to us, as accurately as possible. Considering the vast amount of data, this can be a bit overwhelming. By using Random Forest technology – a simple machine learning algorithm based on decision trees – we can better predict future production.

Decision Trees

We use decision trees to build classification or regression models in the form of a tree structure. They are useful for breaking down a data set into smaller and smaller subsets while incrementally developing an associated decision tree. The final result is a tree with decision nodes, leaf nodes, and root nodes. A decision node has two or more branches. A leaf node is an end point that represents a classification or decision. A root node is the topmost decision node in a tree that corresponds to the best predictor. Decision trees can handle both categorical and numerical data.

Let’s use a simple random forest model to illustrate how the decision tree works.

In our example, we chose a simple dataset from the Bakken Formation with limited predictors. In real practice, you would want to use an entire dataset with more predictors, but we want to keep it simple to illustrate how the model works. For illustration purposes, we chose these four features as predictors:

Latitude
Longitude
True Vertical Depth
Perforation Length

For the outcome, we chose IP Boevd21. Thus, the chosen features are used to predict the outcome of production shown as IP Boevd21.

To begin, we get the data for our example from PetroDE:

In PetroDE, do a search on the Bakken Formation with the time period set to All Time.
In the Analysis Card, select Download Search and save the search result as a CSV file.

Notice that the CSV file contains over 17,000 wells (or rows).

Bakken Wells — 17,605 wells in the Bakken Formation

Next, we put the CSV file into our random forest algorithm to further filter the data. Our algorithm uses the following filters to reduce the number of rows to just over 9000:

Well Status = producing
Hole Direction = Horizontal
IPBoevd21 > 0
TVD > 1000
Perforation Length > 100

Cross Correlation

Next, we perform some simple exploratory data analysis to show the cross-correlation between predictors and outcome:

As we can see, there is no large correlation between any pair of predictors. That is good. But also, there is no large correlation between any predictor and the IP Boevd21 outcome. This could mean there is a complicated relationship between this set of predictors and outcome. Let’s explore further.

Classify the Outcome

To get the best decipherable analysis, we use a classification model instead of a regression model.
Let’s split our dataset into three classification categories using statistical tertiles: Low, Medium, and High production. Now we have a categorical outcome in our dataset.

Check Model Performance

We can check the model performance by splitting our dataset into train and test datasets using a traditional 80% / 20% proportion. To simplify the results for illustration purposes, we also limit the decision tree size to four levels by setting the max_depth parameter of the algorithm to 4. Otherwise, the results would be too great for illustration purposes. In real life the size of the decision tree should not be limited, as a larger decision tree increases the accuracy of the results.

Number of observations in the training data: 7324
Number of observations in the test data: 1832

Train Accuracy :: 0.5951665756417258
Test Accuracy :: 0.5709606986899564

The Train / Test results are not great but not bad, considering our very small set of predictors and reduced tree size. The test accuracy is close to train accuracy, thus, we don’t have any significant problem with overfitting or underfitting.

Next, let’s look at a confusion matrix for the test set.

As we can see, everything looks good for the first two prediction classes, low and medium. But the third prediction class, high, is not well predicted. This is most likely due to our limited set of predictors, which is not enough to see differences between class high and class low. If we look at the bottom-left corner cell, we can see that a lot of high producing wells are predicted as low. Thus, we have a lot of false negative predictions. This tells us that we need to add more predictors in order to separate high producing wells from low producing wells in the same limited territory.

Predictor Importance Ranking

The Random Forest algorithm calculates the importance of our existing predictors and ranks their contribution for predicting a particular outcome. As shown in the figure below, geographical coordinates are ranked as the most important in our test dataset.

Our random forest algorithm, which we limited to four levels for our test dataset, produces the following decision tree.

Interpreting the Random Forest Model with LIME

Our Random Forest Model can also be interpreted with the LIME (Local Interpretable Model-Agnostic Explanation) algorithm. Using LIME, we can select any row from the test dataset, make a prediction based on the decision tree, and explain it.
For example, if we look at row #1, the decision tree points to medium. Using LIME as illustrated below, we see that the most important predictor is latitude <=47.76, which points to medium. Other predictors point to NOT medium, but it is not enough to impact the decision.

Looking at row #2, we see below that all predictors point to high, with the most important predictors for that classification being latitude and longitude.

In row #3, we see a controversial decision that may be a false negative. The Prediction Probabilities of low and high are nearly equal. The longitude > -103.52 points to a low classification, but latitude > 48.20 points to a NOT low classification (i.e. high in this case since high is ranked higher in probability than medium). To improve the model, we need to add more predictors, as previously stated.

Conclusion

The power of using the Random Forest machine learning model for predicting production or any other outcome is obvious when you can see the results in a decision tree or LIME diagram. In this example, we used a very small data set and can conclude that the results from a larger dataset would be even more accurate. Predicting outcomes with machine learning algorithms makes a lot of sense!