Biota Skills Evaluation – Reservoir Engineering / Data Science¶
The goal of this notebook is to assess several sets of skills that are used on a daily basis at Biota.
Instructions:
• Each section contains it’s own set of questions which should be answered to the best of your ability
• If the answer to a direct question is not known to you currently, state that you have not seen that command or usage before. Then search the internet for the answer and provide it.
• If you cannot answer a question, provide a list of your thought processes, and what you tried along the way. This is important for Biota since we often want to accomplish the right thing, but may need help on the execution. This is materially different from not knowing what the most appropriate thing to do is in the first place.
Packages:
• This evaluation should be completable with only the basic packages listed below. If you find yourself needing different packages, please install them and note what and why you are using something specifically.
Execute the cells below to import the relevant packages and functions
In [ ]:
%matplotlib inline
In [4]:
# general packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn as sk
import scipy as sp
Unix¶
Unix commands are often used at Biota and data science in general.
Write the commands necessary to:
1. Create a directory named “test”
2. Create an empty file name “foo.txt”
3. Create a directory inside of “test” called “temp”
4. Copy “foo.txt” into “temp”
5. Change the name of “foo.txt” to “bar.txt”
6. Print the path to the current working directory
7. Change the current working directory to home
8. Explain the difference between a relative and absolute path
In [ ]:
GitHub¶
1. Write the command to clone a directory called “git@github.com:user/foo” (not a real repository)
2. Change into that directory
3. Create a branch called “test”
4. change into that branch
5. Create an empty file name “foo.txt”
6. Add that file to the staging area
7. Commit that file
8. Push that file to the master branch on github
Statistics¶
1) Give a brief definition of p-values and type 1, and type 2 errors¶
In [ ]:
2) Explain the bias/variance tradeoff¶
In [ ]:
3) Explain the complexity/interpretibility tradeoff¶
In [ ]:
4) Define and explain the differences in predicting continuous and categorical labeled data¶
In [ ]:
5) Give a definition and explain the purpose of splitting data into a test and training set, and the purpose of crosss validation¶
In [ ]:
6) Explain the “curse of dimensionality” How does this affect a classification task such as random forests or support vector machines¶
In [ ]:
Reservoir Engineering¶
Part 1)¶
1) Briefly describe the main elements of a petroleum system. In this context, where is the main focus for conventional vs. unconventional resource development?¶
In [ ]:
2) Under what conditions a reservoir is called a gas condensate reservoir? What would be the main consequences of excessive pressure drop in these reservoirs? What would be your recommended remedy?¶
In [ ]:
3) What reservoir properties are required/preferred for a successful gas flooding EOR project? How would you choose your injection gas (CO2, N2, natural gas)? What are the main challenges and potential solutions in heterogeneous reservoirs?¶
In [ ]:
4) What are the main drive mechanisms of production from conventional vs. unconventional reservoirs?¶
In [ ]:
5) How does the In Situ stress state affect the complexity of resulting hydraulic fracture system? What is the role of pre-existing natural fracturs/faults?¶
In [ ]:
6) Unconventional wells are known for their large decline rates. What are some potential ways of extending the life of an unconventional well?¶
In [ ]:
7) Noting that subsurface microbes (extracted from drill cuttings and produced fluids) represent variations in chemical and physical properties within hydrocarbon reservoirs, what would be your value proposition for DNA Diagnostics in conventional reservoirs?¶
In [ ]:
Part 2)¶
1) Create a variable named WellData by reading in the well metadata as a pandas dataframe from the file WellData.csv. Set the index as the WellName column. (The well metadata includes average geologic and completions parameters for 115 wells in addition to their 2-year cumulative oil production.)¶
In [ ]:
2) Perform a groupby function to show how many wells are landed in formations A and B. Which target formation seems to be the main focus of the operator?¶
In [ ]:
3) Why do you think the operator is putting more capital in developing the formation you determined in Part 2) above? Using matplotlib and/or seaborn, create a visual to summarize how main geologic parameters (porosity, saturation, thickness, pressure) differ between formation A and B.¶
In [ ]:
4) Is there a significant difference in how formation A vs. formation B wells have been completed? Using matplotlib and/or seaborn, create a visual to summarize how main completions parameters (lateral length, injected fluid, injected proppant, average stage length) differ between formation A- and formation B-landed wells.¶
In [ ]:
5) How does the production response vary between formation A- and formation B-landed wells? Create visualization as necessary. What do you think is causing the observed difference in production response from these two formations?¶
In [ ]:
6) Perform a statistical test to determine if the observed difference in mean production response (between formations A and B) is statistically significant.¶
In [ ]:
7) Using scikit-learn, divide the data into training and testing sets (75% training, 25% testing).¶
In [ ]:
8) Using scikit-learn, train a multiple linear regression model for CumOil_24Months using all the geologic and completions parameters as your explanatory variables. Test the model, plot the results, and summarize relative statistics.¶
In [ ]:
9) Is it a good practice to include all features in the model? Explain how model complexity may impact in-sample and out-of-sample accuracy?¶
In [ ]:
10) What would be your strategy for building the best parsimonious model? How would you choose the most important features?¶
In [ ]:
11) Using scikit-learn, perform principal coordinates analysis of the geologic features. Plot the resulting PC1 and PC2 as a scatter plot and color the points based on TargetFormation.¶
In [ ]:
12) Is PCA successfull in unsupervised clustering of wells into formation A vs. formation B wells? If yes, which principal component separates formation A from Formation B wells?¶
In [ ]:
Part 3)¶
1) Read the following data files for a given well:¶
• WellLogs.csv which includes petro-physical well logs. Each zone (formation) is identified with a label.
• Formations.csv which includes top and bottom MDs of each zone.
• DNAFormationContributions.csv which includes an estimate of how much each zone contributes to total liquid production.
2) Calculate Brittleness Index for each depth as defined by the fraction of total rock volume that is made of quartz and calcite. Hint: Use the last 6 columns of WellLogs.csv.¶
In [ ]:
3) Calculate average properties for each zone using groupby function.¶
In [ ]:
4) Create a visualization with 5 log tracks vs. depth: gamma ray, brittleness index, oil volume, water volume, and formation contribution to liquid production.¶
In [ ]:
5) How does the brittleness index for zones A4-A6 compare with other zones? How about contribution to production according DNA diagnostics?¶
In [ ]:
6) Generally speaking, how does formation contribution (to liquid production) correlate with formation-average brittleness index? How would you explain the observed relationship?¶
In [ ]:
7) Are zones B3 and B4 more likely to contribute to water production or oil production? Why?¶
In [ ]:
Thanks! Please save this jupyter notebook with inline images, and email the resulting jupyter notebook to shojaei@biota.com (name this file “firstname_lastname_eval.ipynb” for example mine would be “hasan_shojaei_eval.ipynb”)