MATH 60603A Statistical Learning Assignment #3
Fall 2021 1
Predictive maintenance game
• Individual assignment.
• Upload your decision(s) before 8:30AM (EDT) on December 3rd, 2021.
• You are required to provide your R code (upload it on ZoneCours).
• This round of business simulation is worth 10% of your final grade.
Context:
Your company has operations in very remote area where centrifugal pumps are used to extract valuable liquids. Pump
maintenance is done by a contractor. All pumps get maintained once a year, but there is an option to repair selected pumps
after 6 months at a cost. Your role will be to decide which pumps, if any, get the month 6 maintenance. The pumps will allbe
maintained at the end of the 12 months, whether they had a maintenance done at 6 months or not.
When pumps are maintained, the contractor may change the bearings, the seals, or replace the whole pump. Due to the
remote location, every pump maintenance has a base cost of $ 500, plus the cost of the repair, namely:
• $ 4 000: Whole pump replaced
• $ 200: Bearings replaced
• $ 100: Seals replaced
• $ 250: Bearings and seals replaced (cheaper to replace both at once)
Even if not parts are replaced, the contractor performs an inspection and a tuning of the pump at no additional cost.
Sensors on the pump keep track of energy use, volume of liquid extracted, as well as vibrations of the pump. Because of their
remote location, the data are just stored locally as there are no networks available. The extracted liquid gets collected every
month, and the data are then brought back on a physical device by the transporter.
The company has a total of 100 000 pumps. They provide you with a summary of the sensor data for the first 5 months of the
years to decide which pump will be maintained at month 6. Last year, a study was also done on 200 pumps that were randomly
split in two groups. The first group got a maintenance at 6 months, then the regular maintenance at 12 months, but the
second group just got the regular maintenance at 12 month. The data from that study are available to you, namely the details
of the maintenance and 12 months of sensor data.
Beyond the cost of repairs, pumps consume electrical energy that you pay 0.10$ per kwh. The liquid that you extract is worth
0.10$ per cubic meter. Your objective is to net the largest possible profit during the 12 months period (value of liquid, minus
energy and repair costs).
Sensor Data
A zip file containing the data is available on ZoneCours. The data is split between different files:
• repairs.csv: The list of repairs that were done on the 200 pumps used for the study last year.
• Sensors-study.csv: Sensor data for 12 months for the 200 pumps in the study last year.
• Sensors-score.csv: Sensor data for the first 5 months of the year to help you decide which pump should be
maintained.
How to play the game:
You need to prepare a one column CSV file with the list of IDs of the pumps to repair at 6 months. Upload this decision on
https://dsgame.hec.ca/play (consult ZoneCours for instructions on setting up your access, and joining the game). When you
upload a decision, you will get immediate feedback on the net profit. The platform allows for multiple uploads per person,
up to 99, which means that you may try many different solutions.
https://dsgame.hec.ca/play
MATH 60603A Statistical Learning Assignment #3
Fall 2021 2
On the upload platform, you will see not only your results but also those of the whole group. While the game is being played,
you will see the “interim leaderboard.” Before the deadline of the assignment, you must select your final decision as one of
your uploads. When the game ends, the “real-life leaderboard” will be unveiled and will prevail for the final ranking. The
“real-life leaderboard” plays the role of a test set: it is held until the end to measure the performance of the final answers of
everybody.
Method:
Your objective is based on a business outcome: net profit. The process of prediction will involve cleaning, combining,
analyzing, and modeling data. We do not give further instructions on the methods used; you are on your own for that. There
is no single good answer and multiple strategies that can support the business problem reasonably well. We expect each
student to come up with their own approach.
Evaluation:
Look for the baseline on the upload platform. It corresponds to the profit made with no extra maintenance. To get a passing
grade, you must do better than that!
The evaluation will be based on the results at the end of the game. Each student must select one of their uploads as their
final answer, and that answer will prevail. You get:
• 0% – if you are below the baseline on the interim leaderboard; at least 50% if your profit is above.
• 100% – Top 10% of students on the “real-life leaderboard.”
• The rest of the marks will be linearly interpolated using the following equation with values from the “real-life
leaderboard”:
50{1 + (𝑥𝑥 − 𝐵𝐵)/(𝑇𝑇 − 𝐵𝐵)}
where 𝑥𝑥 is your profit, 𝐵𝐵 the baseline, and 𝑇𝑇 the profit of the last student with 100% from the previous criterion.
You must upload your R code on ZoneCours. It will not be reviewed systematically, only if some precisions are needed. Your
grades could be reduced if irregularities are found in the R code, or if there is evidence you have not used R.
Data dictionary:
The variable names for repair are self-explanatory.
For the sensors, the data is monthly for each pump. You will find:
• Volume of liquid extracted in cubic meters
• Total Energy used by the pump in kWh
• PSDxxxx: Power Spectral Density (in g^2/Hz) at frequency xxxx Hz
When sensors are positioned on a mechanical device, the signal is recorded, then analyzed in terms of energy at different
frequencies. Fourier transforms are typically used to analyze the frequency domain. Instead of providing raw data that would
require a lot of preprocessing, we provide a few summary statistics of the average level of energy at 6 predefined frequencies.
In a real-world example, you would likely have to read some engineering papers, talk to experts, and try to figure out what
type of information is usually relevant. Note that PSD could be calculated on different sensors measuring different directions
of displacement, but we simplify the problem by providing just one summary value for each of a few frequencies.
Have fun!