Since the emergence of the age of Data Science, Machine Learning has become a full-fledged discipline. Similar to Software Development, even Machine Learning has it’s own life cycle. This is now called Machine Learning Engineering or MLOps. However, the most important aspect is defining the problem right. If this goes wrong the entire project may go haywire.
Introducing Goal Definition
This is the most important phase of the Machine Learning life cycle. It typically consists of documenting three requirements:
- Problem Statement
- Business Constraints
- ML problem Constraints
Problem Statement
Before defining a problem, one must be sure of where not to use Machine Learning. Please remember that Machine Learning is all about extrapolating results based on an observed pattern. Hence, if your system needs to be fully explainable, do not bother to think about ML.
Having said that, if you decide to use ML, the first step is to identify what is the business problem you want to solve. In other words, what’s the task you want your system to perform. Following are the tasks that an ML algorithm can achieve:
- Automate
- Alert/Prompt
- Organize
- Annotate
- Extract
- Recommend
- Classify
- Quantify
- Synthesize
- Transform
- Answer a question
- Detect Anomaly
Business Constraints
Once you decide upon the task of the model, it is very important to define the Business Constraints. To understand this, one needs to look at ML systems as a product, rather than as a science experiment. In other words, your ML systems are supposed to be consumed by an end-user. Hence, some of the following constraints one should consider while designing an ML system are as follows:
- Latency: Depending on the business problem and end-user experience, this constraint deserves attention. No one will use a system that will take an undue amount of time to produce results.
- Interpretability: Now, as aforementioned, ML systems are not completely explainable. However, in critical applications, it becomes imperative for the model to explain the results to a certain degree. For instance, in Cancer prediction scenario, it is important for the physician who is using it to know the rationale behind the model’s prediction.
- Consumption: Lastly, it is essential to define how will the user consume the model? Will it be a web form or a dashboard or a sensor output? The answer to this determines the deployment strategy of the model.
ML problem constraints
Based on the above business constraints, the ML problem and it’s constraints are framed. Therefore, the first step is deciding whether the ML problem is supervised/unsupervised. Further, supervised learning is segregated into Classification/Regression and unsupervised learning consists of techniques like Clustering, PCA, Anomaly Detection etc.
Lastly, once you decide the ML problem to be solved, it’s essential to decide on the metric you are optimizing for. This depends very much on the business use case. For instance, in cancer prediction problem, it’s important for the model to be as precise as possible. Thus, precision is the key metric. Thus, it is the one that needs to be optimised. Similarly, other metrics could be Recall, F1-Score, Log Loss or even RMSE in case of regression.
Also read: Understanding Precision and Recall
Conclusion
Hope that this article helps the readers with the basics of ML problem formulation. However, this is not an exhaustive article, but an attempt to layout certain recommendations. Moreover, Machine Learning Lifecycle is an iterative one. Hence, the goal may change once you traverse into later stages of the lifecycle.
P.S.
This article is inspired by Machine Learning Engineering by Andriy Bukrov. You can buy this book using the following links: