UNLOCKING ARTIFICIAL INTELLIGENCE II – Life Cycle Data Science Project
Identifying what stages, we must follow when addressing an Artificial Intelligence (AI) project is essential to structure and analyze what resources are necessary and at what stage their involvement will be more relevant. In this way we will be more prepared for any eventuality, being able to estimate the efforts that we must assume, as well as identify the feasibility and reduction of the project’s operational costs. The identification of these stages will allow the managers and project manager to understand the scope of the project, monitor and treat the risks inherent in this new technology and clarify and solve the problems that arise throughout the development of the project related to its management. Each phase of an AI project will contain its own tasks and demands for completion that will be more or less critical in response to the maturity and degree of knowledge, skill and / or limitation of resources in which the company addresses such projects, as well as the time of adoption of the use and / or use of the service / product by the market.
On the other hand, we must take into account that AI is identified within a multidisciplinary context (statistics and optimization, computer science, robotics, natural language processing and generation, business management, analytics, etc.) where different professional profiles coexist. They cover a certain function. In this sense, we must clarify that each AI project must assume its own requirements that will define its requirements and scope. As we can find in the modelling of human intelligence where a series of phases and stages are distinguished where different needs and abilities are perceived that favour decision-making such as the understanding of language and the environment, reasoning and models of learning.
Similarly, each AI ecosystem, infrastructure or third-party software that is used for the development of the models and implementation of the learning system has its own requirements that will set, in one way or another, decision-making as what resources are needed and what stages and times are we to fight with. At the same time, each problem to be solved, each product to be developed has its own business needs and therefore will determine the flow and monitoring of the phases that must be applied throughout the life cycle of an AI project.
In all systems for decision-making that currently coexist, the techniques and technologies that include Big Data for the transmission, storage and analysis of large volumes of data become indispensable. As well as the application of an advanced analytics that allows predicting future situations and selecting the most optimal strategy and decision making. For this reason, the techniques that encompass Data Science, among which Data Mining and Machine Learning (Machine Learning (ML) and Deep Learning (DL) stand out) are the most used today, since they make it possible to enrich and create knowledge business, thus optimizing decision making and extracting new hidden information among the data generated by the company. We can identify these systems as an evolution of the so-called KDD (Knowledge Discovery in Databases) systems. In different studies carried out during this last decade, it is confirmed that more than 70% of the efforts within an AI project focus the data processing to obtain the necessary knowledge for decision making and its contribution to the business.
It should be emphasized that Data Science and this new AI does not only refer to treating and managing large volumes of data in an orderly manner, a fact that if it differentiates them from KDD systems (interactive and iterative methodology), it is also necessary to attend to its variability, quality and representativeness of them, creating a final data product. This data product can be a scorecard, a recommender, classifier or any response that facilitates decision making and actions. For this it is necessary to maintain the processes of data disposition, classification, selection, cleaning and reduction of the same. As well as, the introduction of prior knowledge, and the same interpretation of the results. Without it, it would not be possible to achieve a recommended reliability and efficiency in decision making. Fact that resembles how a human being learns by obtaining information from the environment: we capture information from the environment, we store this information that comes to us from different sources (senses), we filter it, we are left with that information that interests us and about it we reason and take decisions. In this last step of reasoning we apply the experience acquired, the purpose or intention and the emotions, to discriminate which actions may be less beneficial in response to the mistakes made in past situations and the information obtained from different sources. Aspects that come to be taken into account in the life cycle of a Data Science project with either Data Mining techniques or Machine Learning techniques and more recently Deep Learning.
As with software development where there is a Software Engineering with different methodologies that allow optimizing times and improving the quality of the software, thus facilitating the interconnection between all members of the project team. The developments and projects of Data Science and AI must be built around a multidisciplinary team that must work in a coordinated, communicated and integrated manner in an agile way with all the departments present in the project value chain and with a culture focused on data and generation of useful knowledge, with a strong standardization and automation of the processes that allow the scalability of the project, however complex it may be. We can identify this type
of project as a combination between science and engineering, where statistical knowledge, advanced mathematics and applied research are necessary but where it is also essential, knowledge of algorithms, skills in the treatment and analysis of heterogeneous data, management of Software, organization and diligence to anticipate and solve the possible risks, adaptation of the data, algorithms and infrastructure to the business and the objectives set and of course its implementation and production, evaluating its reliability and compliance in both ethical and regulatory restrictions as in the needs and perspectives of customers. Today there are different action methodologies for the management and optimization of this life cycle.
The CRISP-DM (Cross Industry Standard Process for Data Mining) model created in 1999 by SPSS, NCR and DaimlerChrysler, which maintains a standard six-phase process that was conceived for the development of Data Mining projects, and which was subsequently used with great success in other AI projects where the collection and analysis of large volumes of data is sought. Another standard used for Data Mining projects based on commercial tools is the SEMMA model (Sample, Explore, Modify, Model, Assess) which are based on the CRIPS-DM standard and created by the SAS Institute in 1998. Generally and given that Data Mining projects are sensitive and deficient of data generating sources, a model that is externalized from CRM projects to Data Science projects is the Catalyst model created in 2003. One of the most important differences between these Types of methodologies are that while CRIPS-DM, KDD or Catalyst focus on business needs and understanding, the SEMA methodology is more geared towards the use of statistics for data sampling. In all these methodologies, the phase of identification of data sources and their preparation and processing are emphasized, as well as the need to evaluate the algorithm for extracting knowledge patterns according to the data we handle and the objectives set. On the other hand, the evaluation of the results in CRIPS-DM is carried out based on the performance of the chosen model and the objectives set, while in the SEMA methodology it only refers to the performance of the model. In the case of Catalyst, the evaluation of the results is carried out only in relation to the objectives and requirements of the business strategy. One aspect to take into account is that the SEMA methodology is oriented to SAS tools and therefore the algorithms and models that SAS provides, while with the other methodologies the Data Science analyst can use the tools and models he wants. Among the different methodologies described is that of CRISP-DM the most used in the market, there are different manufacturers that provide tools for monitoring the project using this methodology and there is a worldwide network for using this methodology. Among the most prominent manufacturers in this network are Teradata, Sgi, DESPSS, IBM, OHRA and prestigious consultants such as Deloitte, ICL, ABB, etc.
The CRISP-DM (Cross Industry Standard Process) methodology used in data-mining is presented with the following phases:
- Business Understanding: This phase identifies the objectives to be achieved after a detailed study of the business, customer requirements and needs. Create a strategic plan to achieve these objectives with minimum reliability and quality requirements. The regularization and regulation of cybersecurity and privacy of data and computer systems must be taken into account.
- Acquisition of the Data: Identify the data necessary to achieve the objectives. Recognize data sources. Describe the types of data with which we will work and identify those that are really necessary. Recognize problems in their quality, such as if there is repeated, incomplete, inconsistent data, with errors, among others.
- Data Preparation: Process data flows, solve missing data problems, control inconsistencies of data flows and perform data cleaning and standardization, generation of variables, integration of different data sets, etc.
- Modelling: Determine which model or technique is the most appropriate for the resolution of the problem to be treated and which techniques to be applied consistently in response to the data we have, resources and needs. In general, you can go back to the previous phase to work with the data and have an entry according to the needs of the model. In this phase, the system evaluation and performance tests must be created to study the quality and reliability of the results obtained with the selected model and the objectives set.
- Evaluation and Interpretation: Visualization and analysis of the data obtained and its correspondence on the objectives, reliability and quality desired.
- Model deployment: The knowledge and results obtained are displayed and displayed to the client.
- Operations: Perform the actions that the client sees relevant according to the results obtained. In addition, we move on to a phase of monitoring and maintenance of the model according to, for example, the period of validity of the results or models used as well as the business objectives that may vary over time. It may happen that the reliability of the results of the model goes down so the project must be retaken from the beginning.
One aspect to highlight and that is not foreseen in the indicated methodologies is the study of the impact on the client of the incorporation and integration of these models and information flow systems in their internal processes.
We can find another methodology focused on cloud developments (Cloud) which is widely used by large companies that offer AI services and machine learning. One of the big differences between traditional Software Engineering and the progress of AI projects is that while the former focuses on a deterministic code development, the developments of AI projects are based on the creation of models for treatment of data and knowledge generation. These models can lose their reliability over time and therefore there must be a follow-up and maintenance of it, and the data flows that nourish the model. It is at this point that the new methodology called ModelOps is emphasized, which is responsible for automating the deployment of the model, as well as its monitoring, supervision and maintenance such that it accelerates the development and improves the scalability of this type of projects. ModelOps is based on the Devops methodology, which is used for the development of applications, while ModelOps focuses on accelerating the process of creating models from its initial laboratory phase, validation and testing until its deployment with the expected quality and reliability according to the established objectives. On the other hand, the strong boom that this new AI is having and Machine Learning has caused large companies such as Microsoft, Google, Amazon or IBM to create a set of models and tools for the development of these machine learning projects. Agile and based on cloud technology. This new set of services, model bank with the ModelOps methodology, allows the development and management of this type of projects in an agile way. Facilitating the democratization and approach of these technologies to all given its high degree of abstraction and easy use.
The high flexibility of this methodology based on these tools and cloud services allows you to build the entire life cycle of an AI or Machine Line project with a few clicks. It focuses on the concept of pipeline (pipeline), that is, the user joins the tasks to be performed at each step of the process (workflows) with a connector. For example, it allows to include any quality control that the Data Science analyst wants (Security control, bias, variability, compliance checks, etc.). As well as keeping track of data and models throughout the entire project life cycle. It also allows to automatically improve the models through a feedback loop from the user interface to the backend model. Usually these tools support a user interface where they can build an acyclic graph in which, each node represents the task to be performed and each edge defines the control flow. It is also based on a pattern of events, which allow you to control the runtime of models and applications. One of the advantages of this methodology is the possibility of managing and maintaining different versions of the same model (trained with different data sets or different values of its configuration parameters). These types of services and tools based on ModelOps methodology generally provide an integrated cloud platform that allows users to manage and implement models using a collaborative and automated workflow. And that can be used both in production environments in the company and in research environments. Some examples of platforms are:
- Watson Machine Learning (WML): https://www.ibm.com/es-es/cloud/machine-learning
- Azure Machine Learning (AML): https://azure.microsoft.com/es-es/free/machine-learning/
- AWS Machine Learning: https://aws.amazon.com/es/machine-learning/
- Cloud Machine Learning: https://cloud.google.com/products/ai/
- A particular life cycle to work with this ModelOps methodology and in particular for the Microsoft ecosystem for Machine Line is reflected in the image (1 Azure Machine Learning life cycle).
In this life cycle we can observe the same phases that we found with the CRISP-DM methodology such as business understanding, data acquisition or modeling phase, but not iteratively. The scalability and flexibility of this methodology and platform allows to work in parallel different necessary tasks of the project life cycle.
Apart from these different models, if they improve, or at least does not delay the time between the milestones of a project and its life cycle. It is necessary to bear in mind that on the one hand, in every project and more especially in the Data Science projects there are different repeated delays. In general, working with data and almost always unknown sources increases uncertainty and associated risks, with the possibility of delays due to not knowing the actual format of the data, ignoring the type of source or simply not falling into that data they are not fully formatted in relation to the tasks that are subsequently executed. This means that the phases in the life cycle of a Data Science project should not be considered as linear, being highly iterative and cyclic, and there are large dependencies between the Data Science team and the other teams involved in the project. However, what should be clear that the transition from one stage to another should not be done without having a high 70% confirmation and compliance with that stage. This will facilitate and maintain high reliability in the final result.