In this post, you will come to know about Cross Industry Standard Process for Data Mining (CRISP-DM) methodology. Here, we have presented the crisp dm data understanding process, after the previous post on Phase 1 on Business Understanding.
You need to collect the data that are listed in the resources of the project in the second stage. This process includes loading of data, as it helps in the data understanding process. For instance, if you are using a particular tool for understanding the data, it is necessary to load the information into the tool. Well, you can seek professional assistance for data understanding. In case you acquire multiple sources of information for your data mining project, you need to decide when and how these should be integrated.
Initial report for data collection
Here, all the sources of information are to be listed, along with their locations and the methods you have used to acquire them. In case you have faced any issue while acquiring them, you need to list them here. Depending on your data mining goals, the initial report should also contain the solutions for these problems, which will help you when you replicate the project in future. You will also find these beneficial while executing similar projects in future.
Report for data description: In this report, the information that you have acquired needs to be described. This includes the ‘surface’ and ‘gross’ properties of data. You need to describe the quantity of data (for instance, number of fields in each table), data format, the identity of each of the fields and other features. Now, you need to analyze whether the acquired information meets your requirements.
The data exploration stage involves addressing the questions for data mining, using reporting techniques and data visualization. The reputed service providers focus on the following aspects of their data mining process:
- Key attribute distribution
- Relationships between the small number or pairs of attributes
- Attributes of important sub-populations
- Simple aggregation results
- Simple analysis of statistics
Your tools for data mining may directly be addressed by these analyses. They may also refine or contribute to the reports of data description and quality. These are used in the other steps of preparing data for further analysis.
Report for data exploration
In this report, you need to list the results of the exploration of your information. This includes the initial hypothesis or first findings, and state how they impact rest of the project. You may also include plots and graphs here, in order to indicate the characteristics that suggest further evaluation of interesting subsets of data.
Verifying quality of data
The data quality verification process is one of the most important stages in data mining. You need to examine the information quality, addressing questions like:
- Whether the information is complete, covering all the required cases.
- Whether the information is correct, and in case it contains errors, how often do they occur?
- If the information contains missing values, and if so, where they occur, how they are represented and how common they are.
In this report, it is necessary to make a list of data quality verification results. In case problems related to quality exists, you can need to suggest possible remedies. These solutions are developed, depending largely on the information and knowledge about the business.
Data understanding focuses on the comprehension of the information available in the project and its study.
In case you have data mining questions, you can seek our assistance.
In the next post, you will come to know about phase 3, which deals with data preparation. You will get a detailed knowledge of analysis of data and feature selection in this section.