In this post, you will come to know about the crisp dm Data Preparation Phase(Cross Industry Standard Process for Data Mining), the third stage in the data mining process. In the previous phase, we had presented Data Understanding.

CRISP DM Data Preparation Phase

Data Preparation (Step 3)

Select the data

In the third stage of the project, it is necessary for you to decide which information you are going to use for the evaluation process. During this stage of your data mining project, you should set the criteria based on relevance of the available data to your goals, technical constraints like limits on the volume of data, data types and the quality of information at your disposal. You should remember that selecting data includes choosing the records (rows) as well as attributes (columns) in a table.

While including or excluding certain information, you should list the respective information along with the reasons for your decision.

Cleaning your data

Cleaning your data is one of the important stages in your data preparation methodology. The process involves raising the quality of records to the desired level, considering the analysis techniques that you selected. You may choose clean data subsets, insert necessary defaults or go for advanced techniques, like assessing the missing information by modelling.

Report of data cleaning: Here, you need to describe the actions and decisions taken to address issues related to data quality. It is necessary to consider the transformations of data that are made for cleaning, and the impact that they might have on the analysis results.

Constructing required data: Constructing the required information includes the operations for construction of constructive data, like production of derived attributes, or records that are entirely new, or values that are transformed into existing attributes.

Derived attributes: Derived attributes refer to the new attributes that have been developed from one or multiple attributes, taken from the same record. For instance, you can use the variables of width and length to calculate the area.

Generated records: Here, the creation of new records is described, like creation of the records of customers making no purchase during the previous year.

Integrated data: This is one of the important processes in the data mining methodologies, where information from multiple databases, records and tables are combined to create new values or records.

Merged data: While merging tables, two or more tables having different information on the same objects are joined together. For instance, in a retain chain, one table may contain information on the general characteristics of each store, like type of mall, floor space and so on. Another table might contain information related to the demographics of the area. One record for each of the stores is included in the tables. You can merge these tables into a new one, containing one record for each of the stores, by combining fields from the original tables.

 Aggregations: Aggregations are the operations in which you compute new values by summarizing information from more than one table or records. For instance, you can convert a table containing information of customer purchases. Here, the fields can include the number of purchases, percentages or orders that have been charged to credit cards, average amount of purchases and so on.

You will come to know about Modelling in Phase 4, which deals with modelling of data by incorporating computational learning algorithms. You can get across to us, at PGBS, for any professional support for data mining.