Unwanted data is something that we need to reject and eliminate. But what actually is bad data? Well, it includes info that is incorrect or inappropriate for the current context. This also embraces info that is obsolete as well. Addresses and titles which are no longer valid, old phone numbers and corrupted ones add to the collection. Marketing efforts take a hit when bad info is found in between reliable sources. You cannot afford to miss informative advertisements and messages targeted at prospective customers. They will become uncomfortable and reluctant if irrelevant or inaccurate information reaches them. To makes things worse some of the marketing efforts never reach customers due to unavailable phone numbers or incorrect addresses.

data cleaning in data mining

Data cleansing techniques for deleting unwanted data

Companies instruct their marketing staff to simply go on with the canvassing process until you find errors, fix them and proceed. It is just a temporary patch and not a solution to the issue. Bad info requires identification and removal which makes campaigns more successful. Fields have to be checked to validate if they contain valid data or not. Improving its quality using any of the most operative data cleansing techniques is more important than anything else.

The five steps cleansing processing method is a quick approach. It begins with creating an invalid and a valid data file to start with. You can now move forward and check for invalid characters within fields. Valid info can then be moved to valid file and invalid ones to the other file. Outside vendors can also help you with data cleansing services since it is time-consuming and requires expertise. This is useful especially before a new campaigning pattern is ready and set. However, the whole of your efforts will be ruined if you rely on an unprofessional service provider. So, be careful before you hire anyone.

These splendid methods used for cleansing may not be enough to clean a huge database but as a starter, it will surely help your organization to move forward. Removal of bad data not only helps organizations to reach prospective customers but it also increases the efficiency of the overall system making it more operative. Good info which is left after cleansing will be dependable and can be aimed for future marketing resource generation. Analysis and research trends can point to inefficiency which can be further validated. Now we can check the step by step process for data cleansing in excel.

Data Cleansing Methods in Excel

Excel is a versatile spreadsheet program that helps you carry out extensive calculations, enter data, and use customized or in-built functions easily. This Excel data cleaning guide would help you understand ways to clean data in easy to follow steps. Let’s get started.

1) Removing Extra Spaces Among Inputs

Let us look at an example. The below text has been inputted in different ways with varying spaces.

How to cleanse data

How       to cleanse data

How       to cleanse          data

How                                         to cleanse data

The first line is the accepted input method with single space among words. In the remaining three lines, the gap between words is more with leading as well as trailing spaces. Even after the last word has been typed, few spaces have been given in the last sentence.

Often, when data is imported from other sources, from a text file or a database, such irregularities show up.

‘Trim’ function is used for cleaning this data for eliminating the unwanted additional spaces. The syntax is:

TRIM(Text)

The text argument can be typed manually or can be referred to a cell carrying input text. Suppose, any of the last three aforesaid statements is entered in cell A1. In the argument part, use cell reference A1.

The trim function would eliminate additional, leading and trailing spaces between words and leave just one space between the words, as is the standard practice. Just practice with an above set of examples and you would find that Trim has gotten rid of all unwanted spaces automatically for you.

2) Selecting and entering data in blank cells together

If you have copied content from cell(s) and want to use only the text without the formulas or formatting, you need to paste the values only using ‘Paste Special’ option. If you have gaps in the dataset, you can address them collectively in Excel.

Suppose, you have built a table carrying employee names and number of leaves availed of by them in a month. If an employee has not taken single leave, there would appear a gap in dataset against his entry. If you don’t want to leave empty spaces in the dataset, you can insert ‘Not Availed’ against all gaps.

One option is navigating to each individual cell and manually typing ‘Not Availed’. This would be a tough task if the dataset is large. To accomplish the task,

  • You need to get the complete data set selected
  • From the top ribbon, choose ‘Find and Select’ (the menu with magnifier icon)
  • From its drop-down, select ‘Go to Special’ to reach its dialog box
  • Alternately, you can press F5 on the keyboard
  • A blank cell reference dialog box opens
  • Click on ‘Special’ button to reach the special dialog box
  • Click ‘Blanks’ and press OK
  • All blank cells within dataset would be selected and get highlighted in grey
  • The first blank cell would be in white to indicate that it is active
  • Type ‘Not Availed’ and then press Ctrl+Enter
  • As soon as the command is pressed, ‘Not Availed’ would be copied in all blank cells

3) Converting ‘Numbers Stored as Text’ into Numbers

Please take a look at the below pattern of number entry in excel cells.

456

456

‘456

The ‘456’ is right aligned in the cell, in the first instance, which is normal for numbers. Text always is left aligned whereas the number is right aligned in Excel cells. In the remaining two instances, you would find that the numbers are left aligned. This implies that the numbers are stored as text.

For converting the last two entries into numbers, you can follow any of the two available ways. In the second case, you can

  • Right-click the cell and move to ‘Format Cells’
  • Select ‘General’
  • Or, you can navigate to formatting box and type ‘General’ and press ‘Enter’
  • The left aligned 2nd entry above would be converted into the number

For the 3rd entry above, a leading apostrophe is used for entering the number as text in the cell. To get it converted back to number

  • Click on any blank cell and type 1 in it
  • Click the ‘456 and copy it
  • Move to the cell with ‘1’ entered in it
  • Choose ‘Paste Special’ option from ‘Paste’ to open the dialog box
  • From the ‘operation’ category, choose ‘multiply’ and press OK
  • This action would get the entry ‘456 multiplied by 1
  • Any number multiplied by 1 would not undergo any change in value
  • But the leading apostrophe is removed and the text is converted into number format

4) Removing Duplicates

Suppose, you have a large dataset that is carrying multiple entries of same members with values. If you wish to eliminate the duplicates of individual members, you can do so in two ways.

  • In the first method, you need to use ‘Conditional Formatting’
  • Select the entire data set
  • Navigate to ‘Home’ and select ‘Conditional Formatting’
  • From the drop-down menu, click on ‘Highlight Cell Rules’
  • Navigate down to ‘Duplicate Values’ from the corresponding drop-down menu
  • Once you select this, all duplicate values would be highlighted along with formatting
  • A dialog box would also appear with the highlight formatting option
  • The default cell formatting is ‘Red fil with Dark red Text’; retain this option
  • Click OK and all the duplicates would be highlighted
  • All members of the dataset with names and corresponding duplicate entries are highlighted
  • Any entry that appears twice would be highlighted
  • Select the duplicate entries that you want to remove
  • Manually delete the entries from the dataset

There is another method to remove the duplicates from the dataset.

  • Select the complete dataset
  • In the top ribbon, navigate to ‘Data’ tab
  • Here, you would come across ‘Remove Duplicates’ icon
  • The icon is represented by two columns of 5 cells each with an intervening arrow
  • Click the option
  • The ‘Remove Duplicate’ dialog box is opened
  • If your dataset has headers, check the ‘My Data Has Headers’ option
  • If you don’t select this, the first row would also be considered as part of the dataset
  • All the columns in your dataset would appear in the dialog box
  • Check marking option would be available in front of each column
  • Check the columns which you want to check for the presence of duplicate data
  • Selecting and clicking on Ok would remove all duplicate values from individual columns
  • Only the unique values are left behind
  • For qualifying to get removed, the entries have to be exact duplicates in selected columns
  • If multiple columns are selected, duplicate values are scanned for row-wise
  • If all the entries in a given row are duplicate of another row in the dataset
  • Then only it is deleted
  • If you choose individual columns, all duplicate entries in that column are removed

5) Highlighting the errors

While carrying out data preprocessing in Excel, we have to check for errors and get them highlighted for removal. Suppose, you want to work on the dataset of 6 corporations. You have the revenues generated by them for 4 years and the corresponding net income values. With the available figures, you need to calculate the net income margin i.e. net income divided by revenues.

It is possible for errors to be present in the data for one or more corporations. This can be possible if the revenue or income figures for the companies are not present. We would get a ‘division error’ when trying to divide net income by revenues. The problem becomes serious when the dataset is huge and manually segregating errors and setting them right are a challenge. You can handle large datasets for error highlighting by two methods.

  • In the first method, you can use ‘Conditional Formatting’
  • Get the complete dataset selected
  • In the Home tab on the top ribbon, click on ‘Conditional Formatting’
  • From the drop-down menu, select ‘New Rule’ option
  • ‘New Formatting Rule’ dialog box would open up
  • Click on 2nd option i.e. ‘Format only cells that contain’
  • Select ‘errors’ from the drop-down under ‘Format only cells with’ highlighted in bold
  • This option would give you the chance to format those cells which carry errors
  • Next, to the Preview box space, you have the Format option
  • Click on it and choose the ‘Font Style’ and ‘Strikethrough’ option for error values
  • You can select the color in which an error cell would be highlighted
  • Click Ok
  • All cells which carry errors would be highlighted in your desired format within the dataset

There is also another way to accomplish this. For selecting those cells which contain errors,

  • Select the entire dataset that you want to work with
  • Press F5 on the keyboard
  • This would make ‘Go To’ dialog box to appear
  • Click on ‘Special’ button
  • The ‘Go To special’ dialog box would open up
  • Select Formulas
  • The moment you select ‘Formulas’, 4 options underneath it would get activated
  • Uncheck the first 3 options
  • Just check the ‘errors’ option and click on OK
  • The entire array of cells which contain errors would get selected and highlighted
  • You can manually remove the errors in the individual cells
  • Or, you may type something such as ‘Not Available’ and then press Ctrl+Enter
  • Pressing the button duo would get the ‘Not Available’ message replicated in all cells
  • All cells imply the cells which earlies had errors in them

6) Converting Text to Various Cases (Lower, Upper, Proper)

You can have text entered in the cells in various ways. Sometimes, all the characters in the text would be in uppercase or they may be in lower case. In certain cells, the individual characters or words may be in the upper and lowercase mix. In order to ensure consistency in text presentation, three formulas are available for proper formatting.

  • Getting all textual content converted into lowercase

You need to use the ‘Lower’ function. The syntax is LOWER() and the argument can be single or many words or cell reference. Enter ‘GOOD GOD’ and it would be converted to ‘good god’.

  • Getting all textual content converted into uppercase

You need to use the ‘Upper’ function. The syntax is UPPER() and the argument can be single or many words or cell reference. Enter ‘good god’ and it would be converted to ‘GOOD GOD’.

  • Getting all textual content converted into a proper case

You need to use the ‘Proper’ function. The syntax is PROPER() and the argument can be single or many words or cell reference. Enter ‘gOOd GOD’ and it would be converted to ‘Good God’. The textual matter would be converted into sentence case with each word in the sentence having its first letter converted to a capital letter.

If you have many cells underneath the cell on which you have applied the formula, you just need to hit Ctrl+Enter after entering the formula. Dragging would replicate the formula in all selected cells.

With the aforesaid formulas, you can easily introduce consistency in all the text contained in the workbook. If you are going to use a workbook shared with you by others or you have to work on an excel file with text entered by the different person, you are going to experience the problem of inconsistency in cell formatting. These three formulas would return consistency to text appearance instantly.

7) Checking the Spelling

When you are working with a large dataset, you often would come across text entries. In MS Word and MS Powerpoint, you can easily check if the text contains spelling error, grammatical mistake or any other problem in it. The Word and PowerPoint programs highlight the errors with underlines of various colors. This feature is missing in MS Excel.

Despite this shortcoming, you can resort to another method in MS Excel for checking spelling related errors and setting them right.

  • Select the entire dataset for which you want to run the spell check
  • Press F7 on Keyboard
  • This would bring up the Spelling: Language dialog box
  • This feature is the same that you come across in Word or PowerPoint
  • All words that Excel would consider as having spelling mistake would be shown
  • Corresponding suggestions for correcting the same with standard words will be shown
  • Once you have adopted the changes or discarded the same, the spell check is over
  • ‘The Spell Check is Complete: You’re Good to Go’ box would appear
  • Click OK and your content is spelling error free

8) Deleting the Formatting

When considering how to clean data in Excel, you have to deal with various types of formats which you want to be removed quickly. Suppose, you have a worksheet carrying various formatting types which you want cleared to give uniform formatting. To do so,

  • Select the complete set of data
  • Navigate to Home tab
  • In the ‘Tell Me What To Do’ box, enter Clear
  • From Clear, choose the ‘Clear Formats’ option
  • All differing formats would be cleared off the worksheet
  • The ‘Clear all’ option would also do the same thing
  • But, by exercising this option, the content in the selected cells would also be removed
  • You can also use three additional Clear options
  • One would clear the comments, other all the hyperlinks, and the third only content

Other Data Cleansing Tools to Try

Apart from MS Excel, you have various open-source tools for carrying out data modeling, analysis and cleansing. Some of the prominent tools are

  • JASP

It is an open source statistical software that runs with the help of COS and carries features like that of SPSS

  • Rattle

This graphical user interface facilitates using R for machine learning with relative ease

  • RapidMiner

This is a machine learning package with a user friendly interface

  • Orange

Using this graphical user interface, which is open source, you can use Python for machine learning

  • Talend Data Preparation

It can be used for cleaning data as well as preparing the same with smarts

  • Trifacta Wranger

This tool is also employed for cleaning data. You can also avail of ‘match by example’ feature which is unique to this tool.

All the aforesaid tools are open source and free to use. You can complement your understanding of Excel with the usage of these tools for better analyzing, modeling and cleaning large datasets.

Advantages of Data Cleansing

The benefits of getting the data cleaned are numerous. Some of them have been enlisted below:

  • All serious errors in the data are removed
  • Inconsistencies in formatting are eliminated
  • When data is collected from multiple sources, errors are gotten rid of
  • Data from multiple sources can be reconciled and used together as a single set
  • Efficiency of data usage can be enhanced significantly
  • Any information from the data can be extracted at ease
  • Customers and employers can be kept content with proper information
  • You can easily integrate various functions in the cleaned data
  • Your objectives related to data can be materialized effectively

Conclusion

In Data Science, cleaning data is an integral element. The process can be fragmented into four subprocesses for ease of understanding. The data is initially collected. It is then subjected to cleansing. The cleaned data is analyzed and modeled. The final result is then published for the intended users.

It is important to understand the gravity of data cleaning process. Skipping any step related to it will make you come across problems. For making the raw data compatible with data analytics tools like Python or R, you need to use proper cleaning techniques.