55 10.1 Cleaning And Restructuring Data In Excel
Emese Felvegi; Noreen Brown; Barbara Lave; Julie Romey; Mary Schatz; Diane Shingledecker; and Robert McCarn
Before we can work with our data, we need to make sure it’s valid, accurate, and reliable. In the age of Big Data, companies may spend just as much or more on maintaining the health and cleaning their data as they spend on collecting or purchasing it in the first place. Consider the issues that can stem from missing or wrong values, duplicates, and typos. The validity, accuracy, and reliability of your calculations depend on your ability to keep your data up-to-date. Many estimates show that about 30% of your data may become inaccurate over time (JD Supra, 2019; Strategic DB, 2019) and even small data sets can be costly to clean, let alone files that are tens or hundreds of thousands of records deep – or much more if you are using large scale databases.
There are many data cleaning solutions out there for a wide range of file formats, data volumes, or budgets. However, there are many things we can accomplish using Excel functions and features so that you can process our data quickly and effectively. Instead of purchasing an application, assigning data cleaning to an employee, or hiring a service to scrub your data, for records under a million per sheet, Excel can save you a great deal of time and funds using a variety of functions and features. Table 10.1 shows you some important functions that can help you clean up your data.
CLEAN | Removes all nonprintable characters from text. |
TRIM | Removes all spaces from text except for single spaces between words. |
CONCATENATE | Join two or more text strings into one string. |
LEFT | Returns a string containing a specified number of characters from the left side of a string. |
RIGHT | Returns a string containing a specified number of characters from the right side of a string. |
MID | Returns a specific number of characters from a text string. |
SEARCH | SEARCH returns the number of the character at which a specific character or text string is first found. |
FIND and FINDB | Locate one text string within a second text string. |
UPPER | Converts text to uppercase. |
LOWER | Converts text to lowercase. |
PROPER | Capitalizes the first letter in a text string and any other letters in text that follow any character other than a letter. Converts all other letters to lowercase letters. |
TEXT | Change the way a number appears by applying formatting to it with format codes. |
VALUE | Converts a text string that represents a number to a number. |
Table 10.1 A sample of text and data cleaning functions in Excel.
The following sections show the functions above in action. The Ch10_Data_File contains four sheets. The Documentation sheet notes the sources of our data. Text_FUNC sheet features a variety of common errors you may see in a data set, including line breaks in the wrong place, extra spaces or no spaces in between words, non-printing characters, improperly capitalized or all upper case, all lower case text, ill-formatted data values. The DataGen_Companies sheet contains a set of “dummy” (plausible, but not real) data about companies generated at https://www.generatedata.com/ that the author of this chapter intentionally injected with common errors seen in data in order to unfold and process it for the sake of practicing Excel functions for the Chapter Practice section. The Mockaroo_Cars sheet is a “dummy” dataset about consumers and their addresses generated at https://mockaroo.com/, this data set will be used for the Mail Merge section. Both of these “dummy” data sets are archived here for educational purposes.
Figure 10.1.1 below shows the Text_FUNC sheet with a variety of common errors seen in data you import from other sources. The CONCATENATE & TRIM range is an example of how a single line of text can be created from the contents of three rows by nesting two Excel functions. CONCATENATE on its own will merge the three cells into one, but alone, it does nothing about the extra spaces we see in the text. TRIM will remove all spaces, which means we need to add ” ” in order for Excel to add the needed blank cells in between words.
The LEFT, RIGHT, MID range in columns A:C illustrate another common set of functions used to process data. Oftentimes data comes in large chunks merged together. While we can use the Data > Text to Columns feature with delimiters to tell Excel where we want our data split, the LEFT, RIGHT, MID functions will process data from certain directions depending on where in the string is the text or number we wish to extract. B9 and B10 show a part number we can extract portions of using the MID function into C9, C10. B12 and B13 show course numbers we can extract portions of using the RIGHT and LEFT functions into C12, C13.
Figure 10.1.2 shows the formulas in columns A:C to illustrate the combination of CONCATENATE and TRIM nested in a variety of ways to find the best configuration to output the way we want our text to appear with the syntax for LEFT, RIGHT, and MID showing underneath.
Figure 10.1.3 below shows the formulas in columns F:H to illustrate the different between FIND and SEARCH, as well as show the UPPER, LOWER, PROPER, VALUE and TEXT functions used to produce the contents for data in those ranges.
More Examples
Visit the Official Microsoft site for a list of common text functions in Excel.
Observe the variety of tasks you can achieve by using relatively simple formulas and nested alternatives.
“Note: Although you can use the TEXT function to change formatting, it’s not the only way. You can change the format without a formula by pressing CTRL+1 (or +1 on the Mac), then pick the format you want from the Format Cells > Number dialog (Source).”
Consider possible uses of these functions in order to clean your data. We will revisit these functions and the use of delimiters in the Chapter Practice.
ATTRIBUTION
Chapter by Emese Felvégi. CC BY-NC-SA 3.0. Dummy data sets from https://www.generatedata.com/ and from https://mockaroo.com archived here for educational purposes.