In almost any data mining project, about 80% of the work is data cleaning (Dasu and Johnson 2003). That is, getting data ready for analysis. This short blog post aims to shed some light on a small but important part of data cleaning that Wickham (2014) called data tidying. We first outline properties of tidy data and present examples of messy forms in which data can be received then demonstrate how one can go about tidying the messy data.
Data tidying as presented by Wickham (2014) involves structuring datasets into a specific format that makes the analysis of said datasets easier. Once the process is complete the product is a tidy dataset. Tidy data is a standard way of mapping the meaning of a dataset to its structure.
The structure of tidy data
The following are key properties of tidy data as presented by Wickham :
- Each variable forms a column containing values of the variable for each observation
- Each observation forms a row containing values of each variable
- Each type of observational unit forms a table
These properties lead to a type of rectangular data table with rows representing observations and columns representing variables or features. These properties also describe Codd’s 3rd normal form (Codd 1990) but with the focus on one dataset as opposed to many linked datasets common in most relational databases.
In some cases data may not be provided in a tidy format required by most data analysis tools, in which case one will be tasked with tidying the data. below we present two examples of such cases and how one may go about tidying the data in R. Tidying is done with methods offered by the tidyr (Wickham, 2015a) package in R.
1. Data in a wide format
Suppose one is given data about the performance of students in two subjects in the format shown in table 1 below
The properties that describe each student are Name, Subject and Mark. The structure as shown in table 1 above is thus in violation of the guidelines of tidy data. The column names are values instead of variable names. A tidy version of the data in table 1 would be as depicted by table 2 below
1.1 Transforming the data in R
First we read in the data from a text file and confirm its structure
library(readr) Data<- read_delim("messy.txt", delim=" ",skip=1, col_names=c("Name","Sesotho","Mathematics")) Data
# A tibble: 4 x 3 Name Sesotho Mathematics <chr> <dbl> <dbl> 1 Teboho 90 70 2 Sipho 55 60 3 Selepe 80 80 4 Mabesa 77 23
The structure is indeed as shown in table 1. We then proceed to tidy the data using the function gather() from the package tidyr as follows:
library(tidyr) tidyData<- Data %>% gather(Subject,Mark, Sesotho:Mathematics) tidyData
# A tibble: 8 x 3 Name Subject Mark <chr> <chr> <dbl> 1 Teboho Sesotho 90 2 Sipho Sesotho 55 3 Selepe Sesotho 80 4 Mabesa Sesotho 77 5 Teboho Mathematics 70 6 Sipho Mathematics 60 7 Selepe Mathematics 80 8 Mabesa Mathematics 23
2. Multiple variables stored in one column
Suppose that the dataset presented in table 1 has an extra column Gender_Age giving the gender and age of each student as shown in table 3 below:
2.1 Transforming the data in R
We first read in the data
Data2<-read_delim("messy2.txt", delim=" ",skip=1, col_names=c("Name","Sesotho","Mathematics","Gender_Age")) Data2
# A tibble: 4 x 4 Name Sesotho Mathematics Gender_Age <chr> <dbl> <dbl> <chr> 1 Teboho 90 70 Female_20 2 Sipho 55 60 Male_21 3 Selepe 80 80 Male_20 4 Mabesa 77 23 Male_22
The data is clearly not in a tidy format as the column names are values instead of variable names and the last column contains two values that should be in different columns. Bellow we use the functions gather() and separate() to transform the data:
tidyData2<- gather(Data2,Subject,Mark, Sesotho:Mathematics) %>% separate(Gender_Age, c("Gender","Age")) tidyData2
# A tibble: 8 x 5 Name Gender Age Subject Mark <chr> <chr> <chr> <chr> <dbl> 1 Teboho Female 20 Sesotho 90 2 Sipho Male 21 Sesotho 55 3 Selepe Male 20 Sesotho 80 4 Mabesa Male 22 Sesotho 77 5 Teboho Female 20 Mathematics 70 6 Sipho Male 21 Mathematics 60 7 Selepe Male 20 Mathematics 80 8 Mabesa Male 22 Mathematics 23
The result as seen above is a dataset that is tidy as intended.
We end the blog post here noting that data may be received in multiple messy formats and the two cases presented above are by no means exhaustive of said messy formats. The tidyr package does however provide powerful tools that can help, so it is worthwhile exploring the documentation when faced with any tidying task.
If you need to get in touch with us, you can contact us here
Codd EF (1990). The Relational Model for Database Management: Version 2. Addison-Wesley Longman Publishing, Boston.
Dasu T, Johnson T (2003). Exploratory Data Mining and Data Cleaning. John Wiley & Sons.
Wickham, H. (2014). Tidy data. The Journal of Statistical Software, 59.
Wickham, H. (2015a). tidyr: Easily Tidy Data with ‘spread()‘ and ‘gather()‘ Functions. R package version 0.8.2.