Introduction

In almost any data mining project, about 80% of the work is data cleaning (Dasu and Johnson 2003). That is, getting data ready for analysis. This short blog post aims to shed some light on a small but important part of data cleaning that Wickham (2014) called data tidying. We first outline properties of tidy data and present examples of messy forms in which data can be received then demonstrate how one can go about tidying the messy data.

Data tidying

Data tidying as presented by Wickham (2014) involves structuring datasets into a specific format that makes the analysis of said datasets easier. Once the process is complete the product is a tidy dataset. Tidy data is a standard way of mapping the meaning of a dataset to its structure.

The structure of tidy data

The following are key properties of tidy data as presented by Wickham :

  • Each variable forms a column containing values of the variable for each observation
  • Each observation forms a row containing values of each variable
  • Each type of observational unit forms a table

These properties lead to a type of rectangular data table with rows representing observations and columns representing variables or features. These properties also describe Codd’s 3rd normal form (Codd 1990) but with the focus on one dataset as opposed to many linked datasets common in most relational databases.

Some examples

In some cases data may not be provided in a tidy format required by most data analysis tools, in which case one will be tasked with tidying the data. below we present two examples of such cases and how one may go about tidying the data in R. Tidying is done with methods offered by the tidyr (Wickham, 2015a) package in R.

1. Data in a wide format

Suppose one is given data about the performance of students in two subjects in the format shown in table 1 below

SesothoMathematics
Teboho9070
Sipho5560
Selepe8080
Mabesa7723

The properties that describe each student are Name, Subject and Mark. The structure as shown in table 1 above is thus in violation of the guidelines of tidy data. The column names are values instead of variable names. A tidy version of the data in table 1 would be as depicted by table 2 below

NameSubjectMark
TebohoSesotho90
TebohoMathematics70
SiphoSesotho55
SiphoMathematics60
SelepeSesotho80
SelepeMathematics80
MabesaSesotho77
MabesaMathematics23

1.1 Transforming the data in R

First we read in the data from a text file and confirm its structure

library(readr)
Data<- read_delim("messy.txt", delim=" ",skip=1, col_names=c("Name","Sesotho","Mathematics"))
Data
# A tibble: 4 x 3
  Name   Sesotho Mathematics
  <chr>    <dbl>       <dbl>
1 Teboho      90          70
2 Sipho       55          60
3 Selepe      80          80
4 Mabesa      77          23

The structure is indeed as shown in table 1. We then proceed to tidy the data using the function gather() from the package tidyr as follows:

library(tidyr)
tidyData<- Data %>% gather(Subject,Mark, Sesotho:Mathematics)
tidyData
# A tibble: 8 x 3
  Name   Subject      Mark
  <chr>  <chr>       <dbl>
1 Teboho Sesotho        90
2 Sipho  Sesotho        55
3 Selepe Sesotho        80
4 Mabesa Sesotho        77
5 Teboho Mathematics    70
6 Sipho  Mathematics    60
7 Selepe Mathematics    80
8 Mabesa Mathematics    23

2. Multiple variables stored in one column

Suppose that the dataset presented in table 1 has an extra column Gender_Age giving the gender and age of each student as shown in table 3 below:

SesothoMathematicsGender_Age
Teboho9070Female_20
Sipho5560Male_21
Selepe8080Male_20
Mabesa7723Male_22

2.1 Transforming the data in R

We first read in the data

Data2<-read_delim("messy2.txt", delim=" ",skip=1,
                  col_names=c("Name","Sesotho","Mathematics","Gender_Age"))
Data2
# A tibble: 4 x 4
  Name   Sesotho Mathematics Gender_Age
  <chr>    <dbl>       <dbl> <chr>     
1 Teboho      90          70 Female_20 
2 Sipho       55          60 Male_21   
3 Selepe      80          80 Male_20   
4 Mabesa      77          23 Male_22   

The data is clearly not in a tidy format as the column names are values instead of variable names and the last column contains two values that should be in different columns. Bellow we use the functions gather() and separate() to transform the data:

tidyData2<- gather(Data2,Subject,Mark, Sesotho:Mathematics) %>% separate(Gender_Age, c("Gender","Age"))
tidyData2
# A tibble: 8 x 5
  Name   Gender Age   Subject      Mark
  <chr>  <chr>  <chr> <chr>       <dbl>
1 Teboho Female 20    Sesotho        90
2 Sipho  Male   21    Sesotho        55
3 Selepe Male   20    Sesotho        80
4 Mabesa Male   22    Sesotho        77
5 Teboho Female 20    Mathematics    70
6 Sipho  Male   21    Mathematics    60
7 Selepe Male   20    Mathematics    80
8 Mabesa Male   22    Mathematics    23

The result as seen above is a dataset that is tidy as intended.
 

We end the blog post here noting that data may be received in multiple messy formats and the two cases presented above are by no means exhaustive of said messy formats. The tidyr package does however provide powerful tools that can help, so it is worthwhile exploring the documentation when faced with any tidying task.
 

If you need to get in touch with us, you can contact us here
 

References

Codd EF (1990). The Relational Model for Database Management: Version 2. Addison-Wesley Longman Publishing, Boston.

Dasu T, Johnson T (2003). Exploratory Data Mining and Data Cleaning. John Wiley & Sons.

Wickham, H. (2014). Tidy data. The Journal of Statistical Software, 59.

Wickham, H. (2015a). tidyr: Easily Tidy Data with ‘spread()‘ and ‘gather()‘ Functions. R package version 0.8.2.

Categories: Data Cleaning

Leave a Reply

Your email address will not be published. Required fields are marked *