Turning travel data into tidy data

Turning travel data into tidy data

In the previous posts we walked though the basics of reading in the data, quickly plotting some variables, and creating flags to help with future analysis.  Now we’re going to transform the data in to tidy format so that we have variables in columns, observations in rows, and only one type per dataset.  You can learn more about this format here.

First, we’ll use the library called ‘reshape2’ to pivot the dataset called temps on the variable named “MST”, making it tall instead of wide.  Each row will have the date in the column titled “MST”, a column for variable that contains the name of each variable, and a column for the value associated with each observation:

#Tidy data time
library("reshape2")
tidy_temps<-melt(temps, "MST")

We can check that it worked by using the command ‘head’:
head(tidy_temps)
MST variable value
1 2013-1-1 Max.TemperatureF 30
2 2013-1-2 Max.TemperatureF 31
3 2013-1-3 Max.TemperatureF 25
4 2013-1-4 Max.TemperatureF 37
5 2013-1-5 Max.TemperatureF 49
6 2013-1-6 Max.TemperatureF 41

Next, we’re going to recreate the month variable to use in later analysis:

tidy_temps$Date <- ymd(tidy_temps$MST)
tidy_temps$month <-month(ymd(tidy_temps$Date), label = TRUE, abbr = TRUE)

Finally, we’re going to subset the data so that we’re only looking at the days classified as ‘perfect’ according to the definition in the Creating flag variables in R post.

#Look at just perfect days
perfect<- subset(tidy_temps, tidy_temps$variable=='perfect')

Now that we’re down to just the data we want, we’re going to summarize it into a new table named ‘ps’:

ps <- perfect %>%
group_by(variable, value) %>%
summarise(n=n()) %>%
group_by(variable) %>%
mutate(perc=100*n/sum(n))

The table named ps is just 3 lines by 4 columns -Variable, Value, N, and Percent:
variable value n perc

1 perfect 0 319 80.1507538
2 perfect 1 78 19.5979899
3 perfect 1 0.2512563

Now that we have the data in a summarized format it is very easy to graph it.

#Perfect days
p<- ggplot(ps, aes(x=value, y=perc, label=round((perc), digit=2), fill=value))
p+
geom_bar(stat="identity") +
xlab("Perfect Status, 1=Yes, 0=No")+
ylab("Percent")+
geom_text(aes(label=round(perc)), vjust=-0.2)

Perfect days

This chart tells us that only 20% of the days in the dataset fit the criteria that I’ve used to determine a “perfect” day in Sedona.  Let’s break this out by month to see when they are most likely to occur:

#Summarize data by month
ps <- perfect %>%
group_by(month,value) %>%
summarise(n=n()) %>%
group_by(month) %>%
mutate(perc=100*n/sum(n))

#Perfect days by month
ggplot(ps, aes(x=month, y=perc, fill=value)) +
geom_bar(stat="identity") +
ylab("percent")

Perfect Days by Month

There we have it – the highest proportion of days that meet my ideal criteria fall in October and March.  In October there’s a greater than 50% chance that the weather will be ideal compared to June, July, and August when none of the days meet my ideal criteria. March is close to 50% and April and November have decent odds as well.

 

Leave a Reply

Your email address will not be published. Required fields are marked *