Today I read an article about an article about an R package named Waffle, which is hands down the bast name I’ve heard for an R package. You can learn about the creation of this package here. I had some difficulty finding the article that originally alerted me to this package again because googling “waffle r” results in an awful lot of actual waffle recipes and the images can lead to a whole different exploration.
Eventually I was able to find the article about Creating Infographic Style Charts using the R Waffle package which is a great resource if you’re interested in learning more about this type of analysis. The author goes through all the steps for getting the packages loaded and started, so I’m just going to jump into some different examples of how this package can be used.
Continuing on our Sedona planning trip I’m going to use this methodology to look into Grand Canyon visitation so I can plan our visit around the busiest time. The National Parks Service offers lots of access to it’s data and I was able to pull a report of monthly visitation here.
Once I downloaded the csv I needed to manually remove the header and add 1 temporary column name to load the data into R without errors.
First, read the csv file into r. Be sure to set the quote= “\”” in order to avoid an error with parsing counts in the thousands out as separate columns.
visits<- read.csv(file="Grand_Canyon_Visitation_2016.csv", header=TRUE, sep=",", quote = "\"", row.names=NULL)
Before we can do any analysis on the data, we need to convert the traffic counts from a factor to a character. Then, the comma has to be removed so that R will be able to convert it to a numeric variable. If you don’t take out the comma, you will get an NA for any number value (and yes, I learned that the hard way!)
visits$TrafficCount <- gsub(",","",visits$TrafficCount)
Next, we’re going to do some quick exploratory analysis on the visitation by month using a package called Mosaic.
#Summary stats on visits
min Q1 median Q3 max mean sd n missing
0 1124 19709.5 42604.25 168653 40123.29 53071.1 48 0
OK, now we know the range of what we’re dealing with. The max visitors to any one of the locations in any given month is 168,653 and the mean is 42,604. Let’s use some waffle charts to understand more. First, summarize the data by site using dplyr functions:
sites <- select(visits, TrafficCounter, TrafficCount)
A tibble: 4 x 2
1 TRAFFIC COUNT (DESERT VIEW) 289792
2 TRAFFIC COUNT (NORTH RIM) 127096
3 TRAFFIC COUNT (SOUTH DISTRICT) 1493761
4 TRAFFIC COUNT (TUWEEP) 15269
Now we have our inputs to the waffle chart and we can build the code. Notice that I have updated the color listing to make the palette mimic the palette available in the viridis package.
sites <- c(`South District`=1493761, `Desert View`=289792,
`North Rim`=127096, `Tuweep`=15269)
waffle(sites/14000, rows=5, use_glyph = "child", glyph_size = 6,
title="Total Grand Canyon Visitors by Site, 2016",
xlab="1 person = 14K people")
And now we have our final chart, which makes it easy to see that the South District represents the grand majority of visitors in 2016.
While I’ve simulated the viridis palette in this chart, there is also an example of making waffle charts from scratch here that allows for the ability to directly embed that palette.