Analyzing Weather Data in R programming

Analyzing Weather Data in R programming

Now that we’ve completed all of the prep work (see Reading Data), we are ready to begin analyzing the temperature data we’ve brought into R. To start, we need to format a few of the existing fields and create some new ones.

We begin by telling R that the date field is indeed date format. The code below tells R to create a new field “Date” in the temps table that is equal to the conversation of the current field, “MST” using the Y/M/D format. Notice that we’re using another library, this one called “lubridate”. If you don’t already have lubridate check out the last post on installing packages (see Getting Started).

#Format date
temps$Date <- ymd(temps$MST)

It’s always a good idea to double check that a created field is behaving as you expect. Typing “temps$Date” in the command prompt of the console should bring up a list of dates.

Using a lubridate function again, we are going to create a second date variable that displays the month of each date.

temps$month <-month(temps$Date)

If you view the new variable you’ll see numbers ranging from 1-12. However, since we’re going to be using this variable to create graphs, I’d rather see the abbreviated month name. A simple tweak to the code gives us this result instead of the numeric value:

temps$month <-month(ymd(temps$Date), label = TRUE, abbr = TRUE)

And now, it’s finally time to run some analysis. Let’s start with some summary statistics on mean temperature.

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
4.00 34.00 43.00 45.09 60.00 75.00 1

The results confirm my initial guess that there’s plenty of variation. Average temps in Sedona range from a minimum of 4 degrees to a Max of 75, with an average of 45. This gives us a good overall understanding, but what I really want is average temp by month. A box plot is a great way to compare the ranges in temperature for different months:

boxplot(Mean.TemperatureF~month, data=temps, col="grey", main="Average Temperature Range in Sedona by Month",
ylab="Average Temp", xlab="Month")

I like box plots because they tell you so much in one view. The gray shaded area is where 50% of the data points are. You can interpret it as the expected range for normal average temps that month. For example, the expected range in January is mid-twenties to upper thirties and the expected range of average temps in July is all in the mid-sixties.

The size of the grey box tells you how large the range is – so you can see there’s a lot more variation in the winter months then there is in the summer months. Next, the “whiskers” – which are the lines extending from the box plot show you the outer ends of the range statistically. Anything beyond those whiskers is an outlier and is highly unlikely to occur. There are a few dots that extend beyond these ranges, but as we would expect, most of the temps fall into the expected statistical range.

One thing that can help with interpreting the results is to add some guidelines for desired temps. Adding on to the graphing code allows me to highlight the upper and lower limit of the average temps I’d like to have if we travel to Sedona:

boxplot(Mean.TemperatureF~month, data=temps, col="grey", main="Average Temperature Range in Sedona by Month",
ylab="Average Temp", xlab="Month")
abline(h=72, col="red")
abline(h=55, col="blue")

There we have it – a quick view to tell me that the months of June-September are the most likely to have average temps in the desired range.  Check out Improving Travel Analysis Graphs in R to see how to customize these charts.

Leave a Reply

Your email address will not be published. Required fields are marked *