In this notebook we will make use of the tidyverse package which comes with a lot of convenient functions for doing data analysis. In particular, we will use the dplyr package for data wrangling, forcats package to easily handle categorical data, and the ggplot2 package to display our data.
library(tidyverse)
library(forcats)
A key idea in Tidyverse is the that of piping. The tidyverse achieves this using the magrittr package.
The following example of code shows a sequence of commands using nested function calls.
summarize(group_by(arrange(mutate(tb, options ...), options ...), options ...), options ...)
Piping makes these same commands easier to understand.
tb %>%
mutate(options ...) %>%
arrange(options ...) %>%
group_by(options ...) %>%
summarize(options ...)
Whenever you want to do something new in R, just google what you want to do; there is a good chance there is a function out there that does what you want. In this case we used the read_csv
function to load the data. We used the n_max
parameter to read in only the first 1000 rows.
tb <- read_csv('properties_2016.csv', n_max=1000)
The first thing we can do is take a look at the data.
tb %>% head(10)
We note that there are a lot of NA values. Let's do something interesting with them by first counting the number of NAs in each column and then plotting the results.
There is more than one way to count the number of NA values; we show a few ways below.
sum(is.na(tb$latitude)) # the base R way
tb$latitude %>% is.na %>% sum # using piping
We can use the summarize_all
function from the dplyr package to apply a function to every columnn of a data.frame (tibble). For example, below we count the number of NAs in each column.
na_nums <- tb %>% summarize_all(. %>% is.na %>% sum)
na_nums
Now that we have our counts, let's visualize them. Below we create a tibble of the data to plot.
ggdat <- tibble(
col_name = names(na_nums),
num_na = as.numeric(na_nums)
)
ggdat %>% head(10)
options(repr.plot.height=4)
ggplot(ggdat) +
geom_col(aes(x=col_name, y=num_na)) +
theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))
Notice that the values on the x-axis are in alphabetical order. We can change that by first sorting the data in the desired format and then converting the col_name
column into an ordered factor.
ggdat_ordered <- tibble(
col_name = names(na_nums),
num_na = as.numeric(na_nums)
) %>% arrange(-num_na) %>% mutate(col_name=fct_inorder(col_name))
ggdat_ordered %>% head(10)
ggplot(ggdat_ordered) +
geom_col(aes(x=col_name, y=num_na)) +
theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))