Loading packages

In this notebook we will make use of the tidyverse package which comes with a lot of convenient functions for doing data analysis. In particular, we will use the dplyr package for data wrangling, forcats package to easily handle categorical data, and the ggplot2 package to display our data.

In [1]:
library(tidyverse)
library(forcats)
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats

Piping Commands

A key idea in Tidyverse is the that of piping. The tidyverse achieves this using the magrittr package.

The following example of code shows a sequence of commands using nested function calls.

summarize(group_by(arrange(mutate(tb, options ...), options ...), options ...), options ...)

Piping makes these same commands easier to understand.

tb %>%
  mutate(options ...) %>%
  arrange(options ...) %>%
  group_by(options ...) %>%
  summarize(options ...)

Loading the data

Whenever you want to do something new in R, just google what you want to do; there is a good chance there is a function out there that does what you want. In this case we used the read_csv function to load the data. We used the n_max parameter to read in only the first 1000 rows.

In [2]:
tb <- read_csv('properties_2016.csv', n_max=1000)
Parsed with column specification:
cols(
  .default = col_integer(),
  architecturalstyletypeid = col_character(),
  bathroomcnt = col_double(),
  bedroomcnt = col_double(),
  calculatedbathnbr = col_double(),
  calculatedfinishedsquarefeet = col_double(),
  fips = col_character(),
  hashottuborspa = col_character(),
  lotsizesquarefeet = col_double(),
  pooltypeid10 = col_character(),
  pooltypeid2 = col_character(),
  propertycountylandusecode = col_character(),
  propertyzoningdesc = col_character(),
  rawcensustractandblock = col_character(),
  roomcnt = col_double(),
  typeconstructiontypeid = col_character(),
  yearbuilt = col_double(),
  fireplaceflag = col_character(),
  structuretaxvaluedollarcnt = col_double(),
  taxvaluedollarcnt = col_double(),
  landtaxvaluedollarcnt = col_double()
  # ... with 3 more columns
)
See spec(...) for full column specifications.

Exploring the Data

The first thing we can do is take a look at the data.

In [3]:
tb %>% head(10)
parcelidairconditioningtypeidarchitecturalstyletypeidbasementsqftbathroomcntbedroomcntbuildingclasstypeidbuildingqualitytypeidcalculatedbathnbrdecktypeid⋯numberofstoriesfireplaceflagstructuretaxvaluedollarcnttaxvaluedollarcntassessmentyearlandtaxvaluedollarcnttaxamounttaxdelinquencyflagtaxdelinquencyyearcensustractandblock
10754147NA NA NA 0 0 NA NA NA NA ⋯ NA NA NA 9 2015 9 NANA NA NA
10759547NA NA NA 0 0 NA NA NA NA ⋯ NA NA NA 27516 2015 27516 NANA NA NA
10843547NA NA NA 0 0 NA NA NA NA ⋯ NA NA 650756 1413387 2015 762631 20800.37NA NA NA
10859147NA NA NA 0 0 3 7 NA NA ⋯ 1 NA 571346 1156834 2015 585488 14557.57NA NA NA
10879947NA NA NA 0 0 4 NA NA NA ⋯ NA NA 193796 433491 2015 239695 5725.17NA NA NA
10898347NA NA NA 0 0 4 7 NA NA ⋯ 1 NA 176383 283315 2015 106932 3661.28NA NA NA
10933547NA NA NA 0 0 NA NA NA NA ⋯ NA NA 397945 554573 2015 156628 6773.34NA NA NA
10940747NA NA NA 0 0 NA NA NA NA ⋯ 1 NA 101998 688486 2015 586488 7857.84NA NA NA
10954547NA NA NA 0 0 NA NA NA NA ⋯ NA NA NA 9 2015 9 NANA NA NA
10976347NA NA NA 0 0 3 7 NA NA ⋯ 1 NA 218440 261201 2015 42761 4054.76NA NA NA

We note that there are a lot of NA values. Let's do something interesting with them by first counting the number of NAs in each column and then plotting the results.

There is more than one way to count the number of NA values; we show a few ways below.

In [4]:
sum(is.na(tb$latitude)) # the base R way
tb$latitude %>% is.na %>% sum # using piping
0
0

We can use the summarize_all function from the dplyr package to apply a function to every columnn of a data.frame (tibble). For example, below we count the number of NAs in each column.

In [5]:
na_nums <- tb %>% summarize_all(. %>% is.na %>% sum)
na_nums
parcelidairconditioningtypeidarchitecturalstyletypeidbasementsqftbathroomcntbedroomcntbuildingclasstypeidbuildingqualitytypeidcalculatedbathnbrdecktypeid⋯numberofstoriesfireplaceflagstructuretaxvaluedollarcnttaxvaluedollarcntassessmentyearlandtaxvaluedollarcnttaxamounttaxdelinquencyflagtaxdelinquencyyearcensustractandblock
0 987 1000994 0 0 939 940 289 943 ⋯ 278 1000163 132 0 136 31 990 990 295

Visualizing the Data

Now that we have our counts, let's visualize them. Below we create a tibble of the data to plot.

In [6]:
ggdat <- tibble(
    col_name = names(na_nums),
    num_na = as.numeric(na_nums)
) 
ggdat %>% head(10)
col_namenum_na
parcelid 0
airconditioningtypeid 987
architecturalstyletypeid1000
basementsqft 994
bathroomcnt 0
bedroomcnt 0
buildingclasstypeid 939
buildingqualitytypeid 940
calculatedbathnbr 289
decktypeid 943
In [7]:
options(repr.plot.height=4)
ggplot(ggdat) +
  geom_col(aes(x=col_name, y=num_na)) +
  theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))

Notice that the values on the x-axis are in alphabetical order. We can change that by first sorting the data in the desired format and then converting the col_name column into an ordered factor.

In [8]:
ggdat_ordered <- tibble(
    col_name = names(na_nums),
    num_na = as.numeric(na_nums)
) %>% arrange(-num_na) %>% mutate(col_name=fct_inorder(col_name))
ggdat_ordered %>% head(10)
col_namenum_na
architecturalstyletypeid1000
hashottuborspa 1000
pooltypeid10 1000
pooltypeid2 1000
typeconstructiontypeid 1000
fireplaceflag 1000
finishedsquarefeet6 999
basementsqft 994
storytypeid 994
taxdelinquencyflag 990
In [9]:
ggplot(ggdat_ordered) +
  geom_col(aes(x=col_name, y=num_na)) +
  theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))