Loading packages¶

In this notebook we will make use of the tidyverse package which comes with a lot of convenient functions for doing data analysis. In particular, we will use the dplyr package for data wrangling, forcats package to easily handle categorical data, and the ggplot2 package to display our data.

library(tidyverse)
library(forcats)

Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats

Piping Commands¶

A key idea in Tidyverse is the that of piping. The tidyverse achieves this using the magrittr package.

The following example of code shows a sequence of commands using nested function calls.

summarize(group_by(arrange(mutate(tb, options ...), options ...), options ...), options ...)

Piping makes these same commands easier to understand.

tb %>%
  mutate(options ...) %>%
  arrange(options ...) %>%
  group_by(options ...) %>%
  summarize(options ...)

Loading the data¶

Whenever you want to do something new in R, just google what you want to do; there is a good chance there is a function out there that does what you want. In this case we used the read_csv function to load the data. We used the n_max parameter to read in only the first 1000 rows.

tb <- read_csv('properties_2016.csv', n_max=1000)

Parsed with column specification:
cols(
  .default = col_integer(),
  architecturalstyletypeid = col_character(),
  bathroomcnt = col_double(),
  bedroomcnt = col_double(),
  calculatedbathnbr = col_double(),
  calculatedfinishedsquarefeet = col_double(),
  fips = col_character(),
  hashottuborspa = col_character(),
  lotsizesquarefeet = col_double(),
  pooltypeid10 = col_character(),
  pooltypeid2 = col_character(),
  propertycountylandusecode = col_character(),
  propertyzoningdesc = col_character(),
  rawcensustractandblock = col_character(),
  roomcnt = col_double(),
  typeconstructiontypeid = col_character(),
  yearbuilt = col_double(),
  fireplaceflag = col_character(),
  structuretaxvaluedollarcnt = col_double(),
  taxvaluedollarcnt = col_double(),
  landtaxvaluedollarcnt = col_double()
  # ... with 3 more columns
)
See spec(...) for full column specifications.

Exploring the Data¶

The first thing we can do is take a look at the data.

tb %>% head(10)

We note that there are a lot of NA values. Let's do something interesting with them by first counting the number of NAs in each column and then plotting the results.

There is more than one way to count the number of NA values; we show a few ways below.

sum(is.na(tb$latitude)) # the base R way
tb$latitude %>% is.na %>% sum # using piping

We can use the summarize_all function from the dplyr package to apply a function to every columnn of a data.frame (tibble). For example, below we count the number of NAs in each column.

na_nums <- tb %>% summarize_all(. %>% is.na %>% sum)
na_nums

Visualizing the Data¶

Now that we have our counts, let's visualize them. Below we create a tibble of the data to plot.

ggdat <- tibble(
    col_name = names(na_nums),
    num_na = as.numeric(na_nums)
) 
ggdat %>% head(10)

options(repr.plot.height=4)
ggplot(ggdat) +
  geom_col(aes(x=col_name, y=num_na)) +
  theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))

Notice that the values on the x-axis are in alphabetical order. We can change that by first sorting the data in the desired format and then converting the col_name column into an ordered factor.

ggdat_ordered <- tibble(
    col_name = names(na_nums),
    num_na = as.numeric(na_nums)
) %>% arrange(-num_na) %>% mutate(col_name=fct_inorder(col_name))
ggdat_ordered %>% head(10)

ggplot(ggdat_ordered) +
  geom_col(aes(x=col_name, y=num_na)) +
  theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))

col_name	num_na
parcelid	0
airconditioningtypeid	987
architecturalstyletypeid	1000
basementsqft	994
bathroomcnt	0
bedroomcnt	0
buildingclasstypeid	939
buildingqualitytypeid	940
calculatedbathnbr	289
decktypeid	943

parcelid	airconditioningtypeid	architecturalstyletypeid	basementsqft	buildingclasstypeid	buildingqualitytypeid	calculatedbathnbr	decktypeid	⋯	numberofstories	fireplaceflag	structuretaxvaluedollarcnt	taxvaluedollarcnt	assessmentyear	landtaxvaluedollarcnt	taxamount	taxdelinquencyflag	taxdelinquencyyear	censustractandblock
10754147	NA	NA	NA	NA	NA	NA	NA	⋯	NA	NA	NA	9	2015	9	NA	NA	NA	NA
10759547	NA	NA	NA	NA	NA	NA	NA	⋯	NA	NA	NA	27516	2015	27516	NA	NA	NA	NA
10843547	NA	NA	NA	NA	NA	NA	NA	⋯	NA	NA	650756	1413387	2015	762631	20800.37	NA	NA	NA
10859147	NA	NA	NA	3	7	NA	NA	⋯	1	NA	571346	1156834	2015	585488	14557.57	NA	NA	NA
10879947	NA	NA	NA	4	NA	NA	NA	⋯	NA	NA	193796	433491	2015	239695	5725.17	NA	NA	NA
10898347	NA	NA	NA	4	7	NA	NA	⋯	1	NA	176383	283315	2015	106932	3661.28	NA	NA	NA
10933547	NA	NA	NA	NA	NA	NA	NA	⋯	NA	NA	397945	554573	2015	156628	6773.34	NA	NA	NA
10940747	NA	NA	NA	NA	NA	NA	NA	⋯	1	NA	101998	688486	2015	586488	7857.84	NA	NA	NA
10954547	NA	NA	NA	NA	NA	NA	NA	⋯	NA	NA	NA	9	2015	9	NA	NA	NA	NA
10976347	NA	NA	NA	3	7	NA	NA	⋯	1	NA	218440	261201	2015	42761	4054.76	NA	NA	NA

col_name	num_na
architecturalstyletypeid	1000
hashottuborspa	1000
pooltypeid10	1000
pooltypeid2	1000
typeconstructiontypeid	1000
fireplaceflag	1000
finishedsquarefeet6	999
basementsqft	994
storytypeid	994
taxdelinquencyflag	990