In this report, I’m going to examine the Newton Building permits dataset and find three crucial patterns about the housing situation in Newton for the last 8 years.
##Data Cleaning
newton<-read_csv("newton_fl.csv")
## Parsed with column specification: ## cols( ## .default = col_character(), ## Parcel_num = col_integer(), ## Permit..Number = col_integer(), ## year = col_integer() ## )
## See spec(...) for full column specifications.
#combine the three variables that share the same information
newton<-newton %>%
unite(issue.date,
c(Permit..Issue.Date, Issue.Date), sep = " ") %>%
unite(permit.no,
c(Permit..Number, Permit.No), sep = " ") %>%
unite(notes,
c(Note.Text, Notes), sep = " ")
newton$issue.date<-str_remove(newton$issue.date, "NA")
newton$permit.no<-str_remove(newton$permit.no, "NA")
newton$notes<-str_remove(newton$notes, "NA")
#Select useful variables
newton.1<-select(newton,
Parcel_num,
Parcel.Address,
Apply.Date,
Month,
year,
issue.date,
permit.no,
Permit.Type,
Applied.Value,
Fee.Amount,
notes,
X, Y)
#convert applied value and fee from characters to numbers
newton.1$Applied.Value<-parse_number(newton.1$Applied.Value)
## Warning: 6 parsing failures. ## row # A tibble: 5 x 4 col row col expected actual expected <int> <int> <chr> <chr> actual 1 10925 NA a number ########### row 2 11865 NA a number ########### col 3 11923 NA a number ########### expected 4 11965 NA a number ########### actual 5 18383 NA a number ########## ## ... ................. ... .................................. ........ .................................. ...... .................................. ... .................................. ... .................................. ........ .................................. ...... .................................. ## See problems(...) for more details.
newton.1$Fee.Amount<-parse_number(newton.1$Fee.Amount)
## Warning: 12 parsing failures. ## row # A tibble: 5 x 4 col row col expected actual expected <int> <int> <chr> <chr> actual 1 8189 NA a number - row 2 10290 NA a number - col 3 13336 NA a number ####### expected 4 13457 NA a number ####### actual 5 13520 NA a number ####### ## ... ................. ... .............................. ........ .............................. ...... .............................. ... .............................. ... .............................. ........ .............................. ...... .............................. ## See problems(...) for more details.
#convert application date and issue date from factor to actual date
newton.1$Apply.Date<-mdy(newton.1$Apply.Date)
## Warning: 30 failed to parse.
newton.1$issue.date<-mdy(newton.1$issue.date)
## Warning: 411 failed to parse.
#convert the variable year from integer into factor
newton.1$year<-as.factor(newton.1$year)
#convert some of the month description into appropriate month abbreviation
newton.1$Month<-gsub("April", "Apr", newton.1$Month)
newton.1$Month<-gsub("June", "Jun", newton.1$Month)
newton.1$Month<-gsub("July", "Jul", newton.1$Month)
newton.1$Month<-gsub("Sept", "Sep", newton.1$Month)
newton.1<-mutate(newton.1,
month = match(Month, month.abb))
##1. The change of average Applied Value over the last nine years
In this section, I’ll plot a comparative boxplot to show the change of applied value each year from 2010 to 2018.
ggplot(newton.1,
aes(y=log(Applied.Value), x=year))+
geom_boxplot()+
ggtitle("Boxplot of Applied Value (2010-2018)")
## Warning: Removed 521 rows containing non-finite values (stat_boxplot).
The chart above shows the distributions of the applied value for each year. Since the applied value can get very large for some permits while most other permits had 0 applied value, I take a natural logarithm of the data to correct the skewness before plotting. As we can tell, the median position and inter-quantile range look similar every year, which indicates that the housing value in Newton hasn’t changed much since 2010. This may also imply that the economy in Newton was relatively steady for the last couple years.
##2. Seasonal Trends of Applied Value
Besides the time trend of applied value, I’m also interested in whether there exists some kind of seasonal characteristics of applied value every year.
#aggregate the median applied value by month and year
median.data<-aggregate(Applied.Value~month+year,
data=newton.1, median)
median.data<-filter(median.data, !year==2018)
median.data.1<-ts(median.data$Applied.Value,
start = c(2010, 1), frequency = 12)
#aggregate the overall applied value by month and year
sum.data<-aggregate(Applied.Value~month+year,
data=newton.1, sum)
sum.data<-filter(sum.data, !year==2018)
sum.data.1<-ts(sum.data$Applied.Value,
start = c(2010,1), frequency = 12)
#create boxplot
boxplot(median.data.1~cycle(median.data.1),
main="Boxplot of seasonal median applied value",
xlab="month", ylab="median applied value($)")
boxplot(log(sum.data.1)~cycle(log(sum.data.1)),
main="Boxplot of seasonal summed applied value",
xlab="month", ylab="log summed applied value($)")
Both graphs above give the seasonal changes of applied value. The X axes are months. Each boxplot is the distribution of aggregated applied value for a particular month over 7 years. The differences are the methods of aggregation. In the first chart, I calculate the median applied value for each month, while in the second chart, I aggregate the value by adding them up for every month. The two charts show us two distinct seasonal patterns. For the seasonal trend of median applied value, the first two months is higher than the rest time of a year. However, the variation of the seven-years distribution is distinctively large for January. Therefore, the information may be less meaningful. The median applied value in May and October are usually at the bottom of a year, while the median applied value in July and December are usually at the highest among the rest of the year. The distribution looks very different when we aggregate them by summing them up. Now, the overall applied value in January is the lowest while May has a relatively high summed applied value. These two charts give us a glimpse of how permits’ applied value fluctuated in a typical year. However, because the information is highly aggregated, we must be careful in interpreting the implication of this trend.
##3. WordCloud
In this section, I’m going to plot two word cloud graphs using permit notes for 2010 and 2017 to see what are the major reasons for applying for permit between the two years? Are there any differences?
library(tm)
library(wordcloud)
#first let's check out the permit notes in 2010
newton2010<-filter(newton.1, year==2010)
corpus<-Corpus(VectorSource(newton2010$notes))
corpus<-tm_map(corpus, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(corpus, content_transformer(tolower)): ## transformation drops documents
corpus<-tm_map(corpus, removeNumbers)
## Warning in tm_map.SimpleCorpus(corpus, removeNumbers): transformation drops ## documents
corpus<-tm_map(corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation ## drops documents
corpus<-tm_map(corpus, removeWords,
c("and","the","with","all","one","two","only"))
## Warning in tm_map.SimpleCorpus(corpus, removeWords, c("and", "the",
## "with", : transformation drops documents
corpus<-tm_map(corpus, stripWhitespace)
## Warning in tm_map.SimpleCorpus(corpus, stripWhitespace): transformation ## drops documents
tdm<-TermDocumentMatrix(corpus)
m<-as.matrix(tdm)
v<-sort(rowSums(m), decreasing = T)
d<-data.frame(word = names(v),freq=v)
wordcloud(d$word, d$freq, random.order = F,
rot.per = 0.3, scale = c(4,.5),
max.words = 200, colors = brewer.pal(8, "Dark2"))
#now let's turn to those permit notes in 2017
newton2017<-filter(newton.1, year==2017)
corpus<-Corpus(VectorSource(newton2017$notes))
corpus<-tm_map(corpus, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(corpus, content_transformer(tolower)): ## transformation drops documents
corpus<-tm_map(corpus, removeNumbers)
## Warning in tm_map.SimpleCorpus(corpus, removeNumbers): transformation drops ## documents
corpus<-tm_map(corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation ## drops documents
corpus<-tm_map(corpus, removeWords,
c("and","the","with","all","one","two","new"))
## Warning in tm_map.SimpleCorpus(corpus, removeWords, c("and", "the",
## "with", : transformation drops documents
corpus<-tm_map(corpus, stripWhitespace)
## Warning in tm_map.SimpleCorpus(corpus, stripWhitespace): transformation ## drops documents
tdm<-TermDocumentMatrix(corpus)
m<-as.matrix(tdm)
v<-sort(rowSums(m), decreasing = T)
d<-data.frame(word = names(v),freq=v)
wordcloud(d$word, d$freq, random.order = F,
rot.per = 0.3, scale = c(4,.5),
max.words = 250, colors = brewer.pal(8, "Dark2"))
There are several things worth mentioning looking at the two word cloud graphs above. First, the high-frequency words are more scattered in 2010 than that in 2017. In 2017, very few words appear disproportionally high while the rests are words only appear once or two, which shows a high degree of concentration in terms of certain activities and objects. Second, in 2010, the words appeared most frequently were “new” and “replace”, while in 2017, although the verb “replace” still showed up quite a lot, most of the notes were talking about “install” and “existing”. This tells us that the year 2010 might be the start of a new cycle in housing and building renewal or even an expansion of the housing market. But in 2017, most building permits reflect the needs of installation and addition to the existing structures instead of major overhauls. Finally, in both year, most building permits aim at making changes to windows, kitchen, bathroom, and basement. This may tell us the type of room that gets the most attention from an average Newton household. The word cloud could tell us many things about people and housing in Newton. But we need information from other years to delineate a more comprehensive picture of the situation.
##4. Two strange things about the dataset
The first thing about the building permit dataset of Newton that really bothers me is that there are many repetitive and unclear variables in the raw data set. Specifically, Parcel_num and Parcel Id are the same variable. Permit number and issue date are both partially recorded by two columns. Also, The variable Units is a blank variable.
Second, there are several rows among those permit records look like some kind of metadata. For example, row 3306 to row 3308 don’t have any information about particular permit, but the fee amount recorded in row 3308 is abnormally high. There are many more of them. At this point, I’m not clear what do they really mean.
irredata<-newton.1[is.na(newton.1$Parcel.Address), ]
irredata
## # A tibble: 20 x 14 ## Parcel_num Parcel.Address Apply.Date Month year issue.date permit.no ## <int> <chr> <date> <chr> <fct> <date> <chr> ## 1 5 <NA> NA Apr 2011 NA " NA" ## 2 157 <NA> NA Apr 2011 NA " NA" ## 3 7 <NA> NA May 2011 NA " NA" ## 4 5 <NA> NA May 2011 NA " NA" ## 5 1 <NA> NA May 2011 NA " NA" ## 6 179 <NA> NA May 2011 NA " NA" ## 7 6 <NA> NA Jun 2011 NA " NA" ## 8 8 <NA> NA Jun 2011 NA " NA" ## 9 227 <NA> NA Jun 2011 NA " NA" ## 10 5 <NA> NA Aug 2011 NA " NA" ## 11 7 <NA> NA Aug 2011 NA " NA" ## 12 212 <NA> NA Aug 2011 NA " NA" ## 13 8 <NA> NA Sep 2011 NA " NA" ## 14 10 <NA> NA Sep 2011 NA " NA" ## 15 227 <NA> NA Sep 2011 NA " NA" ## 16 5 <NA> NA Mar 2012 NA " NA" ## 17 NA <NA> NA Dec 2015 NA " 1" ## 18 1 <NA> NA Feb 2016 NA " 1" ## 19 1 <NA> NA May 2016 NA " 1" ## 20 449 <NA> NA May 2016 NA " 449" ## # ... with 7 more variables: Permit.Type <chr>, Applied.Value <dbl>, ## # Fee.Amount <dbl>, notes <chr>, X <chr>, Y <chr>, month <int>