Take Home Exercise 01

In this Take Home Exercise, I will explore the demographic of the city of Engagement, Ohio USA.

JUNQIU NI (MITB)
2022-04-24

1. Overview

In this take home exercise, I will reveal the demographic of the city of Engagement, Ohio USA by using a patchwork of pie charts to visualize the distribution of participants and a Diverging stacked bar chart to visualize the joviality across different groups of participants.

2. Getting Started

Before we get started, it is important to ensure that the R packages have been installed. If yes, we will load the R packages. If they have yet to be installed, we will install the R packages and load them onto R environment.

The chunk code below will do the trick.

packages = c('tidyverse',"HH","vcd",'knitr','ggdist', 'ggridges',
             'patchwork', 'ggthemes', 'hrbrthemes',
             'ggrepel', 'ggforce')

for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

3. Importing Data

The code chunk below import “Participants.csv” from the data folder into R by using read_csv() of readr and save it as an tibble data frame called “participant_data”.

participant_data <- read_csv("data/Participants.csv")

4. Data Wrangling

We group the age data into 4 groups including “18-28”, “29-38”, “39-49” and “50-60”. Also, we group joviality data into 4 groups “0.2-0.4”, “0.4-0.6”, “0.6-0.8” and “0.8-1.0”.

participant_data["age_group"] = cut(participant_data$age, c(18,29,39,50,60),
                                    c("18-28","29-38","39-49","50-60"), include.lowest=TRUE)
participant_data["joviality_group"] = cut(participant_data$joviality, c(0,0.2,0.4,0.6,0.8,1), 
                                          c("0-0.2", "0.2-0.4","0.4-0.6","0.6-0.8","0.8-1.0"), include.lowest=TRUE)
participant_data
# A tibble: 1,011 × 9
   participantId householdSize haveKids   age educationLevel     
           <dbl>         <dbl> <lgl>    <dbl> <chr>              
 1             0             3 TRUE        36 HighSchoolOrCollege
 2             1             3 TRUE        25 HighSchoolOrCollege
 3             2             3 TRUE        35 HighSchoolOrCollege
 4             3             3 TRUE        21 HighSchoolOrCollege
 5             4             3 TRUE        43 Bachelors          
 6             5             3 TRUE        32 HighSchoolOrCollege
 7             6             3 TRUE        26 HighSchoolOrCollege
 8             7             3 TRUE        27 Bachelors          
 9             8             3 TRUE        20 Bachelors          
10             9             3 TRUE        35 Bachelors          
# … with 1,001 more rows, and 4 more variables: interestGroup <chr>,
#   joviality <dbl>, age_group <fct>, joviality_group <fct>

We calculate the number of population in different degree of joviality by different groups including householdSize, haveKids, age_group, educationLevel, and interestGroup.

t1=table(participant_data$householdSize,participant_data$joviality_group)
t2=table(participant_data$haveKids,participant_data$joviality_group)
t3=table(participant_data$age_group,participant_data$joviality_group)
t4=table(participant_data$educationLevel,participant_data$joviality_group)
t5=table(participant_data$interestGroup,participant_data$joviality_group)

participant=as.data.frame(rbind(t1,t2,t3,t4,t5))
participant <- cbind(Type = rownames(participant), participant)
rownames(participant) <- 1:nrow(participant)

participant["Subtable"] <- c("householdSize","householdSize","householdSize",
                          "haveKids","haveKids",
                          "age_group","age_group","age_group","age_group",
                        "educationLevel","educationLevel","educationLevel","educationLevel",
                        "interestGroup","interestGroup","interestGroup","interestGroup","interestGroup",
                      "interestGroup","interestGroup","interestGroup","interestGroup","interestGroup")

participant=participant %>%
  bind_rows(summarise_all(., ~if(is.numeric(.)) sum(.) else "Total"))

The below participant table shows the distribution of participants of different groups.

kable(participant)
Type 0-0.2 0.2-0.4 0.4-0.6 0.6-0.8 0.8-1.0 Subtable
1 78 76 58 51 74 householdSize
2 70 83 70 81 69 householdSize
3 59 57 67 55 63 householdSize
FALSE 148 159 128 132 143 haveKids
TRUE 59 57 67 55 63 haveKids
18-28 52 47 54 61 59 age_group
29-38 48 55 40 39 62 age_group
39-49 58 53 58 47 46 age_group
50-60 49 61 43 40 39 age_group
Bachelors 46 57 33 44 52 educationLevel
Graduate 30 35 34 31 40 educationLevel
HighSchoolOrCollege 116 100 116 97 96 educationLevel
Low 15 24 12 15 18 educationLevel
A 21 26 17 21 17 interestGroup
B 18 22 14 22 15 interestGroup
C 22 19 14 18 29 interestGroup
D 21 22 15 22 16 interestGroup
E 18 15 13 13 24 interestGroup
F 17 26 25 17 21 interestGroup
G 19 22 23 18 26 interestGroup
H 27 25 22 22 15 interestGroup
I 20 17 25 12 22 interestGroup
J 24 22 27 22 21 interestGroup
Total 1035 1080 975 935 1030 Total

The below codes prepare the number of particpants of different groups for further visualization

t6=table(participant_data$householdSize)
t7=table(participant_data$haveKids)
t9=table(participant_data$educationLevel)
t10=table(participant_data$interestGroup)

t6=as.data.frame(t6)
t7=as.data.frame(t7)
t9=as.data.frame(t9)

5. A Patchwork of 3 Pie Charts

We would like to see the education level distribution, household size distribution and havekids distribution of the participants.

5.1 Sketch of Proposed Design

an image

We would like to see the education level distribution, household size distribution and havekids distribution of the participants using pie charts of ggplot(), gem_bar() and coord_polar().

5.2 A Pie Chart of Household Size Distribution Among Participants

The household size distribution is plotted as below.

ggplot(t6,aes(x="", y=Freq, fill=Var1))+
  geom_bar(width = 1, stat = "identity")+
  coord_polar("y", start=0)

We use scale_fill_brewer() to change the color, ggtitle() to add the title, theme_ipsum() to change the features of the plot and amend the labs.

pie1=ggplot(t6,aes(x="", y=Freq, fill=Var1))+
  geom_bar(width = 1, stat = "identity")+
  coord_polar("y", start=0)+
  scale_fill_brewer(palette="Spectral")+
  ggtitle("Household Size Distribution")+
  labs(fill='Household Size')+
  xlab("") + 
  ylab("Participants")+
  theme_ipsum()

pie1

Similarly, we have the distribution of havekids and education level.

5.3 A Pie Chart of HaveKids Distribution Among Participants

pie2=ggplot(t7,aes(x="", y=Freq, fill=Var1))+
  geom_bar(width = 1, stat = "identity")+
  coord_polar("y", start=0)+
  scale_fill_brewer(palette="Spectral")+
  ggtitle("HaveKids Distribution")+
  labs(fill='HaveKids')+
  xlab("") + 
  ylab("Participants")+
  theme_ipsum()

pie2

5.4 A Pie Chart of Education Level Distribution Among Participants

pie3=ggplot(t9,aes(x="", y=Freq, fill=Var1))+
  geom_bar(width = 1, stat = "identity")+
  coord_polar("y", start=0)+
  scale_fill_brewer(palette="Spectral")+
  scale_fill_brewer(palette="Spectral")+
  ggtitle("Education Level Distribution")+
  labs(fill='Education Level')+
  xlab("") + 
  ylab("Participants")+
  theme_ipsum()

pie3

5.5 Patchwork of 3 pie charts

We use patchwork to combine the three pie charts. The patch work finally gives a overview of distribution of paricipants in terms of education level, household size and have kids or not.

patchwork <- (pie1 /pie2  /pie3)
patchwork

6. A Diverging Stacked Bar Chart

6.1 Sketch of Proposed Design

an image We will use diverging stacked bar chart to show the joviality level of different groups of participants.

6.2 Plot the Diverging Stacked Bar Chart Using likert

The code chunk below plot the diverging stacked bar chart by using likert to show the joviality level of different groups of participants.

likert(Type ~ . | Subtable, data = participant, as.percent = TRUE, layout = c(1, 6))

The plot above is unreadable, we want to scale the plot to visualize better by add scale feature to likert.

likert(Type ~ . | Subtable, data = participant, as.percent = TRUE,
       scales = list(y = list(relation = "free")), layout = c(1, 6))

We want to sort the bars to make it easier to interpret so we add the order.

dsbc=likert(Type ~ . | Subtable, data = participant, as.percent = TRUE, positive.order = TRUE, 
       scales = list(y = list(relation = "free")), layout = c(1, 6))
dsbc

7. Final Visualization and Findings

patchwork

From the pie charts, we can find that around only 30% of participants have kids. The education level of more than half of the participants is highschool or college or below.

dsbc

The diverging stacked bar chart shows that participants with only one household size may have lower joviality than those with 2 or 3. Participants with kids also have a higher level of joviality. Young people are more likely to be jovial than the older. Also, higher education level may come with higher level of joviality. Participants with Interest Group of C,E and G have higher level of joviality than those with Interest Group H.