R.3 — Using Naive Bayes to Predict Voting in the 2016 Presidential Election

Jun. 14, 2018

Categories: R-Posts Tags: R Machine Learning Probability Bayes Politics


It’s a dark and stormy night. You step out of a black Toyota Corolla and mutter a rushed “thanks” to your Uber driver. You fumble for your office security pass and make your way into the building, through the lobby, and up to the eighth floor. The receptionist is here — staying late; you just got back from dinner while your computer was teaching itself how to learn — and you give her a wave as you enter the office suite.

You unlock your office door and glance at your computer screen. “P(A|B) = P(A) * P(B|A) / P(B)” flashes repeatedly on the monitor. You just figured out how to predict voting behavior using a self-tuned computer algorithm. Have your eyes been opened to a new world of political prediction? Maybe. Maybe not.

You step into the world of Naive Bayes.


I just finished reading Pedro Domingos’ great book “The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World”, a master text about developing the universal machine learning algorithm. I decided to test out a Bayesian learning algorithm called Naive Bayes on some survey data I’ve been using and try to predict individual voting behavior. The results are good, though not quite surprising — 94% out-of-sample accuracy — and I’ll share the process with you now.

Here’s how to use Naive Bayes classifiers in R to predict voting behavior in the 2016 U.S. presidential election.


What is Naive Bayes?


Before starting in on the number crunching, the theoretical question should be answered: what is Naive Bayes? Naive Bayes (NB) — strictly speaking, a Naive Bayes Classifier — is a supervised machine learning algorithm based on Bayes Theorem that uses conditional probability with (naive) assumptions about independent connections between variables (called features). NB asserts that the probability of features 1, 2, and 3 occurring in an observation are independent of features 4, 5, and 6 — wrong for any data that has variables that are correlated with other variables.

However, this naive assumption still performs well on data that have dependent features — a testament to the power of probabilistic inference (and the superiority of the posterior!).

So how does Naive Bayes classification work?

Bayes Theorem

As mentioned, Naive Bayes classification relies on Bayes Theorem to predict a variable using conditional probability. Bayes theorem is defined as P(A|B) = P(A) * P(B|A) / P(B), allowing you to compute the chance of some event A happening if event B is true as well.

For example, that someone has a disease if that they test positive, given that you know (1) how often contract the virus — that’s P(A) — and (2) how often people who have tested positive for the virus actually have the virus — that’s P(B|A) — and (3) the probability of testing positive for it. Running with this example: if 0.2% of the population has the virus, testing positive is 90% accurate, and 8% of people test positive, then the probability that you have AIDS given that you test positive for it is ** P(Disease | Test Positive) = 0.002 * 0.90 / 0.08** = 0.025 = 2.5% — much better than 90% odds!

A Naive Bayes classifier uses Bayes Theorem to compute the probability that y is true given that x1 is true, y given x2, x1 given x2, etc. until it has computed the probability of each feature occurring if all/each other is also true.

Let’s use a Naive Bayes classifier to predict voting for Hillary Clinton or Donald Trump in the 2016 U.S. presidential election, given that an individual has some subset of other qualities and attitudes.


What Data?


To train a Naive Bayes classifier to predict voting behavior in 2016, I first need data on Americans’ voting behavior, demographics, and beliefs. I turn to the oft-used Cooperative Congressional Election Study, a survey of ~64,000 Americans before and after the 2016 U.S. presidential election, to do so.

I’m going to use the following variables (or features) to train my Naive Bayes classifier to predict whether or not someone voted for Hillary Clinton in 2016.

The variables are all categorical and have anywhere from three to five response options.

Here’s the code to import and wrangle the CCES data. *NOTE: This code was written before improvements to the case_when() function in the tidyverse’s dplyr package, and as such is… messy… Thanks to my good friend Alexander Agadjanian for passing the script along and saving me a lot of time.

# setup 
rm(list = ls())
source("~/setup_elliott.R")
library(foreign)
library(mice)
library(e1071)

do_impute = FALSE


# reading in and cleaning cces data ----

# set working directory to the data folder
cces.raw <- read.dta("../../data_no_export/post/2018-06-14-r3-naive-bayes-politcs/CCES16_Common_OUTPUT_Feb2018_VV.dta")



# MUTATE: demog recoding -----
cces <- cces.raw %>%
  mutate(race = ifelse(race == "White", "White",
                       ifelse(race == "Black", "Black",
                              ifelse(race == "Hispanic" | hispanic == "Yes", "Hispanic/Latino",
                                     ifelse(race == "Asian", "Asian/Asian-American",
                                            ifelse(race == "Native American", "Native American", 
                                                   ifelse(race == "Mixed" | race == "Other" | race == "Middle Eastern", "Other", NA)))))),
         race4 = ifelse(race == "White", "White",
                        ifelse(race == "Black", "Black",
                               ifelse(race == "Hispanic/Latino", "Hispanic/Latino",
                                      ifelse(race == "Other" | race == "Asian/Asian-American" | race == "Native American", "Other", NA)))),
         race4.r = ifelse(race == "White", "1. White",
                          ifelse(race == "Black", "2. Black",
                                 ifelse(race == "Hispanic/Latino", "3. Hispanic/Latino",
                                        ifelse(race == "Other" | race == "Asian/Asian-American" | race == "Native American", "4. Other", NA)))),
         gender = ifelse(gender == "Male", "Male",
                         ifelse(gender == "Female", "Female", NA)),
         age = 2017 - birthyr,
         agegroup = ifelse(age >= 18 & age <= 29, "18-29",
                           ifelse(age >= 30 & age <= 44, "30-44",
                                  ifelse(age >= 45 & age <= 54, "45-54",
                                         ifelse(age >= 55 & age <= 64, "55-64",
                                                ifelse(age >= 65, "65+", NA))))),
         age.2 = ifelse(age >= 18 & age <= 44, "18-44",
                        ifelse(age >= 45, "45+", NA)),
         educ6 = ifelse(educ == "No HS", "Non-High School Graduate",
                        ifelse(educ == "High school graduate", "High School Graduate",
                               ifelse(educ == "Some college", "Some College",
                                      ifelse(educ == "2-year", "2-Year College Degree",
                                             ifelse(educ == "4-year", "4-Year College Degree",
                                                    ifelse(educ == "Post-grad", "Postgraduate Degree", NA)))))),
         educ4 = ifelse(educ == "No HS" | educ == "High school graduate", "HS or Less",
                        ifelse(educ == "Some college", "Some College",
                               ifelse(educ == "2-year" | educ == "4-year", "College Degree",
                                      ifelse(educ == "Post-grad", "Postgraduate Degree", NA)))),
         educ4.r = ifelse(educ == "No HS" | educ == "High school graduate", "1. HS or Less",
                          ifelse(educ == "Some college", "2. Some College",
                                 ifelse(educ == "2-year" | educ == "4-year", "3. College Degree",
                                        ifelse(educ == "Post-grad", "4. Postgraduate Degree", NA)))),
         educ2 = ifelse(educ == "No HS" | educ == "High school graduate" | educ == "Some college", "Less Than College Degree",
                        ifelse(educ == "2-year" | educ == "4-year" | educ == "Post-grad", "College Degree or More", NA)))

cces <- cces %>%
  mutate(pid7.pre = ifelse(pid7 == "Not sure" | pid7 == "Skipped" | pid7 == "Not Asked", NA, 
                           ifelse(pid7 == "Strong Democrat", "Strong Democrat",
                                  ifelse(pid7 == "Not very strong Democrat", "Not very strong Democrat",
                                         ifelse(pid7 == "Lean Democrat", "Lean Democrat",
                                                ifelse(pid7 == "Independent", "Independent",
                                                       ifelse(pid7 == "Lean Republican", "Lean Republican",
                                                              ifelse(pid7 == "Not very strong Republican", "Not very strong Republican",
                                                                     ifelse(pid7 == "Strong Republican", "Strong Republican", NA)))))))), 
         pid3.pre = ifelse(pid7.pre == "Strong Democrat" | pid7.pre == "Not very strong Democrat" | pid7.pre == "Lean Democrat", "Democrat",
                           ifelse(pid7.pre == "Strong Republican" | pid7.pre == "Not very strong Republican" | pid7.pre == "Lean Republican", "Republican",
                                  ifelse(pid7.pre == "Independent", "Independent", NA))),
         ideo5.pre = ifelse(ideo5 == "Not sure" | ideo5 == "Skipped" | ideo5 == "Not Asked", NA, 
                            ifelse(ideo5 == "Very liberal", "Very liberal",
                                   ifelse(ideo5 == "Liberal", "Liberal",
                                          ifelse(ideo5 == "Moderate", "Moderate",
                                                 ifelse(ideo5 == "Conservative", "Conservative",
                                                        ifelse(ideo5 == "Very conservative", "Very conservative", NA)))))),
         ideo5.pre.r = ifelse(ideo5 == "Not sure" | ideo5 == "Skipped" | ideo5 == "Not Asked", NA, 
                              ifelse(ideo5 == "Very liberal", 1,
                                     ifelse(ideo5 == "Liberal", 2,
                                            ifelse(ideo5 == "Moderate", 3,
                                                   ifelse(ideo5 == "Conservative", 4,
                                                          ifelse(ideo5 == "Very conservative", 5, NA)))))),
         ideo3.pre = ifelse(ideo5.pre == "Very liberal" | ideo5.pre == "Liberal", "Liberal",
                            ifelse(ideo5.pre == "Very conservative" | ideo5.pre == "Conservative", "Conservative",
                                   ifelse(ideo5.pre == "Moderate", "Moderate", NA))))

# need to make sure conditions on initial and branch questions are fulfilled

cces <- cces %>%
  mutate(pid7.post = ifelse(CC16_421_dem == "Strong Democrat" & CC16_421a == "Democrat", "Strong Democrat",
                            ifelse(CC16_421_dem == "Not so strong Democrat" & CC16_421a == "Democrat", "Not so strong Democrat",
                                   ifelse(CC16_421b == "The Democratic Party" & (CC16_421a == "Independent" | CC16_421a == "Other"), "Lean Democrat",
                                          ifelse(CC16_421b == "The Republican Party" & (CC16_421a == "Independent" | CC16_421a == "Other"), "Lean Republican",
                                                 ifelse(CC16_421_rep == "Not so strong Republican" & CC16_421a == "Republican", "Not so strong Republican",
                                                        ifelse(CC16_421_rep == "Strong Republican" & CC16_421a == "Republican", "Strong Republican",
                                                               ifelse((CC16_421b == "Neither" | CC16_421b == "Not sure") & (CC16_421a == "Independent" | CC16_421a == "Other"), "Pure Independent", NA))))))),
         pid7.post.r = ifelse(CC16_421_dem == "Strong Democrat" & CC16_421a == "Democrat", 1,
                              ifelse(CC16_421_dem == "Not so strong Democrat" & CC16_421a == "Democrat", 2,
                                     ifelse(CC16_421b == "The Democratic Party" & (CC16_421a == "Independent" | CC16_421a == "Other"), 3,
                                            ifelse(CC16_421b == "The Republican Party" & (CC16_421a == "Independent" | CC16_421a == "Other"), 5,
                                                   ifelse(CC16_421_rep == "Not so strong Republican" & CC16_421a == "Republican", 6,
                                                          ifelse(CC16_421_rep == "Strong Republican" & CC16_421a == "Republican", 7,
                                                                 ifelse((CC16_421b == "Neither" | CC16_421b == "Not sure") & (CC16_421a == "Independent" | CC16_421a == "Other"), 4, NA))))))),
         pid3.post = ifelse(pid7.post == "Strong Democrat" | pid7.post == "Not so strong Democrat" | pid7.post == "Lean Democrat", "Democrat",
                            ifelse(pid7.post == "Strong Republican" | pid7.post == "Not so strong Republican" | pid7.post == "Lean Republican", "Republican",
                                   ifelse(pid7.post == "Pure Independent", "Independent", NA))))

# MUTATE: voting recoding --------

cces <- cces %>%
  mutate(vote.all = ifelse(CC16_410a == "Hillary Clinton (Democrat)", "Hillary Clinton", 
                           ifelse(CC16_410a == "Donald Trump (Republican)", "Donald Trump", 
                                  ifelse(CC16_410a == "Gary Johnson (Libertarian)", "Gary Johnson",
                                         ifelse(CC16_410a == "Jill Stein (Green)", "Jill Stein",
                                                ifelse(CC16_410a == "Evan McMullin (Independent)", "Evan McMullin",
                                                       ifelse(CC16_410a == "Other", "Other", NA)))))),
         vote.3 = ifelse(CC16_410a == "I didn't vote in this election" | CC16_410a == "I'm not sure" | CC16_410a == "Skipped" | CC16_410a == "Not Asked" | is.na(CC16_410a), NA,
                         ifelse(CC16_410a == "Hillary Clinton (Democrat)", "Hillary Clinton",
                                ifelse(CC16_410a == "Donald Trump (Republican)", "Donald Trump", "Other"))),
         vote.2 = ifelse(CC16_410a == "Hillary Clinton (Democrat)", "Clinton",
                         ifelse(CC16_410a == "Donald Trump (Republican)", "Trump", NA)),
         general_turnout = ifelse(CL_E2016GVM != "", "Voted", "Didn't Vote"),
         primary_turnout = ifelse(CL_E2016PPVM != "", "Voted", "Didn't Vote"),
         pre.vote.all = ifelse(CC16_364 == "I already voted (early or absentee)" & CC16_364b == "Donald Trump (Republican)", "Donald Trump",
                               ifelse(CC16_364 == "I already voted (early or absentee)" & CC16_364b == "Hillary Clinton (Democrat)", "Hillary Clinton",
                                      ifelse(CC16_364 == "I already voted (early or absentee)" & CC16_364b == "Gary Johnson (Libertarian)", "Gary Johnson",
                                             ifelse(CC16_364 == "I already voted (early or absentee)" & CC16_364b == "Jill Stein (Green)", "Jill Stein",
                                                    ifelse(CC16_364 != "I already voted (early or absentee)" & CC16_364c == "Donald Trump (Republican)", "Donald Trump",
                                                           ifelse(CC16_364 != "I already voted (early or absentee)" & CC16_364c == "Hillary Clinton (Democrat)", "Hillary Clinton",
                                                                  ifelse(CC16_364 != "I already voted (early or absentee)" & CC16_364c == "Gary Johnson (Libertarian)", "Gary Johnson",
                                                                         ifelse(CC16_364 != "I already voted (early or absentee)" & CC16_364c == "Jill Stein (Green)", "Jill Stein",
                                                                                ifelse(CC16_364 != "I already voted (early or absentee)" & CC16_364c == "Other", "Other",
                                                                                       ifelse(CC16_364 != "I already voted (early or absentee)" & CC16_364c == "I won't vote in this election", "Won't Vote",
                                                                                              ifelse(CC16_364 != "I already voted (early or absentee)" & CC16_364c == "I'm not sure", "Not Sure",
                                                                                                     ifelse(CC16_364 == "I already voted (early or absentee)" & CC16_364b == "Other", "Other", NA)))))))))))))

# MUTATE: income & religion recoding ------
cces <- cces %>%
  mutate(income.6 = ifelse(faminc == "Less than $10,000" | faminc == "$10,000 - $19,999" | faminc == "$20,000 - $29,999", "$30k>",
                           ifelse(faminc == "$30,000 - $39,999" | faminc == "$40,000 - $49,999", "$30k-50k",
                                  ifelse(faminc == "$50,000 - $59,999" | faminc == "$60,000 - $69,999", "$50k-70k",
                                         ifelse(faminc == "$70,000 - $79,999" | faminc == "$80,000 - $99,999", "$70k-100k",
                                                ifelse(faminc == "$100,000 - $119,999" | faminc == "$120,000 - $149,999" | faminc == "150,000 - $199,999", "$100k-200k",
                                                       ifelse(faminc == "$200,000 - $249,999" | faminc == "$250,000 - $349,999" | faminc == "$350,000 - $499,999" | faminc == "$500,000 or more", "$200k<", NA)))))),
         income.6.r = ifelse(faminc == "Less than $10,000" | faminc == "$10,000 - $19,999" | faminc == "$20,000 - $29,999", "1. $30k>",
                             ifelse(faminc == "$30,000 - $39,999" | faminc == "$40,000 - $49,999", "2. $30k-50k",
                                    ifelse(faminc == "$50,000 - $59,999" | faminc == "$60,000 - $69,999", "3. $50k-70k",
                                           ifelse(faminc == "$70,000 - $79,999" | faminc == "$80,000 - $99,999", "4. $70k-100k",
                                                  ifelse(faminc == "$100,000 - $119,999" | faminc == "$120,000 - $149,999" | faminc == "150,000 - $199,999", "5. $100k-200k",
                                                         ifelse(faminc == "$200,000 - $249,999" | faminc == "$250,000 - $349,999" | faminc == "$350,000 - $499,999" | faminc == "$500,000 or more", "6. $200k<", NA)))))),
         retired = ifelse(employ == "Retired", 1,
                          ifelse(employ == "Full-time" | employ == "Homemaker" | employ == "Other" | employ == "Part-time" | employ == "Permanently disabled" | employ == "Student" | employ == "Unemployed", 0, NA)),
         rel.imp = ifelse(pew_religimp == "Not at all important", 1,
                          ifelse(pew_religimp == "Not too important", 2,
                                 ifelse(pew_religimp == "Somewhat important", 3,
                                        ifelse(pew_religimp == "Very important", 4, NA)))),
         union.status = ifelse(union == "I am not now, nor have I been, a member of a labor union", "Never union member",
                               ifelse(union == "Yes, I am currently a member of a labor union", "Current union member",
                                      ifelse(union == "I formerly was a member of a labor union", "Former union member", NA))),
         church.cat = ifelse(pew_churatd == "More than once a week", "1. More than once a week",
                             ifelse(pew_churatd == "Once a week", "2. Once a week",
                                    ifelse(pew_churatd == "Once or twice a month", "3. Once or twice a month",
                                           ifelse(pew_churatd == "A few times a year", "4. A few times a year",
                                                  ifelse(pew_churatd == "Seldom", "5. Seldom",
                                                         ifelse(pew_churatd == "Never", "6. Never", NA)))))),
         church.cont = ifelse(pew_churatd == "More than once a week", 6,
                              ifelse(pew_churatd == "Once a week", 5,
                                     ifelse(pew_churatd == "Once or twice a month", 4,
                                            ifelse(pew_churatd == "A few times a year", 3,
                                                   ifelse(pew_churatd == "Seldom", 2,
                                                          ifelse(pew_churatd == "Never", 1, NA)))))))

cces$pew_churatd[cces$pew_churatd == "Don't know"] <- NA
cces$pew_churatd[cces$pew_churatd == "Skipped"] <- NA
cces$pew_churatd[cces$pew_churatd == "Not Asked"] <- NA

# attitudes  -------
cces <- cces %>%
  mutate(read_news = case_when(CC16_300_3 == "Yes" ~ "Yes",
                               CC16_300_3 == "No" ~ "No"),
         racial_resent = case_when(as.numeric(CC16_422d) %in% 1:5 ~ as.character(CC16_422d),
                                   TRUE ~ NA_character_),
         health_care = case_when(as.numeric(CC16_426_2) %in% 1:5 ~ as.character(CC16_426_2),
                                 TRUE ~ NA_character_))

The one thing Naive Bayes can’t do is fill in missing data. I have to do that myself, using multiple imputation with chained equations — the namesake of the mice package in R.

# set working directory to the data folder
setwd("~/Dropbox/Projects/CCES Naive Bayes Predict Party") # point your working directory to the downloaded CCES data

# select variables to predict vote for naive bayes clsfer -----
survey <- cces %>%
  select(vote.2,
         pid7.post,
         ideo5.pre,
         race4,
         agegroup,
         gender,
         income.6,
         union.status,
         educ4,
         read_news,
         racial_resent,
         health_care)

# impute responses for missing variables ----------
# read: https://www.r-bloggers.com/imputing-missing-data-with-r-mice-package/

# actually impute responses

# what's missing?
do_impute = FALSE

if(do_impute){
  print("imputing")
  
  # what's missing? 
  survey %>% 
    mutate(missing = ifelse(is.na(race4),1,0)) %>%
    summarise(missing = sum(missing),
              n = n(),
              prop = missing/n) 
  
  # only keep individuals with > 50% variables
  survey$na_count <- apply(survey, 1,  # count no. of missing varaibles
                           function(x){sum(is.na(x))/length(survey)})
  
  ggplot(survey,aes(x=na_count)) + 
    geom_bar(aes(y = (..count..)/sum(..count..))) + 
    scale_y_continuous() +
    theme_few() +
    labs(title=" ",
         subtitle = " ",
         x = "Percentage of Variables Missing",
         y = "Percentage of Respondents")
  
  survey <- select(survey,-na_count)
  
  # set up imputation
  md.pattern(survey)
  
  # make vars into factors, else imputation will be numeric (y i k e s)
  survey[,2:length(survey)] <- lapply(survey[,2:length(survey)],as.factor)
  
  impute_me <- select(survey,-vote.2)
  
  # do the imputation
  system.time(imputed_data <- mice(impute_me,m=1,maxit=1,meth='pmm',seed=500))
  completedData <- complete(imputed_data, 1) # instead of: 'long'
  imputed_cces <- completedData # instead of: aggregate(completedData, by = list(completedData$.id),FUN= mean)
  
  # add back vote
  imputed_cces <- survey %>% 
    select(vote.2) %>%
    bind_cols(imputed_cces)
  
  # write it
  saveRDS(imputed_data,"imputed.RDS")
  write.csv(imputed_cces,"imputed.csv",row.names = F)
} else {
  #print("importing")
  imputed_cces <- read.csv("imputed.csv",stringsAsFactors = F)
}

survey <- imputed_cces

Now that we have all the data coded and imputed, the fun begins – let’s train the Naive Bayes algorithm on a selection of the CCES survey!


Naive Bayes Training, and Prediction


Splitting the CCES survey into training and testing data is relatively easy. All we have to do is tell R to sample the row numbers for 75% of the dataset, pick out those rows for the training set, and pick out the rest for the testing set.

# naive bayes --------
survey.na_omit <- survey %>% na.omit()

# split into train and test 
smp_size <- floor(0.75 * nrow(survey.na_omit))


set.seed(123)
train_ind <- sample(seq_len(nrow(survey.na_omit)), size = smp_size)

train <- survey.na_omit[train_ind, ]
test <- survey.na_omit[-train_ind, ]

Now, our training data is ready for Naive Bayes to fit a model. Let’s get to it!

# Fitting the Naive Bayes model
Naive_Bayes_Model = naiveBayes(vote.2 ~., data=train)

Here’s what the model output looks like:

# Print the model summary
Naive_Bayes_Model
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##   Clinton     Trump 
## 0.5424547 0.4575453 
## 
## Conditional probabilities:
##          pid7.post
## Y         Lean Democrat Lean Republican Not so strong Democrat
##   Clinton   0.155746574     0.007573936            0.201851407
##   Trump     0.007910490     0.194198974            0.046465222
##          pid7.post
## Y         Not so strong Republican Pure Independent Strong Democrat
##   Clinton              0.028011541      0.075018033     0.526388555
##   Trump                0.228263968      0.140393387     0.012614025
##          pid7.post
## Y         Strong Republican
##   Clinton       0.005409954
##   Trump         0.370153934
## 
##          ideo5.pre
## Y         Conservative     Liberal    Moderate Very conservative
##   Clinton  0.067444097 0.365773022 0.374368839       0.013584996
##   Trump    0.478406499 0.034920182 0.290692702       0.188996579
##          ideo5.pre
## Y         Very liberal
##   Clinton  0.178829045
##   Trump    0.006984036
## 
##          race4
## Y              Black Hispanic/Latino      Other      White
##   Clinton 0.16494350      0.09383265 0.06570089 0.67552296
##   Trump   0.01460946      0.04967218 0.05480331 0.88091505
## 
##          agegroup
## Y              18-29      30-44      45-54      55-64        65+
##   Clinton 0.11493147 0.25252465 0.16572493 0.23725655 0.22956239
##   Trump   0.05017104 0.17196408 0.17973204 0.29104903 0.30708381
## 
##          gender
## Y            Female      Male
##   Clinton 0.5964174 0.4035826
##   Trump   0.4939424 0.5060576
## 
##          income.6
## Y         $100k-200k     $200k<   $30k-50k      $30k>   $50k-70k
##   Clinton 0.16121664 0.03798990 0.21098822 0.21104833 0.18273623
##   Trump   0.15122577 0.03128563 0.21942702 0.19427024 0.19947263
##          income.6
## Y          $70k-100k
##   Clinton 0.19602068
##   Trump   0.20431870
## 
##          educ4
## Y         College Degree HS or Less Postgraduate Degree Some College
##   Clinton      0.3912599  0.1797307           0.1910916    0.2379178
##   Trump        0.3358751  0.2974629           0.1109607    0.2557013
## 
##          read_news
## Y                No       Yes
##   Clinton 0.4468021 0.5531979
##   Trump   0.5518814 0.4481186
## 
##          racial_resent
## Y         Neither agree nor disagree Somewhat agree Somewhat disagree
##   Clinton                 0.09840106     0.30620341        0.03913200
##   Trump                   0.21037628     0.15329247        0.23268244
##          racial_resent
## Y         Strongly agree Strongly disagree
##   Clinton     0.52344314        0.03282039
##   Trump       0.03727195        0.36637685
## 
##          health_care
## Y         Greatly decrease Greatly increase    Maintain Slightly decrease
##   Clinton      0.008776148      0.376713152 0.238759317       0.020497716
##   Trump        0.122006842      0.110461802 0.411701824       0.164623717
##          health_care
## Y         Slightly increase
##   Clinton       0.355253667
##   Trump         0.191205815

That took about 0.5 seconds to do, thanks to the power of conditional probability and a small number of features! Now it’s time to make predictions with the data. We can do that with the predict() function and our test data. type = "raw" means I want to return probabilities from the prediction, not predicted values (IE: 0.75 instead of "Clinton")

# Prediction on the dataset
NB_Predictions = predict(Naive_Bayes_Model,
                         as.matrix(test),
                         type = "raw") # type="class" or "raw"

Here’s what the predicted probability of voting for Hillary Clinton looks for the first ten test cases:

NB_Predictions[,1][1:10]
##  [1] 0.00001785202 0.64584735255 0.89277757934 0.97321907159 0.00177587372
##  [6] 0.99308096869 0.99931374743 0.02802852835 0.99618927544 0.94931948233

And here’s the confusion matrix to check accuracy

NB_Predictions <- NB_Predictions[,1] %>% 
  as.vector()

NB_Predictions <- ifelse(NB_Predictions>0.5,"Clinton","Trump")

table(NB_Predictions,test$vote.2) # Confusion matrix to check accuracy
##               
## NB_Predictions Clinton Trump
##        Clinton    5136   384
##        Trump       364  4339

As we can see, the Naive Bayes classifier correctly predicted 5136 of 5500 Clinton voters in the testing data!

The real test is how much better Naive Bayes does than other traditional predictive methods. Below, I specify and train a logistic regression model using glm() and evaluate the predictions.

# what about logit? 
train.logit <- train %>% 
  mutate(vote.2.binary = ifelse(vote.2=="Clinton",1,0))

fit <- glm(vote.2.binary ~ . ,
           select(train.logit,-vote.2),
           family=binomial(link='logit'))


LG_Predictions <- ifelse(predict(fit,test) > 0.5,"Clinton","Trump")
table(LG_Predictions, test$vote.2)
##               
## LG_Predictions Clinton Trump
##        Clinton    5086   289
##        Trump       414  4434

With a total accuracy — share of Clinton and Trump voters identified correctly — of 93.1%, the logit model does just the slightest bit better than the Naive Bayes classifier, which has an accuracy of 92.7%. This is likely because of logistic regression’s ability to learn well from a medium amount of data. However, had we done this example with just thirty or forty training observation, the tendency of logit to overfit the training set would have made it perform much worse than the NB approach. The power of Naive Bayes really comes into gear with a small amount of data and/or a large (think: fifty) number of features.


Conclusion


But wait, where’s the machine learning? Naive Bayes does one thing well — learning from a large set of data with independent variables (*as we saw, the method performed just fine with the highly correlated demographic and attitudinal variables) in a probabilistic way that avoids overfitting and selects the ideal posterior for all predictors. What it does not do at all is learn — at least in the typical machine learning way, in which the computer would teach itself the better way to make the model.

For this, I foresee a solution that uses random forest decision trees alongside the Naive Bayes approach to first select the a number of features of a dataset, then fit the Naive Bayes model to them and evaluate predictions, then repeat until the best features have been combined with the best models to produce the best predictions of whatever observation we are trying to identify.

A future project, perhaps? For now…



Have any questions, comments, or concerns? You know where to find me.

— Elliott






comments powered by Disqus