Names: Aidan Hussain¶

Project Title: The Effects of Mass Shootings on Presidential Voting Trends¶

Website: https://aidanhussain.github.io/cmps3160project/¶

This project delves into a critical and timely issue, exploring the intersection between gun violence and political trends in the United States. The primary focus is to examine how the prevalence of mass shootings influences presidential voting patterns. This research is not only academically stimulating but also has profound societal implications, offering insights into how critical issues like gun violence can shape political landscapes in the U.S.

Gun violence has been on the forefront of political rhetoric in recent years and has led to polarized stances on gun rights. With the 2024 Presidential Election less than one year away, it is crucial for each candidate/party to understand the factors that shift voting trends. The results of this analysis could influence both Democrat's and Republican's stances on the issue of gun rights moving forward. Supporting evidence of the importance of the gun rights issue as it relates to politics are linked below:

  • https://www.bloomberg.com/news/articles/2023-12-08/
  • https://www.pewresearch.org/politics/2023/06/28/gun-violence-widely-viewed-as-a-major-and-growing-national-problem/
  • https://www.vox.com/23142734/las-vegas-unlv-mass-shooting

At the heart of our study is the question: Does an increase in gun violence in a state correlate with a shift towards the Democratic party in presidential elections?

This question stems from the Democratic party's general stance favoring stricter gun control laws.

In [170]:
#Clone my git repository
!git clone https://github.com/aidanhussain/cmps3160project.git
%cd cmps3160project
fatal: destination path 'cmps3160project' already exists and is not an empty directory.
/content/drive/MyDrive/Colab Notebooks/cmps3160project
In [171]:
#Import all the libraries we'll need
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import math
import scipy.stats as stats
import numpy as np
import statsmodels.api as sm

DATA & PREPROCESSING¶

Data Set 1: US Population by State¶

Link: https://www.kaggle.com/datasets/alexandrepetit881234/us-population-by-state

This dataset includes each US state's piopulation as of the 2020 US Census. This dataset will be integral to normalizing/scaling raw statistics from other datasets we use throughout this project. I plan to calculate per capita deaths and injuries (using data from the mass shootings data set) using these population totals as the denominator of the per capita equations, and I also plan to use this population data to help standardize some of the data from the election datasets.

The dataset includes 5 variables relating to each state's population total. The variables of particular interest are 'State' and '2020_census' (the total population for that state as of the 2020 census).

In [172]:
us_populations_raw = pd.read_csv("us_pop_by_state.csv")
us_populations_raw.dtypes #Check dtypes of each column
Out[172]:
rank                float64
state                object
state_code           object
2020_census           int64
percent_of_total    float64
dtype: object

All the dtypes are correct. Great!

In [173]:
us_populations_raw.head(10)
Out[173]:
rank state state_code 2020_census percent_of_total
0 1.0 California CA 39538223 0.1191
1 2.0 Texas TX 29145505 0.0874
2 3.0 Florida FL 21538187 0.0647
3 4.0 New York NY 20201249 0.0586
4 5.0 Pennsylvania PA 13002700 0.0386
5 6.0 Illinois IL 12801989 0.0382
6 7.0 Ohio OH 11799448 0.0352
7 8.0 Georgia GA 10711908 0.0320
8 9.0 North Carolina NC 10439388 0.0316
9 10.0 Michigan MI 10077331 0.0301

Now, let's tidy this dataset up by dropping the total row, renaming the '2020_census' column to 'Population', and renaming Washington D.C.'s entry from 'DC' to 'District of Columbia'.

In [174]:
us_populations_tidy = us_populations_raw.copy()
us_populations_tidy = us_populations_tidy[:-1] #Drop total row
us_populations_tidy = us_populations_tidy.rename({'2020_census': 'Population'}, axis=1)
us_populations_tidy = us_populations_tidy.set_index('state_code')
us_populations_tidy.at['DC', 'state'] = 'District of Columbia' #Rename Washington D.C. entry
us_populations_tidy = us_populations_tidy.reset_index()
us_populations_tidy.head(10)
Out[174]:
state_code rank state Population percent_of_total
0 CA 1.0 California 39538223 0.1191
1 TX 2.0 Texas 29145505 0.0874
2 FL 3.0 Florida 21538187 0.0647
3 NY 4.0 New York 20201249 0.0586
4 PA 5.0 Pennsylvania 13002700 0.0386
5 IL 6.0 Illinois 12801989 0.0382
6 OH 7.0 Ohio 11799448 0.0352
7 GA 8.0 Georgia 10711908 0.0320
8 NC 9.0 North Carolina 10439388 0.0316
9 MI 10.0 Michigan 10077331 0.0301

Data Set 2: US Mass Shootings¶

Link: https://www.kaggle.com/datasets/zusmani/us-mass-shootings-last-50-years

I am using this data set because I think that it is both interesting (in terms of the questions it can help answer) and applicable (to very real issues in the world today). My goal is to use this data set, alongside other data, to identify the effect that the prevalance of gun violence (or, more specifically, mass shootings) have on US presidential voting trends.

This data set consists of mass shootings in the United States between August 2016 and April 2021 (at least the segment of data I am using does; the actual dataset link includes other datasets with different formatting that are not being used for this project). The dataset includes 8 variables that describe each mass shooting event. The variables of particular interest include Incident Date, State, # Killed, and # Injured.

The dataset uses the FBI's definition of a mass shooting (4+ casualties excluding the shooter) as the basis for an incident's inclusion. Further informatin regarding this definition is available here: https://www.ojp.gov/ncjrs/virtual-library/abstracts/analysis-recent-mass-shootings

I think that it will be very interesting to investigate the question of whether greater prevelance of gun violence in a state has any impact on voting trends in that state. In essence, are states where gun violence is more prevelant shifting to vote more in favor of the Democratic party (typically in support of more strict gun laws)? Of course, to conduct this analysis, I will need to design this project in an appropriate way to ensure that I am answering the correct question.

In [175]:
mass_shootings_raw = pd.read_csv("Mass shooting data.csv")
mass_shootings_raw.dtypes #Check dtypes of each column; 'Incident Date' needs to be converted to datetime

mass_shootings_raw['Incident Date'] = pd.to_datetime(mass_shootings_raw['Incident Date']) #Convert 'Incident Date' to datetime

mass_shootings_raw.dtypes #Check dtypes of each column
Out[175]:
Incident ID                int64
Incident Date     datetime64[ns]
State                     object
City Or County            object
Address                   object
# Killed                   int64
# Injured                  int64
Operations               float64
dtype: object

Now that 'Incident Date's dtype is updated, dtypes are set correctly! Now, let's take a look at what data we have:

As you can see above, I was able to load the CSV file into this notebook. Thankfully, the original data source was well organized, so I did not have to make any meaningful changes to the data in order to display in a digestible and logical way. Now, let's start thinking about ETL by finding an interesting statistic and generating an interesting graph from this data set!

In [176]:
mass_shootings_raw.head(10)
Out[176]:
Incident ID Incident Date State City Or County Address # Killed # Injured Operations
0 1978561 2021-04-15 District of Columbia Washington 1800 block of West Virginia Ave NE 0 4 NaN
1 1978635 2021-04-15 Indiana Indianapolis 8951 Mirabel Rd 8 5 NaN
2 1978652 2021-04-15 Illinois Chicago 600 block of N Sawyer Ave 0 4 NaN
3 1978616 2021-04-15 Florida Pensacola 700 Truman Ave 0 6 NaN
4 1976538 2021-04-13 Maryland Baltimore 2300 block of Hoffman St 0 4 NaN
5 1975296 2021-04-12 Illinois Chicago I-290 and S Damen Ave 1 3 NaN
6 1974943 2021-04-11 Kansas Wichita 200 block of N Battin St 1 3 NaN
7 1975004 2021-04-11 Washington Seattle 306 23rd Ave S 0 4 NaN
8 1974088 2021-04-10 Tennessee Memphis 4315 S 3rd St 1 3 NaN
9 1973692 2021-04-10 Missouri Koshkonong US-63 and MO-F 1 3 NaN
In [177]:
#Sum of # Killed by State
mass_shootings_raw.groupby(['State'])['# Killed'].sum().sort_values(ascending=False).head(10)
Out[177]:
State
Texas             229
California        214
Illinois          152
Florida           134
Louisiana          84
Pennsylvania       83
Missouri           83
Ohio               79
North Carolina     77
Nevada             72
Name: # Killed, dtype: int64
In [178]:
#Sum of # Killed + # Injured by State
mass_shootings_raw['# Killed or Injured'] = mass_shootings_raw['# Killed'] + mass_shootings_raw['# Injured']
mass_shootings_raw.groupby(['State'])['# Killed or Injured'].sum().sort_values(ascending=False).head(10)
Out[178]:
State
Illinois        1069
California       969
Texas            754
Florida          649
Nevada           561
Pennsylvania     495
Louisiana        463
Ohio             424
New York         410
Missouri         344
Name: # Killed or Injured, dtype: int64

That's interesting: Illinois shoots past Texas and California when injuries are included (1069 killed + injured). Texas is also surpassed by California. It will be interesting to explore whether the choice of metric to measure the prevalence of gun violence makes a difference in any conclusions we draw as we progress through this project.

Anyway, now let's transform this dataset into the format we require for the purposes of our project. The goal of this transformation is to end with a dataframe that has one entry per state (the rows) and a total number of deaths, the total number of injuries, and the total number of deaths + injuries as variables (the columns) from between 2016-11-08 and 2020-11-03 (the two election days). Additionally, we want to scale the numeric variables by the state's population to ensure our data is sufficiently normalized.

We will create two versions of this dataframe: one that includes all shootings, and one that includes only shootings with 8 or more deaths.

In [179]:
#Includes all shootings
mass_shootings_clean = mass_shootings_raw.copy()
mass_shootings_clean = mass_shootings_clean[(mass_shootings_clean['Incident Date'] > '2016-11-08') & (mass_shootings_clean['Incident Date'] < '2020-11-03')] #Filter the DF to the correct dates
mass_shootings_clean = mass_shootings_clean[['State','# Killed','# Injured','# Killed or Injured']]
mass_shootings_clean = mass_shootings_clean.groupby('State').sum()
mass_shootings_clean.head(10)
Out[179]:
# Killed # Injured # Killed or Injured
State
Alabama 52 184 236
Alaska 1 8 9
Arizona 19 58 77
Arkansas 17 120 137
California 191 653 844
Colorado 22 106 128
Connecticut 10 59 69
Delaware 8 25 33
District of Columbia 13 131 144
Florida 112 401 513
In [180]:
#Includes shooting with 8+ deaths
mass_shootings_clean_minimim8 = mass_shootings_raw.copy()
mass_shootings_clean_minimim8 = mass_shootings_clean_minimim8[(mass_shootings_clean_minimim8['Incident Date'] > '2016-11-08') & (mass_shootings_clean_minimim8['Incident Date'] < '2020-11-03')] #Filter the DF to the correct dates
mass_shootings_clean_minimim8 = mass_shootings_clean_minimim8[mass_shootings_clean_minimim8['# Killed'] >= 8]
mass_shootings_clean_minimim8 = mass_shootings_clean_minimim8[['State','# Killed','# Injured','# Killed or Injured']]
mass_shootings_clean_minimim8 = mass_shootings_clean_minimim8.groupby('State').sum()
mass_shootings_clean_minimim8.head(10)
Out[180]:
# Killed # Injured # Killed or Injured
State
California 13 2 15
Florida 17 17 34
Mississippi 8 1 9
Nevada 59 441 500
Ohio 10 17 27
Pennsylvania 11 7 18
Texas 77 80 157
Virginia 13 4 17

Alright, now, let's scale these numeric variables by each state's total population per the 2020 Census.

In [181]:
#Includes all shootings
mass_shootings_clean_normalized = mass_shootings_clean.merge(us_populations_tidy,how = 'left', left_on='State', right_on = 'state')
mass_shootings_clean_normalized['# Killed per Capita'] = mass_shootings_clean_normalized['# Killed'] / mass_shootings_clean_normalized['Population']
mass_shootings_clean_normalized['# Injured per Capita'] = mass_shootings_clean_normalized['# Injured'] / mass_shootings_clean_normalized['Population']
mass_shootings_clean_normalized['# Killed or Injured per Capita'] = mass_shootings_clean_normalized['# Killed or Injured'] / mass_shootings_clean_normalized['Population']
mass_shootings_clean_normalized = mass_shootings_clean_normalized[['state','# Killed per Capita','# Injured per Capita','# Killed or Injured per Capita']]
mass_shootings_clean_normalized.head(10)
Out[181]:
state # Killed per Capita # Injured per Capita # Killed or Injured per Capita
0 Alabama 0.000010 0.000037 0.000047
1 Alaska 0.000001 0.000011 0.000012
2 Arizona 0.000003 0.000008 0.000011
3 Arkansas 0.000006 0.000040 0.000045
4 California 0.000005 0.000017 0.000021
5 Colorado 0.000004 0.000018 0.000022
6 Connecticut 0.000003 0.000016 0.000019
7 Delaware 0.000008 0.000025 0.000033
8 District of Columbia 0.000019 0.000190 0.000209
9 Florida 0.000005 0.000019 0.000024
In [182]:
#Only includes shootings with 8+ deaths
mass_shootings_clean_normalized_minimum8 = mass_shootings_clean_minimim8.merge(us_populations_tidy,how = 'left', left_on='State', right_on = 'state')
mass_shootings_clean_normalized_minimum8['# Killed per Capita'] = mass_shootings_clean_normalized_minimum8['# Killed'] / mass_shootings_clean_normalized_minimum8['Population']
mass_shootings_clean_normalized_minimum8['# Injured per Capita'] = mass_shootings_clean_normalized_minimum8['# Injured'] / mass_shootings_clean_normalized_minimum8['Population']
mass_shootings_clean_normalized_minimum8['# Killed or Injured per Capita'] = mass_shootings_clean_normalized_minimum8['# Killed or Injured'] / mass_shootings_clean_normalized_minimum8['Population']
mass_shootings_clean_normalized_minimum8 = mass_shootings_clean_normalized_minimum8[['state','# Killed per Capita','# Injured per Capita','# Killed or Injured per Capita']]
mass_shootings_clean_normalized_minimum8.head(10)
Out[182]:
state # Killed per Capita # Injured per Capita # Killed or Injured per Capita
0 California 3.287958e-07 5.058396e-08 3.793797e-07
1 Florida 7.892958e-07 7.892958e-07 1.578592e-06
2 Mississippi 2.701535e-06 3.376919e-07 3.039227e-06
3 Nevada 1.900397e-05 1.420466e-04 1.610506e-04
4 Ohio 8.474973e-07 1.440745e-06 2.288243e-06
5 Pennsylvania 8.459781e-07 5.383497e-07 1.384328e-06
6 Texas 2.641917e-06 2.744849e-06 5.386765e-06
7 Virginia 1.506130e-06 4.634246e-07 1.969555e-06

Great! Now our mass shooting/gun violence dataset is in the correct format to proceed with the rest of our project.

Its interesting that only 8 states qualify with having a mass shooting with over 8 deaths. This small number of observations will likely make it difficult to draw statistically significant conclusions, but I am still including it out of curiosity.

Data Set 3: Results of the 2016 Presidential Elections in the United States¶

Link: https://www.statista.com/statistics/630799/preliminary-results-of-the-2016-presidential-election/

This dataset includes information about how each US state voted in the 2016 presidential election, by candidate. The dataset includes the 2 major candidates: Donald Trump and Hillary Clinton.

The dataset includes 3 variables relating to each state's 2016 presidential vote: 'State', (% of votes for) 'Hillary Clinton', and (% of votes for) 'Donald Trump'.

This will be one of two presidential election datasets we will use to investigate the potential effects of gun violence on American voting trends; the other dataset will feature 2020 Presidential Election results.

In [183]:
election_results_2016_raw = pd.read_csv("2016_election_results.csv")
election_results_2016_raw.dtypes #Check dtypes of each column; need to remove extraneous column...

election_results_2016_raw = election_results_2016_raw.iloc[:,0:3] #Tidy up data
election_results_2016_raw.head(10)
Out[183]:
State Hillary Clinton Donald Trump
0 Alabama 34.7 62.7
1 Alaska 37.6 52.8
2 Arizona 45.5 49.0
3 Arkansas 33.7 60.6
4 California 62.3 31.9
5 Colorado 48.2 43.3
6 Connecticut 54.7 41.0
7 Delaware 53.4 41.9
8 District of Columbia 92.8 4.2
9 Florida 47.8 49.0

Great, the datatypes are correct, and the extraneous column has been removed!

However, for consistency, let's normalize the vote proportions of Hillary Clinton and Donald Trump so they add up to 100% (because we do not care about the immaterial portion of votes for other candidiates). Also, let's rename the columns from the candidate's name to the political party's name (Hillary Clinton -> Democrat Proportion, Donald Trump -> Republican Proportion).

In [184]:
election_results_2016_tidy = election_results_2016_raw.copy()
election_results_2016_tidy['Democrat Proportion'] = election_results_2016_tidy['Hillary Clinton'] / (election_results_2016_tidy['Hillary Clinton'] + election_results_2016_tidy['Donald Trump'])
election_results_2016_tidy['Republican Proportion'] = election_results_2016_tidy['Donald Trump'] / (election_results_2016_tidy['Hillary Clinton'] + election_results_2016_tidy['Donald Trump'])
election_results_2016_tidy = election_results_2016_tidy.drop(['Hillary Clinton','Donald Trump'], axis = 1)

election_results_2016_tidy.sort_values(by='State').head(10)
Out[184]:
State Democrat Proportion Republican Proportion
0 Alabama 0.356263 0.643737
1 Alaska 0.415929 0.584071
2 Arizona 0.481481 0.518519
3 Arkansas 0.357370 0.642630
4 California 0.661359 0.338641
5 Colorado 0.526776 0.473224
6 Connecticut 0.571578 0.428422
7 Delaware 0.560336 0.439664
8 District of Columbia 0.956701 0.043299
9 Florida 0.493802 0.506198

Great! Now our data is normalized and formatted in a way that will be convenient for us as we work through this project.

Data Set 4: 2020 USA Presidential Election Results¶

Link: https://www.kaggle.com/datasets/paultimothymooney/percent-voting-for-democratic-party-by-state

This dataset includes information about how each US state voted in the 2020 presidential election, by political party (Democrat or Republican).

The dataset includes 6 variables, however 3 are redundant (state identifiers). The variables of interest to us are 'state', '# DEM', and '# REP'.

This is the second presidential election dataset we will use to investigate the potential effects of gun violence on American voting trends; the other being the 2016 election results dataset from above.

In [185]:
election_results_2020_raw = pd.read_csv("democratic_vs_republican_votes_by_usa_state_2020.csv")
election_results_2020_raw.dtypes #Check dtypes of each column

election_results_2020_raw.head(10)
Out[185]:
state DEM REP usa_state usa_state_code percent_democrat
0 Alabama 843473 1434159 Alabama AL 37.032892
1 Alaska 45758 80999 Alaska AK 36.098993
2 Arizona 1643664 1626679 Arizona AZ 50.259682
3 Arkansas 420985 761251 Arkansas AR 35.609218
4 California 9315259 4812735 California CA 65.934760
5 Colorado 1753416 1335253 Colorado CO 56.769307
6 Connecticut 1059252 699079 Connecticut CT 60.241900
7 Delaware 295413 199857 Delaware DE 59.646859
8 District of Columbia 258561 14449 District of Columbia DC 94.707520
9 Florida 5294767 5667834 Florida FL 48.298456

The dtypes are all correct, so we are ready to proceed!

Now, let's normalize the vote proportions of Democrat and Republican votes so they add up to 100% (because we do not care about the immaterial portion of votes for other candidates). Let's also only keep the 'state' variable as an identifier for now.

In [186]:
election_results_2020_tidy = election_results_2020_raw.copy()
election_results_2020_tidy['Democrat Proportion'] = election_results_2020_tidy['DEM'] / (election_results_2020_tidy['DEM'] + election_results_2020_tidy['REP'])
election_results_2020_tidy['Republican Proportion'] = election_results_2020_tidy['REP'] / (election_results_2020_tidy['DEM'] + election_results_2020_tidy['REP'])
election_results_2020_tidy = election_results_2020_tidy.drop(['DEM','REP','usa_state','usa_state_code','percent_democrat'], axis = 1)

election_results_2020_tidy.sort_values(by='state').head(10)
Out[186]:
state Democrat Proportion Republican Proportion
0 Alabama 0.370329 0.629671
1 Alaska 0.360990 0.639010
2 Arizona 0.502597 0.497403
3 Arkansas 0.356092 0.643908
4 California 0.659348 0.340652
5 Colorado 0.567693 0.432307
6 Connecticut 0.602419 0.397581
7 Delaware 0.596469 0.403531
8 District of Columbia 0.947075 0.052925
9 Florida 0.482985 0.517015

EXPLORATORY DATA ANALYSIS¶

Now, before we continue with our modeling, let's conduct some exploratory data analysis do we can start to understand the data we are working with. First, let's visualize the states with the largest population.

In [187]:
#Let's visualize the states with the largest population
graph_vals = us_populations_tidy.copy()
graph_vals = graph_vals[['state','Population']]
graph_vals = graph_vals.sort_values(by='Population', ascending = False)
graph_vals['Population'] = graph_vals['Population'] / 1e6
graph_vals = graph_vals.set_index('state').head(10).iloc[0:]

graph_vals.plot.bar()
plt.ylabel('Population (in millions)')
plt.xlabel('State')
plt.title('Top 10 US States by Population')
plt.xticks(rotation=45)
plt.show()

This visualization helps show why California and Texas have so much gun violence (per the dataset)...they are WAY more populated. Interestingly, Illinois really isnt nearly as large as those two states, but had a lot of gun violence. That's an alarming observation!

Now, let's crate an interesting graphic. Let's see a breakdown of the number of mass shooting deaths per quarter.

In [188]:
#Let's visualize the # of gun deaths per quarter
graph_vals = mass_shootings_raw.copy()
graph_vals = graph_vals[['Incident Date','# Killed']]
graph_vals['Incident Month'] = graph_vals['Incident Date'].dt.to_period('Q')
graph_vals = graph_vals.groupby('Incident Month').sum()

graph_vals.plot.bar()

plt.ylabel('# Killed')
plt.xlabel('Time Period')
plt.title('Mass Shooting Deaths by Quarter')
plt.xticks(rotation=45)
plt.show()
<ipython-input-188-966e79a0d50b>:5: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  graph_vals = graph_vals.groupby('Incident Month').sum()
In [189]:
graph_vals.mean()/3 #Calculate average number of mass shooting deaths per month
Out[189]:
# Killed    35.033333
dtype: float64

It is interesting that their seems to be some cyclicality to the graph. I wonder why? Also, the blind eye suggests that the trend is upward slanted, suggesting that gun violence deaths attributed to mass shootings seem to be on the rise. That's scary. I wonder if this trend has impacted voting trends. After all, their being 35 unnecessary, human-caused deaths per month attributed to mass shootings is certainly eye-catching.

Now, let's look at the prevalence of mass shooting deaths per state on a per capita basis.

In [190]:
import matplotlib.pyplot as plt
import pandas as pd

graph_vals = mass_shootings_clean_normalized.copy()
graph_vals = graph_vals[['state','# Killed per Capita']]
graph_vals.set_index('state', inplace=True)
graph_vals = graph_vals.sort_values(by='# Killed per Capita', ascending = False)
graph_vals.plot(kind='bar', figsize=(15, 7))

plt.title('Mass Shootings Deaths per Capita by State')
plt.xlabel('State')
plt.ylabel('# Killed per Capita')

plt.legend(title='Metrics', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.show()

Interesting. Nevada is #1 (probably due to the tragic mass shooting at the music festival in Las Vegas that made national news). A harrowing look at that shooting is available here: https://www.cnn.com/interactive/2017/10/us/las-vegas-shooting-cnnphotos/

Now, if we only include shootings with 8+ deaths, how does this visualization change?

In [191]:
graph_vals = mass_shootings_clean_normalized_minimum8.copy()
graph_vals = graph_vals[['state','# Killed per Capita']]
graph_vals.set_index('state', inplace=True)
graph_vals = graph_vals.sort_values(by='# Killed per Capita', ascending = False)
graph_vals.plot(kind='bar', figsize=(15, 7))

plt.title('Mass Shootings Deaths per Capita by State, Minimum 8 Deaths')
plt.xlabel('State')
plt.ylabel('# Killed per Capita')

plt.legend(title='Metrics', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.show()

Nevada is now #1 by a LONG SHOT (not surprising). It will be interetsing to see if our decision on where to set the threshold for what counts as a mass shooting has any impact on our findings.

Now, let's take a look at some interesting graphics pertaining to how states voted in the 2016 Presidential Election. The following visualization shows the split of Democrat and Republican voting proportions in the election, starting with the highest Democrat proportion (Washington D.C.) and ending with the highest Republican proportion (Wyoming). On its own, this data does not tell us a whole lot more than the surface-level political preferences of each state, but this data will become very useful as we progress through this project!

In [192]:
# Sort by Democrat voting proportion
election_results_2016_tidy.sort_values(by='Democrat Proportion', ascending=False, inplace=True)

# Create a stacked bar chart
plt.figure(figsize=(15, 8))
plt.bar(election_results_2016_tidy['State'], election_results_2016_tidy['Democrat Proportion'], label='Democrat Vote Proportion', color='blue')
plt.bar(election_results_2016_tidy['State'], election_results_2016_tidy['Republican Proportion'], bottom=election_results_2016_tidy['Democrat Proportion'], label='Republican Vote Proportion', color='red')
plt.xlabel('State')
plt.ylabel('Vote Proportion')
plt.title('2016 Presidential Election Results by State')
plt.xticks(rotation=90)
plt.legend()
plt.show()

Now, let's use the same visualization technique as before to show the split of Democrat and Republican voting proportions in the 2020 Presidential Election:

In [193]:
# Sort by Democrat voting proportion
election_results_2020_tidy.sort_values(by='Democrat Proportion', ascending=False, inplace=True)

# Create a stacked bar chart
plt.figure(figsize=(15, 8))
plt.bar(election_results_2020_tidy['state'], election_results_2020_tidy['Democrat Proportion'], label='Democrat Vote Proportion', color='blue')
plt.bar(election_results_2020_tidy['state'], election_results_2020_tidy['Republican Proportion'], bottom=election_results_2020_tidy['Democrat Proportion'], label='Republican Vote Proportion', color='red')
plt.xlabel('State')
plt.ylabel('Vote Proportion')
plt.title('2020 Presidential Election Results by State')
plt.xticks(rotation=90)
plt.legend()
plt.show()

HYPOTHESIS & FURTHER SET-UP¶

Given the investigative question set forth at the beginning of this project, as well as the data we have looked at thus far, my hypothesis is that an increase in the prevalence of mass gun violence in a state between 2016 and 2020 will lead to a greater shift toward the Democratic Party in the 2020 Presidential Election.

Combining Data Sets¶

Now that we have loaded and tidied our datasets, let's start to combine them so we can investigate the connection between gun violence/mass shooting prevalence and voting trends.

First, let's combine the election results datasets so we can calculate the change in voting proportions from 2016 to 2020.

In [194]:
election_results_df = election_results_2016_tidy.merge(election_results_2020_tidy, how = 'outer', left_on='State', right_on = 'state', suffixes=('_2016', '_2020'))
election_results_df = election_results_df.drop(['state'], axis=1)
election_results_df.head(10)
Out[194]:
State Democrat Proportion_2016 Republican Proportion_2016 Democrat Proportion_2020 Republican Proportion_2020
0 District of Columbia 0.956701 0.043299 0.947075 0.052925
1 Hawaii 0.674620 0.325380 0.650426 0.349574
2 California 0.661359 0.338641 0.659348 0.340652
3 Vermont 0.652081 0.347919 0.671562 0.328438
4 Massachusetts 0.646872 0.353128 0.667834 0.332166
5 Maryland 0.639875 0.360125 0.648533 0.351467
6 New York 0.617861 0.382139 0.564811 0.435189
7 Illinois 0.590095 0.409905 0.579234 0.420766
8 Washington 0.587662 0.412338 0.603309 0.396691
9 Rhode Island 0.582983 0.417017 0.603370 0.396630

Cool! Now, let's investigate this newly created dataframe. Let's calculate the change in Democrat and Republican voting proportions between 2016 and 2020.

In [195]:
election_results_df['Democrat Proportion Change'] = election_results_df['Democrat Proportion_2020'] - election_results_df['Democrat Proportion_2016']
election_results_df['Republican Proportion Change'] = election_results_df['Republican Proportion_2020'] - election_results_df['Republican Proportion_2016']

Now, let's take a look at the five states with the highest increase in the proportion of Democrat votes, followed by the five states with the highest increase in the proportion of Republican votes.

In [196]:
election_results_df.sort_values(by='Democrat Proportion Change', ascending = False).head(5)
Out[196]:
State Democrat Proportion_2016 Republican Proportion_2016 Democrat Proportion_2020 Republican Proportion_2020 Democrat Proportion Change Republican Proportion Change
16 Colorado 0.526776 0.473224 0.567693 0.432307 0.040917 -0.040917
13 Delaware 0.560336 0.439664 0.596469 0.403531 0.036133 -0.036133
40 Nebraska 0.364793 0.635207 0.400725 0.599275 0.035932 -0.035932
17 Maine 0.516129 0.483871 0.551852 0.448148 0.035723 -0.035723
20 New Hampshire 0.502110 0.497890 0.536212 0.463788 0.034102 -0.034102
In [197]:
election_results_df.sort_values(by='Republican Proportion Change', ascending = False).head(5)
Out[197]:
State Democrat Proportion_2016 Republican Proportion_2016 Democrat Proportion_2020 Republican Proportion_2020 Democrat Proportion Change Republican Proportion Change
32 Alaska 0.415929 0.584071 0.360990 0.639010 -0.054939 0.054939
6 New York 0.617861 0.382139 0.564811 0.435189 -0.053050 0.053050
1 Hawaii 0.674620 0.325380 0.650426 0.349574 -0.024195 0.024195
33 Mississippi 0.409184 0.590816 0.395477 0.604523 -0.013706 0.013706
7 Illinois 0.590095 0.409905 0.579234 0.420766 -0.010861 0.010861

Now, let's visualize the change in Democrat vote proportion in each state:

In [198]:
election_results_df = election_results_df.sort_values(by='Democrat Proportion Change', ascending = False)

states = election_results_df['State']
dem_change = election_results_df['Democrat Proportion Change']

fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(10, 8))

# Plotting Democrat Proportion Change
sns.barplot(x=states, y=dem_change, ax=axes, color="blue")
axes.set_title('Change in Democrat Voting Proportion (2016-2020)')
axes.set_ylabel('Proportion Change')
axes.set_xlabel('State')
axes.set_xticklabels(states, rotation=90)

plt.tight_layout()

plt.show()

Unsurprisingly, most states saw an increase in the proportion of votes cast for the Democratic party; this should be expected given the outcomes of the 2016 election (Trump beats Clinton) and 2020 election (Biden beats Trump).

However, this prompts an interesting predicament regarding what data we truly care about. If we are trying to figure our the impacts of gun violence on voting, then logically we only care about relative changes in voting proportions (relative to other states). In other words, we don't care that the entire country shifted to voting more Democratic--we care about which states shifted MORE and which shifted LESS.

Thus, let us back out the "overall voting" factor by calculating the weighted-average change in voting proportion based on state population, and subtracting that value from each state's change in proportion, thus leaving us with the proportion shifts beyond that which is expected based on overarching national voitng trend changes. We will use the weighted-average rather than the arithmetic average because each state, logically, should not be equally weighted if we are calculating a national factor.

In [199]:
#Calculate the overall voting factor
election_results_overallfactor = election_results_df.merge(us_populations_tidy,how = 'left', left_on='State', right_on = 'state')
election_results_overallfactor = election_results_overallfactor[['State','Democrat Proportion_2016','Republican Proportion_2016','Democrat Proportion_2020','Republican Proportion_2020','Democrat Proportion Change','Republican Proportion Change','Population']]
election_results_overallfactor = election_results_overallfactor[['Democrat Proportion Change','Population']]
total_population = election_results_overallfactor['Population'].sum()
election_results_overallfactor['Contribution to Overall Factor'] = election_results_overallfactor['Population'] / total_population
election_results_overallfactor['Contribution to Overall Factor'] = election_results_overallfactor['Contribution to Overall Factor'] * election_results_overallfactor['Democrat Proportion Change']
election_results_overallfactor = election_results_overallfactor[['Contribution to Overall Factor']].sum()
election_results_overallfactor
Out[199]:
Contribution to Overall Factor    0.007143
dtype: float64

Now that we have calculated the "overall voting" factor as 0.0071 (0.71%), meaning on average, a state's democrat voting proportion increased by 0.71%, we can back this out of our original calculation of the change in Democrat and Republican voting proportions.

In [200]:
#Create columns to store our new adjusted calculations
election_results_df['Democrat Proportion Change Adjusted'] = election_results_df['Democrat Proportion Change'] - election_results_overallfactor[0]
election_results_df['Republican Proportion Change Adjusted'] = election_results_df['Republican Proportion Change'] + election_results_overallfactor[0]

#Clean up the DF by dropping unnecessary columns
election_results_df_clean = election_results_df[['State','Democrat Proportion Change Adjusted','Republican Proportion Change Adjusted']]
election_results_df_clean.head(10)
Out[200]:
State Democrat Proportion Change Adjusted Republican Proportion Change Adjusted
16 Colorado 0.033774 -0.033774
13 Delaware 0.028990 -0.028990
40 Nebraska 0.028789 -0.028789
17 Maine 0.028580 -0.028580
20 New Hampshire 0.026959 -0.026959
37 Kansas 0.026304 -0.026304
50 Wyoming 0.025055 -0.025055
11 Connecticut 0.023698 -0.023698
38 Montana 0.020595 -0.020595
19 Minnesota 0.020438 -0.020438

Cool, now we have a clean dataframe with the variables we want (pertaining to the 2016 and 2020 US Presidential Elections).

Now, let's recreate that same visualization of the change in Democrat voting proportion, this time with the overall factor included to adjust the values.

In [201]:
election_results_df_clean = election_results_df_clean.sort_values(by='Democrat Proportion Change Adjusted', ascending = False)

states = election_results_df['State']
dem_change = election_results_df['Democrat Proportion Change Adjusted']

fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(10, 8))

# Plotting Democrat Proportion Change
sns.barplot(x=states, y=dem_change, ax=axes, color="blue")
axes.set_title('Change in Democrat Voting Proportion, Adjusted (2016-2020)')
axes.set_ylabel('Proportion Change')
axes.set_xlabel('State')
axes.set_xticklabels(states, rotation=90)

plt.tight_layout()

plt.show()

So, very similar, but now more indicative of changes in a state (rather than at a national level).

MODEL & CONCLUSION¶

The Path Forward: 2 Possible Models¶

Now that most of our data is assembled, let's think about two ways we could possibly proceed:

  • Correlation Analysis: We could proceed with a correlation analysis, which would calculate the correlations between the (adjusted) change in Democrat/Republic voting proportion in a state, and the per capita mass shooting deaths (or injuries) in that state. This approach would be a fairly high-level way to analyze whether there appears to be a linear correlation between the two statistics. This type of analysis could allow us to make conclusions regarding the impact of mass shooting casualties on changes in voting trends, but would not be very useful in terms of generating future predictions.

  • Regression Analysis: Another possible model we could build would be a regression model that uses the (adjusted) change in voting trends (either Democrat or Republican, since it doesn't really matter) as the independent variable, and the # of deaths per capita, # of injuries per capita, and potentially other statistics we could find (to reduce omitted variable bias) as dependent variables. Such a model would allow us to check for statistical significance, as well as make future predictions (assuming our model has any predictive power, in the statistical sense).

Performing Regression Analysis¶

Ultimately, I decided to pursure performing regressional analysis as opposed to correlation analysis because it offers greater specificity regarding the relationship between variables. Namely, regression analysis allows us a model with greater predictive power.

Given the available data in our datasets, the regression will use the per capita mass shooting statistics (from between the 2016 and 2020 election dates) and the adjusted change in Democrat voting proportion in each state. Thus, the regression will have 51 observations (50 states + Washington D.C.). The decision to use the absolute prevalance of mass shooting casualties (as opposed to calculating the change in shooting prevalance from 2012-2016 to 2016-2020) was made on the basis of the data available to us. I believe this simplification is okay because our dependent variable, the change in Democrat voting proportion, impliciutly "bakes" in any mass shootings that occurred prior to the 2016 election into the value (because those shootings would have theoretically impacted the Democrat vote share in 2016); thus, any changes (above and beyond those epcected based on the country's change in preference at large) from 2016-2020 should be related strictly to any mass shootings that occurred between 2016 and 2020 (not the change in prevalence from a previous period).

All the following analysis uses the mass shootings dataframe that includes all shootings...

First, we need to create a DataFrame that combines all our dependent and independent features:

In [202]:
regression_data = mass_shootings_clean_normalized.merge(election_results_df_clean,how = 'left', left_on='state', right_on = 'State')
regression_data = regression_data[['State','# Killed per Capita','# Injured per Capita','# Killed or Injured per Capita','Democrat Proportion Change Adjusted']] #Keep only columns we want, reorder columns to dependent variable is last
regression_data.head(10)
Out[202]:
State # Killed per Capita # Injured per Capita # Killed or Injured per Capita Democrat Proportion Change Adjusted
0 Alabama 0.000010 0.000037 0.000047 0.006923
1 Alaska 0.000001 0.000011 0.000012 -0.062082
2 Arizona 0.000003 0.000008 0.000011 0.013972
3 Arkansas 0.000006 0.000040 0.000045 -0.008421
4 California 0.000005 0.000017 0.000021 -0.009154
5 Colorado 0.000004 0.000018 0.000022 0.033774
6 Connecticut 0.000003 0.000016 0.000019 0.023698
7 Delaware 0.000008 0.000025 0.000033 0.028990
8 District of Columbia 0.000019 0.000190 0.000209 -0.016769
9 Florida 0.000005 0.000019 0.000024 -0.017960

Great! Now we have a centralized dataframe that has all our independent and dependant variables combined in a clean, easy-to-use way!

Now, let's run some single-variable regressions to get an early idea of how our dependent and independent datapoints are related.

In [203]:
x_features = ['# Killed per Capita', '# Injured per Capita', '# Killed or Injured per Capita']

#From Demo 8
def regress_with_stats(df_penrose, observations):
  fig, ax = plt.subplots(1, 3, figsize=(15,5), sharex=False)

  for i,o in enumerate(observations):
      slope, intercept, r_value, p_value, std_err = stats.linregress(df_penrose[o],
                                                                     df_penrose['Democrat Proportion Change Adjusted'])
      # Pack these into a nice title
      diag_str = "p-value=%.1g\nr-value=%.3f\nstd err.=%.3f\nslope=%.3f\nintercept=%.3f" % (p_value, r_value, std_err, slope, intercept)
      df_penrose.plot.scatter(x=o, y='Democrat Proportion Change Adjusted', title=diag_str, ax=ax[i])
      y_pred = df_penrose[o] * slope + intercept
      # Make points and line
      pts = np.linspace(df_penrose[o].min(), df_penrose[o].max(), 500)
      line = slope * pts + intercept
      ax[i].plot(pts, line, lw=1, color='red')



regress_with_stats(regression_data, x_features)

That is very interesting--it appears that all our independent features (that relate to mass shootings) are negatively related to the increase in the vote share Democrats received between the 2016 and 2020 Presidential elections. In other words, states that had a greater per capita prevalence of mass shooting casualties seem to have has a greater DECREASE in Democrat voting share.

However, let's run an actual multi-variable linear regression before we expand too much on this phenomenon. Let's run a regression with '# Killed per Capita' and '# Injured per Capita' as our two independent variables. I will not be scaling these independnet features because they are all related enough (per capita numbers that are directly related), so I do not see the need for scaling.

In [204]:
df_ind = regression_data[['# Killed per Capita','# Injured per Capita']]
df_target = regression_data['Democrat Proportion Change Adjusted']

X = df_ind
y = df_target

model = sm.OLS(y, X).fit()

predictions = model.predict(X) # make the predictions by the model

# Print out the statistics
model.summary()
Out[204]:
OLS Regression Results
Dep. Variable: Democrat Proportion Change Adjusted R-squared (uncentered): 0.083
Model: OLS Adj. R-squared (uncentered): 0.041
Method: Least Squares F-statistic: 1.980
Date: Sun, 10 Dec 2023 Prob (F-statistic): 0.150
Time: 16:54:44 Log-Likelihood: 115.91
No. Observations: 46 AIC: -227.8
Df Residuals: 44 BIC: -224.2
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
# Killed per Capita 1653.4746 842.391 1.963 0.056 -44.253 3351.202
# Injured per Capita -273.8772 145.056 -1.888 0.066 -566.218 18.463
Omnibus: 23.754 Durbin-Watson: 1.900
Prob(Omnibus): 0.000 Jarque-Bera (JB): 38.468
Skew: -1.588 Prob(JB): 4.43e-09
Kurtosis: 6.159 Cond. No. 12.7


Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Interesting. So our multi-variable linear regression model assigns a positive coefficient to '# Killed per Capita' and a negative coefficient to '# Injured per Capita'. At surface level, this does not seem to make much real-world sense. It fits our hypothesis that an increase in the # Killed per Capita would increase the Democrat vote share, but it does not seem to fllow that # Injured per Capita would decrease vote share.

The coefficients on both independent features are fairly statistically significant, however, which is encouraging (P = 0.056 and P = 0.066).

Thinking abstractly, I suppose it is possible that mass shooting deaths are usually more memorable (and the statistics related to deaths are more often broadcast in the media) which could have an impact on Democrat vote share, while the # Injured may be more of a byproduct of states where guns are more widely circulated and possessed (which tend to be more Republican states), and those states may have become less Democrat leaning as a result of increased political polarization nationwide.

To try to get to the bottom of this, let's run a single-variable linear regression with '# Killed or Injured per Capita' as the lone independent variable.

In [205]:
df_ind = regression_data[['# Killed or Injured per Capita']]
df_target = regression_data['Democrat Proportion Change Adjusted']

X = df_ind
y = df_target

model = sm.OLS(y, X).fit()

predictions = model.predict(X) # make the predictions by the model

# Print out the statistics
model.summary()
Out[205]:
OLS Regression Results
Dep. Variable: Democrat Proportion Change Adjusted R-squared (uncentered): 0.001
Model: OLS Adj. R-squared (uncentered): -0.021
Method: Least Squares F-statistic: 0.03366
Date: Sun, 10 Dec 2023 Prob (F-statistic): 0.855
Time: 16:54:44 Log-Likelihood: 113.94
No. Observations: 46 AIC: -225.9
Df Residuals: 45 BIC: -224.1
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
# Killed or Injured per Capita -11.1120 60.564 -0.183 0.855 -133.094 110.870
Omnibus: 27.820 Durbin-Watson: 1.878
Prob(Omnibus): 0.000 Jarque-Bera (JB): 53.646
Skew: -1.754 Prob(JB): 2.24e-12
Kurtosis: 6.960 Cond. No. 1.00


Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Okay, so this regression has a negative coefficient, but is hardly significant (P = 0.855) so we are going to disregard this regression in terms of coming to a conclusion.

Now, let's run a full-fledged regression using only the '# Killed per Capita' as an independent feature.

In [206]:
df_ind = regression_data[['# Killed per Capita']]
df_target = regression_data['Democrat Proportion Change Adjusted']

X = df_ind
y = df_target

model = sm.OLS(y, X).fit()

predictions = model.predict(X) # make the predictions by the model

# Print out the statistics
model.summary()
Out[206]:
OLS Regression Results
Dep. Variable: Democrat Proportion Change Adjusted R-squared (uncentered): 0.008
Model: OLS Adj. R-squared (uncentered): -0.014
Method: Least Squares F-statistic: 0.3734
Date: Sun, 10 Dec 2023 Prob (F-statistic): 0.544
Time: 16:54:44 Log-Likelihood: 114.12
No. Observations: 46 AIC: -226.2
Df Residuals: 45 BIC: -224.4
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
# Killed per Capita 247.3217 404.723 0.611 0.544 -567.832 1062.475
Omnibus: 23.555 Durbin-Watson: 1.988
Prob(Omnibus): 0.000 Jarque-Bera (JB): 37.667
Skew: -1.584 Prob(JB): 6.62e-09
Kurtosis: 6.100 Cond. No. 1.00


Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

So, the coefficient on # Killed per Capita is indeed positive, but the statistical significance is still lacking (P = 0.544).

Thus, I cannot confidently draw any conclusions from this single-variable regression either.

Given the results of both of these single-variable linear regressions, I am disappointed in the lack of viable conclusions I can draw. However, the original multi-variable analysis (that had both coefficients as fairly statistically significant) continues to intrigue me, as well as the abstract explanation I arbitrarily put forth in the subsequent commentary.

My desire to build off that model (and its issues) naturally follows...

What if we added the original Republican proportion as a independent feature alongside '# Killed per Capita' to try to assess whether our uncompelling results are attributable to greater polarization in states (that are inadvertently being captured by the '# Injured per Capita' feature in the origianl multi-variable linear regression).

To do so, let's add that feature into our 'Regression_df' dataframe. Further, since we now have a variable that does not have the same (or extremely similar) distribution as '# Killed per Capita', let's scale our features.

In [207]:
from scipy.stats import zscore

#Add 2016 Republican Proportion to regression data dataframe
regression_data_new = regression_data.merge(election_results_2016_tidy, how = 'left', on='State')
regression_data_new = regression_data_new[['# Killed per Capita','# Injured per Capita','Democrat Proportion Change Adjusted','Republican Proportion']]
regression_data_new = regression_data_new.rename(columns = {'Republican Proportion' : '2016 Republican Proportion'})



regression_data_scaled = regression_data_new.apply(zscore) #Scale our regression data using Z-Score standardization
regression_data_scaled.head(10)
Out[207]:
# Killed per Capita # Injured per Capita Democrat Proportion Change Adjusted 2016 Republican Proportion
0 0.968178 0.304632 0.048431 0.942122
1 -0.918970 -0.439247 -3.502862 0.460929
2 -0.647380 -0.520192 0.411214 -0.067733
3 -0.019845 0.397921 -0.741233 0.933193
4 -0.190833 -0.277030 -0.778971 -1.518397
5 -0.405121 -0.223702 1.430293 -0.433021
6 -0.622933 -0.281479 0.911743 -0.794337
7 0.491780 -0.024243 1.184073 -0.703673
8 2.753904 4.741135 -1.170850 -3.900255
9 -0.113279 -0.216209 -1.232157 -0.167092

Now we have a data that is properly formatted and scaled to run our regression analysis:

In [208]:
df_ind = regression_data_scaled[['# Killed per Capita', '2016 Republican Proportion']]
df_target = regression_data_scaled['Democrat Proportion Change Adjusted']

X = df_ind
y = df_target

model = sm.OLS(y, X).fit()

predictions = model.predict(X) # make the predictions by the model

# Print out the statistics
model.summary()
Out[208]:
OLS Regression Results
Dep. Variable: Democrat Proportion Change Adjusted R-squared (uncentered): 0.099
Model: OLS Adj. R-squared (uncentered): 0.058
Method: Least Squares F-statistic: 2.405
Date: Sun, 10 Dec 2023 Prob (F-statistic): 0.102
Time: 16:54:44 Log-Likelihood: -62.885
No. Observations: 46 AIC: 129.8
Df Residuals: 44 BIC: 133.4
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
# Killed per Capita -0.1592 0.149 -1.070 0.291 -0.459 0.141
2016 Republican Proportion 0.2304 0.149 1.548 0.129 -0.070 0.530
Omnibus: 35.854 Durbin-Watson: 2.022
Prob(Omnibus): 0.000 Jarque-Bera (JB): 99.764
Skew: -2.073 Prob(JB): 2.17e-22
Kurtosis: 8.905 Cond. No. 1.33


Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Frustrating! Now, the coefficient on # Killed per Capita is negative (suggesting greater mass gun violence predicts a decrease in Democrat vote share), and the statistical significance is greater than on our previous model that did not include the beginning Republican voter proportion (P = 0.291 versus P = 0.544).

NOW, let's run these regressions again, but only look at mass shootings with 8 or more deaths (trying to separate out indiscriminate mass shootings from more run-of-the-mill shootings).

In [209]:
regression_data = mass_shootings_clean_normalized_minimum8.merge(election_results_df_clean,how = 'left', left_on='state', right_on = 'State')
regression_data = regression_data[['State','# Killed per Capita','# Injured per Capita','# Killed or Injured per Capita','Democrat Proportion Change Adjusted']] #Keep only columns we want, reorder columns to dependent variable is last
regression_data.head(10)
Out[209]:
State # Killed per Capita # Injured per Capita # Killed or Injured per Capita Democrat Proportion Change Adjusted
0 California 3.287958e-07 5.058396e-08 3.793797e-07 -0.009154
1 Florida 7.892958e-07 7.892958e-07 1.578592e-06 -0.017960
2 Mississippi 2.701535e-06 3.376919e-07 3.039227e-06 -0.020849
3 Nevada 1.900397e-05 1.420466e-04 1.610506e-04 -0.006744
4 Ohio 8.474973e-07 1.440745e-06 2.288243e-06 -0.005255
5 Pennsylvania 8.459781e-07 5.383497e-07 1.384328e-06 -0.000096
6 Texas 2.641917e-06 2.744849e-06 5.386765e-06 0.010151
7 Virginia 1.506130e-06 4.634246e-07 1.969555e-06 0.013921

Now, let's run some single-variable regressions to get an early idea of how our dependent and independent datapoints are related.

In [210]:
x_features = ['# Killed per Capita', '# Injured per Capita', '# Killed or Injured per Capita']

#From Demo 8
def regress_with_stats(df_penrose, observations):
  fig, ax = plt.subplots(1, 3, figsize=(15,5), sharex=False)

  for i,o in enumerate(observations):
      slope, intercept, r_value, p_value, std_err = stats.linregress(df_penrose[o],
                                                                     df_penrose['Democrat Proportion Change Adjusted'])
      # Pack these into a nice title
      diag_str = "p-value=%.1g\nr-value=%.3f\nstd err.=%.3f\nslope=%.3f\nintercept=%.3f" % (p_value, r_value, std_err, slope, intercept)
      df_penrose.plot.scatter(x=o, y='Democrat Proportion Change Adjusted', title=diag_str, ax=ax[i])
      y_pred = df_penrose[o] * slope + intercept
      # Make points and line
      pts = np.linspace(df_penrose[o].min(), df_penrose[o].max(), 500)
      line = slope * pts + intercept
      ax[i].plot(pts, line, lw=1, color='red')



regress_with_stats(regression_data, x_features)

Again, it appears that all our independent features (that relate to mass shootings) are negatively related to the increase in the vote share Democrats received between the 2016 and 2020 Presidential elections. In other words, states that had a greater per capita prevalence of mass shooting casualties seem to have has a greater DECREASE in Democrat voting share.

Let's run a regression with '# Killed per Capita' as our independent variable.

In [211]:
df_ind = regression_data[['# Killed per Capita']]
df_target = regression_data['Democrat Proportion Change Adjusted']

X = df_ind
y = df_target

model = sm.OLS(y, X).fit()

predictions = model.predict(X) # make the predictions by the model

# Print out the statistics
model.summary()
/usr/local/lib/python3.10/dist-packages/scipy/stats/_stats_py.py:1806: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=8
  warnings.warn("kurtosistest only valid for n>=20 ... continuing "
Out[211]:
OLS Regression Results
Dep. Variable: Democrat Proportion Change Adjusted R-squared (uncentered): 0.055
Model: OLS Adj. R-squared (uncentered): -0.080
Method: Least Squares F-statistic: 0.4040
Date: Sun, 10 Dec 2023 Prob (F-statistic): 0.545
Time: 16:54:47 Log-Likelihood: 24.056
No. Observations: 8 AIC: -46.11
Df Residuals: 7 BIC: -46.03
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
# Killed per Capita -417.0750 656.184 -0.636 0.545 -1968.703 1134.553
Omnibus: 0.500 Durbin-Watson: 0.621
Prob(Omnibus): 0.779 Jarque-Bera (JB): 0.465
Skew: 0.031 Prob(JB): 0.793
Kurtosis: 1.821 Cond. No. 1.00


Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

So only including these mass shootings with 8+ detahs, the coefficient on # Killed per Capita is negative, but the P-value is far from statistically significant (P=0.545).

Now, let's again add in the 2016 Republican vote share to make sure we do not ignore potential effects of polarization.

In [212]:
from scipy.stats import zscore

#Add 2016 Republican Proportion to regression data dataframe
regression_data_new = regression_data.merge(election_results_2016_tidy, how = 'left', on='State')
regression_data_new = regression_data_new[['# Killed per Capita','# Injured per Capita','Democrat Proportion Change Adjusted','Republican Proportion']]
regression_data_new = regression_data_new.rename(columns = {'Republican Proportion' : '2016 Republican Proportion'})



regression_data_scaled = regression_data_new.apply(zscore) #Scale our regression data using Z-Score standardization
regression_data_scaled.head(10)
Out[212]:
# Killed per Capita # Injured per Capita Democrat Proportion Change Adjusted 2016 Republican Proportion
0 -0.552904 -0.396303 -0.406560 -2.279994
1 -0.474667 -0.380480 -1.175525 0.111112
2 -0.149782 -0.390153 -1.427815 1.318644
3 2.619958 2.645367 -0.196133 -0.160686
4 -0.464778 -0.366525 -0.066067 0.627843
5 -0.465037 -0.385855 0.384399 0.074417
6 -0.159911 -0.338590 1.279250 0.691584
7 -0.352878 -0.387460 1.608451 -0.382920
In [213]:
df_ind = regression_data_scaled[['# Killed per Capita', '2016 Republican Proportion']]
df_target = regression_data_scaled['Democrat Proportion Change Adjusted']

X = df_ind
y = df_target

model = sm.OLS(y, X).fit()

predictions = model.predict(X) # make the predictions by the model

# Print out the statistics
model.summary()
/usr/local/lib/python3.10/dist-packages/scipy/stats/_stats_py.py:1806: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=8
  warnings.warn("kurtosistest only valid for n>=20 ... continuing "
Out[213]:
OLS Regression Results
Dep. Variable: Democrat Proportion Change Adjusted R-squared (uncentered): 0.013
Model: OLS Adj. R-squared (uncentered): -0.317
Method: Least Squares F-statistic: 0.03825
Date: Sun, 10 Dec 2023 Prob (F-statistic): 0.963
Time: 16:54:47 Log-Likelihood: -11.301
No. Observations: 8 AIC: 26.60
Df Residuals: 6 BIC: 26.76
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
# Killed per Capita -0.0512 0.406 -0.126 0.904 -1.044 0.942
2016 Republican Proportion -0.0980 0.406 -0.242 0.817 -1.091 0.895
Omnibus: 0.702 Durbin-Watson: 0.374
Prob(Omnibus): 0.704 Jarque-Bera (JB): 0.556
Skew: 0.255 Prob(JB): 0.757
Kurtosis: 1.813 Cond. No. 1.04


Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

So, even trying to drill down to only the more "traditional" mass shootings, my findings still dispute the hypothesis that an increase in mass shootings leads to a greater Democrat vote share.

Conclusion¶

Given the results of my models, the only logical conclusion I can draw is that it appears that an increased prevalence of mass gun violence (regardless of how its measured) does NOT suggest a greater increase in the Democrat vote share in the next presidential election. In fact, my findings, altough not statistically significant, actually indicate the opposite: that Republicans tend to see a relative increase in their vote share in states that experience more mass shooting violence. While that does not (to me, at least) make much intuitive sense, the numbers are the numbers. A real world take away from this analysis could be that Democrats should focus less on gun control in their political rhetoric, while Republicans may be wise to continue to remain fixed on their pro gun rights stances.

Collaboration Plan¶

For this project, I plan to work solo, so I will not need to collobrate with a partner. Yet, it is still crucial to formulate a plan to ensure that I stay on track regarding working on the project. As such, I plan to set aside a specific block of time each week to work on this project, thereby ensuring that I do not fall behind and risk submitting a project without the adequate level of detail expected.

In [214]:
#Download as HTML...
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [215]:
%cd /content/drive/MyDrive/'Colab Notebooks'
/content/drive/MyDrive/Colab Notebooks
In [216]:
%%shell
jupyter nbconvert --to html 'CMPS3160 Voting Project.ipynb'
[NbConvertApp] Converting notebook CMPS3160 Voting Project.ipynb to html
[NbConvertApp] Writing 725785 bytes to CMPS3160 Voting Project.html
Out[216]: