Bible Common Words and Sentiment

MEA Text Project

Bible Subdivision Comparisons

In this project, I will analyze the KJV Bible using five different subdivisions. I want to see the most common words of each subdivision and see the sentiments of each subdivision. I believe poetry will have the most positive sentiment. I believe prophecy or history will have the most negative sentiment.

First, I have to load all my packages and import my text.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
library(textdata)
library(readr)
library(wordcloud2)

kjv <- read_csv("kjv.csv")
Rows: 31102 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): citation, book, text
dbl (2): chapter, verse

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
kjv |> 
  unnest_tokens(word, text) -> kjv2

As an overview, I would like to see the most common words of the entire Bible before breaking it down by subdivision.

kjv2 |> 
  count(word, sort = TRUE) |> 
  anti_join(stop_words) |> 
  head(100) |> 
  arrange(desc(n)) -> Bible_Common_Words 
Joining with `by = join_by(word)`
wordcloud2(Bible_Common_Words)

On Biblerr.com, each book of the Bible can be grouped into five subdivisions: history, poetry, prophecy, gospels, and epistles.

The first step to this project is to filter all the books into the subdivision they fall into. I will use the filter method to accomplish this.

kjv2 |> 
  anti_join(stop_words) |> 
  filter(book %in% c("Genesis", "Exodus", "Leviticus", "Numbers", 
                   "Deuteronomy", "Joshua", "Judges", "Ruth", 
                   "1 Samuel", "2 Samuel", "1 Kings", "2 Kings", 
                   "1 Chronicles", "2 Chronicles", "Ezra", "Nehemiah", 
                   "Esther", "Acts")) -> History_Books
Joining with `by = join_by(word)`
kjv2 |> 
  anti_join(stop_words) |> 
  filter(book %in% c("Job", "Psalms", "Proverbs", "Ecclesiastes", "Song of Solomon", 
                     "Lamentations")) -> Poetry_Books
Joining with `by = join_by(word)`
kjv2 |> 
  anti_join(stop_words) |> 
  filter(book %in% c("Isiah", "Jeremiah", "Ezekiel", "Daniel", "Hosea",
                     "Joel", "Amos", "Obadiah", "Jonah", "Micah", "Habakkuk",
                     "Zephaniah", "Haggai", "Zechariah", 
                     "Malachi", "Revelation")) -> Prophecy_Books
Joining with `by = join_by(word)`
kjv2 |>
  anti_join(stop_words) |> 
  filter(book %in% c("Matthew", "Mark", "Luke", "John")) -> Gospels_Books
Joining with `by = join_by(word)`
kjv2 |>
  anti_join(stop_words) |> 
  filter(book %in% c("Romans", "1 Corinthians", "2 Corinthians", "Galations", 
                     "Ephesians", "Philippians", "Colossians", "1 Thessalonians",
                     "2 Thessalonians", "1 Timothy", "2 Timothy", "Titus", "Philemon", 
                     "Hebrews", "James", "1 Peter", "2 Peter", "1 John", "2 John", 
                     "3 John", "Jude")) -> Epistles_Books 
Joining with `by = join_by(word)`

Next, I would like to see the most common words in the five sections the bible is split into. I will also make a chart for each to be able to clearly see the top ten common words of each. I am also only looking at the top ten in each section to prohibit too much data.

History_Books |> 
  count(word, sort = TRUE) |> 
  head(10) |> 
  arrange(desc(n)) -> History_Common_Words

History_Common_Words |> 
  ggplot(aes(word, n, fill = word)) + geom_col() + theme_classic() + 
  labs (title= "Most Common History Words")

Poetry_Books |> 
  count(word, sort = TRUE) |> 
  head(10) |> 
  arrange(desc(n)) -> Poetry_Common_Words


Poetry_Common_Words |> 
  ggplot(aes(word, n, fill = word)) + geom_col() + theme_classic() + 
  labs (title= "Most Common Poetry Words") 

Prophecy_Books |> 
  count(word, sort = TRUE) |> 
  head(10) |> 
  arrange(desc(n)) -> Prophecy_Common_Words

Prophecy_Common_Words |> 
  ggplot(aes(word, n, fill = word)) + geom_col() + theme_classic() + 
  labs (title= "Most Common Prophecy Words") 

Gospels_Books |> 
  count(word, sort = TRUE) |> 
  head(10) |> 
  arrange(desc(n)) -> Gospels_Common_Words

Gospels_Common_Words |> 
  ggplot(aes(word, n, fill = word)) + geom_col() + theme_classic() + 
  labs (title= "Most Common Gospel Words") 

Epistles_Books |> 
  count(word, sort = TRUE) |> 
  head(10) |> 
  arrange(desc(n)) -> Epistles_Common_Words

Epistles_Common_Words |> 
  ggplot(aes(word, n, fill = word)) + geom_col() + theme_classic() + 
  labs (title= "Most Common Epistle Words") 

After viewing the most common words in each subdivision, I noticed the similarities and differences. I noticed how each subdivision also have similar words such as God, Lord, ye, and thy. However each subdivision does seem to have different words that make them unique. In this history subdivision, it has words such as “Israel” and “Children.” Israel makes sense due to the fact these are books explaining context and the setting of the Bible. However, children is interesting because it not an uncommon word however it is only seen in the top ten common words in this subdivision. History does have a prominent usage of nouns in the common words. The poetry subdivision has words such as “heart” and “wicked.” In contrast to the nouns in the history subdivision, these seem to be more emotional words with sentiment. The prophecy subdivision seems to be most similar to the history subdivision. It also has the word “Israel” in the top ten. The prophecy subdivision also has a large amount of nouns in the most common words. The gospel subdivision has the most variations of names of God in the most common words. Names like “Jesus”, “Lord”, “Father”, and “God” are all in the top ten. Finally, the epistles subdivision has fairly similar words to the rest of the subdivisions, however, the word “spirit” stuck out to me. It is in the top ten most common words. This could be due to the term the “Holy Spirit” in the Bible being used often in the epistles subdivision. I made a bar plot for each of these to visualize

After analyzing the most common words and seeing the words, it would be interesting to see the sentiment in each of the subdivisions. This will help be able to see what subdivision has the most emotion. I will be creating a list of the top ten negative and positive sentiments of each subdivision. I will also find the total average sentiment for each subdivision and create a bar plot at the end.

History_Books |> 
  count(word) |> 
  arrange(desc(n)) |> 
  inner_join(get_sentiments('afinn')) -> History_Sentiments
Joining with `by = join_by(word)`
top_positive <- History_Sentiments %>%
  filter(value > 0) %>%
  arrange(desc(value)) %>%
  slice_head(n = 10) |> 
  knitr::kable()
top_negative <- History_Sentiments %>%
  filter(value < 0) %>%
  arrange(value) %>%
  slice_head(n = 10) |> 
  knitr::kable()

///

sum(History_Sentiments$value) ->total_history_sentiment
total_history_words <- sum(History_Sentiments$n)
average_history_sentiment <- total_history_sentiment / total_history_words

Poetry_Books |> 
  count(word) |> 
  arrange(desc(n)) |> 
  inner_join(get_sentiments('afinn')) -> Poetry_Sentiments
Joining with `by = join_by(word)`
top_positive_Poetry <- Poetry_Sentiments |>
  filter(value > 0) |> 
  arrange(desc(value)) |> 
  slice_head(n = 10) 
print(top_positive_Poetry)
# A tibble: 10 × 3
   word          n value
   <chr>     <int> <dbl>
 1 rejoice      81     4
 2 wonderful    11     4
 3 rejoicing     9     4
 4 triumph       8     4
 5 rejoiced      4     4
 6 praise      162     3
 7 love         68     3
 8 glad         44     3
 9 joy          38     3
10 beloved      36     3
top_negative_Poetry <- Poetry_Sentiments |>
  filter(value < 0) |> 
  arrange(value) |> 
  slice_head(n = 10) 
print(top_negative_Poetry)
# A tibble: 10 × 3
   word            n value
   <chr>       <int> <dbl>
 1 hell           16    -4
 2 ass             5    -4
 3 fraud           1    -4
 4 whore           1    -4
 5 evil          126    -3
 6 anger          50    -3
 7 destruction    41    -3
 8 hate           39    -3
 9 destroy        35    -3
10 die            26    -3
sum(Poetry_Sentiments$value) ->total_poetry_sentiment
total_poetry_words <- sum(Poetry_Sentiments$n)
average_poetry_sentiment <- total_poetry_sentiment / total_poetry_words

Prophecy_Books |> 
  count(word) |> 
  arrange(desc(n)) |> 
  inner_join(get_sentiments('afinn')) -> Prophecy_Sentiments
Joining with `by = join_by(word)`
top_positive_Prophecy <- Prophecy_Sentiments |> 
  filter(value > 0) |> 
  arrange(desc(value)) |> 
  slice_head(n = 10) 
print(top_positive_Prophecy)
# A tibble: 10 × 3
   word          n value
   <chr>     <int> <dbl>
 1 rejoice      25     4
 2 rejoiced      5     4
 3 rejoicing     3     4
 4 wonderful     1     4
 5 love         25     3
 6 pleasant     18     3
 7 praise       18     3
 8 joy          16     3
 9 loved        15     3
10 glad         12     3
top_negative_Prophecy <- Prophecy_Sentiments |> 
  filter(value < 0) |> 
  arrange(value) |> 
  head(10) 
print(top_negative_Prophecy)
# A tibble: 10 × 3
   word         n value
   <chr>    <int> <dbl>
 1 bastard      1    -5
 2 hell        11    -4
 3 ass          6    -4
 4 whore        5    -4
 5 evil       147    -3
 6 die         76    -3
 7 anger       69    -3
 8 destroy     69    -3
 9 violence    33    -3
10 dead        30    -3
sum(Prophecy_Sentiments$value) ->total_prophecy_sentiment
total_prophecy_words <- sum(Prophecy_Sentiments$n)
average_prophecy_sentiment <- total_prophecy_sentiment / total_prophecy_words

Gospels_Books |> 
  count(word) |> 
  arrange(desc(n)) |> 
  inner_join(get_sentiments('afinn')) -> Gospels_Sentiments
Joining with `by = join_by(word)`
top_positive_Gospels <- Gospels_Sentiments |> 
  filter(value > 0) |> 
  arrange(desc(value)) |> 
  slice_head(n = 10) 
print(top_positive_Gospels)
# A tibble: 10 × 3
   word          n value
   <chr>     <int> <dbl>
 1 rejoice      13     4
 2 heavenly      8     4
 3 miracle       7     4
 4 rejoiced      6     4
 5 wonderful     2     4
 6 rejoicing     1     4
 7 love         51     3
 8 joy          24     3
 9 loved        24     3
10 faithful     11     3
top_negative_Gospels <- Gospels_Sentiments |> 
  filter(value < 0) |> 
  arrange(value) |> 
  slice_head(n = 10) 
print(top_negative_Gospels)
# A tibble: 10 × 3
   word         n value
   <chr>    <int> <dbl>
 1 cock        12    -5
 2 hell        15    -4
 3 ass          7    -4
 4 damned       1    -4
 5 dead        69    -3
 6 evil        47    -3
 7 kill        29    -3
 8 die         27    -3
 9 destroy     21    -3
10 betrayed    18    -3
sum(Gospels_Sentiments$value) ->total_Gospels_sentiment
total_Gospels_words <- sum(Gospels_Sentiments$n)
average_Gospels_sentiment <- total_Gospels_sentiment / total_Gospels_words

Epistles_Books |> 
  count(word) |> 
  arrange(desc(n)) |> 
  inner_join(get_sentiments('afinn')) -> Epistles_Sentiments
Joining with `by = join_by(word)`
top_positive_Epistles <- Epistles_Sentiments |> 
  filter(value > 0) |> 
  arrange(desc(value)) |> 
  slice_head(n = 10)
print(top_positive_Epistles)
# A tibble: 10 × 3
   word          n value
   <chr>     <int> <dbl>
 1 rejoice      26     4
 2 heavenly     14     4
 3 rejoicing     9     4
 4 rejoiced      5     4
 5 triumph       1     4
 6 win           1     4
 7 love        122     3
 8 beloved      54     3
 9 faithful     33     3
10 joy          33     3
top_negative_Epistles <- Epistles_Sentiments |> 
  filter(value < 0) |> 
  arrange(value) |> 
  slice_head(n = 10) 
print(top_negative_Epistles)
# A tibble: 10 × 3
   word         n value
   <chr>    <int> <dbl>
 1 bastards     1    -5
 2 damned       2    -4
 3 hell         2    -4
 4 ass          1    -4
 5 fraud        1    -4
 6 tortured     1    -4
 7 dead        77    -3
 8 evil        71    -3
 9 died        18    -3
10 die         15    -3
sum(Epistles_Sentiments$value) ->total_Epistles_sentiment
total_Epistles_words <- sum(Epistles_Sentiments$n)
average_Epistles_sentiment <- total_Epistles_sentiment / total_Epistles_words

average_sentiments <- c(average_history_sentiment, average_poetry_sentiment,
                        average_prophecy_sentiment, average_Gospels_sentiment,
                        average_Epistles_sentiment)

categories <- c("History", "Poetry", "Prophecy", "Gospels", "Epistles")

category_colors <- c("blue", "green", "orange", "red", "purple")


barplot(average_sentiments, names.arg = categories,
        main = "Average Sentiment Scores by Book Category",
        xlab = "Book Category",
        ylab = "Average Sentiment Score",
        col = category_colors)

“God” is the most common word that holds sentiment in every subdivision except the gospels, which its most common word with sentiment is “Jesus.” This makes sense since Jesus is what each gospel book is about. In three of the subdivisions, “history”, “prophecy”, and “epistles”, the word the the strongest negative sentiment is the word “bastard.” In poetry it is the word “hell”. In gospels it is the word “Cock.” In every subdivision the most common word with the most positive sentiment is the word “rejoice.” After I evaluated each of the sentiments in each book, I wanted to see the average sentiment of the entire subdivisions. I found the subdivision with the highest sentiment is the history subdivision. The subdivision with the lowest sentiment is the prophecy subdivision. I made a bar graph to better visualize this.

In conclusion, the I was correct about prophecy having the most negative sentiment but was incorrect about how poetry had the most positive sentiment.