Skip to main content

Workflow

  1. Collect songs from Wikipedia in the Jupyter notebook `Collecting Billboard Top 100 From Wikipedia`.
  2. Import it into Google Sheets and manually populate the "Lyrics" column.
    1. Took the top result for "(song_title) lyrics" in a Google search that didn't contain extraneous features ([CHORUS], [VERSE], and the like)
  3. Export it as a .csv file and import it into this Jupyter notebook for data exploration.

CTA Analysis:

  1. Preprocessing Lyrics without punctuation, lyrics by phrases, lyrics by words.
  2. Basic: Unique artists in decade, unique artists, number of times artists appear, unique phrases, average character count, average word count.
  3. Frequency: Parts of speech distribution (by decade), sepculative words, personal pronouns (total, first vs second, first person singular vs first person plural, genders), lexical diversity.
  4. Network graph
  5. Sentiment analysis (by phrases)
  6. Common words: direct count (without stopwords), TF-IDF, topic modeling