Workflow
- Collect songs from Wikipedia in the Jupyter notebook `Collecting Billboard Top 100 From Wikipedia`.
- Import it into Google Sheets and manually populate the "Lyrics" column.
- Took the top result for "(song_title) lyrics" in a Google search that didn't contain extraneous features ([CHORUS], [VERSE], and the like)
- Export it as a .csv file and import it into this Jupyter notebook for data exploration.
CTA Analysis:
- Preprocessing Lyrics without punctuation, lyrics by phrases, lyrics by words.
- Basic: Unique artists in decade, unique artists, number of times artists appear, unique phrases, average character count, average word count.
- Frequency: Parts of speech distribution (by decade), sepculative words, personal pronouns (total, first vs second, first person singular vs first person plural, genders), lexical diversity.
- Network graph
- Sentiment analysis (by phrases)
- Common words: direct count (without stopwords), TF-IDF, topic modeling