In this post I explore the changing trends of the Billboard Hot 100 chart over the past 60 years. Increased "Chart Stickiness" reflects underlying changes in the ways we consume music as well as the rule-changes of the Billboard songs. Further audio analysis reveals that the popular songs of today are more energetic and danceable than ever before!Read More
Exploring the connection between artists:
One of my favorite ways to find music is by surfing related artists; I pick a band I've been hooked on then click through the list of similar artists, find one I like then click on their similar artists etc. etc. Definitely a nerdy way to try and source new bands, but I've had some great success (this week I happened across Gillbanks - they have a weird Radiohead-ish feel at times and a Two-Door Cinema Club feel at others... weird but it works).
All of this artist surfing got me thinking about what makes one artist "related" to another - touring together, similar influences, same record label, sharing bandmates - the list could go on. In my opinion, Spotify has tackled this question of relatedness in a really great and fundamental way:
In other words: My sister Kendal says "Yo, you saw St. Paul and the Broken Bones? I love that band - you should check out Nathaniel Ratliff and the Night Sweats they have a similar vibe." Not only are the bands similar in their exotic names, but they can be considered to be related because Kendal often listens to them together. This example is a loose interpretation, Spotify's metrics seem to be more based on playlist proximity and probably some population measures of artist overlap by listening-session.
Imagine "related artists" to be sort of like neighbors; the two artists seem to get along, so they live in similar areas so they can hang out. Those areas (aka communities) can be roughly understood as "genres."
Now, I don't particularly like the concept of genres; in most musical contexts, they are a clunky and pre-judicious way of bucketing artists. ["I wouldn't like Chris Stapleton I HATE country!!" or "Banks? Weird name. Plus I don't like electronica, just Adele.."]. I appreciate the attempt to let a listener know what they are getting into, but genres are heavy-handed in their labeling and don't encourage exploration. What we will see in the following graphs is how user-level listening data can indicate how two artists fit together, both in the meta sense of genres and on a more useful granular level.
This is the fun part. With the idea of "communities" and neighborhoods in mind I knew I wanted to create a network graph that plotted artists and their relations (more on this in a few - I added a more technical section at the end for those interested). What I needed was data, so I build a program in Python to ping Spotify's API and pull back a list of artist relationships. The simplified work flow is as follows:
- Define a list of 10 artists - we will call this the To_Parse list. I used: Kings of Leon, Kendrick Lamar, The War on Drugs, Zac Brown Band, The Rolling Stones, Daft Punk, Spoon, Kanye West, Rage Against the Machine, Atomic Man.
- I chose these artists to get a diverse picture of each "mega-genre". You'll notice that in a moment of complete vanity, I added my own band Atomic Man to the mix. This had an unexpected and interesting outcome that we'll discuss later!
- For each artist in this list, pull back their top 20 related artists and store them like: [Kings of Leon, The Strokes].
- If any of these 10 "related artists" has more than 100,000 followers on Spotify, add it to the To_Parse list.
- Repeat until at least 10,000 artists have been searched for.
- For every artist in the data set, reach out to the EchoNest API for genre information and then programmatically group these sub-genres into more general labels.
This program allowed me to collect information on exactly 7,609 artists' top 20 similar artists.
Network graphs are often used in data science to visualize many relationships between members of a group. Recently a Duke researcher used graph theory to explain how meta-data could have implicated Paul Revere as a potential terrorist in the eyes of the British government - it's a super cool topic, but I won't delve into the more technical aspects till the end.
For now, it is only important to imagine each artist as a dot and the "related status" to be a line between those dots. Many algorithms have been developed to determine where each dot should be placed in order to properly visualize how some dots are closer neighbors than others and highlight the overall trends of a network.
I used a program called Gephi to import my data and apply 7 different placement algorithms. These algorithms weight relationships differently and allow us to see how artists are naturally clumped by "close relation." Although the dots are colored according to a genre variable I calculated, their placement does NOT take genre into account, but rather shows how two artists are frequently listened in conjunction with one another. Take a look at the graphs here:
What is most interesting to me is how the algorithms were able to place artists into clusters that accurately reflected "closeness." The fact that most clusters are homogeneously colored indicates that Spotify's use of related artist status is a good substitute for "genre"; it captures large trends in music similarity but also allows for more nuanced relationships and groupings.
These 7 different graph layouts above allow us to look at the network from different perspectives, but none of the underlying relationships are changed. On average, this sample of related artists had ~8 degrees of separation (sorry Kevin Bacon...), but this number would likely drop if I were to include artists with fewer followers or utilize a larger data set of artists. I chose these cut offs to keep the amount of small artist noise down and work with a manageable amount of data for the interactive chart below.
Once I found a graph layout that showed some good community clustering I realized it would be helpful to interactively zoom and label the points. I used a Python plotting package called iGraph to import Gephi's GML file and hosted the resulting interactive visualization on Plotly after spending way too long figuring out how to format it :). You can hover over the chart to view artist name, number of followers, meta-genre and top three sub-genres. Use the crosshair pointer to select a section to zoom (to un-zoom click the "autoscale" button that looks like [X] in the top right corner). Notice anything cool? Let me know in the comments!
(PS: This graph doesn't work as well on mobile so you might want to check it out on a desktop/laptop)
Exploring the Graph
So who is the most important artist in this dataset? Who is the least important?
These are really useful and interesting questions that I chose to answer using the PageRank algorithm developed by Google founders Larry Page and Sergey Brin. This algorithm was used to allow the earliest versions of Google decide which webpages were most useful and reputable and rank results accordingly. The algorithm has been subsequently abandoned for much more complex ones, but it is still really useful for characterizing network connectivity (have you ever gone past page 1 on Google in the past 5 years?).
Essentially PageRank gives each dot a value based on how many incoming and outgoing links it has (these values are further weighted by the importance of surrounding nodes). I rank the PageRank algorithm on the data sample in order to determine which are the top 15 most important and 5 least important bands:
You'll notice that the majority of the "important" artists as determined by PageRank are mainstream pop acts; I think this is a result of the non-random sample I collected that does include slightly more pop artists than other types, combined with the fact that these sort of artists will probably have more listeners and relationships due to the Billboard effect.
The least important artists is where MY band Atomic Man gets to shine! Since we are as-of-yet undiscovered and were placed into the dataset manually we are a bit of an anomaly. All of our related artists had fewer than 100,000 followers so they were not added to the To_Parse list, therefore our neighborhood exploration died a sudden and painful death. (Please feel free to listen to our music extensively and help us build better network value!)
The Long and Winding Road
Another metric I looked into was "betweenness" of dots - basically who are the two furthest apart artists in this entire graph? I leveraged an algorithm to track the walk between these points which involves 27 different artists; it starts with "Herbie Hancock" and ends with Latin sensation "El Indio." I placed the top song by each artist on the path in a playlist to the right. Listening through in order seems like a natural progression from artist to artist, further indicating the power of related-powered play listing.
While analyzing this data, I came across dozens of bizarre band names and sub-genres I had never heard of before. My favorite of the lot was "Viking Metal" - I don't really understand what that means but its provocative and super intense. I used the same algorithm described above to figure out the shortest path necessary to take us from the sensational Viking Metal band "Amon Amarth" to our one and only true Queen: Beyonce. Surprisingly, this path only involves 10 artists and actually seems pretty natural (that is if you can stomach the Viking Metal at the beginning..."Twilight of the Thunder God" is my new ring tone...)
Follow these playlists and let me know what you think of the paths in the comments - I've been really enjoying their diversity this week!
I have been working on this post for several months - it took a while to figure out the best way to ping the Spotify API and pull back the data and then finagle it into an interactive. I plan on posting my code once I get a public GitHub set up, but for now I'll put my answers to anyone's technical questions here - just drop a note in the comments! More to follow..
As always, special thanks to: Katie for putting up with me, my parents for listening to me ramble about related artists for nearly 6 months and Brian/Stephen for proof reading.
When a job posting says "we need a rock star!" I think of Keith Richards chain smoking and cursing at vague Python errors. I could do that..
It's 10 AM on a Tuesday and I'm anxiously refreshing the 9:30 Club ticketing page, trying to buy tickets for the St. Paul and the Broken Bones' New Years Eve show. A few minutes after my first attempt, the show was sold out and I am out of luck. After shopping around on secondhand ticket sites, I find two GA tickets priced just above the original official page - wanting to hear "Call Me" as the ball dropped, I anxiously pulled the trigger.
Those who buy tickets are familiar with the question: should I buy now or wait to see if the prices drop later? There are a host of factors in play that can alter ticket prices: total venue capacity, event exclusivity (DMB tours every summer, The Stones won't be touring for much longer..), tour scale (nationwide vs. regional tours have differing production costs) and of course the demand for tickets.
In my last blog post, I scraped data from the secondhand ticket aggregation site SeatGeek in order to map music venues across the United States. This time, I sought to explore whether or not there is a "sweet spot" of when to buy concert tickets from resellers.
For this post, I worked with Chris Leydon from SeatGeek to pull some sales information on the big acts of last summer. Although transaction level data were unavailable for privacy reasons, Chris sent over the median ticket price by day for each concert held by the summer's biggest acts, namely: Taylor Swift, Ed Sheeran, U2, The Rolling Stones, Kenny Chesney, Billy Joel, Rush, Grateful Dead, Foo Fighters, and One Direction.
First I plotted the daily median price of every event in the data set (see above). I colored each event line by artist, which highlights the first takeaway: daily prices are quite volatile. This, coupled with the privacy reasons mentioned before, is why Chris recommended using the median price.
The upside to using median-price by day is that we are able to visualize general fluctuations in ticket prices over time without worrying much about outlier postings. The downside is that, without more detailed information on sales volume (i.e.: the quantity of tickets sold by section and price point) its hard to determine what may have driven price fluctuations.
Nevertheless, at a high level the question of: "When is the best time to buy tickets?" appears to have different answers depending on your artist of choice, but not broad differences across an artist's concerts:
These graphs show the rather extreme trends by artist; some thoughts on their creation/interpretation:
- Note that each graph has a different Y-axis tailored to an artist's individual range of median ticket prices.
- I have pooled all events for an artist together, and using the STATSMOOTH R function to generate a LOESS curve for the artist
- A LOESS curve is a method for locally weighted polynomial regression;
- The surrounding shadow area denotes the 95% confidence interval (aka how often observed values will fall within the range);
- Essentially the goal is to fit a line to a scatter plot (typically with noisy data) that may not have a clear linear line of best fit;
- For each data point (n) in a plot a regression value is computed - we then weight whichever regression values are closest to the real values in a given subset/window of the data. Finally these weighted values are used to estimate a line of best fit;
- The above is a simplification of the process but essentially its a quick and dirty way to pull a trend out of scattered and temporal data which is just what we have here;
At the conclusion of these initial peeks I have more questions than answers about the secondary ticketing market! Clearly there are some noticeable (probably exploitable) fluctuations in ticket prices as we get closer to the D-Day of a concert. Back in January, New York Attorney General Eric Schneiderman said “Ticketing is a fixed game..” and his office release a report on the ticket markets which opened:
The New York Attorney General (“NYAG”) regularly receives complaints from New Yorkers frustrated by their inability to purchase tickets to concerts and other events that appear to sell out within moments of the tickets’ release. These consumers wonder how the same tickets can then appear moments later on StubHub or another ticket resale site, available for resale at substantial markups. In response to these complaints, NYAG has been investigating the entire industry and the process by which event tickets are distributed – from the moment a venue is booked through the sale of tickets to the public. This Report outlines the findings of our investigation. [http://www.ag.ny.gov/pdfs/Ticket_Sales_Report.pdf, January 2016]
The available data suggest that as we get closer to an event, ticket prices are more volatile (likely due to increased supply and demand as concert goers finalize their schedules), so it is possible that one could exploit a sizable price drop. I have found that, outside of "must-see-shows" the best approach is to buy the afternoon of a show and, if things don't go as planned hit up the box office as soon as the openers hit the stage. SeatGeek has developed behind the scenes analytics to create a "Deal Score" for a current ticket offering on a 0-to-100 scale which takes into account the pricing for similar events and seating charts. There is certainly more to learn about the secondary pricing market, but this is a great first start!
A major benefit of life in DC is the vibrant music scene. Generally speaking, ticket prices are reasonable, and the wide offering of venue sizes makes the city an attractive addition to the tour schedules of both big-name artists and those yet to be found. Two summers ago I convinced a friend to head to DC9, one of the District's smaller alternative venues to check out Royal Blood. A year later we barely managed to snag tickets for their sold-out 9:30 Club performance. This experience inspired me to ask some questions about music venues' size, location and pricing which I will explore over the next few posts. I'll go over the findings and graphs first and the coding/procedure at the bottom!
The first and most simple questions any frequent concert goer would ask:
- Where do we listen to music?
- How much does it cost?
To answer these questions I turned to common online ticket retailers in order to procure some concert data. Unfortunately TicketMaster and TicketFly and other primary market ticket sellers make it difficult to learn about event offerings; my web scrapers for those sites were frequently 404'ed. Instead, I turned to secondary ticket market SeatGeek who provide an awesome API and detailed documentation on how to pull down ticket data. Using secondary marketplace data for concert tickets does come with some limitations since scalped tickets are often multiple times more expensive than box office prices. Regardless, the SeatGeek dataset can answer some questions for us:
1. Music venues are geographically clustered. 60% of music venues in the United States are located within 5 miles of another venue. Analysis shows that these "neighborhood clusters" often have a consistent mix of small/medium/large locales, which presumably cater to artists of differing popularity. This clustering trend is apparent in the city-level graphs shown above; certain avenues and neighborhoods contain nearly all of a city's venues. It is likely that venues congregate due to specific sound ordinances in cities, which also has the positive outcome of proximity to bars and clubs.
Even with significant clustering, the effect of sound ordinance on venue location is still a hot topic. Officials in DC are currently considering the extension of stricter residential area sound laws into business/commercial areas. According to the owner of a DC rock club, Black Cat, the move "could effectively shut down D.C.’s live music venues.” A better understanding of the relatively tight clustering of venues could help DC officials recognize the value of conserving more relaxed sound-level laws for "rock-n-roll-avenue".
% of Venues by # Neighboring Venues
2. More people = More Music. This might seem like a no-brainer but it's interesting to see the effect of population density on the availability of music venues. To explore this, I mapped each venue in the SeatGeek data set to its respective FIPS code (think ZIP code but better) and pulled in 2010 US Census Bureau population counts.
More populated areas are more likely to have more venues, a trend clearly seen on the All United States map at the start of the post - larger cities are easily accessible to musicians via air and highway and have the greatest concentration of music fans. (The Pearson correlation between population and number of venues within a FIPS area is .82 with a p-value of 0.000.)
Population heavy areas (cities) don't cost more per ticket than less populated areas ('burbs/country). There is no statistical relationship between population size and average ticket price. Concert cost is relatively consistent across venues nationally, likely due to nation wide tours maintaining prices between cities. This observation's extension into the real world is, however, fairly limited by the data set which contains mostly city-based venues. Perhaps with a larger/historic data set we could see just the opposite trend with time series analysis.
Where are the most expensive concerts in the nation? This question is only partially answered by the choropleth map below (think heat map by geographic area). The large grey swatches show that we don't have data for every FIPs county in the US. Even with the missing data, we can see that for our dataset of SeatGeek second hand market tickets, the tip of Florida, New York City and southern California have some of the most expensive tickets. The scale, however, is pretty tight and our above correlation findings showed that these more-populated-areas do not have a super strong relationship.
Without a more robust data set of primary market ticket prices we only have a partial glimpse of the nation's ticket prices, but it does allow us to visualize the geographic distribution.
So how did I make these graphs? I used a mix of Python, R (Get maps and SQLDF libraries primarily), SVG manipulation, Excel and coffee. Back in college I used ARCGIS a bit, but on a tight time frame and 0 budget I went the freeware route. I did try some freeware alternatives to ARCGIS but they were slow and clunky.
Ultimately the process was as follows. I am more than happy to share code and data if requested!
- Use Python to pull all events from the SeatGeek API - push to R
- Import/Stack/Munge events data. Summarize individual events to venue level. Explore correlations and plot points for specific cities and US in total. Use "noncensus" library, to crosswalk Zip codes to FIPs codes and pull in 2010 census population values. Summarize average ticket price by FIPs code and export to txt. Other packages used: ggmap, ggplot2, sqldf, rcmdr, hmisc.
- Use Nathan Yau's method from Flowing Data and Python to edit a FIPS code SVG (scalable vector graphic) document from Wikipedia. Import SVG to Adobe Illustrator and spruce it up a bit.
- Listened to lots of Mo Lowda and the Humble during the post creation. This band is like a jazzy Kings of Leon from Temple University and they absolutely rock out.