diff options
Diffstat (limited to 'blog/2022-11-10-artist-correlation.md')
-rw-r--r-- | blog/2022-11-10-artist-correlation.md | 78 |
1 files changed, 78 insertions, 0 deletions
diff --git a/blog/2022-11-10-artist-correlation.md b/blog/2022-11-10-artist-correlation.md new file mode 100644 index 0000000..d52e613 --- /dev/null +++ b/blog/2022-11-10-artist-correlation.md @@ -0,0 +1,78 @@ +# Correlating music artists + +I listen to a lot of music and so every few months my music collection gets +boring again. So far I have asked friends to recommend me music but I am running +out of friend too now. Therefore I came up with a new solution during a few +days. + +I want to find new music that i might like too. After some research I found that +there is [Musicbrainz](https://musicbrainz.org/) (a database of all artists and +recordings ever made) and [Listenbrainz](https://listenbrainz.org/) (a service +to which you can submit what you are listening to). Both databases are useful +for this project. The high-level goal is to know, what people that have a lot of +music in common with me, like to listen to. For that the shared number of +listeners for each artist is relevant. I use the word 'a listen', to refer to +one playthrough of a track. + +## The Procedure + +### Parse data & drop unnecessary detail + +All of the JSON files of listenbrainz are parsed and only information about how +many listens each user has submitted for what artist are kept. The result is +stored in a B-tree map on my disk (the +[sled library](https://crates.io/crates/sledg) is great for that). + +- First mapping created: `(user, artist) -> shared listens`. +- (Also created a name lookup: `artist -> artist name`) + +The B-Tree stores values ordered, such that i can iterate through all artists of +a user, by scanning the prefix `(user, …`. + +### Create a graph + +Next an undirected graph with weighted edges is generated where nodes are +artists and edges are shared listens. For each user, each edge connecting +artists they listen to, the weight is incremented by the sum of the logarhythms +of either one's playthrough count for that user. This means that artists that +share listeners are connected and because of the logarhythms, users that listen +to an artist _a lot_ won't be weighted proportionally. + +Mapping: `(artist, artist) -> weight`. (Every key `(x, y)` is identical with +`(y, x)` so that edges are undirectional.) + +### Query artists + +The graph tree can now be queried by scanning with a prefix of one artist +(`("The Beatles", …`) and all correlated artists are returned with a weight. The +top-weighted results are kept and saved. + +### Notes + +Two issues appeared during this project that lead to the following fixes: + +- Limit one identity to 32 artists at most because the edge count grows + quadratically (100 artists -> 10000 edges) +- When parsing data the user id is made dependent of the time to seperate arists + when music tastes changing over time. Every 10Ms (~4 months) the user ids + change. + +## Results + +In a couple of minutes I rendered about 2.2 million HTML documents with my +results. They are available at `https://metamuffin.org/artist-correl/{name}`. +Some example links: + +- [The Beatles](https://metamuffin.org/artist-correl/The%20Beatles) +- [Aimer](https://metamuffin.org/artist-correl/Aimer) +- [Rammstein](https://metamuffin.org/artist-correl/Rammstein) +- [Mitski](https://metamuffin.org/artist-correl/Mitski) + +## Numbers + +- Musicbrainz: 15GB +- Listenbrainz: 350GB +- Extracted listening data: 23GB +- Graph: 56GB +- Rendered HTML: 2.3GB +- Compressed HTML (squashfs with zstd): 172MB |