diff options
author | metamuffin <metamuffin@disroot.org> | 2023-02-13 20:25:04 +0100 |
---|---|---|
committer | metamuffin <metamuffin@disroot.org> | 2023-02-13 20:25:04 +0100 |
commit | c19adca147d38562b3f4a06cb2205e043bc24856 (patch) | |
tree | 808ceebd163294cc66ed8882885348b914ab1125 /content/articles/2022-11-10-artist-correlation.md | |
parent | 77eef59404acaed6faa636239bd18010e34a91de (diff) | |
download | metamuffin-blog-c19adca147d38562b3f4a06cb2205e043bc24856.tar metamuffin-blog-c19adca147d38562b3f4a06cb2205e043bc24856.tar.bz2 metamuffin-blog-c19adca147d38562b3f4a06cb2205e043bc24856.tar.zst |
restructure for embedding into my website
Diffstat (limited to 'content/articles/2022-11-10-artist-correlation.md')
-rw-r--r-- | content/articles/2022-11-10-artist-correlation.md | 78 |
1 files changed, 0 insertions, 78 deletions
diff --git a/content/articles/2022-11-10-artist-correlation.md b/content/articles/2022-11-10-artist-correlation.md deleted file mode 100644 index d52e613..0000000 --- a/content/articles/2022-11-10-artist-correlation.md +++ /dev/null @@ -1,78 +0,0 @@ -# Correlating music artists - -I listen to a lot of music and so every few months my music collection gets -boring again. So far I have asked friends to recommend me music but I am running -out of friend too now. Therefore I came up with a new solution during a few -days. - -I want to find new music that i might like too. After some research I found that -there is [Musicbrainz](https://musicbrainz.org/) (a database of all artists and -recordings ever made) and [Listenbrainz](https://listenbrainz.org/) (a service -to which you can submit what you are listening to). Both databases are useful -for this project. The high-level goal is to know, what people that have a lot of -music in common with me, like to listen to. For that the shared number of -listeners for each artist is relevant. I use the word 'a listen', to refer to -one playthrough of a track. - -## The Procedure - -### Parse data & drop unnecessary detail - -All of the JSON files of listenbrainz are parsed and only information about how -many listens each user has submitted for what artist are kept. The result is -stored in a B-tree map on my disk (the -[sled library](https://crates.io/crates/sledg) is great for that). - -- First mapping created: `(user, artist) -> shared listens`. -- (Also created a name lookup: `artist -> artist name`) - -The B-Tree stores values ordered, such that i can iterate through all artists of -a user, by scanning the prefix `(user, …`. - -### Create a graph - -Next an undirected graph with weighted edges is generated where nodes are -artists and edges are shared listens. For each user, each edge connecting -artists they listen to, the weight is incremented by the sum of the logarhythms -of either one's playthrough count for that user. This means that artists that -share listeners are connected and because of the logarhythms, users that listen -to an artist _a lot_ won't be weighted proportionally. - -Mapping: `(artist, artist) -> weight`. (Every key `(x, y)` is identical with -`(y, x)` so that edges are undirectional.) - -### Query artists - -The graph tree can now be queried by scanning with a prefix of one artist -(`("The Beatles", …`) and all correlated artists are returned with a weight. The -top-weighted results are kept and saved. - -### Notes - -Two issues appeared during this project that lead to the following fixes: - -- Limit one identity to 32 artists at most because the edge count grows - quadratically (100 artists -> 10000 edges) -- When parsing data the user id is made dependent of the time to seperate arists - when music tastes changing over time. Every 10Ms (~4 months) the user ids - change. - -## Results - -In a couple of minutes I rendered about 2.2 million HTML documents with my -results. They are available at `https://metamuffin.org/artist-correl/{name}`. -Some example links: - -- [The Beatles](https://metamuffin.org/artist-correl/The%20Beatles) -- [Aimer](https://metamuffin.org/artist-correl/Aimer) -- [Rammstein](https://metamuffin.org/artist-correl/Rammstein) -- [Mitski](https://metamuffin.org/artist-correl/Mitski) - -## Numbers - -- Musicbrainz: 15GB -- Listenbrainz: 350GB -- Extracted listening data: 23GB -- Graph: 56GB -- Rendered HTML: 2.3GB -- Compressed HTML (squashfs with zstd): 172MB |