diff options
Diffstat (limited to 'content/articles')
-rw-r--r-- | content/articles/2022-11-10-artist-correlation.md | 63 |
1 files changed, 63 insertions, 0 deletions
diff --git a/content/articles/2022-11-10-artist-correlation.md b/content/articles/2022-11-10-artist-correlation.md new file mode 100644 index 0000000..c8eaa62 --- /dev/null +++ b/content/articles/2022-11-10-artist-correlation.md @@ -0,0 +1,63 @@ +# Correlating music artists + +A hear a lot of music and so every few months my music collection gets boring +again. So far I have asked friends to recommend me music but I am running out of +friend too now. Therefore I came up with a new solution. + +I want to find new music that i might like too. After some research I found that +there is [Musicbrainz](https://musichbrainz.org/) (a database of all artists and +recordings ever made) and [Listenbrainz](https://listenbrainz.org/) (a service +to which you can submit what you are listening too). Both databases are useful +for this project. The high-level goal is to know, what people that have a lot of +listens in common, like to listen to. + +## The Procedure + +### Parse data & drop unnecessary detail + +I parse all of the JSON files of listenbrainz and only keep information about +how often what user listens to which artists. The result is stored in a B-tree +map on my disk (the [sled library](https://crates.io/crates/sledg) is great for +that). + +- First mapping created: `(user, artist) -> shared listens`. +- (Also created a name lookup: `artist -> artist name`) + +The B-Tree stores values ordered, such that i can iterate through all artists of +a user, by scanning the prefix `(user, …`. + +### Create a graph + +Next an undirected graph with weighted edges is generated where nodes are +artists and edges are shared listens. For every user, each pair of artists they +listen to, receives the sum of listens to either one's listens. + +Mapping: `(artist, artist) -> weight`. + +Every key `(x, y)` is identical with `(y, x)` so that edges are undirectional. + +### Query artists + +The graph tree can now be queried by scanning with a prefix of one artist +(`("The Beatles", …`) and all correlated artists are returned with a weight. The +top 16 results are kept and saved. + +## Results + +In a couple of minutes I rendered about 2.2 million HTML documents with my +results. They are available at `https://metamuffin.org/artist-correl/{name}`. +Some example links: + +- [The Beatles](https://metamuffin.org/artist-correl/The%20Beatles) +- [Aimer](https://metamuffin.org/artist-correl/Aimer) +- [Rammstein](https://metamuffin.org/artist-correl/Rammstein) +- [Mitski](https://metamuffin.org/artist-correl/Mitski) + +## Numbers + +- Musicbrainz: 15GB +- Listenbrainz: 350GB +- Extracted listening data: 11GB +- Graph: 24GB +- Rendered HTML: 8.4GB +- Compressed HTML (squashfs with zstd): 105MB |