aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--content/articles/2022-11-10-artist-correlation.md63
1 files changed, 63 insertions, 0 deletions
diff --git a/content/articles/2022-11-10-artist-correlation.md b/content/articles/2022-11-10-artist-correlation.md
new file mode 100644
index 0000000..c8eaa62
--- /dev/null
+++ b/content/articles/2022-11-10-artist-correlation.md
@@ -0,0 +1,63 @@
+# Correlating music artists
+
+A hear a lot of music and so every few months my music collection gets boring
+again. So far I have asked friends to recommend me music but I am running out of
+friend too now. Therefore I came up with a new solution.
+
+I want to find new music that i might like too. After some research I found that
+there is [Musicbrainz](https://musichbrainz.org/) (a database of all artists and
+recordings ever made) and [Listenbrainz](https://listenbrainz.org/) (a service
+to which you can submit what you are listening too). Both databases are useful
+for this project. The high-level goal is to know, what people that have a lot of
+listens in common, like to listen to.
+
+## The Procedure
+
+### Parse data & drop unnecessary detail
+
+I parse all of the JSON files of listenbrainz and only keep information about
+how often what user listens to which artists. The result is stored in a B-tree
+map on my disk (the [sled library](https://crates.io/crates/sledg) is great for
+that).
+
+- First mapping created: `(user, artist) -> shared listens`.
+- (Also created a name lookup: `artist -> artist name`)
+
+The B-Tree stores values ordered, such that i can iterate through all artists of
+a user, by scanning the prefix `(user, …`.
+
+### Create a graph
+
+Next an undirected graph with weighted edges is generated where nodes are
+artists and edges are shared listens. For every user, each pair of artists they
+listen to, receives the sum of listens to either one's listens.
+
+Mapping: `(artist, artist) -> weight`.
+
+Every key `(x, y)` is identical with `(y, x)` so that edges are undirectional.
+
+### Query artists
+
+The graph tree can now be queried by scanning with a prefix of one artist
+(`("The Beatles", …`) and all correlated artists are returned with a weight. The
+top 16 results are kept and saved.
+
+## Results
+
+In a couple of minutes I rendered about 2.2 million HTML documents with my
+results. They are available at `https://metamuffin.org/artist-correl/{name}`.
+Some example links:
+
+- [The Beatles](https://metamuffin.org/artist-correl/The%20Beatles)
+- [Aimer](https://metamuffin.org/artist-correl/Aimer)
+- [Rammstein](https://metamuffin.org/artist-correl/Rammstein)
+- [Mitski](https://metamuffin.org/artist-correl/Mitski)
+
+## Numbers
+
+- Musicbrainz: 15GB
+- Listenbrainz: 350GB
+- Extracted listening data: 11GB
+- Graph: 24GB
+- Rendered HTML: 8.4GB
+- Compressed HTML (squashfs with zstd): 105MB