aboutsummaryrefslogtreecommitdiff
path: root/content/articles/2022-11-10-artist-correlation.md
blob: c8eaa62c8fb8e15f67308ac8f19f7a0e604fdd36 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
# Correlating music artists

A hear a lot of music and so every few months my music collection gets boring
again. So far I have asked friends to recommend me music but I am running out of
friend too now. Therefore I came up with a new solution.

I want to find new music that i might like too. After some research I found that
there is [Musicbrainz](https://musichbrainz.org/) (a database of all artists and
recordings ever made) and [Listenbrainz](https://listenbrainz.org/) (a service
to which you can submit what you are listening too). Both databases are useful
for this project. The high-level goal is to know, what people that have a lot of
listens in common, like to listen to.

## The Procedure

### Parse data & drop unnecessary detail

I parse all of the JSON files of listenbrainz and only keep information about
how often what user listens to which artists. The result is stored in a B-tree
map on my disk (the [sled library](https://crates.io/crates/sledg) is great for
that).

- First mapping created: `(user, artist) -> shared listens`.
- (Also created a name lookup: `artist -> artist name`)

The B-Tree stores values ordered, such that i can iterate through all artists of
a user, by scanning the prefix `(user, …`.

### Create a graph

Next an undirected graph with weighted edges is generated where nodes are
artists and edges are shared listens. For every user, each pair of artists they
listen to, receives the sum of listens to either one's listens.

Mapping: `(artist, artist) -> weight`.

Every key `(x, y)` is identical with `(y, x)` so that edges are undirectional.

### Query artists

The graph tree can now be queried by scanning with a prefix of one artist
(`("The Beatles", …`) and all correlated artists are returned with a weight. The
top 16 results are kept and saved.

## Results

In a couple of minutes I rendered about 2.2 million HTML documents with my
results. They are available at `https://metamuffin.org/artist-correl/{name}`.
Some example links:

- [The Beatles](https://metamuffin.org/artist-correl/The%20Beatles)
- [Aimer](https://metamuffin.org/artist-correl/Aimer)
- [Rammstein](https://metamuffin.org/artist-correl/Rammstein)
- [Mitski](https://metamuffin.org/artist-correl/Mitski)

## Numbers

- Musicbrainz: 15GB
- Listenbrainz: 350GB
- Extracted listening data: 11GB
- Graph: 24GB
- Rendered HTML: 8.4GB
- Compressed HTML (squashfs with zstd): 105MB