summaryrefslogtreecommitdiff
path: root/blog/2022-11-10-artist-correlation.md
blob: d52e613536f6919b0845c89d2ab69dabced774e7 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# Correlating music artists

I listen to a lot of music and so every few months my music collection gets
boring again. So far I have asked friends to recommend me music but I am running
out of friend too now. Therefore I came up with a new solution during a few
days.

I want to find new music that i might like too. After some research I found that
there is [Musicbrainz](https://musicbrainz.org/) (a database of all artists and
recordings ever made) and [Listenbrainz](https://listenbrainz.org/) (a service
to which you can submit what you are listening to). Both databases are useful
for this project. The high-level goal is to know, what people that have a lot of
music in common with me, like to listen to. For that the shared number of
listeners for each artist is relevant. I use the word 'a listen', to refer to
one playthrough of a track.

## The Procedure

### Parse data & drop unnecessary detail

All of the JSON files of listenbrainz are parsed and only information about how
many listens each user has submitted for what artist are kept. The result is
stored in a B-tree map on my disk (the
[sled library](https://crates.io/crates/sledg) is great for that).

- First mapping created: `(user, artist) -> shared listens`.
- (Also created a name lookup: `artist -> artist name`)

The B-Tree stores values ordered, such that i can iterate through all artists of
a user, by scanning the prefix `(user, …`.

### Create a graph

Next an undirected graph with weighted edges is generated where nodes are
artists and edges are shared listens. For each user, each edge connecting
artists they listen to, the weight is incremented by the sum of the logarhythms
of either one's playthrough count for that user. This means that artists that
share listeners are connected and because of the logarhythms, users that listen
to an artist _a lot_ won't be weighted proportionally.

Mapping: `(artist, artist) -> weight`. (Every key `(x, y)` is identical with
`(y, x)` so that edges are undirectional.)

### Query artists

The graph tree can now be queried by scanning with a prefix of one artist
(`("The Beatles", …`) and all correlated artists are returned with a weight. The
top-weighted results are kept and saved.

### Notes

Two issues appeared during this project that lead to the following fixes:

- Limit one identity to 32 artists at most because the edge count grows
  quadratically (100 artists -> 10000 edges)
- When parsing data the user id is made dependent of the time to seperate arists
  when music tastes changing over time. Every 10Ms (~4 months) the user ids
  change.

## Results

In a couple of minutes I rendered about 2.2 million HTML documents with my
results. They are available at `https://metamuffin.org/artist-correl/{name}`.
Some example links:

- [The Beatles](https://metamuffin.org/artist-correl/The%20Beatles)
- [Aimer](https://metamuffin.org/artist-correl/Aimer)
- [Rammstein](https://metamuffin.org/artist-correl/Rammstein)
- [Mitski](https://metamuffin.org/artist-correl/Mitski)

## Numbers

- Musicbrainz: 15GB
- Listenbrainz: 350GB
- Extracted listening data: 23GB
- Graph: 56GB
- Rendered HTML: 2.3GB
- Compressed HTML (squashfs with zstd): 172MB