Creating a Personalized Mastodon Post Recommendation System

July 4th, 2023

So I've been using Mastodon for the last few months, and it's the most social network I'm active in. However, I feel that I'm missing out a lot mostly because I don't get any recommendation notifications.

For me, both Twitter & Mastodon was a place where I follow people that feel have a high signal-to-noise ratio, and Twitter did a decent job (admittedly sometimes annoying, but decent enough) on suggesting tweets & notifying me about them, so I can regularly go and consume the most interesting content.

In short, for me, Twitter's tweet suggestion notifications acted as a notification to consume Twitter content and I felt that regularly checking Twitter was beneficial enough.

Unsurprisingly, and unfortunately (probably only to me), Mastodon doesn't have a recommendation/notification system! So I got increasingly inactive on Mastodon.

Then one day it struck me that since Mastodon is all open, I can just go and make a recommendation system for me. So here we go.

So basically my target is that I want to pick a sufficiently interesting post from my timeline every few days

To do that I can just use an embedding model.

Since I did not want to waste time on picking a model that I'm not sure if it will work, I just went to the obvious pick: OpenAI's latest model.

Turns out that's the text-embedding-ada-002 model.

From the OpenAI sample code:

import pandas as pd
import pickle
import numpy as np

from openai.embeddings_utils import (
    get_embedding,
    distances_from_embeddings,
)

EMBEDDING_MODEL = 'text-embedding-ada-002'

embedding_cache_path = 'data/embeddings_cache.pkl'

# load the cache if it exists, and save a copy to disk
try:
    embedding_cache = pd.read_pickle(embedding_cache_path)
except FileNotFoundError:
    embedding_cache = {}

with open(embedding_cache_path, 'wb') as embedding_cache_file:
    pickle.dump(embedding_cache, embedding_cache_file)

def embedding_from_string(
    string: str,
    model: str=EMBEDDING_MODEL,
    embedding_cache=embedding_cache
) -> np.ndarray:
    '''Return embedding of a given string, using a cache to avoid recomputing.'''
    if (string, model) not in embedding_cache.keys():
        embedding_cache[(string, model)] = get_embedding(string, model)
        with open(embedding_cache_path, 'wb') as embedding_cache_file:
            pickle.dump(embedding_cache, embedding_cache_file)
    return np.array(embedding_cache[(string, model)])

This allows us to use the embedding_from_string without consistently worrying about spending too much quota on embedding the same string.

So now with that, to pick an interesting article I just went to the most simplistic method: just embed every post that I've liked, average them, and then calculate the distances with all of the posts in the timeline.

So here is the code to calculate the average favorited embedding:

from bs4 import BeautifulSoup
from mastodon import Mastodon

mastodon = Mastodon(access_token='goranmoomin_usercred.secret')

# fetch all favourited posts
favourites = mastodon.favourites()
favourites_cache = favourites
while favourites := mastodon.fetch_next(favourites):
    favourites_cache += favourites

# embed all of the favourited posts
favourite_embeddings = []
for favourite in favourites_cache:
    soup = BeautifulSoup(favourite['content'], features='html.parser')
    text = soup.get_text()
    embedding = embedding_from_string(text)
    favourite_embeddings.append(embedding)

# and average them
favourites_embedding = np.average(favourite_embeddings, axis=0)

… and here is the code to calculate the distance between favourites_embedding and sort the posts from the latest timeline:

from mastodon import Mastodon
from datetime import timedelta

mastodon = Mastodon(access_token='goranmoomin_usercred.secret')

# fetch the latest timeline
timeline = mastodon.timeline()
timeline_cache = timeline
while timeline := mastodon.fetch_next(timeline):
    if timeline_cache[0]['created_at'] - timeline[0]['created_at'] > timedelta(days=1):
        break
    timeline_cache += timeline
    
# a sort key function that calculates the distance with favourites
def _distance_from_favourites_embedding(text):
    # handle the empty string case, usually a single image. return the max value 1.
    if text == '':
        return 1.
    embedding = embedding_from_string(text)
    distances = distances_from_embeddings(embedding, [favourites_embedding])
    return distances[0]

texts = []
for status in timeline_cache:
    # the content of reblogs is an empty string, so handle them separately
    content = status['reblog']['content'] if status['reblog'] else status['content']
    soup = BeautifulSoup(content, features='html.parser')
    text = soup.get_text()
    texts.append(text)

# sort them
texts = sorted(texts, key=_distance_from_favourites_embedding)

# print them!
for i, text in enumerate(texts):
    print(i, _distance_from_favourites_embedding(text), text, sep='\t')

Surprisingly, this super-naive approach turns out to work perfectly. Here are the first 10 results for me:

0	0.12469403873877571	I am overjoyed that not one, but TWO #Mastodon clients are now available on the #Mac, running on #MacCatalyst.I worked a TON on Mac Catalyst for over five years until I left #Apple last year, in the sincere hope that it would make it easier for authors to write great iOS apps that can also be polished into excellent apps for the Mac. It pleases me to no end that both @ivory and @MonaApp used Catalyst to port their fantastic #Mastodon clients to the Mac, allowing the Mac to become first-class peers of their iOS products.Congratulations to both apps for achieving this! As an ex-Catalyst plumber, I’m grinning from ear to ear.
1	0.1257537443177793	@matdevdugIt took me a few years of occasional effort to get proficient at Rust; it's one of the hardest languages I ever learned. I greatly enjoy it now. I got better at staying on a smooth path most of the time. There's something addicting about making it all fit together. I have confidence in big refactors.Have I just climbed that hill and now want to convince myself it was worth it? In part, but it also opened many areas of software previously closed, where I need that performance.
2	0.12802969954985488	> feels like an iMessage replacement— a happy Element X nightly user.> I can’t go back to the old Element, even if it’s a nightly— another happy Element X nightly user.
3	0.12863189501840622	I am now running a #Smalltalk-80 emulator on my MacBook Pro, since I thought may be the best way to understand the original implementation of MVC would be to try it. The environment is interesting just by itself though. It is recognizably an ancestor of Macintosh style GUIs, while also missing some pretty basic affordances that came with the Lisa and Macintosh.
4	0.1291937033145475	I've been working my way through the new SwiftUI data flow macros and have re-written my post from 4 years ago as well as updated the sample app. Check it out at https://troz.net/post/2023/swiftui-data-flow-2023/ #Swift #SwiftUI
5	0.12964436359488318	Love that 90% of GitHub projects have mad lib READMEs “<Project> is an [opinionated | minimal | performant] <Obscure Programming Language> implementation of <Only Word The Reader Cares About> using <Framework Last Updated in 2012> that adds <Inscrutable List of Features>.”
6	0.1305046908853189	@arroz @lorentey @mjtsai This is definitely also fair. Apple doesn't make a lot of performance promises, and historically hasn't provided a lot of source code, so we have to guess at what is most efficient, and methods on the type "feel" efficient.When you know more, it feels obvious that there's no particular reason an NSString extension would be the most efficient way to interact with String, but Xcode completion doesn't make it obvious what are ObjC bridging extensions.
7	0.1322295115884221	@janriemer*mumbles something with #NixOS on mobile phones*😁@chrichri @purism
8	0.1324972216941176	Looking back over some of my posts around the introduction of the original iPad in 2010. Found a fun one:“…one of the amazing things about the iPad is that an entire iPhone-app's worth of complexity and power can be implemented as a mere pop-up window in [an] iPad application”Speaking of squandering things…
9	0.13466618189885537	New blog post — "The Xerox Smalltalk-80 GUI Was Weird”. https://collindonnell.com/the-xerox-smalltalk-80-gui-was-weird

This is exactly what I'd consider interesting, though I'm not sure if this is also mostly because I follow the right people :)

I also want to check posts from people I don't follow, to check if this is a valid method to find more posts that I'd be interested, but that's future work.

For me right now I just need a way to regularly post notifications from one of those posts, and that's where ntfy.sh comes in.

So basically, you can just POST to a curl endpoint, and you can get a notification. For example…

$ curl -d 'New Mastodon post: I am overjoyed that not one, but TWO #Mastodon clients are now available on the #Mac, running on…' -H 'Actions: view, Open Post in Ivory, ivory:///status/110645585722834036'

This command results in the following notification, and by touching that 'Open Post in Ivory' button one can load the accompanying post in Ivory.

Example Notification

BTW, below is from Ivory's URL scheme documentation.

Tabs

ivory://acct/home (or timeline)
ivory://acct/mentions
ivory://acct/lists
ivory://acct/favorites
ivory://acct/bookmarks
ivory://acct/statistics
ivory://acct/profile
ivory://acct/search

Modal callback_url=<url> valid for all the below

ivory://acct/openURL?url=<url>
ivory://acct/status/status_id (from acct's instance)
ivory://acct/user_profile/user_acct
ivory://acct/post
ivory://acct/post/text
ivory://acct/post?text=<text>&in_reply_to_status_url=<url>

Don't get fooled by the acct, that's the account for the user oneself. Just leave it blank, if you want to open another user's account it's ivory:///user_profile/user_acct where user_acct is the placeholder for the value you need to display.

Anyway, so you can just find out the status id of the timeline post, which is just status['id'] in Python. I was a bit too lazy, so I just created a text-to-status dict and got the id from that.

statuses = {}

for status in timeline_cache:
    # ...
    statuses[text] = status

for i, text in enumerate(texts):
    status = statuses[text]
    print(i, _distance_from_favourites_embedding(text), status['id'], text, sep='\t')

Now that I have the ID, I can just curl/request appropriately & schedule this to run every day.

That, I'll leave as future work. Which means that I have all this worked out and yet I'm not getting any notifications, but that's how I roll :)