Everyone who reads this blog knows how fond I am of Letterboxd: I use it to reference films in my weekly roundups, and I’m a ‘pro’ user, meaning that I support the website with €16 per year (which is more or less the average price of a pizza in Brussels - I’m Italian, pizza is my reference unit for many things).
Among the many nice perks of the ‘pro’ membership, there is a stats
page which shows in ‘real-time’ - i.e., without having to wait for the end of the year - some charts and information about how my ‘film buff year’ and ’life’ are going: as of the time of writing, I have an alarming 307 viewings logged for 2024 - so unless I go into a coma by the end of the month, I will definitely shatter last year’s already worrying record, sign of a thriving social life!
Who was the actor I most watched this year? Michael Caine… I didn’t expect that.
Who was the actor I most watched ever? Samuel L. Jackson, but that’s because of Nick Fury’s sometimes very short appearances… the second most watched ever? Willem Dafoe, which is a pleasant surprise as he’s not a franchise person (sure, two Spider-men, but that’s an exception).
Film most watched ever? Doctor Sleep, which I expected (but only because Letterboxd didn’t exist when Back to the Future came out… I must have seen it more than twenty times).
But there are a lot of statistics that I would like to see and that are not on this page: how did my taste evolve? Who’s my most rated cinematographer? What are the most striking differences between my taste and the rest of the world (ok, Aliens and Gladiator… but what else?)
So, I would like to run my own analysis: out of curiosity, but also to try and improve my skills in data visualisation.
My plan was:
- export my history of watched films and ratings. Letterboxd offers a nice ’export’ feature which, in a couple of seconds, allows the download of the entire interaction with the website (ratings and timeline but also reviews, likes, watchlist)
- knowing the TMDb ID for each film, use the TMDb API to extract as much information as possible about cast and crew, etcetera
- go to the Winchester, have a nice cold pint, then come back home, be curious and have fun
Step one, easy. The export zip file contains three files I’m interested in:
- diary.csv: this file lists everything I’ve ’logged’ (meaning: everything with a ‘watched date’), and includes the date I logged the film on the website, film title, release year, Letterboxd URI, rating, a flag showing whether it was a rewatch, tags (empty for me because I don’t use them) and the aforementioned ‘watched date’; films are sorted chronologically by this date;
- ratings.csv: the list of all the ratings I gave on the website (meaning: everything with a ‘star rating’, regardless of whether I noted the watching date or not), detailing the date I logged the film on the website, the film title, year, Letterboxd URI, and the rating;
- watched.csv: the list of all the films I marked as watched (regardless of the fact I rated them, or logged them), which simply lists the date I logged the film on the website, the film title, year and Letterboxd URI. I’m not sure this file is useful at all for my purpose.
The natural starting point, since I’m interested not only in my appreciation of a film, but also in how it’s linked to my chronology of watching, is the ‘diary’ file.
Step two… here comes the first problem with my plan: none of these files include the TMDb ID for a movie. I had given for granted it would, because (1) that’s where Letterboxd grabs all its information, (2) each Letterboxd page includes the relevant link to TMDb, and (3) the ID is included in each user’s Letterboxd RSS feed. But no, it’s not in the export.
Since the exports themselves don’t include any useful information about films, I need to make this link with a richer data source.
So, revised plan:
- export my history of watched films and ratings. Letterboxd offers a nice ’export’ feature which, in a couple of seconds, allows the download of the entire interaction with the website (ratings and timeline but also reviews, likes, watchlist)
- link the information in the export file to each film’s TMDb ID
- knowing the TMDb ID for each film, use the TMDb API to extract as much information as possible about cast and crew, etcetera
- go to the Winchester, have a nice cold pint, then come back home, be curious and have fun
The second problem: there is no unique ID for a film 1. Even worse, the ‘Letterboxd URI’ field (which is always a shortened URI, so it’s not immediately clear what it links to) has a different meaning depending on the export file:
- in diary.csv, it doesn’t link to the film main page, but to a specific page created for this specific watch; the format of the destination URI is:
- ‘https://letterboxd.com/' +
- username +
- ‘/film/’ +
- film-title +
- (in case of multiple watches of the same film) zero-based cardinal number identifying the viewing
- in ratings.csv and watched.csv, it links to the film page, following the schema:
- ‘https://letterboxd.com/film' +
- film-title
So, if it’s the third time I’ve logged Shaun of the Dead:
- the diary URI will be: https://letterboxd.com/feadin/film/shaun-of-the-dead/2/
- the rating URI will be: https://letterboxd.com/shaun-of-the-dead/
Furthermore, the film-title part of the URI, normally a dash-separated lowercase version of the English title, also includes the film year in case of ambiguity
…and also a numeric further suffix if that’s not enough
All of this makes perfect sense from a technical point of view, but it’s not very helpful to me: unless I know exactly how many films with a given title exist in general and for a given year, I have no way of determining with certainty the correct film URI based on the information I have in these exports.
This implies that, even if I found online a reference list matching the Letterboxd ID and the TMDb ID, I wouldn’t be sure to identify the right film.
Anyway, the only source linking these two information that I’ve found is the Wikidata project
, which also exposes an endpoint for a query service
. Maybe it will be handy for another time.
I also understand these exports are not made for my use case, but only to comply to regulations about users’ rights to export their data. But Letterboxd is more than happy to support TMDb IDs and IMDB IDs when importing data, so this feels just like a way to ‘keep users in’ (TMDb wouldn’t complain if their IDs were provided in these exports).
What to do then? I even tried to play with Kaggle’s Movies Daily Update Dataset
, but it lists too many non-film related entries, and the data quality is not good enough to even make a simple ’title + year’ matching reliable.
The universal remedy, in this case, is dear old web scraping… which is not good web citizenship, I know, but I feel like I’m out of options.
Furthermore, starting from diary.csv, I would need to scrape two pages for each entry: the URI in the diary entry would take me to my viewing page, then I would need to explore that page to get the URI of the movie page, finally I would have to open the movie page to get the TMDb ID.
So my current sub-plan for step 2 is:
- join diary and ratings by film title and year (not 100% guaranteed to be correct - I hope not to fall into the ‘swan song’ case);
- get the TMDb ID for each film by scraping the main movie page (linked by ratings); I’ll apply a 10-second delay between requests to avoid being too invasive;
- save once and for all the matching Letterboxd URI - TMDb ID in a SQLite database.
If any reader has a better idea about how to proceed, please get in touch!
But really, Letterboxd, do you need to make things so complicated2? I would be willing to agree to an increase of my membership by the price of a Belgian beer if you added the de facto standard ids to your exports.
-
Actually, Letterboxd does have, of course, a unique numeric ID for each film, which is used for instance in the path for posters, and sprinkled here and there in each page’s source code… but it’s not exposed anywhere ↩︎
-
Another surprise, to be dealt with later on: I thought the ratings rating and the diary rating would match, but no, they are two separate scores; the ratings rating usually matches the last diary rating for a film, but ‘correcting’ it in the film main page only changes the former, not the latter. Again, it makes sense from an engineering point of view, but it’s not the way I expected it to work ↩︎