How I created a Twitter bot that posts about science fiction books

Posted on Fri 31 March 2023 in blog

As an avid reader I thought it would be cool to combine my love of literature and data to create a Twitter bot that posts about sci-fi books :

Screenshot of the Project Gutenberg Sci-Fi books Twitter account

Specifically I wanted to recommend science fiction books that you could read free of charge, something made possible thanks to Project Gutenberg, a volunteer-run organisation that hosts a collection of public domain works.

Data source/collection

PG very helpfully offers a CSV feed, which I used to download their catalog:

import csv

import requests

IDS_CSV = "data/IDs_log.csv"
SF_CATALOG = "data/sf_catalog.csv"
PG_CATALOG = "data/pg_catalog.csv"

# Access Project Guntenberg CSV feed and write catalog to CSV file.
URL = "https://www.gutenberg.org/cache/epub/feeds/pg_catalog.csv"

with open(PG_CATALOG, 'wb') as f:
    content = requests.get(URL, stream=True)
    for line in content.iter_lines():
        f.write(line+'\n'.encode())

Here's a snippet of the catalog:

Text# Type Issued Title Language Authors Subjects LoCC Bookshelves
64 Text 1993-05-01 The Gods of Mars en Burroughs, Edgar Rice, 1875-1950 Science fiction; Mars (Planet) -- Fiction; Life on other planets -- Fiction; Carter, John (Fictitious character) -- Fiction; Dejah Thoris (Fictitious character) -- Fiction PS Science Fiction
155 Text 2006-01-12 The Moonstone en Collins, Wilkie, 1824-1889 England -- Fiction; Country homes -- Fiction; Police -- England -- Fiction; Jewelry theft -- Fiction; East Indians -- England -- Fiction; Mystery fiction PR Detective Fiction; Mystery Fiction

Data transformation

Filtering the data

Using the Subjects field, I identified science fiction books (and only books) and created a new CSV file of science fiction books:

# Create CSV file of science fiction books.
with open(PG_CATALOG) as input_file, open(SF_CATALOG, 'w') as output_file:
    field_names = ['Text#', 'Title', 'Authors']
    csv_reader = csv.DictReader(input_file)
    csv_writer = csv.DictWriter(output_file, field_names)
    csv_writer.writeheader()
    for row in csv_reader:
        if "Science Fiction" in row["Bookshelves"] and row["Type"] == "Text":
            csv_writer.writerow({"Text#":row["Text#"], "Title":row["Title"], "Authors":row["Authors"]})

Cleaning the data

The Authors field metadata is in the "surname-first" format (used in academic and scientific writing), which is unfortunately not as readable for the intended end user of this bot:

For example:

Text# Title Authors
28767 The Defenders Dick, Philip K., 1928-1982; Emshwiller, Ed, 1925-1990 [Illustrator]

In order to clean this data, I create a Book class and implemented a _clean_author_metadata() method:

"""Book class module."""

class Book:
    """A book from the Project Guntenberg collection."""

    def __init__(self, text_ID, title, authors):
        """ Constructor method."""

        self.text_ID = str(text_ID)
        self.title = title
        self.URL = f"https://www.gutenberg.org/ebooks/{text_ID}"
        self.authors = self._clean_authors_data(authors)

    def _clean_authors_data(self, authors):
        """Clean the author metadata."""

        individual_authors = authors.split(";")
        cleaned_authors = []
        for author in individual_authors:
            author_words = author.split(" ")
            new_author_words = []
            for word in author_words:
                try:
                    int(word[0])
                except ValueError:
                    if word != "[Illustrator]":
                        new_word = word.replace(",", "")
                        new_author_words.insert(-1, new_word)
                    else:
                        new_author_words.append(word)
                except IndexError:
                    pass
            cleaned_author = " ".join(new_author_words)
            cleaned_authors.append(cleaned_author)
        return cleaned_authors

Using our The Defenders example from before, here is the cleaned data:

Choosing a book to post about

Before a random book is selected from the science fiction catalog to be posted to Twitter, there are a couple of checks that take place.

When a book is posted about by the Twitter bot, the book's unique ID (the Text# field that you can see in the catalog snippet above) is written to an "IDs" CSV - basically a log of which books have been posted about.

To begin with, we verify if the number of IDs recorded is equivalent to the number of science fiction books listed in the catalog. If this is the case, the bot has posted about all the SF books available and a fresh, empty IDs CSV file is created so that we can start posting again:

import os 
import csv
import random

import tweepy 

from book import Book

IDS_CSV = "data/IDs_log.csv"
SF_CATALOG = "data/sf_catalog.csv"
HEADERS = ["Text#", "Title"]

with open(SF_CATALOG) as sf_csv, open(IDS_CSV) as IDs_csv:
    sf_csv_reader = csv.DictReader(sf_csv)
    sf_rows = [row for row in sf_csv_reader]
    sf_rows_count = len(sf_rows)

    IDs_csv_reader = csv.DictReader(IDs_csv)
    IDs_rows = [row for row in IDs_csv_reader]
    IDs_rows_count = len(IDs_rows)

# Check to see if the Twitterbot has gone through all books in the SF catalog.
if sf_rows_count == IDs_rows_count:
    with open(IDS_CSV, "w") as f:
        csv_writer = csv.DictWriter(f, HEADERS)
        csv_writer.writeheader()
        csv_writer.writerow({"Text#":"0", "Title":"0"})

When a random book is selected from the SF catalog, we then check that this book hasn't been posted before. Note that there are two steps in this check:

  • There is a check to make sure the random book's Text# ID isn't present in our IDs catalog.

  • There is a check to ensure the book's title doesn't match anything that's been posted about before. That's because there are books that are present in the catalog more than once - for example, there are two version of The Strange Case of Dr. Jekyll and Mr. Hyde - so even if the Text# ID is different, it might be the same book. If the same book has posted before, it's Text# ID is written to the IDs catalog:

random_pick = sf_rows[random.randint(0, sf_rows_count)]

# Check to make sure the book chosen hasn't already been posted previously.
books_to_log = []
flag = False
while flag == False:
    with open(IDS_CSV) as f:
        csv_reader = csv.DictReader(f)
        text_rows = [row['Text#'] for row in csv_reader]
        title_rows = [row['Title'] for row in csv_reader]
    if random_pick["Title"] in title_rows and random_pick["Text#"] not in text_rows:
        same_book = Book(random_pick["Text#"], random_pick["Title"],random_pick["Authors"])
        books_to_log.append(same_book)
    if random_pick["Text#"] in text_rows:
        random_pick = sf_rows[random.randint(0, sf_rows_count)]
        continue
    else:
        flag = True
        book_pick = Book(random_pick["Text#"], random_pick["Title"],random_pick["Authors"])
        books_to_log.append(book_pick)
        with open(IDS_CSV, 'a') as f:
            csv_writer = csv.DictWriter(f, HEADERS)
            for book in books_to_log:
                csv_writer.writerow({"Text#":book.text_ID, "Title":book.title})

Once a random book has been selected that hasn't been posted about before, its Text# is logged to the IDs CSV and the book is posted to Twitter:

# Post to Twitter.
authors_string = " and ".join(book_pick.authors)
client.create_tweet(text=f"Check out {book_pick.title} by {authors_string}. #ebook #sciencefiction {book_pick.URL}")

Here's an example:

Automation

In order to make my Twitter bot automatically post a random book recommendation twice a day (at 7am and 7pm UTC time), I utilized Github Actions to create a customized workflow. This allowed me to automate the process of selecting and posting a book recommendation without the need for manual intervention:

name: Post book tweet

on: 
  schedule: 
  - cron: "0 7 * * *"
  - cron: "0 19 * * *"

permissions: write-all

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
    - name: Install Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.10'
    - name: Checkout repository content
      uses: actions/checkout@master
    - name: Install requirements.txt
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt
    - name: Set enviromental secrets and execute twitter_bot.py
      env:
          PG_TWITTER_ACCESS_TOKEN: ${{ secrets.PG_TWITTER_ACCESS_TOKEN }}
          PG_TWITTER_ACCESS_TOKEN_SECRET: ${{ secrets.PG_TWITTER_ACCESS_TOKEN_SECRET }}
          PG_TWITTER_BEARER_TOKEN: ${{ secrets.PG_TWITTER_BEARER_TOKEN }}
          PG_TWITTER_CONSUMER_KEY: ${{ secrets.PG_TWITTER_CONSUMER_KEY }}
          PG_TWITTER_CONSUMER_SECRET: ${{ secrets.PG_TWITTER_CONSUMER_SECRET }}
      run: |
          python src/twitter_bot.py
    - name: update repo
      run: |
        git config user.email ${{ secrets.EMAIL }}
        git config user.name "Ben"
        git config user.username ben-n93
        git config user.password ${{ secrets.PERSONAL_ACCESS_TOKEN}}
        git add --all
        git commit -m "update"
        git push

In order to ensure that my Twitter bot could post about any new sci-fi books added to the PG catalog, I also used GitHub Actions to automate the extraction of the catalog data on the first day of every month (I won't include the YAML file here but you can view it in my repo).

You can view the full source code on Github.