Python script for backing up Twitter statuses

Update, 14 October 2012 I don’t think the script below works anymore, but I’ve found that the All My Tweets web site is good for my purposes. As of today, it will create a single page of your most recent 3200 tweets, which is the current limit of the API provided by Twitter.

Mt. Vernon Street, Boston, MA

I was looking for a way to backup my tweets (I find that I’m reluctantly giving over to calling Twitter posts by that name, but I don’t feel good about it) and found this “recipe” at ActiveState’s web site:

Recipe 576594: Backup/download your tweets or anyone’s tweets.

It’s a nice simple Python script that saves messages to a text file. Works great, although I wanted some more. It doesn’t save timestamps and also leaves HTML link tags in there. The recipe also introduced me to a nifty library, “Beautiful Soup“, which is a “Python HTML/XML parser designed for quick turnaround projects like screen-scraping.”

Since I’ve been intending to get back into learning and using Python, this was a good opportunity to play around a bit. Starting from Zach’s code and with the nice Beautiful Soup documentation, it didn’t take long to figure out how to extract more from the Twitter pages. Also there was some fun with regular expressions to strip out the HTML.

Of course, this is kind of a brittle way to do things and it makes assumptions about HTML that can change at any time at the whims of Twitter’s web people, but I’m okay with that. (And I’m aware that there are Python libraries for interfacing with the Twitter API.)

So why do I want to backup all this ephemera? Well, I’m a bit of a saver. If I’m taking the trouble to post a stream of pointless drivel, then I’d like to have some record of it under my control.

And here it is! The script loops through your (or anyone’s) archive pages and saves a page at a time (currently 20 posts) along with the datestamp and status ID. A nice feature would be for the script to be able to incrementally add only newer posts to the file, but I didn’t get into that. And I didn’t do anything with the times, so the topmost posts will say “3 minutes ago” and etc. The year isn’t displayed on posts from the current year.

I’ve had my fun with it and don’t know that I’ll be doing any more tinkering with it for a while or ever. But you can take the ball and run with it if you want. Feel free to take the code and turn it into something amazing. I’ll put this in the public domain, after disclaiming any responsibility for what the code does. It’s pretty simple — you can see for yourself how dangerous it is.

#!/usr/bin/python3

# started from: http://code.activestate.com/recipes/576594/

import sys
import time
import datetime
import re
from urllib.request import urlopen
from BeautifulSoup import BeautifulSoup

# Replace USERNAME with your twitter username
url = 'http://twitter.com/USERNAME?page=%s'

# r'@<a [^>]*href="/([^"]*)"[^<]*</a>'
pattern_user = r'''(?x)   # verbose mode
    @                     # start of twitter user link
    <a[ ][^>]*href="/     # 'a' opening tag to the start of href url
    ([^"]*)"              # capture the user part of url to \1
    [^<]*                 # any number of non-closing bracket chars to get to:
    </a>                  # the 'a' closing tag'''
    # matches @<a href="http://movingtofreedom.org/scarpent">scarpent</a>

# r'<a [^>]*href="([^"]*)"[^<]*</a>'
pattern_link = r'''(?x)   # verbose mode
    <a[ ][^>]*href="      # 'a' opening tag to the start of href url
    ([^"]*)"              # capture entire url to \1
    [^<]*                 # any number of non-closing bracket chars to get to:
    </a>                  # the 'a' closing tag'''
    # matches <a href="http://bit.ly/Xxlch" rel="nofollow"
    #                            target="_blank">http://bit.ly/Xxlch</a>

# capture numeric status id from "published timestamp" span
re_status_id = re.compile(r'.*/status/([0-9]*).*')
    # e.g. http://twitter.com/scarpent/status/1329714004

print('tweets saved at: ' + str(datetime.datetime.today()) + '\n')

num_tweets = 0
delay = 5
for x in range(1,2000):
    f = urlopen(url % x)
    # f = open('twitter-feed.htm', 'rb')
    soup = BeautifulSoup(f.read())
    f.close()
    tweets = soup.findAll('li', {'class': re.compile(r'.*\bstatus\b.*')})
    if len(tweets) == 0:
        break

    for tweet in tweets:
        num_tweets += 1
        content = str(tweet.find('span', 'entry-content').renderContents(),
                      'utf8')

        content = re.sub(pattern_user, r'@\1', content)
        content = re.sub(pattern_link, r'\1', content)
        print(content)

        date_time = str(
            tweet.find('span', 'published timestamp').renderContents().strip(), 'utf8')
        m = re_status_id.search(tweet.find('a', 'entry-date')['href'])
        if m:
            status_id = m.groups()[0]
        meta = date_time + ' (' + status_id + ')\n'
        print(meta)

    sys.stderr.write('%d tweets saved\nwaiting %d seconds for next page...\n' %
                     (num_tweets, delay))
    # be nice to twitter's servers
    time.sleep(delay)

print('%d tweets' % num_tweets)

Updated 3/20/09: Removed explicit writing to “tweets” file. Seems more Unix-like to redirect stdout to a file instead. (The status messages with ongoing counts now go to stderr.)

Updated 10/15/09: Changed from span “published” to “published timestamp”.

And, speaking of things being brittle and at the mercy of changes to Twitter, the day after I first posted this, Twitter changed the behavior of how more posts are displayed. But for now the [url]?page=N referencing scheme still works. Maybe I’ll end up having to play with the API after all…

4 thoughts on “Python script for backing up Twitter statuses

  1. Hi, Bruce!

    Yeah, I have an Identi.ca account, and maybe I should set up to post there first and then automate a post to Twitter in turn. I guess I’m going with Twitter purely for network effects, and it’s all vaporous enough that I’m not concerned about it at the moment. (But maybe should re-think this and use Identi.ca as my primary account out of principle.)

  2. Very good – had to add a random delay to ensure Twitter did not remotely shut off.

    Thanks for sharing.

    Code Additions:

    import math,random,time
    #Random Sleep to trick Twitter Servers
    time.sleep(random.randint(1,6))

Comments are closed.