Unofficial Kiwi Farms Community Project: Twitter Archiver

James Smith

Retired Staff
True & Honest Fan
kiwifarms.net
Joined
Jan 17, 2017
Some lolcows like to post dumb stuff to Twitter then delete it. I wrote this shell script to archive users' new tweets which I run as a cron job every 5 minutes:
Also available at https://paste.debian.net/hidden/601bfe30/
Code:
#!/bin/bash
if [ -e /tmp/tweetcrawler.busy ]; then
  exit 0
fi

touch /tmp/tweetcrawler.busy

cd /home/tweetcrawler/tools/twitter/tweetcrawler

while read user; do

  if [ ! -e html/${user} ]; then
    mkdir html/${user}
  fi
  if [ ! -e html/${user}/oldtweets.txt ]; then
    touch html/${user}/oldtweets.txt
  fi

  echo -n > html/${user}/newtweets.txt

  curl -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36" -Ls "https://twitter.com/search?f=tweets&q=from%3A${user}" |\
   grep 'class="tweet-timestamp js-permalink js-nav js-tooltip"' |\
   sed "s/^  <a href=\"\(\/[0-9A-Za-z_]*\/status\/[0-9]*\).*data-time=\"\(.*\)\" data-time-ms.*/\2 https:\/\/www.twitter.com\1/" > html/${user}/newtweets.txt

  echo -n > html/${user}/tweetlist-top.html
  while read currenttweet; do
    grep -q "${currenttweet}" html/${user}/oldtweets.txt
    if [ ! $? -eq 0 ]; then
      echo ${currenttweet} >> html/${user}/oldtweets.txt
      archive=$(/usr/local/bin/archiveis $(echo ${currenttweet} | cut -d " " -f 2) 2> /dev/null)
      xvfb-run -s "-screen 0 670x1080x24" cutycapt --min-width=670 --min-height=1080 --url=$(echo ${currenttweet} | cut -d " " -f 2) --out=html/${user}/$(date -d @$(echo ${currenttweet} | cut -d " " -f 1) +%Y%m%d%Hh%Mm%Ss)$(echo ${currenttweet} | sed 's/.*\//_/').png

      echo "      <li><a href=\"$(date -d @$(echo ${currenttweet} | cut -d " " -f 1) +%Y%m%d%Hh%Mm%Ss)$(echo ${currenttweet} | sed 's/.*\//_/').png\">$(date -d @$(echo ${currenttweet} | cut -d " " -f 1) +%Y%m%d%Hh%Mm%Ss)$(echo ${currenttweet} | sed 's/.*\// /')</a></li>" >> html/${user}/tweetlist-forward.html
      sort -r html/${user}/tweetlist-forward.html > html/${user}/tweetlist-backward.html

      cp html/${user}/tweetlist.html html/${user}/tweetlist-bottom.html
      echo "      <li><a href=\"$(date -d @$(echo ${currenttweet} | cut -d " " -f 1) +%Y%m%d%Hh%Mm%Ss)$(echo ${currenttweet} | sed 's/.*\//_/').png\">$(date -d @$(echo ${currenttweet} | cut -d " " -f 1) +%Y%m%d%Hh%Mm%Ss)$(echo ${currenttweet} | sed 's/.*\// /')</a></li>" >> html/${user}/tweetlist-top.html
      if [ ! -z ${archive} ]; then
        echo "      <li><a href=\"${archive}\">$(date -d @$(echo ${currenttweet} | cut -d " " -f 1) +%Y%m%d%Hh%Mm%Ss)$(echo ${currenttweet} | sed 's/.*\// /') : external archive</a></li>" >> html/${user}/tweetlist-top.html
      fi
      cat html/${user}/tweetlist-top.html html/${user}/tweetlist-bottom.html > html/${user}/tweetlist.html
    fi
  done < html/${user}/newtweets.txt

done < users

rm /tmp/tweetcrawler.busy
It takes a screenshot of the tweet, saves it to a directory, submits the tweet to archive.today, and writes the results to an HTML file sorted newest to oldest.

Upon the first run it will give you an error saying some files don't exist which is fine. It creates the files for you. I could have muted the errors but didn't. This script requires a few tools that you might not already have installed including curl, archiveis, xvfb-run, and cutycapt. Before the first run you have to set up a new user called tweetcrawler and make directories including /home/tweetcrawler/tools/twitter/tweetcrawler/html. To install archiveis you can run pip install --user archiveis as the tweetcrawler user if it's not included in your package manager. xvfb-run may be included in the xfvb package under Debian and Ubuntu relatives. Write the list of usernames you want to archive to /home/tweetcrawler/tools/twitter/tweetcrawler/users one name per line.

@3119967d0c modified the script to use archivenow which supports additional archive sites like the Wayback Machine and to run archive tasks through Tor using torsocks with the IsolatePID option turned on. The modified script is here:
Also available at https://0bin.net/paste/H3+lEnXCX6OchH01#9fk9qmjRN0PtZy+Gjsye7JtSo-OGaI+CjyYLLUHr+zW
Code:
#!/bin/bash
if [ -e /tmp/tweetcrawler.busy ]; then
  exit 0
fi

touch /tmp/tweetcrawler.busy

cd /home/tweetcrawler/tools/twitter/tweetcrawler

while read user; do

  if [ ! -e html/${user} ]; then
    mkdir html/${user}
  fi
  if [ ! -e html/${user}/oldtweets.txt ]; then
    touch html/${user}/oldtweets.txt
  fi
 
  echo -n > html/${user}/newtweets.txt
 
  useragent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.12 Safari/537.36"
 
  torsocks curl -A "$useragent" -Ls "https://twitter.com/search?f=tweets&q=from%3A${user}" |\
  grep 'class="tweet-timestamp js-permalink js-nav js-tooltip"' |\
  sed "s/^  <a href=\"\(\/[0-9A-Za-z_]*\/status\/[0-9]*\).*data-time=\"\(.*\)\" data-time-ms.*/\2 https:\/\/www.twitter.com\1/" > html/${user}/newtweets.txt
 
  echo -n > html/${user}/tweetlist-top.html
 
  while read currenttweet; do
    grep -q "${currenttweet}" html/${user}/oldtweets.txt
    if [ ! $? -eq 0 ]; then
      echo ${currenttweet} >> html/${user}/oldtweets.txt
     
      iarchive=$(torsocks /home/tweetcrawler/.local/bin/archivenow --ia $(echo ${currenttweet} | cut -d " " -f 2))
      archiveis=$(torsocks /home/tweetcrawler/.local/bin/archivenow --is $(echo ${currenttweet} | cut -d " " -f 2))
     
      currenttweetid=$(echo ${currenttweet} | sed 's/.*\//_/')
      dtstring=$(date -d @$(echo ${currenttweet} | cut -d " " -f 1) +%Y%m%d%Hh%Mm%Ss)
     
      xvfb-run -s "-screen 0 670x1080x24" torsocks wkhtmltoimage --width 670 --height 1080 --javascript-delay 1750 --custom-header "User-Agent" "${useragent}" $(echo ${currenttweet} | cut -d " " -f 2) html/${user}/${dtstring}${currenttweetid}_u.png >> /dev/null
     
      pngcrush -q -blacken -reduce html/${user}/${dtstring}${currenttweetid}_u.png html/${user}/${dtstring}${currenttweetid}.png
     
      rm html/${user}/${dtstring}${currenttweetid}_u.png
     
      echo "      <li><a href=\"$(date -d @$(echo ${currenttweet} | cut -d " " -f 1) +%Y%m%d%Hh%Mm%Ss)$(echo ${currenttweet} | sed 's/.*\//_/').png\">$(date -d @$(echo ${currenttweet} | cut -d " " -f 1) +%Y%m%d%Hh%Mm%Ss)$(echo ${currenttweet} | sed 's/.*\// /')</a></li>" >> html/${user}/tweetlist-forward.html
      sort -r html/${user}/tweetlist-forward.html > html/${user}/tweetlist-backward.html
     
      cp html/${user}/tweetlist.html html/${user}/tweetlist-bottom.html
      echo "      <li><a href=\"$(date -d @$(echo ${currenttweet} | cut -d " " -f 1) +%Y%m%d%Hh%Mm%Ss)$(echo ${currenttweet} | sed 's/.*\//_/').png\">$(date -d @$(echo ${currenttweet} | cut -d " " -f 1) +%Y%m%d%Hh%Mm%Ss)$(echo ${currenttweet} | sed 's/.*\// /')</a></li>" >> html/${user}/tweetlist-top.html
     
      if [ ! -z ${archiveis} ]; then
        echo "      <li><a href=\"${archiveis}\">$(date -d @$(echo ${currenttweet} | cut -d " " -f 1) +%Y%m%d%Hh%Mm%Ss)$(echo ${currenttweet} | sed 's/.*\// /') : Archive.is</a></li>" >> html/${user}/tweetlist-top.html
      fi
      if [ ! -z ${iarchive} ]; then
        echo "      <li><a href=\"${iarchive}\">$(date -d @$(echo ${currenttweet} | cut -d " " -f 1) +%Y%m%d%Hh%Mm%Ss)$(echo ${currenttweet} | sed 's/.*\// /') : Wayback Machine</a></li>" >> html/${user}/tweetlist-top.html
      fi
      cat html/${user}/tweetlist-top.html html/${user}/tweetlist-bottom.html > html/${user}/tweetlist.html
    fi
  done < html/${user}/newtweets.txt

done < users

rm /tmp/tweetcrawler.busy

These scripts could probably be improved with the twint tool.
 
Last edited:
I would add to your list ol' Ethan Ralph. I briefly added him before realizing at least two other people had autoarchiving on him.

For what it's worth, I've been running the script pretty consistently targeting about 40-45 accounts for about the last month or so. Most of these have been people who are normally not of general interest (unpleasant journalists for example, I happened to capture a bunch of the pile on tweets after the assault on Andy Ngo because I was monitoring them anyway) or are delicate flowers I don't want to expose to general scrutiny at least until they quit Twitter.

It's been very good, but I have had the odd problem where some of the programs the script launches hang up.. Generally it's in the other screenshot program I switched to from the one you're using, which I'm reconsidering. The virtual server I'm using is pretty low end so I'm assuming this is caused by timeouts or some sort of weird memory exhaustion issue, but I am also illiterate in shell scripting so wouldn't rule out the possibility that I'm doing something really dumb, so I'm having a go at a more Cadillacish solution that I hope will enable more frequent scraping for a larger number of accounts, faster.

Basically, I want to split things out as follows:
* 'Ingestion' script uses Twint modules to read the previous page worth of tweets for a user into a database (currently SQLite). This is currently implemented in a sequential loop, and is at best no faster than @SoapQueen1's very slick method with the curl and the regexp processing. One benefit of doing this with Twint is being able to easily capture text, links, location information etc and search that data from the DB. So far this process has been trouble free for a week or so with the same list of accounts, and because I've done it in Python I think I should be able to parallelize it quite easily to cut down on the time that tweets go unarchived. I also have things set up to capture public Telegram channel messages- planning to expand this to private Telegram chat messages (by acting as a Telegram client) and stories in RSS feeds, but the structure I have set up could fit in with monitoring other sources of lolz like Reddit users too.
* 'Archiving' script reads the tweets, Telegram messages, etc that have been retrieved by the ingestion script but are not yet archived/saved to a screenshot, Archive.today, Wayback, local WARC archive, etc, to produce a list to loop through and archive. As it gets a successful response back, it stores the URL in the DB. Idea is that this gets run with a different flag from different cron jobs to tell it what destination it's archiving to, this way the different destinations can be archived to simultaneously, without overloading one particular destination (Archive.Today is quite vulnerable to this if you archive a lot of v. complex stuff). Will fuck out if a certain archive method fails >5 times for a particular item. Still a WIP.
** I intend to use cutycapt for screenshots, but the ideal would be to use headless Mozilla with Selenium automation. That avoids any fucking around with X servers, and should make it a lot easier to run simple scripts to do things like unhide 'sensitive' images etc. If anyone has experience with Selenium and Python I could definately use some help on that.
* 'Reporting' script produces static webpages for each user with the content of the tweets and all available archive links. Would figure to leverage some existing JS table widgets to make that look as sexy as possible. Haven't started yet, should be simple.

I'm hoping that by moving to Python, I'm going to be more able to readily capture error messages and logging from the external applications that I can make some sense of..unlikely, but you never know.

I'll post my repo once I have these pieces in place. Part of the reason I want to have a DB is to make it easier to add other features- for example, the ability to capture and hold (and optionally, archive) tweets back to the beginning of time for a user. Also, leveraging the collected DB data and Twint, the ability to compare what is currently being returned from a search of a user's timeline vs what has been captured in the past- thus enabling easy identification of spicy deleted tweets. But that is for the future!
 
I disabled local screenshots on my machine. I will probably also disable the local HTML file generation because honestly it's not that useful (especially without the local screenshots.)

What I'd like to do with it eventually:
  1. Twint integration
  2. ArchiveNow integration
  3. Tor integration
  4. Packaging and Instructions
  5. Windows Support (not sure how to go about that)
  6. Additional Data
    • Media / Quoted Tweets
    • Replies (keep an up-to-date list)
    • Date Deleted (if deleted)
  7. Display (probably a website)
I'm going to start moving it over to Python 3 since that might be the best direction to move.

It looks like you can't use pip3 to install the twint module at the moment but this works:
Code:
apt-get install git python3-pip
pip3 install --user --upgrade -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint
 
Last edited:
The script regexps are currently broken, probably due to the Twitter redesign rollout. I don't have the time to fix it right now.
 
  • Feels
Reactions: Wendy Carter
Alright.. I got distracted by some other non-Twitter related archiving projects. But I've patched together enough from my new project to get collections going again for the Twitter users I'm monitoring.

This is up at:
https://git.kiwifarms.net/naqqer/slurper

It's pretty straightforward. At the present point I haven't had time to add proper logging or error handling or to figure out how to handle concurrent DB connections, so what there is is:
* a SQLite db structure defined in db_setup.sql, to be deployed to a new SQLite3 db under data/collections.db. New Twitter users to be monitored can be added with "insert into twitter_users (twitter_username) VALUES ('TheyCallMeDSP');"
* ongoing_spider.py and archive_collected.py scripts will collect tweets from these users to the DB, and archive them to the internet archive and archive.is respectively.
* the runrepeatedly.sh shell script runs those two scripts sequentially forever to ensure everything gets collected and pushed to archives.

Basically nothing other than the above is implemented properly, the database schema and scripts are going to need a lot of rework, and even what is working is a bit fucky, but it should work for collecting tweets and archiving them.
 
I can't run that on the systems I have because none of them have Python 3.6 so I won't be able to implement it for a few weeks.
 
  • Feels
Reactions: Wendy Carter
I get a "This page not found" html page when i run curl using the following command:
Bash:
curl -A "[user-agent]" "https://twitter.com/search?f=live&q=(from%3Atheycallmedsp)" > dsp.html
Pasting that link into the browser on any computer works perfectly and the user-agent I got from searching "what is my user agent" on duckduckgo.
 
Back