- Joined
- Jan 17, 2017
Some lolcows like to post dumb stuff to Twitter then delete it. I wrote this shell script to archive users' new tweets which I run as a cron job every 5 minutes:
It takes a screenshot of the tweet, saves it to a directory, submits the tweet to archive.today, and writes the results to an HTML file sorted newest to oldest.
Upon the first run it will give you an error saying some files don't exist which is fine. It creates the files for you. I could have muted the errors but didn't. This script requires a few tools that you might not already have installed including
@3119967d0c modified the script to use
These scripts could probably be improved with the
Also available at https://paste.debian.net/hidden/601bfe30/
Code:
#!/bin/bash
if [ -e /tmp/tweetcrawler.busy ]; then
exit 0
fi
touch /tmp/tweetcrawler.busy
cd /home/tweetcrawler/tools/twitter/tweetcrawler
while read user; do
if [ ! -e html/${user} ]; then
mkdir html/${user}
fi
if [ ! -e html/${user}/oldtweets.txt ]; then
touch html/${user}/oldtweets.txt
fi
echo -n > html/${user}/newtweets.txt
curl -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36" -Ls "https://twitter.com/search?f=tweets&q=from%3A${user}" |\
grep 'class="tweet-timestamp js-permalink js-nav js-tooltip"' |\
sed "s/^ <a href=\"\(\/[0-9A-Za-z_]*\/status\/[0-9]*\).*data-time=\"\(.*\)\" data-time-ms.*/\2 https:\/\/www.twitter.com\1/" > html/${user}/newtweets.txt
echo -n > html/${user}/tweetlist-top.html
while read currenttweet; do
grep -q "${currenttweet}" html/${user}/oldtweets.txt
if [ ! $? -eq 0 ]; then
echo ${currenttweet} >> html/${user}/oldtweets.txt
archive=$(/usr/local/bin/archiveis $(echo ${currenttweet} | cut -d " " -f 2) 2> /dev/null)
xvfb-run -s "-screen 0 670x1080x24" cutycapt --min-width=670 --min-height=1080 --url=$(echo ${currenttweet} | cut -d " " -f 2) --out=html/${user}/$(date -d @$(echo ${currenttweet} | cut -d " " -f 1) +%Y%m%d%Hh%Mm%Ss)$(echo ${currenttweet} | sed 's/.*\//_/').png
echo " <li><a href=\"$(date -d @$(echo ${currenttweet} | cut -d " " -f 1) +%Y%m%d%Hh%Mm%Ss)$(echo ${currenttweet} | sed 's/.*\//_/').png\">$(date -d @$(echo ${currenttweet} | cut -d " " -f 1) +%Y%m%d%Hh%Mm%Ss)$(echo ${currenttweet} | sed 's/.*\// /')</a></li>" >> html/${user}/tweetlist-forward.html
sort -r html/${user}/tweetlist-forward.html > html/${user}/tweetlist-backward.html
cp html/${user}/tweetlist.html html/${user}/tweetlist-bottom.html
echo " <li><a href=\"$(date -d @$(echo ${currenttweet} | cut -d " " -f 1) +%Y%m%d%Hh%Mm%Ss)$(echo ${currenttweet} | sed 's/.*\//_/').png\">$(date -d @$(echo ${currenttweet} | cut -d " " -f 1) +%Y%m%d%Hh%Mm%Ss)$(echo ${currenttweet} | sed 's/.*\// /')</a></li>" >> html/${user}/tweetlist-top.html
if [ ! -z ${archive} ]; then
echo " <li><a href=\"${archive}\">$(date -d @$(echo ${currenttweet} | cut -d " " -f 1) +%Y%m%d%Hh%Mm%Ss)$(echo ${currenttweet} | sed 's/.*\// /') : external archive</a></li>" >> html/${user}/tweetlist-top.html
fi
cat html/${user}/tweetlist-top.html html/${user}/tweetlist-bottom.html > html/${user}/tweetlist.html
fi
done < html/${user}/newtweets.txt
done < users
rm /tmp/tweetcrawler.busy
Upon the first run it will give you an error saying some files don't exist which is fine. It creates the files for you. I could have muted the errors but didn't. This script requires a few tools that you might not already have installed including
curl
, archiveis
, xvfb-run
, and cutycapt
. Before the first run you have to set up a new user called tweetcrawler
and make directories including /home/tweetcrawler/tools/twitter/tweetcrawler/html
. To install archiveis
you can run pip install --user archiveis
as the tweetcrawler
user if it's not included in your package manager. xvfb-run
may be included in the xfvb
package under Debian and Ubuntu relatives. Write the list of usernames you want to archive to /home/tweetcrawler/tools/twitter/tweetcrawler/users
one name per line.@3119967d0c modified the script to use
archivenow
which supports additional archive sites like the Wayback Machine and to run archive tasks through Tor using torsocks
with the IsolatePID
option turned on. The modified script is here:Also available at https://0bin.net/paste/H3+lEnXCX6OchH01#9fk9qmjRN0PtZy+Gjsye7JtSo-OGaI+CjyYLLUHr+zW
Code:
#!/bin/bash
if [ -e /tmp/tweetcrawler.busy ]; then
exit 0
fi
touch /tmp/tweetcrawler.busy
cd /home/tweetcrawler/tools/twitter/tweetcrawler
while read user; do
if [ ! -e html/${user} ]; then
mkdir html/${user}
fi
if [ ! -e html/${user}/oldtweets.txt ]; then
touch html/${user}/oldtweets.txt
fi
echo -n > html/${user}/newtweets.txt
useragent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.12 Safari/537.36"
torsocks curl -A "$useragent" -Ls "https://twitter.com/search?f=tweets&q=from%3A${user}" |\
grep 'class="tweet-timestamp js-permalink js-nav js-tooltip"' |\
sed "s/^ <a href=\"\(\/[0-9A-Za-z_]*\/status\/[0-9]*\).*data-time=\"\(.*\)\" data-time-ms.*/\2 https:\/\/www.twitter.com\1/" > html/${user}/newtweets.txt
echo -n > html/${user}/tweetlist-top.html
while read currenttweet; do
grep -q "${currenttweet}" html/${user}/oldtweets.txt
if [ ! $? -eq 0 ]; then
echo ${currenttweet} >> html/${user}/oldtweets.txt
iarchive=$(torsocks /home/tweetcrawler/.local/bin/archivenow --ia $(echo ${currenttweet} | cut -d " " -f 2))
archiveis=$(torsocks /home/tweetcrawler/.local/bin/archivenow --is $(echo ${currenttweet} | cut -d " " -f 2))
currenttweetid=$(echo ${currenttweet} | sed 's/.*\//_/')
dtstring=$(date -d @$(echo ${currenttweet} | cut -d " " -f 1) +%Y%m%d%Hh%Mm%Ss)
xvfb-run -s "-screen 0 670x1080x24" torsocks wkhtmltoimage --width 670 --height 1080 --javascript-delay 1750 --custom-header "User-Agent" "${useragent}" $(echo ${currenttweet} | cut -d " " -f 2) html/${user}/${dtstring}${currenttweetid}_u.png >> /dev/null
pngcrush -q -blacken -reduce html/${user}/${dtstring}${currenttweetid}_u.png html/${user}/${dtstring}${currenttweetid}.png
rm html/${user}/${dtstring}${currenttweetid}_u.png
echo " <li><a href=\"$(date -d @$(echo ${currenttweet} | cut -d " " -f 1) +%Y%m%d%Hh%Mm%Ss)$(echo ${currenttweet} | sed 's/.*\//_/').png\">$(date -d @$(echo ${currenttweet} | cut -d " " -f 1) +%Y%m%d%Hh%Mm%Ss)$(echo ${currenttweet} | sed 's/.*\// /')</a></li>" >> html/${user}/tweetlist-forward.html
sort -r html/${user}/tweetlist-forward.html > html/${user}/tweetlist-backward.html
cp html/${user}/tweetlist.html html/${user}/tweetlist-bottom.html
echo " <li><a href=\"$(date -d @$(echo ${currenttweet} | cut -d " " -f 1) +%Y%m%d%Hh%Mm%Ss)$(echo ${currenttweet} | sed 's/.*\//_/').png\">$(date -d @$(echo ${currenttweet} | cut -d " " -f 1) +%Y%m%d%Hh%Mm%Ss)$(echo ${currenttweet} | sed 's/.*\// /')</a></li>" >> html/${user}/tweetlist-top.html
if [ ! -z ${archiveis} ]; then
echo " <li><a href=\"${archiveis}\">$(date -d @$(echo ${currenttweet} | cut -d " " -f 1) +%Y%m%d%Hh%Mm%Ss)$(echo ${currenttweet} | sed 's/.*\// /') : Archive.is</a></li>" >> html/${user}/tweetlist-top.html
fi
if [ ! -z ${iarchive} ]; then
echo " <li><a href=\"${iarchive}\">$(date -d @$(echo ${currenttweet} | cut -d " " -f 1) +%Y%m%d%Hh%Mm%Ss)$(echo ${currenttweet} | sed 's/.*\// /') : Wayback Machine</a></li>" >> html/${user}/tweetlist-top.html
fi
cat html/${user}/tweetlist-top.html html/${user}/tweetlist-bottom.html > html/${user}/tweetlist.html
fi
done < html/${user}/newtweets.txt
done < users
rm /tmp/tweetcrawler.busy
These scripts could probably be improved with the
twint
tool.
Last edited: