Archival Tools - How to archive anything.

  • 🐕 I am attempting to get the site runnning as fast as possible. If you are experiencing slow page load times, please report it.
I'm trying to archive tweets from a Twitter username but they have 98.2K Tweets. It seems like Twitter changed their coding because Twint isn't working anymore. Is there any other scraper I could use?
 
  • Autistic
Reactions: Josh Faggot Moon
Just throwing the stuff I use on here because Im probably the laziest fuck in existence and if I can archive shit, so can you.

9Convert - Website for downloading youtube videos. Lets you download streams as soon as they are up on a channel.
Online Video Trimmer - What it says on the tin. A super easy tool that lets you split videos up. I love it because I can choose when/where to split the video. Also has a bunch of other tools like cropping, flipping, etc
Youtube Livestream Theater Mode - Chrome Extension. Great for if your screenrecording and want to get the livechat along with the video.
OBS Studio - Dont forget good old OBS studio in case you need to record a stream live/as you are watching it because somebody likes to delete their streams right after they are live :)
 
@Null have you gone through the thread and found tools for archiving twitter and Instagram for the OP? They're fickle and in demand. I don't know of a way to archive Twitter and it would very useful

also for those who like to experiment technically, ffmpeg can be tweaked based on compression quality and subjective measures of video quality. HEVC/x265 and VBR might offer better quality if it is supported by playback. I haven't messed around with it but x264 should be compatible with everything
 
@Null have you gone through the thread and found tools for archiving twitter and Instagram for the OP? They're fickle and in demand. I don't know of a way to archive Twitter and it would very useful
There is tweetsave for individual tweets, you paste the link and the tweet gets saved to several archives. They also have a firefox plugin.

For tiktok there is tikup which uploads videos to the internet archive unless you turn it off with --no-upload. You get all videos of a user with tikup --no-upload --folder name_of_your_archive_folder tiktok_username. I never upload because the content isn't worth the storage it would take up. Laughing about it here is enough.
 
There is tweetsave for individual tweets, you paste the link and the tweet gets saved to several archives. They also have a firefox plugin.

For tiktok there is tikup which uploads videos to the internet archive unless you turn it off with --no-upload. You get all videos of a user with tikup --no-upload --folder name_of_your_archive_folder tiktok_username. I never upload because the content isn't worth the storage it would take up. Laughing about it here is enough.
I already have archival tools for these (internet archive and yt-dlp). I should specify archiving whole accounts.
 
Youtube-dl has not updated in 4 months because Susan went after them hard. Use yt-dlp now.
Just checked, it has a slightly different menu when you go to download something, but like with Youtube-dl it still needs ffmpeg to merge files together properly. It's wicked fast compared to Youtube-dl as well. If you already have ffmpeg, just paste it to your new yt-dlp folder.
 
Last edited:
  • Autistic
Reactions: Josh Faggot Moon
Why are only webm and mp4 supported? What about mkv? What about ts?
for what program? mkv is a container format that contains encoded streams and ts files are just mpeg chunks
 
for what program? mkv is a container format that contains encoded streams and ts files are just mpeg chunks
I'm talking about the farms. Mkv and ts are not supported as attachments, which is some niggerkike shit due to both not being uncommon.

And one more thing, it appears that HEVC is also not supported here (is it supported?), despite being up to twice as efficient as AVC. I try to, within reason, avoid shitting up my HDD with various big videos, such as VJlink stream rips (because that nigger likes to randomly yeet streams off YouTube) so I recently started transcoding many videos to HEVC to slice the file size without shitting up the quality. Now, HEVC is not even some enthusiast only format, as our terrestrial TV channels use HEVC that allows HD video at bitrates SD video was broadcast with using MPEG-2 back before we switched to DVB-T2. I also have a totally legit rip of the Karekano anime that's encoded in HEVC at about 1.1 Mbps at SD resolution. And it wasn't plagued by compression artifacts.
 
It should be trivial to turn TS files into MP4 right? if the stream is MPEG format

HEVC support depends on hardware and I don't think is widely supported? it is used by tv broadcasts because free software doesn't matter to them, but it does matter for software companies
 
@Null have you gone through the thread and found tools for archiving twitter and Instagram for the OP? They're fickle and in demand. I don't know of a way to archive Twitter and it would very useful

also for those who like to experiment technically, ffmpeg can be tweaked based on compression quality and subjective measures of video quality. HEVC/x265 and VBR might offer better quality if it is supported by playback. I haven't messed around with it but x264 should be compatible with everything
You can use archive.today to archive individual posts on twitter and instagram if I recall..
 
I tried to archive a private group FB page where the mod is calling me a 'disgusting homophobe' but all it saves is the banner and page rules, guessing because the page is private.
Anyone got a workaround to help out an old boomer?

Edited to add: I have posted in A&H without archiving and evaded the promised strangling so far. You would be helping me out there. I know inmates post newspaper articles from behind a paywall which may be the same situation here.
 
Last edited:
I tried to archive a private group FB page where the mod is calling me a 'disgusting homophobe' but all it saves is the banner and page rules, guessing because the page is private.
Anyone got a workaround to help out an old boomer?

Edited to add: I have posted in A&H without archiving and evaded the promised strangling so far. You would be helping me out there. I know inmates post newspaper articles from behind a paywall which may be the same situation here.
Try the browser extension "SingleFile"
 
Try the browser extension "SingleFile"
Thanks for that. Seems Safari has its own web archiving tool which I found Googling SingleFile.
1642567235535.png

If using Chrome on macOS:
 
I bring good news of the Internet Archive Wayback Machine - Save Page Now 2 (SPN2) Public API in this post. It is a real thing based on modified S3. Hopefully it will get more people archiving.
 
I'm trying to archive tweets from a Twitter username but they have 98.2K Tweets. It seems like Twitter changed their coding because Twint isn't working anymore. Is there any other scraper I could use?
I already have archival tools for these (internet archive and yt-dlp). I should specify archiving whole accounts.

Archiving Twitter Accounts​

WARNING: Using these tools in this manner is likely a violation of twitter terms of service, so don't do any of this without taking the proper precautions. If you have to ask what that encompasses, you probably shouldn't be doing this. Read through all of the info thoroughly before proceeding.

No twitter account required:
Software required:
snscrape
Wayback Machine SPN Scripts (regular) / (tor version)

1. Get the user's full list of tweet urls (torsocks optional):
torsocks snscrape twitter-user username | tee -a username-fullurls.txt
2. Run ./iacheck.sh username-fullurls.txt to generate toarchive-username.txt
Code:
#!/bin/bash
#
# IA Wayback Machine archive check
# Usage: ./iacheck.sh listofurls

echo ["$(date +"%Y-%m-%d %T")"] Script Started
echo ["$(date +"%Y-%m-%d %T")"] Total urls to check: "$(wc -l "$1" | awk '{ print $1 }')"
echo ["$(date +"%Y-%m-%d %T")"] Output to file: toarchive-"$1"

while read -r line; do
  curl -4 --tcp-fastopen --compressed --tr-encoding -L -s http://web.archive.org/wayback/available?url="${line}" | jq -r '@text' | grep {} | jq -r '.[]' | head -n 1 | tee -a toarchive-"$1"
  sleep 1
done <"$1"
3. Send the unarchived urls to archive.org via spn.sh or spn-tor.sh:
./spn.sh -n -q -p '4' -c '-4 --http2-prior-knowledge --compressed --tr-encoding' -d '&skip_first_archive=1' toarchive-username.txt

Notes:
The spn.sh scripts use the Wayback Archive SPN2 API to archive urls. The tor version has also worked well in testing. You can also authenticate with an archive.org account to enable more connections, see the spn.sh github for details.

If you only want to get a users' posts since a certain date,you can use the --since DATE option:
torsocks snscrape --since 2022-01-25 twitter-user username > username-01-25.txt
  • Get a list of all tweet urls for a user
  • Check those urls for preexisting archives on web.archive.org
  • Send any unarchived tweets to web.archive.org for archiving
No twitter account required:
Software required:
snscrape

Run ./twb.sh username DATE, with DATE being the date you want to view tweets since. Change $HOME/Desktop/test/ to your preferred path.
#!/bin/bash
#
# ./twb.sh username DATE
# v002 rewrite

torsocks snscrape --since "$2" twitter-user "$1" > "$1".txt
urls=$(while read -r line; do
echo -n ' <DT><A HREF="https://archive.today/?run=1&url='
echo ''"$line"'">'"https://archive.today/?run=1&url=$line"'</A>'
done <"$1".txt)

touch "$1-$2".html
cat >"$1-$2".html <<EOF

<!DOCTYPE NETSCAPE-Bookmark-file-1>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<TITLE>$1</TITLE>
<B>tweets from: $1</B> <a href="https://twitter.com/$1">twitter</a> <a href="https://web.archive.org/web/*/https://twitter.com/$1/*">archive.org</a> <a href="https://archive.today/https://twitter.com/$1/*">archive.today</a><br>
<B>tweets since: $2</B>
<DL><p>
$urls
</DL><p>

EOF
brave-browser --password-store=basic file://$HOME/Desktop/test/"$1-$2".html &>/dev/null &
rm $1.txt
#firefox -new-tab file://$HOME/Desktop/test/"$1-$2".html &>/dev/null &
Notes:
Cleaned up things quite a bit since the first version and consolidated both scripts into one. This uses snscrape to get the tweet ids, so no twitter account is required. The script pulls the list of tweet id's, prepends the archive.md (KF wordfilters .today to .md) url, creates an html file of the links, then opens the file in your browser as a clickable list. Click the first link and solve the captcha, then you can just go down the list of urls and middle click and open them in a new tab to archive.
  • Get a list of tweet urls for a user
  • Preface urls with archive.today link
  • Open html page of links in web browser
  • Solve captcha, then middle click away
Twitter account required:
Software required:
twarc2
jq
snscrape
Wayback Machine SPN Scripts (regular) / (tor version)

Borrowing some text from my post about archiving to archive.today:
You need to have a twitter account. Remember that everything that you search for via the API is logged somewhere and tied to the IP, email, and phone number that registered the account. Use a burner sim and nonpersonal email. With that warning out of the way, visit https://developer.twitter.com/ and it will grant you instant API access. Create a project and save all the secret keys locally (you do use a password manager, right?).

Install twarc. Run twarc2 configure and enter the bearer token from the twitter dev site.

1. Get the full list of tweet urls:
torsocks snscrape twitter-user username | tee -a username-fullurls.txt

2. Create a file of the tweet ids:
awk -F/ '{print $6}' username-fullurls.txt > username-tweetids.txt

3.Use twarc to "hydrate" the tweets:
twarc2 hydrate username-tweetids.txt username-hydrated.jsonl

4. Drop the identifying twarc info from the jsonl (optional, for privacy):
jq -c 'del(.__twarc)' username-hydrated.jsonl > "tmp" && mv -f "tmp" username-hydrated.jsonl

5. Convert the jsonl into a csv file:
twarc2 csv username-hydrated.jsonl username-csv.csv

You now have a .jsonl of the data directly from twitter and also a .csv to make it easier to work with. On to the archiving of the unarchived tweets at archive.org.

6. (optional) Run ./iacheck.sh username-fullurls.txt to generate toarchive-username.txt:
Code:
#!/bin/bash
#
# IA Wayback Machine archive check
# Usage: ./iacheck.sh listofurls

echo ["$(date +"%Y-%m-%d %T")"] Script Started
echo ["$(date +"%Y-%m-%d %T")"] Total urls to check: "$(wc -l "$1" | awk '{ print $1 }')"
echo ["$(date +"%Y-%m-%d %T")"] Output to file: toarchive-"$1"

while read -r line; do
  curl -4 --tcp-fastopen --compressed --tr-encoding -L -s http://web.archive.org/wayback/available?url="${line}" | jq -r '@text' | grep {} | jq -r '.[]' | head -n 1 | tee -a toarchive-"$1"
  sleep 1
done <"$1"

7. (optional) Send the unarchived urls to archive.org via spn.sh or spn-tor.sh:
./spn.sh -n -q -p '4' -c '-4 --http2-prior-knowledge --compressed --tr-encoding' -d '&skip_first_archive=1' toarchive-username.txt

Notes:
The spn.sh scripts use the Wayback Archive SPN2 API to archive urls. The tor version has also worked well in testing. You can also authenticate with an archive.org account to enable more connections, see the spn.sh github for details.

Theoretically, snscrape can pull the .jsonl data from twitter, allowing you to bypass steps 1-5 above and skip using a twitter account. You can use the following command to do that (torsocks optional):
torsocks snscrape --jsonl twitter-user username | tee -a username-raw_tweets.jsonl

I personally choose the twarc method because it is considered the standard in academia/data science when it comes to creating/sharing/manipulating twitter datasets. If I'm taking the time to archive something, I'd prefer to get it right the first time. Especially content as easily removed as tweets. I'm not sure that the jsonl that snscrape outputs is as close to the pure API output that twarc saves. If someone wants to test/compare the results, please share your findings.

WARNING: twitter makes a big deal about not sharing anything other than dehydrated tweet ids, so if you're sharing the jsonl files, csv derived from them, etc. then share them at your own risk. I'm not sure how much of a stink they actually make about it beyond dropping API access and banning your account, but there's warnings all over everything I've been reading while researching this, so you may want to be discreet.
  • Saving the .jsonl data of all tweets locally for further data analysis
  • "Hydrating" a list of tweet ids that you downloaded somewhere
  • (possible no-account workaround using snscrape)
Notes:
If someone could clean up that jq line in the iacheck file, I'd appreciate it. I know there's a simple solution, but I couldn't wrap my head around it the other night and had to make do. Hopefully I've made things easy to follow and everyone will have success in their archiving endeavors. Any mistakes, tips, tricks, improvements or suggestions? Let me know.
2022-01-26 added twb script for browser archiving with .today
 
Last edited:
Back