Archival Tools - How to archive anything.

Null · Dec 1, 2021

I've made a new guide for using Kdenlive.

Lemmingwise · Dec 1, 2021

Oi you have loicense for that law and order episode?

Kung Pow Cream · Dec 5, 2021

I'm trying to archive tweets from a Twitter username but they have 98.2K Tweets. It seems like Twitter changed their coding because Twint isn't working anymore. Is there any other scraper I could use?

BirdBeeper · Jan 3, 2022

Just throwing the stuff I use on here because Im probably the laziest fuck in existence and if I can archive shit, so can you.

9Convert - Website for downloading youtube videos. Lets you download streams as soon as they are up on a channel.
Online Video Trimmer - What it says on the tin. A super easy tool that lets you split videos up. I love it because I can choose when/where to split the video. Also has a bunch of other tools like cropping, flipping, etc
Youtube Livestream Theater Mode - Chrome Extension. Great for if your screenrecording and want to get the livechat along with the video.
OBS Studio - Dont forget good old OBS studio in case you need to record a stream live/as you are watching it because somebody likes to delete their streams right after they are live

awoo · Jan 4, 2022

@Null have you gone through the thread and found tools for archiving twitter and Instagram for the OP? They're fickle and in demand. I don't know of a way to archive Twitter and it would very useful

also for those who like to experiment technically, ffmpeg can be tweaked based on compression quality and subjective measures of video quality. HEVC/x265 and VBR might offer better quality if it is supported by playback. I haven't messed around with it but x264 should be compatible with everything

Null · Jan 5, 2022

awoo said:
@Null have you gone through the thread and found tools for archiving twitter and Instagram for the OP? They're fickle and in demand. I don't know of a way to archive Twitter and it would very useful

Look up Nitter and Bibliogram.

Richard Budgetts Brain · Jan 5, 2022

awoo said:
@Null have you gone through the thread and found tools for archiving twitter and Instagram for the OP? They're fickle and in demand. I don't know of a way to archive Twitter and it would very useful

There is tweetsave for individual tweets, you paste the link and the tweet gets saved to several archives. They also have a firefox plugin.

For tiktok there is tikup which uploads videos to the internet archive unless you turn it off with --no-upload. You get all videos of a user with tikup --no-upload --folder name_of_your_archive_folder tiktok_username. I never upload because the content isn't worth the storage it would take up. Laughing about it here is enough.

awoo · Jan 5, 2022

Richard Budgetts Brain said:
There is tweetsave for individual tweets, you paste the link and the tweet gets saved to several archives. They also have a firefox plugin.

For tiktok there is tikup which uploads videos to the internet archive unless you turn it off with --no-upload. You get all videos of a user with tikup --no-upload --folder name_of_your_archive_folder tiktok_username. I never upload because the content isn't worth the storage it would take up. Laughing about it here is enough.

I already have archival tools for these (internet archive and yt-dlp). I should specify archiving whole accounts.

MarvinTheParanoidAndroid · Jan 6, 2022

Null said:
Youtube-dl has not updated in 4 months because Susan went after them hard. Use yt-dlp now.

Releases · yt-dlp/yt-dlp

A youtube-dl fork with additional features and fixes - yt-dlp/yt-dlp

github.com

Just checked, it has a slightly different menu when you go to download something, but like with Youtube-dl it still needs ffmpeg to merge files together properly. It's wicked fast compared to Youtube-dl as well. If you already have ffmpeg, just paste it to your new yt-dlp folder.

Cavalier Cipolla · Jan 9, 2022

Why are only webm and mp4 supported? What about mkv? What about ts?

awoo · Jan 9, 2022

Cavalier Cipolla said:
Why are only webm and mp4 supported? What about mkv? What about ts?

for what program? mkv is a container format that contains encoded streams and ts files are just mpeg chunks

Cavalier Cipolla · Jan 9, 2022

awoo said:
for what program? mkv is a container format that contains encoded streams and ts files are just mpeg chunks

I'm talking about the farms. Mkv and ts are not supported as attachments, which is some niggerkike shit due to both not being uncommon.

And one more thing, it appears that HEVC is also not supported here (is it supported?), despite being up to twice as efficient as AVC. I try to, within reason, avoid shitting up my HDD with various big videos, such as VJlink stream rips (because that nigger likes to randomly yeet streams off YouTube) so I recently started transcoding many videos to HEVC to slice the file size without shitting up the quality. Now, HEVC is not even some enthusiast only format, as our terrestrial TV channels use HEVC that allows HD video at bitrates SD video was broadcast with using MPEG-2 back before we switched to DVB-T2. I also have a totally legit rip of the Karekano anime that's encoded in HEVC at about 1.1 Mbps at SD resolution. And it wasn't plagued by compression artifacts.

awoo · Jan 10, 2022

It should be trivial to turn TS files into MP4 right? if the stream is MPEG format

HEVC support depends on hardware and I don't think is widely supported? it is used by tv broadcasts because free software doesn't matter to them, but it does matter for software companies

Lieutenant Rasczak · Jan 15, 2022

awoo said:
@Null have you gone through the thread and found tools for archiving twitter and Instagram for the OP? They're fickle and in demand. I don't know of a way to archive Twitter and it would very useful

also for those who like to experiment technically, ffmpeg can be tweaked based on compression quality and subjective measures of video quality. HEVC/x265 and VBR might offer better quality if it is supported by playback. I haven't messed around with it but x264 should be compatible with everything

You can use archive.today to archive individual posts on twitter and instagram if I recall..

Moosebonker · Jan 15, 2022

I tried to archive a private group FB page where the mod is calling me a 'disgusting homophobe' but all it saves is the banner and page rules, guessing because the page is private.
Anyone got a workaround to help out an old boomer?

Edited to add: I have posted in A&H without archiving and evaded the promised strangling so far. You would be helping me out there. I know inmates post newspaper articles from behind a paywall which may be the same situation here.

Baraadmirer · Jan 15, 2022

Does KF not support the .is domain? I use the GGBlocker extension to archive pages, but when I try to make it a link in a post, the text isn't hyperlinked. Changing the domain to .md seems to fix things.

beamsss · Jan 18, 2022

Moosebonker said:
I tried to archive a private group FB page where the mod is calling me a 'disgusting homophobe' but all it saves is the banner and page rules, guessing because the page is private.
Anyone got a workaround to help out an old boomer?

Edited to add: I have posted in A&H without archiving and evaded the promised strangling so far. You would be helping me out there. I know inmates post newspaper articles from behind a paywall which may be the same situation here.

Try the browser extension "SingleFile"

Moosebonker · Jan 18, 2022

beamsss said:
Try the browser extension "SingleFile"

Thanks for that. Seems Safari has its own web archiving tool which I found Googling SingleFile.

If using Chrome on macOS:

https://apple.stackexchange.com/questions/415015/differences-between-options-to-save-webpages-in-safari-and-chrome

bippu_as_fuck_ls400 · Jan 18, 2022

I bring good news of the Internet Archive Wayback Machine - Save Page Now 2 (SPN2) Public API in this post. It is a real thing based on modified S3. Hopefully it will get more people archiving.

bippu_as_fuck_ls400 · Jan 25, 2022

Kung Pow Cream said:
I'm trying to archive tweets from a Twitter username but they have 98.2K Tweets. It seems like Twitter changed their coding because Twint isn't working anymore. Is there any other scraper I could use?

awoo said:
I already have archival tools for these (internet archive and yt-dlp). I should specify archiving whole accounts.

Archiving Twitter Accounts

WARNING: Using these tools in this manner is likely a violation of twitter terms of service, so don't do any of this without taking the proper precautions. If you have to ask what that encompasses, you probably shouldn't be doing this. Read through all of the info thoroughly before proceeding.

No twitter account required:

Software required:
snscrape
Wayback Machine SPN Scripts (regular) / (tor version)

1. Get the user's full list of tweet urls (torsocks optional):
torsocks snscrape twitter-user username | tee -a username-fullurls.txt
2. Run ./iacheck.sh username-fullurls.txt to generate toarchive-username.txt

Code:

#!/bin/bash
#
# IA Wayback Machine archive check
# Usage: ./iacheck.sh listofurls

echo ["$(date +"%Y-%m-%d %T")"] Script Started
echo ["$(date +"%Y-%m-%d %T")"] Total urls to check: "$(wc -l "$1" | awk '{ print $1 }')"
echo ["$(date +"%Y-%m-%d %T")"] Output to file: toarchive-"$1"

while read -r line; do
  curl -4 --tcp-fastopen --compressed --tr-encoding -L -s http://web.archive.org/wayback/available?url="${line}" | jq -r '@text' | grep {} | jq -r '.[]' | head -n 1 | tee -a toarchive-"$1"
  sleep 1
done <"$1"

3. Send the unarchived urls to archive.org via spn.sh or spn-tor.sh:

./spn.sh -n -q -p '4' -c '-4 --http2-prior-knowledge --compressed --tr-encoding' -d '&skip_first_archive=1' toarchive-username.txt

Notes:
The spn.sh scripts use the Wayback Archive SPN2 API to archive urls. The tor version has also worked well in testing. You can also authenticate with an archive.org account to enable more connections, see the spn.sh github for details.

If you only want to get a users' posts since a certain date,you can use the --since DATE option:
torsocks snscrape --since 2022-01-25 twitter-user username > username-01-25.txt

Get a list of all tweet urls for a user
Check those urls for preexisting archives on web.archive.org
Send any unarchived tweets to web.archive.org for archiving

No twitter account required:

Software required:
snscrape

Run ./twb.sh username DATE, with DATE being the date you want to view tweets since. Change $HOME/Desktop/test/ to your preferred path.

#!/bin/bash
#
# ./twb.sh username DATE
# v002 rewrite

torsocks snscrape --since "$2" twitter-user "$1" > "$1".txt
urls=$(while read -r line; do
echo -n ' <DT><A HREF="https://archive.today/?run=1&url='
echo ''"$line"'">'"https://archive.today/?run=1&url=$line"'</A>'
done <"$1".txt)

touch "$1-$2".html
cat >"$1-$2".html <<EOF

<!DOCTYPE NETSCAPE-Bookmark-file-1>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<TITLE>$1</TITLE>
tweets from: $1 <a href="https://twitter.com/$1">twitter</a> <a href="https://web.archive.org/web/*/https://twitter.com/$1/*">archive.org</a> <a href="https://archive.today/https://twitter.com/$1/*">archive.today</a> 
tweets since: $2
<DL>
$urls
</DL>

EOF
brave-browser --password-store=basic file://$HOME/Desktop/test/"$1-$2".html &>/dev/null &
rm $1.txt
#firefox -new-tab file://$HOME/Desktop/test/"$1-$2".html &>/dev/null &

Notes:
Cleaned up things quite a bit since the first version and consolidated both scripts into one. This uses snscrape to get the tweet ids, so no twitter account is required. The script pulls the list of tweet id's, prepends the archive.md (KF wordfilters .today to .md) url, creates an html file of the links, then opens the file in your browser as a clickable list. Click the first link and solve the captcha, then you can just go down the list of urls and middle click and open them in a new tab to archive.

Get a list of tweet urls for a user
Preface urls with archive.today link
Open html page of links in web browser
Solve captcha, then middle click away

Twitter account required:

Software required:
twarc2
jq
snscrape
Wayback Machine SPN Scripts (regular) / (tor version)

Borrowing some text from my post about archiving to archive.today:
You need to have a twitter account. Remember that everything that you search for via the API is logged somewhere and tied to the IP, email, and phone number that registered the account. Use a burner sim and nonpersonal email. With that warning out of the way, visit https://developer.twitter.com/ and it will grant you instant API access. Create a project and save all the secret keys locally (you do use a password manager, right?).

Install twarc. Run twarc2 configure and enter the bearer token from the twitter dev site.

1. Get the full list of tweet urls:
torsocks snscrape twitter-user username | tee -a username-fullurls.txt

2. Create a file of the tweet ids:
awk -F/ '{print $6}' username-fullurls.txt > username-tweetids.txt

3.Use twarc to "hydrate" the tweets:
twarc2 hydrate username-tweetids.txt username-hydrated.jsonl

4. Drop the identifying twarc info from the jsonl (optional, for privacy):
jq -c 'del(.__twarc)' username-hydrated.jsonl > "tmp" && mv -f "tmp" username-hydrated.jsonl

5. Convert the jsonl into a csv file:
twarc2 csv username-hydrated.jsonl username-csv.csv

You now have a .jsonl of the data directly from twitter and also a .csv to make it easier to work with. On to the archiving of the unarchived tweets at archive.org.

6. (optional) Run ./iacheck.sh username-fullurls.txt to generate toarchive-username.txt:

Code:

#!/bin/bash
#
# IA Wayback Machine archive check
# Usage: ./iacheck.sh listofurls

echo ["$(date +"%Y-%m-%d %T")"] Script Started
echo ["$(date +"%Y-%m-%d %T")"] Total urls to check: "$(wc -l "$1" | awk '{ print $1 }')"
echo ["$(date +"%Y-%m-%d %T")"] Output to file: toarchive-"$1"

while read -r line; do
  curl -4 --tcp-fastopen --compressed --tr-encoding -L -s http://web.archive.org/wayback/available?url="${line}" | jq -r '@text' | grep {} | jq -r '.[]' | head -n 1 | tee -a toarchive-"$1"
  sleep 1
done <"$1"

7. (optional) Send the unarchived urls to archive.org via spn.sh or spn-tor.sh:

./spn.sh -n -q -p '4' -c '-4 --http2-prior-knowledge --compressed --tr-encoding' -d '&skip_first_archive=1'  toarchive-username.txt

Notes:
The spn.sh scripts use the Wayback Archive SPN2 API to archive urls. The tor version has also worked well in testing. You can also authenticate with an archive.org account to enable more connections, see the spn.sh github for details.

Theoretically, snscrape can pull the .jsonl data from twitter, allowing you to bypass steps 1-5 above and skip using a twitter account. You can use the following command to do that (torsocks optional):
torsocks snscrape --jsonl twitter-user username | tee -a username-raw_tweets.jsonl

I personally choose the twarc method because it is considered the standard in academia/data science when it comes to creating/sharing/manipulating twitter datasets. If I'm taking the time to archive something, I'd prefer to get it right the first time. Especially content as easily removed as tweets. I'm not sure that the jsonl that snscrape outputs is as close to the pure API output that twarc saves. If someone wants to test/compare the results, please share your findings.

WARNING: twitter makes a big deal about not sharing anything other than dehydrated tweet ids, so if you're sharing the jsonl files, csv derived from them, etc. then share them at your own risk. I'm not sure how much of a stink they actually make about it beyond dropping API access and banning your account, but there's warnings all over everything I've been reading while researching this, so you may want to be discreet.

Saving the .jsonl data of all tweets locally for further data analysis
"Hydrating" a list of tweet ids that you downloaded somewhere
(possible no-account workaround using snscrape)

Notes:
If someone could clean up that jq line in the iacheck file, I'd appreciate it. I know there's a simple solution, but I couldn't wrap my head around it the other night and had to make do. Hopefully I've made things easy to follow and everyone will have success in their archiving endeavors. Any mistakes, tips, tricks, improvements or suggestions? Let me know.

2022-01-26 added twb script for browser archiving with .today

Archival Tools - How to archive anything.

Ooperator

Who's afraid of the Candyman?

*aggressive honking*

Please be patient, I have awootism

Ooperator

"eVeRyONe agrees that transwomen are women"

Please be patient, I have awootism

This will all end in tears, I just know it.

Dai Vesuvio, regalaci un'altra Pompei!

Please be patient, I have awootism

Dai Vesuvio, regalaci un'altra Pompei!

Please be patient, I have awootism

I'm doing my part

Tartan Triangle

💪🍦💪

Tartan Triangle

Junction Produce Lifestyle

Junction Produce Lifestyle

Archiving Twitter Accounts​

aggressive honking

Archiving Twitter Accounts