A question I have, how does something like archive.today and wayback get past DDOS captchas like cloudflare or kiwiflare. But ghost archive and megalodon have a harder time? What is the trick Wayback and archive.today get around this? Does anyone know exactly?
Wayback has always had some advantages. I have no doubt that they have plenty of IP space. And everyone loves them. I wouldn't be surprised if Cloudflare, let alone individual site owners, just whitelists Wayback crawlers by default (Cloudflare even has a direct partnership with the Internet Archive/Wayback). I'm sure Megalodon, like WebCitation back in the day, would be in the same position, if anyone outside Japan knew who the fuck they were.
The heroes behind Archive.Today will have to work harder. I mean, I don't like to stereotype Russians, but surely there's some equivalent of six degrees of Kevin Bacon for all Russians being three degrees connected to someone running a residential proxy network with thousands of IP addresses around the world. The way that they almost always bypass paywalls in the same way that Google's crawler identifying itself as Googlebot does? Well, one could speculate as to what useragent they use... it probably isn't 'ARCHIVE.IS ARCHIVE CRAWLER'.
@clipartfan92 I don't know how much control you have over it but can you make it so that the Ghostarchive bookmarklet you made back in February in this post opens up to the captcha that sometimes appears when archiving with Ghostarchive? I ask because one of the biggest problems I face with that bookmarklet is the fact that any attempt at archiving with it just gets stuck in a loop. I've compared archiving the site from the bookmarklet to entering the link manually through the site itself and things that took literal seconds to archive through the site just got stuck in limbo with the bookmarklet. The main reason I suspect for this behavior to be happening is because of the captcha itself. I really don't think there's any other reason for it not working other than that.
The Archive.is bookmarklet already opens up to its Cloudflare captcha and it's not something that gets in the way of archiving too much to really complain about so I wouldn't mind having to go through Ghostarchives captcha. The only gripe I have with it is that one register of the captcha doesn't apply to new archives, meaning you have to go through it again and again but this happens sparingly.
@clipartfan92 I don't know how much control you have over it but can you make it so that the Ghostarchive bookmarklet you made back in February in this post opens up to the captcha that sometimes appears when archiving with Ghostarchive? I ask because one of the biggest problems I face with that bookmarklet is the fact that any attempt at archiving with it just gets stuck in a loop. I've compared archiving the site from the bookmarklet to entering the link manually through the site itself and things that took literal seconds to archive through the site just got stuck in limbo with the bookmarklet. The main reason I suspect for this behavior to be happening is because of the captcha itself. I really don't think there's any other reason for it not working other than that.
The Archive.is bookmarklet already opens up to its Cloudflare captcha and it's not something that gets in the way of archiving too much to really complain about so I wouldn't mind having to go through Ghostarchives captcha. The only gripe I have with it is that one register of the captcha doesn't apply to new archives, meaning you have to go through it again and again but this happens sparingly.
archive.today is cloudflared? I always gets slapped with a Google reCAPTCHA when using a proxy. Another question do anyone know if those scripts to archive multiple links in archive.today still work?
I think that happens because of the jankiness of the ghostarchive site. It may seem like they archive faster using the webpage but I think it's all luck of the draw with ghostarchive's site load, connectivity, and the remote site that's being archived. I've been having those same problems for a while now and I'll either get a loop or a familiar 502 Bad Gateway error page from Cloudflare. I just visited with Mullvad Browser and didn't get any captcha (using a VPN). I attempted to archive three urls. The first one got stuck in the loop. The second archived within 30 seconds with zero issues. The third got stuck in the loop and eventually went to a 502 error. If I refreshed the page, it would show it back in the loop again.
Pure speculation, but the captcha coming up might have something to do with the IP address/VPN endpoint you're using. I've got them in the past and other times I haven't.
I rarely, if ever, use the bookmarklet or actual webpage and mostly send the archive requests via curl and scripting. It doesn't make the process work any better but it saves me time. Here's a curl command to get the archive to start the process and display the link in a terminal (replace the $1 variable with the url):
archive.today is cloudflared? I always gets slapped with a Google reCAPTCHA when using a proxy. Another question do anyone know if those scripts to archive multiple links in archive.today still work?
I think they just mispoke on the cloudflare part. I've had no luck with those scripts in a couple of years and the constant cookie hassles weren't worth it when I looked into it. There may be a modern solution that uses selenium and one of those captcha solver plugins, but that's beyond my pay grade.
If I've got a list of urls for archive.today, I run them through a script that takes the text list of urls, prefaces them with the archive url, creates an html file, then opens that in a browser. Once you solve the first captcha, you can click a bunch of them before you get hit with another captcha. This code is atrocious and I haven't touched it for a few years, much less clean it up, but it works absolutely fine. Plz no bully, I'm just an idiot that plays on computers as a hobby and sar'd this up from bits and pieces from here and there. If anyone could rewrite this properly in bash or python, please do the needful, thank you sar!
Bash:
#!/bin/bash
#
# archive.today link script
#
cwd=$(pwd)
shortname=$(echo "$1" | sed 's/\.txt$//')
if [[ -z "$1" ]]; then
echo "Error: No filename provided"
exit
fi
if [[ ! -f "$1" ]] ; then
echo "Error: File does not exist"
exit
fi
if [[ ! -s "$1" ]] ; then
echo "Error: File is empty"
exit
fi
urls=$(while read -r line; do
echo -n ' <DT><A HREF="https://archive.today/?run=1&url='
echo ''"$line"'">'"$line"'</A>'
done <"$1")
touch "$shortname".html
cat >"$shortname".html <<CREATEHTML
<!DOCTYPE html>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<html>
<head>
<title>$shortname</title>
<style>
body {
background-color: #000000;
color: #ffffff;
}
</style>
<b>$shortname</b><br>
<br>
Archived urls:<br>
<dl><p>
$urls
</dl><p>
</head>
</html>
CREATEHTML
echo 'file://'"$cwd"'/'"$shortname"'.html'
brave-browser file://"$cwd"/"$shortname".html &>/dev/null &
#flatpak run com.brave.Browser file://"$cwd"/"$shortname".html 2>/dev/null &
If you want to archive a list of urls to archive.org, you can use this script (there's also a tor branch if that suits your needs). Depending on how much you're archiving, you're probably doing to hit some sort of daily IP submission limit, so using something like proxychains to rotate around VPN endpoints is probably needed.
That's way more than I wanted to type, hopefully someone gets some use out of it.
Is there an alternative to yt-dlp? I'm a bit code illiterate, and trying to figure out how to get this to work is killing me. My original plan was to use a yt to mp4 converter, but I am going to assume using a omni-tool such as yt-dlp is the way to go for archival purposes.
Is there an alternative to yt-dlp? I'm a bit code illiterate, and trying to figure out how to get this to work is killing me. My original plan was to use a yt to mp4 converter, but I am going to assume using a omni-tool such as yt-dlp is the way to go for archival purposes.
Is there an alternative to yt-dlp? I'm a bit code illiterate, and trying to figure out how to get this to work is killing me. My original plan was to use a yt to mp4 converter, but I am going to assume using a omni-tool such as yt-dlp is the way to go for archival purposes.
I recommended this site to another user: SSVid.net. It's able to rip Twitter and YouTube videos (for the time being before they change their API again) along with a few other sites.
Is there an alternative to yt-dlp? I'm a bit code illiterate, and trying to figure out how to get this to work is killing me. My original plan was to use a yt to mp4 converter, but I am going to assume using a omni-tool such as yt-dlp is the way to go for archival purposes.
Is there an alternative to yt-dlp? I'm a bit code illiterate, and trying to figure out how to get this to work is killing me. My original plan was to use a yt to mp4 converter, but I am going to assume using a omni-tool such as yt-dlp is the way to go for archival purposes.
The interface is... busy, but pretty much anything you can do with yt-dlp you can do with JDownloader.
The only real advantage that yt-dlp has is that it can automatically extract browser cookies for, say, YouTube, from your browser based on a command line flag (which you could add as a default to the yt-dlp config file if you liked). With JDownloader, I believe you have to use a browser extension to manually export these, and then import them to JDownloader. On the other hand a) you probably shouldn't be using cookies from your actual main browser/a Google account you actually use to archive from YouTube, in case they detect this and decide to lock you out of your account b) you should only have to do this once so who really cares.
The main actual reasons that I don't use it is that I like using busted old laptops with tiny amounts of RAM, and JDownloader will use more than yt-dlp (though it's not that bad really), and that I'm not aware of JDownloader having any good options to download SponsorBlock data or to integrate those and subtitles and chapters directly into a single MP4 video, at least without writing some batch/shell scripts (but that only really matters for my purpose, which is downloading videos to watch on my phone later, it doesn't matter for archive purposes- just download things like subtitles as seperate files).
I run Windows 10 and downloadeded the first .exe for Windows I saw:
now it says Win8+, but I really don't know if that means Windows 8+ as an standalone OS, or Windows 8 and beyond. When I open up the executable, I that anything I type, it will shut down completely. I tried to copy-paste some of the third party package manager installs such as Scoop and Chocolatey, but those just shut down the executable as well.
I run Windows 10 and downloadeded the first .exe for Windows I saw: View attachment 7978049
now it says Win8+, but I really don't know if that means Windows 8+ as an standalone OS, or Windows 8 and beyond. When I open up the executable, I that anything I type, it will shut down completely. I tried to copy-paste some of the third party package manager installs such as Scoop and Chocolatey, but those just shut down the executable as well.
It sort of sounds like you're just trying to double click "yt-dlp.exe" What you need to do is open a command line(cmd or powershell) and run it that way. For instance in cmd and the yt-dlp is in Downloads.
Code:
Windows-R cmd (Displays Command Prompt)<enter>
C:\Users\dave> cd Downloads
C:\Users\dave\Downloads> yt-dlp https://youtube.com/some/video/goes here
As another user mentioned, you can also download a GUI for it.
Did you install ffmpeg? When it says Windows8+, it means Windows 8 and future versions. Download the .exe file, go the folder that the yt-dlp.exe is and open command prompt in the same directory that yt-dlp.exe is located
Did you install ffmpeg? When it says Windows8+, it means Windows 8 and future versions. Download the .exe file, go the folder that the yt-dlp.exe is and open command prompt in the same directory that yt-dlp.exe is located
I have tried to the best of my abilities to complete this in the way that you and DavidS877 had stated earlier. With the folder in Downloads, and trying to run it with the commands both of you gave me, I keep getting this error:
'ytp.exe' is not recognized as an internal or external command,
operable program or batch file.
Edit: My chimp brain got it to work, now do I throw in the FFmpeg into the folder with yt-dlp, or do I get that added a different way? As this was prompted as I got it running:
Edit: My chimp brain got it to work, now do I throw in the FFmpeg into the folder with yt-dlp, or do I get that added a different way? As this was prompted as I got it running:
I've discovered some quite great, Yandex has saved web cache which you can rearchive on archive.today, this is good and all, but another great thing I found is they save the web cache of the wayback machine, meaning excluded domains like soyjak.party wayback snapshots can be viewed as they were archvied thanks to yandex. Example. This means sites that become excluded are actually not gone entirely, the only downside is that browsing these archives are just really inconvenient.