Archival Tools - How to archive anything.

With the them of livestream archiving, is there a tool like open video downloader that can clip a long archived stream rather than downloading the whole thing? As in you select a timestamp and where it should end and that is all that is saved. I'm talking specifically in the realm of Youtube.
Youtube-dl can do that:
 
Youtube-dl can do that:
I assume this is only in the terminal/command prompt version? I've been using the gui based one and it is super finnicky. Might as well try this.
 
Anyone having trouble accessing archive.today? It seems like I'm getting a 400 error.

ETA: Looks like it's back up. 👍
 
Last edited:
  • Agree
Reactions: Nikes_JustDoIt
Tips and Tricks for Archiving

NEVER USE WAYBACKMACHINE. THEY CAN DELETE YOUR SNAPSHOT, IT'S NOT WORTH IT!!!!

Only use it for research.


How to archive instagram? Let's start!
1686884127291.png

Here's a fan account for Boji the dog, a dog that takes public transit in Istanbul.

1686884245143.png

To archive the account. First use this website :
Enter the username of who you want to archive.

And boom you should have a link like this one.
If you want to be quicker you could just, https://www.picuki.com/profile/* (replace the asterisk with the username)

bojihasbeenarchived.png
Take the URL and run it through https://archive.ph/
Viola! Boji has been archived!!!


Just a nice tutorial for you all! Semper Fidelis to the nice farmer who taught me this trick.

Anyways farmers, stay archiving!
 
Is there some way to archive someone's reddit profile that isn't in the old reddit layout? I need to archive someone who has links to their other social media on their profile. They're an 18+ account too.
 
  • Like
Reactions: grapeshark77
Tips and Tricks for Archiving

NEVER USE WAYBACKMACHINE. THEY CAN DELETE YOUR SNAPSHOT, IT'S NOT WORTH IT!!!!

Only use it for research.


How to archive instagram? Let's start!
View attachment 5165758
Here's a fan account for Boji the dog, a dog that takes public transit in Istanbul.

View attachment 5165773
To archive the account. First use this website :
Enter the username of who you want to archive.

And boom you should have a link like this one.
If you want to be quicker you could just, https://www.picuki.com/profile/* (replace the asterisk with the username)

View attachment 5165782
Take the URL and run it through https://archive.ph/
Viola! Boji has been archived!!!


Just a nice tutorial for you all! Semper Fidelis to the nice farmer who taught me this trick.

Anyways farmers, stay archiving!
Correction! Twitter is kind of retarded right now. Upload a screenshot of the tweet to the site directly and archive using :
Ghost Archive / https://ghostarchive.org/
 
Correction! Twitter is kind of retarded right now. Upload a screenshot of the tweet to the site directly and archive using :
Ghost Archive / https://ghostarchive.org/
Additional note, Musk also instituted really strict rate-limits so ghostarchive may just archive a "You have been ratelimited" page. If so just try again. Once I had to try three times to get it to archive.
 
yt-dlp can archive twitter videos again thanks to this commit. yt-dlp by default will use the description in the filename, which can cause errors, use something like this instead:
Code:
yt-dlp --add-metadata --embed-thumbnail --restrict-filenames --output "twitter-%(uploader_id)s-%(upload_date)s-%(display_id)s.%(height)sp.%(ext)s" tweeturl
 
Earlier, I wanted to look for a tweet and archive it, but I was too lazy to log in. I used the Google Cache link for the profile since it was right there in the search results, and then clicked on the individual tweet since those aren't blocked anymore. You can do that for any profile using this format, which I've set up to open in one click for a list of Twitter accounts I have.


It's probably not good enough for tweeter deleters, but it could be updated frequently enough to be useful.
 
Any ideas on how to archive a page that is locked to members only, on Archive of Our Own, aka Ao3? It seems difficult, unless I make a throwaway account, and even then, I’m not sure if it would be archivable on archive today. Has anyone tried something like this or found a work around? It’s a page that is locked, so only logged in users can view. Site link: https://archiveofourown.org/
 
Any ideas on how to archive a page that is locked to members only, on Archive of Our Own, aka Ao3? It seems difficult, unless I make a throwaway account, and even then, I’m not sure if it would be archivable on archive today. Has anyone tried something like this or found a work around? It’s a page that is locked, so only logged in users can view. Site link: https://archiveofourown.org/
It can't be archived unless the guy running archive.today specifically signs in an account (like he has for twitter and reddit). Best bet is to probably just copy and paste the entire fic.
 
  • Like
Reactions: 3MMA
Use fanfictiondownloader which will not only download it but add in data on url time donwloaded time story was posted author etc. If its adults only though you will need a throwaway account to pop into the downloader to get it.
 
Additional note, Musk also instituted really strict rate-limits so ghostarchive may just archive a "You have been ratelimited" page. If so just try again. Once I had to try three times to get it to archive.
I wonder if that could still work by using Nitter instead of Twitter?

Btw, when I access recently on Archive.today, I got a Captcha and I still got to click a dozen of times before I get access to it.
 
  • Like
Reactions: Markass the Worst
Earlier, I wanted to look for a tweet and archive it, but I was too lazy to log in. I used the Google Cache link for the profile since it was right there in the search results
I think this trick was killed a day after I posted about it.

Btw, when I access recently on Archive.today, I got a Captcha and I still got to click a dozen of times before I get access to it.
Archive.today always gives captchas to me when I use VPNs or Tor to access it. It's just something you have to deal with. Usually I'll archive the stuff or get archive links opened before I turn on a VPN or Tor.

The worst part is that you don't see the final URL so you have to deal with it or copy the text of the link you found.
 
Archive.today always gives captchas to me when I use VPNs or Tor to access it. It's just something you have to deal with. Usually I'll archive the stuff or get archive links opened before I turn on a VPN or Tor.

The worst part is that you don't see the final URL so you have to deal with it or copy the text of the link you found.
More strange is I use it when I'm not on Tor and my VPN was off. Looks like I'll use GhostArchive more often.
 
  • Like
Reactions: Markass the Worst
For a few months now I've been including speech-recognition transcripts for audio and video I've posted into threads here. I do this through a Python script that uses the AssemblyAI API, geared to what I usually want, specifically the option for labelling speakers and adjusting the spoken language for more accurate results.

AssemblyAI is a paid service but it's cheap at $0.65 per hour of transcription. I've found the results to be very good, and the transcriptions complete quickly. Their documentation is fairly good as well.

You can upload video files for transcription, but it's much quicker to upload just the audio, which doesn't need to be very high-quality either. Here's a one-liner with ffmpeg:
Bash:
ffmpeg -i "$1" -vn -ac 1 -ab 96000 "$2.mp3"

To get up and running you'll need Python 3.10 or newer, plus the requests module installed via pip (python3 -m pip install requests). If you have exiftool installed, it'll print an estimate of the time needed for transcription.

Usage message:
Code:
usage: atranscribe.py [-h] -f FILE [-n N] [-l CODE] [-k API_KEY]

transcribe an audio file

options:
  -h, --help            show this help message and exit
  -f FILE, --file FILE  file to be transcribed
  -n N, --number-speakers N
                        number of expected speakers
  -l CODE, --language CODE
                        language to be transcribed (can include region specifier, eg en_uk)
  -k API_KEY, --key API_KEY
                        AssemblyAI API key

The script writes out three files, named after the uploaded audio/video file with a different extension: forum-ready BBCode in a .bbcode file, Markdown in a .md file, as well as a JSON file containing the full transcription response (for when something goes wrong).

You can also provide your AssemblyAI API key via the ASSEMBLY_AI_KEY environment variable.

Python code below; most of the types are just data wrappers to sanity-check the spelling of payload key names in the API.
Python:
"""Assembly AI transcription wrapper.

Python package requirements:
    - requests

Optional external program requirements:
    - exiftool (to display estimated transcription time)
"""

import argparse
import dataclasses
import json
import os
import subprocess
from dataclasses import dataclass
from datetime import datetime
from pathlib import Path
from time import sleep

import requests

SUPPORTED_EXTENSIONS = {
    ".3ga",
    ".webm",
    ".8svx",
    ".mts",
    ".m2ts",
    ".ts",
    ".aac",
    ".mov",
    ".ac3",
    ".mp2",
    ".aif",
    ".mp4",
    ".m4p",
    ".m4v",
    ".aiff",
    ".mxf",
    ".alac",
    ".amr",
    ".ape",
    ".au",
    ".dss",
    ".flac",
    ".flv",
    ".m4a",
    ".m4b",
    ".m4p",
    ".m4r",
    ".mp3",
    ".mpga",
    ".ogg",
    ".oga",
    ".mogg",
    ".opus",
    ".qcp",
    ".tta",
    ".voc",
    ".wav",
    ".wma",
    ".wv",
}
# Uploaded files are limited to 2.2GB.
UPLOAD_SIZE_LIMIT_IN_BYTES = 2200 * (10**6)

ACCEPTABLE_LANGUAGES = {
    "en_au",
    "es",
    "it",
    "fr_fr",
    "nl",
    "fr_ca",
    "de",
    "en_us",
    "fr",
    "ja",
    "en_uk",
    "hi",
    "pt",
    "en",
}

BASE_URL = "https://api.assemblyai.com/v2"


@dataclass(frozen=True)
class Word:
    confidence: float
    end: int
    speaker: str
    start: int
    text: str


@dataclass(frozen=True)
class Utterance:
    confidence: float
    end: int
    speaker: str
    start: int
    text: str
    words: list[Word]


@dataclass(frozen=True)
class Args:
    file: Path
    language: str
    n_speakers: int | None
    key: str | None


def parse_arguments() -> Args:
    parser = argparse.ArgumentParser(description="transcribe an audio file")
    parser.add_argument(
        "-f",
        "--file",
        metavar="FILE",
        type=Path,
        required=True,
        help="file to be transcribed",
    )
    parser.add_argument(
        "-n",
        "--number-speakers",
        dest="n_speakers",
        metavar="N",
        type=int,
        required=False,
        help="number of expected speakers",
    )
    parser.add_argument(
        "-l",
        "--language",
        dest="language",
        metavar="CODE",
        default="en_us",
        type=str,
        required=False,
        help="language to be transcribed (can include region specifier, eg en_uk)",
    )
    parser.add_argument(
        "-k",
        "--key",
        dest="key",
        metavar="API_KEY",
        type=str,
        required=False,
        help="AssemblyAI API key",
    )
    args = parser.parse_args()

    file = args.file
    assert file.exists(), f"{file} does not exist"
    assert file.suffix in SUPPORTED_EXTENSIONS, f"Unsupported filetype: {file.suffix}"
    assert (
        file.stat().st_size < UPLOAD_SIZE_LIMIT_IN_BYTES
    ), "File is too large; upload limit is 2.2GB."

    lang = args.language
    if lang not in ACCEPTABLE_LANGUAGES:
        langs = ", ".join(sorted(ACCEPTABLE_LANGUAGES))
        msg = f"Unsupported language: {lang}\n" f"Must be one of: {langs}"
        raise RuntimeError(msg)

    return Args(
        file=file,
        language=lang,
        n_speakers=args.n_speakers,
        key=args.key,
    )


def get_duration(file: Path) -> int:
    """Get the duration of `file` to the nearest second."""
    args = ["exiftool", "-n", "-S", "-t", "-Duration"]
    raw = subprocess.check_output(
        [*args, file],
        encoding="utf-8",
    )
    return round(float(raw.strip()))


def calculate_processing_bounds(seconds: int) -> tuple[int, int]:
    """Calculate lower and upper bounds of transcription time in seconds.

    AssemblyAI states their transcription time is 15-30% of the file's length,
    so this is a (over-cautious) approximation.
    """
    upper = seconds // 3
    lower = upper // 2
    return (lower, upper)


def format_seconds(s: int) -> str:
    """Format integer seconds into XmYYs."""
    return f"{s // 60}m{s % 60:02}s"


def log(message: str) -> None:
    """Print a message prefixed with the current time."""
    now = datetime.now().strftime("%H:%M")
    print(f"{now} :: {message}")


def millis_to_timestamp(m: int) -> str:
    """Convert milliseconds to HH:MM:SS duration string."""
    total_seconds = m // 1000
    hours = total_seconds // 60 // 60
    minutes = total_seconds // 60 % 60
    seconds = total_seconds % 60
    return f"{hours:02}:{minutes:02}:{seconds:02}"


def _format_part(part: Utterance, opening: str, closing: str) -> str:
    timestamp = millis_to_timestamp(part.start)
    speaker = part.speaker
    text = part.text
    return f"[{timestamp}]\n{opening}{speaker}{closing}: {text}"


def format_part_md(part: Utterance) -> str:
    """Format part as markdown."""
    return _format_part(part, "**", "**")


def format_part_bb(part: Utterance) -> str:
    """Format part as BBCode."""
    return _format_part(part, "[B]", "[/B]")


def attempt_to_print_time_estimate(file: Path) -> None:
    try:
        duration = get_duration(file)
    except subprocess.CalledProcessError:
        # Likely that exiftool was not found
        return
    lower_s, upper_s = calculate_processing_bounds(duration)
    lower = format_seconds(lower_s)
    upper = format_seconds(upper_s)
    log(f"Transcription expected to take between {lower} and {upper}.")


def upload_audio(file: Path, headers: dict[str, str]) -> str:
    file_bytes = args.file.read_bytes()
    log("Uploading audio file...")
    upload_response = requests.post(
        BASE_URL + "/upload", headers=headers, data=file_bytes
    )
    url: str = upload_response.json()["upload_url"]
    log(f"Audio uploaded to {url}")
    return url


@dataclass(frozen=True)
class SubmissionResponse:
    id: str
    url: str


@dataclass(frozen=True)
class TranscriptPayload:
    audio_url: str
    language_code: str
    speaker_labels: bool
    speakers_expected: int | None


def submit_file_for_transcription(
    payload: TranscriptPayload, headers: dict[str, str]
) -> SubmissionResponse:
    response = requests.post(
        BASE_URL + "/transcript",
        headers=headers,
        json=dataclasses.asdict(payload),
    ).json()

    if error := response.get("error"):
        raise RuntimeError(f"Error response from API: {error}")

    transcript_id = response["id"]
    transcript_url = f"{BASE_URL}/transcript/{transcript_id}"
    return SubmissionResponse(id=transcript_id, url=transcript_url)


def main(args: Args) -> None:
    # Key provided as an argument overrides the environment variable.
    key = args.key or os.getenv("ASSEMBLY_AI_KEY")
    if key is None:
        raise RuntimeError(
            "ASSEMBLY_AI_KEY environment variable must be set"
            " or --key must be given as an argument."
        )
    headers = {"authorization": key}

    audio_url = upload_audio(args.file, headers)
    payload = TranscriptPayload(
        audio_url=audio_url,
        language_code=args.language,
        speaker_labels=args.n_speakers is not None,
        speakers_expected=args.n_speakers,
    )
    transcript_response = submit_file_for_transcription(payload, headers)
    polling_endpoint = transcript_response.url

    attempt_to_print_time_estimate(args.file)
    # Print the transcript_id in case you need to look it up in the web UI.
    log(f"Transcript ID: {transcript_response.id}")
    log("Polling for completion...")

    while True:
        status_response = requests.get(polling_endpoint, headers=headers).json()
        if status_response["status"] == "completed":
            log("Completed transcription.")
            break
        elif status_response["status"] == "error":
            error = status_response["error"]
            raise RuntimeError(f"Transcript failed: {error}")
        else:
            sleep(3)

    # Write out the raw JSON response
    json_file = args.file.with_suffix(".json")
    log(f"Writing response JSON to {json_file}")
    json_file.write_text(json.dumps(status_response, indent=2))

    # Write out the transcription text
    if utterances := status_response.get("utterances"):
        # Write per-speaker sections
        utterances = [Utterance(**obj) for obj in status_response["utterances"]]
        output_md = "\n\n".join([format_part_md(u) for u in utterances])
        output_bb = (
            "[QUOTE]\n"
            + "\n\n".join([format_part_bb(u) for u in utterances])
            + "\n[/QUOTE]"
        )
    else:
        # Transcript not split up by speaker, so write the whole thing.
        output_md = status_response["text"]
        output_bb = status_response["text"]

    md_file = args.file.with_suffix(".md")
    log(f"Writing Markdown transcript to {md_file}")
    md_file.write_text(output_md)

    bb_file = args.file.with_suffix(".bbcode")
    log(f"Writing BBCode transcript to {bb_file}")
    bb_file.write_text(output_bb)


if __name__ == "__main__":
    args = parse_arguments()
    main(args)
 
Last edited:
For a few months now I've been including speech-recognition transcripts for audio and video I've posted into threads here. I do this through a Python script that uses the AssemblyAI API, geared to what I usually want, specifically the option for labelling speakers and adjusting the spoken language for more accurate results.
Now I'm going to feel like a dunce if I ever transcribe something manually again.
 
  • Feels
Reactions: Markass the Worst
Back