Install these requirements with your usual package manager:
- zsh - the script itself isn't bash compatible, sorry
- sqlite3
- jq
- curl (if you want to download the pictures and videos)
- python3
For example:
brew install zsh sqlite3 jq curl python3
And install these python packages either with pipx (recommended) or pip:
(
If you're on a Mac check out the datasette website for a standalone application. You won't have to install anything below that starts with "datasette", but you will have to download the plugins separately - see the website for details.)
- snscrape
- sqlite-utils (you may be able to install this with your usual package manager instead)
- datasette (ditto)
For example:
Code:
pipx install snscrape
pipx install sqlite-utils
pipx install datasette
As well as these datasette plugins so they're accessible to datasette:
- datasette-cluster-map
- datasette-media
- datasette-render-html
- datasette-render-markdown
For example:
Code:
pipx inject datasette datasette-cluster-map datasette-media datasette-render-html datasette-render-markdown
Download the tonydb.sh.txt file, rename it to tonydb.sh and make it executable with
chmod u+x tonydb.sh
.
Create text files for your target's social media accounts. So for Tony:
Code:
==> instagram-accounts.txt <==
erininthemorning
==> reddit-accounts.txt <==
_supernovasky_
erininthemorning
==> twitter-accounts.txt <==
erininthemorn
erininthenight
realitybias
benchmarkpol
Then run it with
zsh tonydb.sh scrape
, which will download all of the person's posts. This can take some time, especially Twitter - Tony's main account took just under 10 minutes.
If you want a local copy of all of the photos and videos Tony's posted, run
zsh tonydb.sh media
. It'll come to a couple of gigs total.
If you already have a bunch of .jsonl files from snscrape already, stick them in a directory, set the
json_dir
environment variable to the directory path and run
zsh tonydb.sh load
. This will skip the scraping step and just create the database from the existing files. Inserting the Twitter data will still take some time because it needs a lot of reshaping to be useful.
Once that's done, you can start up datasette with the following command
Code:
datasette erin-reed.db -m metadata.json --static assets:assets
It'll tell you something like:
Code:
INFO: Uvicorn running on http://127.0.0.1:8041 (Press CTRL+C to quit)
Visit eg
http://127.0.0.1:8041
in your web browser and search away!
When you want to update the database, just run
zsh tonydb.sh scrape
again. snscrape just pulls down the entire account each time, which is why it's slow, but the database is set up just to add in new posts.
You can add more accounts to the *-accounts.txt files as you find them, and they'll be added the next time you run scrape. As well, you can remove old accounts that you don't expect to be updated (eg Tony's old Twitter accounts and Reddit account), which will save some time - those records will be in the database from the first run but won't be checked if they're not in the accounts files.
Feel free to PM me when it breaks or if you get stuck.