Best way to download a website?

Aidan · Dec 30, 2021

Medulseur said:
Just trying to figure out how to install Wget is holding me up a little. But I am sure I could figure it out if given enough time.

Well that's a bash script so won't work natively on Windows. I think WSL uses bash by default so you could use that. Any linux vm would work just as well. If you're new to VMs then just use a usb to transfer to your actual computer, otherwise set up filesharing your preferred way.

I'll also recommend redoing it in your language of choice.

edit: I realize my script screws up on one of the comics due to an issue #9.5. I will have to go back and account for that. Just an FYI, I'll update the script in a new post and in the old one when I do.

Medulseur · Dec 30, 2021

Aidan said:
Well that's a bash script so won't work natively on Windows. I think WSL uses bash by default so you could use that. Any linux vm would work just as well. If you're new to VMs then just use a usb to transfer to your actual computer, otherwise set up filesharing your preferred way.

I'll also recommend redoing it in your language of choice.

edit: I realize my script screws up on one of the comics due to an issue #9.5. I will have to go back and account for that. Just an FYI, I'll update the script in a new post and in the old one when I do.

Oh no, virtual machines? This is starting to encompass all of the things I have wanted to get into for the past few years but have never got around to doing. lol

Aidan · Dec 30, 2021

Medulseur said:
Oh no, virtual machines? This is starting to encompass all of the things I have wanted to get into for the past few years but have never got around to doing. lol

It can if you want it to. That script just won't work on Windows natively is all.

Medulseur · Dec 30, 2021

Aidan said:
It can if you want it to. That script just won't work on Windows natively is all.

As complicated as it seems I do like the results better. HTTrack is easier, sure, but I hate copying an entire website just to get some jpegs.

Aidan · Dec 30, 2021

It's not really that complicated, the script is the most complicated part.
>get some generic linux distro iso
>download vmware or virtualbox
>imagine not downloading virtualbox to support open source software
>create virtual machine
>install linux on virtual machine
>copypaste script into text file
>rename text file to script.sh
>be leet hacker, type bash script.sh in terminal
>wait for download
>run again for good measure
>plug usb into computer
>virtual machine ask if you want it to go to host or vm
>click vm
>copy files to usb

Medulseur · Dec 30, 2021

Aidan said:
It's not really that complicated, the script is the most complicated part.
>get some generic linux distro iso
>download vmware or virtualbox
>imagine not downloading virtualbox to support open source software
>create virtual machine
>install linux on virtual machine
>copypaste script into text file
>rename text file to script.sh
>be leet hacker, type bash script.sh in terminal
>wait for download
>run again for good measure
>plug usb into computer
>virtual machine ask if you want it to go to host or vm
>click vm
>copy files to usb

Thanks for the step by step! Going to give it a try.
When I go to download virtual box it says 6.0 and below supports software virtualization but the newer versions don't. Does that mean I want 6.0 instead of 6.1.3?

Aidan · Dec 30, 2021

Just get 6.1.3, you're not using software virtualization.

Medulseur · Dec 30, 2021

Aidan said:
Just get 6.1.3, you're not using software virtualization.

Any particular distro I should get or should I just go with Ubuntu?

Aidan · Dec 30, 2021

Medulseur said:
Any particular distro I should get or should I just go with Ubuntu?

Doesn't matter for this but if you plan to tinker aside from this then any distro you want to explore is fine. Ubuntu and its derivatives will work well.

@Medulseur @ ing to ping you again in case you don't refresh. I found a typo in my script so let me know before you intend to download if I haven't posted the revision.

Medulseur · Dec 30, 2021

Aidan said:
Doesn't matter for this but if you plan to tinker aside from this then any distro you want to explore is fine. Ubuntu and its derivatives will work well.

@Medulseur @ ing to ping you again in case you don't refresh. I found a typo in my script so let me know before you intend to download if I haven't posted the revision.

Yeah I haven't copied it yet because I saw you mention a typo. I'll wait until you are able to fix it.

Aidan · Dec 30, 2021

Medulseur said:
Yeah I haven't copied it yet because I saw you mention a typo. I'll wait until you are able to fix it.

Haven't verified it works 100% but I think it's fine now.

Fixed typo where double-digit issues were still preceded by a 0.
Added a shitty fix to handle the 9.5 issue.

Bash:

        #!/usr/bin/env bash

#root url
parent_url="https://elfquest.com/read/"

#Arrays for each comic on the webpage. Each index in each array relates to each other array
#lazy associative arrays
quests=("ELFQUEST" "SIEGE_AT_BLUE_MOUNTAIN" "KINGS_OF_THE_BROKEN_WHEEL" "WOLFRIDER" "DREAMTIME" "HIDDEN_YEARS" "SHARDS" "SEARCHER_AND_THE_SWORD" "THE_DISCOVERY")
#quests=( "WOLFRIDER" "DREAMTIME" "HIDDEN_YEARS" "SHARDS" "SEARCHER_AND_THE_SWORD" "THE_DISCOVERY")

#Number of issues for each comic
num_issues=( 21 8 9 1 1 29 16 1 1 )
#num_issues=(  1 1 29 16 1 1 )
directories=( "OQ" "SABM" "KOBW" "WR" "DTC" "HY" "SH" "SAS" "DISC" )
#directories=(  "WR" "DTC" "HY" "SH" "SAS" "DISC" )
filenames=( "oq" "sabm" "kobw" "awr" "dtc" "hy" "sh" "sas" "disc" )
#filenames=(  "awr" "dtc" "hy" "sh" "sas" "disc" )

#Lazy bools for oddball quests/comics
#Initially added to facilitate HIDDEN_YEARS which has 30 issues due to an issue #9.5
#Using an array in case this comes up again
#Using C-style 1 = true and 0 = false
#booleans=( 0 )
#bool_index=0

declare -A booleans
booleans["HY"]=0


#used for troubleshooting to help indicate arrays worked as intended
for i in ${!quests[@]}
do
echo Quest: ${quests[$i]} Issues: ${num_issues[$i]}
done



#Go through each quest/comic and then go through each issue of the comic
for i in ${!quests[@]}
do

echo ${quests[$i]}
curr_issue=1

#Go through each issue of the current comic/quest
while [ "$curr_issue" -le "${num_issues[$i]}" ]
do
#Start on page 0 which appears to always be cover page
curr_page=0
#echo "$curr_issue "

#If there's only one issue then there is no subdirectory
#Subdirectories are for each issue only
if [ ${num_issues[$i]} -eq 1 ]
then
#            url="$parent_url${directories[$i]}-"
url="$parent_url${directories[$i]}/${filenames[$i]}-"
else
#Check if the current issue is < 10 and prepend with a 0 if so
if [ $curr_issue -lt 10 ]
then
url="$parent_url${directories[$i]}/${directories[$i]}0${curr_issue}/${filenames[$i]}0${curr_issue}-"
else
url="$parent_url${directories[$i]}/${directories[$i]}${curr_issue}/${filenames[$i]}${curr_issue}-"
fi
fi

#Specific check for the HIDDEN_YEARS comic edition 9.5
#NOTE - Bash does not handle floating point values (eg 9.5) and I'm not using awk or anything to work around it

#If the current issue is 9.5, the current comic is HIDDEN YEARS and the HY bool has NOT been set (so remains 0)
#Set URL accordingly and curr_isue to 9.5 (a string, not a float)
if [ $curr_issue -eq 10 -a ${directories[$i]} = "HY" -a ${booleans["HY"]} -eq 0 ]
then
url="$parent_url${directories[$i]}/${directories[$i]}09.5/${filenames[$i]}09.5-"
#Set first boolean index to 1
booleans["HY"]=1

curr_issue="9.5"
fi


#Wget each image until failure
#Susceptible to premature failure and so may not finish an issue. A few runs should be fine
       
#Wget options
#--no-clobber           Don't overwrite existing files
#--continue             Continue partial downloads
#--verbose              Verbose output
#--timeout=5            Wait 5 seconds before moving on from timeout
#--directory-prefix     Directory to store files in. Example - ELFQUEST/issue_1/

while wget --no-clobber --continue --verbose --timeout=5 --directory-prefix="${quests[$i]}/issue_$curr_issue/" "$url${curr_page}.jpg"
do
#increment to next page
((curr_page+=1))
#sleep for 2-9 seconds (excessive but safe in my experience)
sleep $((1 + $RANDOM % 3))
done
       


#NOTE - This will run on every run for better or worse. Comment it out if undesired
#Convert each issue into a pdf use ImageMagick's convert tool
#Check if convert is installed before attemping to make a pdf
which convert > /dev/null
if [ $? -eq 0 ]
then
#If appropriate pdfs directory does not exist, create it
if [ ! -d ${quests[$i]}/pdfs ]
then
mkdir ${quests[$i]}/pdfs
fi

if [ ${num_issues[$i]} -gt 1 ]
then
convert $(ls ${quests[$i]}/issue_$curr_issue/*jpg | sort -n -t - -k 2) ${quests[$i]}/pdfs/${quests[$i]}_$curr_issue.pdf
else
convert $(ls ${quests[$i]}/issue_$curr_issue/*jpg | sort -n -t - -k 2) ${quests[$i]}/pdfs/${quests[$i]}.pdf
fi
fi


#If the current issue is 9.5, the current comic is HIDDEN YEARS and the HY bool has been set to 1
#Then set curr_issue=10
#Else increment by 1
if [ $curr_issue = "9.5" -a ${directories[$i]} = "HY" -a ${booleans["HY"]} -eq 1 ]
then
curr_issue=10
else
#increment to next issue of current comic
((curr_issue+=1))
fi

done
done

Medulseur · Dec 30, 2021

Aidan said:
Haven't verified it works 100% but I think it's fine now.

Thanks friend! I'll get this VM set up and try out your solution and report back to you with my results. Once again, thanks for taking the time to actually program something like this.

Kendall Motor Oil · Dec 30, 2021

You can use grab-site. It is command line based and probably requires linux. I use it for my archival and it works well. You use a WARC viewer to view the site.

Aidan · Dec 30, 2021

Medulseur said:
Thanks friend! I'll get this VM set up and try out your solution and report back to you with my results. Once again, thanks for taking the time to actually program something like this.

Sure thing. I always encourage people store copies of things they care about that are online and enjoy helping achieve that when I can. People with local copies often end up as curators of things they care about and this is a small segment of the internet that is extraordinarily important and underrated until it affects you.

My fix isn't actually a great fix but I'm working on that right now and will once again update the script when it's done in another post as well as the original post with the script. Practically speaking, this means with the current version of the script it will download the HIDDEN YEARS comic issue 9.5 into the issue_9 directory. If this has already happened then you can just hop into that directory in the terminal and do the following.

Bash:

#Execute each command one at a time, this isn't meant to be a script

#Make sure you're in the issue_9 directory with some hy09.5 files
ls hy09.5*

#That should show all the relevant files in the directory. If so, you can continue
#Make a directory for those files
mkdir ../issue_9.5

#move the files into that newly made directory
mv -v hy09.5* ../issue_9.5/

#make sure they're in there
ls ../issue9.5/

#Make sure none remain in the issue_9 directory as a sanity check
ls hy09.5*

edit - I realized you're probably not remotely familiar with the terminal just now and so to get open a terminal.
Go to where you ran the script from. If you don't know then it's probably the home directory so after opening the terminal type ls
to see if the directories are there. If they are then do

Bash:

cd HIDDEN_YEARS/issue_9

Aidan · Dec 30, 2021

Updated script

Added a check for HY issue 9.5 and changed boolean array to be associative
Added pdf collation for each issue stored in a directory called "pdfs"

Bash:

#!/usr/bin/env bash

#root url
parent_url="https://elfquest.com/read/"

#Arrays for each comic on the webpage. Each index in each array relates to each other array
#lazy associative arrays
quests=("ELFQUEST" "SIEGE_AT_BLUE_MOUNTAIN" "KINGS_OF_THE_BROKEN_WHEEL" "WOLFRIDER" "DREAMTIME" "HIDDEN_YEARS" "SHARDS" "SEARCHER_AND_THE_SWORD" "THE_DISCOVERY")
#quests=( "WOLFRIDER" "DREAMTIME" "HIDDEN_YEARS" "SHARDS" "SEARCHER_AND_THE_SWORD" "THE_DISCOVERY")

#Number of issues for each comic
num_issues=( 21 8 9 1 1 29 16 1 1 )
#num_issues=(  1 1 29 16 1 1 )
directories=( "OQ" "SABM" "KOBW" "WR" "DTC" "HY" "SH" "SAS" "DISC" )
#directories=(  "WR" "DTC" "HY" "SH" "SAS" "DISC" )
filenames=( "oq" "sabm" "kobw" "awr" "dtc" "hy" "sh" "sas" "disc" )
#filenames=(  "awr" "dtc" "hy" "sh" "sas" "disc" )

#Lazy bools for oddball quests/comics
#Initially added to facilitate HIDDEN_YEARS which has 30 issues due to an issue #9.5
#Using an array in case this comes up again
#Using C-style 1 = true and 0 = false
#booleans=( 0 )
#bool_index=0

declare -A booleans
booleans["HY"]=0


#used for troubleshooting to help indicate arrays worked as intended
for i in ${!quests[@]}
do
    echo Quest: ${quests[$i]} Issues: ${num_issues[$i]}
done



#Go through each quest/comic and then go through each issue of the comic
for i in ${!quests[@]}
do

    echo ${quests[$i]}
    curr_issue=1

    #Go through each issue of the current comic/quest
    while [ "$curr_issue" -le "${num_issues[$i]}" ]
    do
        #Start on page 0 which appears to always be cover page
        curr_page=0
        #echo "$curr_issue "

        #If there's only one issue then there is no subdirectory
        #Subdirectories are for each issue only
        if [ ${num_issues[$i]} -eq 1 ]
        then
#            url="$parent_url${directories[$i]}-"
            url="$parent_url${directories[$i]}/${filenames[$i]}-"
        else
            #Check if the current issue is < 10 and prepend with a 0 if so
            if [ $curr_issue -lt 10 ]
            then
                url="$parent_url${directories[$i]}/${directories[$i]}0${curr_issue}/${filenames[$i]}0${curr_issue}-"
            else
                url="$parent_url${directories[$i]}/${directories[$i]}${curr_issue}/${filenames[$i]}${curr_issue}-"
            fi
        fi

        #Specific check for the HIDDEN_YEARS comic edition 9.5
        #NOTE - Bash does not handle floating point values (eg 9.5) and I'm not using awk or anything to work around it

        #If the current issue is 9.5, the current comic is HIDDEN YEARS and the HY bool has NOT been set (so remains 0)
        #Set URL accordingly and curr_isue to 9.5 (a string, not a float)
        if [ $curr_issue -eq 10 -a ${directories[$i]} = "HY" -a ${booleans["HY"]} -eq 0 ]
        then
            url="$parent_url${directories[$i]}/${directories[$i]}09.5/${filenames[$i]}09.5-"
            #Set first boolean index to 1
            booleans["HY"]=1

            curr_issue="9.5"
        fi


        #Wget each image until failure
        #Susceptible to premature failure and so may not finish an issue. A few runs should be fine
       
        #Wget options
        #--no-clobber           Don't overwrite existing files
        #--continue             Continue partial downloads
        #--verbose              Verbose output
        #--timeout=5            Wait 5 seconds before moving on from timeout
        #--directory-prefix     Directory to store files in. Example - ELFQUEST/issue_1/

        while wget --no-clobber --continue --verbose --timeout=5 --directory-prefix="${quests[$i]}/issue_$curr_issue/" "$url${curr_page}.jpg"
        do
            #increment to next page
            ((curr_page+=1))
            #sleep for 2-9 seconds (excessive but safe in my experience)
            sleep $((1 + $RANDOM % 3))
        done
       


        #NOTE - This will run on every run for better or worse. Comment it out if undesired
        #Convert each issue into a pdf use ImageMagick's convert tool
        #Check if convert is installed before attemping to make a pdf
        which convert > /dev/null
        if [ $? -eq 0 ]
        then
            #If appropriate pdfs directory does not exist, create it
            if [ ! -d ${quests[$i]}/pdfs ]
            then
                mkdir ${quests[$i]}/pdfs
            fi

            if [ ${num_issues[$i]} -gt 1 ]
            then
                convert $(ls ${quests[$i]}/issue_$curr_issue/*jpg | sort -n -t - -k 2) ${quests[$i]}/pdfs/${quests[$i]}_$curr_issue.pdf
            else
                convert $(ls ${quests[$i]}/issue_$curr_issue/*jpg | sort -n -t - -k 2) ${quests[$i]}/pdfs/${quests[$i]}.pdf
            fi
        fi


        #If the current issue is 9.5, the current comic is HIDDEN YEARS and the HY bool has been set to 1
        #Then set curr_issue=10
        #Else increment by 1
        if [ $curr_issue = "9.5" -a ${directories[$i]} = "HY" -a ${booleans["HY"]} -eq 1 ]
        then
            curr_issue=10
        else
            #increment to next issue of current comic
            ((curr_issue+=1))
        fi

    done
done

Medulseur · Dec 30, 2021

Aidan said:
Updated script

I'm having a bit of an issue getting virtualbox to work. I am getting an error called
AMD-V is disabled in the BIOS (or by the host OS) (VERR_SVM_DISABLED).
and from what I have read I need to go into my PC bios to fix this

Aidan · Dec 30, 2021

Medulseur said:
I'm having a bit of an issue getting virtualbox to work. I am getting an error called
AMD-V is disabled in the BIOS (or by the host OS) (VERR_SVM_DISABLED).
and from what I have read I need to go into my PC bios to fix this

I haven't had to do it before but it sounds like what you say, you gotta go into your bios and click an option to enable virtualization.

Medulseur · Dec 30, 2021

Aidan said:
I haven't had to do it before but it sounds like what you say, you gotta go into your bios and click an option to enable virtualization.

This may have to wait until tomorrow. Getting pretty late here and I don't feel like messing with my bios tonight.
Might just skip the VM all together and install ubuntu on my old laptop that was slogged down by Windows 10

Aidan · Dec 30, 2021

I went ahead and added a snippet to make pdfs as it iterates through the comics as well since I assume it's preferred due to convenience. It won't run unless you have convert installed. The Bash on the 1st page and this page are up to date.
To install it on Ubuntu run sudo apt install imagemagick

Naming convention is not good but you can rename them or ask for help scripting that later. Let me know if you run into any issues.

Frail Snail · Dec 30, 2021

Medulseur said:
Yeah I am a bit worried about that because HTTrack is still going at it but it's limited to about 30kb/s
I was under the impression that wget is linux only. Would it work on windows?
This page and all the comics on it are what I am trying to download.

Their forum seems to imply that, if the download speed limit is set to blank, it defaults to the slowest setting of 25 KB/s to prevent abuse of the servers. Try setting it manually to a sane value if you haven't. Just don't go overboard or you will 100% get rate limited.

Best way to download a website?

Aidan

Medulseur

SUPPERTIME

Aidan

Medulseur

SUPPERTIME

Aidan

Medulseur

SUPPERTIME

Aidan

Medulseur

SUPPERTIME

Aidan

Medulseur

SUPPERTIME

Aidan

Medulseur

SUPPERTIME

Kendall Motor Oil

Aidan

Aidan

Medulseur

SUPPERTIME

Aidan

Medulseur

SUPPERTIME

Aidan

Frail Snail