Best way to download a website?

Medulseur · Dec 29, 2021

I want to download a website for offline viewing. Can I get any recommendations for what software to use for this? I've already looked up some solutions such as HTTrack, Webcopy, and Sitesucker but I would like some input from my fellow Kiwis.

Fomo Hoire · Dec 29, 2021

I've only used httrack for automated website downloading. Make sure to be careful about how you set the depth and if the target site blocks you for being a bot then try changing the user agent or Javascript settings. I'm not sure how to circumvent a captcha code.

Back in the day when I still used Firefox, if it was something simple I'd use some addon to open the sequence of pages manually then automate (depending on laziness) a loop of CTRL S, ENTER, CTRL W until all the tabs were gone.

If you're after a wiki, I think you can export the whole thing through a special page, no need to crawl the site.

If the links follow a formula, curl or wget are the "basic" way to do it.

If you'd browse the site manually first anyways, someone recommended this thing in software endorsements. I've never used it but it sounds nifty.

Medulseur · Dec 29, 2021

Dergint said:
I've only used httrack for automated website downloading. Make sure to be careful about how you set the depth and if the target site blocks you for being a bot then try changing the user agent or Javascript settings. I'm not sure how to circumvent a captcha code.

Back in the day when I still used Firefox, if it was something simple I'd use some addon to open the sequence of pages manually then automate (depending on laziness) a loop of CTRL S, ENTER, CTRL W until all the tabs were gone.

If you're after a wiki, I think you can export the whole thing through a special page, no need to crawl the site.

If the links follow a formula, curl or wget are the "basic" way to do it.

If you'd browse the site manually first anyways, someone recommended this thing in software endorsements. I've never used it but it sounds nifty.

Thank you, all that was really helpful. I'm trying to save this website that has a comic series free to read but the owner won't let you download. I decided to go with HTTrack since that seems to be what everybody else recommends also. That endorsed program does sound neat though but knowing me I'd forget I have it on then end up archiving an entire day of browsing.

Fomo Hoire · Dec 29, 2021

Medulseur said:
Thank you, all that was really helpful. I'm trying to save this website that has a comic series free to read but the owner won't let you download.

If it's something like xkcd (daily/weekly strip published straight to web, not pirated manga) try checking if it has an RSS Feed. Some RSS feeds will include the content, and there are programs that can automate from RSS. I'm not sure if there's one that automates specifically downloading images, but I figure at least it gets around right click protection.

Medulseur said:
That endorsed program does sound neat though but knowing me I'd forget I have it on then end up archiving an entire day of browsing.

Haha same. I'm lazy but I always figured that if I used it, I'd try to see if I could tie it to a browser profile so I can just have one special window for perishables but still do most of my browsing on another window.

Medulseur · Dec 29, 2021

Dergint said:
If it's something like xkcd (daily/weekly strip published straight to web, not pirated manga) try checking if it has an RSS Feed. Some RSS feeds will include the content, and there are programs that can automate from RSS. I'm not sure if there's one that automates specifically downloading images, but I figure at least it gets around right click protection.

Haha same. I'm lazy but I always figured that if I used it, I'd try to see if I could tie it to a browser profile so I can just have one special window for perishables but still do most of my browsing on another window.

I tried looking for an RSS feed but I couldn't find a link and looking at the sites source didn't help either. The website allows you to right click and save but the series I am after has a lot of pages and it would take me forever to save one at a time.

Fomo Hoire · Dec 29, 2021

Medulseur said:
The website allows you to right click and save but the series I am after has a lot of pages and it would take me forever to save one at a time.

Dunno why that reminded me of Hipster Hitler, but it reminds me of Hipster Hitler. If you have already started with the automation it probably won't be worth it, but another thing you could do is check the URL format of the images and try to determine if they're all in the same folder on the server. If they are, check if they are in an open directory.

Medulseur · Dec 29, 2021

Dergint said:
Dunno why that reminded me of Hipster Hitler, but it reminds me of Hipster Hitler. If you have already started with the automation it probably won't be worth it, but another thing you could do is check the URL format of the images and try to determine if they're all in the same folder on the server. If they are, check if they are in an open directory.

No open directory, at least not that I could find. I would bet they are all in folders, though, since each image is labeled XX-1, XX-2, ect

Aidan · Dec 29, 2021

I've never used httrack and would personally just use wget in a script to download just the images, assuming their URL of the images is formulaic. As mentioned earlier, nowadays you need to take care not to get viewed as a bot and then blocked or banned if you do this sort of thing, but 99% of the time in my limited experience a short time interval works fine.

If you want to send me the link or another comic on the site I can use as an example then I can try to zip it and send it for you or otherwise work it out.

Medulseur · Dec 29, 2021

Aidan said:
I've never used httrack and would personally just use wget in a script to download just the images, assuming their URL of the images is formulaic. As mentioned earlier, nowadays you need to take care not to get viewed as a bot and then blocked or banned if you do this sort of thing, but 99% of the time in my limited experience a short time interval works fine.

If you want to send me the link or another comic on the site I can use as an example then I can try to zip it and send it for you or otherwise work it out.

Yeah I am a bit worried about that because HTTrack is still going at it but it's limited to about 30kb/s
I was under the impression that wget is linux only. Would it work on windows?
This page and all the comics on it are what I am trying to download.

Aidan · Dec 29, 2021

Medulseur said:
Yeah I am a bit worried about that because HTTrack is still going at it but it's limited to about 30kb/s
I was under the impression that wget is linux only. Would it work on windows?
This page and all the comics on it are what I am trying to download.

wget is open source and works on Windows just fine. It's one of the native GNU utilities that's on most linux distros by default though.
I'll take a crack at it in a bit.

Sketch Turner · Dec 29, 2021

There's also curl, if you need more options. wget is a lot more simpler though.

I've messed with downloading shit externally with PHP using its curl implementation.

You can specify cookies and/or request header from your local PC if need be if a website requires some sort of bullshit authentication via these methods (curl can do this, not sure about wget).

Aidan · Dec 29, 2021

This seems pretty do-able with a bash script since they seem to have ordered everything consistently. I'll post what I get here when it's done.
wget also supports the use of cookies and some other stuff, but curl could be used just as well, especially the way I intend to do it. There's probably a nice way to use wget to recursively do everything here but crafting that line would take me longer than just writing a simple script.

Aidan · Dec 29, 2021

Here's the first issue of the first comic. Doing all of them will take awhile so I wanted to provide this one early for feedback.
Each issue can be made into a pdf too but I consider it bad form to open pdfs from randoms on the net so don't find it necessary yet unless it's requested. If you want to make all the images into a pdf yourself, ImageMagick's convert utility is great for it and that's what I use.

To spare NULL some storage and bandwidth, it may be best all of them be shared using something like OnionShare. If you have somewhere I could upload it that will work too.

Medulseur · Dec 29, 2021

Aidan said:
Here's the first issue of the first comic. Doing all of them will take awhile so I wanted to provide this one early for feedback.
Each issue can be made into a pdf too but I consider it bad form to open pdfs from randoms on the net so don't find it necessary yet unless it's requested. If you want to make all the images into a pdf yourself, ImageMagick's convert utility is great for it and that's what I use.

Looks like it worked pretty well! I may have to try wget in the future. I am always shy about using command line programs because I am a big dumb who likes purdy buttons. Don't worry about downloading more because HTTrack finished. Next time though I will try to put on my big boy pants and try to use wget.

Aidan · Dec 29, 2021

Medulseur said:
Looks like it worked pretty well! I may have to try wget in the future. I am always shy about using command line programs because I am a big dumb who likes purdy buttons. Don't worry about downloading more because HTTrack finished. Next time though I will try to put on my big boy pants and try to use wget.

Well you may not like the solution posted below for reference and convenience.

Using wget you could do this much more concisely and even on a single line, but it's been awhile and I didn't feel like running it over and over to make sure that very long line worked. This is a very indirect implementation of the same thing.

Bash:

#!/usr/bin/env bash

#root url
parent_url="https://elfquest.com/read/"

#Arrays for each comic on the webpage. Each index in each array relates to each other array
#lazy associative arrays
quests=("ELFQUEST" "SIEGE_AT_BLUE_MOUNTAIN" "KINGS_OF_THE_BROKEN_WHEEL" "WOLFRIDER" "DREAMTIME" "HIDDEN_YEARS" "SHARDS" "SEARCHER_AND_THE_SWORD" "THE_DISCOVERY")
#quests=( "WOLFRIDER" "DREAMTIME" "HIDDEN_YEARS" "SHARDS" "SEARCHER_AND_THE_SWORD" "THE_DISCOVERY")

#Number of issues for each comic
num_issues=( 21 8 9 1 1 29 16 1 1 )
#num_issues=(  1 1 29 16 1 1 )
directories=( "OQ" "SABM" "KOBW" "WR" "DTC" "HY" "SH" "SAS" "DISC" )
#directories=(  "WR" "DTC" "HY" "SH" "SAS" "DISC" )
filenames=( "oq" "sabm" "kobw" "awr" "dtc" "hy" "sh" "sas" "disc" )
#filenames=(  "awr" "dtc" "hy" "sh" "sas" "disc" )

#Lazy bools for oddball quests/comics
#Initially added to facilitate HIDDEN_YEARS which has 30 issues due to an issue #9.5 
#Using an array in case this comes up again
#Using C-style 1 = true and 0 = false
#booleans=( 0 )
#bool_index=0

declare -A booleans
booleans["HY"]=0


#used for troubleshooting to help indicate arrays worked as intended
for i in ${!quests[@]}
do
    echo Quest: ${quests[$i]} Issues: ${num_issues[$i]}
done



#Go through each quest/comic and then go through each issue of the comic
for i in ${!quests[@]}
do

    echo ${quests[$i]}
    curr_issue=1

    #Go through each issue of the current comic/quest
    while [ "$curr_issue" -le "${num_issues[$i]}" ]
    do
        #Start on page 0 which appears to always be cover page
        curr_page=0
        #echo "$curr_issue "

        #If there's only one issue then there is no subdirectory
        #Subdirectories are for each issue only
        if [ ${num_issues[$i]} -eq 1 ]
        then
#            url="$parent_url${directories[$i]}-"
            url="$parent_url${directories[$i]}/${filenames[$i]}-"
        else
            #Check if the current issue is < 10 and prepend with a 0 if so
            if [ $curr_issue -lt 10 ] 
            then
                url="$parent_url${directories[$i]}/${directories[$i]}0${curr_issue}/${filenames[$i]}0${curr_issue}-"
            else
                url="$parent_url${directories[$i]}/${directories[$i]}${curr_issue}/${filenames[$i]}${curr_issue}-"
            fi
        fi

        #Specific check for the HIDDEN_YEARS comic edition 9.5
        #NOTE - Bash does not handle floating point values (eg 9.5) and I'm not using awk or anything to work around it

        #If the current issue is 9.5, the current comic is HIDDEN YEARS and the HY bool has NOT been set (so remains 0)
        #Set URL accordingly and curr_isue to 9.5 (a string, not a float)
        if [ $curr_issue -eq 10 -a ${directories[$i]} = "HY" -a ${booleans["HY"]} -eq 0 ]
        then
            url="$parent_url${directories[$i]}/${directories[$i]}09.5/${filenames[$i]}09.5-"
            #Set first boolean index to 1
            booleans["HY"]=1

            curr_issue="9.5"
        fi


        #Wget each image until failure
        #Susceptible to premature failure and so may not finish an issue. A few runs should be fine
        
        #Wget options
        #--no-clobber           Don't overwrite existing files
        #--continue             Continue partial downloads
        #--verbose              Verbose output
        #--timeout=5            Wait 5 seconds before moving on from timeout
        #--directory-prefix     Directory to store files in. Example - ELFQUEST/issue_1/

        while wget --no-clobber --continue --verbose --timeout=5 --directory-prefix="${quests[$i]}/issue_$curr_issue/" "$url${curr_page}.jpg"
        do
            #increment to next page
            ((curr_page+=1))
            #sleep for 2-9 seconds (excessive but safe in my experience)
            sleep $((1 + $RANDOM % 3))
        done
        


        #NOTE - This will run on every run for better or worse. Comment it out if undesired
        #Convert each issue into a pdf use ImageMagick's convert tool
        #Check if convert is installed before attemping to make a pdf
        which convert > /dev/null
        if [ $? -eq 0 ]
        then
            #If appropriate pdfs directory does not exist, create it
            if [ ! -d ${quests[$i]}/pdfs ]
            then
                mkdir ${quests[$i]}/pdfs
            fi

            if [ ${num_issues[$i]} -gt 1 ]
            then
                convert $(ls ${quests[$i]}/issue_$curr_issue/*jpg | sort -n -t - -k 2) ${quests[$i]}/pdfs/${quests[$i]}_$curr_issue.pdf
            else
                convert $(ls ${quests[$i]}/issue_$curr_issue/*jpg | sort -n -t - -k 2) ${quests[$i]}/pdfs/${quests[$i]}.pdf
            fi
        fi


        #If the current issue is 9.5, the current comic is HIDDEN YEARS and the HY bool has been set to 1
        #Then set curr_issue=10
        #Else increment by 1
        if [ $curr_issue = "9.5" -a ${directories[$i]} = "HY" -a ${booleans["HY"]} -eq 1 ]
        then
            curr_issue=10
        else
            #increment to next issue of current comic
            ((curr_issue+=1))
        fi

    done
done

Medulseur · Dec 29, 2021

Aidan said:
Well you may not like the solution posted below for reference and convenience.

Yeah that is pretty intimidating. I recognize a bit of it because I am trying to learn how to code but most of it goes right over my head. You put all that together though? Seems like wget is better suited for people who actually know stuff.

Aidan · Dec 30, 2021

Medulseur said:
Yeah that is pretty intimidating. I recognize a bit of it because I am trying to learn how to code but most of it goes right over my head.

Bash is a terrible reference for coding practices so unless you write shell scripts at some point it's not that handy, but you have it in case you want to run it and I was going to post it anyway since I'm a stranger on the internet. Golden Rule n all.

You put all that together though? Seems like wget is better suited for people who actually know stuff.

Yes I wrote that all up but it looks like more than it is.

for loop to walk through the arrays which effectively represent each comic
while loops to go through each issue and to download each image
if/else to handle single digit values that need to be prepended with 0 in the URL

The wget line is basic and can be figured out from looking up usage. no-clobber, continue, and verbose are used a lot. You could get more advanced and give it a list of URLs (1 for each comic) to accept which implicitly rejects others and then tell it to download only jpgs recursively. However I don't think there's a way to tell it to write the files in an organized fashion beyond copying the directory structure in the URL.

edit: Actually if you're learning to code, this would be a good thing to implement in your language of choice depending on your current level. The logic would be the exact same at worst. Even in C this would be trivial since wget is doing the hard part.

Medulseur · Dec 30, 2021

Aidan said:
Bash is a terrible reference for coding practices so unless you write shell scripts at some point it's not that handy, but you have it in case you want to run it and I was going to post it anyway since I'm a stranger on the internet. Golden Rule n all.

Yes I wrote that all up but it looks like more than it is.

for loop to walk through the arrays which effectively represent each comic

while loops to go through each issue and to download each image

if/else to handle single digit values that need to be prepended with 0 in the URL

The wget line is basic and can be figured out from looking up usage. no-clobber, continue, and verbose are used a lot. You could get more advanced and give it a list of URLs (1 for each comic) to accept which implicitly rejects others and then tell it to download only jpgs recursively. However I don't think there's a way to tell it to write the files in an organized fashion beyond copying the directory structure in the URL.

edit: Actually if you're learning to code, this would be a good thing to implement in your language of choice depending on your current level. The logic would be the exact same at worst. Even in C this would be trivial since wget is doing the hard part.

So if I wanted to download the rest of the comics I would just change the names and directories to match then run the script?

Aidan · Dec 30, 2021

Medulseur said:
So if I wanted to download the rest of the comics I would just change the names and directories to match then run the script?

Yeah the only thing you would need to change, in this specific case, is what's in the arrays. Modifying the script to be interactive may be better if you decide to do one-off downloads from the same site or something. Use the read command if so.

https://elfquest.com/read/OQ/OQ01/oq01-0.jpg

To add a new comic that's stored at https://elfquest.com/read/ you'd need to do the following.

Verify that it's ordered the same (probably safe assumption)
Add the name of the comic to the quests array.
- You can delete the others if they've been downloaded already, else it'll try to download them again which will take time.
- All arrays needed to be modified with respect to each array. Poor, lazy design.
Add the number of issues to the num_issues array
Add the parent directory of the comic to the directories array
Add the filename letters to the filenames array.

The OQ01 directory is determined using the parent directory and $curr_issue. The rest of the filename similarly uses the $curr_issue and $curr_page variables. The filetype being "jpg" and not "jpeg" or any other format is assumed, but as far as I could tell was always true.

edit - by the way I never stopped downloading so if you decide you do want them or need help converting any/all to pdf just let me know. I'll hold onto them for at least a week.

Medulseur · Dec 30, 2021

Aidan said:
Yeah the only thing you would need to change, in this specific case, is what's in the arrays. Modifying the script to be interactive may be better if you decide to do one-off downloads from the same site or something. Use the read command if so.

Just trying to figure out how to install Wget is holding me up a little. But I am sure I could figure it out if given enough time.

Best way to download a website?

Medulseur

SUPPERTIME

Fomo Hoire

Medulseur

SUPPERTIME

Fomo Hoire

Medulseur

SUPPERTIME

Fomo Hoire

Medulseur

SUPPERTIME

Aidan

Medulseur

SUPPERTIME

Aidan

Sketch Turner

Aidan

Aidan

Attachments

Medulseur

SUPPERTIME

Aidan

Medulseur

SUPPERTIME

Aidan

Medulseur

SUPPERTIME

Aidan

Medulseur

SUPPERTIME