Follow

What should I use to crawl/download/archive an entire website? It's all simple static pages (no JavaScript), but has lots of links to download small binary files, which I also want to preserve. Any OS -- just want the best tools.

· · Web · 11 · 6 · 2

@cancel In the few times I did it in the past I used wget --mirror with a few tweaked parameters for directory traversal and domain-spanning.

@blindcoder It only seems to download .html, images, css, etc.

@cancel It can only follow HTML code, naturally, it'll follow all hyperlinks regardless of data type.

@blindcoder No, it's not downloading .zip files that are linked from .html files.

@cancel @blindcoder are the binary files hosted on the same domain as the html+images+css? I think wget needs explicit options to allow getting from multiple domains in recursive mode, and it might also need options to limit recursion depth in that case to avoid downloading the whole internet...

@mathr @blindcoder They're on the same domain. It looks like either wget requires the file extensions be added to a list of accepted file extensions, or also that robots.txt be ignored. I can do the latter, but I'm not sure how to do the former, because there are many varied file extensions I want to back up. Is there a way to wildcard it?

@cancel @mathr Well, wget does respect robots.txt by default. Try with this: wget -e robots=off

@cancel @mathr I think wget also respects rel=nofollow but I don't know how to turn that off...

@cancel I usually use httrack, specifically the webhttrack frontend.

@csepp Yeah, I'm trying that now... I set to pretty conservative (1 connection, 1 request per second at most) and the site blocked me :/ really not good for preservation... why would you set your server up to do that

@cancel Oof. Well, I thiiink httrack is smart enough to retry them later, but in any case, that sucks.

@cancel Damn. Then I'm out of ideas. Archive Team's wget fork might have some kinda countermeasures, or swarm capability, or something? Maybe? Or if they don't block Tor, you could use that to get a fresh IP... idk.

@csepp It seems like it's already backed up on archive.org, but I wanted my own copy :/

@cancel Oh, in that case I think there are scripts for downloading from them and stripping the additional WBM frontend stuff. I wish they'd just let people download their WARCs though.

@cancel i think there’s a wget invocation that can help you here

@cancel I have just used wget in the past. It's pretty simple to have it recursively download everything linked on a page. Don't remember the flags right now, but finding a guide would not be too difficult

@cancel

As a more complete solution if you want more than one website i know this:

https://archivebox.io/

@cancel ugh masto made the link smaller, need to have the protocol in front

@cancel Also if you'd prefer something more GUI based, http://www.httrack.com/ is pretty good

@cancel yeeee, extinct did a 420 ych and i got my dragon done up in it ^_^

@cancel ah i just saw the date -- it got tossed over my timeline a few mins ago so i thought it was recent aaa

@cancel Ugh, I've been wanting to download docs, references and guides (like CSS or the i3wm guide) in forever, but have yet to find a good way. All I got so far is a bunch of folders I can't use.

Sign in to participate in the conversation
Merveilles

Revel in the marvels of the universe. We are a collective of forward-thinking individuals who strive to better ourselves and our surroundings through constant creation. We express ourselves through music, art, games, and writing. We also put great value in play. A warm welcome to any like-minded people who feel these ideals resonate with them.