Linux Today: Linux News On Internet Time.

MachineOfTheMonth: Slurping down websites

Jul 21, 2001, 19:00 (0 Talkback[s])
(Other stories by Glenn Mullikin)

[ Thanks to Glenn Mullikin for this link. ]

It's often nice to have your favorite webpage retrieved automatically, and this article sets out to explain how with sirobot and some hands-on examples of a number of popular sites:

"Monolithic applications are great. I use them and enjoy such programs as KDE Konqueror, Mozilla, the Pan newsreader and others. However, when it comes to doing custom things, sometimes it's useful to use more basic tools that allow you to hook them up with other programs such that they cooperate to get a job done.

In this article I propose to take a look at ways to minimize your online time but still get your favorite website. I'll be using the sirobot web mirroring tool and hooking it up with perl to do the dirty work. Here is what the man page says about sirobot:

...The problem isn't pulling down a certain page, the problem is figuring out the right syntax to use for the url. Doing this requires an analysis on a case by case basis. Before we get started, I'll admit that some sites are not amenable to pulling down. For example, on Kuro5hin.org if you wanted to pull down a specific story in flat mode, how would you do that? How would you get the url for that? From my examination, it isn't something that appears in the web browser url window, when you're in flat mode so without a url, we can't pull down a story in flat mode. But let's look at some sites that seemed to work."

Complete Story