Grep the Web
Sep 02, 1999, 15:31 (4 Talkback[s])
(Other stories by Martin Vermeer)
Full Text Search: The Key to Better Natural Language Queries for NoSQL in Node.js
[ The opinions expressed by authors on Linux Today are their
own. They speak only for themselves and not for Linux Today.
As routined users of regular expressions for finding
stuff on our hard disk or lines within our source files, we tend to
forget that searching and selecting stuff like this is far from a
trivial activity. Regular expressions are a programming language; a
simple one, but a programming paradigm nevertheless. As in all true
programming languages, there is a non-trivial relationship between
code written and results obtained; a relationship that will only
open up to the user after an extended period of exercise by trial
Most ordinary computer users never see regular expressions. In
fact, searching for generic filenames in an Explorer-type applet
will likely baffle them, and files the names of which are not
exactly remembered will only be unearthed by the needle-in-haystack
approach. Now, however, with the exploding popularity of the World
Wide Web, one kind of regular expression is both available to, and
potentially extremely useful for, the average,
non-computer-knowledgable user. The search expressions that the
popular search engines accept can be seen as a very simple, and
somewhat atypical, kind of regular expressions.
Even using these very simple regular expressions requires
learning by doing, though the people designing these search engines
do everything in their power to make them as easy as possible to
use. As I have noticed, most non-programming-capable users simply
never learn this well, and therefore fail to ever realize for
themselves the potential of the Web as a source -- the most
versatile and complete source in history -- of information on any
subject you can possibly imagine.
There is a touching testimonial in The AltaVista Story
(Osborne McGraw-Hill, ISBN 0-07-882435-4) which underlines this
point. Annie Warren of Digital Equipment explains how she showed a
friend with an active interest in genealogy how AltaVista could be harnessed to
digging out relevant information; and, as she recounts about her
friend, "all of a sudden, she needed a PC and an account with an
Internet Service Provider." I too can testify from experience how
powerfully rewarding it can be to show someone -- not just how to
solve one problem, but the basic skills needed to solve a whole
class of problems over and over again.
I venture that one effective form of free software advocacy is
teaching the effective mining of the World Wide Web as a knowledge
base for IT support. Even among computing professionals, the
potential of "grepping the Web" (in a manner of speaking) is
substantially underestimated. A colleague of mine, a computer
professional with over a decade of experience on several operating
systems, got stuck trying to change passwords on the Digital Unix
Alpha mainframe of our institute. The machine claimed that the
password file was "in use", as if it were somehow locked, the way a
device like a modem can be if it is reserved by another user.
When told of this situation, I volunteered that there was
probably a lock file somewhere that was erroneously left over from
an aborted operation and that would have to be manually removed. As
to name and location of this lock file, I had nothing useful to
offer and nothing was found in the obvious places either. I
suggested therefore that he paint the exact error message (in
quotes) into the AltaVista search window, well realizing that
chances of success were pretty minimal. After all, what may work
for Linux is unlikely to work for an OS that you would be very
unlikely to install on your home computer, has a limited free
software tradition (well, there is of course DECUS) and an
installed base that is several orders of magnitude smaller.
Still, several documents came up and the first one contained
detailed instructions including the identity of the lock file to be
removed. Had my advice not worked, we presumably would have used
the vendor support, paying good money for it -- for a piece of
knowledge freely available Out There!
This is a rather trivial example not even involving a true
regular-expression search. I have found AltaVista invaluable for
this kind of thing. Often I remember vaguely that something exists,
and by craftily combining "plus" conditions that have to be met
simultaneously, and using the "minus" prefix to exclude categories
of irrelevant stuff coming up, I usually manage to find what I'm
after. Google isn't bad either;
its syntax assumes that conditions are always to be "anded"
together, usually a realistic assumption.
The existing search engines are still notoriously inefficient,
especially in inexperienced hands, and even the best of them have
indexed only some 20% of the Web's content. There is ample scope
for improvement, an issue that the people operating Google appear
to be trying to address.
Recently I saw someone complain -- apropos of something entirely
different -- that there were no Linux drivers for the IEEE-488
(GPIB) bus or tcl/tk tools for use with it in a laboratory
environment. This perceived deficiency was what kept him locked
into Visual Basic. Now I had read in Linux Journal about the
project and remembered seeing something along these lines; a
quick search and I could help set a Linux lover free.
Here we have a tool to debunk the myth of the need for vendor
support. Teach people to fish and they will be no more hungry;
teach them regular expressions and they will be helpless nevermore.
The glittering prize of empowerment, a literacy tool just for the
taking by anyone with the patience to teach it.
"It's an enchanted world, Hobbes!"
Martin Vermeer is a research professor and
department head at the Finnish Geodetic Institute, as well as
"docent" at Helsinki University, Department of Geophysics. He uses
Linux both at work and at home.