WDVL.com: The Perl You Need to Know Special: Introduction to mod_perlApr 11, 2000, 18:45 (0 Talkback[s])
(Other stories by Aaron Weiss)
Mod_perl, the module that makes for a happy but complex marriage between Perl and the Apache web server, can ultimately offer significant performance improvements in Perl-backed web sites.
Perl is a powerful and flexible as a backend language for web developers, as the Perl You Need to Know series no doubt illustrates. However, serving many pages which rely on Perl processing can come at a cost in memory and time. This month we introduce the wonders of mod_perl, an Apache module which integrates Perl into the Apache web server. We'll begin by discussing the reasoning behind mod_perl and its uses, pros, and cons, and in follow-up articles delve into some code-specific issues when working in a mod_perl environment. Readers should already be familiar with the Perl covered in the Perl You Need to Know series -- furthermore, you'll need hands on access to your own Apache server to employ mod_perl.
The Story of Forks
Apache, as you may know, is a very popular web server. So popular, in fact, that as of March 2000 Apache is believed to power some 60% of web sites on the Internet -- and, thank goodness for open source, it's free to boot. What an age to be alive! A web server would be an extremely simple thing if your site only ever attracted a single visitor at a time. With 6 billion people on this planet, that's rather unlikely. Instead, the web server must juggle and serve a number of suitors simultaneously, not unlike a harried waitress scurrying between restaurant patrons. Web servers in general employ one of several schemes for handling incoming requests, some schemes more efficient than others. Apache, in its current 1.x incarnation, is what they call a pre-forking server. This does not mean Apache is older than silverware ("the time before forks"). Rather, it means that the parent Apache process "spawns" (like a demon) a number of children processes who lie in wait anticipating an incoming connection. When a request comes in, and one child is busy, another child handles that request. If all the children are busy, Apache may birth more children depending on the server's configuration, or -- when the maximum number of children are born -- additional requests are queued and the end user must wait for the next available child process.
Each child spawned takes up space and resources -- namely, memory and possibly processing time (depending on what it's doing). Ideally, Apache keeps just enough children alive to handle incoming requests. If additional children must be spawned to handle a surge of requests, Apache will ruthlessly kill them lest they lie around forever idle, simply consuming resources. The world of Apache is a brutal place.
How does all of this relate to Perl? A connection request arrives at an Apache child process, and requests, for example, a CGI script. The CGI process occurs external to the Apache child, which means that the child must fork a new process to launch the CGI script. In the case of a CGI coded in Perl, the Perl interpreter must be launched since Perl is not a compiled language. The interpreter is launched as a separate process, it compiles and executes the Perl code, and returns the results to the Apache child, who then passes them along to the visitor. Works great, except for two problems: it's slow, since the Perl script has to be re-interpreted every time it is run, and it consumes even more memory, because the Perl interpreter must be launched for each execution of the Perl script.
The above describes your standard garden variety CGI environment. For sites with low traffic and/or low processing demands, CGI is easy to implement and the costs are still reasonable (keep in mind that "slow" in computer terms is still very, very fast in human terms).Where the CGI model begins to break down is with sites that must process more than several simultaneous requests for Perl scripts, and those scripts perform a variety of activities such as database queries. A web site with these needs will quickly become bogged down by the sheer inefficiency of CGI, wasting memory and leaving visitors frustrated with noticeable wait times.
Enter the Hero
One sunny (or cloudy, we just don't know) day, a bright fellow named Doug MacEachern resolved to marry Perl and Apache, so that rather than interacting as two foreign independent entities, the two would be joined in holy matrimony, with the advantages and both combined in union, able to tackle the world till obsolescence do they part. With a knack for hacking, but perhaps not such a gift for names, Doug names his new hybrid mod_perl. Put more accurately, mod_perl is an Apache module that integrates the Perl interpreter into the Apache web server.
The benefits of this integration are twofold:
Getting the Goods
Most sites that run Apache are based on Unix-like operating systems such as Linux or FreeBSD, although Apache is also available for the Windows platforms. You will need to be running an Apache web server, preferably the newest stable release available (1.3.12 at the time of writing) to make use of mod_perl, although there are plug-ins similar to mod_perl for other web servers (nsapi_perl for Netscape servers, or the commercial PerlEx by ActiveState for O'Reilly, Microsoft, and Netscape servers).
On Apache under a Unix-like operating system, you can download the source for mod_perl (current version is 1.22). Alternatively, if you are familiar with the CPAN.pm module the command install Bundle::Apache will install mod_perl and several related Perl modules that you may or may not wish to use. You can also install mod_perl manually, from the source link above, and then type perldoc Bundle::Apache to view a list of related modules that you can retrieve and install if you wish.
Apache is also available for Windows, but many Windows Perl coders use ActiveState's popular port, ActivePerl. This is a problem for us here, because mod_perl will not (yet) work under Windows with ActivePerl. There is hope -- you can freely download a fully bundled set of binaries containing Apache, mod_perl, and an alternate port of Perl all for Windows 95/98/NT.
While Windows users have downloaded binaries, many Unix-like users have downloaded source code. The vagaries of compiling anything under a Unix environment are complex, but in a typical scenario you can rely on the built-in compilation scripts included with Apache and mod_perl. The compilation procedure involves building of mod_perl first, which then in turn builds the Apache binaries -- the end result will be a new Apache httpd binary. The installation summary below is reproduced from Stas Bekman's thorough "mod_perl Guide" -- you can skip the first five lines if you've already downloaded and unpacked the Apache and mod_perl sources (which is what these lines do).
As illustrated, you simply need to unpack the Apache and mod_perl sources into respective subdirectories, then change into the mod_perl source directory and execute the "perl Makefile.PL" command illustrated above. This tells the compiler where to find the Apache sources and what options to build in -- the above routine defaults to "everything" which is satisfactory for most uses and certainly a first time experience. Finally, the sources are all built while your computer churns and smokes for a few minutes, and installed into place, typically /usr/local/apache.
Assuming a /usr/local/apache destination, the new httpd (the binary for the Apache server) will be found in /usr/local/apache/bin.
Gee, it's huge
If you've previously compiled an Apache server you may have noticed that the typical httpd size is between 300-400K. Now, with mod_perl integrated, the httpd has ballooned to over 1 megabyte. Perl is, you can see, as William Shatner would shill, "big! really big!". This brings us to the subject of tradeoffs.
Life is a box of compromises. Buffalo wings and cheesecake are a swell meal, but make you fatter. Chicken broth and celery stalks are slimming and dull. And so it is with the Apache web server, which is much more robust with a belly full of Perl. The trouble in the henhouse is that Apache, as we discussed, is pre-forking -- which means that a fat parent server will spawn fat children. Several of them. Isn't that always the way. That's the cost of doing business when you want to execute heavy Perl scripts with aplomb, but most web sites are composed of more than simply Perl scripts -- such as static web pages. And a static web page is like a sheet of paper, lightweight. Unfortunately, if your site is running mod_perl and has many static pages to serve in addition to Perl scripts, that is one fat child process running around carrying a tiny load.
So it's a battle of inefficiencies: vanilla Apache is inefficient at executing Perl scripts via CGI, while mod_perl beefed up Apache is inefficient at serving simple web pages. You need to consider the general breakdown of pages served by your site -- are we looking at 90% Perl scripts vs. 10% simple pages, or 10% Perl scripts vs. 90% simple pages? Likely somewhere in between. At the extremes, your best choice is to choose the most efficient server for most of the time. In a scenario where 10% of your requests trigger Perl scripts, it might be justifiable to live with the relative penalty of CGI for the benefit of a small and compact server process, allowing for more simultaneous visitors in a given amount of memory. If you serve relatively few simple pages, the advantages of a beefy mod_perl server will pay off more than the penalty of a few extra though large processes. Many readers find themselves somewhere between these two poles, though -- say, 30/70 or 40/60 or 50/50.
A nifty solution to this quandary is to run two Apache servers. One Apache server is the small, compact vanilla version while the other is the robust and hefty mod_perl enabled Apache server. Incoming requests are then routed to the mod_perl server when Perl scripts are required, while simple page requests are handled by the lightweight server. Elegant enough, but the devil is in the details. Ultimately, this is the preferred solution when you can't justify serving all content from either a slim or fat Apache server but it has its own pitfalls. You'll need to maintain two separate installation trees for each Apache server, including separate configuration files, and each server will spit out separate log files, making the job of analyzing traffic a bit more complicated. The mod_perl server is typically configured to listen on an alternate network port, such as 8080, but you don't want end users to see this -- all pages should appear to come from one server lest problems arise with firewalls, bookmarks, and so on. This is solved by employing internal proxying within the slim Apache server's configuration file, to redirect requests for Perl scripts to the mod_perl server "behind the scenes". That's the short of it -- the long is simply too long and too off-topic for this article, but we again direct you to Stas Bekman's thorough coverage of multiple server arrangements.
For the sake of simplicity in this introduction, we'll assume a single Apache server which is mod_perl enabled, even if this is not the ideal architecture for sites with lots of static content. The Apache server is configured, prior to launch, in the very long but well commented httpd.conf file which, in a default installation, is found in /usr/local/apache/conf subdirectory. Once again, and not to pass the buck too often, Apache server configuration is a career unto itself, so we will focus only on configuration of the mod_perl aspect.
Simply put, we want to tell Apache to process Perl scripts via the Apache::Registry module, which is mod_perl's pseudo-CGI environment. This allows us to run Perl scripts written for a typical CGI environment (such as using the CGI.pm module) under mod_perl, which is technically not a CGI extension.
The default httpd.conf file installed with Apache is not configured to use mod_perl; instead, it is configured to execute scripts via CGI. You will probably find a configuration directive in your httpd.conf file that looks something like:
ScriptAlias /cgi-bin/ "/usr/local/apache/cgi-bin/"
This directive tells Apache that any files in the relative path /cgi-bin/ should be considered scripts, and launched accordingly. You need to consider whether all scripts on your web site will be Perl and handled by mod_perl, or whether there are other scripts that may still need to execute via CGI. The safest approach is to retain at least one subdirectory for traditional old-style CGI scripts and one subdirectory for your mod_perl Perl scripts. The ScriptAlias directive above must only point to a path with CGI scripts, and not to the path where you want Perl scripts executed from. Let's say, then, that you create a new path -- /usr/local/apache/cgi-perl/ for your mod_perl enabled scripts.
Of course, if you are running mod_perl scripts exclusively, you could simply comment out the ScriptAlias directive by preceding it with a pound symbol (#), and simply use the cgi-bin/ path for your Perl scripts.
Now we're ready to add mod_perl specific configuration directives. If you scroll through the httpd.conf file, you'll find a section which contents the commented heading "Aliases: Add here as many aliases as you need ...". It's easiest to scroll down towards the end of this section, just before it is closed with the tag, and add our new alias here.
Alias /cgi-perl/ "/usr/local/apache/cgi-perl/" SetHandler perl-script PerlHandler Apache::Registry Options ExecCGI PerlSendHeader On
Above, we define an alias, linking /cgi-perl/ to the system path /usr/local/apache/cgi-perl/. The directive references this alias and defines a number of attributes for it. First, we tell Apache to let mod_perl handle these files via the SetHandler directive, and we tell mod_perl to handle them using its Apache::Registry module. The Registry module is basically the star of the show here, as it is what handles emulating a CGI environment and compiling/caching the Perl code. We tell Apache to handle these files as executable via the ExecCGI parameter, otherwise the browser would try to send the script as a text file to the end user -- yikes!. Finally, we tell Apache to send an HTTP header to the browser on script execution -- this is not strictly necessary if your Perl script is well behaved about sending the header itself, such as by the CGI->header() method of the CGI.pm module.
Start Your Coding
Our mod_perl Apache server is ready to serve. That's the good news. But, like any high performance piece of machinery, mod_perl is not going to provide its optimum benefits right out of the box like this. Before you're ready to tweak and tune, however, it's important to get used to developing scripts in the mod_perl environment (and for better or worse, there is a lot of tweaking and tuning that can be done under the hood). Of course, you'll want to save your Perl scripts to the system directory aliased to /cgi-perl/ or whatever name you chose. Whether you are adapting existing scripts or writing anew, your Perl should interact with the browser just as you did before, via the CGI.pm module, which we looked at way back in Part 2 of the Perl You Need to Know. You can retrieve parameters and send output to the browser just as before, but keep in mind that although we continue to use the label "CGI" as a manner of speaking, scripts executed by mod_perl are not technically using the CGI extension.
Although many Perl scripts will run as-is in the mod_perl environment, you are not yet taking full advantage of mod_perl's benefits. We'll close out this month's installment looking at pre-loading Perl modules. Next month we'll look some more at optimizations, and also at some thorny pitfalls in coding practice that could undermine Perl scripts that otherwise work fine outside of mod_perl.
Your Perl scripts most probably begin by linking in some modules via the use() statement. At the least, you probably:
#!/usr/bin/perl use CGI;
Because your script invocations will likely keep using many of the same modules, one mod_perl optimization is to pre-load these modules, allowing mod_perl to compile them once and keep them resident in memory. Future script executions do not then need to recompile these modules, shaving a few more milliseconds off total execution time. The typical way you can pre-load Perl modules is with the PerlModule directive, which you can place in Apache's httpd.conf file along with your other mod_perl directives:
Alias /cgi-perl/ "/usr/local/apache/cgi-perl/" PerlModule CGI SetHandler perl-script PerlHandler Apache::Registry Options ExecCGI PerlSendHeader On
You can list any other Perl modules you wish to pre-load in the one PerlModule directive, simply separated by spaces. There is a slightly more sophisticated method of pre-loading modules that involves using the PerlRequire directive to load a short script that contains "use ()" statements for each module -- this is not a necessary step to begin with, but is nicely illustrated in Vivek Khera's mod_perl_tuning document.
Just because you've pre-loaded a Perl module does not mean that you forego the "use ()" statement in your Perl script. Leave those in as they are. Perl will not waste time recompiling the module sources, but it will import necessary elements of the module into your script's namespace, allowing you to leave calls to the module unchanged in syntax within your script.
Take Home Message: Optimizations
It is tempting and simple to walk away from an introduction to mod_perl thinking that it magically takes care of all optimizations. The magical mod_perl genie just compiles your Perl and everything is milk and honey. Not so fast! The ways in which mod_perl compiles and caches code varies depending on how it is used -- before we become immersed in details next month, go to sleep tonight with a good overview of the ways in which mod_perl can optimize Perl execution.
"Better Than Nothing" Optimization:
"Hands Off" Optimization:
"Extreme but Bloated" Optimization: