In the last column, I talked about how to read and write from multiple file
descriptors seemingly simultaneously with the select() function call. Using multiplexed I/O
lets programs block while waiting for notification that some file descriptors are ready for reading
or writing.
Using multiplexed I/O of this form is quite common. High-performance network servers —
including mail, workgroup, and Web servers — all must handle multiple simultaneous connections,
often thousands of connections. While select() looks like it would handle this kind of load
just fine, it has a fatal flaw that limits its performance. Figure 1 shows the prototype of
select().
Figure 1: A Prototype of the select() Function int select(int numfd, fd_set * readfds, fd_set * writefds, |
Think about what happens when a program wants to use select() to block until either file
descriptor 1000 or 1001 is ready to be read from. It will have to call select() with a
numfd of 1002 and the appropriate value of readfds. Easy enough, right?
The Problem with select()
Now try to think about things from the kernel’s point of view. The kernel gets a request from a
user-space program to monitor some file descriptors for reading. It knows that the file descriptors
are smaller than 1002, but that’s all it knows. To figure out which file descriptors the program is
interested in, the kernel needs to check all 1,002 file descriptors, safely, one by one.
Needless to say, this is terrible for performance. As most busy servers call select()
quite often, checking 1,002 file descriptors when the program really only cares about two is quite
inefficient. Additionally, if the file descriptors were instead 10001 and 10002, things would be
worse by an order of magnitude. All of this means that select() has a very poor property:
The performance of select() is directly related to the value of the file descriptors
being monitored. No other Linux system call depends on file descriptors’ values. (The number of file
descriptors monitored would be a more natural characteristic for performance to depend on.) As a
specific example of this, select() performs very poorly once the file descriptors get
large.
select() has one additional problem — how many file descriptors should fd_set be
able to hold? Ideally, it should be able to hold as many file descriptors as a process can have
open. Traditionally, Linux allowed only 1,024 file descriptors per process, so this was reasonable.
However, now that patches are available for the Linux kernel that raise that limit, fd_set
needs to grow. In fact, it now needs to hold hundreds of thousands of file descriptors. If
fd_set were actually enlarged, each fd_set structure would measure at least 12K in
size! Needless to say, handling fd_set structures of that size would have a noticeable
performance impact!
The 2.2 Linux kernel introduced the poll() system call. poll() is supported by
most Unix systems and is included in the Single Unix Specification. It duplicates the function of
select() but scales much better. Every glibc-based Linux system provides
poll(); those running on older kernels emulate poll() via select() in the C
library, but applications need not know the difference. Here’s what poll() looks like:
int poll(struct pollfd *ufds, unsigned int nfds, int timeout);
The first parameter, ufds, must point to an array of struct pollfd. Each element
in the array specifies a file descriptor that the program is interested in monitoring, and what
events on that file descriptor the program would like to know about. The next parameter,
nfds, tells the kernel how many total items are in the ufds array. The final
parameter, timeout, is the maximum time, in milliseconds, that the kernel should wait for the
activities the ufds array specifies. If timeout contains a negative value, the kernel
will wait forever. If nfds is zero, poll() becomes a simple millisecond sleep.
Building the ufds array is the most complicated part of using poll(). Each
struct pollfd looks like:
struct pollfd {
int fd; /* file descriptor */
short events; /* requested events */
short revents; /* returned events */
};
The fd element contains the file descriptor whose events this structure describes. The
events field is a bitmap of events the application would find interesting, and the
revents field is filled in on return from poll() with a list of the events that have
actually occurred on the file descriptor. Programs can monitor one or more of POLLIN,
POLLOUT, or POLLPRI. The first two monitor whether there is data to read and whether
writing will block. The last tells the application if there is out-of-band data available on the
socket, which can occur on sockets. These values can be logically ORed together, letting the program
monitor a single file descriptor for both reading and writing.
On return from poll(), the revents field is filled with the same POLLIN,
POLLOUT, and POLLPRI values, indicating which of the “interesting” events the file
descriptor is now ready for. The following bits may also be set in revents:
POLLERR: This is set if an error condition has occurred on the file descriptor.
POLLHUP: This is set when the file descriptor refers to a terminal that has been hung up
on.
POLLNVAL: If fd doesn’t refer to a valid (open) file descriptor, this value is
set.
On return, poll() returns -1 if an error occurred, 0 if the timeout was reached, or the
number of file descriptors that have some revents fields filled in.
Listing One is a sample program that performs exactly the same function as the
select() test program in last month’s Compile Time.
Listing One: The poll() System Call at #include <fcntl.h> /* For simplicity, all error checking has been left out */ int main(int argc, char ** argv) { while (1) { pfds[1].fd = fd; poll(pfds, 2, -1); if (pfds[1].revents & POLLIN) { |
If you compare Listing One to select‘s test program, you’ll find very few
changes.
The easiest way to run this program is to use a named pipe for the file specified on the command
line. Here’s how (assuming you have the source code for the above program in the file
testpoll.c):
# make testpoll
# mknod p mypipe
# ./testpoll mypipe
Since the poor scalability of select() motivated us to look at poll(), we should
look at poll()‘s performance characteristics. The number of struct pollfd items passed
to it is going to have an effect on performance, which is reasonable, since the amount of work the
system call is doing is directly related to the number of items in that array. However, the value of
the individual file descriptors has no effect. This is a major win for poll() over
select().
The other advantage to poll() is that the largest file descriptor it can handle is
limited by the size of an int, not by the size of another structure (such as an
fd_set). This lets poll() work with Linux kernels that allow hundreds of thousands of
file descriptors per process, which is extremely useful in some cases.
So much for multiplexed I/O — once you understand nonblocking I/O, select(), and
poll(), you’re pretty much covered. There is one other method for multiplexing I/O requests
— asynchronous I/O — which I’ll cover in a future article.
Pipes
Since I have some room left in this column, I’ll talk about pipes. Pipes use the same API as
regular files — in fact, I suggested using a pipe when running the poll() program I wrote.
Since they look just like normal files, programs don’t normally care whether they’re writing to a
pipe or a file (or a terminal device for that matter). This flexibility lets most programs deal with
pipes quite well, a fact Linux takes advantage of with the pipe (|) shell construct. This replaces
one program’s standard output and another program’s standard input with a single pipe, which allows
the first program to talk to the second.
There are two types of pipes: named and unnamed. Simply enough, named pipes have a name in the
filesystem and unnamed pipes do not. An unnamed pipe disappears once no processes are reading from
it, while named pipes survive process termination and reboots, and must be removed by unlink.
We’re only interested in discussing unnamed pipes right now, but the two are quite similar. The first
thing to know is how to create a pipe:
int pipe(int fds[2]);
By passing an array of two file descriptors to pipe(), you create a pipe. The first file
descriptor in the array becomes the end of the pipe that can be read from, and the second is the
writable end of the pipe. Any data written to fds[1] will show up in fds[0] in the
same order it was written. (You’ll sometimes hear Linux pipes called FIFO pipes, because the First
data In is the First data Out.) Pipes are half-duplex — you can send data only one direction. The
pipe() system call returns 0 on success, and nonzero on failure. The only reasons
pipe() would fail are a shortage of system resources, the parameter being invalid, or the
process having too many file descriptors already open.
Listing Two is a simple program that creates a pipe and sends a message down it.
Listing Two: Creating a Pipeline and Sending a #include <stdio.h> int main(int argc, char ** argv) { pipe(fds); /* we read -1 bytes to leave room for the trailing ‘ |