Multiplexed I/0 with poll() Tutorial

In the last column, I talked about how to read and write from multiple file
descriptors seemingly simultaneously with the select() function call. Using multiplexed I/O
lets programs block while waiting for notification that some file descriptors are ready for reading
or writing.

Using multiplexed I/O of this form is quite common. High-performance network servers —
including mail, workgroup, and Web servers — all must handle multiple simultaneous connections,
often thousands of connections. While select() looks like it would handle this kind of load
just fine, it has a fatal flaw that limits its performance. Figure 1 shows the prototype of

Figure 1: A Prototype of the select() Function

int select(int numfd, fd_set * readfds, fd_set * writefds,
fd_set * exceptfds, struct timeval * tv);

Think about what happens when a program wants to use select() to block until either file
descriptor 1000 or 1001 is ready to be read from. It will have to call select() with a
numfd of 1002 and the appropriate value of readfds. Easy enough, right?

The Problem with select()

Now try to think about things from the kernel’s point of view. The kernel gets a request from a
user-space program to monitor some file descriptors for reading. It knows that the file descriptors
are smaller than 1002, but that’s all it knows. To figure out which file descriptors the program is
interested in, the kernel needs to check all 1,002 file descriptors, safely, one by one.

Needless to say, this is terrible for performance. As most busy servers call select()
quite often, checking 1,002 file descriptors when the program really only cares about two is quite
inefficient. Additionally, if the file descriptors were instead 10001 and 10002, things would be
worse by an order of magnitude. All of this means that select() has a very poor property:

The performance of select() is directly related to the value of the file descriptors
being monitored. No other Linux system call depends on file descriptors’ values. (The number of file
descriptors monitored would be a more natural characteristic for performance to depend on.) As a
specific example of this, select() performs very poorly once the file descriptors get

select() has one additional problem — how many file descriptors should fd_set be
able to hold? Ideally, it should be able to hold as many file descriptors as a process can have
open. Traditionally, Linux allowed only 1,024 file descriptors per process, so this was reasonable.
However, now that patches are available for the Linux kernel that raise that limit, fd_set
needs to grow. In fact, it now needs to hold hundreds of thousands of file descriptors. If
fd_set were actually enlarged, each fd_set structure would measure at least 12K in
size! Needless to say, handling fd_set structures of that size would have a noticeable
performance impact!

The 2.2 Linux kernel introduced the poll() system call. poll() is supported by
most Unix systems and is included in the Single Unix Specification. It duplicates the function of
select() but scales much better. Every glibc-based Linux system provides
poll(); those running on older kernels emulate poll() via select() in the C
library, but applications need not know the difference. Here’s what poll() looks like:

int poll(struct pollfd *ufds, unsigned int nfds, int timeout);

The first parameter, ufds, must point to an array of struct pollfd. Each element
in the array specifies a file descriptor that the program is interested in monitoring, and what
events on that file descriptor the program would like to know about. The next parameter,
nfds, tells the kernel how many total items are in the ufds array. The final
parameter, timeout, is the maximum time, in milliseconds, that the kernel should wait for the
activities the ufds array specifies. If timeout contains a negative value, the kernel
will wait forever. If nfds is zero, poll() becomes a simple millisecond sleep.

Building the ufds array is the most complicated part of using poll(). Each
struct pollfd looks like:

struct pollfd {
int fd; /* file descriptor */
short events; /* requested events */
short revents; /* returned events */

The fd element contains the file descriptor whose events this structure describes. The
events field is a bitmap of events the application would find interesting, and the
revents field is filled in on return from poll() with a list of the events that have
actually occurred on the file descriptor. Programs can monitor one or more of POLLIN,
 or POLLPRI. The first two monitor whether there is data to read and whether
writing will block. The last tells the application if there is out-of-band data available on the
socket, which can occur on sockets. These values can be logically ORed together, letting the program
monitor a single file descriptor for both reading and writing.

On return from poll(), the revents field is filled with the same POLLIN,
 and POLLPRI values, indicating which of the “interesting” events the file
descriptor is now ready for. The following bits may also be set in revents:

POLLERR: This is set if an error condition has occurred on the file descriptor.

POLLHUP: This is set when the file descriptor refers to a terminal that has been hung up

POLLNVAL: If fd doesn’t refer to a valid (open) file descriptor, this value is

On return, poll() returns -1 if an error occurred, 0 if the timeout was reached, or the
number of file descriptors that have some revents fields filled in.

Listing One is a sample program that performs exactly the same function as the
select() test program in last month’s Compile Time.

Listing One: The poll() System Call at

#include <fcntl.h>
#include <stdio.h>
#include <sys/poll.h>
#include <sys/time.h>
#include <unistd.h>

/* For simplicity, all error checking has been left out */

int main(int argc, char ** argv) {
int fd;
char buf[1024];
int i;
struct pollfd pfds[2];
fd = open(argv[1], O_RDONLY);

while (1) {
pfds[0].fd = 0;
pfds[0].events = POLLIN;

pfds[1].fd = fd;
pfds[1].events = POLLIN;

poll(pfds, 2, -1);
if (pfds[0].revents & POLLIN) {
i = read(0, buf, 1024);
if (!i) {
printf(“stdin closedn”);
return 0;
write(1, buf, i);

if (pfds[1].revents & POLLIN) {
i = read(fd, buf, 1024);
if (!i) {
printf(“file closedn”);
return 0;
write(1, buf, i);

If you compare Listing One to select‘s test program, you’ll find very few

The easiest way to run this program is to use a named pipe for the file specified on the command
line. Here’s how (assuming you have the source code for the above program in the file

# make testpoll
# mknod p mypipe
# ./testpoll mypipe

Since the poor scalability of select() motivated us to look at poll(), we should
look at poll()‘s performance characteristics. The number of struct pollfd items passed
to it is going to have an effect on performance, which is reasonable, since the amount of work the
system call is doing is directly related to the number of items in that array. However, the value of
the individual file descriptors has no effect. This is a major win for poll() over

The other advantage to poll() is that the largest file descriptor it can handle is
limited by the size of an int, not by the size of another structure (such as an
fd_set). This lets poll() work with Linux kernels that allow hundreds of thousands of
file descriptors per process, which is extremely useful in some cases.

So much for multiplexed I/O — once you understand nonblocking I/O, select(), and
poll(), you’re pretty much covered. There is one other method for multiplexing I/O requests
— asynchronous I/O — which I’ll cover in a future article.


Since I have some room left in this column, I’ll talk about pipes. Pipes use the same API as
regular files — in fact, I suggested using a pipe when running the poll() program I wrote.
Since they look just like normal files, programs don’t normally care whether they’re writing to a
pipe or a file (or a terminal device for that matter). This flexibility lets most programs deal with
pipes quite well, a fact Linux takes advantage of with the pipe (|) shell construct. This replaces
one program’s standard output and another program’s standard input with a single pipe, which allows
the first program to talk to the second.

There are two types of pipes: named and unnamed. Simply enough, named pipes have a name in the
filesystem and unnamed pipes do not. An unnamed pipe disappears once no processes are reading from
it, while named pipes survive process termination and reboots, and must be removed by unlink.
We’re only interested in discussing unnamed pipes right now, but the two are quite similar. The first
thing to know is how to create a pipe:

int pipe(int fds[2]);

By passing an array of two file descriptors to pipe(), you create a pipe. The first file
descriptor in the array becomes the end of the pipe that can be read from, and the second is the
writable end of the pipe. Any data written to fds[1] will show up in fds[0] in the
same order it was written. (You’ll sometimes hear Linux pipes called FIFO pipes, because the First
data In is the First data Out.) Pipes are half-duplex — you can send data only one direction. The
pipe() system call returns 0 on success, and nonzero on failure. The only reasons
pipe() would fail are a shortage of system resources, the parameter being invalid, or the
process having too many file descriptors already open.

Listing Two is a simple program that creates a pipe and sends a message down it.

Listing Two: Creating a Pipeline and Sending a

#include <stdio.h>
#include <string.h>
#include <unistd.h>

int main(int argc, char ** argv) {
int fds[2];
char message[80];
char * string = “Hello World”;
int size;

write(fds[1], string, strlen(string));

/* we read -1 bytes to leave room for the trailing ‘’ */
size = read(fds[0], message, sizeof(message) – 1);
message[size] = ‘’;

printf(“Got the message ‘%s’ from the pipen”, message);


return 0;

Pretty simple stuff really. The most interesting bit is that the read() call doesn’t wait
for (sizeof(message) – 1) bytes to show up in the pipe before returning; it returns the first
hunk of data that appears, no matter how short it is. A pipe can hold a relatively small amount of
data before it fills up; once it fills up, any write()s to the pipe will block until some
data is removed from the pipe by a reader. (Of course, the various multiplexed I/O techniques we’ve
introduced let programs avoid blocking if it becomes necessary!) I’ll jazz this program up a bit in
Listing Three by having it create a child, which generates the message to send to the

Listing Three: Interprocess

#include <stdio.h>
#include <string.h>
#include <unistd.h>

int main(int argc, char ** argv) {
int fds[2];
char message[80];
char * string = “Hello World”;
int size;


if (!fork()) {
write(fds[1], string, strlen(string));


/* we read -1 bytes to lave room for the trailing ‘’ */
size = read(fds[0], message, sizeof(message) – 1);
message[size] = ‘’;

printf(“Got the message ‘%s’ from the pipen”, message);


return 0;

This isn’t much different from the first example, but it shows something that would be much more
difficult without pipes — interprocess communication. The two processes are communicating with each
other, albeit very simply.

There are two special cases to think about. The first is, what happens when the writer closes
its end of the pipe, and the reader tries to read? When there are no writers, and no data in the
pipe, read() requests on the pipe will return immediately with 0 bytes read (but not an error
condition). It’s also worth mentioning that both select() and poll() consider a pipe
with no writers ready for reading, since a read() on the file descriptor won’t block. The
next question is, what happens when you write to a pipe with no remaining readers? Let’s give that a
try (Listing Four).

Listing Four: Writing to a Pipe With No

#include <stdio.h>
#include <string.h>
#include <unistd.h>

int main(int argc, char ** argv) {
int fds[2];
char * string = “Hello World”;

write(fds[1], string, strlen(string));

printf(“Wrote the message into the pipen”);


return 0;

Here’s what happened when I ran it:

# gcc -Wall -o writetest
# ./writetest
Broken pipe

Broken pipe? Where did that come from? And why didn’t I see the results of my printf()?Obviously, something more drastic then a failed
write() is occurring here.

Linux considers writing to a pipe with no readers a pretty serious matter, since data could get
lost. Rather then let a process blithely ignore error codes from write() and go about its
business, Linux sends the process a signal called SIGPIPE. Unless the process has
specifically asked to handle the SIGPIPE signal, Linux kills the process. The Broken pipemessage is from the shell, which is telling you that the process
died because of a broken pipe. We’ll talk much more about signals in the next column.

For now, you can get write() to return EPIPE rather then die by telling the kernel
you’d like to ignore the signal. Listing Five is a test program that does just that. You’ll
understand more about what’s going on in this program in another month.

Listing Five: Ignoring

#include <stdio.h>
#include <string.h>
#include <sys/signal.h>
#include <unistd.h>

int main(int argc, char ** argv) {
int fds[2];
char * string = “Hello World”;


write(fds[1], string, strlen(string));

printf(“Wrote the message into the pipen”);


return 0;