How to Hack Your Server

Writing Apache Modules

Many times faster than CGI, the Apache API gives developers direct access to the server core. This is the first in a series of articles dealing with the Apache API.

by Sameer Parekh

The Apache Group designed the Apache web server with modularity in mind. When they rewrote the server core for the 0.8.x release of Apache, they built into the core an extensible module API in order to provide a consistent interface for adding functionality. They separated the bulk of the server's operations into a set of modules so the server core would be a minimal set of operations.

The group designed the module structure with a number of motivations. First, the Apache Group is seriously concerned with server performance. By abstracting most of the server's operations into separate modules, the Apache Group made it possible for server administrators to easily remove modules performing functions they don't need, improving their application's performance. Second, third-party developers can easily develop for Apache using the extensible module API, adding to its general functionality. Apache originally grew from a series of patches to the then-popular NCSA Httpd server. With a module API, functionality can now be added to the server without an ugly set of patches.

Finally, in addition to providing incredible flexibility, under the Apache API, web engineers can develop applications which previously used the slower CGI system. Netscape Communications Corporation has done some benchmark tests and found that using a server API provides a significant performance improvement over the CGI interface.

In this article, we provide an introduction to programming for the Apache Server API. We dissect an existing module, config_log_module, which provides web server administrators a configurable alternative to the standard NCSA Common Log Format.

The config_log_module provides server administrators with the ability to create custom log lines, using a ``printf'' style configuration directive. The ``LogFormat'' directive is used to specify the exact format of the log line. For example:

LogFormat "%h %l %u %t \"%r\" %s %b" 

is the LogFormat directive used to emulate the standard common log format. The initial comments in the mod_log_config.c source file describe all the LogFormat directives. (See Listing 1.)

The core data structure in a module is the ``module'' structure. When building a module, the application developer defines this structure and fills it with the appropriate function calls that should be run in order to invoke the operations for the module. The module structure for mod_log_config.c, is as follows:

module config_log_module = {
 init_config_log,	/* initializer */
 NULL,			/* create per-dir config */
 NULL,			/* merge per-dir config */
 make_config_log_state,	/* server config */
 NULL,			/* merge server config */
 config_log_cmds,	/* command table */
 NULL,			/* handlers */
 NULL,			/* filename translation */
 NULL,			/* check_user_id */
 NULL,			/* check auth */
 NULL,			/* check access */
 NULL,			/* type_checker */
 NULL,			/* fixups */
 config_log_transaction	/* logger */

The NULL entries in this table refer to portions of the server API which the config_log_module does not use. We do not describe those functions in this article.

The LogFormat directive is defined in the ``command table'', ``config_log_cmds'', which is as follows:

command_rec config_log_cmds[] = {
{ "TransferLog", set_config_log, NULL, RSRC_CONF, 
TAKE1, "the filename of the access log" },
{ "LogFormat", log_format, NULL, RSRC_CONF, TAKE1,
      "a log format string (see docs)" },
{ NULL }

This structure is known as the command_rec, which consists of a null-terminated list of substructures, one for each configuration directive. Each substructure has the following fields:

The config_log_cmds structure contains two directives. The TransferLog, which describes the file to which the log is stored, and the LogFormat, which is the actual format of the configuration file. By specifying TAKE1 as the format of the configuration option, the Apache configuration core is directed to look for one---and only one---option following the configuration directive. Other possible settings for the configuration format include TAKE2 and FLAG, which tell the core to look for two options or to accept the directive as an on/off switch, respectively. We use only the TAKE1 format in this article.

Now that we have seen how the Apache configuration core understands the module-specific configuration, we will look at how the core processes and stores the configuration data internally so that the module may access this data when necessary.

The config_log_module stores its module-specific configuration options in a structure. Modules can define for themselves how they store their configuration options. Some modules, that need only one option may use just a simple null-terminated string rather than a C structure. The config_log_module structure, known as the config_log_state, is typedef'ed as follows:

typedef struct {
    char *fname;
    array_header *format;
    int log_fd;
} config_log_state; 

The Apache API requires one function in order to properly allocate memory for the configuration structure. The comments for the module structure define this function as the ``server config'' function. (There also exists a ``per-dir config'' function, which is not used by config_log_module.)

The config_log_module uses the make_config_log_state() function to allocate memory for the data structure:

void *make_config_log_state (pool *p, server_rec *s)
  config_log_state *cls =
    (config_log_state *)palloc (p, sizeof (config_log_state));

  cls->fname = NULL;
  cls->format = NULL;
  clsi->log_fd = -1;

  return (void *)cls;

The make_config_log_state function takes, as arguments, a pointer to the ``Apache memory pool'' and a pointer to the server-wide configuration structure. Apache uses an internal memory allocation system to prevent memory leaks (which we do not describe in detail here.)

Very simply, make_config_log_state merely allocates enough memory for the module configuration data structure, initializes it to NULL values, and returns a pointer to the newly allocated memory. Notice that the memory allocation uses ``palloc'', which is Apache's internal memory allocation function. A module should never use ``malloc'' to allocate memory. All memory allocations should be made using Apache's set of ``pool'' memory allocation functions. (Apache internally takes care of deallocating such memory, which is why there is no ``pfree''.)

Once the memory for all the module's configuration structures are allocated, the server parses the configuration files and calls the functions as described by the command_rec structure for that module. Setting the LogFormat, for example, is done with the log_format function:

char *log_format (cmd_parms *cmd, void *dummy, char *arg)
  char *err_string = NULL;
  config_log_state *cls = \
	get_module_config (cmd->server->module_config,
  cls->format = \
	parse_log_string (cmd->pool, arg, &err_string);
    return err_string;

As the LogFormat directive is a TAKE1 configuration directive, the second argument to the function isn't used. Therefore we call it ``dummy'' in the function definition/prototype.

The function first uses the standard get_module_config function to extract from the server core the data structure which was initialized and allocated for this module with the make_config_log_state function. Once the get_module_config function retrieves the configuration structure, the configuration option that was passed into LogFormat and provided to the function in ``arg'' is assigned to the proper location within the data structure, and the parse_log_string function is called to parse the directive into its component parts and return an error message if the directive is badly formatted. The function then returns a NULL pointer on success, or, if an error has occurred, a pointer to a character string containing an error message which is printed to STDERR.

Finally, once both the TransferLog and LogFormat directives have been processed, the server is ready to initialize itself. The config_log_module's initialization requires that it open a file on disk (or, if the ``| ...'' format was passed to TransferLog, open a pipe to a child process) for logging purposes:

void init_config_log (server_rec *s, pool *p)
  /* First, do "physical" server, which gets 
   * default log fd and format for the virtual 
   * servers, if they don't override...
  config_log_state *default_conf = \
    open_config_log (s, p, NULL);
  /* Then, virtual servers */
  for (s = s->next; s; \
    s = s->next) open_config_log (s, p, default_conf);

init_config_log calls open_config_log() for every server (the main server and all virtual hosts) being run with Apache. It scrolls through the linked list of servers from the data in the server_rec structure, calling open_config_log for each one, as shown in Listing 2.

The open config log function does the file opening and /process spawning necessary for the storage of the logs which are written in accordance with the LogFormat directive. (The TransferLog directive allows a pipe in the form of ``| ...'' to be executed, to which the logs lines are sent.)

Once everything is set up, the server can begin to accept requests. As this module is only a logging module, it doesn't use any of the API functionality other than the ``logger'' function.

The logger function, as the rest of the functions used in the process of handling a request, takes as an argument a pointer to the request_rec data structure. request_rec stores all the data pertaining to a particular request made to the server. The logger function uses the data stored within this structure to find the information it needs to log to the file.

The request_rec structure is defined as follows:

struct request_rec {

  pool *pool;
  conn_rec *connection;
  server_rec *server;

/* If we wind up getting redirected,
 * pointer to the request we redirected to.
  request_rec *next;		
/* If this is an internal redirect,
 * pointer to where we redirected *from*.
  request_rec *prev;

/* If this is a sub_request (see request.h) 
 * pointer back to the main request.

  request_rec *main;	

  /* Info about the request itself... we 
   * begin with stuff that only
   * protocol.c should ever touch...

  char *the_request;   /* First line of 
		        * request, so we can log it   

  int assbackwards;    /* HTTP/0.9, "simple" 
		        * request 

  int proxyreq;        /* A proxy request */

  int header_only;     /* HEAD request, as opposed 
			* to GET 

  char *protocol;      /* Protocol, as given to 
			* us, or HTTP/0.9 

  char *status_line;   /* Status line, if set 
			* by script 

  int status;	       /* In any case */
  /* Request method, two ways; also, 
   * protocol, etc..  Outside of protocol.c,
   * look, but don't touch.
  char *method;	     /* GET, HEAD, POST, etc. */
  int method_number; /* M_GET, M_POST, etc. */

  int sent_bodyct;   /* byte count in stream 
		      * is for body 
  /* MIME header environments, in and out.
   * Also, an array containing environment
   * variables to be passed to subprocesses, so
   * people can write modules to add to that
   * environment.
   * The difference between headers_out and
   * err_headers_out is that the latter are printed
   * even on error and persist across internal
   * redirects (so the headers printed for
   * ErrorDocument handlers will have them).
   * The 'notes' table is for notes from one module
   * to another, with no other set purpose in
   * mind...
  table *headers_in;
  table *headers_out;
  table *err_headers_out;
  table *subprocess_env;
  table *notes;

  char *content_type;	/* Break these out --- we 
			 * dispatch on 'em */

  char *handler;	/* What we *really* dispatch
			 * on

  char *content_encoding;
  char *content_language;
  int no_cache;
  /* What object is being requested (either 
   * directly, or via include
   * or content-negotiation mapping).

  char *uri;   /* complete URI for a proxy req, or
		* URL path for a non-proxy req 
  char *filename;
  char *path_info;
  char *args;		/* QUERY_ARGS, if any */

  struct stat finfo;	/* ST_MODE set to zero 
			 * if no such file */
  /* Various other config info which may change with
   * .htaccess files. These are config vectors, with
   * one void* pointer for each module (the thing
   * pointed to being the module's business).
  void *per_dir_config;	/* Options set in config
			 * files, etc. 

  void *request_config;	/* Notes on *this* request */

/* a linked list of the configuration directives in
 * the .htaccess files accessed by this request.
 * N.B. always add to the head of the list, _never_
 * to the end.  That way, a sub request's list can
 * (temporarily) point to a parent's list

  const struct htaccess_result *htaccess;

This article does not go into the details of how the config_log_module actually does its logging and parses the LogFormat format string, as that is just standard C. The config_log_transaction function prototype, however, is as follows:

int config_log_transaction(request_rec *r); 

The function depends on the request_rec structure, some elements of which we describe here:

is a pointer to the pool of memory from which allocations should be made while processing this one HTTP request. After processing this request, all allocations made from this pool are freed.
is a pointer to the conn_rec structure, which describes details of the connection, such as the local socket address, remote socket address, etc. We do not discuss the conn_rec in detail in this article.
is a pointer to the server_rec, which points to all the configuration information specific to the server (i.e., either the main server or one of the virtualhost servers) under which this request was made. Most important within the server_rec structure is the ``module_config'' pointer, which is used by the get_module_config function to return module-specific configuration directives.
Some requests may result in an internal redirect, resulting in a seperate logical request, even though it goes over a single HTTP request. The main/next/prev pointers point to the chain of request_rec structures which were processed through internal redirects for the current single HTTP request.
string, which merely contains the first line of the request (e.g., GET /index.html HTTP/1.0).
is a boolean flag to see whether or not we're processing an old-style HTTP/0.9 ``simple'' request.
is the HTTP status return code pertaining to the request (e.g., 200 for ``Document Follows'' or 404 for ``Not Found''). httpd.h contains a list of all the available status codes the server currently supports.
is a pointer to a ``table'' structure (the table structure is not described in this article) which lists all of the incoming HTTP structures which the client sent to the server.
is a ``table'' pointer to the headers the server sends back to the client.
is another ``table'' pointer with all of the environment variables that are set for CGI, SSI, etc.
contains the MIME Content-type for use with dispatching the actual request handlers. This Content-type may be an actual MIME type or it may actually be an internal type in order to be dispatched to a specific module's handler based on various criteria. (e.g. CGI_MAGIC_TYPE)
contains the URL path for a given request. (A request GET /index.html HTTP/1.0 would associate /index.html to the uri variable within request_rec.)
If the request has translated to an actual file in the file system, this is the full path to that file. In some instances (proxy module, for example), the ``filename'' is not a representation of a file in the file system, but a proxy URL, perhaps.
is a ``stat'' structure with information about the file, if it exists in the file system. If it doesn't, the server sets finfo.st_mode equal to zero.

That wraps up this issue's description of an existing module. To take a look at the full source to this module, grab the Apache source from and look at mod_log_config.c. Next time, we'll write a new module from scratch.

Sameer Parekh is the President of Community ConneXion, Inc., the Internet Privacy Provider. Parekh has been listed by Newsweek as one of the ``50 People Who Matter Most'' on the Internet. In addition to making available within the United States and Canada a commercial SSL-encrypting version of the Apache server, his company provides privacy on the Internet to those who need it with an extensive array of services, including anonymous mail and web accounts. Information is available on their web pages at *