Capturing Input/Output of Another Process in C

In my travels in C programming, I periodically need to run another process and redirect its standard output back to the first process. While it is straight forward to perform, it is not always obvious. This article will explain the process of how this is done in three sections.

In my travels in C programming, I periodically need to run another process and redirect its standard output back to the first process. While it is straight forward to perform, it is not always obvious. This article will explain the process of how this is done in three sections.

  • High Level Overview
  • Explanation of each line
  • Code Sample

High Level Overview

  • Create a three pipe(2)s for standard input, output and error
  • fork(2) the process
  • The child process runs dup2(2) to over the pipes to redirect the new processes’s standard
  • input, output and error to the pipe.
  • The parent process reads from the pipe(2) descriptors as needed.

Explanation

A pipe(2) is a Unix system call API that creates two file descriptors. Data written to one end of the pipe can be read by the other. It provides simple FIFO functionality without the need to maintain an associated data structure. The process should initially create three pipe(2) file descriptor pairs for standard input, output and error. For our purposes, it will be used to bridge communication between the parent and second process.

Next, our program will run a standard Unix fork(2), which creates a copy of the running processes, the stack and machine code, except with a different process ID. The return value for the parent is the process ID (pid) of the child, while the child returns 0.

dup2(2)‘s documentation says it “duplicates” a file descriptor, but I found this to be a misleading misnomer. In layman’s terms, dup2(2) cause any reads or writes to the newfd to be redirected (pointed) to the oldfd descriptor while the original newfd is closed. For our uses, the child process will use dup2(2) to redirect its standard input, output and error to the pipe(2) descriptors.

At this point, the child process will run execl(2), which will replace the current process with a new process. This is different than spawning a new process, such as through system(3), thought the effect would be the same. Now, because of the dup(2) calls, any reads or writes to standard input, output or error will be redirected to the respective pipe(2)‘s.

On the other end, the parent process will use the other end of the pipe(2) to read or write to the child process, thus accomplishing our objective.

Example Code

#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define SAMPLE_STRING	"Bismillah"

int main() {
	int fdstdin[2];
	int fdstdout[2];
	int fdstderr[2];
	int pid;

	pipe(fdstdin);
	pipe(fdstdout);
	pipe(fdstderr);

	pid = fork();

	if (pid == 0) { /* Child process */
		int ret;
		/*
		 * Have the 2nd argument (oldd) point to
		 * the first argument, (newd)
		 */
		dup2(fdstdin[0], STDIN_FILENO);
		dup2(fdstdout[1], STDOUT_FILENO);
		dup2(fdstderr[1], STDERR_FILENO);

		close(fdstdin[1]);
		close(fdstdout[0]);
		close(fdstderr[0]);

		/* This simulates a simple writing to stderr */
		system("printf Hi > /dev/stderr");

		/* Simulates writing from stdin to a test file */
		system("cat > `mktemp`");

		/* Typical method is to run execl(2). Here just using printf */
		ret = execl("/usr/bin/printf", "printf", "Hello World!", NULL); 

		if (ret == -1) {
			/*
			 * execl(2) returns -1 if an error occurs. Any
			 * debugging messages to the console would be
			 * interpreted as output of the process. Therefore,
			 * we will simply exit.
			 * The parent process's read attempts will return -1
			 */
			exit(128);
		}
	}
	else { /* Parent process */
		char buf[1000];

		/* Close the other end of the pipe */
		close(fdstdin[0]);
		close(fdstdout[1]);
		close(fdstderr[1]);

		/* Read from the stderr */
		read(fdstderr[0], buf, 1000);
		printf("Stderr message from child, simulated by a "
		    "system(): %s\n", buf);

		/* Sending data to the stdin of the child process */
		printf("Sending string '%s' to stdin, written to mktemp file.\n"
		    SAMPLE_STRING);

		write(fdstdin[1], SAMPLE_STRING, strlen(SAMPLE_STRING));
		/* Closing the stdin pipe */
		close(fdstdin[1]);

		/* Read from the stdout */
		read(fdstdout[0], buf, 1000);
		printf("Stdout message from child, run with an execl(): %s\n",
		    buf);
	}
}

Compiling and running this code should give you the following output.

$ ./redirect
Stderr message from child, simulated by a system(): Hi
Sending string 'Bismillah' to stdin, written to mktemp file.
Stdout message from child, run with an execl(): Hello World!

I hope this helps someone going forward! Thoughts?

This work is heavily based off of Cameron Zwarich’s excellent 1998 article Pipes in Unix from C-Scene, issue #4. I have it in hard-copy from 2001 and periodically refer back to it.

Avoiding Redundancy with Function Pointers

I am currently writing OpenGit, a BSD-licensed re-implementation of Linus Torvald’s Git (lower-cased going forward). This frequently involves reviewing git’s source code to understand how it works under the hood. One of the things I consistently encounter is git performing similar and sizable tasks in multiple different ways and places, resulting in redundancy and a higher maintenance cost.

In this brief entry, I will discuss a classic problem and how I solve it: When minor variants of a routine result in multiple implementations.

Example Pseudo-Code Problem

Git makes heavy use of zlib deflation, a library used to decompress arbitrary data. In the process, git will perform different subroutines, such as calculating an object’s cyclic redundancy check (CRC) or SHA1 value, both on deflated and inflated data, consuming the data in different ways. Rather than having a single deflation function, git re-implemented the deflation code in numerous different ways.

To help understand the problem, lets use this sample pseudo-code, which are decompression routines. The primary difference between the two functions is that one executes additional_routine_one() on the decompressed data, while the other executes additional_routine_two() on the uncompressed data.

void
decompress_routine_one(int fd, char *data, uint8_t *bin)
{
   int size;
   char compressed_data[1000];
   char *decompressed_data;

   do {
      size = read(fd, compressed_data, 1000);

      ... decompress the data ...

      additional_routine_one(decompressed_data, bin);
   } while(size <= 0);
}

void
decompress_routine_two(int fd, char *var)
{
   int size;
   char compressed_data[1000];
   char *decompressed_data;

   do {
      size = read(fd, compressed_data, 1000);
      additional_routine_two(compressed_data, var);

      ... decompress the data ...

   } while(size <= 0);
}

decompress_routine_one(fd, data, bin);
decompress_routine_two(fd, data, var);

In this example, both will decompress the data in the do-while loop, but perform different tasks and require different arguments for the routine.

  • The first executes additional_routine_one(), which uses two arguments: The decompressed_data and uses the bin variable.
  • The second executes additional_routine_two(), which utilizes the raw compressed data, and the variables var and size.

In other words, not only does the additional task change, the location of the task changes. This could be further complicated by the number of arguments that the additional routine utilizes.

Possible Solution

The best approach I have concluded with is to implement handler function pointers at the appropriate points in the primary routine. The application should verify if the handler is not necessary by checking if the pointer is set to NULL. This is, of course, not my own creation, but based on reviewing Linux and BSD (oh, and Windows) kernel implementations. Consider the following alternative.

typedef void datahandler(char *, int, void *);

void
additional_routine_one(char *data, int size, void *arg)
{
   char *x = arg;
   ... do something ...
}

void
additional_routine_two(char *data, int size, void *arg)
{
   int *x = arg;
   ... do something ...
}

void
decompress_routine(int fd, datahandler decompressed_handler, void *darg,
    datahandler compressed_handler, void *iarg)
{
   int size;
   char buf[1000];

   do {
      size = read(fd, buf, 1000);
      if (decompressed_handler)
         decompressed_handler(darg);

      ... decompress the data ...

      if (compressed_handler)
         compressed_handler(iarg);
   } while(size <= 0);
}

decompress_routine(fd, additional_routine_one, bin, NULL, NULL);
decompress_routine(fd, NULL, NULL, additional_routine_two, var);

In this example, the decompress_routine performs the same complex decompression algorithm, but rather than having two separate functions, at the appropriate points in each function they verify if a function pointer was passed, If so, it passes the respective argument.

Additionally, if the program must run both additional_routine_one and additional_routine_two the program can run:

decompress_routine(fd, additional_routine_one, bin, additional_routine_two, var);

Finally, if in cases where I require more than one argument, I typically pass a pointer to a structure with the data I need.

Potential Tradeoffs

  • Performance: There is a trivial cost to verifying if the function pointer is set to NULL or not. This may be irrelevant for general-purpose applications, but could be something to consider for extremely high-performing systems, where memory or disk utilization is not a concern.
  • Code clarity: Having a series of NULLs is “ugly” code. However, this can be trivially resolved by using macros to hide away the NULL references.

Thoughts? Comments? Threats?

Passing by Reference: C’s Garbage Collection

The C programming language has no built-in garbage-collection mechanism – and it very likely never will. This can (and does) lead to memory leaks by even the best programmers. It is also an imputes for the Rust language. However, depending on your use-case, it is still possible to structure your code to use the stack as a sort of zero-cost “garbage collector”.

Lets jump directly into the code!

This is how many applications instantiate and utilize a structure or arbitrary object.

struct resource *instance;
instance = malloc(sizeof(struct resource));
get_resource(instance); ... free(instance);

While this is a perfectly fine snippet of code, it requires the program to explicitly free(3) instance when it is no longer needed or risk a memory leak. There is also a slight performance loss from the malloc(3) and free(3).

Therefore, lately I have been using another method.

struct resource instance;
get_resource(&instance);

Rather than allocating memory, this uses the stack. When the variable is “destroyed” immediately after falling out of scope without the need for a free(3).

The downside, of course, is losing the ability to pass the pointer elsewhere after the initial allocating function closes. But, this can be overcome by creating the variable in the parent function to all those that need it.

Thoughts?

SHA1 on FreeBSD Snippet

I needed some code that produces SHA1 digests for a project I am working on. I hunted through the FreeBSD’s sha1(1) code and produced this minimal snippet. Hopefully this helps someone else in the future.

Compile and run as follows:

$ cc shatest.c -o shatest -lmd
$ ./shatest
10d0b55e0ce96e1ad711adaac266c9200cbc27e4
$ printf "bismillah" | sha1
10d0b55e0ce96e1ad711adaac266c9200cbc27e4

Thanks to FreeBSD for maintaining such clean code!

Why Numerous Programming Languages?

There are numerous programming languages out there, some of which have general purpose and some have specific purposes. Here are some of the languages I’ve come across.

  • Assembly Language – This is not so much a language, as a way to write raw CPU instructions in a way that’s more human readable. I’ve only seen it used to write simple libraries and low-level operating system functions.
  • BASIC – A business programming language used to perform simple tasks or games.
  • C/C++ – These are general purpose languages that run directly on the hardware, which means dealing directly with memory and operating system specifics. Their manipulation of the hardware can only be through the operating system.
  • C# – Uses C++, but calls upon a uniquely Microsoft .NET library.
  • Java – A general purpose language that does not run on the physical hardware. It was primarily built to make the binary executable portable across all physical platforms and OS’s
  • Perl – An interpreted scripting language. It was initially created as a “glue language” to perform simple tasks or fit into unique places (such as a robust CGI language).
  • PHP – A web scripting language that is interpreted through a PHP interpreter.
  • Python – Object-oriented, multi-platform, interpreted language (which means it requires an interpreter). Never used it, so here it is.
  • Ruby – I don’t know much about, so here’s a link.

This list could go on forever. I should also add Fortran and Pascal to this list (but I won’t).

There is no “best language”, there are just different languages for different purposes. But if you are going to learn a language for general purposes, I would suggest C++, one of those .NET languages or Java.

Found Old Chat Server Project

During my high school years, I used to be part of an “underground” IRC server. We would talk about security-related topics and the latest exploits, usually about some Unix variant. Even though no one would really care about our late-night computer conversations, I thought it best that we chat over an encrypted medium, and considering that I knew nothing about how SSL could serve to transparently encrypt IRC daemons and clients, I decided to write my own encrypted chat server. Below is the C code using the Unix API for the very simple framework. I planned on adding encryption for which I was learning the GMP library. I wrote the code below my junior year of HS.

I tried to re-create this project in undergrad, but kept failing and never figured out why. I randomly found a printout of this structure among some old papers.

Pretty certain there’s a memory leak here somewhere, but eh…

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <netdb.h>
#include <netinet/in.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/time.h>
#include <sys/socket.h>
#include <sys/uio.h>
#include <arpa/inet.h>

struct clientinfo {
   int fd;
   struct sockaddr_in info;
};

int rmfd(struct clientinfo *connects, int location, int numcon) {
   int count;
   struct clientinfo *temp;

   for(count=location;count<=numcon;count++)
      connects[count] = connects[count+1];

   temp = malloc(sizeof(struct clientinfo)*(numcon-1+(numcon-1==0)));

   for(count=0;countsizeof(struct clientinfo)*(numcon-1+(numcon-1==0)));
   for(count=0;countreturn numcon-1;
}

int addfd(struct clientinfo *connects, int numcon) {
   struct clientinfo *temp;
   int count;
   temp = malloc(sizeof(struct clientinfo) * (numcon+(numcon==0)));
   for(count=0;countsizeof(struct clientinfo)*(numcon+1));
   for(count=0;countreturn numcon+1;
}

int main(int argc, char **argv) {
   int check;
   int sin_size;
   int numcon;
   int maxfd;
   int readtest;
   int writecount;
   char rwbuf[1024];
   struct timeval timer;
   struct clientinfo *connects;
   fd_set sockrd;

   if (argc != 2) {
      fprintf(stderr, "usage: %s port\n", argv[0]);
      exit(-1);
   }

   sin_size = sizeof(struct sockaddr);
   numcon = 0;
   connects = malloc(sizeof(struct clientinfo));
   FD_ZERO(&sockrd);
   connects[0].info.sin_family = AF_INET;
   connects[0].info.sin_port = htons(atoi(argv[1]));
   connects[0].info.sin_addr.s_addr = INADDR_ANY;

   connects[0].fd = socket(AF_INET, SOCK_STREAM, 0);
   check = bind(connects[0].fd, (struct sockaddr *)&connects[0].info, sizeof(struct sockaddr));
   if(check == -1) {
      perror("bind");
      exit(check);
   }
   check = listen(connects[0].fd, 1024);
   if (check == -1) {
      perror("listen");
      exit(check);
   }
   maxfd = connects[0].fd;
   for(;;) {
      FD_ZERO(&sockrd);
      FD_SET(connects[0].fd, &sockrd);
      for(check=1;check<=numcon;check++)
         FD_SET(connects[check].fd, &sockrd);
      timer.tv_sec = 60;
      timer.tv_usec = 0;

      select(maxfd+1, &sockrd, NULL, NULL, &timer);
      if (FD_ISSET(connects[0].fd, &sockrd)) {
         sin_size = sizeof(struct sockaddr_in);
         numcon = addfd(connects, numcon);
         connects[numcon].fd = accept(connects[0].fd, (struct sockaddr *)&connects[numcon].info, &sin_size);
         if (connects[numcon].fd > maxfd)
            maxfd = connects[numcon].fd;
      }
      else {
         for(readtest=1;readtest<=numcon;readtest++) {
            if(FD_ISSET(connects[readtest].fd, &sockrd)) {
               memset(rwbuf, 0, 1024);
               check = read(connects[readtest].fd, rwbuf, 1024);
               if (check == 0) {
                  close(connects[readtest].fd);
                  numcon = rmfd(connects, readtest, numcon);
               }
               else {
                  for(writecount=1;writecount<=numcon;writecount++) {
                     if (writecount != readtest)
                        write(connects[writecount].fd, rwbuf, 1024);
                  }
               }
            }
         }
      }
   }
}