39 Efficiency and program transformations

Efficiency considerations are of course important when building software libraries. Below, we discuss some efficiency aspects of stream processors that have attracted our attention while working on the Fudget library.

We can distinguish two kinds of efficiency:

While execution efficiency was, and to a large extent still is, a weak point of functional language implementations (as compared to C, for example), programmer efficiency is a strong point, and one of the reason why functional languages are interesting in the first place. There is often a trade-off between the two.

In the following section, we will discuss, in the context of the Fudgets system, how we obtain reasonable execution efficiency. We have not tried to quantify programmer efficiency in a way that permits further comparison or judgement.

39.1 Execution efficiency

Two factors that influence how fast fudgets program run are:

39.1.1 Efficiency of the interface to the window system

The efficiency of the interface to the window system was a concern right from the start of the work on fudgets. The initial implementation used conventional text I/O to talk to a C program which called routines in Xlib and returned the results (see Section 22.3.1). The C program also forwarded events from X to the functional program. This was not a very efficient implementation and hence we tried to minimise the amount of data passed between the window system and the functional program. Although this was done to avoid problems caused by slow execution of functional programs, an additional positive effect is that fudget programs perform well when using a low bandwidth connection (e.g., modem connection) between the X server and the application. Some figures to back up this statement are given in Figure 101. (One can not draw any general conclusions on performance from these, of course.)
Two tiny programs:
The Fudgets counter example from Section 9.4:13s
The Motif counter example from Figure 112:14s

Some small programs:

The Fudget calculator from Section 9.8 with 15 buttons:13s
The Fudgets calculator Cla with 28 buttons:16s
The calculator xcalc with 40 buttons, from X Windows distribution:18s

Some larger programs:

The Fudgets WWW browser from Chapter 32:22s
Mosaic 2.6:75s
Netscape 3.0:137s
Netscape 4.04j2:313s

Figure 101. Comparing the startup times of some programs when running via a 16.8kbps modem connection.

When it comes to event processing, we naturally wanted to to minimise the number of events that has to be handled by the functional program. Fortunately, the X Windows system can do a lot of event processing for the application. By setting event masks and button grabs appropriately, you can often eliminate all insignificant events, i.e., all events that are sent to the application program carry some meaningful information. In simpler window systems, the application has to deal with every little mouse move and button/key presses/releases by itself.

As an example, consider the implementation of a command button. It should behave as follows:The following type of events are thus of interest to the program:In the X windows system a button grab, (see XGrabButton() in the Xlib manual) with an event mask that selects button presses/releases and enter/leave window events (each GUI element is a window), can be used to select exactly these events, with only one small exception: if the mouse pointer enters the button area, the mouse button is pressed, the pointer leaves the button area and the mouse button is released, the program will receive an insignificant button release event. The important thing is that no unnecessary motion events will be received.

39.1.2 Efficiency of the Fudget combinators

39.1.2.1 Efficiency of different representations of stream processors
Two of the stream-processor representations presented above have been used in practice. Early versions of the Fudget library used list functions and synthetic oracles (Section 20.4.1). We later changed to the continuation-based representation (Section 20.4.2) since it proved to be slightly more efficient with the compiler we used (HBC [Aug97]).

We also tried a third representation,

data SP i o = StepSP [o] (i->SP i o)
which was slightly less efficient than the continuation-based representation.

In the discussion below, we assume the continuation-based representation (although some of the ideas can be carried over to other representations).

39.1.2.2 Program transformations for efficiency
Using loopThroughRight (Section 18.2) is a general way to adapt an existing stream processor for use in a new context. Another simple and common way to adapt a stream processor is by mapping a function on the elements of the input or output stream:

sp -==- mapSP g
mapSP f -==- sp
For example, tagged parallel composition of fudgets, >+<, can be defined like

fud1 >+< fud2 =
  mapSP post -==- (fud1 -+- fud2) -==- mapSP pre
where pre and post are the appropriate re-tagging functions. However, if implemented directly, such a definition has a rather high overhead.

By transforming >+< to a form not involving -==-, -+- and mapSP, but instead recursion and pattern matching on the stream-processor constructors, a more efficient solution can be obtained.

Programs transformations of this kind are tedious to do by hand, but it could still be worthwhile if the resulting code is to be included in a library. The above described transformation has been done by hand in the Fudget library. We measured the effect on a communication intensive fudget program containing a parallel composition of 50 fudgets (the Space Invaders program described in Chapter 35). The transformation reduced the CPU time consumption by over 35%. Encouraged by this result, we also transformed some more combinators (for example >^=< and >=^< discussed in Section 13.1) in the same way.

It would of course be nicer to have these transformations done automatically, especially when they are needed in application programs. The kind of automatic transformation that would be useful here is deforestation [Wad90], which eliminates intermediate data structures (applications of PutSP and GetSP in this case) by using certain unfold/fold transformations.

39.1.2.3 A practical semi-automatic transformation
Between the manual and fully automatic implementations of the above program transformation is a semi-automatic alternative. It is interesting because it requires less work than the manual solution and it is more likely to be supported by a compiler than the fully automatic solution.

The manual work required in this solution is located in the library. The application programmer need not be aware of it. The automatic work required is inlining (unfold) by the compiler. It actually works even without inlining, but the efficiency gain is not as big.

The expressions we wish to optimise are of the kind illustrated above: a stream-processor combinator applied to mapSP f, for some f. The trick is to make mapSP a constructor in the stream-processor data type:

data SP a b =  PutSP b (SP a b)
            |  GetSP (a -> SP a b)
            |  MapSP (a -> b)
            |  NullSP
Since the type is abstract, adding constructors to it like this will not be visible to application programmers.

Now that MapSP is a constructor, the implementation of serial composition (as shown in Figure 45) can be extended to handle the case when one or both arguments are applications of MapSP in a more efficient way. This means that an expression like

MapSP f -==- MapSP g
can evaluate to

MapSP (f . g)
With inlining, this step can be taken by the compiler and the composition f . g can then be optimised further. Without inlining, we have at least eliminated a use of -==-, and thereby reduced the number of generated applications of the PutSP and GetSP constructors.

We have not tested the above ideas in the Fudget library.

39.1.2.4 Performance measurements
To get some idea of how high the communication overhead is in the fudgets system, we performed some simple measurements.

The first test measures the efficiency of serial composition and compares the operators >==<, >^^=< and -==-. We measured the time it took to send around 5000 messages through a serial composition of a varying number of identity fudgets (or identity stream processors). The program used is shown in Figure 102. It was compiled with HBC and run on a Pentium Pro 200Mhz under NetBSD 1.2 [Neta]. The results are shown in Figure 103. We can see that the time grows roughly linearly with the length of the composition and that serial composition of fudgets is much more expensive than serial composition of stream processors.

The last table in Figure 103 shows the performance of the function composition map id . ... . map id. It is more efficient than the other serial compositions.

import Fudgets

main = fudlogue mainF

mainF = nullF >==< tstF >==< concatSP >^^=< stdinF

tstF = case argReadKey "comb" 1 of
         1 -> nest (idF>==<) idF depth
         2 -> nest (idSP>^^=<) idF depth
         3 -> absF (nest (idSP -==-) idSP depth)

nest f z 0 = z
nest f z n = f (nest f z (n-1))

depth = argReadKey "depth" 0

Figure 102. A program to measure the efficiency of serial composition.

time ./Internal <testinput1 -S -h8M - -depth n

-comb 1 (idF >==< idF >==< ... >==< idF):

nUser timeGCsGC timeMax heapMax stack
00.080u00.00
504.173u400.1062432836
1008.475u800.25958401410
20018.418u1630.951598843042
40044.521u3343.352880203768

-comb 2 (idSP >^^=< idSP >^^=< ... >^^=< idF):

nUser timeGCsGC timeMax heapMax stack
00.100u00.00
500.755u80.0135712210
1001.452u150.0342270272
2002.869u300.0853504642
4005.706u590.18769761218

-comb 3 (idSP -==- idSP -==- ... -==- idSP):

nUser timeGCsGC timeMax heapMax stack
00.076u00.00
500.411u40.0132524171
1000.812u70.0035840348
2001.574u150.0441956651
4003.047u300.07549241260

map id . map id . ... . map id

nUser timeGCsGC timeMax heapMax stack
00.000u00.00
500.094u20.00395698
1000.203u40.005616275
2000.434u80.009712566
4000.906u160.0117616473

Figure 103. The efficiency of serial composition.

The second test measures the efficiency of parallel composition. We measured the time it took to send about 70000 messages through one of the fudgets in a parallel composition of identity fudgets. The program used is shown in Figure 104. The parallel compositions were created by listF,

listF :: (Eq a) => [(a, F b c)] -> F (a, b) (a, c)
which internally constructs a balanced binary tree of parallel compositions and a table for translating the addresses of the fudgets to positions in the tree. The results are shown in Figure 105. The depth of the tree, and hence the time it takes to send a message to a particular fudget in tree grows logarithmically with the size of the parallel composition. However, since all that is known about the address type is that it is an instance of the Eq class, the table lookup has to be implemented as a linear search and hence the lookup time varies linearly with the position in the list. From the results we see that the time of the table lookup soon becomes the dominating factor. This suggests that it would be a good idea to provide alternative combinators to listF, which require the address type to be an instance of the Ord class, or even the Ix class, to reduce the time complexity of the table lookup time to logarithmic or constant, respectively.
import Fudgets

main = fudlogue mainF

mainF = nullF >==< tstF >==< concatSP >^^=< stdinF

tstF = listF [(i,idF) | i<-[1..size]] >=^< (,) sel

size = argReadKey "size" 1
sel = argReadKey "sel" 1

Figure 104. A program to measure the efficiency of parallel composition.

time ./Internal2 <testinput2 - -size n -sel k

k=1
n2log ntimeGCsGC timemax heap
101.352u110.0229908
3252.179u180.0334924
25682.457u220.0766744
2048112.989u270.31317776
n=256
k2log ntimeGCsGC timemax heap
6484.640u330.11
12787.001u440.12
12886.801u440.16
256811.126u670.2487880

Figure 105. The efficiency of parallel composition of fudgets.

39.1.3 Space efficiency

A problem that almost inevitably occurs at some point when developing programs in lazy functional languages is space leaks. In early versions of the Fudget library, streams were represented as (potentially infinite) lists. As discussed in Section 20.4.1, this gave us problems with streams being retained indefinitely, eventually causing programs to run out of memory. This problem was first solved by changing the way the compiler (HBC) treats pattern bindings, as described in [Spa93]. Later, the switch to the continuation-based representation of stream processors also eliminated the problem.