HacWrite - 39 Efficiency and program transformations

In the following section, we will discuss, in the context of the Fudgets system, how we obtain reasonable execution efficiency. We have not tried to quantify programmer efficiency in a way that permits further comparison or judgement.

39.1 Execution efficiency

Two factors that influence how fast fudgets program run are:

the efficiency of the interface to the window system X Windows, in particular efficiency of event processing.
the efficiency of the fudget combinators.

39.1.1 Efficiency of the interface to the window system

The efficiency of the interface to the window system was a concern right from the start of the work on fudgets. The initial implementation used conventional text I/O to talk to a C program which called routines in Xlib and returned the results (see Section 22.3.1). The C program also forwarded events from X to the functional program. This was not a very efficient implementation and hence we tried to minimise the amount of data passed between the window system and the functional program. Although this was done to avoid problems caused by slow execution of functional programs, an additional positive effect is that fudget programs perform well when using a low bandwidth connection (e.g., modem connection) between the X server and the application. Some figures to back up this statement are given in Figure 101. (One can not draw any general conclusions on performance from these, of course.)

Two tiny programs:
The Fudgets counter example from Section 9.4: 13s
The Motif counter example from Figure 112: 14s
Some small programs:
The Fudget calculator from Section 9.8 with 15 buttons: 13s
The Fudgets calculator Cla with 28 buttons: 16s
The calculator xcalc with 40 buttons, from X Windows distribution: 18s
Some larger programs:
The Fudgets WWW browser from Chapter 32: 22s
Mosaic 2.6: 75s
Netscape 3.0: 137s
Netscape 4.04j2: 313s
Figure 101. Comparing the startup times of some programs when running via a 16.8kbps modem connection.

When it comes to event processing, we naturally wanted to to minimise the number of events that has to be handled by the functional program. Fortunately, the X Windows system can do a lot of event processing for the application. By setting event masks and button grabs appropriately, you can often eliminate all insignificant events, i.e., all events that are sent to the application program carry some meaningful information. In simpler window systems, the application has to deal with every little mouse move and button/key presses/releases by itself.

As an example, consider the implementation of a command button. It should behave as follows:

When the pointer is over the area of the screen occupied by the button and the user presses the mouse button, the button image should be changed to look depressed (that is, pressed down, not discouraged!). If the mouse button is released when the button appears depressed, the button command is triggered.
If the mouse pointer leaves the button image, it should revert to its normal (raised) appearance (indicating that nothing will happen if the user now releases the mouse button). If the user returns the pointer to the button image, without releasing the mouse button, the button should return to its depressed appearance.

The following type of events are thus of interest to the program:

Mouse button presses (but only if the pointer is within the command button).
Mouse button releases (but only if the mouse button has been pressed while the pointer was inside the command button).
Mouse motion (but only if the mouse button has been pressed while the pointer was inside the command button and only if the pointer crosses the border of the screen area occupied by the command button).

In the X windows system a button grab, (see XGrabButton() in the Xlib manual) with an event mask that selects button presses/releases and enter/leave window events (each GUI element is a window), can be used to select exactly these events, with only one small exception: if the mouse pointer enters the button area, the mouse button is pressed, the pointer leaves the button area and the mouse button is released, the program will receive an insignificant button release event. The important thing is that no unnecessary motion events will be received.

39.1.2 Efficiency of the Fudget combinators

39.1.2.1 Efficiency of different representations of stream processors

Two of the stream-processor representations presented above have been used in practice. Early versions of the Fudget library used list functions and synthetic oracles (Section 20.4.1). We later changed to the continuation-based representation (Section 20.4.2) since it proved to be slightly more efficient with the compiler we used (HBC [Aug97]).

We also tried a third representation,

data SP i o = StepSP [o] (i->SP i o)

which was slightly less efficient than the continuation-based representation.

In the discussion below, we assume the continuation-based representation (although some of the ideas can be carried over to other representations).

39.1.2.2 Program transformations for efficiency

Using loopThroughRight (Section 18.2) is a general way to adapt an existing stream processor for use in a new context. Another simple and common way to adapt a stream processor is by mapping a function on the elements of the input or output stream:

sp -==- mapSP g
mapSP f -==- sp

For example, tagged parallel composition of fudgets, >+<, can be defined like

fud1 >+< fud2 =
  mapSP post -==- (fud1 -+- fud2) -==- mapSP pre

where pre and post are the appropriate re-tagging functions. However, if implemented directly, such a definition has a rather high overhead.

By transforming >+< to a form not involving -==-, -+- and mapSP, but instead recursion and pattern matching on the stream-processor constructors, a more efficient solution can be obtained.

Programs transformations of this kind are tedious to do by hand, but it could still be worthwhile if the resulting code is to be included in a library. The above described transformation has been done by hand in the Fudget library. We measured the effect on a communication intensive fudget program containing a parallel composition of 50 fudgets (the Space Invaders program described in Chapter 35). The transformation reduced the CPU time consumption by over 35%. Encouraged by this result, we also transformed some more combinators (for example >^=< and >=^< discussed in Section 13.1) in the same way.

It would of course be nicer to have these transformations done automatically, especially when they are needed in application programs. The kind of automatic transformation that would be useful here is deforestation [Wad90], which eliminates intermediate data structures (applications of PutSP and GetSP in this case) by using certain unfold/fold transformations.

39.1.2.3 A practical semi-automatic transformation

Between the manual and fully automatic implementations of the above program transformation is a semi-automatic alternative. It is interesting because it requires less work than the manual solution and it is more likely to be supported by a compiler than the fully automatic solution.

The manual work required in this solution is located in the library. The application programmer need not be aware of it. The automatic work required is inlining (unfold) by the compiler. It actually works even without inlining, but the efficiency gain is not as big.

The expressions we wish to optimise are of the kind illustrated above: a stream-processor combinator applied to mapSP f, for some f. The trick is to make mapSP a constructor in the stream-processor data type:

data SP a b =  PutSP b (SP a b)
            |  GetSP (a -> SP a b)
            |  MapSP (a -> b)
            |  NullSP

Since the type is abstract, adding constructors to it like this will not be visible to application programmers.

Now that MapSP is a constructor, the implementation of serial composition (as shown in Figure 45) can be extended to handle the case when one or both arguments are applications of MapSP in a more efficient way. This means that an expression like

MapSP f -==- MapSP g

can evaluate to

MapSP (f . g)

With inlining, this step can be taken by the compiler and the composition f . g can then be optimised further. Without inlining, we have at least eliminated a use of -==-, and thereby reduced the number of generated applications of the PutSP and GetSP constructors.

We have not tested the above ideas in the Fudget library.

39.1.2.4 Performance measurements

To get some idea of how high the communication overhead is in the fudgets system, we performed some simple measurements.

The first test measures the efficiency of serial composition and compares the operators >==<, >^^=< and -==-. We measured the time it took to send around 5000 messages through a serial composition of a varying number of identity fudgets (or identity stream processors). The program used is shown in Figure 102. It was compiled with HBC and run on a Pentium Pro 200Mhz under NetBSD 1.2 [Neta]. The results are shown in Figure 103. We can see that the time grows roughly linearly with the length of the composition and that serial composition of fudgets is much more expensive than serial composition of stream processors.

The last table in Figure 103 shows the performance of the function composition map id . ... . map id. It is more efficient than the other serial compositions.

import Fudgets

main = fudlogue mainF

mainF = nullF >==< tstF >==< concatSP >^^=< stdinF

tstF = case argReadKey "comb" 1 of
         1 -> nest (idF>==<) idF depth
         2 -> nest (idSP>^^=<) idF depth
         3 -> absF (nest (idSP -==-) idSP depth)

nest f z 0 = z
nest f z n = f (nest f z (n-1))

depth = argReadKey "depth" 0

Figure 102. A program to measure the efficiency of serial composition.

time ./Internal <testinput1 -S -h8M - -depth n
-comb 1 (idF >==< idF >==< ... >==< idF):
n User time GCs GC time Max heap Max stack
0 0.080u 0 0.00
50 4.173u 40 0.10 62432 836
100 8.475u 80 0.25 95840 1410
200 18.418u 163 0.95 159884 3042
400 44.521u 334 3.35 288020 3768
-comb 2 (idSP >^^=< idSP >^^=< ... >^^=< idF):
n User time GCs GC time Max heap Max stack
0 0.100u 0 0.00
50 0.755u 8 0.01 35712 210
100 1.452u 15 0.03 42270 272
200 2.869u 30 0.08 53504 642
400 5.706u 59 0.18 76976 1218
-comb 3 (idSP -==- idSP -==- ... -==- idSP):
n User time GCs GC time Max heap Max stack
0 0.076u 0 0.00
50 0.411u 4 0.01 32524 171
100 0.812u 7 0.00 35840 348
200 1.574u 15 0.04 41956 651
400 3.047u 30 0.07 54924 1260
map id . map id . ... . map id
n User time GCs GC time Max heap Max stack
0 0.000u 0 0.00
50 0.094u 2 0.00 3956 98
100 0.203u 4 0.00 5616 275
200 0.434u 8 0.00 9712 566
400 0.906u 16 0.01 17616 473
Figure 103. The efficiency of serial composition.

The second test measures the efficiency of parallel composition. We measured the time it took to send about 70000 messages through one of the fudgets in a parallel composition of identity fudgets. The program used is shown in Figure 104. The parallel compositions were created by listF,

listF :: (Eq a) => [(a, F b c)] -> F (a, b) (a, c)

which internally constructs a balanced binary tree of parallel compositions and a table for translating the addresses of the fudgets to positions in the tree. The results are shown in Figure 105. The depth of the tree, and hence the time it takes to send a message to a particular fudget in tree grows logarithmically with the size of the parallel composition. However, since all that is known about the address type is that it is an instance of the Eq class, the table lookup has to be implemented as a linear search and hence the lookup time varies linearly with the position in the list. From the results we see that the time of the table lookup soon becomes the dominating factor. This suggests that it would be a good idea to provide alternative combinators to listF, which require the address type to be an instance of the Ord class, or even the Ix class, to reduce the time complexity of the table lookup time to logarithmic or constant, respectively.

import Fudgets

main = fudlogue mainF

mainF = nullF >==< tstF >==< concatSP >^^=< stdinF

tstF = listF [(i,idF) | i<-[1..size]] >=^< (,) sel

size = argReadKey "size" 1
sel = argReadKey "sel" 1

Figure 104. A program to measure the efficiency of parallel composition.

time ./Internal2 <testinput2 - -size n -sel k
k=1
n 2log n time GCs GC time max heap
1 0 1.352u 11 0.02 29908
32 5 2.179u 18 0.03 34924
256 8 2.457u 22 0.07 66744
2048 11 2.989u 27 0.31 317776
n=256
k 2log n time GCs GC time max heap
64 8 4.640u 33 0.11
127 8 7.001u 44 0.12
128 8 6.801u 44 0.16
256 8 11.126u 67 0.24 87880
Figure 105. The efficiency of parallel composition of fudgets.

39.1.3 Space efficiency

A problem that almost inevitably occurs at some point when developing programs in lazy functional languages is space leaks. In early versions of the Fudget library, streams were represented as (potentially infinite) lists. As discussed in Section 20.4.1, this gave us problems with streams being retained indefinitely, eventually causing programs to run out of memory. This problem was first solved by changing the way the compiler (HBC) treats pattern bindings, as described in [Spa93]. Later, the switch to the continuation-based representation of stream processors also eliminated the problem.