In the previous post, I talked about the first part of the Performance Lab for the Computer Science 2021 course. Part II is optimizing the smooth operation. Each pixel has a red, blue, green value. For smooth, we average the values of all the pixels around one particular pixel and replace that one pixel by the average, hence ‘smoothing.’

The baseline code is given to us again and we need to optimize it to make it run faster.


void naive_smooth(int dim, pixel *src, pixel *dst)
{
int i j;
for ( i = 0; i < dim; i++)
for ( j = 0; j < dim; i++)
dst[RIDX(i, j, dim)] = avg(dim, i, j, src);
return;
}

The baseline code simple traverses all the pixels – but we do better. What I did was first create a two-dimensional array the size of the image and first calculate all the values of each pixel and store it in the array. The second part of the operation, I can call from the array and calculate the average and then make the replacement of that pixel with the average. It is not incredibly efficient, and probably won’t pass your speedup benchmark, but you could implement some loop unrolling, code motion or blocking to speed it up.

The key, however, is to figure out how to calculate pixels on the edge of the image (all images are squares for this lab). The difference is, if a pixel is on the edge of the image, the number of surrounding pixels is different. If you’re smoothing a pixel on a corner, there would be three. For a pixel on the edge, but not a corner, it would be five. If the pixel is not on an edge that number would be eight.

In my computer science 2021 course called Machine Architecture and Organization, we’re working on Lab IV, or the Performance Lab, the last lab in this semester. The first three labs (data lab, bomb lab, and architecture lab, respectively) were pretty interesting and fun, but this Performance Lab has a practical application.

There are two parts to the lab, each worth 50 points: the first portion deals with rotating a picture 90 degrees counter-clockwise. The second portion, we smooth or blur an image by replacing a pixel value with the average of the pixels around it.

I’m still working on Part I, so I’ll only be talking about that now:
- To start, we’re given the baseline code that works!


void naive_rotate( int dim, pixel *src, pixel *dst)
{
int i, j;
for( i = 0; i < dim; i++ )
for ( j = 0; j < dim; j++ )
dst[RIDX( dim-1-j, i, dim )] = src[RIDX( i, j, dim )];
return;
}

There are some things I could do to optimize the code that I see right away: a lot of redundant calls. Making use of the following will help me optimize the algorithm to a desired CPE (cycles per element):

Code Motion: moving statements, expressions or functions out of the body of the loop – especially things that don’t change over the iterations of the loop, otherwise known as a loop invariant.

Blocking: as paraphrased from my textbook, blocking is organizing data structures into large chunks. The instructions on that chunk are computed and then discarded before another chunk is loaded.

Loop Unrolling: reducing the number of iterations in a loop by calculating multiple iterations in one call to the loop. I had trouble understand this concept for a while, but referring to this simple example I found on Wikipedia helped me a lot:

We need to write a function that deletes 100 items using the function delete(). At first thought, we would probably consider writing the following function:


int x;
for ( x = 0; x < 100; x++ )
{
delete(x);
}

No doubt it works, but instead of deleting one item per iteration, why not delete, say, five at a time?
int x;


for ( x = 0; x < 100; x+=5 )
{
delete(x);
delete(x+1);
delete(x+2);
delete(x+3);
delete(x+4);
}

Using a combination of the methods mentioned above, I’m supposed to make the code run as fast as possible on pictures of multiple dimensions.

UPDATE: I passed my benchmark speed – not by much, however. I unrolled and I also used function pointers, which another student graciously hinted for me to do.