Tuesday, 7 October 2008

Don't forget the old ways

Recently I've been spending a large part of my time writing quite a complex bit of new code to process the seismic data we acquire using our 3D chirp system (those of you who've been paying attention will remember me talking about this a little while ago). Going through this process I've become increasingly aware just how much we take for granted the insane amounts of computing power those of us in active science research have available to us.

Although the process I'm trying to accomplish (3D seismic imaging) is not in either conceptually complex, nor difficult to write as an algorithm, it is difficult to do well. This spawns from two main factors: firstly, our obsession with data redundancy leading to enormous amounts of data being pumped in at the beginning; and secondly, the complex coupling between acoustic waves and their host medium making modelling the propagation of sound in geologically complex areas mathematically expensive (equally, this can also be thought of as a problem resulting from our desire as scientists to push the limit and explore more challenging environments).

The former problem is the one I've been mainly struggling with. The nature of our system means that trying to produce an imaged volume requires the manipulation of an incredible amount of data. For an average survey you're talking about having 10 - 15 million spatial samples, each consisting of 3000 - 4000 measurements of the reflection energy recorded at different times. To image this, the data has to be converted into the Fourier (frequency) domain since the propagation of acoustic wavefronts through a medium is frequency dependent. This produces a further 1500 - 2000 data points for each time measurement, giving a total of about 1.2 x 10^14 (120 trillion) data samples to manipulate! Not to mention a windowed Fourier transform before and after!

When handling data on these sorts of volumes it's easy to forget that altering the location of a single calculation within the code can dramatically affect it's run time. Simple things, like moving a variable calculation out of a for loop so that it is calculated only once for each frequency component on each time series shaves, literally, days off the processing time. Equally, a clever bit of sorting of your input data can enable the application of a single calculation to time series from multiple surface locations.

These techniques are not new, as sundialsvcs pointed out in this thread on LinuxQuestions.org I recently stumbled across. There is a tendancy with the processor power and memory volumes available at increasingly reasonable prices for our coding to become more slapdash and less optimized. The application of a little thought and some of the tricks commonly employed when computers were beasts that filled entire rooms could, I think, cut swathes off our (by this I mean scientists) processing time.

Also, talk about cutting energy usage to save the environment - imagine how much greener it would make us!

No comments: