Overview
Eric Ehlers, Principal Consultant at Post Trade Solutions makes his debut hosting Ahead of the Curve podcast alongside his colleague Zeyu (Jerry) Shen.
They share their expertise in explaining how to speed up crucial processing time for running complex calculations in Open Source Risk Engine (ORE) by utilising GPGPU (General purpose computing on graphics processing units).
Listen to the podcast
Welcome, dear listeners, to another exciting episode of Acadia's,Ahead of the Curve. Today we're going to talkabout the ORE GPGPU project, which is runningopen source risk engine on GPUs for improved performance.My name is Eric Ehlers, I work in expert services,and I'm joined by Jerry. -Hi, I am Zeyu Shen.My colleagues also call me Jerry. I'm a Senior Quant Analystin Quant Service. I'm based in Boston. -The ORE GPGPU project,I'll just break that down. Normally, when you run calculationson your computer, they run on the CPU, which is the centralprocessing unit. Going back 20 or 30 years,they've introduced graphics cards which contain GPUs,which are graphical processing units. Of course, graphics cardswere originally intended for graphics, but people very quickly realisedthat they could take other calculations and speed them upby running them on graphics cards. That's how you got GPGPU,which is general-purpose GPU programming. The ORE GPGPUproject is to take ORE risk calculations and run themon graphics cards for improved performance.The main motivation for this project is that we havesome backtesting calculations, which currently take several days to run,and we're trying to get that runtime down to a few hours.There are multiple implementations of GPGPU available.One of them is CUDA. Do you want to tell us more about that?Sure. CUDA is a platform developed by NVIDIA.In CUDA, we refer to the CPU as the host, and we refer to the GPU as the device.CUDA is basically a set of APIs that allows the CPU to interact with GPUs,so you can do a lot like a lot of things, for example,memory allocations on your device. You can do memory copiesbetween your host and device. Also, you can just call kernelson your host. CUDA also has many libraries,and some of them are pretty useful for derivative pricings.For example, the cuRAND libraries. It is used to generate random numbers,and it has many double-digit random number generators supported.For example, a quasi-random generator called sample sequence,a pseudo-random generator called Mersenne Twisters.These two are very commonly used in industry.Another pretty useful library is the cuBLAS Library.It is normally used for linear algebras. You can do, for example,matrix multiplication using it. In CUDA, it is pretty easy to implementyour own code to do the matrix multiplication,but when you're using the libraries, these codes are optimisedat the compiler level. You will see the performanceto be much better when using these librariescompared to your own code. -Thanks for that overview.CUDA is a proprietary implementation for NVIDIA graphics cards.Another implementation of GPGPU is called OpenCL,which is a free, open-source implementation.With OpenCL, you can have multiple builds. The first one is a base casewhere the code runs on your CPU, just like a vanilla build,so zero increase in performance. Then, also with OpenCLyou can take an application like ORE and you can run iton whichever graphics cards are available on your computer.For example, on my work laptop, I have an Intel graphics card.I have an NVIDIA graphics card, and with OpenCL,I can run ORE on the CPU, on the Intel graphics card,on the NVIDIA graphics card, and I can comparethose three implementations. I can benchmark them for performanceand see which runs faster. As I said, OpenCLis free and open source, and we already havean OpenCL implementation of ORE on GitHub.Anybody who's watching this podcast who might be interestedin running ORE for GPGPU, there's a build already out there,and it's free, and you can install it and fire it up.We also have a tutorial in the pipeline. Hopefully, by the timepeople see this podcast, the tutorial will be online,and that provides step-by-step instructions on howto run ORE using OpenCL.This frameworkthat we use to run ORE on GPGPU, we use for OpenCLand that we use for CUDA, we call this the ORE GPGPU framework.Do you want to talk us through that in a little bit more detail?Sure. The framework is basically an outside layer to the ORE.When we first start developing these frameworks, ORE at the timehad millions of lines of code. We think it probably makes more senseto just develop something outside of ORE instead of changing all the codesand the prices to support GPUs, because otherwiseit will be very time-consuming and probably not even doable.Currently, the framework is mainly used for scripted traits.Scripted traits mostly use Monte Carlo simulations to do the pricing,and it is pretty slow compared to other trait types.It is one of the main bottlenecks of ORE. When using the GPUs,we can just increase the speed by a lot, and the Monte Carlo simulationsare naturally like large-scale, simple calculations.They are just naturally suitable for GPU processing.Currently, we support three frameworks in ORE.The first one is basic CPU. The second one is OpenCL framework,and the third one is the CUDA framework. The CUDA frameworkis currently under development. For users, you can make your own choiceof which framework you want to use in the pricingengine.xml.When you select either CUDA or OpenCL framework,you can also specify which GPU you want the framework to be run on.In the framework, we are also using the runtime compilersof the CUDA or OpenCL. The reason why we're using that is that.For example, for static backtesting, we reprice the trades whenusing historical datas for thousands of times.We just built the kernels for the first time we ran those pricings.Then, for the subsequent reprice of this trade,we just reuse those kernels, and it will be very efficient in that way.For ORE framework we're also using the kernel framework,we're also using Mersenne Twister random generator,which is a double-digit one because accordingto what we have reviewed in our testing, when using a singledigit, precision random generators, the resultcan be 6 per cent to 7 per cent away from the result you get by using CPU.You highlighted a couple of important aspects of GPGPU Programming.A graphics card was originally designed to run graphics,and it's optimised for graphics processing.For example, you might take an image with a million pixels,and you might want to transform every single one of those pixelsin the same way in parallel. If you're going to takea financial application like ORE and run it on a GPU,it takes a lot of expertise to design that implementation.You can't just take any part of ORE and run it on a GPU.If you're just loading data from a database or something like that,that's not going to benefit from running on a GPU.You need to take a large number of similar transactionsthat you're going to run in parallel. What you just described, for example,taking a Monte Carlo analysis and running it on a GPU,that is the calculation that can be perfectlyoptimised on a GPU. So far, we've been talkingabout running ORE GPGPU on, for example,a developer laptop. You can get a much better performanceby running it on high-end hardware. Do you want to talka little bit about that? -Sure. Once we finish the developmentfor the CUDA framework, our next step is to migratethat to the AWS. AWS stands for Amazon Web Services.The reason why we do that is because we can utilise A100 GPUs there.A100 GPUs is NVIDIA's latest GPUs. It has low latencies.It has a lot of tensor cores you can use to do the GPU processing.We expect to see a huge performance boost after we migrate ORE to the cloud.In conclusion, the ORE GPGPU project, we already havethe OpenCL implementation released as open source.We have a tutorial in the pipeline, and currently under developmentis the CUDA implementation, which will also be released open source.Anybody can download that and fire it up. Then internally,we're working on migrating that build onto Amazon Web Services,where we will run it on A100 for maximum performance.Our goal is to take our backtesting serviceand speed it up by orders of magnitude. Jerry, that was very interesting.Thank you very much. -Thank you for having me.It's a pleasure. I hope you enjoyed this conversationas much as we did. Thank you very much for joining us.If you would like to delve deeper into the topic of this podcast,then please visit us on acadia.inc.