Loading recent posts...

Jun 27, 2012

Xeon Phi and AMD’s GCN Squeezing Nvidia’s TESLA

Intel’s Xeon Phi seemed like a doomed architecture back when Intel was attempting to compete with the likes of ATi and Nvidia. The company even scrapped the Larabee project, but all that work was not thrown away.

Intel redirected the use of their multiple core architecture towards high performance computing. We believe that much of the work was done on the software side, as Intel’s main purpose was to make software integration much easier for HPC users. The idea is a good one and the result is practical, although we’d rather have anything but x86 inside. For now, if Intel Xeon Phi x86 offers the better result, it deserves all the credit. Nvidia’s main problem is the fact that CUDA takes a whole lot of work to program for and that their new Kepler architecture is less powerful where raw computing power is involved. Therefore, a science center must pay for thousands of man-hours to port an application source code from x86 to CUDA just to take advantage of Nvidia’s Tesla. This is the added cost of choosing an Nvidia Tesla accelerator card for you server or supercomputer. Not only does the center have to pay for the extra man-hours of coding, but the final implementation and start usage of the server is also delayed by weeks or even months.

Intel brags that porting your code to its MIC accelerators will not take more than just a few days. Considering that the performance of the current Xeon Phi version is almost equal with Nvidia’s Kepler-based Tesla, the server owner will think twice before sticking Tesla cards inside, considering the additional funds he must provide for all the software optimization work. So what’s there left for Nvidia to do? If only Nvidia’s Kepler were faster. There is one faster card where DP FP64 is concerned, and that is AMD’s Tahiti GPU.

The second problem Nvidia has with its new Kepler architecture is that its raw compute power is actually less impressive than the company’s previous architecture.

Sure, Kepler is easier to program for and it is actually able to run a basic operating system, but the raw power would have made it stand tall ahead of Intel’s new MIC product line. Their main problem is that Intel touts 1 TFLOP of real-world double-precision (FP64) performance with its first iteration of Xeon Phi cards.  AMD stands quite alright in that perspective, as the current Radeon HD 7970 Tahiti GPU is able to deliver 947 GFLOPs for a much lower price than Xeon-Phi, while the new Radeon HD 7970 GHz Edition actually surpasses Intel’s goal by a significant margin of about 12%. Offering this much performance without any “professional” price tag is quite an achievement for AMD’s team. In fact, Nvidia’s top performing part when DP FP64 performance is concerned, is the Fermi-based Tesla M2090 card that is rated with a real-world double-precision (FP64) performance of 665 Gigaflops or 0.66 TFLOP.

How did Nvidia end up with a new generation of GPU compute accelerators that are slower than the previous generation? The answer is that Nvidia was not targeting DP FP64 performance with their current Tesla generation, and that they built the new Tesla K10 GPU compute cards using two Kepler GPUs. Thus, Nvidia’s K10 is able to achieve an impressive peak of 4.6 TFLOPs of single-precision compute performance. That’s 343% the performance of the Fermi-based Tesla M2090 card, but that’s not what Intel is offering. Remember that Intel emphasizes on double-precision FP64 performance rather than on single-precision.

Unfortunately, Nvidia’s DP PF64 performance with its Kepler GPU is over 6 times slower than what Fermi is able to put out. Kepler’s DP FP64 performance sits at just 95 Gigaflops, or 0.09 TFLOP.

The cards are clearly targeted at different applications, and at this point we believe that Nvidia would have been better off with a 28-nm-based Fermi with increased performance and lower thermals. Practically, a dual-GPU Fermi Tesla card built with 28-nm GPUs, but clocked at the same frequencies would be able to put out over 1.3 TFLOPs of DP FP64 performance. Nvidia could really pull this one out of their hat if the company decided to take this route. Now, many of our readers are probably thinking about the possibility that Nvidia could combine the best of both architectures and achieve the impressive single-point performance of Kepler and the high DP FP64 performance of Fermi. We believe that that’s exactly what Nvidia’s K20 is going for. The GK110 GPU inside will most likely provide competitive DP FP64 performance and even better single-point raw power.

Therefore, while Intel used its clout and money to kick Nvidia’s Tesla out of some of the supercomputers and servers that are now being built, Nvidia might strike back with a new set of Tesla products that will offer much better performance. It is also important to note that with Intel’s Xeon-Phi we’re talking about theoretical performance, as the cards are not out yet, while Nvidia’s Tesla K10 cards are up for grabs. Nobody can deny Intel’s performance achievements, and we believe that the simpler method of Xeon-Phi coding and optimization is a considerable advantage over Nvidia’s CUDA. On the other hand, Intel will have a tough road ahead if the next TESLA K20 card offers 1.7 or 1.9 TFLOPs of DP FP64 raw computing power.

Nvidia is not all defenseless before Intel’s money, market influence, software development, process manufacturing superiority and the general success of the Xeon Phi.

It’s obvious that Intel executed beautifully the remains of its Larabee project and the Knights Corner, MIC or Xeon Phi, whatever you’d like to call it, is, at the moment, an interesting product. We’re sure there’s a great deal of marketing and PR talk in Intel’s claim that porting applications to Xeon Phi is only a “matter of days,” instead of weeks or months. Nvidia has two main strong points now. The first one is the fact that their upcoming GK110 GPU that will power the Tesla K20 card is set to bring more than three times the DP FP64 performance of Nvidia’s previous Tesla generation powered by the Fermi architecture. We know that the Tesla M2090 Fermi-based GPU compute accelerator card is able to process a strong 0.66 TFLOPs of DP FP64 operations, and if the new K20 will be rated at over 1.9 TFLOPs, Intel’s Xeon Phi doesn’t look so powerful anymore.

Intel can brag and sing about their easy porting advantage of Xeon Phi all day, but no supercomputer maker is going to give up a 100 PFLOPs performance power and limit the project to 50 PFLOPs just because it’s easier to port. Supercomputing clients usually have very complex projects to run on their mega servers, and if one technology can deliver the result in one month, while the other will deliver it in twice the time, we have a hard time believing that the client will choose the slower hardware. The second strong point the Kepler-CUDA-GK110 combination gives Nvidia is exactly the continuity of the platform itself, and the fact that CUDA porting could be actually already done before GK110 reaches the client.

Nvidia’s way is the CUDA way and the true fact is that a lot of coding and optimization work is needed to fully enjoy the performance of Nvidia’s TESLA cards.  HPC clients might see Intel’s easier Xeon Phi coding as a way to reduce the cost of software coding that needs to be done.

On the other hand, HPC clients really care about performance. We have a hard time deciding if software coding money savings are more important than the end performance of the installation. We’re inclined to believe that, in the HPC or supercomputing world, money is usually not an issue and, more importantly, the small amount of money that software porting and optimization represents is not as important when compared with the total cost of the hardware and implementation. Considering that we’re talking about tens of thousands of dollars worth of man-hours doing coding and optimizing, the client paying for the server might give Intel’s Xeon Phi a thought if the performance were the same. The thing is that performance is not going to be the same. If Nvidia achieves its targets with the GK110 GPU, the DP F64 performance will be almost twice what Intel’s Xeon Phi brings to the table.

Some might wonder what’s the point in going for Kepler now. Why not wait for Xeon Phi or TESLA K20? The answer is that, if you want your supercomputer ready at the end of this year, you can safely go with Nvidia’s TESLA K10 that’s based on the new Kepler architecture. Sure, there is more CUDA programming to do, but you’ll be able to have you server ready much earlier than if you wait for Xeon Phi or TESLA K20. Having the final installation ready faster is only one of the advantages TESLA K10 offers. The second advantage is that, if you’ve ported your applications in CUDA and you’ve already had them optimized for the Kepler architecture, you can simply swap the TESLA K10 card with the K20 models when they hit the market.

Once this upgrade is finalized, your supercomputer will likely have 30 times the DP FP64 raw computing power compared with the initial Kepler K10 installation and more than 3 times the raw power of a similar Xeon Phi installation. There is nothing Intel can do this year or the next that would allow it to achieve a doubling of Xeon Phi’s DP FP64 performance and, from a pure performance point of view, Nvidia’s GK110 is a definite winner. Once we factor in AMD’s GCN, we’ll clearly see why Nvidia’s TESLA is being squeezed hard in the HPC market, but this will follow in the sixth part of our GPU compute analysis.

Intel Phi Logo
Image credits to Intel

Intel Xeon Phi Coprocessor Accelerator Card
Image credits to Intel

Nvidia TESLA K10 Card
Image credits to NVIDIA

Nvidia TESLA K20 Card based on the GK110 GPU
Image credits to NVIDIA

Nvidia TESLA K10 & K20 Performance Targets
Image credits to Hardware.fr


Post a Comment

Twitter Delicious Facebook Digg Stumbleupon Favorites More

Design by Free WordPress Themes | Bloggerized by Lasantha - Premium Blogger Themes | coupon codes