Homegrown 93PFLOP supercomputer cements China’s Top500 lead

The Top500 list of most powerful supercomputers in the world got a new king today. The Sunway TaihuLight, a new installation at the National Supercomputing Center in Wuxi, China, took the top spot with a Linpack score of 93 petaflops. TaihuLight triples the performance of the former number-one system, the Tianhe-2. That's quite impressive by itself, but TaihuLight achieves its performance entirely with Chinese ShenWei CPUs and a custom interconnect. Tianhe-2 relied on two Intel Ivy Bridge CPUs and three Xeon Phi coprocessors in each of its 16,000 nodes to achieve its 33.9-PFLOP Linpack performance figure, according to the Top500 page on that system.

The TaihuLight system. Source: Jack Dongarra, "Report on the Sunway TaihuLight System"

Each of TaihuLight's 40,960 nodes contains a single ShenWei SW26010 CPU, designed by the National Research Center of Parallel Computer Engineering and Technology. According to the Top500 press release, that chip is a 64-bit RISC part with SIMD and out-of-order execution capability. Each of these 260-core chips is capable of three teraflops of compute performance on its own. The Top500 notes that the 93-PFLOP Linpack score of the machine falls short of its 125-PFLOP theoretical maximum, although the organization suggests that's to be expected with the Linpack benchmark.

Each node has 32GB of DDR3 memory, for a total of 1.3PB across the entire system. The Top500 notes that TaihuLight has a rather small amount of memory relative to its 10,649,600 cores. Tianhe-2, for example, has a similar amount of memory for its 3.12 million cores. Each CPU core also gets just 12KB of instruction cache per core and 64KB of "local scratchpad" memory, rather than the L1-L2-L3 cache hierarchy we might be used to seeing these days. While the Top500 says TaihuLight's 15.3MW power draw while running Linpack "will certainly earn it a place in the upper reaches of the Green500 list" for power efficiency, the organization belives the machine's efficiency would suffer if it had a more traditional amount of memory for its size.

Another view of TaihuLight. Source: Jack Dongarra, "Report on the Sunway TaihuLight System"

According to a paper by Top500 contributor Jack Dongarra on the system, TaihuLight has already run three scientific simulations that have been picked as finalists for the Association of Computing Machinery's Gordon Bell Prize, and two of those programs have achieved 30 to 40 petaflops of sustained performance on the machine. Dongarra says that the fact TaihuLight is running "Gordon Bell contender applications" means that it's "not just a stunt machine." The Top500 says TaihuLight will be used for research and engineering work in "climate, weather and earth systems modeling, life science research, advanced manufacturing, and data analytics." If you're a researcher in those areas, perhaps it's time to figure out how to get some time on this beast.

Comments closed
    • tipoo
    • 3 years ago

    Is it me, or does the core layout design sound like Return of the Cell? Where the MPE is the old PPE, the CPEs are the old SPEs. It has one Management Processor controlling each 64 Compute Processors, which themselves have a simplified design compared to what we’re used to, with 12KB instruction memory and 64kb of scratchpad memory, similar to the SPEs local memory. Even smaller than that actually, those were 256kb which already made developers cry at night (but I’m assuming they know how much they need for the problems they want to solve with this).

    • albundy
    • 3 years ago

    do they still have to call india for tech support?

    • Fonbu
    • 3 years ago

    And so it begins. Only a matter of time until China creates its Doomsday machine.
    [url<]https://archive.org/details/DoomsdayMachine1972[/url<]

    • TheJack
    • 3 years ago

    Does it run crisis?

      • ronch
      • 3 years ago

      No.

    • divide_by_zero
    • 3 years ago

    I, for one, welcome our new Chinese overlords.

      • ronch
      • 3 years ago

      Dunno if that was meant to be sarcastic.

        • divide_by_zero
        • 3 years ago

        Yes indeed it was; it’s a Simpson’s reference.

        [url<]https://www.youtube.com/watch?v=8lcUHQYhPTE[/url<]

    • Takeshi7
    • 3 years ago

    [quote<]The Top500 notes that TaihuLight has a rather small amount of memory relative to its 10,649,600 cores.[/quote<] You don't need much memory when you're using the machine to hack the encryption of foreign governments. Just lots of computing power. That's what I suspect is going on here.

      • Beahmont
      • 3 years ago

      If you’re the Chinese, you don’t spend billions on a supercomputer that reduced the time to break encryption from centuries to decades when you can spend hundreds of millions on espionage and have the answer today or next week.

    • ronch
    • 3 years ago

    1. Shenwei uses which ISA? Alpha?
    2. “TaihuLight??? What kind of stupid name is that?!??” -Mad Dog Tannen
    3. Oh I bet those things aren’t going to be the only things this computer will be used for, knowing China.
    4. Isn’t RAM cheap these days?
    5. Who fabbed those Shenwei chips?
    6. Come on, world!!! We gotta take the crown back from the China!!!
    7. Can it run… er… never mind.

      • jihadjoe
      • 3 years ago

      [quote<]2. "TaihuLight??? What kind of stupid name is that?!??" -Mad Dog Tannen[/quote<] lol Agree the english name of this computer is a mess. If you were to pronounce the Chinese it would read "shenwei taihu zhi guang", which directly translates to "Divine Light of Lake Tai". But instead of translating completely the English name is a combination of transliteration, untranslated and translation. Shenwei is transliterated to "Sunway", "Taihu" (Lake Tai, or Lake Taihu if you prefer) is left untranslated, then "guang" is translated into "light".

        • tipoo
        • 3 years ago

        Ah. So they kind of pulled a OPM Mumen Rider.

        • ronch
        • 3 years ago

        I love how Shenwei became Sunway. LOL

    • Unknown-Error
    • 3 years ago

    China gives the US a nice middle-finger for denying Xeon Phi and other HPC devices. ShenWei is one of several homegrown processors that also include Loongson, UPU, Mars etc.

    People who claim these will be used to make nuclear weapons is off by quite a bit. In-fact weapons designers in China have been complaining about the lack of computing power compared to people involved in Academic/Scientific work in China.

      • ronch
      • 3 years ago

      “if you don’t sell us your technology (so we can pry it open, see how it works, and make a poor copy of it), we’ll steal it and build our own and call it homegrown!”

    • Neutronbeam
    • 3 years ago

    Tech Report tests and reviews the system or it didn’t happen. I’m looking at YOU, “Big Iron” Kampman!

    …still laughing. On Mondays it seems to be easier to amuse myself…

    EDIT: Added “it”

    • DrDominodog51
    • 3 years ago

    Supercomputers used in nuclear weapon development for 1000.

    • chuckula
    • 3 years ago

    From the paper you linked:

    [quote<]Shenwei-64 Instruction Set (this is NOT related to the DEC Alpha instruction set)[/quote<]

      • Jeff Kampman
      • 3 years ago

      The Top500 itself seems confused on this point, but that statement is far more authoritative than anything else I’ve seen about what’s going on under the hood of these chips. Removed the Alpha bit.

        • chuckula
        • 3 years ago

        There’s definitely an information vacuum at this time.

      • the
      • 3 years ago

      They also say that the compute cores are 62 bit and use 264 bit wide vectors.

      The Top500 article indicates that this is likely based upon the Alpha architecture even though the Chinese wish to say otherwise. Previous generations where widely reported to be based upon older Alpha designs. I suspect that the Chinese have updated parts of the ISA as Alpha never shipped with a vector unit (though one had been planned).

      • Rza79
      • 3 years ago

      Well Loongson processors were also supposedly using their own instruction set but it was nothing more than the MIPS instruction set with four instructions lacking.
      Who’s not to say that it’s a simple modification of the Alpha instruction set?
      It does raise huge questions on how and why the SW processors are so similar on a higher level to the 21164. It can’t be a pure coincidence.
      Intel sold the design to them? Seems more likely they stole it.

        • ronch
        • 3 years ago

        Given how China steals their designs left and right, I wouldn’t be surprised if this is the case here again.

        • Unknown-Error
        • 3 years ago

        Alpha AXP ISA was developed by DEC not Intel.

        Loongson was supposed to have their own ISA? Never heard of that. They did copy/reverse engineer many MIPS instructions. They didn’t have a MIPS license so they didn’t use all of the MIPS ISA especially some patented stuff. But an agreement was reached almost 10 years ago and now, Loongson based CPU directly licenses MIPS32/MIPS64.

        So, did China reverse engineer Alpha AXP, MIPS ISAs? Quite likely. An embargo on critical components by the USA would have been disastrous. They either had to create something from scratch or reverse engineer an existing system. US denying them Xeon Phi only confirmed the risk of sanctions.

          • the
          • 3 years ago

          I’ve been under the impression that the Chinese got the Alpha CPU designs through industrial espionage in the late 1990’s.

            • Unknown-Error
            • 3 years ago

            Quite possible. When it comes to critical systems, civilian or military they use espionage, reverse engineering and create their own stuff even if they could buy them off the shelf. Its a matter quickly catching up to the rest of the world after years of neglect and matter of security. Devices manufactured elsewhere can easily have compromising elements. You know what the NSA did to routers heading to China.

            Few examples for espionage; The Americans will never export anything from F-22/F-35 to China, so they use espionage to gain access to at least some components. So surprises there.

            When it comes to reverse engineering, a good example would be Su-27 Flanker. While importing Su-27/30/35S from Russians they also created their own copy the J-11B with almost 100% indigenous material. This gives them some breathing room in case things get sour. And they continue to improve upon the original product. The most recent J-11D prototype is as good as any other flanker with the exception of the Su-35. And remember the the export versions are “monkey versions” compared to the ones in Russian service.

            But the Russian are now getting in to binding long term joint projects with China which are beneficial to both sides and reduce unauthorized copies. They may have even come to an agreement regarding the Su-27 copies.

          • UberGerbil
          • 3 years ago

          Alpha was developed by DEC, but all the IP was sold to Intel as part of Compaq merger. So if Alpha IP was sold/licensed to China, it certainly could have been done by Intel after 2001.

            • the
            • 3 years ago

            Except that after receiving the Alpha IP, Intel immediate spin it off to kill it off as a competitor to Itanium. That architecture was killed off due to business politics between HP, Compaq/DEC and Intel, not because it lagged significantly behind in performance or market share.

    • chuckula
    • 3 years ago

    [quote<]Each CPU core also gets just 12KB of instruction cache per core and 64KB of "local scratchpad" memory, rather than the L1-L2-L3 cache hierarchy we might be used to seeing these days. [/quote<] Red Flag [pun intended] there. You don't need an amazing memory hierarchy to run LinPack. You do need one for more complex real-world workloads. If you didn't, then Intel sure wouldn't waste all those transistors & die space into packing huge L3 caches into Xeons.

      • AnotherReader
      • 3 years ago

      The organization of each core group, one management processing element (MPE) and 64 computing processing elements (CPE), suggests data flow similar to a GPU or the PS3’s Cell. This would be suited to a narrower set of algorithms than conventional machines.

      • Zizy
      • 3 years ago

      Depends which algorithm.
      If you for example want to make a new nuclear bomb, you require as much memory to hold the whole problem inside and as many flops as you can get – to make as many simulations as possible to optimize the thingy.
      Also, cache hierarchy is needed for typical branchy code. Simpler stuff, which is quite common in many HPC cases (I believe majority of them), doesn’t need it as much and gets by with a simple and tiny out of order stuff as well.

      This is kind of further simplified Phi or GPU, which are already way simpler than ordinary Xeon. In return it offers more cores and better performance … as long as your algorithm fits the hardware.

Pin It on Pinterest

Share This