

# High Performance Computing Ecosystem and Trends

Moscow State University Summer Supercomputing Academy 27 June 2016

Andrey Semin

Principal Engineer HPC Technology Manager, Europe, Middle-East and Africa

# Legal Disclaimers

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.

Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance.

Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at <a href="https://www-ssl.intel.com/content/www/us/en/high-performance-computing/path-to-aurora.html">https://www-ssl.intel.com/content/www/us/en/high-performance-computing/path-to-aurora.html</a>.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit <u>http://www.intel.com/performance</u>.

Intel, the Intel logo, Xeon, Xeon Phi, Intel Optane and 3D XPoint are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.

\*Other names and brands may be claimed as the property of others.

© 2016 Intel Corporation. All rights reserved.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.

Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Copyright © 2016 Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.



## Moore's Law and Parallelism



Original data collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond and C. Batten Dotted line extrapolations by C. Moore

\*Other names and brands may be claimed as the property of others.



# CPU Parallelism is Already a MUST



Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Configurations: Intel Performance Projections as of Q1 2015. For more information go to http://www.intel.com/performance of the sensity performance on the performance of the system hardware or software design or configuration may affect actual performance. Copyright © 2015, Intel Corporation Chart illustrates relative performance of the shipped products available on ark.intel.com.

1Not launched



# **CPU Compute Growth Trends**





# **Growing Challenges in HPC**

# "The Walls" System Bottlenecks



Divergent Infrastructure



Memory | I/O | Storage Energy Efficient Performance Space | Resiliency | Unoptimized Software Resources Split Among Modeling and Simulation | Big Data Analytics | Machine Learning | Visualization

Copyright © 2016 Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others. Andrey Semin | 28 June 2016 | Slide 6 Barriers to Extending Usage



Democratization at Every Scale | Cloud Access | Exploration of New Parallel Programming Models



# Intel® Scalable System Framework

#### A Holistic Design Solution for All HPC Needs

Reliability & Resilien Compute Memory/Storage Performance Power Efficiency Software Fabric Intel Silicon Price Photonics

Small Clusters Through Supercomputers Compute and Data-Centric Computing Standards-Based Programmability On-Premise and Cloud-Based

Intel® Xeon® Processors Intel® Xeon Phi™ Processors Intel® Xeon Phi™ Coprocessors Intel® Server Boards and Platforms

Intel<sup>®</sup> Solutions for Lustre\* Intel<sup>®</sup> Optane<sup>™</sup> Technology 3D XPoint<sup>™</sup> Technology Intel<sup>®</sup> SSDs Intel® Omni-Path Architecture Intel® True Scale Fabric Intel® Ethernet Intel® Silicon Photonics

HPC System Software Stack Intel® Software Tools Intel® Cluster Ready Program Intel Supported SDVis



# How It Works



nte

#### **Innovative Technologies Memory/Storage** Compute Intel® Xeon Phi™ High Bandwidth & Resilic Processors **On-Package Memory** Intel® Xeon® Intel® Optane™ Processors Technology Intel® Xeon Phi™ Intel® Solutions Coprocessors for Lustre\* software HPC System Intel® Omni-Path Software Stack Architecture Intel® Parallel Studios Software Suite Intel® Silicon Intel® Math **Photonics** Kernel Library Intel® True Scale Intel® Compilers Fabric Fabric **Software**

### Tighter Integration and Co-Design



#### Increased System Density Reduced System Power Consumption

Copyright  $\ensuremath{\mathbb{C}}$  2016 Intel Corporation. All rights reserved.

\*Other names and brands may be claimed as the property of others.

# **High Performance Compute**





#### **Common Programming Model**



# Intel<sup>®</sup> Xeon Phi<sup>™</sup> x200 Product Family

#### (codename Knights Landing)





#### Compute

- Intel<sup>®</sup> Xeon<sup>®</sup> Processor Binary-Compatible
- 3+ TFLOPS, 3X ST (single-thread) perf. vs KNC
- 2D Mesh Architecture
- Out-of-Order Cores

#### **On-Package Memory**

- Up to **16 GB** at launch
- 5X Bandwidth vs DDR (over 400GB/s)<sup>2</sup>

1<sup>st</sup> Intel processor to integrate

1. Over 3 Teraflops of peak theoretical double-precision performance is preliminary and based on current expectations of cores, clock frequency and floating point operations per cycle.

2. Projected result based on internal Intel analysis of STREAM benchmark using a Knights Landing processor with 16GB of ultra high-bandwidth versus DDR4 memory only with all channels populated. 3. Source: Intel internal information



# Intel® Xeon® Processors

#### At the Heart of Intel® Scalable System Framework

Xeon E5-2600 v3

(Haswell-EP, 22nm)

Up to 18

Up to 36 threads

Up to 45 MB

4 channels of up to 3 RDIMMs or

3 LRDIMMs

Up to 2133



#### Core Single Thread IPC Performance





# THE HEART OF THE DATA CENTER

2x QPI 1.1 channels 6.4, 8.0, 9.6 GT/s

40 / 10 / PCIe\* 3.0 (2.5, 5, 8 GT/s)

160 (Workstation only), 145, 135, 120, 105, 90, 85, 65, 55

# Requires BIOS and firmware update & 3D Stacked DIMMS depend on market availability

Feature

**Cores Per Socket** 

QPI Speed (GT/s)

PCIe\* Lanes/

TDP (W)

Threads Per Socket

Last-level Cache (LLC)

Controllers/Speed(GT/s)

Memory Population

Max Memory Speed

Copyright © 2016 Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others. Andrey Semin | 28 June 2016 | Slide 11

Xeon E5-2600 v4

(Broadwell-EP, 14nm)

Up to 22

Up to 44 threads

Up to 55 MB

+ 3DS LRDIMM<sup>&</sup>

Up to 2400

# Haswell and Broadwell Core Microarchitecture



All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. Intel may make changes to specifications and product descriptions at any time, without notice

> Copyright © 2016 Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others. Andrey Semin | 28 June 2016 | Slide 13

# Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5 v4 Family: Core Improvements

#### Extract more parallelism in scheduling uops

- Reduced instruction latencies (ADC, CMOV, PCLMULQDQ)
- Larger out-of-order scheduler (60->64 entries)
- New instructions (ADCX/ADOX)

#### Improved performance on large data sets

- Larger L2 TLB (1K->1.5K entries)
- New L2 TLB for 1GB pages (16 entries)
- 2nd TLB page miss handler for parallel page walks

Improved address prediction for branches and returns

> Increased Branch Prediction Unit Target Array from 8 ways to 10

Floating Point Instruction performance improvements

- Faster vector floating point multiplier (5 to 3 cycles)
- 1024 Radix divider for reduced latency, increased throughput
- Split Scalar divides for increased parallelism/bandwidth
- Faster vector Gather

### **Broadwell:** What's new

Intel® Scalable System Framework





# Integrated Voltage Regulator (IVR)



- IVR integrates legacy power delivery onto processor package/die
  - Will require small socket TDP increase (10W per SKU vs. IVB-EP)
- IVR enables power Management benefits
- Simplified platform power design
- Platform flexibility for future SKUs and products
- Better architectural flexibility



# Home Snoop w/DIR+OSB Provides up to 15% more Bandwidth vs Early Snoop on E5-26xx v3



#### Memory Read Latency and Bandwidth

Source as of 21 July 2015: Intel internal measurements on platform with two E5-26xx v4 (22C, CLR:2.8GHz), Turbo enabled, 4x32GB 1DPC DDR4-2400, RHEL 7.0. Platform with two E5-2699 v3, Turbo enabled, 4x32GB DDR4-2133, RHEL 7.0. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to <a href="http://www.intel.com/performance">http://www.intel.com/performance</a> \*Other names and brands may be claimed as the property of others.

Copyright © 2016 Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others. Andrey Semin | 28 June 2016 | Slide 15



Intel® Scalable

System Framework

# **High Performance Computing Performance**



Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. Results based on Intel® internal measurements as of February 29, 2016. Configurations: see next slide.





# High Performance Memory and Storage



Intel<sup>®</sup> Optane<sup>™</sup> **High-Bandwidth** Technology Intel Solutions for Memory Lustre\* Software (built with 3D XPoint<sup>™</sup> Technology) **Configurable Modes** The Most Widely **Used** File System for **SSDs Integrated** into HPC the Processor DIMMs

#### New Technologies Are Bringing Memory Closer to Compute





<sup>1</sup> Projected result based on internal Intel analysis of STREAM benchmark using a Knights Landing processor with 16GB of ultra high-bandwidth versus DDR4 memory with all channels populated. <sup>2</sup> Projected result based on internal Intel analysis comparison of 16GB of ultra high-bandwidth memory to 16GB of GDDR5 memory used in the Intel® Xeon Phi™ coprocessor 7120P.

Copyright © 2016 Intel Corporation. All rights reserved.

\*Other names and brands may be claimed as the property of others.



- 10x More Dense than Conventional Memory<sup>3</sup>
- Intel® Optane™ SSDs 5-7x Current Flagship NAND-Based SSDs (IOPS)<sup>1</sup>

#### DRAM-like performance

- Intel<sup>®</sup> DIMMs Based on 3D-XPoint<sup>™</sup>
- 1,000x Faster than NAND<sup>1</sup>
- 1,000x the Endurance of NAND<sup>2</sup>

<sup>1</sup> Performance difference based on comparison between 3D XPoint<sup>™</sup> Technology and other industry NAND <sup>2</sup> Density difference based on comparison between 3D XPoint<sup>™</sup> Technology and other industry DRAM <sup>2</sup> Endurance difference based on comparison between 3D XPoint<sup>™</sup> Technology and other industry NAND



# NAND Flash and 3D XPoint<sup>™</sup> Technology



3D XPoint<sup>™</sup> Technology



and expanding use cases









\*Other names and brands may be claimed as the property of others.

# Intel<sup>®</sup> Solutions for Lustre\* Software

#### The Speed of Lustre\* with the Support of Intel

- Intel® Enterprise Edition for Lustre\* Software v2.4
  - Support for "Distributed Namespace" (DNE) Feature to Scale Out the Metadata Performance of Lustre\*
  - Support for the Latest OS: Red Hat\* 6.7-7 and SUSE\* 11sp4-12
  - Parallel Read IO Performance & HSM Scalability Improvements
- Intel® Cloud Edition for Lustre\* Software v1.2
  - Support for Over-the-Wire and Storage Encryption
  - Disaster Recovery from File System Snapshots
  - Simplified File System Mounting on Clients
  - Support for Intel® Xeon® Processor E5-2600 v3 Product Family-Based Instances
- Intel® Foundation Edition for Lustre\* Software v2.8
  - Delivers the Latest Functions and Features
  - Fully Supported by Intel

# **EXTREME SCALE** STORAGE FOR **HPC**



# **Tighter System-Level Integration**



Memory

\*Other names and brands may be claimed as the property of others.

# Intel<sup>®</sup> Omni-Path Architecture

#### Evolutionary Approach, Revolutionary Features, End-to-End Solution

Intel® Scalable System Framework

abric







|                                      | Silicon          |
|--------------------------------------|------------------|
| ЭЕМ                                  | l custom designs |
| FI ar                                | nd Switch ASICs  |
|                                      | HFI silicon      |
| r produk<br>Sabata<br>Mas 1          | Up to 2 ports    |
|                                      | (50 GB/s         |
|                                      | total b/w)       |
| r José Kali<br>Tantuka<br>Mali Josef | Switch silicon   |
|                                      | up to 48 ports   |
|                                      | (1200 GB/s       |
|                                      | total b/w        |

Н

Software **Open Source** Host Software and Fabric Manager

| 1 | \$F.1. | - 11 |    |     |   |
|---|--------|------|----|-----|---|
|   | -      |      | Ξ. | 0   | _ |
| 1 |        |      | a  | £.  |   |
|   |        |      | 2  |     |   |
|   |        |      |    | uù. | - |



Cables

#### Better Scaling vs. EDR

- 48 Radix Chip Ports
- Up to 26% More Servers than InfiniBand\* EDR within the Same Budget<sup>1</sup>
- Up to 60% Lower Power and Cooling Costs<sup>2</sup>

#### **Configurable / Resilient**

- Job Prioritization (Traffic Flow Optimization)
- No-Compromise Resiliency (Packet Integrity) Protection and Dynamic Lane Scaling)

#### Robust product offerings and ecosystem

- End-to-end Intel product line
- >100 OEM designs<sup>3</sup>
- Strong ecosystem with 70+ Fabric Builders members

#### Maximizes price-performance, freeing up cluster budgets for increased compute and storage capability

- Assumes a 750-node cluster, and number of switch chips required is based on a full bisectional bandwidth (FBB) Fat-Tree configuration. Intel® OPA uses one fully-populated 768-port director switch, and Mellanox EDR solution uses a combination of 648-port director switches and 36-port edge switches. Mellanox component pricing based on Intel MSRP pricing on ark.intel.com. www.dell.com, with prices as of November 3, 2015. Compute node pricing based on Dell PowerEdge R730 server from www.kernelsoftware.com, with prices as of November 3, 2015. Compute node pricing based on Dell PowerEdge R730 server from a combination of director switch and Pellanox to provide reseller pricing based on Intel MSRP pricing on ark.intel.com. 2. Assumes a 750-node cluster, and number of switch chips required is based on a full bisectional bandwidth (FBB) Fat-Tree configuration. Intel® OPA uses one fully-populated 768-port director switch, and Mellanox EDR solution uses a combination of director switches and edge switches. Mellanox S057500 Director Switch, Mellanox S97700 Edge switch, and Mellanox CONTROK -4 VPI adapter card installation documentation of director switches and edge switches. And Mellanox C57500 Director Switch, Mellanox S97700 Edge switch, and Mellanox Connective.4 VPI adapter card installation documentation of switches and edge switches. Mellanox S057500 Director Switch, Mellanox S97700 Edge switch, and Mellanox Connective.4 VPI adapter card installation documentation of a rector switch. Since Mellanox C57500 Director Switch, Mellanox S97700 Edge switch, and Mellanox Connective.4 VPI adapter card installation documentation of specific posted on syntem and so product briefs posted on www.intel.com as of November 16, 2015. Intel® OPA price based on Intel MSRP pricing on ark.intel.com. 3. Intel Internation. Design win count based on DEM and HPC storage vendors who are planning to offer either Intel-branded or custom switch products, along with the total number of OEM platforms that are currently planned to support custom and property of others.
- Source: Intel internal information. Design win count based on OEM and HPC storage vendors who are planning to offer either Intel-branded or custom switch products, along with the total number of OEM platforms that are currently planned to support custom and/or standard Intel<sup>®</sup> OPA adapters. Design win count as of November 1, 2015 and subject to change without notice based on vendor product plans. \*Other names and brands may be claimed as property of others. Copyright © 2016 Intel Corporation. All rights reserved.



\*Other names and brands may be claimed as the property of others.



<sup>1</sup> Based on Intel projections for Wolf River and Prairie River maximum messaging rates, compared to Mellanox CS7500 Director Switch and Mellanox ConnectX-4 adapter and Mellanox SB7700/SB7790 Edge switch product briefs posted on <u>www.mellanox.com</u> as of July 1, 2015, compared to Intel measured data that was calculated from difference between back to back osu\_latency test through one switch horo. 10ns variation due to "near" and "far" ports on an Intel® OPA edge switch. All tests performed using Intel® Xeon® E5-2697v3 with Turbo Mode enabled.
\* Other names and brands may be claimed as property of others.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit <a href="http://www.intel.com/performance">http://www.intel.com/performance</a> tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit <a href="http://www.intel.com/performance">http://www.intel.com/performance</a> tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit <a href="http://www.intel.com/performance">http://www.intel.com/performance</a> tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit <a href="http://www.intel.com/performance">http://www.intel.com/performance</a> tests assist you in fully evaluating your contemplated purchases, including the performance tests are supported.

\*Other names and brands may be claimed as the property of others.

# **Intel® Software Solutions**



## Intel® Software Defined Visualization

Low Cost No Dedicated Viz Cluster

#### **Excellent Performance**

Less Data Movement, I/O Invest Power, Space, Budget in Greater Compute Capability

#### **High Fidelity**

Work with Larger Data Sets – Not Constrained by GPU Memory

#### **Intel® Parallel Studio**

#### **Faster Code**

Boost Application Performance on Current and Next-Gen CPUs

#### Create Code Faster

Utilizing a Toolset that Simplifies Creating Fast and Reliable Parallel Code

#### HPC System Software Stack

#### An Open Community Effort

Broad Range of Ecosystem Partners Open Source Availability

Benefits the Entire HPC Ecosystem Accelerate Application Development Turnkey to Customizable

#### **Open Software Available Today!**





Actual configurations depend on specific OEM offerings and implementation. Copyright © 2016 Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others. Andrey Semin | 28 June 2016 | Slide 27



Intel<sup>®</sup> Cluster Readv

# Summary: a Holistic Architectural Approach

