Mile High Milestone: Tegra K1 “Denver” Will Be First 64-bit ARM Processor for Android

by Nick Stam

Our 32-bit Tegra K1 mobile processor has been racking up praise for bringing amazing performance and true console-quality graphics to the mobile space.

It “handily beats every other ARM SoC” in GPU performance benchmarks, according to Anandtech. And “the GPU performance is what stands out with the Tegra K1, nothing else on the market today is really able to get even close,” according to PC Perspective.

Now, eight months after unveiling Tegra K1’s 32-bit version, we’re providing further architectural details of the chip’s 64-bit version at HOT CHIPS, a technical conference on high-performance chips.

You can get more technical details here, while below is a general view of what we presented:

This new version of Tegra K1 pairs our 192-core Kepler architecture-based GPU with our own custom-designed, 64-bit, dual-core “Project Denver” CPU, which is fully ARMv8 architecture compatible. Further, Denver is fully pin compatible with the 32-bit Tegra K1 for ease of implementation and faster time to market.

With its exceptional performance and superior energy efficiency, the 64-bit Tegra K1 is the world’s first 64-bit ARM processor for Android, and completely outpaces other ARM-based mobile processors.

Tegra K1 Denver

Highest Single-Core CPU Throughput

Denver is designed for the highest single-core CPU throughput, and also delivers industry-leading dual-core performance. Each of the two Denver cores implements a 7-way superscalar microarchitecture (up to 7 concurrent micro-ops can be executed per clock), and includes a 128KB 4-way L1 instruction cache, a 64KB 4-way L1 data cache, and a 2MB 16-way L2 cache, which services both cores.

Denver implements an innovative process called Dynamic Code Optimization, which optimizes frequently used software routines at runtime into dense, highly tuned microcode-equivalent routines. These are stored in a dedicated, 128MB main-memory-based optimization cache. After being read into the instruction cache, the optimized micro-ops are executed, re-fetched and executed from the instruction cache as long as needed and capacity allows.

Effectively, this reduces the need to re-optimize the software routines. Instead of using hardware to extract the instruction-level parallelism (ILP) inherent in the code, Denver extracts the ILP once via software techniques, and then executes those routines repeatedly, thus amortizing the cost of ILP extraction over the many execution instances.

As part of the Dynamic Code Optimization process, Denver looks across a window of hundreds of instructions and unrolls loops, renames registers, removes unused instructions, and reorders the code in various ways for optimal speed. This effectively doubles the performance of the base-level hardware through the conversion of ARM code to highly optimized microcode routines and increases the execution energy efficiency.

The slight overhead of the dynamic optimization process is outweighed by the performance gains of already having optimized code ready to execute. In cases where code may not be frequently reused, Denver can process those ARM instructions directly without going through the dynamic optimization process, delivering the best of both worlds!

Dynamic Code Optimization works with all standard ARM-based applications, requiring no customization from developers, and without added power consumption versus other ARM mobile processors. That’s because the 7-wide superscalar design allows faster throughput than would otherwise be possible at the same clock speed.

NVIDIA Tegra K1 64-bit Denver CPU

Denver’s remarkable design delivers great performance for both single- and multi-threaded applications, as well as multitasking scenarios. The dual-CPU cores can attain significantly higher performance than existing four- to eight-core mobile CPUs on most mobile workloads.

Denver also features new low latency power-state transitions, in addition to extensive power-gating and dynamic voltage and clock scaling based on workloads. Combining Dynamic Code Optimization, 7-way superscalar design and efficient power usage, Denver’s performance will rival some mainstream PC-class CPUs at significantly reduced power consumption.

This means that future mobile devices using our 64-bit Tegra K1 chip can offer PC-class performance for standard apps, extended battery life and the best web browsing experience – all while opening new possibilities for gaming, content creation and enterprise apps.

Look forward later this year to some amazing mobile devices based on the 64-bit Tegra K1 from our partners. And for hard-core Android fans, take note that we’re already developing the next version of Android – “L” – on the 64-bit Tegra K1.