Our 32-bit Tegra K1 mobile processor has been racking up praise for bringing amazing performance and true console-quality graphics to the mobile space.

It “handily beats every other ARM SoC” in GPU performance benchmarks, according to Anandtech. And “the GPU performance is what stands out with the Tegra K1, nothing else on the market today is really able to get even close,” according to PC Perspective.

Now, eight months after unveiling Tegra K1’s 32-bit version, we’re providing further architectural details of the chip’s 64-bit version at HOT CHIPS, a technical conference on high-performance chips.

You can get more technical details here, while below is a general view of what we presented:

This new version of Tegra K1 pairs our 192-core Kepler architecture-based GPU with our own custom-designed, 64-bit, dual-core “Project Denver” CPU, which is fully ARMv8 architecture compatible. Further, Denver is fully pin compatible with the 32-bit Tegra K1 for ease of implementation and faster time to market.

With its exceptional performance and superior energy efficiency, the 64-bit Tegra K1 is the world’s first 64-bit ARM processor for Android, and completely outpaces other ARM-based mobile processors.

Tegra K1 Denver

Highest Single-Core CPU Throughput

Denver is designed for the highest single-core CPU throughput, and also delivers industry-leading dual-core performance. Each of the two Denver cores implements a 7-way superscalar microarchitecture (up to 7 concurrent micro-ops can be executed per clock), and includes a 128KB 4-way L1 instruction cache, a 64KB 4-way L1 data cache, and a 2MB 16-way L2 cache, which services both cores.

Denver implements an innovative process called Dynamic Code Optimization, which optimizes frequently used software routines at runtime into dense, highly tuned microcode-equivalent routines. These are stored in a dedicated, 128MB main-memory-based optimization cache. After being read into the instruction cache, the optimized micro-ops are executed, re-fetched and executed from the instruction cache as long as needed and capacity allows.

Effectively, this reduces the need to re-optimize the software routines. Instead of using hardware to extract the instruction-level parallelism (ILP) inherent in the code, Denver extracts the ILP once via software techniques, and then executes those routines repeatedly, thus amortizing the cost of ILP extraction over the many execution instances.

As part of the Dynamic Code Optimization process, Denver looks across a window of hundreds of instructions and unrolls loops, renames registers, removes unused instructions, and reorders the code in various ways for optimal speed. This effectively doubles the performance of the base-level hardware through the conversion of ARM code to highly optimized microcode routines and increases the execution energy efficiency.

The slight overhead of the dynamic optimization process is outweighed by the performance gains of already having optimized code ready to execute. In cases where code may not be frequently reused, Denver can process those ARM instructions directly without going through the dynamic optimization process, delivering the best of both worlds!

Dynamic Code Optimization works with all standard ARM-based applications, requiring no customization from developers, and without added power consumption versus other ARM mobile processors. That’s because the 7-wide superscalar design allows faster throughput than would otherwise be possible at the same clock speed.

NVIDIA Tegra K1 64-bit Denver CPU

Denver’s remarkable design delivers great performance for both single- and multi-threaded applications, as well as multitasking scenarios. The dual-CPU cores can attain significantly higher performance than existing four- to eight-core mobile CPUs on most mobile workloads.

Denver also features new low latency power-state transitions, in addition to extensive power-gating and dynamic voltage and clock scaling based on workloads. Combining Dynamic Code Optimization, 7-way superscalar design and efficient power usage, Denver’s performance will rival some mainstream PC-class CPUs at significantly reduced power consumption.

This means that future mobile devices using our 64-bit Tegra K1 chip can offer PC-class performance for standard apps, extended battery life and the best web browsing experience – all while opening new possibilities for gaming, content creation and enterprise apps.

Look forward later this year to some amazing mobile devices based on the 64-bit Tegra K1 from our partners. And for hard-core Android fans, take note that we’re already developing the next version of Android – “L” – on the 64-bit Tegra K1.

  • A Popov

    I agree that GPU is really better than anything out on market nowadays
    (just take a look at PC-grade graphics of Trine2 and UE4 demos, wow now
    that’s really a huge leap forward).

    But on the other side, this article is full of phrases like “great performance”,” significantly higher performance than other 8-core ARM chips” and “PC-class performance”, and comments about emulation of NEON. I’ve read white paper and it *has no* words NEON and SIMD in it. NVidia, you need some real-life CPU (not GPU, we know it is good) benchmarks. Run a LINPACK benchmark (and some other tests please) on it and prove that your CPU is better than 8-core rivals.

  • A Popov

    There’s no 64-bit Android for ARM. Yet.

  • http://www.tuttoandroid.net/ Riccardo Robecchi

    Yes, I agree. I think that Windows HAS to run on ARM, but it’ll do that with some key differences than how it is doing it now (BTW, is this tense’s grammar right?).
    I think Microsoft is moving Windows Phone on tablets (or Windows on phones, if you please) as it should have done since the beginning. We’ll see the true convergence of platforms with one software that rules them all. And if they want to do that and support current tablet hardware, they have to support Tegra – so I think we’ll see Tegra tablets in the future.
    As for the legacy Windows support, I think Microsoft got it wrong all the time. In my opinion, they had to port Windows Phone on tablets or create a drastically different Windows-on-ARM product with no desktop and tablet-only apps. Touch Office support had to be a killer app, but we see none even today. If they put all the puzzle pieces together, they’ll definitely have the ultimate platform out there.

  • vasras

    That’s the only thing that matters now. nVidia screwed it up before…

  • shaun walsh

    Oh completely zoomed over the trine 2 part. I was thinking of the trex bench they did that only lasted for 2.5 hours but had stupid high frames. They went back and capped the fps and battery life surpassed the ipad

  • shaun walsh

    Where do you see trine 2 test with battery life?

  • kron123456789

    Definetely not here. You know, there is one useful thing called The Internet. You can find anything there.

  • shaun walsh

    I found it. It was actually 3 hours

  • shaun walsh

    How is it not efficient. Perf/watts it kills the a7 for GPU. Now the a15s aren’t efficient but hey, they aren’t Intel

  • Drew Forester

    I’d love to see something like this powering Ubuntu Touch devices. Even though that OS will probably never actually see it to market. But it’s nice to dream.

  • Brian Caulfield

    Thanks for the feedback!

  • fteoOpty64

    Congrats NV. This achievement is going to be significant for the industry and surely make Intel very nervous!. I am sure NV would be doing serious R&D on multi-socket chipsets in order to scale Denver type SoC to many dsitributed nodes forming a single system or many virtual smaller systems. Great for automotive and other vehicular applications for sure. Best of luck rolling out this chip into products so the world can enjoy the benefits and develop software to take more advantage of this architecture. OpenCL is great and should be pushed further ….

  • fteoOpty64

    @Slacker: Dude, there is nothing VLIW in the architecture. It is all standard ARM architecture. Just pre-code optimisation by the other cpu to populate the Optimization cache. No code-morphying either but some intelligent re-ordering done pre-execution. It looks like an alternative to OoO design without the complex transistor/power overhead but using another core for pre-processing of code in an optimised way. Of course, there would be a penalty if re-ordering goes wrong but if that is rare enough (like 0.001%), then it is a great trade-off. I am sure they simulate this design very very throughly in their supercomputers so they know it works well.

    This architecture will throw some benchmarks in complete twists, so we shall see that soon as products roll out and benchmarks gets done in n ways by thousands of people.

  • fteoOpty64

    Just because NV has not designed a cpu, does not mean they cannot!. Look at Apple A7 which is a jolly great cpu yet consumes just a little more than the outgoing Arm equivalent. You got it wrong in transmeta type implementation. There is none in Denver design. Someone seems to throw in that old architecture and started talking VLIW. There is no VLIW anywhere in this design. Go look for it.
    The reality is the proof of the pudding in real-world benchmarking and we shall see soon, surely NV has done that before and knows. They will have the last laugh here, I am sure!.
    Preliminary benchmarks shows Denver to be almost twice as fast as S800!!.

  • fteoOpty64

    If you think, it is possible to merge Win8.1 to WP then you are solely mistaken on how OS works and how the apps work in their framework. It is just not possible due to different cpu architectures x86 vs Arm. WinRT was crippled at birth and remained so which let to its demise. Had MS “opened” up RT, the story might have been different. As such, due to BayTrail cpu and low power Haswell chips, the Win tablets are running pure x86. The “RT Madness” might not have happened had MS made RT the same as Pro but opened up RT for very free 3rd-party installable programs. They LOCKED it up to be almost useless!.
    With Android and IOS being so strong, there is NOT going to be anymore “Windows -on-Arm”. That boat sunked long time ago, sorry.

  • fteoOpty64

    You can extend battery by capping the framerate to 30 fps and several things. I was told it is possible to get 4 hours plus of gameplay rather than 2.5hrs unrestricted. Besides, when playing games, you are likely to hold the controller so the tablet can be plugged in on a table or some surface.

  • fteoOpty64

    Would love to see a Shield Tablet 10 with a 10inch 2560X1600 screen bundled with two wifi controllers for dual play kids!.

  • fteoOpty64

    Considering the Maxwell architecture has been on the market for some time, NV can crank up the maxwell version of the gpu and name it M2. They can then churn up a “large-tablet” version of Denver with 2 SMX maxwell cores. Hows that ?

  • fteoOpty64

    Actually for a large tablet (ie 10 or 12 inches), fully Ubuntu with a virtualized Android together would be real cool. Full Ubuntu is for keyboard and mouse people. You get into terminal and IDE to do some serious work while Android is more for web and social media stuff. Audio can be backgrounded by either. Imagine a work machine with your own VPNed Android virtualised to your home server!.

  • fteoOpty64

    Dude, it is cheaper if you just buy a GT-750 card today!. That is faster than K1 as it eats 55watts TDP max. Cost around 100 bucks or less.

  • haakonks

    Wow! Thats a lot of sweeping statements!
    Let me just answer to your point number 3. According to the slides from Nvidia-s presentation at HOT CHIPS, Denver supports NEON, with 2 x 128-bit (FP0 and FP1 in the execution pipeline).

  • Nicolai Behmann

    Are you planning an update of the Jetson TK1 dev board to 64bit Denver TK1 version, maybe even this year?

  • Drew Forester

    You just brought a tear to my eye.

    I have to laugh about the K1 powering Acer’s new Chromebook. Maybe the other ARM Chromebooks aren’t quite up to par, but the incredible versatility and power of the K1 just seems a bit much for a glorified web browser. That said, if a K1 showed up in a Firefox OS device, I would still be all over it, but mostly because I just like Firefox OS as an open source comcept.

    But Ubuntu Touch would rock on this chip. I mean… holy crap. Yeah.

    If NVIDIA isn’t already part of Canonical’s little cabal, they should be.

  • Mexor

    So you expect Apple to release a GPU better than the K1 in September? I imagine when playing Trine 2 it is the GPU performance and GPU efficiency that plays the vastly bigger role, isn’t it? I have a feeling if people want a jump in efficiency with equal or greater GPU performance and ability they will have to wait for NVidia’s Erista next year.

    The issue for NVidia is that GPU performance is not at this time as important for tablets as it is for desktops. They are trying to help drive a seemingly inevitable change along that direction with their Shield line, but it also seems like they are trying to differentiate their CPU as well with Denver. If they can do that they would be expected to be the dominant player in the Android tablet space, replacing Qualcomm there. But that remains to be seen.

    However I don’t see how you can expect these other companies to suddenly match NVidia’s graphics prowess. Consider this: if they do match it, they will do it coming from an entirely different direction. Nvidia successfully brought their bread-and-butter PC GPU technology and expertise to the mobile space which caused sudden jumps in GPU performance and perf/watt in mobile. Since the other players don’t have PC GPU technology and expertise they would need to make similar jumps in performance and perf/watt through another avenue at almost exactly the same time. Seems unlikely to me.

  • The Calm Critic

    They’d be clinically insane if there’s not gonna be any. At the very least you gotta have 1 reference Denver driven Shield devices for devs, tighten that sh*t up and then profit from mass market afters.

    ..or so I hope…

  • robjl

    Will there be a developer board for the 64bit Tegra K1 – would be very interested to get one

  • nobodyspecial

    I’d rather see people start using game benchmarks (meaning ACTUAL GAMES) than this crap. What does linpack prove to me when compared to what I’ll actually be using a device like this for? I want to know how good it is in stuff I’ll ACTUALLY DO on tablets/phones etc. I’m not sure why mobile games haven’t started building in benchmarking like the usually do on PC’s. It would certainly be closer to what we are really doing this most of this synthetic crap they benchmark today.

    Linpack is fairly useless in this usage scenario and I’d hope nobody in tablet/phones would be optimizing for that stuff. It’s a waste of time and resources when they should be concentrating on optimizing gaming, because that’s what 90% of the revenue from googleplay is coming from (and 80+% at apple store, and 65% at amazon). It’s the games, not linpack that matters ;)

  • Michael Gainor

    Why is it not quad core?

  • Rich- Don’

    mini-hdmi on phone!

  • Kane

    No. Your playback speed will be too fast.

  • http://juanrga.com juanrga

    It has been a long-time wait for Nvidia CPU, and finally it is a very interesting product. I have been reading the whytepaper and it mentions that DCO could optimize over a 100-instruction size window. Does this mean that Denver could be classified as KIP (Kilo-Instruction Processor)?

    The rumour is this is a 256bit VLIW hardware _at metal level_, but someone here rejected that. In case this is not VLIW design, I have a serious difficulties to understand how Nvidia could design a 7-wide core, when the rest of the academy/industry has difficulties going above 4-wide, die to the quadratic complexity law for superscalar.