Hyper-Threading Technology Explained Architecture and Microarchitecture

Virtually all contemporary operating systems divide their work load up into processes and threads that can be independently scheduled and dispatched to run on a processor. The same division of work load can be found in many high-performance applications such as database engines, scientific computation programs, engineering-workstation tools, and multi-media programs. To gain access to increased processing power, most contemporary operating systems and applications are also designed to execute in dual- or multi-processor environments, where – through the use of symmetric multiprocessing (SMP) – processes and threads can be dispatched to run on a pool of processors. Hyper-Threading technology leverages this support for process- and thread-level parallelism by implementing two logical processors on a single chip. This configuration allows a thread to be executed on each logical processor. Instructions from both threads are simultaneously dispatched for execution by the processor core. The processor core executes these two threads concurrently, using out-of-order instruction scheduling to keep as many of its execution units as possible busy during each clock cycle. Architecturally, a processor with Hyper-Threading technology is viewed as consisting of two logical processors, each of which has its own IA-32 architectural state. After power up and initialisation, each logical processor can be individually halted, interrupted, or directed to execute a specified thread, independently from the other logical processor on the chip. The logical processors share the execution resources of the processor core, which include the execution engine, the caches, the system bus interface, and the firmware. Legacy software will run correctly on a HT-enabled processor, and the code modifications to get the optimum benefit from the technology are relatively simple....

Pentium Northwood

For several months after the Pentium 4 began shipping in late 2000 the leadership in the battle to have the fastest processor on the market alternated between Intel and rival AMD, with no clear winner emerging. However, towards the end of 2001 AMD had managed to gain a clear advantage with its Athlon XP family of processors. Intel’s response came at the beginning of 2002, in the shape of the Pentium 4 Northwood core, manufactured using the 0.13-micron process technology first deployed on the company’s Tualatin processor in mid-2001. The transition to the smaller fabrication technology represents a particularly important advance for the Pentium 4. When it was originally released as a 0.18-micron processor, it featured a core that was almost 70% larger than the competition. A larger core means that there is a greater chance of finding defects on a single processor thus lowering the yield of the part. A larger core also means that fewer CPUs can be produced per wafer also making the CPU a very expensive family member. The 0.13-micron Northwood core addresses this issue. Compared with the original 0.18-micron Willamette die’s surface area of 217mm2, the Northwood’s is a mere 146mm2. What this means is that on current 200mm wafers, Intel is now able to produce approximately twice as many Pentium 4 processors per wafer as was possible on the 0.18-micron process. In fact, architecturally the new Northwood core doesn’t differ much at all from its predecessor and most of its differences can be attributed to the smaller process technology. First off, Intel exploited the opportunity this gave them to increase the transistor count...

Pentium 4

In early 2000, Intel unveiled details of its first new IA-32 core since the Pentium Pro – introduced in 1995. Previously codenamed Willamette – after a river that runs through Oregon – it was announced a few months later that the new generation of microprocessors would be marketed under the brand name Pentium 4 and be aimed at the advanced desktop market rather than servers. Representing the biggest change to Intel’s 32-bit architecture since the Pentium Pro in 1995, the Pentium 4’s increased performance is largely due to architectural changes that allow the device to operate at higher clock speeds and logic changes that allow more instructions to be processed per clock cycle. Foremost amongst these is the Pentium 4 processor’s internal pipeline – referred to as Hyper Pipeline – which comprises 20 pipeline stages versus the ten for the P6 microarchitecture. A typical pipeline has a fixed amount of work that is required to decode and execute an instruction. This work is performed by individual logical operations called gates. Each logic gate consists of multiple transistors. By increasing the stages in a pipeline, fewer gates are required per stage. Because each gate requires some amount of time (delay) to provide a result, decreasing the number of gates in each stage allows the clock rate to be increased. It allows more instructions to be in flight or at various stages of decode and execution in the pipeline. Although these benefits are offset somewhat by the overhead of additional gates required to manage the added stages, the overall effect of increasing the number of pipeline stages is a reduction in...

Pentium Roadmap

The table below presents the anticipated roadmap of future Intel mainstream desktop processor developments. Note that currently there is no public roadmap information for Pentium developments beyond H1 2007. This page will be updated when the data becomes available, though of course the focus of Intel is undoubtedly shifting towards Core technology. H1 2007 Intel Pentium 4 processor 672 supporting HTT or greater – 2MB L2 cache – 3.60 GHz – 800 MHz FSB – Intel 945/955X Express Chipset family Intel Celeron D 360 or greater – 512KB L2 cache – 3.46 GHz – 533 MHz FSB – Intel 945X/946X Express Chipset family Pentium Architecture Pentium Pro Pentium MMX Technology Pentium II Pentium SEC Pentium “Deschutes Pentium Xeon Pentium III Pentium Tualatin Pentium 4 Pentium Northwood Hyper-Threading Technology Pentium Prescott Pentium Processor Numbers Multi-Core Processors Pentium Smithfield Pentium D Pentium...

Pentium Tualatin

It had been Intel’s original intention to introduce the Tualatin processor core long before it actually did, as a logical progression of the Pentium III family that would – as a consequence of its finer process technology – allow higher clock frequencies. In the event, the company was forced to switch its focus to the (still 0.18-micron) Pentium 4, on the basis that it represented a better short term prospect in its ongoing clocking war with AMD than the Tualatin which, of course, would require a wholesale switch to a 0.13-micron fabrication process. As a consequence it was not until mid-2001 that the new core appeared. The Tualatin is essentially a 0.13-micron die shrink of its Coppermine predecessor. It does offer one additional performance enhancing feature however – Data Prefetch Logic (DPL). DPL analyses data access patterns and uses available FSB bandwidth to prefetch data into the processor’s L2 cache. If the prediction is incorrect, there is no associated performance penalty. If it’s correct, time to fetch data from main memory is avoided. Although Tualatin processors are nominally Socket 370 compliant, clocking, voltage and signal level differences effectively mean that they will not work in existing Pentium III motherboards. Since the release of the Pentium Pro, all Intel P6 processors have used Gunning Transceiver Logic+ (GTL+) technology for their FSB. The GTL+ implementation actually changed slightly from the Pentium Pro to the Pentium II/III, the latter implementing what is known as the Assisted Gunning Transceiver Logic+ (AGTL+) bus. Both GTL+ and AGTL+ use 1.5V signalling. Tualatin sees a further change, this time to an AGTL signalling bus that uses...

Pentium III

Intel’s successor to the Pentium II, formerly codenamed Katmai, came to market in the spring of 1999. With the introduction of the MMX came the process called Single Instruction Multiple Data (SIMD). This enabled one instruction to perform the same function on several pieces of data simultaneously, improving the speed at which sets of data requiring the same operations could be processed. The new processor introduced 70 new Streaming SIMD Extensions – but doesn’t make any other architecture improvements. 50 of the new SIMD Extensions are intended to improve floating-point performance. In order to assist data manipulation there are eight new 128-bit floating-point registers. In combination, these enhancements can lead to up to four floating-point results being returned at each cycle of the processor. There are also 12 New Media instructions to complement the existing 57 integer MMX instructions by providing further support for multimedia data processing. The final 8 instructions are referred to by Intel as the New Cacheability instructions. They improve the efficiency of the CPU’s Level 1 cache and allow sophisticated software developers to boost the performance of their applications or games. Other than this, the Pentium III makes no other architecture improvements. It still fits into Slot 1 motherboards, albeit with simplified packaging – the new SECC2 cartridge allows a heatsink to be mounted directly onto the processor card and uses less plastic in the casing. The CPU still has 32KB of Level 1 cache and will initially ship in 450MHz and 500MHz models with a frontside bus speed of 100MHz and 512KB of half-speed Level 2 cache, as in the Pentium II. This means...

Pin It on Pinterest