Contact ARM

RSS updates

ARM Teaching Resources, Guest Lectures/Workshops, & Research Papers

The ARM University Program provides a variety of resources and materials to both students and faculties looking to target ARM projects based on real hardware, incorporate ARM into new or existing curricula, or gain familiarity with ARM. These include links to books, real-time OSes, training materials, real academic curricula, academic research papers, application notes, and other documentation.

"Like" the ARM University Program Facebook page!
Follow the ARM University Program on Twitter!
Subscribe to the University Newsletter!
Check out our Podcasts on iTunes U!

Books Suited for Academia

Real ARM Course Curricula

Lab and Teaching Materials

Guest Lectures and Workshops

ARM-related Research Papers

ARM Assembly Language: Fundamentals and Techniques

In English, by William Hohl
Published by CRC

ISBN-10: 1439806101
ISBN-13: 978-1439806104
Errata List

Embedded Systems: Introduction to the ARM Cortex-M3

In English, by Jonathan W. Valvano
Published by CreateSpace

ISBN-10: 1477508996
ISBN-13: 978-1477508992

Embedded Systems: Real-Time Interfacing to the ARM Cortex-M3

In English, by Jonathan W. Valvano
Published by CreateSpace

ISBN-10: 1463590156
ISBN-13: 978-1463590154

Embedded Systems: Real-Time Operating Systems for the ARM Cortex-M3

In English, by Jonathan Valvano
Published by CreateSpace

ISBN-10: 1466468866
ISBN-13: 978-1466468863

Computer Organization and Design: The Hardware/Software Interface - ARM Edition

In English, by David Patterson and John Hennessy
Fourth Edition
Published by Morgan Kaufman

ISBN-10: 8131222748
ISBN-13: 978-8131222744

Fast and Effective Embedded Systems Design: Applying the ARM mbed

In English, by Rob Toulson and Tim Wilmshurst
Published by Newnes

ISBN: 978-0-08-097768-3

ARM Microcontroller Interfacing

In English, by Warwick A. Smith
Published by Elektor

ISBN-10: 0905705912
ISBN-13: 978-0905705910

ARM Microcontrollers, Part 1: 35 Projects for Beginners

In English, by Bert Van Dam
Published by Elektor

ISBN-10: 0905705947
ISBN-13: 978-0905705941

Assembly Language Programming: ARM Cortex-M3

In English, by Vincent Mahout
Published by Wiley-ISTE

ISBN-10: 1848213298
ISBN-13: 978-1848213296

Fundamentals of Embedded Software with the ARM® Cortex-M3

In English, by Daniel W. Lewis
Published by Prentice Hall

ISBN-10: 0132916541
ISBN-13: 978-0132916547

Getting Started with the Internet of Things: Connecting Sensors and Microcontrollers to the Cloud

In English, by Cuno Pfister
Published by O'Reilly Media

ISBN-10: 1449393578
ISBN-13: 978-1449393571

Various Micrium on ARM Textbooks

In English
Published by Micrium

The Definitive Guide to the ARM Cortex-M3

In English, by Joseph Yiu
Published by Newnes
First Edition Errata Document (71KB PDF)

ISBN-10: 0750685344
ISBN-13: 978-0750685344

The Definitive Guide to the ARM Cortex-M0

In English, by Joseph Yiu
Published by Newnes

ISBN-10: 0123854776
ISBN-978-0123854773

ARM System-on-chip Architecture

In English, by Steve Furber
Second Edition
Published by Addison Wesley

ISBN 0-201-67519-6

Computers as Components: Principles of Embedded Computing System Design

In English, by Wayne Wolf
Published by Morgan Kaufmann

ISBN: 1-5586-0541-X

ARM Assembly Language - an Introduction

In English, by J.R. Gibson
Published by Lulu.com

ISBN: 978-1-84753-696-9

ARM System Developer's Guide

In English, by Andrew Sloss, Dominic Symes, and Chris Wright
Published by Morgan Kaufmann

ISBN: 1-55860-874-5

C Programming for Embedded Microcontrollers

In English, by Warwick A. Smith
Published by Elektor

ISBN: 978-0-905705-80-4

Free Cortex-A Series Programmers Guide (free registration required)

In English, edited by ARM
Published by ARM

Free ARMv7-AR, ARMv7-M, ARMv6-M, and ARMv5 Architecture Reference Manual Downloads

In English, edited by David Seal
Published by Addison-Wesley

ISBN: 0-201-73719-1

ARM-based Embedded System Development Tutorial

ARM嵌入式系统基础教程
In Chinese, by Ligong Zhou
Author: 周立功
Published by BUAAP

ISBN: 7811240408

ARM Embedded System Experiment Tutorial (Part II)

ARM嵌入式系统实验教程
In Chinese, by Ligong Zhou
Author: 周立功
Published by BUAAP

ISBN: 7810777297

ARM Based Embedded Software Development Tutorial

ARM嵌入式系统软件开发实例
In Chinese, by Ligong Zhou
Author: 周立功
Published by BUAAP

ISBN: 7810775774

Other ARM-related Books

US and Canada

Auburn University
- ELEC 5260/ELEC 6260: Embedded Computing Systems

Carnegie Mellon University
- ECE 349: Embedded Real-Time Systems
- ECE 549: Embedded Systems Design

Georgia Tech
- ECE4180: Embedded Systems Design
- ECE4894: Embedded Computing Systems
- CS4803PGC/CS8803PGC: Design and Programming of Game Consoles

University of Texas - Austin
- EE319K: Introduction to Embedded Systems
- EE 345M/EE 380L.6: Real Time Operating Systems for Embedded Systems
- EE445L Microprocessor Applications and Organization
- EE 382N-4: Advanced Embedded Systems Architecture

University of Waterloo
- ECE455: Embedded Software
- ECE254: Operating Systems and Systems Programming

University of Wisconsin
- ECE 315: Introductory Microprocessor Laboratory
- ECE 353: Introduction to Microprocessor Systems
- ECE 453: Embedded Microprocessor System Design

UK and Europe

École Polytechnique Fédérale de Lausanne (EPFL)
- EE-310: Microprogrammed Embedded Systems

Imperial College - University of London
- EE2-19: Introduction to Computer Architecture
- EE1-9: Introduction to Computer Architecture and Systems
- EE4-52: Embedded Systems

University of Manchester
- COMP15111: Fundamentals of Computer Architecture
- COMP22712: Microcontrollers

University of Slovenia
- Academic Material and Various Links

Far East

National Tsing Hua University
- EE 5255: SOC Design Lab

Latin America

University of Buenos Aires
- [66.48] Seminars on Electronics: Embedded Systems

ROW

University of New South Wales
- ELEC 2041: Microprocessors and Interfacing

University of Cape Town
- EEE 3074W: Embedded Systems

Lab Manuals and Exercises

Slides and Academic Teaching Material

Application Notes for Students and Faculty

Other Projects and Resources

ARM currently offers technical workshops and seminars to universities worldwide. Depending on availability and locale, it may be possible for our university staff or real ARM engineers to deliver a lecture to your school. The following lectures currently are offered:

ARM Processors and Architecture Overview

This 1 - 2 hour lecture covers the company business model and industry, basics of the ARM architecture, including the programmer’s model, basic instruction sets, pipelines for the core families, AMBA, and development tools. Aimed at third or fourth-year students and faculty members, this presentation is designed to answer students’ and faculty members' questions at the most technical level. The lecture also includes a demo of the latest ARM hardware and software technologies.

ARM Architecture Comprehensive Overview

This 1.5 - 2 hour lecture covers the company business model and industry, basics of the ARM architecture (including all architecture families), as well as the programmer’s model, instruction sets, pipelines for the core families, AMBA, Mali graphics processors overview, energy management schemes, and development tools. Aimed at third or fourth-year students, graduate students, and faculty members, this presentation is designed to answer students’ and faculty members' questions at the most technical level. The lecture also includes a demo of the latest ARM hardware and software technologies.

ARM/NXP mbed Hands-on Workshop

This 1 - 1.5 hands-on workshop displays the power and flexibility of the ARM/NXP mbed platform and how it's useful in an academic setting. With the falling costs and increasing complexities of processors, microcontrollers are becoming cheaper, more powerful, and more interactive. MCUs are now truly solutions looking for problems, where anyone could conceive a microcontroller application. The problem until now has been turning the idea into a prototype quickly and experimenting with the technology. ARM has changed this with mbed, a rapid prototyping platform designed to simplify getting started with microcontrollers. Using a web based compiler and a very simple drag-and-drop interface, with these applications, developed without the need for expensive tools, a new user can write and execute a "hello world" program in about sixty seconds. A lab setting with host machines with internet connectivity is required.

ARM Cortex-M and v6/7-M Introductions

This 1 or 1.5 hour lecture (short and long versions offered) covers the basics of the Cortex-M0, M3, or M4 processor, including an introduction to the ARM v7-M or v6-M architecture, instruction sets, programmer's model, exception and interrupt handling, data path, pipeline, recommended programming tools, and a demonstration of the latest Cortex-M-based hardware.

ARM Cortex-A and v7-A Introductions

This 1 or 1.5 hour lecture (short and long versions offered) covers the basics of the Cortex-A processors, including an introduction to the ARM v7-A architecture, instruction sets, programmer's model, exception and interrupt handling, data path, pipeline, recommended programming tools, and a demonstration of the latest Cortex-A-based hardware.

ARM SoC Design with AMBA

This 1 or 1.5-hour lecture (short and long versions offered) covers the basics of ARM processors and architectures, SoC design, the AMBA bus protocol, and how to connect an ARM core with different peripherals in an SoC.

Introduction to ARM and the Microprocessor Industry

This .5 - 1 hour lecture covers the company business model and current state of the industry. The lecture also includes a demo of the latest ARM hardware and software technologies.

Other lectures are available upon request, including advanced topics like multicore issues and GPUs.

Please note that there is no marketing or advertising done during the presentations, and the lectures focus entirely on technical issues. If you are interested in having a guest lecturer, please send an email to [email protected], including your university name, address, and contact information.

Want your ARM-related research paper listed here? Contact ARM University Relations team for approval:

A number of research papers have recently been written around ARM technologies:

Accurate system-level performance modeling and workload characterization for mobile internet devices, 2008
Abstract: As mobile applications and devices become ubiquitous, consumer demands for performance, power efficiency, and connectivity are increasing. The software framework existing on mobile internet devices is a complex interaction of real-time tasks, non-real-time applications, and operating system management routines. Traditional simulation approaches are poorly suited to modeling the overall performance characteristics of such systems. Additionally, many traditional benchmark suites used in academia and industry for microprocessor benchmarking and design have been found to be unrepresentative of mobile workloads. This paper presents multiple frameworks utilized for accurately modeling system-level performance of embedded systems used for mobile applications.

An Open-Source Platform for Power Converters Teaching, 2009
Abstract: This work presents a new approach for teaching power con- verters through the use of power inverter example experiments. This scheme is based on a custom-designed hardware and a software platform based on several open-source tools. The platform is controlled by a 32-bit microprocessor which gives the student the possibility to modify the ex- periments through the control rmware. All the required hardware and software necessary to design and implement the control is open-source.

Analysis of hardware prefetching across virtual page boundaries, 2007
Abstract: Data cache prefetching in the L2 is at the forefront of pre-fetching research. In this paper, the authors analyze the impact of virtual page boundaries on these prefetchers.

AnySP: anytime anywhere anyway signal processing, 2009
Abstract: This paper proposes an example architecture, referred to as AnySP, for the next generation mobile signal processing. AnySP uses a co-design approach where the next generation wireless signal processing and high-definition video algorithms are analyzed to create a domain specific programmable architecture. At the heart of AnySP is a configurable single-instruction multiple-data datapath that is capable of processing wide vectors or multiple narrow vectors simultaneously. Results show that AnySP is capable of sustaining 4G wireless processing and high-definition video throughput rates, and will approach the 1000 Mops/mW efficiency barrier when scaled to 45nm.

ARM Cortex-A8: A High Performance Processor for Low Power Applications, Nov. 2007
Abstract: This paper details the ARM Cortex-A8, a microprocessor targeted at systems requiring high performance for both general purpose and media applications while maintaining a low, sub 1 Watt, power profile and a small silicon footprint.

ARMs for the Poor: Selecting a Processor for Teaching Computer Architecture, Oct. 2010
Abstract: Those teaching computer architecture and organization courses have to choose a target processor to illustrate the basic principles of instruction set design. This paper suggests that it is time to choose the ARM processor architecture that is markedly different to those used in most current courses.

Design and Implementation of Turbo Decoders for Software Defined Radio, Oct. 2006
Abstract: This paper presents a case study of algorithm-architecture co-design of Turbo decoder for SDR.

Designing dependable multicore system with unreliable components, June 2009
Abstract: Single core chip architecture do not scale well due to various design and reliability challenges. Multicore system with large numbers of cores are becoming common to take advantage of Moore's law. However, there exist various reliability concerns in nanoscale era due to spatial, temporal and dynamic variations. The only way to enable sustained scaling of multicore systems is to make the architecture robust by using adaptive design techniques and having redundant cores which can replace faulty ones.

Developing an Intermediate Embedded-Systems Course with an Emphasis on Collaboration, Oct 2011
Abstract: Embedded systems are computing devices designed to perform specific tasks as part of larger systems such as digital cameras, measuring instruments, cars, etc. Technological advances have added complexity to embedded-systems development, which needs to be reflected in academic curricula. This paper presents the design and delivery of an intermediate embedded-systems course that follows up on a typical introduction to microcontrollers and, while doing so, it offers an example of engineering-course development that makes use of collaborative learning, outbound ties and learning modes of a community of practice. The paper explains planning, learning objectives, activities and other aspects of the course, including hardware and software tools used as well as lessons learned. A section is dedicated to student projects and current results.

DVFS in loop accelerators using BLADES, 2008
Abstract: Hardware accelerators are common in embedded systems that have high performance requirements but must still operate within stringent energy constraints. To facilitate short time-to-market and reduced non-recurring engineering costs, automatic systems that can rapidly generate hardware bearing both power and performance in mind are extremely attractive. This paper proposes the BLADES (Better-than-worst-case Loop Accelerator Design) system for automatically designing self-tuning hardware accelerators that dynamically select their best operating frequency and voltage based on environmental conditions, silicon variation, and input data characteristics.

Energy-Efficient Simultaneous Thread Fetch from Different Cache Levels in a Soft Real-Time SMT Processor, 2008
Abstract: This paper focuses on the instruction fetch resources in a real-time SMT processor to provide an energy-efficient configuration for a soft real-time application running as a high priority thread as fast as possible while still offering decent progress in low priority or non-real-time thread(s). The authors propose a fetch mechanism, Fetch-around, where a high priority thread accesses the L1 ICache, and low priority threads directly access the L2. This allows both the high and low priority threads to simultaneously fetch instructions, while preventing the low priority threads from thrashing the high priority thread's ICache data. Overall, the authors show an energy-performance metric that is 13% better than the next best policy when the high performance thread priority is 10x that of the low performance thread.

Exploring Variability and Performance in a Sub-200-mV Processor, April 2008
Abstract: In this study, we explore the design of a subthreshold processor for use in ultra-low-energy sensor systems. We describe an 8-bit subthreshold processor that has been designed with energy efficiency as the primary constraint. The processor, which is functional below Vdd=200 mV, consumes only 3.5 pJ/inst at Vdd=350 mV and, under a reverse body bias, draws only 11 nW at Vdd=160 mV. Process and temperature variations in subthreshold circuits can cause dramatic fluctuations in performance and energy consumption and can lead to robustness problems. We investigate the use of body biasing to adapt to process and temperature variations. Test-chip measurements show that body biasing is particularly effective in subthreshold circuits and can eliminate performance variations with minimal energy penalties. Reduced performance is also problematic at low voltages, so we investigate global and local techniques for improving performance while maintaining energy efficiency.

From SODA to Scotch: The Evolution of a Wireless Baseband Processor, Nov. 2008
Abstract: This paper presents the architectural evolution of SODA, a fully programmable multicore architecture to meet the real-time requirements of 3G wireless protocols, going from a research design to a commercial prototype, including the goals, tradeoffs and final design choices.

Impact of Technology and Voltage Scaling on the Soft Error Susceptibility in Nanoscale CMOS, 2008
Abstract: With each technology node shrink, a silicon chip becomes more susceptible to soft errors. The susceptibility further increases as the voltage is scaled down to save energy. Based on analysis on cells from commercial libraries, the authors quantified the increase in the soft error probability across 65nm and 45nm technology nodes at different supply voltages using the Qcrit based simulation methodology. The Qcrit for both bit cells and latches decreases by ~30% as the designs are scaled from 65nm to 45nm. This decrease is expected to continue with further technology scaling as well. The results show that at nominal voltage, the Qcrit for a latch is just ~20% more than that of the bit cell in sub-65nm technology nodes. This work shows that in sub-65nm technology nodes with aggressive voltage scaling, it is equally critical to solve the soft error problems in logic (latches, flip-flops) as it is in SRAMs.

Implementing Embedded Security on Dual-Virtual-CPU Systems, Nov. 2007
Abstract: This article is an introduction to a new hardware technology that provides a low-cost, high-performance, isolated environment to store and process sensitive data for embedded systems, and a case study of the design of a programmable security software framework.

Implementing the Cortex-M0 DesignStart Processor in a Low-end FPGA, Oct. 2010
(Original work in Spanish presented at SASE 2011)
Abstract: ARM has recently launched a low cost, reduced version of the Cortex-M0 processor (Cortex-M0 DesignStart™), which could be synthesized into a FPGA or used for silicon implementations. This article shows the results of the implementation of the Cortex-M0 DesignStart processor in a low-end FPGA from Xilinx, extending the implementations available for Cortex-M processors in FPGA.

Low-cost Techniques for Reducing Branch Context Pollution in a Soft Real-time Embedded Multithreaded Processor, Oct. 2007
Abstract: This paper proposes two low-cost and novel branch history buffer handling schemes aiming at skewing the branch prediction accuracy in favor of a real-time thread for a soft real-time embedded multithreaded processor.

PicoServer: Using 3D stacking technology to build energy efficient servers, Oct. 2008
Abstract: This article extends prior work to show that a straightforward use of 3D stacking technology enables the design of compact energy-efficient servers. The proposed architecture, called PicoServer, employs 3D technology to bond one die containing several simple, slow processing cores to multiple memory dies sufficient for a primary memory.

RazorII: In Situ Error Detection and Correction for PVT and SER Tolerance, Jan. 2009
Abstract: Traditional adaptive methods that compensate for PVT variations need safety margins and cannot respond to rapid environmental changes. In this paper, the authors present a design (RazorII) which implements a flip-flop with in situ detection and architectural correction of variation-induced delay errors. Error detection is based on flagging spurious transitions in the state-holding latch node. The RazorII flip-flop naturally detects logic and register SER.

Reconfigurable energy efficient near threshold cache architectures, 2008
Abstract: Battery life is an important concern for modern embedded processors. Supply voltage scaling techniques can provide an order of magnitude reduction in energy. Current commercial memory technologies have been limited in the degree of supply voltage scaling that can be performed if they are to meet yield and reliability constraints. This has limited designers from exploring the near threshold operating regions for embedded processors. Summarizing prior work, the authors show how proper sizing of memory cells can guarantee that the memory cell reliability in the near threshold supply voltage region matches that of a standard memory cell.

Scan Based Methodology for Reliable State Retention Power Gating Designs, 2010
Abstract: Power gating is an effective technique for reducing leakage power which involves powering off idle circuits through power switches, but those power-gated circuits which need to retain their states store their data in state retention registers. When power-gated circuits are switched from sleep to active mode, sudden rush of current has the potential of corrupting the stored data in the state retention registers which could be a reliability problem. This paper presents a methodology for improving the reliability of power-gated designs by protecting the integrity of state retention registers through state monitoring and correction. This is achieved by scan chain data encoding and decoding. The methodology is compatible with EDA tools design and power gating control flows. A detailed analysis of the proposed methodology's capability in detecting and correcting errors is given including the area overhead and energy consumption of the protection circuitry. The methodology is validate using FPGA and show that it is possible to correct all single errors with Hamming code and detect all multiple errors with CRC-16 code. To the best of the authors' knowledge this is the first study in the area of reliable power gating designs through state monitoring and correction.

Selective state retention design using symbolic simulation, April 2009
Abstract: This paper presents a case study based on symbolic simulation for assisting the designers to design and implement selective retention correctly. To the best of the author's knowledge this is the first study in the area of rigourous design and implementation of selective state retention.

SOC-C: Efficient Programming Abstractions for Heterogeneous Multicore Systems on Chip, 2008
Abstract: This paper tackles problems experienced in the mapping of applications onto complex SoCs with a set of language extensions which allows the programmer to introduce pipeline parallelism into sequential programs, manage distributed memories, and express the desired mapping of tasks to resources.

SODA: A High-Performance DSP Architecture for Software-Defined Radio, Feb. 2007
Abstract: Software-defined radio (SDR) belongs to an emerging class of applications with the processing requirements of a supercomputer but the power constraints of a mobile terminal. The authors developed the Signal-Processing On-Demand Architecture (SODA), a fully programmable architecture that supports SDR, by examining two widely differing protocols, W-CDMA and 802.11a. It meets power-performance requirements by separating control and data processing and by employing ultrawide SIMD execution.

STEEL: a technique for stress-enhanced standard cell library design, 2008
Abstract: Mobility degradation and device scaling limitations have led process engineers to develop new techniques that introduce mechanical stress in MOSFET channels, which results in enhanced carrier transport. New fabrication steps strive to increase carrier mobility which, consequently, increases both Ion and Ioff in CMOS devices. However, most stress-enhancement techniques are dependent on layout parameters and their effects can be exploited within standard cell library design. In this work, the authors propose a new standard cell library design methodology that shares VDD and VSS source/drain connections across standard cell boundaries.

Stress aware layout optimization, 2008
Abstract: Process-induced mechanical stress is used to enhance carrier transport and achieve higher drive currents in current CMOS technologies. In this paper, the authors study how stress-induced performance enhancements are affected by layout properties and suggest guidelines for improving layouts so that performance gains are maximized.

Teaching for Evolution towards Embedded Multi-sensor Interfaces, 2011
Abstract: This paper outlines experiences in bringing up an ARM-based course with significant sensing and actuating capabilities, on top of the already existing infrastructure support for training embedded wireless system design. The role of the suitable development platform, design kits and the lab experiments is outlined, and the expected outcomes are highlighted. We emphasize the role of scaffolding principle, which now does not only apply to the single course, but also to our overall experience in developing such courses.

The challenges of correlating silicon and models in high variability CMOS processes, 2009
Abstract: This talk discusses one of the key challenges of post-silicon validation: namely, the difficulty inherent in correlating observed behavior with modeled behavior. Validation must account for a large number of sources of inherent variability in the silicon, ranging from those inherent in the device and wire models themselves through approximations made in library modeling, extraction, tool algorithms and so on. Examples are given for validating standard cell and memory based designs as well as a general methodology that can be used to enable chip bring-up.

The Use of Compiler Optimizations for Embedded Systems Software, Sept. 2008
Abstract: This paper discusses the fundamental differences between hand-optimizing code to take advantage of a particular processor's compiler and applying built-in optimization options to proven and well-polished code. Examples of common, built-in compiler options are presented using a simulated ARM processor and C compiler, along with a simple methodology that can be applied to any embedded compiler for finding an optimal set of compiler options.

Thread Priority-Aware Random Replacement in TLBs for a High-Performance Real-Time SMT Processor, 2007
Abstract: This paper proposes a novel random replacement method in fully or set associative structures such as TLBs to improve the performance of the main or high-priority thread running in an SMT processor along with other low-priority threads.

Using a Web 2.0 Approach for Embedded Microcontroller Systems, 2010
Abstract: This paper describes Georgia Tech faculty experiences using a new approach for teaching an embedded systems design course and the associated laboratory. A cloud-based C/C++ compiler and file server are used for software development along with a low-cost 32-bit microcontroller board. Student resources include an eBook, web-based reference materials and assignments, an online user forum, and wiki pages with sample microcontroller application code. In laboratory assignments, breadboards are used to rapidly build prototype systems using the microcontroller, networking, and other I/O subsystems using small breakout boards with a wide variety of sensors, displays, and drivers. Software development is done in any web browser, all student files are stored on the web server, and downloading code to the microcontroller functions in the same way as a simple USB flash drive.

Way guard: a segmented counting bloom filter approach to reducing energy for set-associative caches, 2009
Abstract: The design trend of caches in modern processors continues to increase their capacity with higher associativity to cope with large data footprint and take advantage of feature size shrink, which, unfortunately, also leads to higher energy consumption. This paper presents a technique using segmented counting Bloom filters called "Way Guard" to reduce the number of redundant way lookups in large set-associative caches to achieve dynamic energy savings. This Way Guard mechanism only looks up an average of 25-30% of the cache ways and saved up to 65% of the L2 energy and up to 70% of the L1 cache energy.

Worst-case design and margin for embedded SRAM, 2007
Abstract: An important aspect of Design for Yield for embedded SRAM is identifying the expected worst case behavior in order to guarantee that sufficient design margin is present. Previously, this has involved multiple simulation corners and extreme test conditions. It is shown that statistical concerns and device variability now require a different approach, based on work in Extreme Value Theory. This method is used to develop a lower-bound for variability-related yield in memories.

Legacy research papers have been written around ARM technologies that hold educational value:

A Combined Hardware-Software Approach for Low-Power SoCs: Applying Adaptive Voltage Scaling and Intelligent Energy Management Software, Dec. 2002
Abstract: Increased functionality and performance demands are challenging System-on-Chip (SoC) designers to seek better methods for optimizing available battery power in portable applications. Key areas of exploration include dynamic voltage scaling and improved software algorithms for the control of power modes. While adaptive voltage scaling optimizes power use based on temporal environmental conditions, Intelligent Energy Management (IEM) algorithms optimize power consumption based on the dynamic workload of the processor. IEM software and hardware monitor the execution and communication characteristics of workloads and predictively set the performance of the processor to the level that minimizes energy use, while still meeting application deadlines.

AMBA: Enabling Reusable On-chip Designs, Aug. 1997
Abstract: AMBA's goal is to help designers of embedded CPU systems meet challenges like design for low power consumption and test access. This article describes some of AMBA's design methodology and provides a set of specifications that will aid designers in making detailed comparisons with other buses.

ARM MPEG-4 AAC LC Decoder Technical Specification, June 2003
Abstract: Technical specification for the ARM MPEG-2 AAC Low Complexity profile decoder. This document details the performance of the decoder when it has been integrated into an ARM-supplied example player.

ARM7TDMI Power Consumption, Aug. 1997
Abstract: Portable and handheld products require processors that consume less power than those in desktop and other powered applications. As a result, designers must analyze power use in the early stage of design both at the circuit and system levels. RISC processors, such as the ARM7TDMI, have both strengths and weaknesses as far as power consumption is concerned.

Automatic Performance Setting for Dynamic Voltage Scaling, May 2001
Abstract: The emphasis on processors that are both low power and high performance has resulted in the incorporation of dynamic voltage scaling into processor designs. This feature allows one to make fine granularity trade-offs between power use and performance, provided there is a mechanism in the OS to control that trade-off. A novel software approach to automatically controlling dynamic voltage scaling in order to optimize energy use is described in this paper.

Combined Dynamic Voltage Scaling and Adaptive Body Biasing for Lower Power Microprocessors under Dynamic Workloads, Aug 2002
Abstract: Dynamic voltage scaling (DVS) reduces the power consumption of processors when peak performance is unnecessary. However, the achievable power savings by DVS alone is becoming limited as leakage power increases. In this paper, the authors show how the simultaneous use of adaptive body biasing (ABB) and DVS can be used to reduce power in high-performance processors.

Drowsy Caches: Simple Techniques for Reducing Leakage Power, Nov. 2003
Abstract: On-chip caches represent a sizable fraction of the total power consumption of microprocessors. Although large caches can significantly improve performance, they have the potential to increase power consumption. As feature sizes shrink, the dominant component of this power loss will be leakage. However, during a fixed period of time the activity in a cache is only centered on a small subset of the lines. This behavior can be exploited to cut the leakage power of large caches by putting the cold cache lines into a state preserving, low-power drowsy mode. Moving lines into and out of drowsy state incurs a slight performance loss. In this paper we investigate policies and circuit techniques for implementing drowsy caches. The authors show that with simple architectural techniques, about 80%-90% of the cache lines can be maintained in a drowsy state without affecting performance by more than 1%.

Drowsy Instruction Caches, Sept. 2002
Abstract: This paper extends the architectural control mechanism of the drowsy cache to reduce leakage power consumption of instruction caches without significant impact on execution time. The results show that data and instruction caches require different control strategies for efficient execution.

Embedded Control Problems, Thumb, and the ARM7TDMI, Oct. 1995
Abstract: High-end embedded control applications such as cellular phones, disk drives, and modems demand more performance from their controllers yet still require low costs. By implementing a second, compressed instruction set, our architectural innovation Thumb reduces RISC code size, providing 32-bit RISC performance at 8-/16-bit system cost.

Power-Smart System-On-Chip Architecture for Embedded Cryptosystems, Sept. 2005
Abstract: In embedded cryptosystems, sensitive information can leak via timing, power, and electromagnetic channels. This paper introduces a novel power-smart system-on-chip architecture that provides support for masking these channels by controlling, in real-time, the power and the current consumption of a system to predefined programmable values.

Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation, Nov. 2003
Abstract: With increasing clock frequencies and silicon integration, power aware computing has become a critical concern in the design of embedded processors and systems-on-chip. One of the more effective and widely used methods for power aware computing is dynamic voltage scaling (DVS). In order to obtain the maximum power savings from DVS, it is essential to scale the supply voltage as low as possible while ensuring correct operation of the processor. In this paper, the authors propose a new approach to DVS, called Razor, based on dynamic detection and correction of circuit timing errors. The key idea of Razor is to tune the supply voltage by monitoring the error rate during circuit operation, thereby eliminating the need for voltage margins and exploiting the data dependence of circuit delay.

The Dangers of Living with an X (bugs hidden in your Verilog), Oct. 2003
Abstract: The semantics of X in Verilog RTL are extremely dangerous as RTL bugs can be masked, allowing RTL simulations to incorrectly pass where netlist simulations can fail. Such X-bugs are often missed because formal equivalence checkers are configured to ignore them, which is a particular concern given that equivalence checking is fast replacing netlist simulations. This paper gives examples of such problems in order to raise awareness of X issues in many different parts of the design flow, which are often poorly understood by RTL designers and EDA vendors alike.

Thread-level Parallelism and Interactive Performance of Desktop Applications, Aug. 2000
Abstract: Unlike server workloads, the primary requirement of interactive applications is to respond to user events under human perception bounds rather than to maximize end-to-end throughput. In this paper the authors report on the thread-level parallelism and interactive response time of a variety of desktop applications.

Thread level parallelism of desktop applications, April 2004
Abstract: Multiprocessing is already prevalent in servers, where multiple clients present an obvious source of thread level parallelism. The case for multiprocessing is less clear for desktop applications, however processor architects are already designing processors that count on the availability of concurrently runnable threads. In this paper the authors examine a wide variety of existing desktop workloads (over 50) across three different operating systems (Windows NT, BeOS and Linux) and quantify the amount and nature of thread level parallelism (TLP) in the system. The results show that OS and application structure have a significant effect on TLP. While most of the workloads exhibit only moderate amounts of parallelism (under 1.5), there is evidence that show that many of these workloads are not inherently single threaded.

Vertigo: Automatic Performance-Setting for Linux, Oct. 2002
Abstract: Combining high performance with low power consumption is becoming one of the primary objectives of processor designs. Instead of relying just on sleep mode for conserving power, an increasing number of processors take advantage of the fact that reducing the clock frequency and corresponding operating voltage of the CPU can yield quadratic decrease in energy use. However, performance reduction can only be beneficial if it is done transparently, without causing the software to miss its deadlines. In this paper, the authors describe the implementation and performance-setting algorithms used in Vertigo, the authors' power management extensions for Linux.