ARCHITECTURAL CONSIDERATIONS
ARCHITECTURAL CONSIDERATIONS
The growth of the Internet is creat- requirements and constraints for workstations
ing a demand for broadband access equip- and communications processors are funda-
ment and network-enabled consumer mentally different. Designers must carefully
appliances. At the heart of these products are manage latency in communications proces-
communications processors—devices that sors to reduce on-chip buffering. Network
integrate processing, networking, and system interface cards (NICs) typically used in work-
support functions into a single, low-cost sys- stations plug into I/O buses that have high
tem on a chip (SOC). The primary challenges latencies. This forces a NIC to include large
in the design of these devices are minimizing buffers and encourages large burst transfers
cost and time to market, and maximizing flex- for efficiency. For example, a 10-/100-Mbps
ibility. Die size and packaging are the major Ethernet NIC can have as much as 12 Kbytes
Charles D. Cranor factors that determine cost. The rapid pace at of buffering.4 Since communications proces-
which Internet applications and services are sors often contain multiple network interfaces,
R. Gopalakrishnan evolving increases the pressure to reduce devel- placing such large buffers on chip may not be
opment time. Thus, designers of communi- possible and is certainly not cost effective.
Peter Z. Onufryk cations processors are continually looking for Space requirements in workstations are not
ways to speed design and verification. Rapid as stringent as those in the SOC environment
AT&T Labs, Research change is also increasing the importance of of communications processors. This allows
flexibility. Communications processors are workstation NICs to include considerable
often adapted to applications that may not processing power. For example, “intelligent”
have been anticipated, or even existed, when NICs contain on-board processors.2 Even
the chip was designed. “dumb” workstation NICs are actually quite
A large body of research and experience in intelligent. For example, it is common for a
the design of network adaptors for worksta- NIC to contain a complex DMA controller
tions exists.1-3 It may appear that a communi- and buffer management unit. In a communi-
cations processor consists of nothing more cations processor this functionality is typical-
than integrating these designs on a chip. How- ly shared among multiple network interfaces
ever, this is not the case since the system to reduce die size. Network interfaces in these
JANUARY–FEBRUARY 2000 19
ARCHITECTURAL CONSIDERATIONS
tifier fields in the cell’s header. We call this sor for low-cost consumer applications should
form of channel customization interface- contain a single processor that performs all
specific processing since it requires function- data movement, interface-specific processing,
ality beyond simple data movement. Other and application processing. This becomes
examples of interface-specific processing are especially true as embedded processors reach
multiplexing and demultiplexing of data speeds of 500 MHz and higher. The avail-
based on the time slot for a time-division mul- ability of processor cores capable of perform-
tiplexing bus, and searching through multi- ing these tasks would reduce communications
ple DMA descriptors for an optimal size processor design and verification time,
buffer to store a received Ethernet frame. increase their flexibility, and simplify software
Despite the complexity of designing a multi- development. UNUM is an architecture for
channel DMA controller, a number of com- this type of processor core.
munications processors such as the AMD
Am186CC,7 the NETsilicon Net+ARM,8 and Multithreaded CPU for event processing
the Euphony processor9 use this approach. Performing data movement and interface-
The design of a multichannel DMA con- specific processing on the same CPU as appli-
troller with the features necessary to support cation processing dramatically increases the
multiple network interfaces can be as complex number of processor events that must be ser-
as a programmable processor. For this reason, viced. The key to minimizing communica-
some designers have chosen to replace multi- tions processor cost is minimizing die size,
channel DMA controllers with a dedicated which means minimizing on-chip buffering.
processor for data transfers and interface- Small on-chip buffers impose tight constraints
specific processing. This eliminates the com- on acceptable event service latency and result
plexity of designing the DMA controller, in small burst transfers thus increasing the
provides flexibility, and allows modifications number of events.
and enhancements to be made in software. To illustrate the importance of minimizing
The Motorola MPC86010 and the Virata event service latency, consider a cut-through
Helium11 use this approach. transfer of a 1,518-byte Ethernet frame from
Adding a second processor for data trans- a receive FIFO to memory. Using a 64-byte
fers and interface-specific processing elimi- burst transfer results in 24 data transfer
nates the complexity of designing a DMA and request events. To prevent overflow, the
provides flexibility. However, it also introduces receive FIFO must be large enough to accom-
the software complexity and partitioning modate the worst-case event service latency.
issues associated with developing code for A large event service latency not only reduces
multiple processors. This is especially true if the maximum throughput but also requires
the architecture of the processor that handles larger FIFOs to prevent overflow. This, in
communications tasks differs from that of the turn, results in higher queuing delays that fur-
main CPU. Since processor functions must ther increase worst-case event service latency.
be replicated in this approach (for example, Current processors service external events,
two bus interface units, two ALUs), it may using either polling or interrupts. Infrequent
increase die size. This approach also leads to polling results in large event service latencies,
a static partitioning of functions onto proces- while frequent polling consumes large
sors. Idle cycles on one processor cannot be amounts of processing. Both of these are unac-
used to enhance performance of tasks running ceptable. The worst-case event service laten-
on the other. cy for an interrupt with a full-context save for
Applications with low data rates that can a high-performance embedded processor is on
tolerate high latencies without requiring large the order of several microseconds. Although
on-chip buffers do not require a DMA con- typical performance is much better, designers
troller or a dedicated processor. Instead, an must consider worst-case performance during
interrupt handler running on the main CPU system design since taking the best or average
may perform these operations. The T.sqware case will lead to conditions such as buffer over-
TS70212 uses this approach. flow or underflow.
We believe that a communications proces- A major component of interrupt latency is
20 IEEE MICRO
saving and restoring the state Event
of the interrupted context. mapper
Techniques used to reduce Context scheduler
Program
this overhead include External
counter Context 0 Context 1 Context 2 Context n
events PC/Priority PC/Priority PC/Priority PC/Priority
Priority
• coding interrupt service CID PC
Context
routines in assembly lan-
guage to use a small
number of registers, CPU Instruction
pipeline memory
• switching to an alternate
register set for interrupt
processing,
• saving processor registers
to unused floating-point Register file
registers, and (31n × 32)
• providing on-chip mem-
ory for saving and restor-
ing state.
JANUARY–FEBRUARY 2000 21
ARCHITECTURAL CONSIDERATIONS
22 IEEE MICRO
dirty data from the data cache 5 words read from memory 4 written to I/O device
as it is written to a network
20 16 12 8 4 18 14 10 6
interface, potentially elimi-
nating an unnecessary future 21 17 13 9 5 19 15 11 7
write-back. Aligner
22 18 14 10 6 20 16 12 8
All data flowing between
the memory and the I/O 23 19 15 11 7 21 17 13 9
buses pass through an aligner
in the IBIU. For aligned
0 1 2 3
transfers, the aligner simply
passes unmodified data from 4 5 6 7
one bus to another. For 8 9 10 11
unaligned transfers, the align- Holding register
er uses a holding register, 12 13 14 15
shifter, and multiplexer to 5
16 17 18 19
align data as it flows from one 6
20 21 22 23
bus to the other. Figure 4 pro-
7
vides an example of this for an 24 25 26 27
unaligned 4-word transfer. 8 6
Memory
9 7
Putting it all together
10 8
The ability to service exter-
nal events with extremely low 11 9
overhead together with high- Shifter and multiplexer
performance data transfer
instructions allows UNUM to Aligner producing first word for I/O device
perform data movement and
interface-specific processing Figure 4. Unaligned 4-word fly-by transfer from memory to I/O device starting at address 6.
functions in software. Com-
bining a UNUM processor
core with network interface cores allows com- instructions. The simulator also modeled the
munications processors to be rapidly con- caches, memory system, counter/timers, a
structed. A typical high-speed network interface console, and an ATM interface.
in UNUM maps to two processor contexts, one We simulated a 200-MHz UNUM proces-
for input processing and one for output pro- sor with an 8-Kbyte, two-way set-associative
cessing. Threshold logic in the output FIFO of instruction cache; 2-Kbyte, two-way set-
a network interface generates an event whenev- associative data cache; and a 4-word write
er there is room for an output data transfer. Sim- buffer. We configured our simulated 32-bit
ilarly, threshold logic in the input FIFO system bus to run at 100 MHz and the mem-
generates an event whenever enough data exists ory system to consist of 100-MHz SDRAM.
for a complete data transfer or an end-of-pack- All of the benchmarks were written in C and
et is detected. The event-handling routines may compiled with an enhanced MIPS GCC 2.8.1
perform interface-specific processing. compiler with “-O3” optimization. Other
than the ones mentioned in the next section,
Simulation results we did not perform hand assembly language
To better quantify the benefits of the optimizations. In addition, we assumed that
UNUM architecture on data movement and event and interrupt handlers were locked in
interface-specific processing, we created a the instruction cache.
cycle-accurate simulator of a UNUM-based
communications processor. We based the Data movement
CPU in the simulator on the MIPS32 ISA, For our initial measurements we wrote a
which was enhanced to support multiple data movement micro-benchmark that simu-
hardware contexts and data movement lated the transfer of a 1,518-byte packet from
JANUARY–FEBRUARY 2000 23
ARCHITECTURAL CONSIDERATIONS
350
all data cache hits. Given the small size of data
300 caches in SOC communications processors,
250 we expect the actual achieved bandwidth to
200 be closer to the lower end of this range.
150
UNUM with data movement instructions
produced the best results: 212 Mbytes/sec.
100
Since the state of the event service routine fits
50 within a UNUM context, no state informa-
0 tion needs to be loaded from memory. This
64 128 192 256 320 384 448 512
Data transfer size (bytes) explains why UNUM outperforms fast inter-
rupts with data cache hits, and it also is the
Figure 5. Data movement performance. reason why the UNUM curve is unaffected
by data cache misses.
Note that for very small burst sizes,
memory to a network interface using a range UNUM events with PIO outperforms fast
of burst transfer sizes. We measured the result- interrupts with data movement instructions.
ing bandwidth. We examined two data move- This is because for small bursts the overhead
ment mechanisms, one using UNUM data of loading the event service routine state
movement instructions and another using exceeds that of performing memory-to-
PIO. Our PIO function moved data using an memory PIO transfers.
optimized hand-coded assembly routine based
on the BSD bcopy() function. We examined ATM Soft-SAR
three CPU configurations: UNUM (hardware Our second benchmark measures the abili-
context switch with state preservation), fast ty of UNUM to perform complex interface-
interrupts (alternate register set with no inter- specific processing. For this benchmark we
rupt state preservation), and normal proces- selected ATM AAL5 Segmentation and
sor interrupts. For normal interrupts we Reassembly (SAR) since it represents a class of
assumed an overhead of 1 µs. We ran our applications in which the processing performed
benchmark for best- and worst-case data cache on received data depends on its content.
scenarios for state information and with the ATM AAL5 SAR transmit processing con-
assumption that data to be moved is not pre- sists of segmenting protocol data units
sent in the data cache. (PDUs) to be sent on an ATM virtual circuit
Figure 5 provides the results of the data into fixed-length cells and attaching a header
movement benchmark. PIO-based data to each cell. The PDU is padded to contain
movement results in the worst performance. an integral number of cells, and the last cell
The highest achievable bandwidth using PIO has fields that indicate the data length, a user-
was 40 Mbytes/sec. This was true regardless to-user byte, and a CRC-32 value. SAR
of the type of CPU used (UNUM, fast inter- receive processing consists of reassembling
rupts, or regular interrupts) and caching received cells into PDUs, checking the length
assumptions since the cost of PIO dominates and CRC-32 fields, and passing the payload to
all other overheads. upper layers.
Making use of UNUM’s data movement What makes SAR processing challenging is
instructions improved results significantly. For that for each received ATM cell considerable
an SOC environment with small on-chip work must be performed. First, the identifier
buffers we expect burst sizes in the range of field (VPI/VCI) in the cell header is used to
24 IEEE MICRO
look up the virtual circuit that the cell belongs 600
to. This lookup returns a data structure that
contains a pointer to a reassembly buffer and 500
Receive throughput
current CRC-32 for the packet being reassem- Send throughput
400
Throughput (Mbps)
bled. The payload of the cell is then append- Full-duplex throughput
ed to the reassembly buffer, and the
300
reassembly buffer pointer and CRC-32 are
updated. Additional processing is required to
200
handle boundary conditions such as end of
frame and end of buffer. Due to the com- 100
plexity of SAR processing, most systems
implement this in custom hardware. 0
The ATM interface in the system we simu- 512 1,024 1,536 2,048 2,560 3,072
PDU size (bytes)
lated consisted of a physical layer interface (for
example, Utopia), a transmit and receive Figure 6. Maximum ATM SAR throughput.
FIFO, a CRC-32 calculator, and control and
status registers. The interface generates an
ATM receive event when a cell is present in the tant than throughput is worst-case latency,
receive FIFO and an ATM transmit event which determines required on-chip buffering.
when space exists for a cell in the transmit Using UNUM, a 25-Mbps, full-duplex line
FIFO. We wrote code for SAR processing on rate requires just a four-cell transmit and
the UNUM processor. The SAR software uses receive FIFO.
three hardware contexts. The first performs
ATM receive event processing, the second per-
forms ATM transmit event processing, and the
third performs ATM transmit cell scheduling.
U NUM simplifies design of communica-
tions processors, lowers their cost, and
closely integrates data movement and com-
Our first experiment measured the maxi- putation, thereby enabling fly-by processing.
mum achievable throughput, assuming an infi- UNUM’s ability to perform fly-by processing
nite line rate and FIFOs. Figure 6 shows the is well suited to applications such as encryp-
throughput as a function of AAL5 frame size tion, coding, overload control, packet classi-
for half-duplex transmit, half-duplex receive, fication, and packet telephony. The
and full-duplex operation. In all three cases, the emergence of broadband access networks is
throughput increases with frame size since the making these applications increasingly impor-
per frame overhead is amortized over a larger tant for low-cost consumer devices. We are
number of cells. The highest throughput we continuing our investigation of UNUM for
observed was 570 Mbps, which occurs for the these and other applications areas. MICRO
half-duplex receive case. Transmit throughput
is lower because of the extra overhead associat- References
ed with cell scheduling. These results show that 1. C. Dalton, et. al., “Afterburner,” IEEE
low-overhead event processing and high-per- Network, Jul. 1993, pp. 36-43.
formance data movement instructions allow 2. H. Kanakia and D. Cheriton, “The VMP Net-
UNUM to sustain a very high throughput. As work Adaptor Board (NAB): High Perfor-
a comparison, to sustain a receive throughput mance Network Communication for
of 570 Mbps, a single context CPU would have Multiprocessors,” Proc. Symp. Communi-
to service an interrupt every 750 ns. cation Architectures and Protocols, ACM,
Our second experiment measured UNUM New York, 1988, pp. 175-187.
processor utilization and the FIFO size nec- 3. K.K. Ramakrishnan, “Performance Consid-
essary to sustain a full-duplex line rate of 25 erations in Designing Network Interfaces,”
Mbps. We measured a CPU utilization of IEEE J. Selected Areas in Communications,
13.4% with a frame size of 1,536 bytes. This Vol. 11, No. 2, Feb. 1993, pp. 203-219.
means that even when transmitting and 4. Am79C973/Am79C975 PCnet—Fast III
receiving at full line rate, 86% of the CPU is Single Chip 10/100 Mbps PCI Ethernet
available for other processing. More impor- Controller with Integrated PHY Data Sheet,
JANUARY–FEBRUARY 2000 25
ARCHITECTURAL CONSIDERATIONS
Advanced Micro Devices, Sunnyvale, Calif. 9. P.Z. Onufryk, “Euphony: A Signal Processor
5. P. Druschel et al., “Network Subsystem for ATM,” EE Times, Jan. 20, 1997, pp. 54, 80.
Design,” IEEE Network, Jul. 1993, pp. 8-17. 10. PowerQuicc: Motorola MPC860 User
6. T. von Eicken et al., “U-Net: A User Level Manual, Motorola, Austin, Tex.
Network Interface for Parallel and 11. HELIUM IC-000148 Preliminary Data Sheet,
Distributed Computing,” Proc. 15th Ann. VIRATA, Cambridge, UK.
ACM Symp. Operating Systems Principles, 12. TS702 Advanced Communication Controller
ACM, Dec. 1995, pp. 40-53. Data Book, T.sqware Inc., Santa Clara, Calif.
7. Am186CC Communications Controller
User’s Manual, Advanced Micro Devices, Charles D. Cranor is a senior technical staff
Sunnyvale, Calif. member at AT&T Labs–Research in Florham
8. NET+ARM Hardware Reference Guide, Park, New Jersey. His interests include net-
NETsilicon, Waltham, Mass. working, operating systems, and computer
architecture. Cranor received a bachelor’s
degree in electrical engineering from the Uni-
versity of Delaware and a master’s and doc-
torate in computer science from Washington
University in St. Louis, Missouri. He is a
COMING
member of the IEEE, ACM, and USENIX,
and a kernel developer for the open-source
BSD operating systems projects.
26 IEEE MICRO