0% found this document useful (0 votes)
2 views

ARCHITECTURAL CONSIDERATIONS

The document introduces UNUM, an architecture designed to integrate communications functionality directly into CPUs, enhancing performance and flexibility for communications processors. It discusses the challenges faced in designing these processors, such as managing latency and minimizing costs, while highlighting the importance of on-chip buffering and the complexity of DMA controllers. UNUM aims to streamline event processing and data movement by utilizing multiple hardware contexts, allowing for more efficient handling of network interfaces and reducing the need for extensive on-chip buffering.

Uploaded by

SirousFekri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

ARCHITECTURAL CONSIDERATIONS

The document introduces UNUM, an architecture designed to integrate communications functionality directly into CPUs, enhancing performance and flexibility for communications processors. It discusses the challenges faced in designing these processors, such as managing latency and minimizing costs, while highlighting the importance of on-chip buffering and the complexity of DMA controllers. UNUM aims to streamline event processing and data movement by utilizing multiple hardware contexts, allowing for more efficient handling of network interfaces and reducing the need for extensive on-chip buffering.

Uploaded by

SirousFekri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

ARCHITECTURAL CONSIDERATIONS

FOR CPU AND NETWORK


INTERFACE INTEGRATION
THE AUTHORS DESCRIBE UNUM, AN ARCHITECTURE FOR INTEGRATING

COMMUNICATIONS FUNCTIONALITY INTO THE CPU. UNUM NOT ONLY SIMPLIFIES

THE DESIGN OF COMMUNICATIONS PROCESSORS BUT ALSO IMPROVES THEIR

PERFORMANCE AND PROVIDES THEM WITH GREATER FLEXIBILITY.

The growth of the Internet is creat- requirements and constraints for workstations
ing a demand for broadband access equip- and communications processors are funda-
ment and network-enabled consumer mentally different. Designers must carefully
appliances. At the heart of these products are manage latency in communications proces-
communications processors—devices that sors to reduce on-chip buffering. Network
integrate processing, networking, and system interface cards (NICs) typically used in work-
support functions into a single, low-cost sys- stations plug into I/O buses that have high
tem on a chip (SOC). The primary challenges latencies. This forces a NIC to include large
in the design of these devices are minimizing buffers and encourages large burst transfers
cost and time to market, and maximizing flex- for efficiency. For example, a 10-/100-Mbps
ibility. Die size and packaging are the major Ethernet NIC can have as much as 12 Kbytes
Charles D. Cranor factors that determine cost. The rapid pace at of buffering.4 Since communications proces-
which Internet applications and services are sors often contain multiple network interfaces,
R. Gopalakrishnan evolving increases the pressure to reduce devel- placing such large buffers on chip may not be
opment time. Thus, designers of communi- possible and is certainly not cost effective.
Peter Z. Onufryk cations processors are continually looking for Space requirements in workstations are not
ways to speed design and verification. Rapid as stringent as those in the SOC environment
AT&T Labs, Research change is also increasing the importance of of communications processors. This allows
flexibility. Communications processors are workstation NICs to include considerable
often adapted to applications that may not processing power. For example, “intelligent”
have been anticipated, or even existed, when NICs contain on-board processors.2 Even
the chip was designed. “dumb” workstation NICs are actually quite
A large body of research and experience in intelligent. For example, it is common for a
the design of network adaptors for worksta- NIC to contain a complex DMA controller
tions exists.1-3 It may appear that a communi- and buffer management unit. In a communi-
cations processor consists of nothing more cations processor this functionality is typical-
than integrating these designs on a chip. How- ly shared among multiple network interfaces
ever, this is not the case since the system to reduce die size. Network interfaces in these

18 0272-1732/00/$10.00  2000 IEEE


devices consist simply of a data link interface
and buffers.
The integration of processing and net-
Channel state
working in the same device offers an oppor- RAM
tunity to rethink the way CPUs and network
interfaces are designed. Most communications
processor CPU cores use Instruction Set
Architectures (ISAs) that were initially devel-
oped for workstation processors and opti-
mized for SPEClike benchmark performance. DMA
DMA
Network adaptor research has focused on channel DMA
request
select state machine
reducing memory copies and host CPU pro- signals
logic
cessing,5,6 both of which lead to complex
interface-specific hardware that is not appro-
priate for communications processors.
We introduce UNUM, an architecture for On-chip bus
communications processors that supports
extremely fast event processing and high per- Figure 1. Multichannel DMA controller.
formance data movement. With these capa-
bilities, functions typically performed in
custom hardware can be moved to software becomes active, the controller transfers its
executing on the main CPU. state from RAM to the DMA state machine.
After the operation completes, the controller
Design approaches writes back the updated channel state to
The design of a communications processor RAM. Arbitration logic determines which
rarely begins from scratch. Cores are either DMA channel is serviced next.
licensed from external intellectual property Unlike other parts of a communications
(IP) vendors or are available from internal processor for which standard cores are readily
sources. With the emergence of on-chip bus available, the DMA controller must often be
standards and a thriving IP industry, it would designed from scratch. This is because the
appear that a communications processor could DMA controller is highly system dependent.
be rapidly designed by licensing standard CPU These system dependencies include perfor-
and network interface cores and tying the mance requirements, on-chip bus architecture,
whole system together with a multichannel memory controller design, as well as the type
DMA controller. It has been our experience, and number of supported network interfaces.
as well as that of others, that the design and Diversity in network interface requirements
verification of this type of DMA controller is has the greatest impact on DMA design. In
both complex and time consuming. This is addition to transferring data, high-perfor-
especially true when features necessary for high mance, descriptor-based DMA controllers
performance such as unaligned transfers and also transfer control and status information
cache coherency are incorporated. between DMA descriptors in memory and a
Figure 1 shows a simplified block diagram network interface. This allows the DMA con-
of a typical multichannel DMA controller. troller to execute sequences of transfers
Since only one DMA channel can be active autonomously. The format and content of
on a bus at any given time, we can save die these descriptors typically needs to be modi-
area by designing a single DMA state machine fied for each type of network interface. Also,
that is shared by all DMA channels. Since the basic function of a DMA channel itself
there is considerable state information associ- may need to be modified. For example, the
ated with each DMA channel (source address, destination address for a received ATM cell is
destination address, byte count, and descrip- not the address of a buffer pointer in a descrip-
tor pointer), this state is commonly stored in tor as it would be for an Ethernet frame.
a RAM rather than individual registers to Instead, it is the address of a reassembly buffer
reduce chip area. When a DMA channel that is dependent on the virtual circuit iden-

JANUARY–FEBRUARY 2000 19
ARCHITECTURAL CONSIDERATIONS

tifier fields in the cell’s header. We call this sor for low-cost consumer applications should
form of channel customization interface- contain a single processor that performs all
specific processing since it requires function- data movement, interface-specific processing,
ality beyond simple data movement. Other and application processing. This becomes
examples of interface-specific processing are especially true as embedded processors reach
multiplexing and demultiplexing of data speeds of 500 MHz and higher. The avail-
based on the time slot for a time-division mul- ability of processor cores capable of perform-
tiplexing bus, and searching through multi- ing these tasks would reduce communications
ple DMA descriptors for an optimal size processor design and verification time,
buffer to store a received Ethernet frame. increase their flexibility, and simplify software
Despite the complexity of designing a multi- development. UNUM is an architecture for
channel DMA controller, a number of com- this type of processor core.
munications processors such as the AMD
Am186CC,7 the NETsilicon Net+ARM,8 and Multithreaded CPU for event processing
the Euphony processor9 use this approach. Performing data movement and interface-
The design of a multichannel DMA con- specific processing on the same CPU as appli-
troller with the features necessary to support cation processing dramatically increases the
multiple network interfaces can be as complex number of processor events that must be ser-
as a programmable processor. For this reason, viced. The key to minimizing communica-
some designers have chosen to replace multi- tions processor cost is minimizing die size,
channel DMA controllers with a dedicated which means minimizing on-chip buffering.
processor for data transfers and interface- Small on-chip buffers impose tight constraints
specific processing. This eliminates the com- on acceptable event service latency and result
plexity of designing the DMA controller, in small burst transfers thus increasing the
provides flexibility, and allows modifications number of events.
and enhancements to be made in software. To illustrate the importance of minimizing
The Motorola MPC86010 and the Virata event service latency, consider a cut-through
Helium11 use this approach. transfer of a 1,518-byte Ethernet frame from
Adding a second processor for data trans- a receive FIFO to memory. Using a 64-byte
fers and interface-specific processing elimi- burst transfer results in 24 data transfer
nates the complexity of designing a DMA and request events. To prevent overflow, the
provides flexibility. However, it also introduces receive FIFO must be large enough to accom-
the software complexity and partitioning modate the worst-case event service latency.
issues associated with developing code for A large event service latency not only reduces
multiple processors. This is especially true if the maximum throughput but also requires
the architecture of the processor that handles larger FIFOs to prevent overflow. This, in
communications tasks differs from that of the turn, results in higher queuing delays that fur-
main CPU. Since processor functions must ther increase worst-case event service latency.
be replicated in this approach (for example, Current processors service external events,
two bus interface units, two ALUs), it may using either polling or interrupts. Infrequent
increase die size. This approach also leads to polling results in large event service latencies,
a static partitioning of functions onto proces- while frequent polling consumes large
sors. Idle cycles on one processor cannot be amounts of processing. Both of these are unac-
used to enhance performance of tasks running ceptable. The worst-case event service laten-
on the other. cy for an interrupt with a full-context save for
Applications with low data rates that can a high-performance embedded processor is on
tolerate high latencies without requiring large the order of several microseconds. Although
on-chip buffers do not require a DMA con- typical performance is much better, designers
troller or a dedicated processor. Instead, an must consider worst-case performance during
interrupt handler running on the main CPU system design since taking the best or average
may perform these operations. The T.sqware case will lead to conditions such as buffer over-
TS70212 uses this approach. flow or underflow.
We believe that a communications proces- A major component of interrupt latency is

20 IEEE MICRO
saving and restoring the state Event
of the interrupted context. mapper
Techniques used to reduce Context scheduler
Program
this overhead include External
counter Context 0 Context 1 Context 2 Context n
events PC/Priority PC/Priority PC/Priority PC/Priority
Priority
• coding interrupt service CID PC
Context
routines in assembly lan-
guage to use a small
number of registers, CPU Instruction
pipeline memory
• switching to an alternate
register set for interrupt
processing,
• saving processor registers
to unused floating-point Register file
registers, and (31n × 32)
• providing on-chip mem-
ory for saving and restor-
ing state.

Even if interrupt overhead ALU Adder


were eliminated, the overhead
of loading and updating the
event service routine state from
memory would still remain. Data
memory
This is because interrupt ser-
vice routines do not retain state
across invocations. For exam-
ple, a data transfer event ser-
vice routine must load the
starting address, byte count, Figure 2. Example UNUM CPU.
destination address, and possi-
bly a descriptor pointer on
entry, then update the byte count on exit. counter and priority of the corresponding
To eliminate the latency and overhead of hardware context to that of the event. In cases
interrupts, UNUM employs multiple hard- where multiple events occur simultaneously,
ware contexts with priorities. By allowing the or multiple pending events map to the same
state of an event service routine to be pre- hardware context, the event mapper uses the
served in a CPU context across invocations, priority to determine the order of invocation.
we eliminate the overhead of retrieving and The context scheduler issues instructions
updating the event service routine state. Fig- to the CPU pipeline. Each cycle, the context
ure 2 is a block diagram of a UNUM proces- scheduler examines the priority of all active
sor. It consists of three major components: an contexts and issues the next instruction from
event mapper, a context scheduler, and a CPU the context with the highest priority. In cases
pipeline. where multiple active contexts share the high-
The function of the event mapper is to ini- est priority, the scheduler issues instructions
tiate event service routine execution when an from these contexts in a round-robin manner.
external event occurs. Associated with each The UNUM pipeline is a simple single-
possible external event are event mapper reg- issue RISC pipeline augmented to support
isters that contain the context, address, and concurrent execution of instructions from
priority of the corresponding event service rou- multiple contexts. The 31 × 32 register file of
tine. When an event occurs, the event mapper a typical RISC processor is expanded to a 31n
uses this information to initiate event service × 32 register file, where n is the number of
routine execution by setting the program supported hardware contexts. When the con-

JANUARY–FEBRUARY 2000 21
ARCHITECTURAL CONSIDERATIONS

Using cacheable loads and


UNUM stores to generate burst trans-
processor fers results in data cache pollu-
Ext. I/O Ext. I/O Ext. I/O tion, while using block loads
and stores, present in some
Instruction Data processors, increases register
cache cache
I/O I/O I/O
pressure. Unaligned PIO oper-
device device device ations are extremely inefficient.
Internal This is especially true when
BIU
transfers must be performed to
Memory bus I/O bus
Aligner an aligned, fixed-width mem-
ory device, such as a FIFO
port. Finally, PIO operations
External tie up the CPU.
BIU
Since data movement is
Memory controller one of the primary functions
of a communications proces-
sor, we have incorporated
data movement instructions
Figure 3. UNUM-based communications processor. into UNUM. Figure 3 shows
the system architecture of a
communications processor
text scheduler selects a context from which to based on UNUM. The CPU core interfaces to
issue an instruction, it presents the CPU the rest of the system through an internal bus
pipeline with a context ID (CID) and a pro- interface unit (IBIU). In addition to per-
gram counter value. The CID, together with forming the operations of a traditional bus
a register number from the fetched instruc- interface unit, the IBIU incorporates a data
tion, forms the register’s actual address in the mover and aligner that segments the on-chip
register file. Pipeline bypass and interlock logic bus into a memory bus and an I/O bus. The
also uses the CID. Thus, aside from modify- CPU initiates a data movement operation by
ing the instruction issue logic, expanding the issuing an instruction to the data mover. Since
register file, and adding a CID to bypass and data movement fully utilizes on-chip buses,
interlock logic equations, UNUM employs a an implementation may either stall the CPU
traditional single-issue RISC pipeline. pipeline until the operation completes or
Multithreading has historically been used to allow the pipeline to continue execution from
tolerate memory latency. In UNUM, multi- on-chip caches as long as there are no misses.
threading reduces event service latency. A UNUM data movement instructions per-
UNUM processor may be designed to both form fly-by transfers between memory and
tolerate memory latency and reduce event ser- devices on the I/O bus. The TM2D instruc-
vice latency. tion transfers data from memory to an inter-
face, while the TD2M instruction transfers
Data movement instructions data in the opposite direction. In both cases,
General-purpose processors have poor data fly-by data bypasses the data cache and does
movement capabilities. Programmed I/O (PIO) not pollute it. To maintain cache consisten-
generates memory-to-memory transfers that cy, the cache supplies dirty data during
require twice the bus bandwidth of fly-by DMA TM2D processing, and performs cache inval-
operations. Some system designers have used idates during TD2M. Efficient processing of
special hardware to perform fly-by transfers as network data is supported through direct
a side effect of address ranges, but this leads to transfers between a network interface and the
complex software and does not scale well to data cache. The TD2C instruction loads data
multiple interfaces. PIO operations using non- directly into the data cache from an interface,
cacheable loads and stores result in single-word eliminating an unnecessary transfer through
data transfers that achieve poor bus utilization. memory. The TM2DD instruction discards

22 IEEE MICRO
dirty data from the data cache 5 words read from memory 4 written to I/O device
as it is written to a network
20 16 12 8 4 18 14 10 6
interface, potentially elimi-
nating an unnecessary future 21 17 13 9 5 19 15 11 7
write-back. Aligner
22 18 14 10 6 20 16 12 8
All data flowing between
the memory and the I/O 23 19 15 11 7 21 17 13 9
buses pass through an aligner
in the IBIU. For aligned
0 1 2 3
transfers, the aligner simply
passes unmodified data from 4 5 6 7
one bus to another. For 8 9 10 11
unaligned transfers, the align- Holding register
er uses a holding register, 12 13 14 15
shifter, and multiplexer to 5
16 17 18 19
align data as it flows from one 6
20 21 22 23
bus to the other. Figure 4 pro-
7
vides an example of this for an 24 25 26 27
unaligned 4-word transfer. 8 6
Memory
9 7
Putting it all together
10 8
The ability to service exter-
nal events with extremely low 11 9
overhead together with high- Shifter and multiplexer
performance data transfer
instructions allows UNUM to Aligner producing first word for I/O device
perform data movement and
interface-specific processing Figure 4. Unaligned 4-word fly-by transfer from memory to I/O device starting at address 6.
functions in software. Com-
bining a UNUM processor
core with network interface cores allows com- instructions. The simulator also modeled the
munications processors to be rapidly con- caches, memory system, counter/timers, a
structed. A typical high-speed network interface console, and an ATM interface.
in UNUM maps to two processor contexts, one We simulated a 200-MHz UNUM proces-
for input processing and one for output pro- sor with an 8-Kbyte, two-way set-associative
cessing. Threshold logic in the output FIFO of instruction cache; 2-Kbyte, two-way set-
a network interface generates an event whenev- associative data cache; and a 4-word write
er there is room for an output data transfer. Sim- buffer. We configured our simulated 32-bit
ilarly, threshold logic in the input FIFO system bus to run at 100 MHz and the mem-
generates an event whenever enough data exists ory system to consist of 100-MHz SDRAM.
for a complete data transfer or an end-of-pack- All of the benchmarks were written in C and
et is detected. The event-handling routines may compiled with an enhanced MIPS GCC 2.8.1
perform interface-specific processing. compiler with “-O3” optimization. Other
than the ones mentioned in the next section,
Simulation results we did not perform hand assembly language
To better quantify the benefits of the optimizations. In addition, we assumed that
UNUM architecture on data movement and event and interrupt handlers were locked in
interface-specific processing, we created a the instruction cache.
cycle-accurate simulator of a UNUM-based
communications processor. We based the Data movement
CPU in the simulator on the MIPS32 ISA, For our initial measurements we wrote a
which was enhanced to support multiple data movement micro-benchmark that simu-
hardware contexts and data movement lated the transfer of a 1,518-byte packet from

JANUARY–FEBRUARY 2000 23
ARCHITECTURAL CONSIDERATIONS

64 bytes. Using this burst size, a normal inter-


UNUM events and data movement instructions
Fast interrupts with UNUM data movement instructions (D cache hits) rupt-based system achieved 47 Mbytes/sec.
Fast interrupts with UNUM data movement instructions (D cache misses) Data cache misses had little effect since we
Interrupts with UNUM data movement instructions assumed a fixed interrupt overhead of 1 µs,
UNUM events with PIO
which dominated. Replacing normal inter-
450 rupts with fast interrupts improved perfor-
400 mance to 104 Mbytes/sec, assuming all data
cache misses, and 189 Mbytes/sec, assuming
Bandwidth (Mbytes/sec)

350
all data cache hits. Given the small size of data
300 caches in SOC communications processors,
250 we expect the actual achieved bandwidth to
200 be closer to the lower end of this range.
150
UNUM with data movement instructions
produced the best results: 212 Mbytes/sec.
100
Since the state of the event service routine fits
50 within a UNUM context, no state informa-
0 tion needs to be loaded from memory. This
64 128 192 256 320 384 448 512
Data transfer size (bytes) explains why UNUM outperforms fast inter-
rupts with data cache hits, and it also is the
Figure 5. Data movement performance. reason why the UNUM curve is unaffected
by data cache misses.
Note that for very small burst sizes,
memory to a network interface using a range UNUM events with PIO outperforms fast
of burst transfer sizes. We measured the result- interrupts with data movement instructions.
ing bandwidth. We examined two data move- This is because for small bursts the overhead
ment mechanisms, one using UNUM data of loading the event service routine state
movement instructions and another using exceeds that of performing memory-to-
PIO. Our PIO function moved data using an memory PIO transfers.
optimized hand-coded assembly routine based
on the BSD bcopy() function. We examined ATM Soft-SAR
three CPU configurations: UNUM (hardware Our second benchmark measures the abili-
context switch with state preservation), fast ty of UNUM to perform complex interface-
interrupts (alternate register set with no inter- specific processing. For this benchmark we
rupt state preservation), and normal proces- selected ATM AAL5 Segmentation and
sor interrupts. For normal interrupts we Reassembly (SAR) since it represents a class of
assumed an overhead of 1 µs. We ran our applications in which the processing performed
benchmark for best- and worst-case data cache on received data depends on its content.
scenarios for state information and with the ATM AAL5 SAR transmit processing con-
assumption that data to be moved is not pre- sists of segmenting protocol data units
sent in the data cache. (PDUs) to be sent on an ATM virtual circuit
Figure 5 provides the results of the data into fixed-length cells and attaching a header
movement benchmark. PIO-based data to each cell. The PDU is padded to contain
movement results in the worst performance. an integral number of cells, and the last cell
The highest achievable bandwidth using PIO has fields that indicate the data length, a user-
was 40 Mbytes/sec. This was true regardless to-user byte, and a CRC-32 value. SAR
of the type of CPU used (UNUM, fast inter- receive processing consists of reassembling
rupts, or regular interrupts) and caching received cells into PDUs, checking the length
assumptions since the cost of PIO dominates and CRC-32 fields, and passing the payload to
all other overheads. upper layers.
Making use of UNUM’s data movement What makes SAR processing challenging is
instructions improved results significantly. For that for each received ATM cell considerable
an SOC environment with small on-chip work must be performed. First, the identifier
buffers we expect burst sizes in the range of field (VPI/VCI) in the cell header is used to

24 IEEE MICRO
look up the virtual circuit that the cell belongs 600
to. This lookup returns a data structure that
contains a pointer to a reassembly buffer and 500
Receive throughput
current CRC-32 for the packet being reassem- Send throughput
400

Throughput (Mbps)
bled. The payload of the cell is then append- Full-duplex throughput
ed to the reassembly buffer, and the
300
reassembly buffer pointer and CRC-32 are
updated. Additional processing is required to
200
handle boundary conditions such as end of
frame and end of buffer. Due to the com- 100
plexity of SAR processing, most systems
implement this in custom hardware. 0
The ATM interface in the system we simu- 512 1,024 1,536 2,048 2,560 3,072
PDU size (bytes)
lated consisted of a physical layer interface (for
example, Utopia), a transmit and receive Figure 6. Maximum ATM SAR throughput.
FIFO, a CRC-32 calculator, and control and
status registers. The interface generates an
ATM receive event when a cell is present in the tant than throughput is worst-case latency,
receive FIFO and an ATM transmit event which determines required on-chip buffering.
when space exists for a cell in the transmit Using UNUM, a 25-Mbps, full-duplex line
FIFO. We wrote code for SAR processing on rate requires just a four-cell transmit and
the UNUM processor. The SAR software uses receive FIFO.
three hardware contexts. The first performs
ATM receive event processing, the second per-
forms ATM transmit event processing, and the
third performs ATM transmit cell scheduling.
U NUM simplifies design of communica-
tions processors, lowers their cost, and
closely integrates data movement and com-
Our first experiment measured the maxi- putation, thereby enabling fly-by processing.
mum achievable throughput, assuming an infi- UNUM’s ability to perform fly-by processing
nite line rate and FIFOs. Figure 6 shows the is well suited to applications such as encryp-
throughput as a function of AAL5 frame size tion, coding, overload control, packet classi-
for half-duplex transmit, half-duplex receive, fication, and packet telephony. The
and full-duplex operation. In all three cases, the emergence of broadband access networks is
throughput increases with frame size since the making these applications increasingly impor-
per frame overhead is amortized over a larger tant for low-cost consumer devices. We are
number of cells. The highest throughput we continuing our investigation of UNUM for
observed was 570 Mbps, which occurs for the these and other applications areas. MICRO
half-duplex receive case. Transmit throughput
is lower because of the extra overhead associat- References
ed with cell scheduling. These results show that 1. C. Dalton, et. al., “Afterburner,” IEEE
low-overhead event processing and high-per- Network, Jul. 1993, pp. 36-43.
formance data movement instructions allow 2. H. Kanakia and D. Cheriton, “The VMP Net-
UNUM to sustain a very high throughput. As work Adaptor Board (NAB): High Perfor-
a comparison, to sustain a receive throughput mance Network Communication for
of 570 Mbps, a single context CPU would have Multiprocessors,” Proc. Symp. Communi-
to service an interrupt every 750 ns. cation Architectures and Protocols, ACM,
Our second experiment measured UNUM New York, 1988, pp. 175-187.
processor utilization and the FIFO size nec- 3. K.K. Ramakrishnan, “Performance Consid-
essary to sustain a full-duplex line rate of 25 erations in Designing Network Interfaces,”
Mbps. We measured a CPU utilization of IEEE J. Selected Areas in Communications,
13.4% with a frame size of 1,536 bytes. This Vol. 11, No. 2, Feb. 1993, pp. 203-219.
means that even when transmitting and 4. Am79C973/Am79C975 PCnet—Fast III
receiving at full line rate, 86% of the CPU is Single Chip 10/100 Mbps PCI Ethernet
available for other processing. More impor- Controller with Integrated PHY Data Sheet,

JANUARY–FEBRUARY 2000 25
ARCHITECTURAL CONSIDERATIONS

Advanced Micro Devices, Sunnyvale, Calif. 9. P.Z. Onufryk, “Euphony: A Signal Processor
5. P. Druschel et al., “Network Subsystem for ATM,” EE Times, Jan. 20, 1997, pp. 54, 80.
Design,” IEEE Network, Jul. 1993, pp. 8-17. 10. PowerQuicc: Motorola MPC860 User
6. T. von Eicken et al., “U-Net: A User Level Manual, Motorola, Austin, Tex.
Network Interface for Parallel and 11. HELIUM IC-000148 Preliminary Data Sheet,
Distributed Computing,” Proc. 15th Ann. VIRATA, Cambridge, UK.
ACM Symp. Operating Systems Principles, 12. TS702 Advanced Communication Controller
ACM, Dec. 1995, pp. 40-53. Data Book, T.sqware Inc., Santa Clara, Calif.
7. Am186CC Communications Controller
User’s Manual, Advanced Micro Devices, Charles D. Cranor is a senior technical staff
Sunnyvale, Calif. member at AT&T Labs–Research in Florham
8. NET+ARM Hardware Reference Guide, Park, New Jersey. His interests include net-
NETsilicon, Waltham, Mass. working, operating systems, and computer
architecture. Cranor received a bachelor’s
degree in electrical engineering from the Uni-
versity of Delaware and a master’s and doc-
torate in computer science from Washington
University in St. Louis, Missouri. He is a

COMING
member of the IEEE, ACM, and USENIX,
and a kernel developer for the open-source
BSD operating systems projects.

NEXT R. Gopalakrishnan is a senior technical staff


member at the AT&T Labs–Research in
Florham Park, New Jersey. His interests
include packet telephony systems, service dif-

ISSUE ferentiation and I/O performance optimiza-


tions in server operating systems, multimedia
networking, and IP multicast. Gopalakrish-
nan received the BTech degree in electrical
engineering from IIT Kanpur in India, the
Hot Chips 11 MTech degree in computer science from IIT
Delhi, and the DSc degree in computer sci-

is coming! ence from Washington University, St. Louis.


He is a member of the ACM.

Peter Z. Onufryk is a technology consultant


Look for IEEE Micro’s annual with AT&T Labs–Research in Florham Park,
Hot Chips issue in March-April New Jersey. He was the lead architect of two
2000. In addition to the articles communications processors and has worked
selected by Guest Editor Moni- on a number of research and military com-
ca Lam, you’ll find a discussion puters. Onufryk received his bachelor’s degree
in electrical engineering from Rutgers Uni-
by Broadcom Corporation’s versity, master’s in electrical engineering from
cofounder, CTO, and Purdue University, and PhD in electrical and
R&D VP Henry Samueli computer engineering from Rutgers Univer-
on the implications of sity. He is a member of the IEEE, IEEE Com-
the broadband commu- puter Society, and ACM.
nications that will Direct comments about this article to P.Z.
enable the connected Onufryk, Room B009, AT&T Labs–Research,
home of the 21st century. 180 Park Ave., Bldg. 103, Florham Park, NJ
07932; [email protected].

26 IEEE MICRO

You might also like