0% found this document useful (0 votes)

2 views

ARCHITECTURAL CONSIDERATIONS

The document introduces UNUM, an architecture designed to integrate communications functionality directly into CPUs, enhancing performance and flexibility for communications processors. It discusses the challenges faced in designing these processors, such as managing latency and minimizing costs, while highlighting the importance of on-chip buffering and the complexity of DMA controllers. UNUM aims to streamline event processing and data movement by utilizing multiple hardware contexts, allowing for more efficient handling of network interfaces and reducing the need for extensive on-chip buffering.

Uploaded by

SirousFekri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

ARCHITECTURAL CONSIDERATIONS

Uploaded by

SirousFekri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

ARCHITECTURAL CONSIDERATIONS

FOR CPU AND NETWORK

INTERFACE INTEGRATION
THE AUTHORS DESCRIBE UNUM, AN ARCHITECTURE FOR INTEGRATING

COMMUNICATIONS FUNCTIONALITY INTO THE CPU. UNUM NOT ONLY SIMPLIFIES

THE DESIGN OF COMMUNICATIONS PROCESSORS BUT ALSO IMPROVES THEIR

PERFORMANCE AND PROVIDES THEM WITH GREATER FLEXIBILITY.

The growth of the Internet is creat- requirements and constraints for workstations
ing a demand for broadband access equip- and communications processors are funda-
ment and network-enabled consumer mentally different. Designers must carefully
appliances. At the heart of these products are manage latency in communications proces-
communications processors—devices that sors to reduce on-chip buffering. Network
integrate processing, networking, and system interface cards (NICs) typically used in work-
support functions into a single, low-cost sys- stations plug into I/O buses that have high
tem on a chip (SOC). The primary challenges latencies. This forces a NIC to include large
in the design of these devices are minimizing buffers and encourages large burst transfers
cost and time to market, and maximizing flex- for efficiency. For example, a 10-/100-Mbps
ibility. Die size and packaging are the major Ethernet NIC can have as much as 12 Kbytes
Charles D. Cranor factors that determine cost. The rapid pace at of buffering.4 Since communications proces-
which Internet applications and services are sors often contain multiple network interfaces,
R. Gopalakrishnan evolving increases the pressure to reduce devel- placing such large buffers on chip may not be
opment time. Thus, designers of communi- possible and is certainly not cost effective.
Peter Z. Onufryk cations processors are continually looking for Space requirements in workstations are not
ways to speed design and verification. Rapid as stringent as those in the SOC environment
AT&T Labs, Research change is also increasing the importance of of communications processors. This allows
flexibility. Communications processors are workstation NICs to include considerable
often adapted to applications that may not processing power. For example, “intelligent”
have been anticipated, or even existed, when NICs contain on-board processors.2 Even
the chip was designed. “dumb” workstation NICs are actually quite
A large body of research and experience in intelligent. For example, it is common for a
the design of network adaptors for worksta- NIC to contain a complex DMA controller
tions exists.1-3 It may appear that a communi- and buffer management unit. In a communi-
cations processor consists of nothing more cations processor this functionality is typical-
than integrating these designs on a chip. How- ly shared among multiple network interfaces
ever, this is not the case since the system to reduce die size. Network interfaces in these

18 0272-1732/00/$10.00  2000 IEEE

devices consist simply of a data link interface
and buffers.
The integration of processing and net-
Channel state
working in the same device offers an oppor- RAM
tunity to rethink the way CPUs and network
interfaces are designed. Most communications
processor CPU cores use Instruction Set
Architectures (ISAs) that were initially devel-
oped for workstation processors and opti-
mized for SPEClike benchmark performance. DMA
DMA
Network adaptor research has focused on channel DMA
request
select state machine
reducing memory copies and host CPU pro- signals
logic
cessing,5,6 both of which lead to complex
interface-specific hardware that is not appro-
priate for communications processors.
We introduce UNUM, an architecture for On-chip bus
communications processors that supports
extremely fast event processing and high per- Figure 1. Multichannel DMA controller.
formance data movement. With these capa-
bilities, functions typically performed in
custom hardware can be moved to software becomes active, the controller transfers its
executing on the main CPU. state from RAM to the DMA state machine.
After the operation completes, the controller
Design approaches writes back the updated channel state to
The design of a communications processor RAM. Arbitration logic determines which
rarely begins from scratch. Cores are either DMA channel is serviced next.
licensed from external intellectual property Unlike other parts of a communications
(IP) vendors or are available from internal processor for which standard cores are readily
sources. With the emergence of on-chip bus available, the DMA controller must often be
standards and a thriving IP industry, it would designed from scratch. This is because the
appear that a communications processor could DMA controller is highly system dependent.
be rapidly designed by licensing standard CPU These system dependencies include perfor-
and network interface cores and tying the mance requirements, on-chip bus architecture,
whole system together with a multichannel memory controller design, as well as the type
DMA controller. It has been our experience, and number of supported network interfaces.
as well as that of others, that the design and Diversity in network interface requirements
verification of this type of DMA controller is has the greatest impact on DMA design. In
both complex and time consuming. This is addition to transferring data, high-perfor-
especially true when features necessary for high mance, descriptor-based DMA controllers
performance such as unaligned transfers and also transfer control and status information
cache coherency are incorporated. between DMA descriptors in memory and a
Figure 1 shows a simplified block diagram network interface. This allows the DMA con-
of a typical multichannel DMA controller. troller to execute sequences of transfers
Since only one DMA channel can be active autonomously. The format and content of
on a bus at any given time, we can save die these descriptors typically needs to be modi-
area by designing a single DMA state machine fied for each type of network interface. Also,
that is shared by all DMA channels. Since the basic function of a DMA channel itself
there is considerable state information associ- may need to be modified. For example, the
ated with each DMA channel (source address, destination address for a received ATM cell is
destination address, byte count, and descrip- not the address of a buffer pointer in a descrip-
tor pointer), this state is commonly stored in tor as it would be for an Ethernet frame.
a RAM rather than individual registers to Instead, it is the address of a reassembly buffer
reduce chip area. When a DMA channel that is dependent on the virtual circuit iden-

JANUARY–FEBRUARY 2000 19
ARCHITECTURAL CONSIDERATIONS

tifier fields in the cell’s header. We call this sor for low-cost consumer applications should
form of channel customization interface- contain a single processor that performs all
specific processing since it requires function- data movement, interface-specific processing,
ality beyond simple data movement. Other and application processing. This becomes
examples of interface-specific processing are especially true as embedded processors reach
multiplexing and demultiplexing of data speeds of 500 MHz and higher. The avail-
based on the time slot for a time-division mul- ability of processor cores capable of perform-
tiplexing bus, and searching through multi- ing these tasks would reduce communications
ple DMA descriptors for an optimal size processor design and verification time,
buffer to store a received Ethernet frame. increase their flexibility, and simplify software
Despite the complexity of designing a multi- development. UNUM is an architecture for
channel DMA controller, a number of com- this type of processor core.
munications processors such as the AMD
Am186CC,7 the NETsilicon Net+ARM,8 and Multithreaded CPU for event processing
the Euphony processor9 use this approach. Performing data movement and interface-
The design of a multichannel DMA con- specific processing on the same CPU as appli-
troller with the features necessary to support cation processing dramatically increases the
multiple network interfaces can be as complex number of processor events that must be ser-
as a programmable processor. For this reason, viced. The key to minimizing communica-
some designers have chosen to replace multi- tions processor cost is minimizing die size,
channel DMA controllers with a dedicated which means minimizing on-chip buffering.
processor for data transfers and interface- Small on-chip buffers impose tight constraints
specific processing. This eliminates the com- on acceptable event service latency and result
plexity of designing the DMA controller, in small burst transfers thus increasing the
provides flexibility, and allows modifications number of events.
and enhancements to be made in software. To illustrate the importance of minimizing
The Motorola MPC86010 and the Virata event service latency, consider a cut-through
Helium11 use this approach. transfer of a 1,518-byte Ethernet frame from
Adding a second processor for data trans- a receive FIFO to memory. Using a 64-byte
fers and interface-specific processing elimi- burst transfer results in 24 data transfer
nates the complexity of designing a DMA and request events. To prevent overflow, the
provides flexibility. However, it also introduces receive FIFO must be large enough to accom-
the software complexity and partitioning modate the worst-case event service latency.
issues associated with developing code for A large event service latency not only reduces
multiple processors. This is especially true if the maximum throughput but also requires
the architecture of the processor that handles larger FIFOs to prevent overflow. This, in
communications tasks differs from that of the turn, results in higher queuing delays that fur-
main CPU. Since processor functions must ther increase worst-case event service latency.
be replicated in this approach (for example, Current processors service external events,
two bus interface units, two ALUs), it may using either polling or interrupts. Infrequent
increase die size. This approach also leads to polling results in large event service latencies,
a static partitioning of functions onto proces- while frequent polling consumes large
sors. Idle cycles on one processor cannot be amounts of processing. Both of these are unac-
used to enhance performance of tasks running ceptable. The worst-case event service laten-
on the other. cy for an interrupt with a full-context save for
Applications with low data rates that can a high-performance embedded processor is on
tolerate high latencies without requiring large the order of several microseconds. Although
on-chip buffers do not require a DMA con- typical performance is much better, designers
troller or a dedicated processor. Instead, an must consider worst-case performance during
interrupt handler running on the main CPU system design since taking the best or average
may perform these operations. The T.sqware case will lead to conditions such as buffer over-
TS70212 uses this approach. flow or underflow.
We believe that a communications proces- A major component of interrupt latency is

20 IEEE MICRO
saving and restoring the state Event
of the interrupted context. mapper
Techniques used to reduce Context scheduler
Program
this overhead include External
counter Context 0 Context 1 Context 2 Context n
events PC/Priority PC/Priority PC/Priority PC/Priority
Priority
• coding interrupt service CID PC
Context
routines in assembly lan-
guage to use a small
number of registers, CPU Instruction
pipeline memory
• switching to an alternate
register set for interrupt
processing,
• saving processor registers
to unused floating-point Register file
registers, and (31n × 32)
• providing on-chip mem-
ory for saving and restor-
ing state.

Even if interrupt overhead ALU Adder

were eliminated, the overhead
of loading and updating the
event service routine state from
memory would still remain. Data
memory
This is because interrupt ser-
vice routines do not retain state
across invocations. For exam-
ple, a data transfer event ser-
vice routine must load the
starting address, byte count, Figure 2. Example UNUM CPU.
destination address, and possi-
bly a descriptor pointer on
entry, then update the byte count on exit. counter and priority of the corresponding
To eliminate the latency and overhead of hardware context to that of the event. In cases
interrupts, UNUM employs multiple hard- where multiple events occur simultaneously,
ware contexts with priorities. By allowing the or multiple pending events map to the same
state of an event service routine to be pre- hardware context, the event mapper uses the
served in a CPU context across invocations, priority to determine the order of invocation.
we eliminate the overhead of retrieving and The context scheduler issues instructions
updating the event service routine state. Fig- to the CPU pipeline. Each cycle, the context
ure 2 is a block diagram of a UNUM proces- scheduler examines the priority of all active
sor. It consists of three major components: an contexts and issues the next instruction from
event mapper, a context scheduler, and a CPU the context with the highest priority. In cases
pipeline. where multiple active contexts share the high-
The function of the event mapper is to ini- est priority, the scheduler issues instructions
tiate event service routine execution when an from these contexts in a round-robin manner.
external event occurs. Associated with each The UNUM pipeline is a simple single-
possible external event are event mapper reg- issue RISC pipeline augmented to support
isters that contain the context, address, and concurrent execution of instructions from
priority of the corresponding event service rou- multiple contexts. The 31 × 32 register file of
tine. When an event occurs, the event mapper a typical RISC processor is expanded to a 31n
uses this information to initiate event service × 32 register file, where n is the number of
routine execution by setting the program supported hardware contexts. When the con-

JANUARY–FEBRUARY 2000 21
ARCHITECTURAL CONSIDERATIONS

Using cacheable loads and

UNUM stores to generate burst trans-
processor fers results in data cache pollu-
Ext. I/O Ext. I/O Ext. I/O tion, while using block loads
and stores, present in some
Instruction Data processors, increases register
cache cache
I/O I/O I/O
pressure. Unaligned PIO oper-
device device device ations are extremely inefficient.
Internal This is especially true when
BIU
transfers must be performed to
Memory bus I/O bus
Aligner an aligned, fixed-width mem-
ory device, such as a FIFO
port. Finally, PIO operations
External tie up the CPU.
BIU
Since data movement is
Memory controller one of the primary functions
of a communications proces-
sor, we have incorporated
data movement instructions
Figure 3. UNUM-based communications processor. into UNUM. Figure 3 shows
the system architecture of a
communications processor
text scheduler selects a context from which to based on UNUM. The CPU core interfaces to
issue an instruction, it presents the CPU the rest of the system through an internal bus
pipeline with a context ID (CID) and a pro- interface unit (IBIU). In addition to per-
gram counter value. The CID, together with forming the operations of a traditional bus
a register number from the fetched instruc- interface unit, the IBIU incorporates a data
tion, forms the register’s actual address in the mover and aligner that segments the on-chip
register file. Pipeline bypass and interlock logic bus into a memory bus and an I/O bus. The
also uses the CID. Thus, aside from modify- CPU initiates a data movement operation by
ing the instruction issue logic, expanding the issuing an instruction to the data mover. Since
register file, and adding a CID to bypass and data movement fully utilizes on-chip buses,
interlock logic equations, UNUM employs a an implementation may either stall the CPU
traditional single-issue RISC pipeline. pipeline until the operation completes or
Multithreading has historically been used to allow the pipeline to continue execution from
tolerate memory latency. In UNUM, multi- on-chip caches as long as there are no misses.
threading reduces event service latency. A UNUM data movement instructions per-
UNUM processor may be designed to both form fly-by transfers between memory and
tolerate memory latency and reduce event ser- devices on the I/O bus. The TM2D instruc-
vice latency. tion transfers data from memory to an inter-
face, while the TD2M instruction transfers
Data movement instructions data in the opposite direction. In both cases,
General-purpose processors have poor data fly-by data bypasses the data cache and does
movement capabilities. Programmed I/O (PIO) not pollute it. To maintain cache consisten-
generates memory-to-memory transfers that cy, the cache supplies dirty data during
require twice the bus bandwidth of fly-by DMA TM2D processing, and performs cache inval-
operations. Some system designers have used idates during TD2M. Efficient processing of
special hardware to perform fly-by transfers as network data is supported through direct
a side effect of address ranges, but this leads to transfers between a network interface and the
complex software and does not scale well to data cache. The TD2C instruction loads data
multiple interfaces. PIO operations using non- directly into the data cache from an interface,
cacheable loads and stores result in single-word eliminating an unnecessary transfer through
data transfers that achieve poor bus utilization. memory. The TM2DD instruction discards

22 IEEE MICRO
dirty data from the data cache 5 words read from memory 4 written to I/O device
as it is written to a network
20 16 12 8 4 18 14 10 6
interface, potentially elimi-
nating an unnecessary future 21 17 13 9 5 19 15 11 7
write-back. Aligner
22 18 14 10 6 20 16 12 8
All data flowing between
the memory and the I/O 23 19 15 11 7 21 17 13 9
buses pass through an aligner
in the IBIU. For aligned
0 1 2 3
transfers, the aligner simply
passes unmodified data from 4 5 6 7
one bus to another. For 8 9 10 11
unaligned transfers, the align- Holding register
er uses a holding register, 12 13 14 15
shifter, and multiplexer to 5
16 17 18 19
align data as it flows from one 6
20 21 22 23
bus to the other. Figure 4 pro-
7
vides an example of this for an 24 25 26 27
unaligned 4-word transfer. 8 6
Memory
9 7
Putting it all together
10 8
The ability to service exter-
nal events with extremely low 11 9
overhead together with high- Shifter and multiplexer
performance data transfer
instructions allows UNUM to Aligner producing first word for I/O device
perform data movement and
interface-specific processing Figure 4. Unaligned 4-word fly-by transfer from memory to I/O device starting at address 6.
functions in software. Com-
bining a UNUM processor
core with network interface cores allows com- instructions. The simulator also modeled the
munications processors to be rapidly con- caches, memory system, counter/timers, a
structed. A typical high-speed network interface console, and an ATM interface.
in UNUM maps to two processor contexts, one We simulated a 200-MHz UNUM proces-
for input processing and one for output pro- sor with an 8-Kbyte, two-way set-associative
cessing. Threshold logic in the output FIFO of instruction cache; 2-Kbyte, two-way set-
a network interface generates an event whenev- associative data cache; and a 4-word write
er there is room for an output data transfer. Sim- buffer. We configured our simulated 32-bit
ilarly, threshold logic in the input FIFO system bus to run at 100 MHz and the mem-
generates an event whenever enough data exists ory system to consist of 100-MHz SDRAM.
for a complete data transfer or an end-of-pack- All of the benchmarks were written in C and
et is detected. The event-handling routines may compiled with an enhanced MIPS GCC 2.8.1
perform interface-specific processing. compiler with “-O3” optimization. Other
than the ones mentioned in the next section,
Simulation results we did not perform hand assembly language
To better quantify the benefits of the optimizations. In addition, we assumed that
UNUM architecture on data movement and event and interrupt handlers were locked in
interface-specific processing, we created a the instruction cache.
cycle-accurate simulator of a UNUM-based
communications processor. We based the Data movement
CPU in the simulator on the MIPS32 ISA, For our initial measurements we wrote a
which was enhanced to support multiple data movement micro-benchmark that simu-
hardware contexts and data movement lated the transfer of a 1,518-byte packet from

JANUARY–FEBRUARY 2000 23
ARCHITECTURAL CONSIDERATIONS

64 bytes. Using this burst size, a normal inter-

UNUM events and data movement instructions
Fast interrupts with UNUM data movement instructions (D cache hits) rupt-based system achieved 47 Mbytes/sec.
Fast interrupts with UNUM data movement instructions (D cache misses) Data cache misses had little effect since we
Interrupts with UNUM data movement instructions assumed a fixed interrupt overhead of 1 µs,
UNUM events with PIO
which dominated. Replacing normal inter-
450 rupts with fast interrupts improved perfor-
400 mance to 104 Mbytes/sec, assuming all data
cache misses, and 189 Mbytes/sec, assuming
Bandwidth (Mbytes/sec)

350
all data cache hits. Given the small size of data
300 caches in SOC communications processors,
250 we expect the actual achieved bandwidth to
200 be closer to the lower end of this range.
150
UNUM with data movement instructions
produced the best results: 212 Mbytes/sec.
100
Since the state of the event service routine fits
50 within a UNUM context, no state informa-
0 tion needs to be loaded from memory. This
64 128 192 256 320 384 448 512
Data transfer size (bytes) explains why UNUM outperforms fast inter-
rupts with data cache hits, and it also is the
Figure 5. Data movement performance. reason why the UNUM curve is unaffected
by data cache misses.
Note that for very small burst sizes,
memory to a network interface using a range UNUM events with PIO outperforms fast
of burst transfer sizes. We measured the result- interrupts with data movement instructions.
ing bandwidth. We examined two data move- This is because for small bursts the overhead
ment mechanisms, one using UNUM data of loading the event service routine state
movement instructions and another using exceeds that of performing memory-to-
PIO. Our PIO function moved data using an memory PIO transfers.
optimized hand-coded assembly routine based
on the BSD bcopy() function. We examined ATM Soft-SAR
three CPU configurations: UNUM (hardware Our second benchmark measures the abili-
context switch with state preservation), fast ty of UNUM to perform complex interface-
interrupts (alternate register set with no inter- specific processing. For this benchmark we
rupt state preservation), and normal proces- selected ATM AAL5 Segmentation and
sor interrupts. For normal interrupts we Reassembly (SAR) since it represents a class of
assumed an overhead of 1 µs. We ran our applications in which the processing performed
benchmark for best- and worst-case data cache on received data depends on its content.
scenarios for state information and with the ATM AAL5 SAR transmit processing con-
assumption that data to be moved is not pre- sists of segmenting protocol data units
sent in the data cache. (PDUs) to be sent on an ATM virtual circuit
Figure 5 provides the results of the data into fixed-length cells and attaching a header
movement benchmark. PIO-based data to each cell. The PDU is padded to contain
movement results in the worst performance. an integral number of cells, and the last cell
The highest achievable bandwidth using PIO has fields that indicate the data length, a user-
was 40 Mbytes/sec. This was true regardless to-user byte, and a CRC-32 value. SAR
of the type of CPU used (UNUM, fast inter- receive processing consists of reassembling
rupts, or regular interrupts) and caching received cells into PDUs, checking the length
assumptions since the cost of PIO dominates and CRC-32 fields, and passing the payload to
all other overheads. upper layers.
Making use of UNUM’s data movement What makes SAR processing challenging is
instructions improved results significantly. For that for each received ATM cell considerable
an SOC environment with small on-chip work must be performed. First, the identifier
buffers we expect burst sizes in the range of field (VPI/VCI) in the cell header is used to

24 IEEE MICRO
look up the virtual circuit that the cell belongs 600
to. This lookup returns a data structure that
contains a pointer to a reassembly buffer and 500
Receive throughput
current CRC-32 for the packet being reassem- Send throughput
400

Throughput (Mbps)
bled. The payload of the cell is then append- Full-duplex throughput
ed to the reassembly buffer, and the
300
reassembly buffer pointer and CRC-32 are
updated. Additional processing is required to
200
handle boundary conditions such as end of
frame and end of buffer. Due to the com- 100
plexity of SAR processing, most systems
implement this in custom hardware. 0
The ATM interface in the system we simu- 512 1,024 1,536 2,048 2,560 3,072
PDU size (bytes)
lated consisted of a physical layer interface (for
example, Utopia), a transmit and receive Figure 6. Maximum ATM SAR throughput.
FIFO, a CRC-32 calculator, and control and
status registers. The interface generates an
ATM receive event when a cell is present in the tant than throughput is worst-case latency,
receive FIFO and an ATM transmit event which determines required on-chip buffering.
when space exists for a cell in the transmit Using UNUM, a 25-Mbps, full-duplex line
FIFO. We wrote code for SAR processing on rate requires just a four-cell transmit and
the UNUM processor. The SAR software uses receive FIFO.
three hardware contexts. The first performs
ATM receive event processing, the second per-
forms ATM transmit event processing, and the
third performs ATM transmit cell scheduling.
U NUM simplifies design of communica-
tions processors, lowers their cost, and
closely integrates data movement and com-
Our first experiment measured the maxi- putation, thereby enabling fly-by processing.
mum achievable throughput, assuming an infi- UNUM’s ability to perform fly-by processing
nite line rate and FIFOs. Figure 6 shows the is well suited to applications such as encryp-
throughput as a function of AAL5 frame size tion, coding, overload control, packet classi-
for half-duplex transmit, half-duplex receive, fication, and packet telephony. The
and full-duplex operation. In all three cases, the emergence of broadband access networks is
throughput increases with frame size since the making these applications increasingly impor-
per frame overhead is amortized over a larger tant for low-cost consumer devices. We are
number of cells. The highest throughput we continuing our investigation of UNUM for
observed was 570 Mbps, which occurs for the these and other applications areas. MICRO
half-duplex receive case. Transmit throughput
is lower because of the extra overhead associat- References
ed with cell scheduling. These results show that 1. C. Dalton, et. al., “Afterburner,” IEEE
low-overhead event processing and high-per- Network, Jul. 1993, pp. 36-43.
formance data movement instructions allow 2. H. Kanakia and D. Cheriton, “The VMP Net-
UNUM to sustain a very high throughput. As work Adaptor Board (NAB): High Perfor-
a comparison, to sustain a receive throughput mance Network Communication for
of 570 Mbps, a single context CPU would have Multiprocessors,” Proc. Symp. Communi-
to service an interrupt every 750 ns. cation Architectures and Protocols, ACM,
Our second experiment measured UNUM New York, 1988, pp. 175-187.
processor utilization and the FIFO size nec- 3. K.K. Ramakrishnan, “Performance Consid-
essary to sustain a full-duplex line rate of 25 erations in Designing Network Interfaces,”
Mbps. We measured a CPU utilization of IEEE J. Selected Areas in Communications,
13.4% with a frame size of 1,536 bytes. This Vol. 11, No. 2, Feb. 1993, pp. 203-219.
means that even when transmitting and 4. Am79C973/Am79C975 PCnet—Fast III
receiving at full line rate, 86% of the CPU is Single Chip 10/100 Mbps PCI Ethernet
available for other processing. More impor- Controller with Integrated PHY Data Sheet,

JANUARY–FEBRUARY 2000 25
ARCHITECTURAL CONSIDERATIONS

Advanced Micro Devices, Sunnyvale, Calif. 9. P.Z. Onufryk, “Euphony: A Signal Processor
5. P. Druschel et al., “Network Subsystem for ATM,” EE Times, Jan. 20, 1997, pp. 54, 80.
Design,” IEEE Network, Jul. 1993, pp. 8-17. 10. PowerQuicc: Motorola MPC860 User
6. T. von Eicken et al., “U-Net: A User Level Manual, Motorola, Austin, Tex.
Network Interface for Parallel and 11. HELIUM IC-000148 Preliminary Data Sheet,
Distributed Computing,” Proc. 15th Ann. VIRATA, Cambridge, UK.
ACM Symp. Operating Systems Principles, 12. TS702 Advanced Communication Controller
ACM, Dec. 1995, pp. 40-53. Data Book, T.sqware Inc., Santa Clara, Calif.
7. Am186CC Communications Controller
User’s Manual, Advanced Micro Devices, Charles D. Cranor is a senior technical staff
Sunnyvale, Calif. member at AT&T Labs–Research in Florham
8. NET+ARM Hardware Reference Guide, Park, New Jersey. His interests include net-
NETsilicon, Waltham, Mass. working, operating systems, and computer
architecture. Cranor received a bachelor’s
degree in electrical engineering from the Uni-
versity of Delaware and a master’s and doc-
torate in computer science from Washington
University in St. Louis, Missouri. He is a

COMING
member of the IEEE, ACM, and USENIX,
and a kernel developer for the open-source
BSD operating systems projects.

NEXT R. Gopalakrishnan is a senior technical staff

member at the AT&T Labs–Research in
Florham Park, New Jersey. His interests
include packet telephony systems, service dif-

ISSUE ferentiation and I/O performance optimiza-

tions in server operating systems, multimedia
networking, and IP multicast. Gopalakrish-
nan received the BTech degree in electrical
engineering from IIT Kanpur in India, the
Hot Chips 11 MTech degree in computer science from IIT
Delhi, and the DSc degree in computer sci-

is coming! ence from Washington University, St. Louis.

He is a member of the ACM.

Peter Z. Onufryk is a technology consultant

Look for IEEE Micro’s annual with AT&T Labs–Research in Florham Park,
Hot Chips issue in March-April New Jersey. He was the lead architect of two
2000. In addition to the articles communications processors and has worked
selected by Guest Editor Moni- on a number of research and military com-
ca Lam, you’ll find a discussion puters. Onufryk received his bachelor’s degree
in electrical engineering from Rutgers Uni-
by Broadcom Corporation’s versity, master’s in electrical engineering from
cofounder, CTO, and Purdue University, and PhD in electrical and
R&D VP Henry Samueli computer engineering from Rutgers Univer-
on the implications of sity. He is a member of the IEEE, IEEE Com-
the broadband commu- puter Society, and ACM.
nications that will Direct comments about this article to P.Z.
enable the connected Onufryk, Room B009, AT&T Labs–Research,
home of the 21st century. 180 Park Ave., Bldg. 103, Florham Park, NJ
07932; [email protected].

26 IEEE MICRO

Doosan Infracore EZ Guide-I Programming For Lathe.
100% (1)
Doosan Infracore EZ Guide-I Programming For Lathe.
108 pages
Software-Defined Networks: A Systems Approach
From Everand
Software-Defined Networks: A Systems Approach
Larry Peterson
5/5 (1)
Blue Book PDF
No ratings yet
Blue Book PDF
187 pages
ANSYS and Cable Stayed Bridge
No ratings yet
ANSYS and Cable Stayed Bridge
22 pages
Android Training Using Kotlin
No ratings yet
Android Training Using Kotlin
8 pages
The Design of Industrial Ethernet Adapter Based On Ethernet/IP
No ratings yet
The Design of Industrial Ethernet Adapter Based On Ethernet/IP
4 pages
An Efficient On-Chip Network Interface Offering Guaranteed Services, Shared-Memory Abstraction, and Flexible Network Configuration
No ratings yet
An Efficient On-Chip Network Interface Offering Guaranteed Services, Shared-Memory Abstraction, and Flexible Network Configuration
6 pages
COA Lecture 24 DMA PDF
No ratings yet
COA Lecture 24 DMA PDF
25 pages
Lecture 4 On Chip Interfaces 2021
No ratings yet
Lecture 4 On Chip Interfaces 2021
37 pages
Network Interface Controller
No ratings yet
Network Interface Controller
2 pages
Distributed memory architecture
No ratings yet
Distributed memory architecture
16 pages
A Summary On "Characterizing Processor Architectures For Programmable Network Interfaces"
No ratings yet
A Summary On "Characterizing Processor Architectures For Programmable Network Interfaces"
6 pages
Integrated Circuit IP Cores System On A Chip
No ratings yet
Integrated Circuit IP Cores System On A Chip
3 pages
Chapter7_2
No ratings yet
Chapter7_2
23 pages
Networks On Chip (NOC) : Design Challenges
No ratings yet
Networks On Chip (NOC) : Design Challenges
8 pages
William Stallings Computer Organization and Architecture 6 Edition Input/Output
No ratings yet
William Stallings Computer Organization and Architecture 6 Edition Input/Output
56 pages
An_interconnect_architecture_for_networking_systems_on_chips
No ratings yet
An_interconnect_architecture_for_networking_systems_on_chips
10 pages
10/100/1000 Ethernet MAC With Protocol Acceleration MAC-NET Core
No ratings yet
10/100/1000 Ethernet MAC With Protocol Acceleration MAC-NET Core
7 pages
Network Architecture: Rakesh Kumar Naik Madhav Institute of Technology and Science Gwalior
No ratings yet
Network Architecture: Rakesh Kumar Naik Madhav Institute of Technology and Science Gwalior
32 pages
Fundamentals of Modern Computer Architecture: From Logic Gates to Parallel Processing
From Everand
Fundamentals of Modern Computer Architecture: From Logic Gates to Parallel Processing
Sam Steed
No ratings yet
Unit Iv Hardware Accelerates & Networks
No ratings yet
Unit Iv Hardware Accelerates & Networks
59 pages
Performance Evaluation of CDMA Router For Network-On-Chip
No ratings yet
Performance Evaluation of CDMA Router For Network-On-Chip
11 pages
Direct Memory Access
No ratings yet
Direct Memory Access
26 pages
Local Area Network Concepts and Architecture
No ratings yet
Local Area Network Concepts and Architecture
54 pages
08 02 Ip Over SDH
No ratings yet
08 02 Ip Over SDH
5 pages
Layer 2 Protocols: Ethernet
No ratings yet
Layer 2 Protocols: Ethernet
5 pages
N Nterconnect Rchitecture FOR Etworking Ystems On Hips: A I A N S C
No ratings yet
N Nterconnect Rchitecture FOR Etworking Ystems On Hips: A I A N S C
10 pages
Chapter 3
No ratings yet
Chapter 3
5 pages
Lecture 4 On Chip Interfaces 2023
No ratings yet
Lecture 4 On Chip Interfaces 2023
42 pages
Lecture 7
No ratings yet
Lecture 7
62 pages
Monitoring of Ethernet Messages: IEC 61850 Seminar
100% (1)
Monitoring of Ethernet Messages: IEC 61850 Seminar
36 pages
Network-On-chip (Noc) A New Soc Paradigm
No ratings yet
Network-On-chip (Noc) A New Soc Paradigm
42 pages
Networking Programming with C++: Build Efficient Communication Systems
From Everand
Networking Programming with C++: Build Efficient Communication Systems
Robert Johnson
No ratings yet
Multipair Ethernet over DMT ver.2: ver, #2
From Everand
Multipair Ethernet over DMT ver.2: ver, #2
Ashlan Chidester
No ratings yet
IT--105 Final Exam Study Guide
No ratings yet
IT--105 Final Exam Study Guide
303 pages
A Generic Architecture for on-Chip Packet-Switched Interconnections
No ratings yet
A Generic Architecture for on-Chip Packet-Switched Interconnections
7 pages
NoC Seminar Report
No ratings yet
NoC Seminar Report
24 pages
Network+ Guide To Networks 5 Edition
No ratings yet
Network+ Guide To Networks 5 Edition
94 pages
07 Input Output
No ratings yet
07 Input Output
54 pages
Network on Chip
No ratings yet
Network on Chip
44 pages
Design Choices - Ethernet
No ratings yet
Design Choices - Ethernet
29 pages
Attia 2011
No ratings yet
Attia 2011
6 pages
Advantages of Networking
No ratings yet
Advantages of Networking
60 pages
Lecture 7 Embedded Ethernet Controller
No ratings yet
Lecture 7 Embedded Ethernet Controller
18 pages
Programming and Prototyping with Teensy Microcontrollers: Definitive Reference for Developers and Engineers
From Everand
Programming and Prototyping with Teensy Microcontrollers: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Design of Reconfigurable Network-on-Chip Topology
No ratings yet
Design of Reconfigurable Network-on-Chip Topology
4 pages
Appendix F: Authors: John Hennessy & David Patterson
No ratings yet
Appendix F: Authors: John Hennessy & David Patterson
33 pages
Configure A Switch: LAN Switching and Wireless - Chapter 2
No ratings yet
Configure A Switch: LAN Switching and Wireless - Chapter 2
61 pages
Notes:-Different Bus Standards
No ratings yet
Notes:-Different Bus Standards
31 pages
TCP Research
No ratings yet
TCP Research
9 pages
Applying The Benefits of Network On A Chip Architecture To FPGA System Design
No ratings yet
Applying The Benefits of Network On A Chip Architecture To FPGA System Design
9 pages
Ethernet Communications and Requirements For IEC 61850 Based Systems
100% (1)
Ethernet Communications and Requirements For IEC 61850 Based Systems
42 pages
CH4-LAN-Jan2025(1)
No ratings yet
CH4-LAN-Jan2025(1)
51 pages
Io Buses
No ratings yet
Io Buses
56 pages
William Stallings Computer Organization and Architecture 7 Edition Input/Output
No ratings yet
William Stallings Computer Organization and Architecture 7 Edition Input/Output
63 pages
07 Input Output
No ratings yet
07 Input Output
63 pages
07 - Input Output
No ratings yet
07 - Input Output
62 pages
Session 22, 23
No ratings yet
Session 22, 23
14 pages
IV. Networkinterfacecard 130912152642 Phpapp02
No ratings yet
IV. Networkinterfacecard 130912152642 Phpapp02
28 pages
Cpe 490 Information System P Y Engineering I: Prof. Hong Man
No ratings yet
Cpe 490 Information System P Y Engineering I: Prof. Hong Man
31 pages
BSCS - DCCN - F20 - Week 6 - Sec A
No ratings yet
BSCS - DCCN - F20 - Week 6 - Sec A
50 pages
Ccna Exploration Accessing The Wan Study Guide: Chapter 2: PPP
No ratings yet
Ccna Exploration Accessing The Wan Study Guide: Chapter 2: PPP
9 pages
Networks & Interconnect-Interface, Switches, Routing, Examples
No ratings yet
Networks & Interconnect-Interface, Switches, Routing, Examples
44 pages
Cancer Protein
No ratings yet
Cancer Protein
62 pages
Hitachi AMS 2000 Family Dynamic Provisioning Configuration Guide (MK-09DF8201-16)
No ratings yet
Hitachi AMS 2000 Family Dynamic Provisioning Configuration Guide (MK-09DF8201-16)
174 pages
WMS100_AMS200
No ratings yet
WMS100_AMS200
41 pages
DataKinetics TableBASE in Memory Whitepaper 1
No ratings yet
DataKinetics TableBASE in Memory Whitepaper 1
10 pages
A_Comprehensive_Justification_For_Migrat
No ratings yet
A_Comprehensive_Justification_For_Migrat
365 pages
DataKinetics Mainframe Performance Improvement Whitepaper
No ratings yet
DataKinetics Mainframe Performance Improvement Whitepaper
6 pages
virtual-storage-platform-5000-series-datasheet
No ratings yet
virtual-storage-platform-5000-series-datasheet
2 pages
DataKinetics Mainframe Performance Optimization eBook 20240309
No ratings yet
DataKinetics Mainframe Performance Optimization eBook 20240309
7 pages
Creating and deleting volumes
No ratings yet
Creating and deleting volumes
22 pages
A Brief History of Microprogramming
100% (1)
A Brief History of Microprogramming
23 pages
Hitachi Overview Brochure VSP Family
No ratings yet
Hitachi Overview Brochure VSP Family
4 pages
JIS G3445 STKM 11A Steel Tubes
No ratings yet
JIS G3445 STKM 11A Steel Tubes
11 pages
Calcul Uk
No ratings yet
Calcul Uk
5 pages
ATFX 2023Q2 EN MY A
No ratings yet
ATFX 2023Q2 EN MY A
58 pages
A Traveler's Guide To The City-Site of Reddit (1.0)
100% (1)
A Traveler's Guide To The City-Site of Reddit (1.0)
151 pages
ECON101 Notes 1
No ratings yet
ECON101 Notes 1
41 pages
Present Tense
No ratings yet
Present Tense
6 pages
Refrigerant Dryer Catalog
No ratings yet
Refrigerant Dryer Catalog
4 pages
Unit 5 Attention Perception Learning Memory and Forgetting
No ratings yet
Unit 5 Attention Perception Learning Memory and Forgetting
163 pages
Reiki Meditation
56% (41)
Reiki Meditation
199 pages
Crusted Scabies NT
No ratings yet
Crusted Scabies NT
2 pages
34 Typographic Sins
100% (1)
34 Typographic Sins
1 page
TCGVRT0007 Advisory FINAL
No ratings yet
TCGVRT0007 Advisory FINAL
2 pages
Bi Uasa Tahun 4
No ratings yet
Bi Uasa Tahun 4
20 pages
GIZ - NIDM - Climate RiskManagementFramework
100% (1)
GIZ - NIDM - Climate RiskManagementFramework
72 pages
Cant Stop The Feeling
No ratings yet
Cant Stop The Feeling
2 pages
Aging: The Quest To Beat
100% (1)
Aging: The Quest To Beat
99 pages
Velocity Dependent Adhesive Force
No ratings yet
Velocity Dependent Adhesive Force
7 pages
Sample Laser Hair Removal Market Estimates Segment Forecast To 2030
No ratings yet
Sample Laser Hair Removal Market Estimates Segment Forecast To 2030
58 pages
Comparison and Contrast of General Motors and Toyota Motor
33% (3)
Comparison and Contrast of General Motors and Toyota Motor
20 pages
The Little Mermaid - Exit Music
No ratings yet
The Little Mermaid - Exit Music
14 pages
QAP For MS Pipes Revised
100% (2)
QAP For MS Pipes Revised
3 pages
The MathWorks, Inc. - MATLAB Deep Learning HDL Toolbox UG.-The MathWorks, Inc. (2021)
No ratings yet
The MathWorks, Inc. - MATLAB Deep Learning HDL Toolbox UG.-The MathWorks, Inc. (2021)
278 pages
EEC Lab Ohm Law
No ratings yet
EEC Lab Ohm Law
3 pages
Netter s Anatomy Coloring Book 2nd Edition John T. Hansen - The full ebook version is available, download now to explore
100% (1)
Netter s Anatomy Coloring Book 2nd Edition John T. Hansen - The full ebook version is available, download now to explore
53 pages
Trerice Catalog PDF
No ratings yet
Trerice Catalog PDF
272 pages
Use of Concrete in Repairing Hull Damage - Handbook of Damage Control - Part 10
No ratings yet
Use of Concrete in Repairing Hull Damage - Handbook of Damage Control - Part 10
31 pages
Medianet Manual
No ratings yet
Medianet Manual
10 pages

Uploaded by

Uploaded by

ARCHITECTURAL CONSIDERATIONS

FOR CPU AND NETWORK

COMMUNICATIONS FUNCTIONALITY INTO THE CPU. UNUM NOT ONLY SIMPLIFIES

THE DESIGN OF COMMUNICATIONS PROCESSORS BUT ALSO IMPROVES THEIR

PERFORMANCE AND PROVIDES THEM WITH GREATER FLEXIBILITY.

18 0272-1732/00/$10.00  2000 IEEE

Even if interrupt overhead ALU Adder

Using cacheable loads and

64 bytes. Using this burst size, a normal inter-

NEXT R. Gopalakrishnan is a senior technical staff

ISSUE ferentiation and I/O performance optimiza-

is coming! ence from Washington University, St. Louis.

Peter Z. Onufryk is a technology consultant

You might also like