![]() |
||
ICE Project Descriptions: Summer 2000 1.
Automated Synthesis of Embedded Multiprocessors Unlike general-purpose multiprocessors, multiprocessor systems for embedded applications (such as cellular phones, videoconferencing systems, radar devices, etc.) can be streamlined to support a specific set of high level functions. Furthermore, issues of backward compatibility and fast compilation times are not of major concern because these systems are rarely, if ever, modified after production. This dramatically increases the design space that can be considered when implementing embedded computing systems. Due to the vast and complex nature of the design space for an embedded application, the development of automated tools for system-level synthesis is of increasing importance. Such a tool takes as input a high-level language specification of an embedded application; a library of hardware components (such as different types of microprocessors, application-specific integrated circuits, memories, and buses); and a set of optimization constraints and objectives (for example, "find an implementation that minimizes the overall rate"). Given these inputs, the tool attempts to derive an efficient hardware architecture (a collection of processing components and their interconnections), and a mapping of the specified application onto this architecture. This project will involve the design, implementation, and evaluation of algorithms for system-level synthesis. Students will gain experience designing and implementing complex software that involves the following concepts: Technology- and implementation-dependent modeling of application functionality using fundamentals of graph theory. Deterministic heuristics vs. stochastic optimization techniques, such as genetic algorithms and simulated annealing, for exploring complex design spaces. Additionally, we will work with the interaction of these two optimization methodologies: Multi-objective, Pareto optimization. In a design space involving implementation metrics (such as throughput, power consumption, and cost), a Pareto point is a design that is not superseded in quality by another design in all dimensions. Modeling and simulation of embedded multiprocessor architectures. Hardware/software co-design. Students will gain practical research experience that is related to the fields of electronic design automation (representative companies: Cadence, and Synopsys); compiler technology (Hewlett Packard, Microsoft); and digital signal processing (Texas Instruments, Motorola). 2.
Building Experimental Testbeds for Embedded Systems Experimental testbeds are used to go beyond simulation, to test new design methodologies and algorithms on real applications. The building phase is required to truly appreciate the intricate details that make the difference between an embedded system that works reliably, and one with sporadic errors that can cause catastrophic failure. UMD has already had success in building such testbeds; for example, the computer-controlled electric train testbed was developed by three undergraduate research assistants in the SERTS Laboratory; the project received an honorable mention (top ten finish) in the 1998 Motorola University Design Competition. The testbed is designed to experiment with the detection and handling of errors in embedded real-time systems. Similar testbeds need to be built for other graduate and faculty research in the department; undergraduate researchers can contribute by providing the manpower needed to create these testbeds, while gaining hands-on experience in building a real system. 3.
Characterization of Control Independence in Programs Many studies have shown that significant amounts of parallelism can be extracted from ordinary programs if a processor can accurately look ahead arbitrarily far into the dynamic instruction stream. Control flow changes caused by conditional branches are a major impediment to determining which of the distant instructions belong to the dynamic instruction stream. This project investigates the use of control independence information for extracting this "distant parallelism". Earlier studies demonstrated that utilizing control independence is a viable means to extract distant parallelism. The primary objective of this project is to perform a detailed characterization of control independence, in terms of granularity and the available parallelism at each granularity level. The characterization of control independence is a first step in determining what situations require special attention. Once a characterization of control independence is performed, this knowledge can be applied to the development of specialized hardware architectures that can utilize control independences at different granularity levels. 4.
Design of Dynamically Reconfigurable FPGAs Using Neuron-MOS Technology The technology of Field Programmable Gate Arrays (FPGAs) has advanced dramatically and "system on a programmable chip" is becoming a reality. The feature of re-programmability or re-configurability has been pushed to the point where run-time re-configurable hardware systems are also becoming a reality. These new systems are based on so-called multi-context FPGAs. For example, NEC of Japan has successfully produced a CMOS chip, called Dynamically Re-configurable Logic Element which has eight layers of SRAM to store eight programming contexts. A more advanced dynamically re-configurable device is being developed at NTT Communication Science Laboratories in Japan, where Prof. Nakajima was a very active participant in the project during his sabbatical leave for the 1998-99 academic year. The device is based on neuron-MOS, which is a CMOS device having floating gates. Prof. Nakajima and his colleagues at NTT have closely worked with Prof. T. Shibata of the University of Tokyo in Japan who invented the neuron-MOS. They have shown that the neuron-MOS device can realize many functions that are suited for dynamically re-configurable digital systems. Two students from the University of Maryland spent the summer of 1999 at the NTT Labs under the supervision of Prof. Nakajima and initiated full-custom VLSI design work for the new device. There is still so much work to be done before the new device can be introduced in the market. The objective of the project is to develop a new FPGA device which can reconfigure itself dynamically. In order to prove that this is a marketable device, we will make a thorough comparison in performance such as functionality, area, speed, and power between this new device and conventional CMOS devices. Subprojects suitable for undergraduate activities will involve (1) full-custom VLSI design and circuit simulation of this new chip and, for comparison purposes, of conventional CMOS VLSI chips, and (2) architectural- and logic-level studies of digital systems to be built from this new device. By participating in this project, undergraduate students will have a rare opportunity to appreciate the importance of a team-based approach to the development of a new device that requires the knowledge of device physics, analog and digital electronic circuits, digital systems, and VLSI, together with the skills of use of many CAD tools such as MAGIC and HSPICE. 5.
Dynamic Memory-Management in Embedded Real-Time Systems Memory management has recently made the transition from general-purpose systems to embedded systems, in part to facilitate the rapid development of embedded applications. It is playing an increasingly significant role in embedded systems as more designers take advantage of low-overhead embedded operating systems that provide virtual memory (for example, Windows CE or Inferno), and as more designers choose object-oriented software platforms in which run-time garbage collection is pervasive (for example, Sun's Java Virtual Machine or Hewlett-Packard's HP runtime environment. However, the MMUs in today's embedded processors are virtually identical to those in high-performance processors, despite the fact that embedded systems have significantly different goals compared to high-performance systems. Most embedded processors either have a full MMU or none at all; Windows CE compliance requires a full MMU. A few exceptions exist, such as the rudimentary MMU of the ARM740T and '940T (its MMU is a simple protection unit, not a full address-translation unit: it supports some but not all of the features of virtual memory); the design is worth exploring but cannot be used to support Windows CE. This project explores the design space for embedded-system memory management and characterizes the issues on both the hardware and software sides of the interface (Jacob & Mudge, 1997; Jacob & Mudge, 1998). We are also developing a combined hardware-software approach to real-time memory management that achieves the following goals: (1) the performance of the memory-management software is deterministic and lends itself to simple timing analysis; (2) the memory-management code is extremely small; and (3) the memory-management hardware is smaller and less power hungry than in present designs (MMUs often use structures that are relatively large and consume lots of power). 6.
Embedded DRAM Organizations The growing gap between memory access time and processor speed in recent years has led processor architects, DRAM architects, and memory-system designers to rely heavily on high-performance mechanisms such as lockup-free caches, out-of-order execution, hardware and software prefetching mechanisms, and multi-threading. These mechanisms are quite effective at reducing, hiding, or tolerating large memory latencies; however, they do so at the expense of exacerbating the memory bandwidth problem (Burger, 1996). One trend that is helping to solve the bandwidth problem is the development of new DRAM architectures, such as Synchronous DRAM, Enhanced Synchronous DRAM, Synchronous Link, Virtual Channel, and Rambus. All of these architectures are improvements over the traditional DRAM architecture; our studies show that the newest members of the set reduce bandwidth overhead by a factor of four compared to the oldest members (Cuppu, 1999). Another trend that can help solve the bandwidth problem is the use of embedded DRAM-processor organizations that incorporate the DRAM array onto the same die as the processor core (Kozyrakis 1997; Sase, 1997; Nunomura, 1997). This provides several benefits, including a wider memory bus, a faster memory bus, and drastically reduced energy consumption (Fromm, 1997) due largely to the reduced number of off-chip memory requests. This project investigates future issues in memory-system design, processor organization, and execution models. To date, we have performed a thorough performance evaluation of DRAM architectures (Cuppu, 1999) and are currently investigating their real-time behavior. The embedded-DRAM organization, much as its sibling system-on-a-chip, is well positioned to serve as a foundation for a host of microprocessor-based execution models that can exploit tremendous memory bandwidth. Likely models include vector processing, single-chip parallel processing, and DSP. We have been investigating the appropriateness of DSP. 7.
Hardware/Software Co-design of Device Drivers and I/O Hardware As part of a grant in the NSF Experimental Software Systems program, UMD is investigating hardware/software co-design of a real-time operating system and a microcontroller architecture. One of our discoveries is that the primary difficulties to creating device drivers for embedded systems is a poor hardware/software interface to the I/O. We have already begun designing and building new I/O circuitry, that when combined with object-based device drivers, provides interchangeability of the I/O devices while reusing the same software drivers. Last summer, undergraduate students in the NSF-sponsored Research Internships in Telecommunications have used this new method to develop interchangeable RF and IR devices for wireless communication between sensors and actuators. Similar work is needed for many other I/O devices used in embedded systems. 8.
Memory System Support for Pointer-Based Applications The performance of commercial microprocessors continues to improve at a staggering pace. However, our ability to feed these processors with data fast enough to keep them constantly busy is falling far behind. In the time it takes industry to double the performance of processors, memory system performance improves by only 7%. Consequently, application performance becomes increasingly limited by the memory system. The Vortex project at the University of Maryland is investigating architectural support to address the memory performance bottleneck. A major thrust of the project is to provide support of pointer-based applications, programs that make heavy use of dynamic data structures such as linked lists and trees. Pointer-based applications frequently perform pointer dereferencing operations to traverse the link elements within a large dynamic data structure. Also known as chasing, such pointer dereferencing operations often lead to poor memory performance because they give rise to memory access patterns that lack both temporal and spatial locality, two crucial properties for high performance on conventional memory systems. In the Vortex project, several architectural techniques are under development to increase the memory performance of pointer-based applications. First, novel prefetching techniques aggressively schedule prefetch requests to provide higher levels of memory latency tolerance for pointer-chasing loads. Second, support in the memory controller enables software control of the memory fetch size to increase the effective memory bandwidth of sparse memory accesses. Third, support for application-controlled data movement is provided to efficiently support streaming data access patterns. And finally, in addition to the architectural techniques, the Vortex project also strives to understand the nature of the memory access patterns in pointer-based applications from several important application domains, including databases, search engines, sparse matrix codes, and compression algorithms. 9.
Monitoring and Analysis of Real-Time Software Research advances in real-time scheduling have filled the literature, with theoretical methods and algorithms that can handle most every situation that may arise during the design of a real-time system. In practice, however, few of these results are in use. The primary obstacle is that the theory makes use of hypothetical examples, that often are not indicative of the complexity of a real system. Undergraduate researchers can monitor and analyze the inner details of real-time software, to provide data collections that give a better picture of what really happens inside an application. Such data is invaluable to graduate and faculty researchers in real-time operating systems, compiler design, and computer architecture. 10.
Plastic Cell Architecture: A New Paradigm for Dynamically Reconfigurable
Hardware Systems for Network-oriented Computing The most important feature of the advancement of high-level programming languages from Fortran, to C, to C++ and Java, and Jini is the power of delaying decisions to introduce data structures and functions during the execution of the program. With the two-level structure of a CPU and memory, so many functions have successfully been implemented by software programs on von Neumann computers. The ultimate goal of the Plastic Cell Architecture (PCA) is to construct dynamically reconfigurable hardware that performs the same functions that the sofware system delivers. Invented at NTT Research Laboratories in Japan two years ago, the PCA is expected to play a major role in computer networks and cellular phone networks in the next century. The basic feature of its dynamical reconfigurability itself is perfectly suited for software radio-based applications on cellular phone networks as well as for handling load increases in such devices as a server and a router on computer networks. The PCA is comprised of a two dimensional array of Plastic Cells (PCs) with each cell containing a built-in part (BP) and a plastic part (PP). At the Univ. of Maryland, we have developed alternative architectures for the PP and designed and fabricated their corresponding full-custom VLSI chips. During Prof. Nakajima's sabbatical leave at NTT Communication Science Laboratories in Japan for the 1998-99 academic year, one of his Ph.D. students spent seven months at the same NTT Lab and initiated the development of a software package to evaluate PP and PC architectures. During the summer of 1999, an undergraduate student from Virginia Tech worked at the same NTT Lab in Japan and added one software module to the package. The objective of the project is to develop (1) a full-custom VLSI chip for the PCA and (2) a set of CAD tools to implement dynamically reconfigurable computing systems on a PCA chip. Subprojects suitable for undergraduate research activities will involve (1) HDL-based design and full-custom VLSI design of the BP, (2) architectural design and full-custom VLSI design of the PP, (3) development of an improved architecture evaluator for the PP and PC, and (4) development of a simulator for the PCA. By participating in this project, undergraduate students will appreciate the importance of a team-based approach to the development of a new computing hardware system from the transistor level to the architectural level, as well as the integration of theory, hardware, and software into a complete system development. 12.
Techniques for Minimizing Code Size in Compilation Targeting Embedded
Systems Embedded systems refer to the class of application-specific computer systems that are used as controllers and monitors in a variety of consumer and business applications. Such embedded systems are ubiquitous today in household appliances, consumer electronics, communication systems, remote sensing and vehicle control. While many similarities exist between general-purpose computer systems and embedded systems, many of the design criteria differ. For embedded systems, low cost, low power, and small code size often are critically more important than performance at any cost. An interesting project in this space is compilation of high-level code targeting embedded systems, with the objective of minimizing code size. A low code size is desirable when the entire machine code program is stored on-chip, contributing to low silicon area and power dissipation. Of course, the compiler must simultaneously optimize for the best performance possible at that code size. Note that code size is increased by several compiler transformations commonly employed to improve performance, such as loop unrolling and procedure inlining. Given a certain code size budget, an interesting question is where to employ these code-size increasing transformations most profitably, such that the code size is within budget. Such research would likely profit from profiling information coupled with intelligent heuristics. The work will involve implementing the compiler algorithms and simulating the results. Evaluation of results would compare performance with both un-optimized code as well as code optimized regardless of code size. Time permitting, a comparison with hand-optimized code will also be done. |