This online version resulted from scanning and OCRing a computer generated original. Since optical character recognition is certainly not perfect, if you notice any errors, please let me () know. Thanks!
A note on the references - to jump to the references section of the paper to look at a citation, click on the numeric reference entry in bracketed references. E.g. in "[1, 2]" click on the 1 or on the 2 to go to the corresponding citation in the references at the end of the paper. Where possible the references will link to Web accessible content.
A note on the figures - since the figures were also scanned in, they are certainly not perfect. In case you are unable to read some of the detail on the versions included with the text, you can click on the figures themselves to see somewhat larger versions where more detail is visible.
If you are somehow looking at a hardcopy version of this paper and wish to view it on the WWW, you can find it at the URL:
http://www.nersc.gov/~jed/papers/Components/
for now - naturally...
Last update to this HTML version, August 18, 2006 --Jed
"James E (Jed) Donnelley"
<>
http://www.webstart.com/jed/
The interface between a process running under an operating system and the world outside its memory space is the "system call", a request for service from the operating system. The usual approach taken in operating system design has been to provide distinct system calls to obtain service for each type of available local resource (Fig. 3).
If a network becomes available, system calls for network communication are added to the others already supported by the operating system. Some problems with this approach are the Dual Access and Dual Service Dichotomies discussed below. It is argued here that operating systems to be connected to a network (particularly a high speed local area network) should be based on a pure message-passing monitor (Fig. 4)
The title of this paper has at least two interpretations that are consistent with the intent of the author:
The basic approach taken here will be to describe the components of a single machine operating system being implemented at the Lawrence Livermore Laboratory (LLL). The presentation will be largely machine independent, however, and will include discussion of the integration of the described system into a network of similar and dissimilar systems.
Another important development at LLL that began about the time of the first LTSS was networking. It started with a local packet switched message network for sharing terminals and a star type local file transport network for sharing central file storage (e.g. the trillion bit IBM photodigital storage device). These early networks worked out so well that they eventually multiplied to include a Computer Hardcopy Output Recording network, a Remote Job Entry Terminal network, a data acquisition network, a tape farm, a high speed MASS storage network, and others. The entire interconnected facility has retained the name "Octopus" [12, 27] from its earliest days as a star topology.
Recent developments in high speed local networking [5, 11,17] are making it easier to flexibly connect new high speed processors into a comprehensive support network like Octopus. This very ease of hardware interconnection, however, is forcing some rethinking of software interconnection issues to ensure that the software interconnects as easily as the hardware [26, 27].
Recently the network systems group at LLL has started down a significant new fork in the LTSS development path. The new version of LTSS is different enough from the existing versions that it has been variously called New LTSS or Network LTSS (NLTSS). Many of the reasons for the new development have little to do with networking. For example, NLTSS shares resources with capabilities [4, 8, 10, 16, 20, 21, 28]. This allows it to combine the relatively ad hoc sharing mechanisms of older LTSS versions into a more general facility providing principal-of-least-priviledge domain protection. It is only the lowest level network related motivations for the NLTSS development, however, that we will consider here.
When a processor is added to a mature high speed local area network like Octopus, it needs very little in the way of peripherals [27]. For example, when a Cray-1 computer was recently added to Octopus, it came with only some disks, a console, and a high speed net- work interface. All of the other peripherals (terminals, mass storage, printers, film output, tape drives, card readers, etc. etc.) are accessed through the network. The operating system on a machine like this faces two basic problems when it is connected to the network:
Typical third generation operating systems have concerned themselves with supplying local processes access to local resources. They do this via operating system calls. There are system calls for reading and writing files (access to the storage resource), running processes (access to the processing resource), reading and writing tapes (access to a typical peripheral resource), etc. When networks came along, it was natural to supply access to the network resources by supporting system calls to send and receive data on the network (Fig. 3).
Unfortunately, however, this approach is fraught with difficulties for existing operating systems. Just supporting general network communication is not at all an easy task, especially for operating systems without a flexible interprocess communication facility. In fact, if flexible network communication system calls are added to an operating system, they provide a de facto interprocess communication mechanism (though usually with too much overhead for effective local use).
Even systems that are able to add flexible network communication calls create a dual access problem for their users (Fig. 5). For example, consider a user programming a utility to read and write magnetic tapes. If a tape drive is remote, it must be accessed via the network communication system calls. On the other hand, if the drive is local, it must be accessed directly via a tape access operating system call. Since any resource may be local or remote, users must always be prepared to access each resource in two possible ways.
The problem of making local resources available to a network has proven difficult for existing operating systems. The usual approach is to have one or more "server" processes waiting for requests from the network (Fig. 6). These server processes then make local system calls to satisfy the remote requests and return results through the network. Examples of this type of server (though somewhat complicated by access control and translation issues) are the ARPA network file transfer server and Telnet user programs [6, 7]. With this approach there are actually two service codes for each resource, one in the system monitor for local service and one in the server process for remote access.
The major network motivation for the New LTSS development is to solve problems A. and B. in future versions of LTSS in such a way as to avoid the dual access and dual service dichotomies. By doing so, NLTSS also reaps some consequential benefits such as user and server mobility, user extendibility, and others.
NLTSS provides only a single message system call (described in the next section). Figure 7 illustrates the view that an NLTSS process has of the world outside its memory space. Deciding how and where to deliver message data is the responsibility of the NLTSS message system and the rest of the distributed data delivery network.
NLTSS uses the opposite approach. Since all NLTSS resource requests are made and serviced with message exchanges, the message system is the only part of NLTSS that need distinguish between local and remote transfers (Fig. 9). Also, since the distinction made by the message system is independent of the message content, NLTSS eliminates the dual access dichotomy rather than just moving it away from the user as the RSEXEC and similar systems do.
There have been many operating system designs and implementations that supply all resource access through a uniform interprocess communication facility [1, 2, 3, 8, 10, 15, 16, 21, 24, 28]. These interprocess communication mechanisms generally do not extend readily into a network, however. For example, in a system that utilizes shared memory for communication, remote processes have difficulty communicating with processes that expect such direct memory access. Capability based systems generally experience difficulty extending the capability passing mechanism into the network [4, 8, 10, 16, 20, 21, 28].
NLTSS is certainly not the first pure message-passing system [1, 15, 24]. In fact, it is remarkably similar to a system proposed by Walden [24]. Any contributions that NLTSS has to make will come from the care that was given to exclude system overhead and yet still support full service access to local and remote resources through a uniform message-passing mechanism.
Since all resource access in NLTSS is provided through the message system, the message system interface is a key element in the system design. The major goal of the NLTSS message system interface design was to supply a simple, flexible communication facility with an average local system overhead comparable to the cost of copying the communicated data. To do this it was necessary to minimize the number of times that the message system must be called. Another important goal was to allow data transfers from processes directly to and from local peripherals without impacting the uniformity of the message system interface.
The NLTSS Buffer Table
- Link
- Action bits (Activate, Cancel, and Wait)
- Send/Receive bit
- Done bit
- Beginning (BOM) and end (EOM) of message bits
- Receive-From-any and Receive-To-Any bits
- To and From network addresses
- Base and length buffer description
- Buffer offset pointer
- Status
The Buffer Table fields are used as follows:
The paucity and simplicity of the NLTSS system calls allow its monitor to be quite small and simple (a distinct advantage at LLL where memory is always in short supply and security is an important consideration).
Essentially all that is in the NLTSS monitor is the message call handler and device drivers for directly attached hardware devices (figure 4). In the case of the CPU device, the driver contains the innermost parts of the scheduler (the so-called Alternator) and memory manager (that is those parts that implement mechanism, not policy).
One property of the current NLTSS monitor implementations is that each device driver must share some resident memory with a hardware interface process for its device. For example, the storage driver must share some memory with the storage access process, and the alternator must share some memory with the process server. This situation is a little awkward on machines that don't have memory mapping hardware. On systems with only base and bounds memory protection, for example, it forces the lowest level device interface processes to be resident.
The file system illustrates several features of the NLTSS design and implementations.
The basic service provided by the file system is to allow processes to read and write data stored outside their memory spaces. The way in which a process gets access to a file involves the NLTSS capability protocol [26] and is beyond the scope of this paper. We will assume that the file server has been properly instructed to accept requests on a file from a specific network address. The trust that the servers have in the "From" address delivered by the message system is the basis for the higher-level NLTSS capability protection mechanisms [10, 14].
The simplest approach for a file server to take might be to respond to a message of the form "Read', portion description (To file server, From requesting process) with a message containing either "OK”, data or "Error" (To requesting process, From file server).
Unfortunately, this approach would require that the file server be responsible for both storage allocation (primarily a policy matter) and storage access (a mechanism). Either that or the file server would have to flush all data transfers through itself on their way to or from a separate storage access process.
The mechanism implemented in NLTSS is pictured in figure 13. To read or write a file, a process must activate three Buffer Tables. For reading, it activates a send of the command to the file server, a receive for the returned status, and a separate receive for the data returned from the storage access process. For writing, it activates similar command status Buffer Tables, but in place of a data receive, it activates a data send to the storage access process.
This example illustrates the importance of the linkage mechanism in the message system interface. In most systems a file access request requires only one system call. Through the linkage mechanism, NLTSS shares this property. In fact, in NLTSS a process can initiate and/or wait on an arbitrary number of other transfers at the same time. For example, when initiating a file request, it may be desirable to also send an alarm request (return a message after T units of time) and wait for either the file status message or the alarm response.
When the file server gets a read or write request, it translates the logical file access request into one or more physical storage access requests that it send to the storage access process. In this request it includes the network address for the data transfer (this was included in the original "Read" or "Write" request). Having received the storage access request, the access process can receive the written data and write it to storage or read the data from storage and send it to the "Read"ing process.
This mechanism works fine in the case where the requesting process and the storage access process are on separate machines (note that the file server can be on yet a third machine). In this case the data must be buffered as it is transferred to or from storage. In the case where the requesting process and the storage access processes are on the same machine, however, it is possible to transfer the data directly to or from the memory space of the requesting process. In fact, many third generation operating systems perform this type of direct data transfer.
To be a competitive stand-alone operating system, NLTSS must also take advantage of this direct transfer opportunity. In our implementations, the mechanism to take advantage of direct I/O requires an addition to the message system.
There are two additional action bits available in the Buffer Tables of device access processes, IOLock and IOUnLock. If a device access process wants to attempt a direct data transfer, it sets the IOLock bit in its Buffer Table before activation. If the message system finds a match in a local process, instead of copying the data, it will lock the matching process in memory and return the Base address (absolute), Length and Offset of its buffer in the IOLocking Buffer Table. The device access process can then transfer the data directly to or from storage. The IOUnLock operation releases the lock on the requesting processes memory and updates the status of the formerly locked Buffer Table.
The most important aspect of this direct I/0 mechanism is that it has no effect on the operation of the requesting process OR on that of the file server. Only the device access process (which already has to share resident memory to interact with its device driver) and the message system need be aware of the direct I/O mechanism.
The example of an NLTSS semaphore [9, 10] server can be used to further illustrate the flexibility of the NLTSS message system. The basic idea of the semaphore server is to implement a logical semaphore resource to support the following operations:
Typically such a semaphore resource is used by several processes to coordinate exclusive access to a shared resource (a file for example). In this case, after the semaphore value is initialized to 1, each process sends a "P" request to the semaphore server to lock the resource and awaits notification before accessing it (note that the first such locking process will get an immediate notification). After accessing the resource, each process sends a "V" request to the semaphore server to unlock the resource.
An NLTSS implementation of such a server might keep the value of the semaphore and a notification list for each supported semaphore. The server would at all times keep a linked list of Buffer Tables used for submission to the message system. This list would be initialized with some number (chosen to optimize performance) of receive Buffer Tables "To" the semaphore server and "From" any. These Buffer Tables would also have their activate and wait action bits turned on.
The semaphore server need only call the message system after making a complete scan of its receive Buffer Tables without finding any requests to process (i.e. any with Done bits on). Any Done receive reque sts can be processed as indicated above (l. and 2.). If a notification is to be sent, an appropriate send Buffer Table with only the Activate action bit on can be added to the Buffer Table list for the next message system call. These send Buffer Tables are removed from the list after every message system call.
Processes may in general be waiting on some receive completions to supply more data, and for some send completions to free up more output buffer space. Even in this most general situation, however, they need only call the message system when they have no processing left to do.
This semaphore server example can be compared with that given in [10] to illustrate how the network operating system philosophy has evolved at LLL over the years. In earlier designs, for example, capabilities were handled only by the resident monitor. In the NLTSS implementations, the resident monitor handles only the communication and hardware multiplexing described here. Resource access in NLTSS is still managed by capabilities, but this matter is handled as a protocol between the users and servers [26]. The integrity of the capability access protection mechanism is built on the simpler data protection and address control implemented in the distributed network message system of which NLTSS can be a component [10, 14]
There are currently two versions of NLTSS running in an emulation mode, one on a CDC 7600 and one on a Cray-1. These fledgling implementations are being used to experiment with higher-level system protocols, to develop and debug libraries, etc. The systems will be made completely operational in his mode (except for device drivers) before being installed as the resident monitor on any machines.
The NLTSS monitor and most of the servers are being written in a language named Model [18, 19], a Pascal based language with data type extension that was developed at the Los Alamos Scientific Laboratory. Model generates an intermediate language, U-Code (similar to Pascal's P-Code). We expect this feature to help somewhat in moving NLTSS from machine to machine.
An important issue facing NLTSS is compatibility with existing software. We expect little difficulty in supporting the type of requests available from most of the library support routines at LLL. Reading and writing files, terminal IIO, etc., pose no difficulty. The areas that cause the most compatibility problems are those library routines that deal with very system specific features of the existing LTSS systems.
For example, some existing software at LLL depends on a linear process chain structure supported by the LTSS system. Even though the NLTSS message system and capability-type process access protection are much more general, we do plan to implement a fairly elaborate service facility under NLTSS that mimics the linear LTSS process structure. It is hoped that the use of this type of software will gradually lessen as users become more familiar with the basic NLTSS Services. In any case, since this mimicry is not part of the NLTSS monitor, its use causes no more performance degradation than that caused by running a brief additional user program.
Since NLTSS supplies all of its services through its message system, processes on machines that can communicate with the NLTSS machine can access NLTSS resources just as if they were local (except for performance). Also, since NLTSS allows its processes to communicate with other machines via the message system, any resources available on the network are available to NLTSS processes.
Resource sharing is somewhat complicated by problems at both the very low and very high end of the communication protocol scale. At the low end, there is the problem of mapping the NLTSS message exchange into whatever transport level protocol is available on the network (for example, what do you do with the X.25 qualifier bit?). This problem is somewhat eased at LLL by using an in-house protocol developed particularly to suit local network applications [13, 25].
At the high end of the protocol scale, there is the problem of service request-reply standards. The greatest difficulties involved in design of message standards for a pure message-passing service are those resulting from the domain restriction of the serving process(es). Access control and accounting are examples of mechanisms that require distributed coordination. Most third generation operating systems assume that they control the entire computing facility. This assumption is incorrect in a network like Octopus and creates some serious problems. For example, resources serviced on one machine can't be accessed from another, accounts may "run out" on one machine and not on another, etc. Discussion of the distributed mechanisms that NLTSS utilizes for services that require distributed control is beyond the scope of this paper. Some of these mechanisms are described in [26]. Additional details of the NLTSS message standards will be described in later publications.
Implementation of a pure message-passing operating system that efficiently utilizes the hardware resources available to it is a considerable technical challenge. It is a challenge that must be met, however, if the current software difficulties involved in interconnecting operating systems to networks are to be overcome. These software interconnection issues are particularly pressing in a mature high performance local network like the LLL Octopus network. It is hoped that the NLTSS development effort will further the state of the art in software network interconnections by giving birth to a viable message-passing operating system in the demanding environment of the Octopus network.