Distributed Network Manager Requirements

Distributed Network Manager (DNM)

Requirements Specification

Version 2.0

S. Ron Oliver, 25 June 1996

Copyright, All Rights Reserved Permission is granted for use and dissemination of this material for strictly educational, not-for-profit purposes, As long as proper credit is noted. Constructive comments or suggestions are very much appreciated, preferably via email.

sroliver@calpoly.edu

0 Introduction

This document briefly describes the functional, operational, architectural, performance, and security requirements for a Network Management Application that is to be developed as a distributed program. This project was initiated by the author and students in Computer Science 405 (Computer Communications II), at California Polytechnic State University, in the Spring quarter of 1996. The level of granularity in the present version is rather high, due to constraints of time. Lessons learned during the Spring 1996 quarter are reflected herein. A sufficiently detailed specification will not likely take form prior to Version 4.0, possibly later.

0.1 Product Description

Contemporary networks have become very extensive and complex, in many ways. Except in certain isolated, geographically limited networks, there is very little automated network management capability available today. Almost all useful network management utilities are proprietary and function well on only a limited set of extant network technology. Several network management protocols have been defined and standardized. But the most widely used protocol, SNMP, is very limited and poorly implemented in many cases.

To address the rapidly increasing need for highly functional, automated network management an application such as the one describe here must be fully specified, implemented, and made readily available to all network administrators. This capability will go far beyond the limited data collection mode of SNMP, provide for useful statistical analysis, including long term trend analysis, support automatic network reconfiguration when appropriate, and assist in security management. The application will operate in secure mode, and will be distributable over a number of network nodes.

0.2 Distributed Components

This application is to be distributed to ensure high availability and to contain the overhead imposed on network nodes and links so it does not unacceptably degrade performance for the primary missions of those components. These capabilities are further described in Section 5.2 This application will be fully distributed. Among other things, this means that process components will be able to run on processors of different types and in different OS environments. Throughout this document the term 'process component' refers to a process that runs in a given host, and is a component of the Network Management Application. This term is chosen to distinguish these entities from algorithmic or computational processes that may be discussed in some places. At times, when the context permits, the term may be abbreviated to 'process' or 'component'. For performance efficiency and architectural simplicity at most one process component will be resident on a given processor in the network. However, where appropriate, a process component may be multi threaded.

Another potential benefit of a fully distributed application is that some of the process components may be tailored to do specialized parts of the application. This permits a means of modularizing the application to keep the complexity of all process components low, hence enhancing the potential for high reliability.

It is required that the application be designed so that some process components provide a User Interface that permits extensive control and use of the application, while others provide only limited direct user interface capability. Full function user interfaces are intended to be available in secure facilities where access is restricted to trusted individuals. Limited function user interface process components will reside in processors where physical security may be limited. Access will be further secured via user authentication, as described in Section 1.5.

The set of distributed process components will be organized in an hierarchical fashion with at least one root component. It will be possible for there to be 2 or more root process components that cooperate to perform the full function of a single virtual root component. This may be necessary, for instance, to meet the performance requirement that no process component consume more than specified amounts of resources on any processor or link. The virtual root process component will deal with some number of supporting process components. The role of the latter is similar to that of agents in the distributed SNMP application. They collect data and perform specified functions at the nodes in which they reside, including control of some nodes and devices in which there are no component processes of the Network Management Application. A supporting process component, in addition to performing under the direction of the virtual root component, may serve as a surrogate root process. That is, it may interact with other component processes, not directly accessed by the virtual root, as though it were the root process component. The process components that support a surrogate root may, in turn, be surrogates to another set of process components, etc. In this manner the total network management function may be effectively partitioned to ensure performance containment requirements are met, while serving a network of unlimited size and complexity.

Throughout this document the terms 'active mode' and 'surrogate mode' will be used consistently with the notion just introduced. When a component process performs a specific function in active mode it ensures the function is fully executed by controlling that execution. This might include doing statistical computations, displaying information, causing network resources to be reconfigured, etc. When a component process performs a specific function in surrogate mode, it ensures the function is executed by passing sufficient information on to its surrogate or parent process component. In many cases a process component will be limited to surrogate mode for certain functions as part of the security measures to be taken.

0.3 Data Representation and Organization

At a minimum, the application will support the SNMP MIB II data structure, and other standardized SNMP MIB data structures that are widely in use. These capabilities will be extended as needed to represent data elements required to perform the expanded set of functions this application performs. Such extensions will be defined in full compliance with the ISO Abstract Syntax Notation (ASN.1) used to specify SNMP MIBs. Beyond the data types used in SNMP the full set of data types specified for SNMPv2 will be supported and used, where appropriate.

1 Component Interface Module

Each process component will support a standard interface to other components and possibly to an interactive user. Since a significant subset of commands to be specified for, and data to be retrieved from, one process component by another are identical to those specified / retrieved by the interactive user, both functions must be served by the same Module.

1.1 Privileged User Interface

Any process component that supports an interactive user interface (UI) will do so only for properly authorized privileged users. In particular, access to the process will require authentication beyond normal password access. The means of authentication are specified in more detail in Section 1.5. It shall not be permitted for any user to interact with a process component remotely. That is, only users who have direct physical access to the processor in which the process resides shall be given the option to be authenticated.

For early implementations the user interface shall be a simplistic, character-oriented interface. This requirement is levied to permit development efforts to focus primarily on the functional aspects of the application, and to avoid the trap of developing a GUI with no capability. Future versions will provide a fully functional GUI that permits the user to effectively use a Mouse, and minimizes the need to type. All versions of the UI shall be based on an hierarchical set of menus and displays.

The user will be able to set parameters and options that govern the work being done by the application. Some of these may be application-wide, most will be specific to one or more of the process components.

Some options selected will cause one or more process components to collect information much in the way SNMP agents currently collect information. The UI shall be readily extensible to support data collection and configuration management functions beyond those supported by SNMP or SNMPv2. Some options will cause statistical computations to be done using the collected information. The user shall be able to specify when, where, and how collected or computed information is to be displayed or reported. The user may specify certain reports to take place according to an automated cycle. Some reports will be updated in real time. Some reports will be available on demand.

Certain functions of the application will cause things to happen automatically. In particular, for instance, node discovery, described in Section 3.1, will automatically determine the network configuration. Under some circumstances it shall be possible for the user to manually override or intervene with such automatic functions.

The UI shall provide an option for a set of commands to be applied to a set of 2 or more network components at the same time, to reduce the need for repetitive operator interactions. The privileged user may specify that commands to be issued are directed to 2 or more network components by identifying a list of target entities. Subsequent commands will apply to all entities in the list until the list is modified or deactivated.

The UI shall include a simple help feature to remind the user of all currently implemented and applicable commands, with a brief description of each.

1.2 Interface with Peer Application Components

All process components will interface with one or more other process components. In some cases this will entail receiving, from a parent component in the hierarchy, commands precisely like those the user may specify at the privileged process, or sending reports to another process much like the reports generated by the privileged process. All process-to-process interactions will use the OSF DCE/RPC Interprocess Communication capability to achieve such interaction. However, the architecture of process component interaction shall be message-passing rather than pure client-server interaction. In particular, this permits command processing to be identical in all process components, whether the commands come from an interactive user or from an upstream (parent) process component. This choice of architectures also facilitates user authentication.

All process-to-process interactions will be protected using a CRC-16 EDC capability. Most interactions will be relatively low volume data transfers. Some interactions, such as statistical reports, may become voluminous enough to represent an undesirably high burden on the available network link components. When a transmission threatens to cause a network link component to be over- utilized, data compression shall be used to avoid exceeding the threshold. If available data compression techniques do not ensure thresholds will not be exceeded, the compressed transmission shall be partitioned and sent at delayed times to avoid exceeding the threshold. All transmissions shall be sent encrypted, using the method identified in Section 1.5.

A process component will be specifically configured (via initialization and/or the privileged use interface) to interact only with specific peer components. It will accept commands from only one component (it's parent) and will provide information only to specified components. All commands received shall be authenticated to ensure they were issued by the parent process component. A process component will be configured to accept data (reports) only from specific peer components. All data received shall be authenticated to ensure it was sent by a legal source process component. Details of authentication are discussed in Section 1.5.

1.3 SNMP Manager / Agent Interface

Much of the data collection to be done is already being done in some environments by SNMP Managers and Agents, or by components designed to respond to SNMP agents. To the greatest extent feasible, the Network Management Application shall interact with and take advantage of existing SNMP resources to collect information, rather than replicate those functions. The specific alternatives and means by which such interaction shall take place are detailed in a report entitled SNMP Software Design Document for the SNMP Manager / Agent Interface The title of current version of this document is somewhat misleading. A future version will be entitled Software Design for the DNM / SNMP Agent Interface., by Scott Metzger. Additional useful information may be found in The MIB-II, as it Relates to the Distributed Network Manager, by Joshua Lehan. These reports are two of the many technical documents provided with the full set of DNM documentation.

1.4 Route Control Agent Interface

Certain key functions to be automated will involve determining link costs more accurately than is normally done by existing Route Control software. To the greatest extent practical the Network Management Application shall interact with Route Control Agents to provide them the service of maintaining the best possible routing information, where 'best possible' means the most accurate information available to help routers select routes that avoid congestion and/or other network problems that might arise. The specific alternatives and means by which such interaction shall take place are further described in Section 4.2.2.

1.5 Security Management

It is critical that the Network Management Application be as secure as possible in its operations. It shall avoid being compromised by unauthorized persons. It shall secure network information from unauthorized access. And it shall detect and report attempted breaches of security whenever possible.

All transmissions between process components shall be protected using Public Key Encryption. No interactive user shall be granted access to a process that supports an interactive user, without being authenticated. The specific methods of authentication and encryption are describe in SCAMP: Secure Cryptographic Authorization and Management Protocol, by Nathan Lawson.

In a future version of the Network Management Application it shall be possible to specify data monitoring and alarm notification requests that detect and indicate possible network trespassing. The mechanisms used to specify and handle security monitoring will be similar to those used to specify and handle data collection and statistical performance conditions that cause alarms, as specified in Chapter 4. The main difference will be that the privileged user will identify specific tell-tale network traffic conditions that cause alarm. These may be much different from simple statistical conditions. These mechanisms, when specified, will be documented in a supplement that has limited, controlled distribution.

2 Protocol Module(s)

When necessary to effectively accomplish the objectives of the Network Management Application, it will interact with other processes and capabilities on the network. These include SNMP Managers and Agents, Routers, and Route Control Processes. To effect these interactions, the Network Management Application shall use the appropriate protocol. The process components themselves shall use an abstract set of communication procedures that will be the same for all protocols to be supported. Linkage to procedures that implement the appropriate protocol correctly shall be done at process component build time. In this way, for instance, a process component might originally interact with an SNMP agent via SNMPv1, but later be upgraded to use SNMPv2. To the extent that the new features of SNMPv2 do not require a new or revised functional or procedural interface, the upgrade shall be possible without recompiling the process component. Simply relink it. The Protocol Module, therefore, is a library or set of libraries. Protocols to be supported, at minimum, are SNMP, SNMPv2, CMIP, RMON, and other routing protocols as determined during the prototypical stages of development. This module (or set of modules) shall be readily extensible to support new protocols as they emerge.

For each specific protocol supported, security features provided by that protocol shall be used to the fullest extent possible, and to the extent applicable for the purposes of necessary interactions. The Protocol Module(s) shall provide a security submodule that will serve to determine the highest level of security available for a specific interaction, and ensure the interaction takes advantage of that security. This submedial shall provide generic encrypt / decrypt services so that a variety of types of encryption may be used, as dictated by specific protocols or protocol implementations, and the use of different encryption methods shall be transparent to the process components.

2.1 SNMP Support

The first versions of this application will support full interaction with SNMPv1, at a minimum. In particular, any correctly formatted MIB data structure may be imported and used to interact with devices and agents that support it. Such data elements may be presented to the privileged user for selection for relevant monitoring and management activities.

3 Configuration Control Module

The basis for managing any network is knowing its configuration. Thus, this module is in some senses the 'core' of the application. This chapter details the manner in which the application will learn and maintain up to date information about the network configuration.

3.1 Node / Component Discovery

By definition, every process component will have a limited sphere of control. That is, it will concentrate on all nodes / components within a specific set of detectable nodes / components, and ignore all others except when they may not be ignored to accomplish a specific function. Sphere of control might be defined by specifying a range or set of IP address. Alternatively, MAC addresses might be used. More complex methods of specifying sphere of control will be supported as a future enhancement.

Upon startup, every process component shall execute an algorithm to determine the number and types of nodes within its sphere of control, as well as the number and types of communication links that interconnect those nodes. For each such component key information will be learned as well, such as capacity. For some components and for some characteristics of others, discovering the information automatically may not be possible at the time of startup. It shall be possible for the details to be completed via manual intervention or via a pre-set configuration information file. Note that some important characteristics of many nodes in a network have to do with how that node routes traffic from itself to different destinations. That is, expected route 'costs' must be known. This information is generally known only to route control agents, and will thus need to be learned from them.

Information about the configuration of network components within the sphere of control of a process component will be retained in memory for ease and efficiency of access.

As an option, the privileged user may direct that the specific configuration discovered or otherwise known by a process component shall be saved in a file for later use. If it will save processing time, a process component may be configured to start with such a file and simply modify, delete, or add values during node discovery. In some cases this may require privileged user interaction.

As new nodes are added to the network within the sphere of control of a process component, these will eventually be discovered, as well, and added to the configuration.

A key function of each process component is to generate a report, called the network status report, documenting its current knowledge of network components within its sphere of control. At a minimum this report will be a readable, formatted ASCII report. As a future enhancement, graphical format report options will be specified and implemented. The network status report will include the status of each node/device in the sphere of control of the process component. Status for a given component includes an indication of whether or not it is controllable by this process component. Other status information identifies whether the node is inoperative, operating in a degraded mode, or operating well. The report also includes a summary of key component characteristics.

The privileged user may request a network status report at any time. A report may be requested for all known nodes/devices, or for a specific subset of nodes/devices, specified by address list, device type, or the value of some data element of the status report (e.g., all inoperative devices, etc.).

3.2 Configuration Change Notification

Over time, of course, the network configuration will change. Components will be added or removed intentionally. Components will become unavailable at times. To the greatest extent possible, a process component will be able to receive automatic notification from other nodes in its sphere of control concerning these changes. In some cases, the process component must be able to automatically detect these changes, as in failure of a node to respond for some specified length of time. In other cases, the information will have to be supplied via manual intervention. In all cases, if the configuration change is relevant only to displays or reports produced locally to the process component, the appropriate display / report database will simply be updated. If the configuration change is to a part of the network for which displays or reports are generated upstream from the process component, the change notification will be forwarded to all such upstream process components. As an option, unexpected changes of configuration may cause an alarm notice to be produced locally and/or at specified upstream locations. This function is further described in the next section.

3.3 ALARM Management

Under some circumstances specific events need to be known or dealt with in real time. The process component that detects such events will generate an alarm In the SNMP network management literature such notifications are called traps. The term alarm is more accurate and more commonly used for such event notifications. Trap is a term that applies to a broader class of asynchronous events.. Some unexpected changes of configuration may be deemed important enough to cause an alarm. In some cases, even a planned configuration change (that is, one made intentionally by operations personnel) might generate an alarm. If a key node is going to be taken down, it may be important to notify other nodes so they may adjust their routing tables, etc. Operations personnel may or may not know about all such dependencies. Other alarms have to do with exceeding certain threshold values. Each process component, at a minimum, will monitor its own use of CPU and memory resources in its resident processor, and generate an alarm when those thresholds are exceeded. Similarly, it will monitor its utilization of communication links for performing its network management functions, to ensure it does not consume more than a preset percentage of that resource. It shall also be possible to request that alarms be generated when utilization of a link exceeds specified thresholds, or when error rates for a link reach a certain level. When a processing node (host, router, etc.) becomes irresponsive (excessive delay in responding) for a certain length of time, this may also be cause for alarm.

There shall be a mechanism to specify and enable/disable alarms. Specification identifies the component and resource being monitored, the threshold value or other event that causes alarm, and the action to be taken (handling) when an alarm occurs. This will normally be a two-step process. It may be desirable to specify an alarm condition, but not always have alarm detection enabled.

For an alarm that has been specified and enabled, it may be desirable to disable the alarm temporarily. It may also be desirable, at times, to edit or delete the alarm specification.

Alarm specification, enabling, editing, disabling, and deletion shall be accomplished only by a privileged user, both for the sphere of control for the local process component and for downstream components. Some cases of alarm definition and enabling may be accomplished automatically at process startup time.

There are three levels of action to be taken when an alarm occurs. In some cases the responsible process component will simply record the fact that the alarm has occurred (increment a counter) and ignore it. Note that in some cases incrementing such a counter to exceed a preset value may generate yet another alarm.

The second level of alarm handling is to notify an operator and/or an upstream process component that the alarm has occurred. This may involve changing a display or textual report, or sending a message upstream. Note that the upstream process receiving this message may combine it with other information to detect a specific event and generate yet another alarm.

The third level of alarm handling is to notify and react. In addition to performing the notify function precisely as for level two handling, a specific action must be performed. This means executing a procedure that does more than generate an informative message. There are two types of action that might be taken. Some preset actions will be provided. For a simple point-to-point link, for instance, one preset action might be to reset (disconnect and reestablish connection) the link. As the development and deployment of the Network Management Application evolves over time, a set of preset actions will be clearly defined and documented in Attachment B.

An alternative to doing preset actions is to execute a specific user-supplied procedure or event handler. In this case the application process will simply execute the user-provided function. This option will provide a valuable means of developing and testing proposed preset actions, as well as handling very specialized circumstances.

At a minimum, each process component shall monitor it's own resource consumption and raise an alarm when it exceeds the specified limit. This will include percentage of CPU at the host processor and total memory occupied by the process component. Memory occupied may grow, for instance, as the process component accumulates more and more information about the configuration of the network within its sphere of control, and/or statistics being monitored for some or all of the network components within that sphere of control. Some process components will be the primary monitoring component for specific network links used to interact with other process components. (Typically a parent plays this role for each child with which it interacts.) Network Management Application link utilization (vice total link utilization) must also be monitored to ensure the Network Management Application does not consume an inordinate amount of capacity. Threshold values for CPU, memory, and link resource utilization by DNM components will be set via initialization configuration file or via the privileged user interface.

3.4 Other Privileged User Functions

The privileged user may receive a report identifying all controllable parameters for a specific device or set of devices. The report will include current values and range of permitted values.

The privileged user may explicitly cause a controlled device to shutdown, restart, or be reconfigured, to the extent such a device is controllable in these ways.

The privileged user, once authorized access, will be able to query and change passwords and other elements of security control. There will also be support to generate reports of all detected attempts to violate network security.

As a future enhancement, specific detail from the network status report will be exported in the .c3e file format for use with COMNET III COMNET III is a trademark of CACI, Inc.. Also, specific statistical information characterizing network loads over a specified period of time will be provided in an electronic format that may be used to drive network simulations using COMNET III.

4 Data Collection and Reporting

As noted in Chapter 2, the basis of network management is information about the configuration of the network. Much of the important information has to do with the way the network is currently operating, and cannot be gathered via the node discovery process. Most of this dynamic information has to do with the volume of traffic and rate and type of errors detected. The Network Management Application will actively collect information about traffic and errors, maintain simple summary statistics, and do more involved statistical computations. The information so gathered will be reported on demand or periodically at specific intervals.

The data collection and reporting function entails as many as three activities: collect data, do computations, generate reports for presentation to some peripheral device. Raw data collection will always take place in active mode. Collection of data as inputs to a computation function may take place in active or surrogate mode. Presentation of reports may be in active or surrogate mode.

4.1 Simple Data Collection

This function will be very similar to that of SNMP Managers and Agents, as currently defined. The same types of statistics will be collected and retained for reporting. To the greatest extent possible the Network Management Application will avoid replicating the functionality provided by SNMP software and will take advantage of it, as previously specified. When necessary, the application will replicate the function of SNMP.

For most purposes, the precise information being collected will be specified via configuration file during the startup process. The privileged user will be able to manually revise the nature of simple information collection at any time.

If and when it is deemed appropriate, simple data elements will be added to the set of possible information to be gathered. This will entail extending the SNMP MIB data structures.

As a future enhancement, the Network Management Application will also collect data on the status and workload of intelligent devices, beyond the nature of their network traffic and queue lengths. This data might include CPU utilization, memory and disk utilization, or key information for network service applications, such as email servers, web servers, name servers, etc. For such services, statistics concerning application queue lengths and service delays might be of interest.

In general, there are a number of different types of devices of possible interest in managing a network. Each such device may have a different set of parameters of interest. An exhaustive list of devices and parameters would be extensive. In the early versions of the network management application, a relatively small set of information available for a relatively few common devices will be supported. As the application evolves additional monitoring capabilities will be implemented. For ease of access and understanding, details of specific data collection capabilities supported are documented in Attachment A. For each type of device or technology, a specific section of Attachment A describes the data to be monitored.

4.2 STATS Subsystem

Beyond the relatively 'raw' data collection capability specified for SNMP, the Network Management Application will regularly perform statistical computations. The specific computations to be performed by any process component will vary, depending on the functional charter of that component. In some cases a process component will perform all relevant (requested) computations using data collected in its sphere of control. In other cases a process component may perform some computations, but forward some collected or computed information to another component for additional computations to be performed.

4.2.1 Simple Statistics

It shall be possible to request simple statistical summaries for nodes, links, and/or ports of a node. Simple statistics include periodic utilization, throughput, and error rate computations. Counts or values per unit time and averages per set of unit times shall be computed. Time units may be in seconds, minutes, hours, or days. The measurement period may be some number of any higher granularity time unit, including some number of days, weeks, or months. Peak time unit during the measurement period will also be identified and the associated quantity reported. When Simple statistics collection is initiated a start time and optional stop time will be specified. When one measurement period has passed a report is generated, counters are reset to 0 and the process begins again, until the stop time has passed. If a stop time is not specified, computations continue until the process is explicitly terminated by the privileged user.

The focus of these statistics shall be on MAC level packets, at a minimum. As a future enhancement it shall be possible to extend the statistics computation features to higher-level protocols.

It shall also be possible to compute port and node availability statistics. This involves keeping track of the length of time a given port or node is unavailable since the start of the measurement period until 'now', or the end of the period. An availability computation can be done on a periodic basis, say a week, and repeated every week until the specified stop time, or until explicitly terminated by the privileged user. Administrative down time shall be distinguished from unintentional down time. Raw availability is total up time over total time. Effective availability factors out administrative down time to focus on unintended down time. It is total up time over (total time - administrative down time).

4.2.2 Link Cost Computation

Two process components may cooperate in establishing link cost computations, if requested to do so. This involves the two processes synchronizing their time clocks, and periodically measuring delay for transmissions between the two. The rate of measure can be varied, and the length of time over which measurements will be taken to determine a representative delay value can be specified in a manner similar to the periodicity and length of measurement period specifications for simple statistics, as described in Section 4.2.1. The 'cost' of a link may be specified to be a function of delay and/or transmission rate and/or error rate. Some preset cost formulae will be provided for selection. Alternatively, the application may execute a user-provided algorithm. The cost computation may be different for each link.

This computation is provided primarily as a service to Route Control Agents. In future implementations a protocol, or modifications to some existing protocol, will be specified to permit Route Control Agents to interact directly with the Network Management Application to request and receive link cost computation updates. When this protocol is implemented authorization to honor such requests will still be under control of the privileged user. Manual authorization might be required in some circumstances, or the user may be able to pre-authorize this service. In early implementations the request for link cost evaluation will be made by the privileged user, and the results will be reported to the user. The user may then use the information to update tables used by the relevant Route Control Agents using currently available methods.

4.2.3 Trend Analysis

The simple statistics computation capability described in Section 4.2.1 represents a substantial improvement over the simpler data collection capabilities of SNMP Managers and Agents. However, for effective network planning and management, it is critical to be able to do long term trend analysis so that the need for costly or complex network upgrades can be predicted well in advance of the time when performance degrades to an unacceptable level without such upgrades.

The fundamental inputs to trend analysis are simply the data values computed for nodes, links, and ports, as described in Section 4.2.1. One important difference is that, as described, the privileged user may activate and deactivate, or change the nature of, simple statistics computations at will. To properly provide input for trend analysis the data computed via the methods specified in Section 4.2.1 must be collected in a consistent manner over the period of time analyzed. Thus, if trend analysis is specified for a given statistic for a specific port or node, the appropriate computation functions should be automatically set up, and the privileged user's ability to cancel or deactivate those computations should be limited. At the very least, should the user attempt to deactivate or modify the base computations, a warning message should remind the user that doing so may compromise the validity of the previously requested trend analysis. After displaying the warning message the user should be given the option to cancel the deactivate or modify request.

It shall be possible to do trend analysis on utilization or throughput of a link, error rate at a port, or availability of a node or port. At a minimum, a trend analysis will result in a periodic report of the nature of the statistic whose trend is being monitored. The periodicity of the report may be specified by the privileged user. In addition, it shall be possible for the user to specify thresholds that will cause alarms. Such alarms might fire, for instance, if utilization reaches a certain level or the rate of utilization increase changes dramatically. A marked increase in error rate or decrease in availability might also cause alarm.

4.3 Reports

For any data collection, statistical computation, or trend analysis requested, a report will be generated at a specified rate or at specific times. The user may direct that such reports be made to a specific location. Usually such reports will only be made to physically secure locations, but it shall be possible to override this restriction. In general, a report will be delivered to a location other than where the relevant data is being collected or computed. (As, in general, computations may be done at a location other than where the relevant data is collected.) When a report is generated but not presented to a local peripheral device, the process component producing the report is said to be reporting the results in surrogate mode. It will format the report appropriately and forward it toward the active reporting process component.

5 Distribution Control

As the Network Management Application process components initialize themselves, or as the application distributes its functions to different host processors, it is important for each process component to be cognizant of the identity and location of at least some of its peer component processes. This knowledge is critical to ensure information flow and application reconfiguration takes place smoothly, as well as to maintain application security.

5.1 Information Flow Management

A given process component will know at any given point in time exactly the peer components from which it may expect to receive information. It will also know which of those components may be providing directive information (indirectly from the privileged user) or collected, computed, or report information. This knowledge will be used, in part, to ensure security. It will also be used, in combination with the type of information received, and possibly based on some part of the information, to determine whether to use the information actively or pass it along, and to where. Similarly, each process component will maintain information on where it might be forwarding information it generates or receives, and does not retain.

In some cases it will make sense for a given process component to have primary and secondary destinations for certain types of information. For example, if a specific display device is inoperative, a report might best be written to a file, rather than discarded. This capability will be implemented in a future version of the application.

5.2 Network Management Application Configuration Change Control

As previously specified, the Network Management Application shall maintain high availability without imposing excessive processing or communication bandwidth consumption on operational components of the network. This shall be achieved, in part, via an explicit distributed architecture.

High availability is achieved by ensuring that for every operational component of the application a duplicate, equivalent component (or set of components) may be launched in a different computer system should the primary component, or the system in which it resides, become unavailable. In network segments where there is a peer node that has access to the same network components as a node in which a process component function resides, one backup process may be sufficient. This might be the case, for instance, in an Ethernet network segment. In other cases it may be necessary to have two or more backup processes to replicate the same management capability. Whenever a process component becomes unavailable, the parent or some peer process component will detect this situation and ensure the duplicate process component(s) go(es) into operation, inheriting the abandoned work load. It is required that every process component be backed up in this manner. This requirement shall be implemented in a future version of the application.

Controlled performance impact is achieved by providing target resource consumption thresholds for the network management processes. Threshold values will be a function of the capacity of each component, its operational mission, and the specific network management philosophy being implemented. In particular, the desired values may change form time to time. Thus, they must be dynamically provided by the privileged user, possibly via the initialization / configuration file. It is not appropriate for them to be specified in this document or "hardwired" in the application.

Each process will monitor its utilization of resources and will alarm another (parent) process when the threshold is being exceeded. It will then cooperate with the parent process in spawning a helper process component in a different processor and/or using different communication link resources, depending on which resource (CPU, memory, or link bandwidth) is being over-utilized. In particular, some of the task assignments of the busy process component must be reassigned to the helper process. Spawning the helper and distributing the work load shall take place before the specified threshold is actually exceeded.