OpenSM Release Notes 3.2 ============================= Version: OpenSM 3.2.x Repo: git://git.openfabrics.org/~sashak/management.git Date: Dec 2008 1 Overview ---------- This document describes the contents of the OpenSM 3.2 release. OpenSM is an InfiniBand compliant Subnet Manager and Administration, and runs on top of OpenIB. The OpenSM version for this release is opensm-3.2.5 This document includes the following sections: 1 This Overview section (describing new features and software dependencies) 2 Known Issues And Limitations 3 Unsupported IB compliance statements 4 Bug Fixes 5 Main Verification Flows 6 Qualified Software Stacks and Devices 1.1 Major New Features * Cached Routing OpenSM provides an optional unicast routing cache (enabled by '-A' or '--ucast_cache' options). When enabled, unicast routing cache prevents routing recalculation (which is a heavy task in a large cluster) when there was no topology change detected during the heavy sweep, or when the topology change does not require new routing calculation, e.g. when one or more CAs/RTRs/leaf switches going down, or one or more of these nodes coming back after being down. * Routing Chaining Routing chaining is the ability to configure the order in which routing algorithms are applied in opensm, i.e. '-R ftree,updn,minhop' - try using ftree routing. If ftree fails, try updn. If updn fails, try minhop. * IPv6 Solicited Node Multicast addresses consolidation When this mode is used (enabled with --consolidate_ipv6_snm_req option) OpenSM will map all IPv6 Solicited Node Multicast address join requests into a single Multicast group with address ff10:601b::1:ff00:0. In this way limited MLID space is saved. This IBA noncompliant feature is very useful with large (~> 1024 nodes) clusters. * OpenSM sweep state machine rework Huge and buggy OpenSM sweep state machine was fully rewritten in safer and more effective synchronous manner. * Multi lid routing balancing for updn/minhop routing algorithms When LMC > 0 is used OpenSM will ensure to generate routing paths via different switches and when possible chassis. * Preserve base lid routes when LMC > 0 When LMC > 0 is used OpenSM will preserve routing paths for base lids as it would be with LMC = 0. In this way traffic on each LID level is not affected by LMC changes. * Ordered routing paths balancing This adds ability to predefine the port order in which routing paths balancing is performed by OpenSM. Helps to improve performance dramatically (40-50%) for applications with known communication pattern. Activated with --guid_routing_order_file command line option. * Unified OpenSM configuration Now there is "conventional" config file instead of hidden option cache file (opensm.opts). OpenSM will find this in a default place (consult man page for exact value) or the file name can be specified with '-F' command line option. Also there is an option ('-c') to generate config file template. * Query remote SMs during light sweep Master OpenSM will query remote standby SMs periodically to catch its possible state changes and react accordingly (as required by IBA spec). * Predefined port ids for Up/Down algorithm This is useful as Up/Down fine tuning tool - the algorithm will use predefined port IDs instead of GUIDs for its decision about direction. Activated with --ids_guid_file command line option. * Improved plugin API version 2. Now OpenSM will provide to plugins the access to all data structures. This make it possible to implement powerful multi purpose plugins. All OpenSM header files are installed now and specific configuration/build options are exported via generated osm_config.h header file. * Many code improvements, optimizations and cleanups * Automatic daily snapshots generation. This is is not a "feature", but simplifies the access to recent OpenSM bits. 1.2 Minor New Features: * Cleanup cl_qlock_pool memory allocator - speedup memory allocations * Support for configurable (via OSM_UMAD_MAX_PENDING environment variable) size of pending MADs pool. * Set packet life time to subnet timeout option rather than default * Enforce routing paths rebalancing on switch reconnection * In Up/Down routing algorithm compare GUID values in host byte order * Add 'switchbalance' and 'lidbalance' commands for OpenSM console * Respond to new trap 144 node description update flag * Add '--connect_roots' command line options. This preserves connectivity between root nodes in Up/Down routing algorithm * Setting SL in the IPoIB MCast groups in accordance with QoS policy * Dump auto detected root node guids in Up/Down routing algorithm * Unify OpenSM dumpers code * Unify various guid files parsers - add generic nodenamemap style parser * When root node guids were provided in file update the list on each Up/Down run * During ./configure show values of configuration dirs and files * Make prefix routes config file name configurable * Add a Performance Manager HOWTO to the docs and the dist * Support separate SA and SM keys as clarified in IBA 1.2.1 * Remove AM_MAINTAINER_MODE in ./configure * Make vendor type OSM_VENDOR_INTF_OPENIB (libibumad) to be default * Build osm_perfmgr_db.* content only when PerfMgr is enabled. * Move PerfMgr event_db_dump_file to common OpenSM dump dir * Allow space separated strings as values in OpenSM config * Support for multiple event plugins * Add '--version' command line option * Add '--create-config ' command line option * Speedup and simplify logging code * Speedup multicast processing in SA DB * In log messages convert unicast LIDs from hex to decimal format and GIDs from hex to IPv6 address format * Handle all possible ports in "ignore-guids" file * Add 'reroute' console command * Remove many install-exec-hook from Makefiles * Some cleanups in LASH routing algorithm code * In Makefiles remove -rpath and explicit -lpthread, -ldl from LDFLAGS (move to configurator) * Install all OpenSM header files * Improve locking in SM Info receiver * Add new OSM_EVENT_ID_SUBNET_UP event for plugins * Redo lex and yacc files generation in conventional way * Add a missing Node Description check on light sweep. * Move vendor specific compilation defines from command to generated config.h file * Provide useful error message when log file opening fails * Add generated osm_config.h file with OpenSM specific defines * Display port number in decimal in log messages * Replace osm_vendor_select.h by generated osm_config.h * Unify options listing in OpenSM usage message * LFT buffers handling simplification * Add 'dump_conf' console command * OpenSM performs sweep on SIGCONT (coming out of suspend). * When our SM is in Standby state and its priority is increased (via console command), notify master SM by sending Trap 144. * When entering standby state (after discovery) notify master SM with Trap 144. * support more PortInfo:CapabilityMask bits * When babbling port policy is on disable the port with the least hop count. 1.3 Library API Changes None 1.4 Software Dependencies OpenSM depends on the installation of either OFED 1.x, OpenIB gen2 (e.g. IBG2 distribution), OpenIB gen1 (e.g. IBGD distribution), or Mellanox VAPI stacks. The qualified driver versions are provided in Table 2, "Qualified IB Stacks". Also, building of QoS manager policy file parser requires flex, and either bison or byacc installed. 1.5 Supported Devices Firmware The main task of OpenSM is to initialize InfiniBand devices. The qualified devices and their corresponding firmware versions are listed in Table 3. 2 Known Issues And Limitations ------------------------------ * No Service / Key associations: There is no way to manage Service access by Keys. * No SM to SM SMDB synchronization: Puts the burden of re-registering services, multicast groups, and inform-info on the client application (or IB access layer core). * When running with QoS with default configuration (opensm -Q), OpenSM prints list of "Invalid Cached Option" error messages. This does not affect OpenSM functionality. 3 Unsupported IB Compliance Statements -------------------------------------- The following section lists all the IB compliance statements which OpenSM does not support. Please refer to the IB specification for detailed information regarding each compliance statement. * C14-22 (Authentication): M_Key M_KeyProtectBits and M_KeyLeasePeriod shall be set in one SubnSet method. As a work-around, an OpenSM option is provided for defining the protect bits. * C14-67 (Authentication): On SubnGet(SMInfo) and SubnSet(SMInfo) - if M_Key is not zero then the SM shall generate a SubnGetResp if the M_Key matches, or silently drop the packet if M_Key does not match. * C15-0.1.23.4 (Authentication): InformInfoRecords shall always be provided with the QPN set to 0, except for the case of a trusted request, in which case the actual subscriber QPN shall be returned. * o13-17.1.2 (Event-FWD): If no permission to forward, the subscription should be removed and no further forwarding should occur. * C14-24.1.1.5 and C14-62.1.1.22 (Initialization): GUIDInfo - SM should enable assigning Port GUIDInfo. * C14-44 (Initialization): If the SM discovers that it is missing an M_Key to update CA/RT/SW, it should notify the higher level. * C14-62.1.1.12 (Initialization): PortInfo:M_Key - Set the M_Key to a node based random value. * C14-62.1.1.13 (Initialization): PortInfo:P_KeyProtectBits - set according to an optional policy. * C14-62.1.1.24 (Initialization): SwitchInfo:DefaultPort - should be configured for random FDB. * C14-62.1.1.32 (Initialization): RandomForwardingTable should be configured. * o15-0.1.12 (Multicast): If the JoinState is SendOnlyNonMember = 1 (only), then the endport should join as sender only. * o15-0.1.8 (Multicast): If a request for creating an MCG with fields that cannot be met, return ERR_REQ_INVALID (currently ignores SL and FlowLabelTClass). * C15-0.1.8.6 (SA-Query): Respond to SubnAdmGetTraceTable - this is an optional attribute. * C15-0.1.13 Services: Reject ServiceRecord create, modify or delete if the given ServiceP_Key does not match the one included in the ServiceGID port and the port that sent the request. * C15-0.1.14 (Services): Provide means to associate service name and ServiceKeys. 4 Bug Fixes ----------- 4.1 Major Bug Fixes * Set SA attribute offset to 0 when no records are returned * Send trap 64 only after new ports are in ACTIVE state. * Fix in sending client reregistration bit * Fix default OpenSM SM (and SA) Key byte order * Fix in sending Multicast groups creation/deletion notification (Traps 66,67) * Don't startup automatically on SuSE based systems 4.2 Other Bug Fixes * opensm/osm_console.c: fix seg fault when running "portstatus ca" in the console * opensm: fix potential core dumps where osm_node_get_physp_ptr can return NULL * opensm/osm_mcast_mgr: limit spanning tree creation recursion to value of max hops (64) * opensm: switch LFTs incremental update fix * opensm/osm_state_mgr.c: fix segmentation fault * opensm: eliminate some potential NULL pointer dereferences * opensm/osm_console.c: fix guid parsing * opensm: fix off by 1 issue with max_lid and max_multicat_lid_ho * opensm: fix potentially wrong port_guid initialization * opensm/configure.in: fix wrong HAVE_DEFAULT_OPENSM_CONFIG_FILE define generation * opensm: fix snprintf() usage * opensm/osm_sa_lft_record: validate LFT block number * opensm/osm_sa_lft_record: pass block parameter in host byte order * opensm/include/Makefile.am: don't duplicate header files in EXTRA_DIST * opensm/osm_sa_class_port_info.c: fix over bound array access * osmtest/osmt_service.c: fix over bound array access * osmtest: fix qpn encoding in osmtest_informinfo_request() * opensm/osm_vendor_mlx_sa.c: handling attribute offset of 0 * opensm: fix segfault corner case when osm_console_init fails * opensm/console: close console socket on cleanup path * opensm/osm_ucast_lash: fix buffer overflow * opensm: fix broken IPv6 SNM consolidation code * opensm/osm_sa_lft_record.c: fix block number encoding byte order * opensm/osm_sa: fix memory leak in SA responder * opensm/osm_mcast_mgr: fix memory leak * opensm: fix qos config parsing bugs * opensm/osm_mcast_tbl.c: fix sending invalid MF block due to max mlid overflow * opensm: log_max_size config parameter in MB * opensm/osm_ucast_lash: fix extra memory allocations * opensm: fix race in main OpenSM flow * opensm/ftree: fix GUID check against cn_guid_file * opensm/ftree: save FLT buffers memory allocations * opensm/osm_sa_link_record.c: prevent potential endless recursion * opensm: remove SM from sm_guid_tbl when IsSM port capability flag is not set * opensm: fix QoS config bug * opensm: don't reassign zeroed params from config file * Other less critical or visible bugs were also fixed. 5 Main Verification Flows ------------------------- OpenSM verification is run using the following activities: * osmtest - a stand-alone program * ibmgtsim (IB management simulator) based - a set of flows that simulate clusters, inject errors and verify OpenSM capability to respond and bring up the network correctly. * small cluster regression testing - where the SM is used on back to back or single switch configurations. The regression includes multiple OpenSM dedicated tests. * cluster testing - when we run OpenSM to setup a large cluster, perform hand-off, reboots and reconnects, verify routing correctness and SA responsiveness at the ULP level (IPoIB and SDP). 5.1 osmtest osmtest is an automated verification tool used for OpenSM testing. Its verification flows are described by list below. * Inventory File: Obtain and verify all port info, node info, link and path records parameters. * Service Record: - Register new service - Register another service (with a lease period) - Register another service (with service p_key set to zero) - Get all services by name - Delete the first service - Delete the third service - Added bad flows of get/delete non valid service - Add / Get same service with different data - Add / Get / Delete by different component mask values (services by Name & Key / Name & Data / Name & Id / Id only ) * Multicast Member Record: - Query of existing Groups (IPoIB) - BAD Join with insufficient comp mask (o15.0.1.3) - Create given MGID=0 (o15.0.1.4) - Create given MGID=0xFF12A01C,FE800000,00000000,12345678 (o15.0.1.4) - Create BAD MGID=0xFA. (o15.0.1.6) - Create BAD MGID=0xFF12A01B w/ link-local not set (o15.0.1.6) - New MGID with invalid join state (o15.0.1.9) - Retry of existing MGID - See JoinState update (o15.0.1.11) - BAD RATE when connecting to existing MGID (o15.0.1.13) - Partial JoinState delete request - removing FullMember (o15.0.1.14) - Full Delete of a group (o15.0.1.14) - Verify Delete by trying to Join deleted group (o15.0.1.14) - BAD Delete of IPoIB membership (no prev join) (o15.0.1.15) * GUIDInfo Record: - All GUIDInfoRecords in subnet are obtained * MultiPathRecord: - Perform some compliant and noncompliant MultiPathRecord requests - Validation is via status in responses and IB analyzer * PKeyTableRecord: - Perform some compliant and noncompliant PKeyTableRecord queries - Validation is via status in responses and IB analyzer * LinearForwardingTableRecord: - Perform some compliant and noncompliant LinearForwardingTableRecord queries - Validation is via status in responses and IB analyzer * Event Forwarding: Register for trap forwarding using reports - Send a trap and wait for report - Unregister non-existing * Trap 64/65 Flow: Register to Trap 64-65, create traps (by disconnecting/connecting ports) and wait for report, then unregister. * Stress Test: send PortInfoRecord queries, both single and RMPP and check for the rate of responses as well as their validity. 5.2 IB Management Simulator OpenSM Test Flows: The simulator provides ability to simulate the SM handling of virtual topologies that are not limited to actual lab equipment availability. OpenSM was simulated to bring up clusters of up to 10,000 nodes. Daily regressions use smaller (16 and 128 nodes clusters). The following test flows are run on the IB management simulator: * Stability: Up to 12 links from the fabric are randomly selected to drop packets at drop rates up to 90%. The SM is required to succeed in bringing the fabric up. The resulting routing is verified to be correct as well. * LID Manager: Using LMC = 2 the fabric is initialized with LIDs. Faults such as zero LID, Duplicated LID, non-aligned (to LMC) LIDs are randomly assigned to various nodes and other errors are randomly output to the guid2lid cache file. The SM sweep is run 5 times and after each iteration a complete verification is made to ensure that all LIDs that could possibly be maintained are kept, as well as that all nodes were assigned a legal LID range. * Multicast Routing: Nodes randomly join the 0xc000 group and eventually the resulting routing is verified for completeness and adherence to Up/Down routing rules. * osmtest: The complete osmtest flow as described in the previous table is run on the simulated fabrics. * Stress Test: This flow merges fabric, LID and stability issues with continuous PathRecord, ServiceRecord and Multicast Join/Leave activity to stress the SM/SA during continuous sweeps. InformInfo Set/Delete/Get were added to the test such both existing and non existing nodes perform them in random order. 5.3 OpenSM Regression Using a back-to-back or single switch connection, the following set of tests is run nightly on the stacks described in table 2. The included tests are: * Stress Testing: Flood the SA with queries from multiple channel adapters to check the robustness of the entire stack up to the SA. * Dynamic Changes: Dynamic Topology changes, through randomly dropping SMP packets, used to test OpenSM adaptation to an unstable network & verify DB correctness. * Trap Injection: This flow injects traps to the SM and verifies that it handles them gracefully. * SA Query Test: This test exhaustively checks the SA responses to all possible single component mask. To do that the test examines the entire set of records the SA can provide, classifies them by their field values and then selects every field (using component mask and a value) and verifies that the response matches the expected set of records. A random selection using multiple component mask bits is also performed. 5.4 Cluster testing: Cluster testing is usually run before a distribution release. It involves real hardware setups of 16 to 32 nodes (or more if a beta site is available). Each test is validated by running all-to-all ping through the IB interface. The test procedure includes: * Cluster bringup * Hand-off between 2 or 3 SM's while performing: - Node reboots - Switch power cycles (disconnecting the SM's) * Unresponsive port detection and recovery * osmtest from multiple nodes * Trap injection and recovery 6 Qualified Software Stacks and Devices --------------------------------------- OpenSM Compatibility -------------------- Note that OpenSM version 3.2.1 and earlier used a value of 1 in host byte order for the default SM_Key, so there is a compatibility issue with these earlier versions of OpenSM when the 3.2.2 or later version is running on a little endian machine. This affects SM handover as well as SA queries (saquery tool in infiniband-diags). Table 2 - Qualified IB Stacks ============================= Stack | Version -----------------------------------------|-------------------------- OFED | 1.4 OFED | 1.3 OFED | 1.2 OFED | 1.1 OFED | 1.0 OpenIB Gen2 (IBG2 distribution) | 1.0 OpenIB Gen1 (IBGD distribution) | 1.8.0 VAPI (Mellanox InfiniBand HCA Driver) | 3.2 and later Table 3 - Qualified Devices and Corresponding Firmware ====================================================== Mellanox Device | FW versions ------------------------------------|------------------------------- InfiniScale | fw-43132 5.2.000 (and later) InfiniScale III | fw-47396 0.5.000 (and later) InfiniScale IV | fw-48436 7.1.000 (and later) InfiniHost | fw-23108 3.5.000 (and later) InfiniHost III Lx | fw-25204 1.2.000 (and later) InfiniHost III Ex (InfiniHost Mode) | fw-25208 4.8.200 (and later) InfiniHost III Ex (MemFree Mode) | fw-25218 5.3.000 (and later) ConnectX IB | fw-25408 2.3.000 (and later) QLogic/PathScale Device | Note --------|----------------------------------------------------------- iPath | QHT6040 (PathScale InfiniPath HT-460) iPath | QHT6140 (PathScale InfiniPath HT-465) iPath | QLE6140 (PathScale InfiniPath PE-880) iPath | QLE7240 iPath | QLE7280 Note 1: OpenSM does not run on an IBM Galaxy (eHCA) as it does not expose QP0 and QP1. However, it does support it as a device on the subnet. Note 2: QoS firmware and Mellanox devices HCAs: QoS supported by ConnectX. QoS-enabled FW release is 2_5_000 and later. Switches: QoS supported by InfiniScale III Any InfiniScale III FW that is supported by OpenSM supports QoS.