Lustre* Software Release 2.x

Operations Manual

Notwithstanding Intel’s ownership of the copyright in the modifications to the original version of this Operations Manual, as between Intel and Oracle, Oracle and/or its affiliates retain sole ownership of the copyright in the unmodified portions of this Operations Manual.

Important Notice from Intel

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. Lustre is a registered trademark of Oracle Corporation.

*Other names and brands may be claimed as the property of others.

THE ORIGINAL LUSTRE 2.x FILESYSTEM: OPERATIONS MANUAL HAS BEEN MODIFIED: THIS OPERATIONS MANUAL IS A MODIFIED VERSION OF, AND IS DERIVED FROM, THE LUSTRE 2.0 FILESYSTEM: OPERATIONS MANUAL PUBLISHED BY ORACLE AND AVAILABLE AT [http://www.lustre.org/]. MODIFICATIONS (collectively, the “Modifications”) HAVE BEEN MADE BY INTEL CORPORATION (“Intel”). ORACLE AND ITS AFFILIATES HAVE NOT REVIEWED, APPROVED, SPONSORED, OR ENDORSED THIS MODIFIED OPERATIONS MANUAL, OR ENDORSED INTEL, AND ORACLE AND ITS AFFILIATES ARE NOT RESPONSIBLE OR LIABLE FOR ANY MODIFICATIONS THAT INTEL HAS MADE TO THE ORIGINAL OPERATIONS MANUAL.

NOTHING IN THIS MODIFIED OPERATIONS MANUAL IS INTENDED TO AFFECT THE NOTICE PROVIDED BY ORACLE BELOW IN RESPECT OF THE ORIGINAL OPERATIONS MANUAL AND SUCH ORACLE NOTICE CONTINUES TO APPLY TO THIS MODIFIED OPERATIONS MANUAL EXCEPT FOR THE MODIFICATIONS; THIS INTEL NOTICE SHALL APPLY ONLY TO MODIFICATIONS MADE BY INTEL. AS BETWEEN YOU AND ORACLE: (I) NOTHING IN THIS INTEL NOTICE IS INTENDED TO AFFECT THE TERMS OF THE ORACLE NOTICE BELOW; AND (II) IN THE EVENT OF ANY CONFLICT BETWEEN THE TERMS OF THIS INTEL NOTICE AND THE TERMS OF THE ORACLE NOTICE, THE ORACLE NOTICE SHALL PREVAIL.

Your use of any Intel software shall be governed by separate license terms containing restrictions on use and disclosure and are protected by intellectual property laws.

The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing.

This work is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License. To view a copy of this license and obtain more information about Creative Commons licensing, visit Creative Commons Attribution-Share Alike 3.0 United States [http://creativecommons.org/licenses/by-sa/3.0/us] or send a letter to Creative Commons, 171 2nd Street, Suite 300, San Francisco, California 94105, USA.

Important Notice from Oracle

This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited.

The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing.

If this is software or related software documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, the following notice is applicable:

U.S. GOVERNMENT RIGHTS. Programs, software, databases, and related documentation and technical data delivered to U.S. Government customers are "commercial computer software" or "commercial technical data" pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such, the use, duplication, disclosure, modification, and adaptation shall be subject to the restrictions and license terms set forth in the applicable Government contract, and, to the extent applicable by the terms of the Government contract, the additional rights set forth in FAR 52.227-19, Commercial Computer Software License (December 2007). Oracle America, Inc., 500 Oracle Parkway, Redwood City, CA 94065.

This software or hardware is developed for general use in a variety of information management applications. It is not developed or intended for use in any inherently dangerous applications, including applications which may create a risk of personal injury. If you use this software or hardware in dangerous applications, then you shall be responsible to take all appropriate fail-safe, backup, redundancy, and other measures to ensure its safe use. Oracle Corporation and its affiliates disclaim any liability for any damages caused by use of this software or hardware in dangerous applications.

Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.

AMD, Opteron, the AMD logo, and the AMD Opteron logo are trademarks or registered trademarks of Advanced Micro Devices. Intel and Intel Xeon are trademarks or registered trademarks of Intel Corporation. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. UNIX is a registered trademark licensed through X/Open Company, Ltd.

This software or hardware and documentation may provide access to or information on content, products, and services from third parties. Oracle Corporation and its affiliates are not responsible for and expressly disclaim all warranties of any kind with respect to third-party content, products, and services. Oracle Corporation and its affiliates will not be responsible for any loss, costs, or damages incurred due to your access to or use of third-party content, products, or services.

Copyright © 2011, Oracle et/ou ses affiliés. Tous droits réservés.

Ce logiciel et la documentation qui l’accompagne sont protégés par les lois sur la propriété intellectuelle. Ils sont concédés sous licence et soumis à des restrictions d’utilisation et de divulgation. Sauf disposition de votre contrat de licence ou de la loi, vous ne pouvez pas copier, reproduire, traduire, diffuser, modifier, breveter, transmettre, distribuer, exposer, exécuter, publier ou afficher le logiciel, même partiellement, sous quelque forme et par quelque procédé que ce soit. Par ailleurs, il est interdit de procéder à toute ingénierie inverse du logiciel, de le désassembler ou de le décompiler, excepté à des fins d’interopérabilité avec des logiciels tiers ou tel que prescrit par la loi.

Les informations fournies dans ce document sont susceptibles de modification sans préavis. Par ailleurs, Oracle Corporation ne garantit pas qu’elles soient exemptes d’erreurs et vous invite, le cas échéant, à lui en faire part par écrit.

Si ce logiciel, ou la documentation qui l’accompagne, est concédé sous licence au Gouvernement des Etats-Unis, ou à toute entité qui délivre la licence de ce logiciel ou l’utilise pour le compte du Gouvernement des Etats-Unis, la notice suivante s’applique :

U.S. GOVERNMENT RIGHTS. Programs, software, databases, and related documentation and technical data delivered to U.S. Government customers are "commercial computer software" or "commercial technical data" pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such, the use, duplication, disclosure, modification, and adaptation shall be subject to the restrictions and license terms set forth in the applicable Government contract, and, to the extent applicable by the terms of the Government contract, the additional rights set forth in FAR 52.227-19, Commercial Computer Software License (December 2007). Oracle America, Inc., 500 Oracle Parkway, Redwood City, CA 94065.

Ce logiciel ou matériel a été développé pour un usage général dans le cadre d’applications de gestion des informations. Ce logiciel ou matériel n’est pas conçu ni n’est destiné à être utilisé dans des applications à risque, notamment dans des applications pouvant causer des dommages corporels. Si vous utilisez ce logiciel ou matériel dans le cadre d’applications dangereuses, il est de votre responsabilité de prendre toutes les mesures de secours, de sauvegarde, de redondance et autres mesures nécessaires à son utilisation dans des conditions optimales de sécurité. Oracle Corporation et ses affiliés déclinent toute responsabilité quant aux dommages causés par l’utilisation de ce logiciel ou matériel pour ce type d’applications.

Oracle et Java sont des marques déposées d’Oracle Corporation et/ou de ses affiliés.Tout autre nom mentionné peut correspondre à des marques appartenant à d’autres propriétaires qu’Oracle.

AMD, Opteron, le logo AMD et le logo AMD Opteron sont des marques ou des marques déposées d’Advanced Micro Devices. Intel et Intel Xeon sont des marques ou des marques déposées d’Intel Corporation. Toutes les marques SPARC sont utilisées sous licence et sont des marques ou des marques déposées de SPARC International, Inc. UNIX est une marque déposée concédée sous licence par X/Open Company, Ltd.

Ce logiciel ou matériel et la documentation qui l’accompagne peuvent fournir des informations ou des liens donnant accès à des contenus, des produits et des services émanant de tiers. Oracle Corporation et ses affiliés déclinent toute responsabilité ou garantie expresse quant aux contenus, produits ou services émanant de tiers. En aucun cas, Oracle Corporation et ses affiliés ne sauraient être tenus pour responsables des pertes subies, des coûts occasionnés ou des dommages causés par l’accès à des contenus, produits ou services tiers, ou à leur utilisation.

This work is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License. To view a copy of this license and obtain more information about Creative Commons licensing, visit Creative Commons Attribution-Share Alike 3.0 United States or send a letter to Creative Commons, 171 2nd Street, Suite 300, San Francisco, California 94105, USA.


Table of Contents

Preface
1. About this Document
1.1. UNIX* Commands
1.2. Shell Prompts
1.3. Related Documentation
1.4. Documentation, Support, and Training
2. Revisions
I. Introducing the Lustre* File System
1. Understanding Lustre Architecture
1.1. What a Lustre File System Is (and What It Isn't)
1.1.1. Lustre Features
1.2. Lustre Components
1.2.1. Management Server (MGS)
1.2.2. Lustre File System Components
1.2.3. Lustre Networking (LNet)
1.2.4. Lustre Cluster
1.3. Lustre File System Storage and I/O
1.3.1. Lustre File System and Striping
2. Understanding Lustre Networking (LNet)
2.1. Introducing LNet
2.2. Key Features of LNet
2.3. Lustre Networks
2.4. Supported Network Types
3. Understanding Failover in a Lustre File System
3.1. What is Failover?
3.1.1. Failover Capabilities
3.1.2. Types of Failover Configurations
3.2. Failover Functionality in a Lustre File System
3.2.1. MDT Failover Configuration (Active/Passive)
3.2.2. MDT Failover Configuration (Active/Active)L 2.4
3.2.3. OST Failover Configuration (Active/Active)
II. Installing and Configuring Lustre
4. Installation Overview
4.1. Steps to Installing the Lustre Software
5. Determining Hardware Configuration Requirements and Formatting Options
5.1. Hardware Considerations
5.1.1. MGT and MDT Storage Hardware Considerations
5.1.2. OST Storage Hardware Considerations
5.2. Determining Space Requirements
5.2.1. Determining MGT Space Requirements
5.2.2. Determining MDT Space Requirements
5.2.3. Determining OST Space Requirements
5.3. Setting ldiskfs File System Formatting Options
5.3.1. Setting Formatting Options for an ldiskfs MDT
5.3.2. Setting Formatting Options for an ldiskfs OST
5.3.3. File and File System Limits
5.4. Determining Memory Requirements
5.4.1. Client Memory Requirements
5.4.2. MDS Memory Requirements
5.4.3. OSS Memory Requirements
5.5. Implementing Networks To Be Used by the Lustre File System
6. Configuring Storage on a Lustre File System
6.1. Selecting Storage for the MDT and OSTs
6.1.1. Metadata Target (MDT)
6.1.2. Object Storage Server (OST)
6.2. Reliability Best Practices
6.3. Performance Tradeoffs
6.4. Formatting Options for ldiskfs RAID Devices
6.4.1. Computing file system parameters for mkfs
6.4.2. Choosing Parameters for an External Journal
6.5. Connecting a SAN to a Lustre File System
7. Setting Up Network Interface Bonding
7.1. Network Interface Bonding Overview
7.2. Requirements
7.3. Bonding Module Parameters
7.4. Setting Up Bonding
7.4.1. Examples
7.5. Configuring a Lustre File System with Bonding
7.6. Bonding References
8. Installing the Lustre Software
8.1. Preparing to Install the Lustre Software
8.1.1. Software Requirements
8.1.2. Environmental Requirements
8.2. Lustre Software Installation Procedure
9. Configuring Lustre Networking (LNet)
9.1. Configuring LNet via lnetctlL 2.7
9.1.1. Configuring LNet
9.1.2. Adding, Deleting and Showing networks
9.1.3. Adding, Deleting and Showing routes
9.1.4. Enabling and Disabling Routing
9.1.5. Showing routing information
9.1.6. Configuring Routing Buffers
9.1.7. Importing YAML Configuration File
9.1.8. Exporting Configuration in YAML format
9.1.9. Showing LNet Traffic Statistics
9.1.10. YAML Syntax
9.2. Overview of LNet Module Parameters
9.2.1. Using a Lustre Network Identifier (NID) to Identify a Node
9.3. Setting the LNet Module networks Parameter
9.3.1. Multihome Server Example
9.4. Setting the LNet Module ip2nets Parameter
9.5. Setting the LNet Module routes Parameter
9.5.1. Routing Example
9.6. Testing the LNet Configuration
9.7. Configuring the Router Checker
9.8. Best Practices for LNet Options
9.8.1. Escaping commas with quotes
9.8.2. Including comments
10. Configuring a Lustre File System
10.1. Configuring a Simple Lustre File System
10.1.1. Simple Lustre Configuration Example
10.2. Additional Configuration Options
10.2.1. Scaling the Lustre File System
10.2.2. Changing Striping Defaults
10.2.3. Using the Lustre Configuration Utilities
11. Configuring Failover in a Lustre File System
11.1. Setting Up a Failover Environment
11.1.1. Selecting Power Equipment
11.1.2. Selecting Power Management Software
11.1.3. Selecting High-Availability (HA) Software
11.2. Preparing a Lustre File System for Failover
11.3. Administering Failover in a Lustre File System
III. Administering Lustre
12. Monitoring a Lustre File System
12.1. Lustre Changelogs
12.1.1. Working with Changelogs
12.1.2. Changelog Examples
12.2. Lustre Jobstats
12.2.1. How Jobstats Works
12.2.2. Enable/Disable Jobstats
12.2.3. Check Job Stats
12.2.4. Clear Job Stats
12.2.5. Configure Auto-cleanup Interval
12.3. Lustre Monitoring Tool (LMT)
12.4. CollectL
12.5. Other Monitoring Options
13. Lustre Operations
13.1. Mounting by Label
13.2. Starting Lustre
13.3. Mounting a Server
13.4. Unmounting a Server
13.5. Specifying Failout/Failover Mode for OSTs
13.6. Handling Degraded OST RAID Arrays
13.7. Running Multiple Lustre File Systems
13.8. Creating a sub-directory on a given MDTL 2.4
13.9. Creating a directory striped across multiple MDTsL 2.8
13.10. Setting and Retrieving Lustre Parameters
13.10.1. Setting Tunable Parameters with mkfs.lustre
13.10.2. Setting Parameters with tunefs.lustre
13.10.3. Setting Parameters with lctl
13.11. Specifying NIDs and Failover
13.12. Erasing a File System
13.13. Reclaiming Reserved Disk Space
13.14. Replacing an Existing OST or MDT
13.15. Identifying To Which Lustre File an OST Object Belongs
14. Lustre Maintenance
14.1. Working with Inactive OSTs
14.2. Finding Nodes in the Lustre File System
14.3. Mounting a Server Without Lustre Service
14.4. Regenerating Lustre Configuration Logs
14.5. Changing a Server NID
14.6. Adding a New MDT to a Lustre File SystemL 2.4
14.7. Adding a New OST to a Lustre File System
14.8. Removing and Restoring OSTs
14.8.1. Removing a MDT from the File SystemL 2.4
14.8.2. Working with Inactive MDTsL 2.4
14.8.3. Removing an OST from the File System
14.8.4. Backing Up OST Configuration Files
14.8.5. Restoring OST Configuration Files
14.8.6. Returning a Deactivated OST to Service
14.9. Aborting Recovery
14.10. Determining Which Machine is Serving an OST
14.11. Changing the Address of a Failover Node
14.12. Separate a combined MGS/MDT
15. Managing Lustre Networking (LNet)
15.1. Updating the Health Status of a Peer or Router
15.2. Starting and Stopping LNet
15.2.1. Starting LNet
15.2.2. Stopping LNet
15.3. Multi-Rail Configurations with LNet
15.4. Load Balancing with an InfiniBand* Network
15.4.1. Setting Up lustre.conf for Load Balancing
15.5. Dynamically Configuring LNet RoutesL 2.4
15.5.1. lustre_routes_config
15.5.2. lustre_routes_conversion
15.5.3. Route Configuration Examples
16. Upgrading a Lustre File System
16.1. Release Interoperability and Upgrade Requirements
16.2. Upgrading to Lustre Software Release 2.x (Major Release)
16.3. Upgrading to Lustre Software Release 2.x.y (Minor Release)
17. Backing Up and Restoring a File System
17.1. Backing up a File System
17.1.1. Lustre_rsync
17.2. Backing Up and Restoring an MDT or OST (ldiskfs Device Level)
17.3. Backing Up an OST or MDT (ldiskfs File System Level)
17.4. Restoring a File-Level Backup
17.5. Using LVM Snapshots with the Lustre File System
17.5.1. Creating an LVM-based Backup File System
17.5.2. Backing up New/Changed Files to the Backup File System
17.5.3. Creating Snapshot Volumes
17.5.4. Restoring the File System From a Snapshot
17.5.5. Deleting Old Snapshots
17.5.6. Changing Snapshot Volume Size
18. Managing File Layout (Striping) and Free Space
18.1. How Lustre File System Striping Works
18.2. Lustre File Layout (Striping) Considerations
18.2.1. Choosing a Stripe Size
18.3. Setting the File Layout/Striping Configuration (lfs setstripe)
18.3.1. Specifying a File Layout (Striping Pattern) for a Single File
18.3.2. Setting the Striping Layout for a Directory
18.3.3. Setting the Striping Layout for a File System
18.3.4. Creating a File on a Specific OST
18.4. Retrieving File Layout/Striping Information (getstripe)
18.4.1. Displaying the Current Stripe Size
18.4.2. Inspecting the File Tree
18.4.3. Locating the MDT for a remote directory
18.5. Managing Free Space
18.5.1. Checking File System Free Space
18.5.2. Stripe Allocation Methods
18.5.3. Adjusting the Weighting Between Free Space and Location
18.6. Lustre Striping Internals
19. Managing the File System and I/O
19.1. Handling Full OSTs
19.1.1. Checking OST Space Usage
19.1.2. Taking a Full OST Offline
19.1.3. Migrating Data within a File System
19.1.4. Returning an Inactive OST Back Online
19.2. Creating and Managing OST Pools
19.2.1. Working with OST Pools
19.2.2. Tips for Using OST Pools
19.3. Adding an OST to a Lustre File System
19.4. Performing Direct I/O
19.4.1. Making File System Objects Immutable
19.5. Other I/O Options
19.5.1. Lustre Checksums
19.5.2. Ptlrpc Thread Pool
20. Lustre File System Failover and Multiple-Mount Protection
20.1. Overview of Multiple-Mount Protection
20.2. Working with Multiple-Mount Protection
21. Configuring and Managing Quotas
21.1. Working with Quotas
21.2. Enabling Disk Quotas
21.2.1. Enabling Disk Quotas (Lustre Software Prior to Release 2.4)
21.2.2. Enabling Disk Quotas (Lustre Software Release 2.4 and later)L 2.4
21.3. Quota Administration
21.4. Quota Allocation
21.5. Quotas and Version Interoperability
21.6. Granted Cache and Quota Limits
21.7. Lustre Quota Statistics
21.7.1. Interpreting Quota Statistics
22. Hierarchical Storage Management (HSM)L 2.5
22.1. Introduction
22.2. Setup
22.2.1. Requirements
22.2.2. Coordinator
22.2.3. Agents
22.3. Agents and copytool
22.3.1. Archive ID, multiple backends
22.3.2. Registered agents
22.3.3. Timeout
22.4. Requests
22.4.1. Commands
22.4.2. Automatic restore
22.4.3. Request monitoring
22.5. File states
22.6. Tuning
22.6.1. hsm_controlpolicy
22.6.2. max_requests
22.6.3. policy
22.6.4. grace_delay
22.7. change logs
22.8. Policy engine
22.8.1. Robinhood
23. Mapping UIDs and GIDs with NodemapL 2.9
23.1. Setting a Mapping
23.1.1. Defining Terms
23.1.2. Deciding on NID Ranges
23.1.3. Describing and Deploying a Sample Mapping
23.2. Altering Properties
23.2.1. Managing the Properties
23.2.2. Mixing Properties
23.3. Enabling the Feature
23.4. Verifying Settings
23.5. Ensuring Consistency
24. Configuring Shared-Secret Key (SSK) SecurityL 2.9
24.1. SSK Security Overview
24.1.1. Key features
24.2. SSK Security Flavors
24.2.1. Secure RPC Rules
24.3. SSK Key Files
24.3.1. Key File Management
24.4. Lustre GSS Keyring
24.4.1. Setup
24.4.2. Server Setup
24.4.3. Debugging GSS Keyring
24.4.4. Revoking Keys
24.5. Role of Nodemap in SSK
24.6. SSK Examples
24.6.1. Securing Client to Server Communications
24.6.2. Securing MGS Communications
24.6.3. Securing Server to Server Communications
24.7. Viewing Secure PtlRPC Contexts
25. Managing Security in a Lustre File System
25.1. Using ACLs
25.1.1. How ACLs Work
25.1.2. Using ACLs with the Lustre Software
25.1.3. Examples
25.2. Using Root Squash
25.2.1. Configuring Root Squash
25.2.2. Enabling and Tuning Root Squash
25.2.3. Tips on Using Root Squash
IV. Tuning a Lustre File System for Performance
26. Testing Lustre Network Performance (LNet Self-Test)
26.1. LNet Self-Test Overview
26.1.1. Prerequisites
26.2. Using LNet Self-Test
26.2.1. Creating a Session
26.2.2. Setting Up Groups
26.2.3. Defining and Running the Tests
26.2.4. Sample Script
26.3. LNet Self-Test Command Reference
26.3.1. Session Commands
26.3.2. Group Commands
26.3.3. Batch and Test Commands
26.3.4. Other Commands
27. Benchmarking Lustre File System Performance (Lustre I/O Kit)
27.1. Using Lustre I/O Kit Tools
27.1.1. Contents of the Lustre I/O Kit
27.1.2. Preparing to Use the Lustre I/O Kit
27.2. Testing I/O Performance of Raw Hardware (sgpdd-survey)
27.2.1. Tuning Linux Storage Devices
27.2.2. Running sgpdd-survey
27.3. Testing OST Performance (obdfilter-survey)
27.3.1. Testing Local Disk Performance
27.3.2. Testing Network Performance
27.3.3. Testing Remote Disk Performance
27.3.4. Output Files
27.4. Testing OST I/O Performance (ost-survey)
27.5. Testing MDS Performance (mds-survey)
27.5.1. Output Files
27.5.2. Script Output
27.6. Collecting Application Profiling Information (stats-collect)
27.6.1. Using stats-collect
28. Tuning a Lustre File System
28.1. Optimizing the Number of Service Threads
28.1.1. Specifying the OSS Service Thread Count
28.1.2. Specifying the MDS Service Thread Count
28.2. Binding MDS Service Thread to CPU PartitionsL 2.3
28.3. Tuning LNet Parameters
28.3.1. Transmit and Receive Buffer Size
28.3.2. Hardware Interrupts ( enable_irq_affinity)
28.3.3. Binding Network Interface Against CPU PartitionsL 2.3
28.3.4. Network Interface Credits
28.3.5. Router Buffers
28.3.6. Portal Round-Robin
28.3.7. LNet Peer Health
28.4. libcfs TuningL 2.3
28.4.1. CPU Partition String Patterns
28.5. LND Tuning
28.6. Network Request Scheduler (NRS) TuningL 2.4
28.6.1. First In, First Out (FIFO) policy
28.6.2. Client Round-Robin over NIDs (CRR-N) policy
28.6.3. Object-based Round-Robin (ORR) policy
28.6.4. Target-based Round-Robin (TRR) policy
28.6.5. Token Bucket Filter (TBF) policyL 2.6
28.7. Lockless I/O Tunables
28.8. Server-Side Advice and Hinting L 2.9
28.8.1. Overview
28.8.2. Examples
28.9. Large Bulk IO (16MB RPC) L 2.9
28.9.1. Overview
28.9.2. Usage
28.10. Improving Lustre I/O Performance for Small Files
28.11. Understanding Why Write Performance is Better Than Read Performance
V. Troubleshooting a Lustre File System
29. Lustre File System Troubleshooting
29.1. Lustre Error Messages
29.1.1. Error Numbers
29.1.2. Viewing Error Messages
29.2. Reporting a Lustre File System Bug
29.2.1. Searching the Jira* Bug Tracker for Duplicate Tickets
29.3. Common Lustre File System Problems
29.3.1. OST Object is Missing or Damaged
29.3.2. OSTs Become Read-Only
29.3.3. Identifying a Missing OST
29.3.4. Fixing a Bad LAST_ID on an OST
29.3.5. Handling/Debugging "Bind: Address already in use" Error
29.3.6. Handling/Debugging Error "- 28"
29.3.7. Triggering Watchdog for PID NNN
29.3.8. Handling Timeouts on Initial Lustre File System Setup
29.3.9. Handling/Debugging "LustreError: xxx went back in time"
29.3.10. Lustre Error: "Slow Start_Page_Write"
29.3.11. Drawbacks in Doing Multi-client O_APPEND Writes
29.3.12. Slowdown Occurs During Lustre File System Startup
29.3.13. Log Message 'Out of Memory' on OST
29.3.14. Setting SCSI I/O Sizes
30. Troubleshooting Recovery
30.1. Recovering from Errors or Corruption on a Backing ldiskfs File System
30.2. Recovering from Corruption in the Lustre File System
30.2.1. Working with Orphaned Objects
30.3. Recovering from an Unavailable OST
30.4. Checking the file system with LFSCKL 2.3
30.4.1. LFSCK switch interface
30.4.2. Check the LFSCK global status
30.4.3. LFSCK status interface
30.4.4. LFSCK adjustment interface
31. Debugging a Lustre File System
31.1. Diagnostic and Debugging Tools
31.1.1. Lustre Debugging Tools
31.1.2. External Debugging Tools
31.2. Lustre Debugging Procedures
31.2.1. Understanding the Lustre Debug Messaging Format
31.2.2. Using the lctl Tool to View Debug Messages
31.2.3. Dumping the Buffer to a File (debug_daemon)
31.2.4. Controlling Information Written to the Kernel Debug Log
31.2.5. Troubleshooting with strace
31.2.6. Looking at Disk Content
31.2.7. Finding the Lustre UUID of an OST
31.2.8. Printing Debug Messages to the Console
31.2.9. Tracing Lock Traffic
31.2.10. Controlling Console Message Rate Limiting
31.3. Lustre Debugging for Developers
31.3.1. Adding Debugging to the Lustre Source Code
31.3.2. Accessing the ptlrpc Request History
31.3.3. Finding Memory Leaks Using leak_finder.pl
VI. Reference
32. Lustre File System Recovery
32.1. Recovery Overview
32.1.1. Client Failure
32.1.2. Client Eviction
32.1.3. MDS Failure (Failover)
32.1.4. OST Failure (Failover)
32.1.5. Network Partition
32.1.6. Failed Recovery
32.2. Metadata Replay
32.2.1. XID Numbers
32.2.2. Transaction Numbers
32.2.3. Replay and Resend
32.2.4. Client Replay List
32.2.5. Server Recovery
32.2.6. Request Replay
32.2.7. Gaps in the Replay Sequence
32.2.8. Lock Recovery
32.2.9. Request Resend
32.3. Reply Reconstruction
32.3.1. Required State
32.3.2. Reconstruction of Open Replies
32.3.3. Multiple Reply Data per ClientL 2.8
32.4. Version-based Recovery
32.4.1. VBR Messages
32.4.2. Tips for Using VBR
32.5. Commit on Share
32.5.1. Working with Commit on Share
32.5.2. Tuning Commit On Share
32.6. Imperative Recovery
32.6.1. MGS role
32.6.2. Tuning Imperative Recovery
32.6.3. Configuration Suggestions for Imperative Recovery
32.7. Suppressing Pings
32.7.1. "suppress_pings" Kernel Module Parameter
32.7.2. Client Death Notification
33. Lustre Parameters
33.1. Introduction to Lustre Parameters
33.1.1. Identifying Lustre File Systems and Servers
33.2. Tuning Multi-Block Allocation (mballoc)
33.3. Monitoring Lustre File System I/O
33.3.1. Monitoring the Client RPC Stream
33.3.2. Monitoring Client Activity
33.3.3. Monitoring Client Read-Write Offset Statistics
33.3.4. Monitoring Client Read-Write Extent Statistics
33.3.5. Monitoring the OST Block I/O Stream
33.4. Tuning Lustre File System I/O
33.4.1. Tuning the Client I/O RPC Stream
33.4.2. Tuning File Readahead and Directory Statahead
33.4.3. Tuning OSS Read Cache
33.4.4. Enabling OSS Asynchronous Journal Commit
33.4.5. Tuning the Client Metadata RPC Stream L 2.8
33.5. Configuring Timeouts in a Lustre File System
33.5.1. Configuring Adaptive Timeouts
33.5.2. Setting Static Timeouts
33.6. Monitoring LNet
33.7. Allocating Free Space on OSTs
33.8. Configuring Locking
33.9. Setting MDS and OSS Thread Counts
33.10. Enabling and Interpreting Debugging Logs
33.10.1. Interpreting OST Statistics
33.10.2. Interpreting MDT Statistics
34. User Utilities
34.1. lfs
34.1.1. Synopsis
34.1.2. Description
34.1.3. Options
34.1.4. Examples
34.1.5. See Also
34.2. lfs_migrate
34.2.1. Synopsis
34.2.2. Description
34.2.3. Options
34.2.4. Examples
34.2.5. See Also
34.3. filefrag
34.3.1. Synopsis
34.3.2. Description
34.3.3. Options
34.3.4. Examples
34.4. mount
34.5. Handling Timeouts
35. Programming Interfaces
35.1. User/Group Upcall
35.1.1. Synopsis
35.1.2. Description
35.1.3. Parameters
35.1.4. Data Structures
35.2. l_getidentity Utility
35.2.1. Synopsis
35.2.2. Description
35.2.3. Files
36. Setting Lustre Properties in a C Program (llapi)
36.1. llapi_file_create
36.1.1. Synopsis
36.1.2. Description
36.1.3. Examples
36.2. llapi_file_get_stripe
36.2.1. Synopsis
36.2.2. Description
36.2.3. Return Values
36.2.4. Errors
36.2.5. Examples
36.3. llapi_file_open
36.3.1. Synopsis
36.3.2. Description
36.3.3. Return Values
36.3.4. Errors
36.3.5. Example
36.4. llapi_quotactl
36.4.1. Synopsis
36.4.2. Description
36.4.3. Return Values
36.4.4. Errors
36.5. llapi_path2fid
36.5.1. Synopsis
36.5.2. Description
36.5.3. Return Values
36.6. llapi_ladvise L 2.9
36.6.1. Synopsis
36.6.2. Description
36.6.3. Return Values
36.6.4. Errors
36.7. Example Using the llapi Library
36.7.1. See Also
37. Configuration Files and Module Parameters
37.1. Introduction
37.2. Module Options
37.2.1. LNet Options
37.2.2. SOCKLND Kernel TCP/IP LND
37.2.3. Portals LND Linux (ptllnd)
37.2.4. MX LND
38. System Configuration Utilities
38.1. e2scan
38.1.1. Synopsis
38.1.2. Description
38.1.3. Options
38.2. l_getidentity
38.2.1. Synopsis
38.2.2. Description
38.2.3. Options
38.2.4. Files
38.3. lctl
38.3.1. Synopsis
38.3.2. Description
38.3.3. Setting Parameters with lctl
38.3.4. Options
38.3.5. Examples
38.3.6. See Also
38.4. ll_decode_filter_fid
38.4.1. Synopsis
38.4.2. Description
38.4.3. Examples
38.4.4. See Also
38.5. ll_recover_lost_found_objs
38.5.1. Synopsis
38.5.2. Description
38.5.3. Options
38.5.4. Example
38.6. llobdstat
38.6.1. Synopsis
38.6.2. Description
38.6.3. Example
38.6.4. Files
38.7. llog_reader
38.7.1. Synopsis
38.7.2. Description
38.7.3. See Also
38.8. llstat
38.8.1. Synopsis
38.8.2. Description
38.8.3. Options
38.8.4. Example
38.8.5. Files
38.9. llverdev
38.9.1. Synopsis
38.9.2. Description
38.9.3. Options
38.9.4. Examples
38.10. lshowmount
38.10.1. Synopsis
38.10.2. Description
38.10.3. Options
38.10.4. Files
38.11. lst
38.11.1. Synopsis
38.11.2. Description
38.11.3. Modules
38.11.4. Utilities
38.11.5. Example Script
38.12. lustre_rmmod.sh
38.13. lustre_rsync
38.13.1. Synopsis
38.13.2. Description
38.13.3. Options
38.13.4. Examples
38.13.5. See Also
38.14. mkfs.lustre
38.14.1. Synopsis
38.14.2. Description
38.14.3. Examples
38.14.4. See Also
38.15. mount.lustre
38.15.1. Synopsis
38.15.2. Description
38.15.3. Options
38.15.4. Examples
38.15.5. See Also
38.16. plot-llstat
38.16.1. Synopsis
38.16.2. Description
38.16.3. Options
38.16.4. Example
38.17. routerstat
38.17.1. Synopsis
38.17.2. Description
38.17.3. Output
38.17.4. Example
38.17.5. Files
38.18. tunefs.lustre
38.18.1. Synopsis
38.18.2. Description
38.18.3. Options
38.18.4. Examples
38.18.5. See Also
38.19. Additional System Configuration Utilities
38.19.1. Application Profiling Utilities
38.19.2. More /proc Statistics for Application Profiling
38.19.3. Testing / Debugging Utilities
38.19.4. Fileset FeatureL 2.9
39. LNet Configuration C-API
39.1. General API Information
39.1.1. API Return Code
39.1.2. API Common Input Parameters
39.1.3. API Common Output Parameters
39.2. The LNet Configuration C-API
39.2.1. Configuring LNet
39.2.2. Enabling and Disabling Routing
39.2.3. Adding Routes
39.2.4. Deleting Routes
39.2.5. Showing Routes
39.2.6. Adding a Network Interface
39.2.7. Deleting a Network Interface
39.2.8. Showing Network Interfaces
39.2.9. Adjusting Router Buffer Pools
39.2.10. Showing Routing information
39.2.11. Showing LNet Traffic Statistics
39.2.12. Adding/Deleting/Showing Parameters through a YAML Block
39.2.13. Adding a route code example
Glossary
Index

List of Figures

1.1. Lustre file system components in a basic cluster
1.2. Lustre cluster at scale
1.3. Layout EA on MDT pointing to file data on OSTs
1.4. Lustre client requesting file data
1.5. File striping on a Lustre file system
3.1. Lustre failover configuration for a active/passive MDT
3.2. Lustre failover configuration for a active/active MDTs
3.3. Lustre failover configuration for an OSTs
22.1. Overview of the Lustre file system HSM
28.1. The internal structure of TBF policy
38.1. Lustre fileset

List of Tables

1.1. Lustre File System Scalability and Performance
1.2. Storage and hardware requirements for Lustre file system components
5.1. Default Inode Ratios Used for Newly Formatted OSTs
5.2. File and file system limits
8.1. Packages Installed on Lustre Servers
8.2. Packages Installed on Lustre Clients
8.3. Network Types Supported by Lustre LNDs
10.1. Default stripe pattern
24.1. SSK Security Flavor Protections
24.2. lgss_sk Parameters
24.3. lsvcgssd Parameters
24.4. Key Descriptions

List of Examples

28.1. lustre.conf

Preface

The Lustre*Software Release 2.x Operations Manual provides detailed information and procedures to install, configure and tune a Lustre file system. The manual covers topics such as failover, quotas, striping, and bonding. This manual also contains troubleshooting information and tips to improve the operation and performance of a Lustre file system.

1. About this Document

This document is maintained by Intel in Docbook format. The canonical version is available at http://wiki.hpdd.intel.com/display/PUB/Documentation.

1.1. UNIX* Commands

This document may not contain information about basic UNIX* operating system commands and procedures such as shutting down the system, booting the system, and configuring devices. Refer to the following for this information:

  • Software documentation that you received with your system

  • Red Hat* Enterprise Linux* documentation, which is at: http://docs.redhat.com/docs/en-US/index.html

    Note

    The Lustre client module is available for many different Linux* versions and distributions. The Red Hat Enterprise Linux distribution is the best supported and tested platform for Lustre servers.

1.2. Shell Prompts

The shell prompt used in the example text indicates whether a command can or should be executed by a regular user, or whether it requires superuser permission to run. Also, the machine type is often included in the prompt to indicate whether the command should be run on a client node, on an MDS node, an OSS node, or the MGS node.

Some examples are listed below, but other prompt combinations are also used as needed for the example.

Shell

Prompt

Regular user

machine$

Superuser (root)

machine#

Regular user on the client

client$

Superuser on the MDS

mds#

Superuser on the OSS

oss#

Superuser on the MGS

mgs#

1.3. Related Documentation

Application

Title

Format

Location

Latest information

Lustre Software Release 2.x Change Logs

Wiki page

Online at http://wiki.hpdd.intel.com/display/PUB/Documentation

Service

Lustre Software Release 2.x Operations Manual

PDF

HTML

Online at http://wiki.hpdd.intel.com/display/PUB/Documentation

1.4. Documentation, Support, and Training

These web sites provide additional resources:

2. Revisions

The Lustre* File System Release 2.x Operations Manual is a community maintained work. Versions of the manual are continually built as suggestions for changes and improvements arrive. Suggestions for improvements can be submitted through the ticketing system maintained at https://jira.hpdd.intel.com/browse/LUDOC. Instructions for providing a patch to the existing manual are available at: https://wiki.hpdd.intel.com/display/PUB/Making+changes+to+the+Lustre+Manual+source.

This manual currently covers all the 2.x Lustre software releases. Features that are specific to individual releases are identified within the table of contents using a short hand notation (i.e. 'L24' is a Lustre software release 2.4 specific feature), and within the text using a distinct box. For example:

Introduced in Lustre 2.4

Lustre software release version 2.4 includes support for multiple metadata servers.

which version?

The current version of Lustre that is in use on the client can be found using the command lctl get_param version, for example:

$ lctl get_param version
version=
lustre: 2.7.59
kernel: patchless_client
build:  v2_7_59_0-g703195a-CHANGED-3.10.0.lustreopa

Only the latest revision of this document is made readily available because changes are continually arriving. The current and latest revision of this manual is available from links maintained at: http://lustre.opensfs.org/documentation/.

Revision History
Revision 0Built on 19 April 2017 14:57:14Z
Continuous build of Manual.

Part I. Introducing the Lustre* File System

Chapter 1. Understanding Lustre Architecture

This chapter describes the Lustre architecture and features of the Lustre file system. It includes the following sections:

1.1.  What a Lustre File System Is (and What It Isn't)

The Lustre architecture is a storage architecture for clusters. The central component of the Lustre architecture is the Lustre file system, which is supported on the Linux operating system and provides a POSIX *standard-compliant UNIX file system interface.

The Lustre storage architecture is used for many different kinds of clusters. It is best known for powering many of the largest high-performance computing (HPC) clusters worldwide, with tens of thousands of client systems, petabytes (PB) of storage and hundreds of gigabytes per second (GB/sec) of I/O throughput. Many HPC sites use a Lustre file system as a site-wide global file system, serving dozens of clusters.

The ability of a Lustre file system to scale capacity and performance for any need reduces the need to deploy many separate file systems, such as one for each compute cluster. Storage management is simplified by avoiding the need to copy data between compute clusters. In addition to aggregating storage capacity of many servers, the I/O throughput is also aggregated and scales with additional servers. Moreover, throughput and/or capacity can be easily increased by adding servers dynamically.

While a Lustre file system can function in many work environments, it is not necessarily the best choice for all applications. It is best suited for uses that exceed the capacity that a single server can provide, though in some use cases, a Lustre file system can perform better with a single server than other file systems due to its strong locking and data coherency.

A Lustre file system is currently not particularly well suited for "peer-to-peer" usage models where clients and servers are running on the same node, each sharing a small amount of storage, due to the lack of data replication at the Lustre software level. In such uses, if one client/server fails, then the data stored on that node will not be accessible until the node is restarted.

1.1.1.  Lustre Features

Lustre file systems run on a variety of vendor's kernels. For more details, see the Lustre Test Matrix Section 8.1, “ Preparing to Install the Lustre Software”.

A Lustre installation can be scaled up or down with respect to the number of client nodes, disk storage and bandwidth. Scalability and performance are dependent on available disk and network bandwidth and the processing power of the servers in the system. A Lustre file system can be deployed in a wide variety of configurations that can be scaled well beyond the size and performance observed in production systems to date.

Table 1.1, “Lustre File System Scalability and Performance” shows some of the scalability and performance characteristics of a Lustre file system. For a full list of Lustre file and filesystem limits see Table 5.2, “File and file system limits”.

Table 1.1. Lustre File System Scalability and Performance

Feature

Current Practical Range

Known Production Usage

Client Scalability

100-100000

50000+ clients, many in the 10000 to 20000 range

Client Performance

Single client:

I/O 90% of network bandwidth

Aggregate:

10 TB/sec I/O

Single client:

4.5 GB/sec I/O (FDR IB, OPA1), 1000 metadata ops/sec

Aggregate:

2.5 TB/sec I/O

OSS Scalability

Single OSS:

1-32 OSTs per OSS

Single OST:

300M objects, 128TB per OST (ldiskfs)

500M objects, 256TB per OST (ZFS)

OSS count:

1000 OSSs, with up to 4000 OSTs

Single OSS:

32x 8TB OSTs per OSS (ldiskfs),

8x 32TB OSTs per OSS (ldiskfs)

1x 72TB OST per OSS (ZFS)

OSS count:

450 OSSs with 1000 4TB OSTs

192 OSSs with 1344 8TB OSTs

768 OSSs with 768 72TB OSTs

OSS Performance

Single OSS:

15 GB/sec

Aggregate:

10 TB/sec

Single OSS:

10 GB/sec

Aggregate:

2.5 TB/sec

MDS Scalability

Single MDS:

1-4 MDTs per MDS

Single MDT:

4 billion files, 8TB per MDT (ldiskfs)

64 billion files, 64TB per MDT (ZFS)

MDS count:

1 primary + 1 standby

Introduced in Lustre 2.4

256 MDSs, with up to 256 MDTs

Single MDS:

3 billion files

MDS count:

7 MDS with 7 2TB MDTs in production

256 MDS with 256 64GB MDTs in testing

MDS Performance

50000/s create operations,

200000/s metadata stat operations

15000/s create operations,

50000/s metadata stat operations

File system Scalability

Single File:

32 PB max file size (ldiskfs)

2^63 bytes (ZFS)

Aggregate:

512 PB space, 1 trillion files

Single File:

multi-TB max file size

Aggregate:

55 PB space, 8 billion files


Other Lustre software features are:

  • Performance-enhanced ext4 file system:The Lustre file system uses an improved version of the ext4 journaling file system to store data and metadata. This version, called ldiskfs , has been enhanced to improve performance and provide additional functionality needed by the Lustre file system.

  • Introduced in Lustre 2.4

    With the Lustre software release 2.4 and later, it is also possible to use ZFS as the backing filesystem for Lustre for the MDT, OST, and MGS storage. This allows Lustre to leverage the scalability and data integrity features of ZFS for individual storage targets.

  • POSIX standard compliance:The full POSIX test suite passes in an identical manner to a local ext4 file system, with limited exceptions on Lustre clients. In a cluster, most operations are atomic so that clients never see stale data or metadata. The Lustre software supports mmap() file I/O.

  • High-performance heterogeneous networking:The Lustre software supports a variety of high performance, low latency networks and permits Remote Direct Memory Access (RDMA) for InfiniBand *(utilizing OpenFabrics Enterprise Distribution (OFED*), Intel OmniPath®, and other advanced networks for fast and efficient network transport. Multiple RDMA networks can be bridged using Lustre routing for maximum performance. The Lustre software also includes integrated network diagnostics.

  • High-availability:The Lustre file system supports active/active failover using shared storage partitions for OSS targets (OSTs). Lustre software release 2.3 and earlier releases offer active/passive failover using a shared storage partition for the MDS target (MDT). The Lustre file system can work with a variety of high availability (HA) managers to allow automated failover and has no single point of failure (NSPF). This allows application transparent recovery. Multiple mount protection (MMP) provides integrated protection from errors in highly-available systems that would otherwise cause file system corruption.

  • Introduced in Lustre 2.4

    With Lustre software release 2.4 or later servers and clients it is possible to configure active/active failover of multiple MDTs. This allows scaling the metadata performance of Lustre filesystems with the addition of MDT storage devices and MDS nodes.

  • Security:By default TCP connections are only allowed from privileged ports. UNIX group membership is verified on the MDS.

  • Access control list (ACL), extended attributes:the Lustre security model follows that of a UNIX file system, enhanced with POSIX ACLs. Noteworthy additional features include root squash.

  • Interoperability:The Lustre file system runs on a variety of CPU architectures and mixed-endian clusters and is interoperable between successive major Lustre software releases.

  • Object-based architecture:Clients are isolated from the on-disk file structure enabling upgrading of the storage architecture without affecting the client.

  • Byte-granular file and fine-grained metadata locking:Many clients can read and modify the same file or directory concurrently. The Lustre distributed lock manager (LDLM) ensures that files are coherent between all clients and servers in the file system. The MDT LDLM manages locks on inode permissions and pathnames. Each OST has its own LDLM for locks on file stripes stored thereon, which scales the locking performance as the file system grows.

  • Quotas:User and group quotas are available for a Lustre file system.

  • Capacity growth:The size of a Lustre file system and aggregate cluster bandwidth can be increased without interruption by adding new OSTs and MDTs to the cluster.

  • Controlled file layout:The layout of files across OSTs can be configured on a per file, per directory, or per file system basis. This allows file I/O to be tuned to specific application requirements within a single file system. The Lustre file system uses RAID-0 striping and balances space usage across OSTs.

  • Network data integrity protection:A checksum of all data sent from the client to the OSS protects against corruption during data transfer.

  • MPI I/O:The Lustre architecture has a dedicated MPI ADIO layer that optimizes parallel I/O to match the underlying file system architecture.

  • NFS and CIFS export:Lustre files can be re-exported using NFS (via Linux knfsd) or CIFS (via Samba) enabling them to be shared with non-Linux clients, such as Microsoft *Windows *and Apple *Mac OS X *.

  • Disaster recovery tool:The Lustre file system provides an online distributed file system check (LFSCK) that can restore consistency between storage components in case of a major file system error. A Lustre file system can operate even in the presence of file system inconsistencies, and LFSCK can run while the filesystem is in use, so LFSCK is not required to complete before returning the file system to production.

  • Performance monitoring:The Lustre file system offers a variety of mechanisms to examine performance and tuning.

  • Open source:The Lustre software is licensed under the GPL 2.0 license for use with the Linux operating system.

1.2.  Lustre Components

An installation of the Lustre software includes a management server (MGS) and one or more Lustre file systems interconnected with Lustre networking (LNet).

A basic configuration of Lustre file system components is shown in Figure 1.1, “Lustre file system components in a basic cluster”.

Figure 1.1. Lustre file system components in a basic cluster

Lustre file system components in a basic cluster

1.2.1.  Management Server (MGS)

The MGS stores configuration information for all the Lustre file systems in a cluster and provides this information to other Lustre components. Each Lustre target contacts the MGS to provide information, and Lustre clients contact the MGS to retrieve information.

It is preferable that the MGS have its own storage space so that it can be managed independently. However, the MGS can be co-located and share storage space with an MDS as shown in Figure 1.1, “Lustre file system components in a basic cluster”.

1.2.2. Lustre File System Components

Each Lustre file system consists of the following components:

  • Metadata Servers (MDS)- The MDS makes metadata stored in one or more MDTs available to Lustre clients. Each MDS manages the names and directories in the Lustre file system(s) and provides network request handling for one or more local MDTs.

  • Metadata Targets (MDT) - For Lustre software release 2.3 and earlier, each file system has one MDT. The MDT stores metadata (such as filenames, directories, permissions and file layout) on storage attached to an MDS. Each file system has one MDT. An MDT on a shared storage target can be available to multiple MDSs, although only one can access it at a time. If an active MDS fails, a standby MDS can serve the MDT and make it available to clients. This is referred to as MDS failover.

    Introduced in Lustre 2.4

    Since Lustre software release 2.4, multiple MDTs are supported in the Distributed Namespace Environment (DNE). In addition to the primary MDT that holds the filesystem root, it is possible to add additional MDS nodes, each with their own MDTs, to hold sub-directory trees of the filesystem.

    Introduced in Lustre 2.8

    Since Lustre software release 2.8, DNE also allows the filesystem to distribute files of a single directory over multiple MDT nodes. A directory which is distributed across multiple MDTs is known as a striped directory.

  • Object Storage Servers (OSS): The OSS provides file I/O service and network request handling for one or more local OSTs. Typically, an OSS serves between two and eight OSTs, up to 16 TB each. A typical configuration is an MDT on a dedicated node, two or more OSTs on each OSS node, and a client on each of a large number of compute nodes.

  • Object Storage Target (OST): User file data is stored in one or more objects, each object on a separate OST in a Lustre file system. The number of objects per file is configurable by the user and can be tuned to optimize performance for a given workload.

  • Lustre clients: Lustre clients are computational, visualization or desktop nodes that are running Lustre client software, allowing them to mount the Lustre file system.

The Lustre client software provides an interface between the Linux virtual file system and the Lustre servers. The client software includes a management client (MGC), a metadata client (MDC), and multiple object storage clients (OSCs), one corresponding to each OST in the file system.

A logical object volume (LOV) aggregates the OSCs to provide transparent access across all the OSTs. Thus, a client with the Lustre file system mounted sees a single, coherent, synchronized namespace. Several clients can write to different parts of the same file simultaneously, while, at the same time, other clients can read from the file.

A logical metadata volume (LMV) aggregates the MDCs to provide transparent access across all the MDTs in a similar manner as the LOV does for file access. This allows the client to see the directory tree on multiple MDTs as a single coherent namespace, and striped directories are merged on the clients to form a single visible directory to users and applications.

Table 1.2, “ Storage and hardware requirements for Lustre file system components”provides the requirements for attached storage for each Lustre file system component and describes desirable characteristics of the hardware used.

Table 1.2.  Storage and hardware requirements for Lustre file system components

Required attached storage

Desirable hardware characteristics

MDSs

1-2% of file system capacity

Adequate CPU power, plenty of memory, fast disk storage.

OSSs

1-128 TB per OST, 1-8 OSTs per OSS

Good bus bandwidth. Recommended that storage be balanced evenly across OSSs and matched to network bandwidth.

Clients

No local storage needed

Low latency, high bandwidth network.


For additional hardware requirements and considerations, see Chapter 5, Determining Hardware Configuration Requirements and Formatting Options.

1.2.3.  Lustre Networking (LNet)

Lustre Networking (LNet) is a custom networking API that provides the communication infrastructure that handles metadata and file I/O data for the Lustre file system servers and clients. For more information about LNet, see Chapter 2, Understanding Lustre Networking (LNet).

1.2.4.  Lustre Cluster

At scale, a Lustre file system cluster can include hundreds of OSSs and thousands of clients (see Figure 1.2, “ Lustre cluster at scale”). More than one type of network can be used in a Lustre cluster. Shared storage between OSSs enables failover capability. For more details about OSS failover, see Chapter 3, Understanding Failover in a Lustre File System.

Figure 1.2.  Lustre cluster at scale

Lustre file system cluster at scale

1.3.  Lustre File System Storage and I/O

In Lustre software release 2.0, Lustre file identifiers (FIDs) were introduced to replace UNIX inode numbers for identifying files or objects. A FID is a 128-bit identifier that contains a unique 64-bit sequence number, a 32-bit object ID (OID), and a 32-bit version number. The sequence number is unique across all Lustre targets in a file system (OSTs and MDTs). This change enabled future support for multiple MDTs (introduced in Lustre software release 2.4) and ZFS (introduced in Lustre software release 2.4).

Also introduced in release 2.0 is an ldiskfs feature named FID-in-dirent(also known as dirdata) in which the FID is stored as part of the name of the file in the parent directory. This feature significantly improves performance for ls command executions by reducing disk I/O. The FID-in-dirent is generated at the time the file is created.

Note

The FID-in-dirent feature is not backward compatible with the release 1.8 ldiskfs disk format. Therefore, when an upgrade from release 1.8 to release 2.x is performed, the FID-in-dirent feature is not automatically enabled. For upgrades from release 1.8 to releases 2.0 through 2.3, FID-in-dirent can be enabled manually but only takes effect for new files.

For more information about upgrading from Lustre software release 1.8 and enabling FID-in-dirent for existing files, see Chapter 16, Upgrading a Lustre File SystemChapter 16 “Upgrading a Lustre File System”.

Introduced in Lustre 2.4

The LFSCK file system consistency checking tool released with Lustre software release 2.4 provides functionality that enables FID-in-dirent for existing files. It includes the following functionality:

  • Generates IGIF mode FIDs for existing files from a 1.8 version file system files.

  • Verifies the FID-in-dirent for each file and regenerates the FID-in-dirent if it is invalid or missing.

  • Verifies the linkEA entry for each and regenerates the linkEA if it is invalid or missing. The linkEA consists of the file name and parent FID. It is stored as an extended attribute in the file itself. Thus, the linkEA can be used to reconstruct the full path name of a file.

Information about where file data is located on the OST(s) is stored as an extended attribute called layout EA in an MDT object identified by the FID for the file (see Figure 1.3, “Layout EA on MDT pointing to file data on OSTs”). If the file is a regular file (not a directory or symbol link), the MDT object points to 1-to-N OST object(s) on the OST(s) that contain the file data. If the MDT file points to one object, all the file data is stored in that object. If the MDT file points to more than one object, the file data is striped across the objects using RAID 0, and each object is stored on a different OST. (For more information about how striping is implemented in a Lustre file system, see Section 1.3.1, “ Lustre File System and Striping”.

Figure 1.3. Layout EA on MDT pointing to file data on OSTs

Layout EA on MDT pointing to file data on OSTs

When a client wants to read from or write to a file, it first fetches the layout EA from the MDT object for the file. The client then uses this information to perform I/O on the file, directly interacting with the OSS nodes where the objects are stored. This process is illustrated in Figure 1.4, “Lustre client requesting file data” .

Figure 1.4. Lustre client requesting file data

Lustre client requesting file data

The available bandwidth of a Lustre file system is determined as follows:

  • The network bandwidth equals the aggregated bandwidth of the OSSs to the targets.

  • The disk bandwidth equals the sum of the disk bandwidths of the storage targets (OSTs) up to the limit of the network bandwidth.

  • The aggregate bandwidth equals the minimum of the disk bandwidth and the network bandwidth.

  • The available file system space equals the sum of the available space of all the OSTs.

1.3.1.  Lustre File System and Striping

One of the main factors leading to the high performance of Lustre file systems is the ability to stripe data across multiple OSTs in a round-robin fashion. Users can optionally configure for each file the number of stripes, stripe size, and OSTs that are used.

Striping can be used to improve performance when the aggregate bandwidth to a single file exceeds the bandwidth of a single OST. The ability to stripe is also useful when a single OST does not have enough free space to hold an entire file. For more information about benefits and drawbacks of file striping, see Section 18.2, “ Lustre File Layout (Striping) Considerations”.

Striping allows segments or 'chunks' of data in a file to be stored on different OSTs, as shown in Figure 1.5, “File striping on a Lustre file system”. In the Lustre file system, a RAID 0 pattern is used in which data is "striped" across a certain number of objects. The number of objects in a single file is called the stripe_count.

Each object contains a chunk of data from the file. When the chunk of data being written to a particular object exceeds the stripe_size, the next chunk of data in the file is stored on the next object.

Default values for stripe_count and stripe_size are set for the file system. The default value for stripe_count is 1 stripe for file and the default value for stripe_size is 1MB. The user may change these values on a per directory or per file basis. For more details, see Section 18.3, “Setting the File Layout/Striping Configuration (lfs setstripe)”.

Figure 1.5, “File striping on a Lustre file system”, the stripe_size for File C is larger than the stripe_size for File A, allowing more data to be stored in a single stripe for File C. The stripe_count for File A is 3, resulting in data striped across three objects, while the stripe_count for File B and File C is 1.

No space is reserved on the OST for unwritten data. File A in Figure 1.5, “File striping on a Lustre file system”.

Figure 1.5. File striping on a Lustre file system

File striping pattern across three OSTs for three different data files. The file is sparse and missing chunk 6.

The maximum file size is not limited by the size of a single target. In a Lustre file system, files can be striped across multiple objects (up to 2000), and each object can be up to 16 TB in size with ldiskfs, or up to 256PB with ZFS. This leads to a maximum file size of 31.25 PB for ldiskfs or 8EB with ZFS. Note that a Lustre file system can support files up to 2^63 bytes (8EB), limited only by the space available on the OSTs.

Note

Versions of the Lustre software prior to Release 2.2 limited the maximum stripe count for a single file to 160 OSTs.

Although a single file can only be striped over 2000 objects, Lustre file systems can have thousands of OSTs. The I/O bandwidth to access a single file is the aggregated I/O bandwidth to the objects in a file, which can be as much as a bandwidth of up to 2000 servers. On systems with more than 2000 OSTs, clients can do I/O using multiple files to utilize the full file system bandwidth.

For more information about striping, see Chapter 18, Managing File Layout (Striping) and Free Space.

Chapter 2. Understanding Lustre Networking (LNet)

This chapter introduces Lustre networking (LNet). It includes the following sections:

2.1.  Introducing LNet

In a cluster using one or more Lustre file systems, the network communication infrastructure required by the Lustre file system is implemented using the Lustre networking (LNet) feature.

LNet supports many commonly-used network types, such as InfiniBand and IP networks, and allows simultaneous availability across multiple network types with routing between them. Remote direct memory access (RDMA) is permitted when supported by underlying networks using the appropriate Lustre network driver (LND). High availability and recovery features enable transparent recovery in conjunction with failover servers.

An LND is a pluggable driver that provides support for a particular network type, for example ksocklnd is the driver which implements the TCP Socket LND that supports TCP networks. LNDs are loaded into the driver stack, with one LND for each network type in use.

For information about configuring LNet, see Chapter 9, Configuring Lustre Networking (LNet).

For information about administering LNet, see Part III, “Administering Lustre”.

2.2. Key Features of LNet

Key features of LNet include:

  • RDMA, when supported by underlying networks

  • Support for many commonly-used network types

  • High availability and recovery

  • Support of multiple network types simultaneously

  • Routing among disparate networks

LNet permits end-to-end read/write throughput at or near peak bandwidth rates on a variety of network interconnects.

2.3. Lustre Networks

A Lustre network is comprised of clients and servers running the Lustre software. It need not be confined to one LNet subnet but can span several networks provided routing is possible between the networks. In a similar manner, a single network can have multiple LNet subnets.

The Lustre networking stack is comprised of two layers, the LNet code module and the LND. The LNet layer operates above the LND layer in a manner similar to the way the network layer operates above the data link layer. LNet layer is connectionless, asynchronous and does not verify that data has been transmitted while the LND layer is connection oriented and typically does verify data transmission.

LNets are uniquely identified by a label comprised of a string corresponding to an LND and a number, such as tcp0, o2ib0, or o2ib1, that uniquely identifies each LNet. Each node on an LNet has at least one network identifier (NID). A NID is a combination of the address of the network interface and the LNet label in the form:address@LNet_label.

Examples:

192.168.1.2@tcp0
10.13.24.90@o2ib1

In certain circumstances it might be desirable for Lustre file system traffic to pass between multiple LNets. This is possible using LNet routing. It is important to realize that LNet routing is not the same as network routing. For more details about LNet routing, see Chapter 9, Configuring Lustre Networking (LNet)

2.4. Supported Network Types

The LNet code module includes LNDs to support many network types including:

  • InfiniBand: OpenFabrics OFED (o2ib)

  • TCP (any network carrying TCP traffic, including GigE, 10GigE, and IPoIB)

  • Cray: Seastar

  • Myrinet: MX

  • RapidArray: ra

  • Quadrics: Elan

Chapter 3. Understanding Failover in a Lustre File System

This chapter describes failover in a Lustre file system. It includes:

3.1.  What is Failover?

In a high-availability (HA) system, unscheduled downtime is minimized by using redundant hardware and software components and software components that automate recovery when a failure occurs. If a failure condition occurs, such as the loss of a server or storage device or a network or software fault, the system's services continue with minimal interruption. Generally, availability is specified as the percentage of time the system is required to be available.

Availability is accomplished by replicating hardware and/or software so that when a primary server fails or is unavailable, a standby server can be switched into its place to run applications and associated resources. This process, called failover, is automatic in an HA system and, in most cases, completely application-transparent.

A failover hardware setup requires a pair of servers with a shared resource (typically a physical storage device, which may be based on SAN, NAS, hardware RAID, SCSI or Fibre Channel (FC) technology). The method of sharing storage should be essentially transparent at the device level; the same physical logical unit number (LUN) should be visible from both servers. To ensure high availability at the physical storage level, we encourage the use of RAID arrays to protect against drive-level failures.

Note

The Lustre software does not provide redundancy for data; it depends exclusively on redundancy of backing storage devices. The backing OST storage should be RAID 5 or, preferably, RAID 6 storage. MDT storage should be RAID 1 or RAID 10.

3.1.1.  Failover Capabilities

To establish a highly-available Lustre file system, power management software or hardware and high availability (HA) software are used to provide the following failover capabilities:

  • Resource fencing- Protects physical storage from simultaneous access by two nodes.

  • Resource management- Starts and stops the Lustre resources as a part of failover, maintains the cluster state, and carries out other resource management tasks.

  • Health monitoring- Verifies the availability of hardware and network resources and responds to health indications provided by the Lustre software.

These capabilities can be provided by a variety of software and/or hardware solutions. For more information about using power management software or hardware and high availability (HA) software with a Lustre file system, see Chapter 11, Configuring Failover in a Lustre File System.

HA software is responsible for detecting failure of the primary Lustre server node and controlling the failover.The Lustre software works with any HA software that includes resource (I/O) fencing. For proper resource fencing, the HA software must be able to completely power off the failed server or disconnect it from the shared storage device. If two active nodes have access to the same storage device, data may be severely corrupted.

3.1.2.  Types of Failover Configurations

Nodes in a cluster can be configured for failover in several ways. They are often configured in pairs (for example, two OSTs attached to a shared storage device), but other failover configurations are also possible. Failover configurations include:

  • Active/passive pair - In this configuration, the active node provides resources and serves data, while the passive node is usually standing by idle. If the active node fails, the passive node takes over and becomes active.

  • Active/active pair - In this configuration, both nodes are active, each providing a subset of resources. In case of a failure, the second node takes over resources from the failed node.

In Lustre software releases previous to Lustre software release 2.4, MDSs can be configured as an active/passive pair, while OSSs can be deployed in an active/active configuration that provides redundancy without extra overhead. Often the standby MDS is the active MDS for another Lustre file system or the MGS, so no nodes are idle in the cluster.

Introduced in Lustre 2.4

Lustre software release 2.4 introduces metadata targets for individual sub-directories. Active-active failover configurations are available for MDSs that serve MDTs on shared storage.

3.2.  Failover Functionality in a Lustre File System

The failover functionality provided by the Lustre software can be used for the following failover scenario. When a client attempts to do I/O to a failed Lustre target, it continues to try until it receives an answer from any of the configured failover nodes for the Lustre target. A user-space application does not detect anything unusual, except that the I/O may take longer to complete.

Failover in a Lustre file system requires that two nodes be configured as a failover pair, which must share one or more storage devices. A Lustre file system can be configured to provide MDT or OST failover.

  • For MDT failover, two MDSs can be configured to serve the same MDT. Only one MDS node can serve an MDT at a time.

    Introduced in Lustre 2.4

    Lustre software release 2.4 allows multiple MDTs. By placing two or more MDT partitions on storage shared by two MDSs, one MDS can fail and the remaining MDS can begin serving the unserved MDT. This is described as an active/active failover pair.

  • For OST failover, multiple OSS nodes can be configured to be able to serve the same OST. However, only one OSS node can serve the OST at a time. An OST can be moved between OSS nodes that have access to the same storage device using umount/mount commands.

The --servicenode option is used to set up nodes in a Lustre file system for failover at creation time (using mkfs.lustre) or later when the Lustre file system is active (using tunefs.lustre). For explanations of these utilities, see Section 38.14, “ mkfs.lustre”and Section 38.18, “ tunefs.lustre”.

Failover capability in a Lustre file system can be used to upgrade the Lustre software between successive minor versions without cluster downtime. For more information, see Chapter 16, Upgrading a Lustre File System.

For information about configuring failover, see Chapter 11, Configuring Failover in a Lustre File System.

Note

The Lustre software provides failover functionality only at the file system level. In a complete failover solution, failover functionality for system-level components, such as node failure detection or power control, must be provided by a third-party tool.

Caution

OST failover functionality does not protect against corruption caused by a disk failure. If the storage media (i.e., physical disk) used for an OST fails, it cannot be recovered by functionality provided in the Lustre software. We strongly recommend that some form of RAID be used for OSTs. Lustre functionality assumes that the storage is reliable, so it adds no extra reliability features.

3.2.1.  MDT Failover Configuration (Active/Passive)

Two MDSs are typically configured as an active/passive failover pair as shown in Figure 3.1, “Lustre failover configuration for a active/passive MDT”. Note that both nodes must have access to shared storage for the MDT(s) and the MGS. The primary (active) MDS manages the Lustre system metadata resources. If the primary MDS fails, the secondary (passive) MDS takes over these resources and serves the MDTs and the MGS.

Note

In an environment with multiple file systems, the MDSs can be configured in a quasi active/active configuration, with each MDS managing metadata for a subset of the Lustre file system.

Figure 3.1. Lustre failover configuration for a active/passive MDT

Lustre failover configuration for an MDT

Introduced in Lustre 2.4

3.2.2.  MDT Failover Configuration (Active/Active)

Multiple MDTs became available with the advent of Lustre software release 2.4. MDTs can be setup as an active/active failover configuration. A failover cluster is built from two MDSs as shown in Figure 3.2, “Lustre failover configuration for a active/active MDTs”.

Figure 3.2. Lustre failover configuration for a active/active MDTs

Lustre failover configuration for two MDTs

3.2.3.  OST Failover Configuration (Active/Active)

OSTs are usually configured in a load-balanced, active/active failover configuration. A failover cluster is built from two OSSs as shown in Figure 3.3, “Lustre failover configuration for an OSTs”.

Note

OSSs configured as a failover pair must have shared disks/RAID.

Figure 3.3. Lustre failover configuration for an OSTs

Lustre failover configuration for an OSTs

In an active configuration, 50% of the available OSTs are assigned to one OSS and the remaining OSTs are assigned to the other OSS. Each OSS serves as the primary node for half the OSTs and as a failover node for the remaining OSTs.

In this mode, if one OSS fails, the other OSS takes over all of the failed OSTs. The clients attempt to connect to each OSS serving the OST, until one of them responds. Data on the OST is written synchronously, and the clients replay transactions that were in progress and uncommitted to disk before the OST failure.

For more information about configuring failover, see Chapter 11, Configuring Failover in a Lustre File System.

Part II. Installing and Configuring Lustre

Part II describes how to install and configure a Lustre file system. You will find information in this section about:

Table of Contents

4. Installation Overview
4.1. Steps to Installing the Lustre Software
5. Determining Hardware Configuration Requirements and Formatting Options
5.1. Hardware Considerations
5.1.1. MGT and MDT Storage Hardware Considerations
5.1.2. OST Storage Hardware Considerations
5.2. Determining Space Requirements
5.2.1. Determining MGT Space Requirements
5.2.2. Determining MDT Space Requirements
5.2.3. Determining OST Space Requirements
5.3. Setting ldiskfs File System Formatting Options
5.3.1. Setting Formatting Options for an ldiskfs MDT
5.3.2. Setting Formatting Options for an ldiskfs OST
5.3.3. File and File System Limits
5.4. Determining Memory Requirements
5.4.1. Client Memory Requirements
5.4.2. MDS Memory Requirements
5.4.3. OSS Memory Requirements
5.5. Implementing Networks To Be Used by the Lustre File System
6. Configuring Storage on a Lustre File System
6.1. Selecting Storage for the MDT and OSTs
6.1.1. Metadata Target (MDT)
6.1.2. Object Storage Server (OST)
6.2. Reliability Best Practices
6.3. Performance Tradeoffs
6.4. Formatting Options for ldiskfs RAID Devices
6.4.1. Computing file system parameters for mkfs
6.4.2. Choosing Parameters for an External Journal
6.5. Connecting a SAN to a Lustre File System
7. Setting Up Network Interface Bonding
7.1. Network Interface Bonding Overview
7.2. Requirements
7.3. Bonding Module Parameters
7.4. Setting Up Bonding
7.4.1. Examples
7.5. Configuring a Lustre File System with Bonding
7.6. Bonding References
8. Installing the Lustre Software
8.1. Preparing to Install the Lustre Software
8.1.1. Software Requirements
8.1.2. Environmental Requirements
8.2. Lustre Software Installation Procedure
9. Configuring Lustre Networking (LNet)
9.1. Configuring LNet via lnetctlL 2.7
9.1.1. Configuring LNet
9.1.2. Adding, Deleting and Showing networks
9.1.3. Adding, Deleting and Showing routes
9.1.4. Enabling and Disabling Routing
9.1.5. Showing routing information
9.1.6. Configuring Routing Buffers
9.1.7. Importing YAML Configuration File
9.1.8. Exporting Configuration in YAML format
9.1.9. Showing LNet Traffic Statistics
9.1.10. YAML Syntax
9.2. Overview of LNet Module Parameters
9.2.1. Using a Lustre Network Identifier (NID) to Identify a Node
9.3. Setting the LNet Module networks Parameter
9.3.1. Multihome Server Example
9.4. Setting the LNet Module ip2nets Parameter
9.5. Setting the LNet Module routes Parameter
9.5.1. Routing Example
9.6. Testing the LNet Configuration
9.7. Configuring the Router Checker
9.8. Best Practices for LNet Options
9.8.1. Escaping commas with quotes
9.8.2. Including comments
10. Configuring a Lustre File System
10.1. Configuring a Simple Lustre File System
10.1.1. Simple Lustre Configuration Example
10.2. Additional Configuration Options
10.2.1. Scaling the Lustre File System
10.2.2. Changing Striping Defaults
10.2.3. Using the Lustre Configuration Utilities
11. Configuring Failover in a Lustre File System
11.1. Setting Up a Failover Environment
11.1.1. Selecting Power Equipment
11.1.2. Selecting Power Management Software
11.1.3. Selecting High-Availability (HA) Software
11.2. Preparing a Lustre File System for Failover
11.3. Administering Failover in a Lustre File System

Chapter 4. Installation Overview

This chapter provides on overview of the procedures required to set up, install and configure a Lustre file system.

Note

If the Lustre file system is new to you, you may find it helpful to refer to Part I, “Introducing the Lustre* File System” for a description of the Lustre architecture, file system components and terminology before proceeding with the installation procedure.

4.1.  Steps to Installing the Lustre Software

To set up Lustre file system hardware and install and configure the Lustre software, refer the the chapters below in the order listed:

  1. (Required) Set up your Lustre file system hardware.

    See Chapter 5, Determining Hardware Configuration Requirements and Formatting Options - Provides guidelines for configuring hardware for a Lustre file system including storage, memory, and networking requirements.

  2. (Optional - Highly Recommended) Configure storage on Lustre storage devices.

    See Chapter 6, Configuring Storage on a Lustre File System - Provides instructions for setting up hardware RAID on Lustre storage devices.

  3. (Optional) Set up network interface bonding.

    See Chapter 7, Setting Up Network Interface Bonding - Describes setting up network interface bonding to allow multiple network interfaces to be used in parallel to increase bandwidth or redundancy.

  4. (Required) Install Lustre software.

    See Chapter 8, Installing the Lustre Software - Describes preparation steps and a procedure for installing the Lustre software.

  5. (Optional) Configure Lustre Networking (LNet).

    See Chapter 9, Configuring Lustre Networking (LNet) - Describes how to configure LNet if the default configuration is not sufficient. By default, LNet will use the first TCP/IP interface it discovers on a system. LNet configuration is required if you are using InfiniBand or multiple Ethernet interfaces.

  6. (Required) Configure the Lustre file system.

    See Chapter 10, Configuring a Lustre File System - Provides an example of a simple Lustre configuration procedure and points to tools for completing more complex configurations.

  7. (Optional) Configure Lustre failover.

    See Chapter 11, Configuring Failover in a Lustre File System - Describes how to configure Lustre failover.

Chapter 5. Determining Hardware Configuration Requirements and Formatting Options

This chapter describes hardware configuration requirements for a Lustre file system including:

5.1.  Hardware Considerations

A Lustre file system can utilize any kind of block storage device such as single disks, software RAID, hardware RAID, or a logical volume manager. In contrast to some networked file systems, the block devices are only attached to the MDS and OSS nodes in a Lustre file system and are not accessed by the clients directly.

Since the block devices are accessed by only one or two server nodes, a storage area network (SAN) that is accessible from all the servers is not required. Expensive switches are not needed because point-to-point connections between the servers and the storage arrays normally provide the simplest and best attachments. (If failover capability is desired, the storage must be attached to multiple servers.)

For a production environment, it is preferable that the MGS have separate storage to allow future expansion to multiple file systems. However, it is possible to run the MDS and MGS on the same machine and have them share the same storage device.

For best performance in a production environment, dedicated clients are required. For a non-production Lustre environment or for testing, a Lustre client and server can run on the same machine. However, dedicated clients are the only supported configuration.

Warning

Performance and recovery issues can occur if you put a client on an MDS or OSS:

  • Running the OSS and a client on the same machine can cause issues with low memory and memory pressure. If the client consumes all the memory and then tries to write data to the file system, the OSS will need to allocate pages to receive data from the client but will not be able to perform this operation due to low memory. This can cause the client to hang.

  • Running the MDS and a client on the same machine can cause recovery and deadlock issues and impact the performance of other Lustre clients.

Only servers running on 64-bit CPUs are tested and supported. 64-bit CPU clients are typically used for testing to match expected customer usage and avoid limitations due to the 4 GB limit for RAM size, 1 GB low-memory limitation, and 16 TB file size limit of 32-bit CPUs. Also, due to kernel API limitations, performing backups of Lustre software release 2.x. file systems on 32-bit clients may cause backup tools to confuse files that have the same 32-bit inode number.

The storage attached to the servers typically uses RAID to provide fault tolerance and can optionally be organized with logical volume management (LVM), which is then formatted as a Lustre file system. Lustre OSS and MDS servers read, write and modify data in the format imposed by the file system.

The Lustre file system uses journaling file system technology on both the MDTs and OSTs. For a MDT, as much as a 20 percent performance gain can be obtained by placing the journal on a separate device.

The MDS can effectively utilize a lot of CPU cycles. A minimum of four processor cores are recommended. More are advisable for files systems with many clients.

Note

Lustre clients running on architectures with different endianness are supported. One limitation is that the PAGE_SIZE kernel macro on the client must be as large as the PAGE_SIZE of the server. In particular, ia64 or PPC clients with large pages (up to 64kB pages) can run with x86 servers (4kB pages). If you are running x86 clients with ia64 or PPC servers, you must compile the ia64 kernel with a 4kB PAGE_SIZE (so the server page size is not larger than the client page size).

5.1.1.  MGT and MDT Storage Hardware Considerations

MGT storage requirements are small (less than 100 MB even in the largest Lustre file systems), and the data on an MGT is only accessed on a server/client mount, so disk performance is not a consideration. However, this data is vital for file system access, so the MGT should be reliable storage, preferably mirrored RAID1.

MDS storage is accessed in a database-like access pattern with many seeks and read-and-writes of small amounts of data. High throughput to MDS storage is not important. Storage types that provide much lower seek times, such as high-RPM SAS or SSD drives can be used for the MDT.

For maximum performance, the MDT should be configured as RAID1 with an internal journal and two disks from different controllers.

If you need a larger MDT, create multiple RAID1 devices from pairs of disks, and then make a RAID0 array of the RAID1 devices. This ensures maximum reliability because multiple disk failures only have a small chance of hitting both disks in the same RAID1 device.

Doing the opposite (RAID1 of a pair of RAID0 devices) has a 50% chance that even two disk failures can cause the loss of the whole MDT device. The first failure disables an entire half of the mirror and the second failure has a 50% chance of disabling the remaining mirror.

Introduced in Lustre 2.4

If multiple MDTs are going to be present in the system, each MDT should be specified for the anticipated usage and load. For details on how to add additional MDTs to the filesystem, see Section 14.6, “Adding a New MDT to a Lustre File System”.

Introduced in Lustre 2.4

Warning

MDT0 contains the root of the Lustre file system. If MDT0 is unavailable for any reason, the file system cannot be used.

Introduced in Lustre 2.4

Note

Using the DNE feature it is possible to dedicate additional MDTs to sub-directories off the file system root directory stored on MDT0, or arbitrarily for lower-level subdirectories. using the lfs mkdir -i mdt_index command. If an MDT serving a subdirectory becomes unavailable, any subdirectories on that MDT and all directories beneath it will also become inaccessible. Configuring multiple levels of MDTs is an experimental feature for the 2.4 release, and is fully functional in the 2.8 release. This is typically useful for top-level directories to assign different users or projects to separate MDTs, or to distribute other large working sets of files to multiple MDTs.

Introduced in Lustre 2.8

Note

Starting in the 2.8 release it is possible to spread a single large directory across multiple MDTs using the DNE striped directory feature by specifying multiple stripes (or shards) at creation time using the lfs mkdir -c stripe_count command, where stripe_count is often the number of MDTs in the filesystem. Striped directories should typically not be used for all directories in the filesystem, since this incurs extra overhead compared to non-striped directories, but is useful for larger directories (over 50k entries) where many output files are being created at one time.

5.1.2. OST Storage Hardware Considerations

The data access pattern for the OSS storage is a streaming I/O pattern that is dependent on the access patterns of applications being used. Each OSS can manage multiple object storage targets (OSTs), one for each volume with I/O traffic load-balanced between servers and targets. An OSS should be configured to have a balance between the network bandwidth and the attached storage bandwidth to prevent bottlenecks in the I/O path. Depending on the server hardware, an OSS typically serves between 2 and 8 targets, with each target between 24-48TB, but may be up to 256 terabytes (TBs) in size.

Lustre file system capacity is the sum of the capacities provided by the targets. For example, 64 OSSs, each with two 8 TB OSTs, provide a file system with a capacity of nearly 1 PB. If each OST uses ten 1 TB SATA disks (8 data disks plus 2 parity disks in a RAID-6 configuration), it may be possible to get 50 MB/sec from each drive, providing up to 400 MB/sec of disk bandwidth per OST. If this system is used as storage backend with a system network, such as the InfiniBand network, that provides a similar bandwidth, then each OSS could provide 800 MB/sec of end-to-end I/O throughput. (Although the architectural constraints described here are simple, in practice it takes careful hardware selection, benchmarking and integration to obtain such results.)

5.2.  Determining Space Requirements

The desired performance characteristics of the backing file systems on the MDT and OSTs are independent of one another. The size of the MDT backing file system depends on the number of inodes needed in the total Lustre file system, while the aggregate OST space depends on the total amount of data stored on the file system. If MGS data is to be stored on the MDT device (co-located MGT and MDT), add 100 MB to the required size estimate for the MDT.

Each time a file is created on a Lustre file system, it consumes one inode on the MDT and one OST object over which the file is striped. Normally, each file's stripe count is based on the system-wide default stripe count. However, this can be changed for individual files using the lfs setstripe option. For more details, see Chapter 18, Managing File Layout (Striping) and Free Space.

In a Lustre ldiskfs file system, all the MDT inodes and OST objects are allocated when the file system is first formatted. When the file system is in use and a file is created, metadata associated with that file is stored in one of the pre-allocated inodes and does not consume any of the free space used to store file data. The total number of inodes on a formatted ldiskfs MDT or OST cannot be easily changed. Thus, the number of inodes created at format time should be generous enough to anticipate near term expected usage, with some room for growth without the effort of additional storage.

By default, the ldiskfs file system used by Lustre servers to store user-data objects and system data reserves 5% of space that cannot be used by the Lustre file system. Additionally, an ldiskfs Lustre file system reserves up to 400 MB on each OST, and up to 4GB on each MDT for journal use and a small amount of space outside the journal to store accounting data. This reserved space is unusable for general storage. Thus, at least this much space will be used per OST before any file object data is saved.

Introduced in Lustre 2.4

With a ZFS backing filesystem for the MDT or OST, the space allocation for inodes and file data is dynamic, and inodes are allocated as needed. A minimum of 4kB of usable space (before mirroring) is needed for each inode, exclusive of other overhead such as directories, internal log files, extended attributes, ACLs, etc. ZFS also reserves approximately 3% of the total storage space for internal and redundant metadata, which is not usable by Lustre. Since the size of extended attributes and ACLs is highly dependent on kernel versions and site-specific policies, it is best to over-estimate the amount of space needed for the desired number of inodes, and any excess space will be utilized to store more inodes.

5.2.1.  Determining MGT Space Requirements

Less than 100 MB of space is typically required for the MGT. The size is determined by the total number of servers in the Lustre file system cluster(s) that are managed by the MGS.

5.2.2.  Determining MDT Space Requirements

When calculating the MDT size, the important factor to consider is the number of files to be stored in the file system, which depends on at least 4 KiB per inode of usable space on the MDT. Since MDTs typically use RAID-1+0 mirroring, the total storage needed will be double this.

Please note that the actual used space per MDT depends on the number of files per directory, the number of stripes per file, whether files have ACLs or user xattrs, and the number of hard links per file. The storage required for Lustre file system metadata is typically 1-2 percent of the total file system capacity depending upon file size.

For example, if the average file size is 5 MiB and you have 100 TiB of usable OST space, then you can calculate the minimum total number of inodes each for MDTs and OSTs as follows:

(500 TB * 1000000 MB/TB) / 5 MB/inode = 100M inodes

For details about formatting options for ldiskfs MDT and OST file systems, see Section 5.3.1, “Setting Formatting Options for an ldiskfs MDT”.

It is recommended that the MDT have at least twice the minimum number of inodes to allow for future expansion and allow for an average file size smaller than expected. Thus, the minimum space for an ldiskfs MDT should be approximately:

2 KiB/inode x 100 million inodes x 2 = 400 GiB ldiskfs MDT

Note

If the average file size is very small, 4 KB for example, the MDT will use as much space for each file as the space used on the OST. However, this is an uncommon usage for a Lustre filesystem.

Note

If the MDT has too few inodes, this can cause the space on the OSTs to be inaccessible since no new files can be created. Be sure to determine the appropriate size of the MDT needed to support the file system before formatting the file system. It is possible to increase the number of inodes after the file system is formatted, depending on the storage. For ldiskfs MDT filesystems the resize2fs tool can be used if the underlying block device is on a LVM logical volume and the underlying logical volume size can be increased. For ZFS new (mirrored) VDEVs can be added to the MDT pool to increase the total space available for inode storage. Inodes will be added approximately in proportion to space added.

Introduced in Lustre 2.4

Note

Note that the number of total and free inodes reported by lfs df -i for ZFS MDTs and OSTs is estimated based on the current average space used per inode. When a ZFS filesystem is first formatted, this free inode estimate will be very conservative (low) due to the high ratio of directories to regular files created for internal Lustre metadata storage, but this estimate will improve as more files are created by regular users and the average file size will better reflect actual site usage.

Introduced in Lustre 2.4

Note

Starting in release 2.4, using the DNE remote directory feature it is possible to increase the total number of inodes of a Lustre filesystem, as well as increasing the aggregate metadata performance, by configuring additional MDTs into the filesystem, see Section 14.6, “Adding a New MDT to a Lustre File System” for details.

5.2.3.  Determining OST Space Requirements

For the OST, the amount of space taken by each object depends on the usage pattern of the users/applications running on the system. The Lustre software defaults to a conservative estimate for the average object size (between 64KB per object for 10GB OSTs, and 1MB per object for 16TB and larger OSTs). If you are confident that the average file size for your applications will be larger than this, you can specify a larger average file size (fewer total inodes for a given OST size) to reduce file system overhead and minimize file system check time. See Section 5.3.2, “Setting Formatting Options for an ldiskfs OST” for more details.

5.3.  Setting ldiskfs File System Formatting Options

By default, the mkfs.lustre utility applies these options to the Lustre backing file system used to store data and metadata in order to enhance Lustre file system performance and scalability. These options include:

  • flex_bg - When the flag is set to enable this flexible-block-groups feature, block and inode bitmaps for multiple groups are aggregated to minimize seeking when bitmaps are read or written and to reduce read/modify/write operations on typical RAID storage (with 1 MB RAID stripe widths). This flag is enabled on both OST and MDT file systems. On MDT file systems the flex_bg factor is left at the default value of 16. On OSTs, the flex_bg factor is set to 256 to allow all of the block or inode bitmaps in a single flex_bg to be read or written in a single I/O on typical RAID storage.

  • huge_file - Setting this flag allows files on OSTs to be larger than 2 TB in size.

  • lazy_journal_init - This extended option is enabled to prevent a full overwrite of the 400 MB journal that is allocated by default in a Lustre file system, which reduces the file system format time.

To override the default formatting options, use arguments to mkfs.lustre to pass formatting options to the backing file system:

--mkfsoptions='backing fs options'

For other mkfs.lustre options, see the Linux man page for mke2fs(8).

5.3.1. Setting Formatting Options for an ldiskfs MDT

The number of inodes on the MDT is determined at format time based on the total size of the file system to be created. The default bytes-per-inode ratio ("inode ratio") for an MDT is optimized at one inode for every 2048 bytes of file system space. It is recommended that this value not be changed for MDTs.

This setting takes into account the space needed for additional ldiskfs filesystem-wide metadata, such as the journal (up to 4 GB), bitmaps, and directories, as well as files that Lustre uses internally to maintain cluster consistency. There is additional per-file metadata such as file layout for files with a large number of stripes, Access Control Lists (ACLs), and user extended attributes.

It is possible to reserve less than the recommended 2048 bytes per inode for an ldiskfs MDT when it is first formatted by adding the --mkfsoptions="-i bytes-per-inode" option to mkfs.lustre. Decreasing the inode ratio tunable bytes-per-inode will create more inodes for a given MDT size, but will leave less space for extra per-file metadata. The inode ratio must always be strictly larger than the MDT inode size, which is 512 bytes by default. It is recommended to use an inode ratio at least 512 bytes larger than the inode size to ensure the MDT does not run out of space.

The size of the inode may be changed by adding the --stripe-count-hint=N to have mkfs.lustre automatically calculate a reasonable inode size based on the default stripe count that will be used by the filesystem, or directly by specifying the --mkfsoptions="-I inode-size" option. Increasing the inode size will provide more space in the inode for a larger Lustre file layout, ACLs, user and system extended attributes, SELinux and other security labels, and other internal metadata. However, if these features or other in-inode xattrs are not needed, the larger inode size will hurt metadata performance as 2x, 4x, or 8x as much data would be read or written for each MDT inode access.

5.3.2. Setting Formatting Options for an ldiskfs OST

When formatting an OST file system, it can be beneficial to take local file system usage into account. When doing so, try to reduce the number of inodes on each OST, while keeping enough margin for potential variations in future usage. This helps reduce the format and file system check time and makes more space available for data.

The table below shows the default bytes-per-inode ratio ("inode ratio") used for OSTs of various sizes when they are formatted.

Table 5.1. Default Inode Ratios Used for Newly Formatted OSTs

LUN/OST size

Default Inode ratio

Total inodes

over 10GB

1 inode/16KB

640 - 655k

10GB - 1TB

1 inode/68kiB

153k - 15.7M

1TB - 8TB

1 inode/256kB

4.2M - 33.6M

over 8TB

1 inode/1MB

8.4M - 134M


In environments with few small files, the default inode ratio may result in far too many inodes for the average file size. In this case, performance can be improved by increasing the number of bytes-per-inode. To set the inode ratio, use the --mkfsoptions="-i bytes-per-inode" argument to mkfs.lustre to specify the expected average (mean) size of OST objects. For example, to create an OST with an expected average object size of 8MB run:

[oss#] mkfs.lustre --ost --mkfsoptions="-i $((8192 * 1024))" ...

Note

OSTs formatted with ldiskfs are limited to a maximum of 320 million to 1 billion objects. Specifying a very small bytes-per-inode ratio for a large OST that causes this limit to be exceeded can cause either premature out-of-space errors and prevent the full OST space from being used, or will waste space and slow down e2fsck more than necessary. The default inode ratios are chosen to ensure that the total number of inodes remain below this limit.

Note

File system check time on OSTs is affected by a number of variables in addition to the number of inodes, including the size of the file system, the number of allocated blocks, the distribution of allocated blocks on the disk, disk speed, CPU speed, and the amount of RAM on the server. Reasonable file system check times for valid filesystems are 5-30 minutes per TB, but may increase significantly if substantial errors are detected and need to be required.

For more details about formatting MDT and OST file systems, see Section 6.4, “ Formatting Options for ldiskfs RAID Devices”.

5.3.3. File and File System Limits

Table 5.2, “File and file system limits” describes current known limits of Lustre. These limits are imposed by either the Lustre architecture or the Linux virtual file system (VFS) and virtual memory subsystems. In a few cases, a limit is defined within the code and can be changed by re-compiling the Lustre software. Instructions to install from source code are beyond the scope of this document, and can be found elsewhere online. In these cases, the indicated limit was used for testing of the Lustre software.

Table 5.2. File and file system limits

Limit

Value

Description

Maximum number of MDTs

Introduced in Lustre 2.4

256

The Lustre software release 2.3 and earlier allows a maximum of 1 MDT per file system, but a single MDS can host multiple MDTs, each one for a separate file system.

Introduced in Lustre 2.4

The Lustre software release 2.4 and later requires one MDT for the filesystem root. At least 255 more MDTs can be added to the filesystem and attached into the namespace with DNE remote or striped directories.

Maximum number of OSTs

8150

The maximum number of OSTs is a constant that can be changed at compile time. Lustre file systems with up to 4000 OSTs have been tested. Multiple OST file systems can be configured on a single OSS node.

Maximum OST size

128TB (ldiskfs), 256TB (ZFS)

This is not a hard limit. Larger OSTs are possible but today typical production systems do not typically go beyond the stated limit per OST because Lustre can add capacity and performance with additional OSTs, and having more OSTs improves aggregate I/O performance and minimizes contention.

With 32-bit kernels, due to page cache limits, 16TB is the maximum block device size, which in turn applies to the size of OST. It is strongly recommended to run Lustre clients and servers with 64-bit kernels.

Maximum number of clients

131072

The maximum number of clients is a constant that can be changed at compile time. Up to 30000 clients have been used in production.

Maximum size of a file system

512 PB (ldiskfs), 1EB (ZFS)

Each OST can have a file system up to the Maximum OST size limit, and the Maximum number of OSTs can be combined into a single filesystem.

Maximum stripe count

2000

This limit is imposed by the size of the layout that needs to be stored on disk and sent in RPC requests, but is not a hard limit of the protocol. The number of OSTs in the filesystem can exceed the stripe count, but this limits the number of OSTs across which a single file can be striped.

Maximum stripe size

< 4 GB

The amount of data written to each object before moving on to next object.

Minimum stripe size

64 KB

Due to the 64 KB PAGE_SIZE on some 64-bit machines, the minimum stripe size is set to 64 KB.

Maximum object size

16TB (ldiskfs), 256TB (ZFS)

The amount of data that can be stored in a single object. An object corresponds to a stripe. The ldiskfs limit of 16 TB for a single object applies. For ZFS the limit is the size of the underlying OST. Files can consist of up to 2000 stripes, each stripe can be up to the maximum object size.

Maximum file size

16 TB on 32-bit systems

 

31.25 PB on 64-bit ldiskfs systems, 8EB on 64-bit ZFS systems

Individual files have a hard limit of nearly 16 TB on 32-bit systems imposed by the kernel memory subsystem. On 64-bit systems this limit does not exist. Hence, files can be 2^63 bits (8EB) in size if the backing filesystem can support large enough objects.

A single file can have a maximum of 2000 stripes, which gives an upper single file limit of 31.25 PB for 64-bit ldiskfs systems. The actual amount of data that can be stored in a file depends upon the amount of free space in each OST on which the file is striped.

Maximum number of files or subdirectories in a single directory

10 million files (ldiskfs), 2^48 (ZFS)

The Lustre software uses the ldiskfs hashed directory code, which has a limit of about 10 million files, depending on the length of the file name. The limit on subdirectories is the same as the limit on regular files.

Introduced in Lustre 2.8

Note

Starting in the 2.8 release it is possible to exceed this limit by striping a single directory over multiple MDTs with the lfs mkdir -c command, which increases the single directory limit by a factor of the number of directory stripes used.

Lustre file systems are tested with ten million files in a single directory.

Maximum number of files in the file system

4 billion (ldiskfs), 256 trillion (ZFS)

Introduced in Lustre 2.4

up to 256 times the per-MDT limit

The ldiskfs filesystem imposes an upper limit of 4 billion inodes per filesystem. By default, the MDT filesystem is formatted with one inode per 2KB of space, meaning 512 million inodes per TB of MDT space. This can be increased initially at the time of MDT filesystem creation. For more information, see Chapter 5, Determining Hardware Configuration Requirements and Formatting Options.

Introduced in Lustre 2.4

The ZFS filesystem dynamically allocates inodes and does not have a fixed ratio of inodes per unit of MDT space, but consumes approximately 4KB of space per inode, depending on the configuration.

Introduced in Lustre 2.4

Each additional MDT can hold up to the above maximum number of additional files, depending on available space and the distribution directories and files in the filesystem.

Maximum length of a filename

255 bytes (filename)

This limit is 255 bytes for a single filename, the same as the limit in the underlying filesystems.

Maximum length of a pathname

4096 bytes (pathname)

The Linux VFS imposes a full pathname length of 4096 bytes.

Maximum number of open files for a Lustre file system

No limit

The Lustre software does not impose a maximum for the number of open files, but the practical limit depends on the amount of RAM on the MDS. No "tables" for open files exist on the MDS, as they are only linked in a list to a given client's export. Each client process probably has a limit of several thousands of open files which depends on the ulimit.


 

Note

Introduced in Lustre 2.2

In Lustre software releases prior to version 2.2, the maximum stripe count for a single file was limited to 160 OSTs. In version 2.2, the wide striping feature was added to support files striped over up to 2000 OSTs. In order to store the large layout for such files in ldiskfs, the ea_inode feature must be enabled on the MDT, but no similar tunable is needed for ZFS MDTs. This feature is disabled by default at mkfs.lustre time. In order to enable this feature, specify --mkfsoptions="-O ea_inode" at MDT format time, or use tune2fs -O ea_inode to enable it after the MDT has been formatted. Using either the deprecated large_xattr or preferred ea_inode feature name results in ea_inode being shown in the file system feature list.

5.4. Determining Memory Requirements

This section describes the memory requirements for each Lustre file system component.

5.4.1.  Client Memory Requirements

A minimum of 2 GB RAM is recommended for clients.

5.4.2. MDS Memory Requirements

MDS memory requirements are determined by the following factors:

  • Number of clients

  • Size of the directories

  • Load placed on server

The amount of memory used by the MDS is a function of how many clients are on the system, and how many files they are using in their working set. This is driven, primarily, by the number of locks a client can hold at one time. The number of locks held by clients varies by load and memory availability on the server. Interactive clients can hold in excess of 10,000 locks at times. On the MDS, memory usage is approximately 2 KB per file, including the Lustre distributed lock manager (DLM) lock and kernel data structures for the files currently in use. Having file data in cache can improve metadata performance by a factor of 10x or more compared to reading it from disk.

MDS memory requirements include:

  • File system metadata : A reasonable amount of RAM needs to be available for file system metadata. While no hard limit can be placed on the amount of file system metadata, if more RAM is available, then the disk I/O is needed less often to retrieve the metadata.

  • Network transport : If you are using TCP or other network transport that uses system memory for send/receive buffers, this memory requirement must also be taken into consideration.

  • Journal size : By default, the journal size is 400 MB for each Lustre ldiskfs file system. This can pin up to an equal amount of RAM on the MDS node per file system.

  • Failover configuration : If the MDS node will be used for failover from another node, then the RAM for each journal should be doubled, so the backup server can handle the additional load if the primary server fails.

5.4.2.1. Calculating MDS Memory Requirements

By default, 400 MB are used for the file system journal. Additional RAM is used for caching file data for the larger working set, which is not actively in use by clients but should be kept "hot" for improved access times. Approximately 1.5 KB per file is needed to keep a file in cache without a lock.

For example, for a single MDT on an MDS with 1,000 clients, 16 interactive nodes, and a 2 million file working set (of which 400,000 files are cached on the clients):

Operating system overhead = 512 MB

File system journal = 400 MB

1000 * 4-core clients * 100 files/core * 2kB = 800 MB

16 interactive clients * 10,000 files * 2kB = 320 MB

1,600,000 file extra working set * 1.5kB/file = 2400 MB

Thus, the minimum requirement for a system with this configuration is at least 4 GB of RAM. However, additional memory may significantly improve performance.

For directories containing 1 million or more files, more memory may provide a significant benefit. For example, in an environment where clients randomly access one of 10 million files, having extra memory for the cache significantly improves performance.

5.4.3. OSS Memory Requirements

When planning the hardware for an OSS node, consider the memory usage of several components in the Lustre file system (i.e., journal, service threads, file system metadata, etc.). Also, consider the effect of the OSS read cache feature, which consumes memory as it caches data on the OSS node.

In addition to the MDS memory requirements mentioned in Section 5.2.2, “ Determining MDT Space Requirements”, the OSS requirements include:

  • Service threads : The service threads on the OSS node pre-allocate a 4 MB I/O buffer for each ost_io service thread, so these buffers do not need to be allocated and freed for each I/O request.

  • OSS read cache : OSS read cache provides read-only caching of data on an OSS, using the regular Linux page cache to store the data. Just like caching from a regular file system in the Linux operating system, OSS read cache uses as much physical memory as is available.

The same calculation applies to files accessed from the OSS as for the MDS, but the load is distributed over many more OSSs nodes, so the amount of memory required for locks, inode cache, etc. listed under MDS is spread out over the OSS nodes.

Because of these memory requirements, the following calculations should be taken as determining the absolute minimum RAM required in an OSS node.

5.4.3.1. Calculating OSS Memory Requirements

The minimum recommended RAM size for an OSS with two OSTs is computed below:

Ethernet/TCP send/receive buffers (4 MB * 512 threads) = 2048 MB

400 MB journal size * 2 OST devices = 800 MB

1.5 MB read/write per OST IO thread * 512 threads = 768 MB

600 MB file system read cache * 2 OSTs = 1200 MB

1000 * 4-core clients * 100 files/core * 2kB = 800MB

16 interactive clients * 10,000 files * 2kB = 320MB

1,600,000 file extra working set * 1.5kB/file = 2400MB

DLM locks + file system metadata TOTAL = 3520MB

Per OSS DLM locks + file system metadata = 3520MB/6 OSS = 600MB (approx.)

Per OSS RAM minimum requirement = 4096MB (approx.)

This consumes about 1,400 MB just for the pre-allocated buffers, and an additional 2 GB for minimal file system and kernel usage. Therefore, for a non-failover configuration, the minimum RAM would be 4 GB for an OSS node with two OSTs. Adding additional memory on the OSS will improve the performance of reading smaller, frequently-accessed files.

For a failover configuration, the minimum RAM would be at least 6 GB. For 4 OSTs on each OSS in a failover configuration 10GB of RAM is reasonable. When the OSS is not handling any failed-over OSTs the extra RAM will be used as a read cache.

As a reasonable rule of thumb, about 2 GB of base memory plus 1 GB per OST can be used. In failover configurations, about 2 GB per OST is needed.

5.5. Implementing Networks To Be Used by the Lustre File System

As a high performance file system, the Lustre file system places heavy loads on networks. Thus, a network interface in each Lustre server and client is commonly dedicated to Lustre file system traffic. This is often a dedicated TCP/IP subnet, although other network hardware can also be used.

A typical Lustre file system implementation may include the following:

  • A high-performance backend network for the Lustre servers, typically an InfiniBand (IB) network.

  • A larger client network.

  • Lustre routers to connect the two networks.

Lustre networks and routing are configured and managed by specifying parameters to the Lustre Networking (lnet) module in /etc/modprobe.d/lustre.conf.

To prepare to configure Lustre networking, complete the following steps:

  1. Identify all machines that will be running Lustre software and the network interfaces they will use to run Lustre file system traffic. These machines will form the Lustre network .

    A network is a group of nodes that communicate directly with one another. The Lustre software includes Lustre network drivers (LNDs) to support a variety of network types and hardware (see Chapter 2, Understanding Lustre Networking (LNet) for a complete list). The standard rules for specifying networks applies to Lustre networks. For example, two TCP networks on two different subnets (tcp0 and tcp1) are considered to be two different Lustre networks.

  2. If routing is needed, identify the nodes to be used to route traffic between networks.

    If you are using multiple network types, then you will need a router. Any node with appropriate interfaces can route Lustre networking (LNet) traffic between different network hardware types or topologies --the node may be a server, a client, or a standalone router. LNet can route messages between different network types (such as TCP-to-InfiniBand) or across different topologies (such as bridging two InfiniBand or TCP/IP networks). Routing will be configured in Chapter 9, Configuring Lustre Networking (LNet).

  3. Identify the network interfaces to include in or exclude from LNet.

    If not explicitly specified, LNet uses either the first available interface or a pre-defined default for a given network type. Interfaces that LNet should not use (such as an administrative network or IP-over-IB), can be excluded.

    Network interfaces to be used or excluded will be specified using the lnet kernel module parameters networks and ip2nets as described in Chapter 9, Configuring Lustre Networking (LNet).

  4. To ease the setup of networks with complex network configurations, determine a cluster-wide module configuration.

    For large clusters, you can configure the networking setup for all nodes by using a single, unified set of parameters in the lustre.conf file on each node. Cluster-wide configuration is described in Chapter 9, Configuring Lustre Networking (LNet).

Note

We recommend that you use 'dotted-quad' notation for IP addresses rather than host names to make it easier to read debug logs and debug configurations with multiple interfaces.

Chapter 6. Configuring Storage on a Lustre File System

This chapter describes best practices for storage selection and file system options to optimize performance on RAID, and includes the following sections:

Note

It is strongly recommended that storage used in a Lustre file system be configured with hardware RAID. The Lustre software does not support redundancy at the file system level and RAID is required to protect against disk failure.

6.1.  Selecting Storage for the MDT and OSTs

The Lustre architecture allows the use of any kind of block device as backend storage. The characteristics of such devices, particularly in the case of failures, vary significantly and have an impact on configuration choices.

This section describes issues and recommendations regarding backend storage.

6.1.1. Metadata Target (MDT)

I/O on the MDT is typically mostly reads and writes of small amounts of data. For this reason, we recommend that you use RAID 1 for MDT storage. If you require more capacity for an MDT than one disk provides, we recommend RAID 1 + 0 or RAID 10.

6.1.2. Object Storage Server (OST)

A quick calculation makes it clear that without further redundancy, RAID 6 is required for large clusters and RAID 5 is not acceptable:

For a 2 PB file system (2,000 disks of 1 TB capacity) assume the mean time to failure (MTTF) of a disk is about 1,000 days. This means that the expected failure rate is 2000/1000 = 2 disks per day. Repair time at 10% of disk bandwidth is 1000 GB at 10MB/sec = 100,000 sec, or about 1 day.

For a RAID 5 stripe that is 10 disks wide, during 1 day of rebuilding, the chance that a second disk in the same array will fail is about 9/1000 or about 1% per day. After 50 days, you have a 50% chance of a double failure in a RAID 5 array leading to data loss.

Therefore, RAID 6 or another double parity algorithm is needed to provide sufficient redundancy for OST storage.

For better performance, we recommend that you create RAID sets with 4 or 8 data disks plus one or two parity disks. Using larger RAID sets will negatively impact performance compared to having multiple independent RAID sets.

To maximize performance for small I/O request sizes, storage configured as RAID 1+0 can yield much better results but will increase cost or reduce capacity.

6.2. Reliability Best Practices

RAID monitoring software is recommended to quickly detect faulty disks and allow them to be replaced to avoid double failures and data loss. Hot spare disks are recommended so that rebuilds happen without delays.

Backups of the metadata file systems are recommended. For details, see Chapter 17, Backing Up and Restoring a File System.

6.3. Performance Tradeoffs

A writeback cache can dramatically increase write performance on many types of RAID arrays if the writes are not done at full stripe width. Unfortunately, unless the RAID array has battery-backed cache (a feature only found in some higher-priced hardware RAID arrays), interrupting the power to the array may result in out-of-sequence writes or corruption of RAID parity and future data loss.

If writeback cache is enabled, a file system check is required after the array loses power. Data may also be lost because of this.

Therefore, we recommend against the use of writeback cache when data integrity is critical. You should carefully consider whether the benefits of using writeback cache outweigh the risks.

6.4.  Formatting Options for ldiskfs RAID Devices

When formatting an ldiskfs file system on a RAID device, it can be beneficial to ensure that I/O requests are aligned with the underlying RAID geometry. This ensures that Lustre RPCs do not generate unnecessary disk operations which may reduce performance dramatically. Use the --mkfsoptions parameter to specify additional parameters when formatting the OST or MDT.

For RAID 5, RAID 6, or RAID 1+0 storage, specifying the following option to the --mkfsoptions parameter option improves the layout of the file system metadata, ensuring that no single disk contains all of the allocation bitmaps:

-E stride = chunk_blocks 

The chunk_blocks variable is in units of 4096-byte blocks and represents the amount of contiguous data written to a single disk before moving to the next disk. This is alternately referred to as the RAID stripe size. This is applicable to both MDT and OST file systems.

For more information on how to override the defaults while formatting MDT or OST file systems, see Section 5.3, “ Setting ldiskfs File System Formatting Options ”.

6.4.1. Computing file system parameters for mkfs

For best results, use RAID 5 with 5 or 9 disks or RAID 6 with 6 or 10 disks, each on a different controller. The stripe width is the optimal minimum I/O size. Ideally, the RAID configuration should allow 1 MB Lustre RPCs to fit evenly on a single RAID stripe without an expensive read-modify-write cycle. Use this formula to determine the stripe_width, where number_of_data_disks does not include the RAID parity disks (1 for RAID 5 and 2 for RAID 6):

stripe_width_blocks = chunk_blocks * number_of_data_disks = 1 MB 

If the RAID configuration does not allow chunk_blocks to fit evenly into 1 MB, select stripe_width_blocks, such that is close to 1 MB, but not larger.

The stripe_width_blocks value must equal chunk_blocks * number_of_data_disks. Specifying the stripe_width_blocks parameter is only relevant for RAID 5 or RAID 6, and is not needed for RAID 1 plus 0.

Run --reformat on the file system device (/dev/sdc), specifying the RAID geometry to the underlying ldiskfs file system, where:

--mkfsoptions "other_options -E stride=chunk_blocks, stripe_width=stripe_width_blocks"

A RAID 6 configuration with 6 disks has 4 data and 2 parity disks. The chunk_blocks <= 1024KB/4 = 256KB.

Because the number of data disks is equal to the power of 2, the stripe width is equal to 1 MB.

--mkfsoptions "other_options -E stride=chunk_blocks, stripe_width=stripe_width_blocks"...

6.4.2. Choosing Parameters for an External Journal

If you have configured a RAID array and use it directly as an OST, it contains both data and metadata. For better performance, we recommend putting the OST journal on a separate device, by creating a small RAID 1 array and using it as an external journal for the OST.

In a Lustre file system, the default journal size is 400 MB. A journal size of up to 1 GB has shown increased performance but diminishing returns are seen for larger journals. Additionally, a copy of the journal is kept in RAM. Therefore, make sure you have enough memory available to hold copies of all the journals.

The file system journal options are specified to mkfs.lustre using the --mkfsoptions parameter. For example:

--mkfsoptions "other_options -j -J device=/dev/mdJ" 

To create an external journal, perform these steps for each OST on the OSS:

  1. Create a 400 MB (or larger) journal partition (RAID 1 is recommended).

    In this example, /dev/sdb is a RAID 1 device.

  2. Create a journal device on the partition. Run:

    oss# mke2fs -b 4096 -O journal_dev /dev/sdb journal_size

    The value of journal_size is specified in units of 4096-byte blocks. For example, 262144 for a 1 GB journal size.

  3. Create the OST.

    In this example, /dev/sdc is the RAID 6 device to be used as the OST, run:

    [oss#] mkfs.lustre --ost ... \
    --mkfsoptions="-J device=/dev/sdb1" /dev/sdc
  4. Mount the OST as usual.

6.5. Connecting a SAN to a Lustre File System

Depending on your cluster size and workload, you may want to connect a SAN to a Lustre file system. Before making this connection, consider the following:

  • In many SAN file systems, clients allocate and lock blocks or inodes individually as they are updated. The design of the Lustre file system avoids the high contention that some of these blocks and inodes may have.

  • The Lustre file system is highly scalable and can have a very large number of clients. SAN switches do not scale to a large number of nodes, and the cost per port of a SAN is generally higher than other networking.

  • File systems that allow direct-to-SAN access from the clients have a security risk because clients can potentially read any data on the SAN disks, and misbehaving clients can corrupt the file system for many reasons like improper file system, network, or other kernel software, bad cabling, bad memory, and so on. The risk increases with increase in the number of clients directly accessing the storage.

Chapter 7. Setting Up Network Interface Bonding

This chapter describes how to use multiple network interfaces in parallel to increase bandwidth and/or redundancy. Topics include:

Note

Using network interface bonding is optional.

7.1. Network Interface Bonding Overview

Bonding, also known as link aggregation, trunking and port trunking, is a method of aggregating multiple physical network links into a single logical link for increased bandwidth.

Several different types of bonding are available in the Linux distribution. All these types are referred to as 'modes', and use the bonding kernel module.

Modes 0 to 3 allow load balancing and fault tolerance by using multiple interfaces. Mode 4 aggregates a group of interfaces into a single virtual interface where all members of the group share the same speed and duplex settings. This mode is described under IEEE spec 802.3ad, and it is referred to as either 'mode 4' or '802.3ad.'

7.2. Requirements

The most basic requirement for successful bonding is that both endpoints of the connection must be capable of bonding. In a normal case, the non-server endpoint is a switch. (Two systems connected via crossover cables can also use bonding.) Any switch used must explicitly handle 802.3ad Dynamic Link Aggregation.

The kernel must also be configured with bonding. All supported Lustre kernels have bonding functionality. The network driver for the interfaces to be bonded must have the ethtool functionality to determine slave speed and duplex settings. All recent network drivers implement it.

To verify that your interface works with ethtool, run:

# which ethtool
/sbin/ethtool
 
# ethtool eth0
Settings for eth0:
           Supported ports: [ TP MII ]
           Supported link modes:   10baseT/Half 10baseT/Full
                                   100baseT/Half 100baseT/Full
           Supports auto-negotiation: Yes
           Advertised link modes:  10baseT/Half 10baseT/Full
                                   100baseT/Half 100baseT/Full
           Advertised auto-negotiation: Yes
           Speed: 100Mb/s
           Duplex: Full
           Port: MII
           PHYAD: 1
           Transceiver: internal
           Auto-negotiation: on
           Supports Wake-on: pumbg
           Wake-on: d
           Current message level: 0x00000001 (1)
           Link detected: yes
 
# ethtool eth1
 
Settings for eth1:
   Supported ports: [ TP MII ]
   Supported link modes:   10baseT/Half 10baseT/Full
                           100baseT/Half 100baseT/Full
   Supports auto-negotiation: Yes
   Advertised link modes:  10baseT/Half 10baseT/Full
   100baseT/Half 100baseT/Full
   Advertised auto-negotiation: Yes
   Speed: 100Mb/s
   Duplex: Full
   Port: MII
   PHYAD: 32
   Transceiver: internal
   Auto-negotiation: on
   Supports Wake-on: pumbg
   Wake-on: d
   Current message level: 0x00000007 (7)
   Link detected: yes
   To quickly check whether your kernel supports bonding, run:     
   # grep ifenslave /sbin/ifup
   # which ifenslave
   /sbin/ifenslave

7.3. Bonding Module Parameters

Bonding module parameters control various aspects of bonding.

Outgoing traffic is mapped across the slave interfaces according to the transmit hash policy. We recommend that you set the xmit_hash_policy option to the layer3+4 option for bonding. This policy uses upper layer protocol information if available to generate the hash. This allows traffic to a particular network peer to span multiple slaves, although a single connection does not span multiple slaves.

$ xmit_hash_policy=layer3+4

The miimon option enables users to monitor the link status. (The parameter is a time interval in milliseconds.) It makes an interface failure transparent to avoid serious network degradation during link failures. A reasonable default setting is 100 milliseconds; run:

$ miimon=100

For a busy network, increase the timeout.

7.4. Setting Up Bonding

To set up bonding:

  1. Create a virtual 'bond' interface by creating a configuration file:

    # vi /etc/sysconfig/network-scripts/ifcfg-bond0
  2. Append the following lines to the file.

    DEVICE=bond0
    IPADDR=192.168.10.79 # Use the free IP Address of your network
    NETWORK=192.168.10.0
    NETMASK=255.255.255.0
    USERCTL=no
    BOOTPROTO=none
    ONBOOT=yes
  3. Attach one or more slave interfaces to the bond interface. Modify the eth0 and eth1 configuration files (using a VI text editor).

    1. Use the VI text editor to open the eth0 configuration file.

      # vi /etc/sysconfig/network-scripts/ifcfg-eth0
    2. Modify/append the eth0 file as follows:

      DEVICE=eth0
      USERCTL=no
      ONBOOT=yes
      MASTER=bond0
      SLAVE=yes
      BOOTPROTO=none
    3. Use the VI text editor to open the eth1 configuration file.

      # vi /etc/sysconfig/network-scripts/ifcfg-eth1
    4. Modify/append the eth1 file as follows:

      DEVICE=eth1
      USERCTL=no
      ONBOOT=yes
      MASTER=bond0
      SLAVE=yes
      BOOTPROTO=none
      
  4. Set up the bond interface and its options in /etc/modprobe.d/bond.conf. Start the slave interfaces by your normal network method.

    # vi /etc/modprobe.d/bond.conf
    
    1. Append the following lines to the file.

      alias bond0 bonding
      options bond0 mode=balance-alb miimon=100
      
    2. Load the bonding module.

      # modprobe bonding
      # ifconfig bond0 up
      # ifenslave bond0 eth0 eth1
      
  5. Start/restart the slave interfaces (using your normal network method).

    Note

    You must modprobe the bonding module for each bonded interface. If you wish to create bond0 and bond1, two entries in bond.conf file are required.

    The examples below are from systems running Red Hat Enterprise Linux. For setup use: /etc/sysconfig/networking-scripts/ifcfg-* The website referenced below includes detailed instructions for other configuration methods, instructions to use DHCP with bonding, and other setup details. We strongly recommend you use this website.

    http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding

  6. Check /proc/net/bonding to determine status on bonding. There should be a file there for each bond interface.

    # cat /proc/net/bonding/bond0
    Ethernet Channel Bonding Driver: v3.0.3 (March 23, 2006)
     
    Bonding Mode: load balancing (round-robin)
    MII Status: up
    MII Polling Interval (ms): 0
    Up Delay (ms): 0
    Down Delay (ms): 0
     
    Slave Interface: eth0
    MII Status: up
    Link Failure Count: 0
    Permanent HW addr: 4c:00:10:ac:61:e0
     
    Slave Interface: eth1
    MII Status: up
    Link Failure Count: 0
    Permanent HW addr: 00:14:2a:7c:40:1d
    
  7. Use ethtool or ifconfig to check the interface state. ifconfig lists the first bonded interface as 'bond0.'

    ifconfig
    bond0      Link encap:Ethernet  HWaddr 4C:00:10:AC:61:E0
       inet addr:192.168.10.79  Bcast:192.168.10.255 \     Mask:255.255.255.0
       inet6 addr: fe80::4e00:10ff:feac:61e0/64 Scope:Link
       UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500 Metric:1
       RX packets:3091 errors:0 dropped:0 overruns:0 frame:0
       TX packets:880 errors:0 dropped:0 overruns:0 carrier:0
       collisions:0 txqueuelen:0
       RX bytes:314203 (306.8 KiB)  TX bytes:129834 (126.7 KiB)
     
    eth0       Link encap:Ethernet  HWaddr 4C:00:10:AC:61:E0
       inet6 addr: fe80::4e00:10ff:feac:61e0/64 Scope:Link
       UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500 Metric:1
       RX packets:1581 errors:0 dropped:0 overruns:0 frame:0
       TX packets:448 errors:0 dropped:0 overruns:0 carrier:0
       collisions:0 txqueuelen:1000
       RX bytes:162084 (158.2 KiB)  TX bytes:67245 (65.6 KiB)
       Interrupt:193 Base address:0x8c00
     
    eth1       Link encap:Ethernet  HWaddr 4C:00:10:AC:61:E0
       inet6 addr: fe80::4e00:10ff:feac:61e0/64 Scope:Link
       UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500 Metric:1
       RX packets:1513 errors:0 dropped:0 overruns:0 frame:0
       TX packets:444 errors:0 dropped:0 overruns:0 carrier:0
       collisions:0 txqueuelen:1000
       RX bytes:152299 (148.7 KiB)  TX bytes:64517 (63.0 KiB)
       Interrupt:185 Base address:0x6000
    

7.4.1. Examples

This is an example showing bond.conf entries for bonding Ethernet interfaces eth1 and eth2 to bond0:

# cat /etc/modprobe.d/bond.conf
alias eth0 8139too
alias eth1 via-rhine
alias bond0 bonding
options bond0 mode=balance-alb miimon=100
 
# cat /etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0
BOOTPROTO=none
NETMASK=255.255.255.0
IPADDR=192.168.10.79 # (Assign here the IP of the bonded interface.)
ONBOOT=yes
USERCTL=no
 
ifcfg-ethx 
# cat /etc/sysconfig/network-scripts/ifcfg-eth0
TYPE=Ethernet
DEVICE=eth0
HWADDR=4c:00:10:ac:61:e0
BOOTPROTO=none
ONBOOT=yes
USERCTL=no
IPV6INIT=no
PEERDNS=yes
MASTER=bond0
SLAVE=yes

In the following example, the bond0 interface is the master (MASTER) while eth0 and eth1 are slaves (SLAVE).

Note

All slaves of bond0 have the same MAC address (Hwaddr) - bond0. All modes, except TLB and ALB, have this MAC address. TLB and ALB require a unique MAC address for each slave.

$ /sbin/ifconfig
 
bond0Link encap:EthernetHwaddr 00:C0:F0:1F:37:B4
inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500  Metric:1
RX packets:7224794 errors:0 dropped:0 overruns:0 frame:0
TX packets:3286647 errors:1 dropped:0 overruns:1 carrier:0
collisions:0 txqueuelen:0
 
eth0Link encap:EthernetHwaddr 00:C0:F0:1F:37:B4
inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500  Metric:1
RX packets:3573025 errors:0 dropped:0 overruns:0 frame:0
TX packets:1643167 errors:1 dropped:0 overruns:1 carrier:0
collisions:0 txqueuelen:100
Interrupt:10 Base address:0x1080
 
eth1Link encap:EthernetHwaddr 00:C0:F0:1F:37:B4
inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500  Metric:1
RX packets:3651769 errors:0 dropped:0 overruns:0 frame:0
TX packets:1643480 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
Interrupt:9 Base address:0x1400

7.5. Configuring a Lustre File System with Bonding

The Lustre software uses the IP address of the bonded interfaces and requires no special configuration. The bonded interface is treated as a regular TCP/IP interface. If needed, specify bond0 using the Lustre networks parameter in /etc/modprobe.

options lnet networks=tcp(bond0)

7.6. Bonding References

We recommend the following bonding references:

Chapter 8. Installing the Lustre Software

This chapter describes how to install the Lustre software from RPM packages. It includes:

For hardware and system requirements and hardware configuration information, see Chapter 5, Determining Hardware Configuration Requirements and Formatting Options.

8.1.  Preparing to Install the Lustre Software

You can install the Lustre software from downloaded packages (RPMs) or directly from the source code. This chapter describes how to install the Lustre RPM packages. Instructions to install from source code are beyond the scope of this document, and can be found elsewhere online.

The Lustre RPM packages are tested on current versions of Linux enterprise distributions at the time they are created. See the release notes for each version for specific details.

8.1.1. Software Requirements

To install the Lustre software from RPMs, the following are required:

  • Lustre server packages . The required packages for Lustre 2.9 EL7 servers are listed in the table below, where ver refers to the Lustre release and kernel version (e.g., 2.9.0-1.el7) and arch refers to the processor architecture (e.g., x86_64). These packages are available in the Lustre Releases repository, and may differ depending on your distro and version.

    Table 8.1. Packages Installed on Lustre Servers

    Package NameDescription
    kernel-ver_lustre.arch Linux kernel with Lustre software patches (often referred to as "patched kernel")
    lustre-ver.arch Lustre software command line tools
    kmod-lustre-ver.arch Lustre-patched kernel modules
    kmod-lustre-osd-ldiskfs-ver.arch Lustre back-end file system tools for ldiskfs-based servers.
    lustre-osd-ldiskfs-mount-ver.arch Helper library for mount.lustre and mkfs.lustre for ldiskfs-based servers.
    kmod-lustre-osd-zfs-ver.arch Lustre back-end file system tools for ZFS. This is an alternative to lustre-osd-ldiskfs (kmod-spl and kmod-zfs available separately).
    lustre-osd-zfs-mount-ver.arch Helper library for mount.lustre and mkfs.lustre for ZFS-based servers (zfs utilities available separately).
    e2fsprogs Utilities to maintain Lustre ldiskfs back-end file system(s)
    lustre-tests-ver_lustre.arch Lustre I/O Kit benchmarking tools (Included in Lustre software as of release 2.2)


  • Lustre client packages . The required packages for Lustre 2.9 EL7 clients are listed in the table below, where ver refers to the Linux distribution (e.g., 3.6.18-348.1.1.el5). These packages are available in the Lustre Releases repository.

    Table 8.2. Packages Installed on Lustre Clients

    Package NameDescription
    kmod-lustre-client-ver.arch Patchless kernel modules for client
    lustre-client-ver.arch Client command line tools
    lustre-client-dkms-ver.arch Alternate client RPM to kmod-lustre-client with Dynamic Kernel Module Support (DKMS) installation. This avoids the need to install a new RPM for each kernel update, but requires a full build environment on the client.


    Note

    The version of the kernel running on a Lustre client must be the same as the version of the kmod-lustre-client-ver package being installed, unless the DKMS package is installed. If the kernel running on the client is not compatible, a kernel that is compatible must be installed on the client before the Lustre file system software is used.

  • Lustre LNet network driver (LND) . The Lustre LNDs provided with the Lustre software are listed in the table below. For more information about Lustre LNet, see Chapter 2, Understanding Lustre Networking (LNet).

    Table 8.3. Network Types Supported by Lustre LNDs

    Supported Network TypesNotes
    TCPAny network carrying TCP traffic, including GigE, 10GigE, and IPoIB
    InfiniBand networkOpenFabrics OFED (o2ib)
    gniGemini (Cray)

Note

The InfiniBand and TCP Lustre LNDs are routinely tested during release cycles. The other LNDs are maintained by their respective owners

  • High availability software . If needed, install third party high-availability software. For more information, see Section 11.2, “Preparing a Lustre File System for Failover”.

  • Optional packages. Optional packages provided in the Lustre Releasesrepository may include the following (depending on the operating system and platform):

    • kernel-debuginfo, kernel-debuginfo-common, lustre-debuginfo, lustre-osd-ldiskfs-debuginfo- Versions of required packages with debugging symbols and other debugging options enabled for use in troubleshooting.

    • kernel-devel, - Portions of the kernel tree needed to compile third party modules, such as network drivers.

    • kernel-firmware- Standard Red Hat Enterprise Linux distribution that has been recompiled to work with the Lustre kernel.

    • kernel-headers- Header files installed under /user/include and used when compiling user-space, kernel-related code.

    • lustre-source- Lustre software source code.

    • (Recommended) perf, perf-debuginfo, python-perf, python-perf-debuginfo- Linux performance analysis tools that have been compiled to match the Lustre kernel version.

8.1.2. Environmental Requirements

Before installing the Lustre software, make sure the following environmental requirements are met.

  • (Required) Disable Security-Enhanced Linux *(SELinux) on all Lustre servers The Lustre software does not support SELinux. Therefore, the SELinux system extension must be disabled on all Lustre nodes. Also, make sure other security extensions (such as the Novell AppArmor *security system) and network packet filtering tools (such as iptables) do not interfere with the Lustre software.

  • (Required) Use the same user IDs (UID) and group IDs (GID) on all clients. If use of supplemental groups is required, see Section 35.1, “User/Group Upcall”for information about supplementary user and group cache upcall ( identity_upcall).

  • (Recommended) Provide remote shell access to clients. It is recommended that all cluster nodes have remote shell client access to facilitate the use of Lustre configuration and monitoring scripts. Parallel Distributed SHell (pdsh) is preferable, although Secure SHell (SSH) is acceptable.

  • (Recommended) Ensure client clocks are synchronized. The Lustre file system uses client clocks for timestamps. If clocks are out of sync between clients, files will appear with different time stamps when accessed by different clients. Drifting clocks can also cause problems by, for example, making it difficult to debug multi-node issues or correlate logs, which depend on timestamps. We recommend that you use Network Time Protocol (NTP) to keep client and server clocks in sync with each other. For more information about NTP, see: http://www.ntp.org.

8.2. Lustre Software Installation Procedure

Caution

Before installing the Lustre software, back up ALL data. The Lustre software contains kernel modifications that interact with storage devices and may introduce security issues and data loss if not installed, configured, or administered properly.

To install the Lustre software from RPMs, complete the steps below.

  1. Verify that all Lustre installation requirements have been met.

  2. Download the e2fsprogs RPMs for your platform from the Lustre Releasesrepository.

  3. Download the Lustre server RPMs for your platform from the Lustre Releasesrepository. See Table 8.1, “Packages Installed on Lustre Servers”for a list of required packages.

  4. Install the Lustre server and e2fsprogs packages on all Lustre servers (MGS, MDSs, and OSSs).

    1. Log onto a Lustre server as the root user

    2. Use the yum command to install the packages:

      # yum --nogpgcheck install pkg1.rpm pkg2.rpm ...
      

    3. Verify the packages are installed correctly:

      rpm -qa|egrep "lustre|wc"|sort
      

    4. Reboot the server.

    5. Repeat these steps on each Lustre server.

  5. Download the Lustre client RPMs for your platform from the Lustre Releasesrepository. See Table 8.2, “Packages Installed on Lustre Clients”for a list of required packages.

  6. Install the Lustre client packages on all Lustre clients.

    Note

    The version of the kernel running on a Lustre client must be the same as the version of the lustre-client-modules- ver package being installed. If not, a compatible kernel must be installed on the client before the Lustre client packages are installed.

    1. Log onto a Lustre client as the root user.

    2. Use the yum command to install the packages:

      # yum --nogpgcheck install pkg1.rpm pkg2.rpm ...
      

    3. Verify the packages were installed correctly:

      # rpm -qa|egrep "lustre|kernel"|sort
      

    4. Reboot the client.

    5. Repeat these steps on each Lustre client.

To configure LNet, go to Chapter 9, Configuring Lustre Networking (LNet). If default settings will be used for LNet, go to Chapter 10, Configuring a Lustre File System.

Chapter 9. Configuring Lustre Networking (LNet)

This chapter describes how to configure Lustre Networking (LNet). It includes the following sections:

Note

Configuring LNet is optional.

LNet will use the first TCP/IP interface it discovers on a system (eth0) if it's loaded using the lctl network up. If this network configuration is sufficient, you do not need to configure LNet. LNet configuration is required if you are using Infiniband or multiple Ethernet interfaces.

Introduced in Lustre 2.7

The lnetctl utility can be used to initialize LNet without bringing up any network interfaces. This gives flexibility to the user to add interfaces after LNet has been loaded.

Introduced in Lustre 2.7

DLC also introduces a C-API to enable configuring LNet programatically. See Chapter 39, LNet Configuration C-API

Introduced in Lustre 2.7

9.1. Configuring LNet via lnetctl

The lnetctl utility can be used to initialize and configure the LNet kernel module after it has been loaded via modprobe. In general the lnetctl format is as follows:

lnetctl cmd subcmd [options]

The following configuration items are managed by the tool:

  • Configuring/unconfiguring LNet

  • Adding/removing/showing Networks

  • Adding/removing/showing Routes

  • Enabling/Disabling routing

  • Configuring Router Buffer Pools

9.1.1. Configuring LNet

After LNet has been loaded via modprobe, lnetctl utility can be used to configure LNet without bringing up networks which are specified in the module parameters. It can also be used to configure network interfaces specified in the module prameters by providing the --all option.

lnetctl lnet configure [--all]
# --all: load NI configuration from module parameters

The lnetctl utility can also be used to unconfigure LNet.

lnetctl lnet unconfigure

9.1.2. Adding, Deleting and Showing networks

Networks can be added and deleted after the LNet kernel module is loaded.

lnetctl net add: add a network
        --net: net name (ex tcp0)
        --if: physical interface (ex eth0)
        --peer_timeout: time to wait before declaring a peer dead
        --peer_credits: defines the max number of inflight messages
        --peer_buffer_credits: the number of buffer credits per peer
        --credits: Network Interface credits
        --cpts: CPU Partitions configured net uses
        --help: display this help text

Example:
lnetctl net add --net tcp2 --if eth0
                --peer_timeout 180 --peer_credits 8

Networks can be deleted as shown below:

net del: delete a network
        --net: net name (ex tcp0)

Example:
lnetctl net del --net tcp2

All or a subset of the configured networks can be shown. The output can be non-verbose or verbose.

net show: show networks
        --net: net name (ex tcp0) to filter on
        --verbose: display detailed output per network

Examples:
lnetctl net show
lnetctl net show --verbose
lnetctl net show --net tcp2 --verbose

Below are examples of non-detailed and detailed network configuration show.

# non-detailed show
> lnetctl net show --net tcp2
net:
    - nid: 192.168.205.130@tcp2
      status: up
      interfaces:
          0: eth3

# detailed show
> lnetctl net show --net tcp2 --verbose
net:
    - nid: 192.168.205.130@tcp2
      status: up
      interfaces:
          0: eth3
      tunables:
          peer_timeout: 180
          peer_credits: 8
          peer_buffer_credits: 0
          credits: 256

9.1.3. Adding, Deleting and Showing routes

A set of routes can be added to identify how LNet messages are to be routed.

lnetctl route add: add a route
        --net: net name (ex tcp0) LNet message is destined to.
               The can not be a local network.
        --gateway: gateway node nid (ex 10.1.1.2@tcp) to route
                   all LNet messaged destined for the identified
                   network
        --hop: number of hops to final destination
               (1 < hops < 255)
        --priority: priority of route (0 - highest prio)

Example:
lnetctl route add --net tcp2 --gateway 192.168.205.130@tcp1 --hop 2 --prio 1

Routes can be deleted via the following lnetctl command.

lnetctl route del: delete a route
        --net: net name (ex tcp0)
        --gateway: gateway nid (ex 10.1.1.2@tcp)

Example:
lnetctl route del --net tcp2 --gateway 192.168.205.130@tcp1

Configured routes can be shown via the following lnetctl command.

lnetctl route show: show routes
        --net: net name (ex tcp0) to filter on
        --gateway: gateway nid (ex 10.1.1.2@tcp) to filter on
        --hop: number of hops to final destination
               (1 < hops < 255) to filter on
        --priority: priority of route (0 - highest prio)
                    to filter on
        --verbose: display detailed output per route

Examples:
# non-detailed show
lnetctl route show

# detailed show
lnetctl route show --verbose

When showing routes the --verbose option outputs more detailed information. All show and error output are in YAML format. Below are examples of both non-detailed and detailed route show output.

#Non-detailed output
> lnetctl route show
route:
    - net: tcp2
      gateway: 192.168.205.130@tcp1

#detailed output
> lnetctl route show --verbose
route:
    - net: tcp2
      gateway: 192.168.205.130@tcp1
      hop: 2
      priority: 1
      state: down

9.1.4. Enabling and Disabling Routing

When an LNet node is configured as a router it will route LNet messages not destined to itself. This feature can be enabled or disabled as follows.

lnetctl set routing [0 | 1]
# 0 - disable routing feature
# 1 - enable routing feature

9.1.5. Showing routing information

When routing is enabled on a node, the tiny, small and large routing buffers are allocated. See Section 28.3, “ Tuning LNet Parameters” for more details on router buffers. This information can be shown as follows:

lnetctl routing show: show routing information

Example:
lnetctl routing show

An example of the show output:

> lnetctl routing show
routing:
    - cpt[0]:
          tiny:
              npages: 0
              nbuffers: 2048
              credits: 2048
              mincredits: 2048
          small:
              npages: 1
              nbuffers: 16384
              credits: 16384
              mincredits: 16384
          large:
              npages: 256
              nbuffers: 1024
              credits: 1024
              mincredits: 1024
    - enable: 1

9.1.6. Configuring Routing Buffers

The routing buffers values configured specify the number of buffers in each of the tiny, small and large groups.

It is often desirable to configure the tiny, small and large routing buffers to some values other than the default. These values are global values, when set they are used by all configured CPU partitions. If routing is enabled then the values set take effect immediately. If a larger number of buffers is specified, then buffers are allocated to satisfy the configuration change. If fewer buffers are configured then the excess buffers are freed as they become unused. If routing is not set the values are not changed. The buffer values are reset to default if routing is turned off and on.

The lnetctl 'set' command can be used to set these buffer values. A VALUE greater than 0 will set the number of buffers accordingly. A VALUE of 0 will reset the number of buffers to system defaults.

set tiny_buffers:
      set tiny routing buffers
               VALUE must be greater than or equal to 0

set small_buffers: set small routing buffers
        VALUE must be greater than or equal to 0

set large_buffers: set large routing buffers
        VALUE must be greater than or equal to 0

Usage examples:

> lnetctl set tiny_buffers 4096
> lnetctl set small_buffers 8192
> lnetctl set large_buffers 2048

The buffers can be set back to the default values as follows:

> lnetctl set tiny_buffers 0
> lnetctl set small_buffers 0
> lnetctl set large_buffers 0

9.1.7. Importing YAML Configuration File

Configuration can be described in YAML format and can be fed into the lnetctl utility. The lnetctl utility parses the YAML file and performs the specified operation on all entities described there in. If no operation is defined in the command as shown below, the default operation is 'add'. The YAML syntax is described in a later section.

lnetctl import FILE.yaml
lnetctl import < FILE.yaml

The 'lnetctl import' command provides three optional parameters to define the operation to be performed on the configuration items described in the YAML file.

# if no options are given to the command the "add" command is assumed
              # by default.
lnetctl import --add FILE.yaml
lnetctl import --add < FILE.yaml

# to delete all items described in the YAML file
lnetctl import --del FILE.yaml
lnetctl import --del < FILE.yaml

# to show all items described in the YAML file
lnetctl import --show FILE.yaml
lnetctl import --show < FILE.yaml

9.1.8. Exporting Configuration in YAML format

lnetctl utility provides the 'export' command to dump current LNet configuration in YAML format

lnetctl export FILE.yaml
lnetctl export > FILE.yaml

9.1.9. Showing LNet Traffic Statistics

lnetctl utility can dump the LNet traffic statistiscs as follows

lnetctl stats show

9.1.10. YAML Syntax

The lnetctl utility can take in a YAML file describing the configuration items that need to be operated on and perform one of the following operations: add, delete or show on the items described there in.

Net, routing and route YAML blocks are all defined as a YAML sequence, as shown in the following sections. The stats YAML block is a YAML object. Each sequence item can take a seq_no field. This seq_no field is returned in the error block. This allows the caller to associate the error with the item that caused the error. The lnetctl utilty does a best effort at configuring items defined in the YAML file. It does not stop processing the file at the first error.

Below is the YAML syntax describing the various configuration elements which can be operated on via DLC. Not all YAML elements are required for all operations (add/delete/show). The system ignores elements which are not pertinent to the requested operation.

9.1.10.1. Network Configuration

net:
   - net: <network.  Ex: tcp or o2ib>
     interfaces:
         0: <physical interface>
     detail: <This is only applicable for show command.  1 - output detailed info.  0 - basic output>
     tunables:
        peer_timeout: <Integer. Timeout before consider a peer dead>
        peer_credits: <Integer. Transmit credits for a peer>
        peer_buffer_credits: <Integer. Credits available for receiving messages>
        credits: <Integer.  Network Interface credits>
	SMP: <An array of integers of the form: "[x,y,...]", where each
	integer represents the CPT to associate the network interface
	with> seq_no: <integer.  Optional.  User generated, and is
	passed back in the YAML error block>

Both seq_no and detail fields do not appear in the show output.

9.1.10.2. Enable Routing and Adjust Router Buffer Configuration

routing:
    - tiny: <Integer. Tiny buffers>
      small: <Integer. Small buffers>
      large: <Integer. Large buffers>
      enable: <0 - disable routing.  1 - enable routing>
      seq_no: <Integer.  Optional.  User generated, and is passed back in the YAML error block>

The seq_no field does not appear in the show output

9.1.10.3. Show Statistics

statistics:
    seq_no: <Integer. Optional.  User generated, and is passed back in the YAML error block>

The seq_no field does not appear in the show output

9.1.10.4. Route Configuration

route:
  - net: <network. Ex: tcp or o2ib>
    gateway: <nid of the gateway in the form <ip>@<net>: Ex: 192.168.29.1@tcp>
    hop: <an integer between 1 and 255. Optional>
    detail: <This is only applicable for show commands.  1 - output detailed info.  0. basic output>
    seq_no: <integer. Optional. User generated, and is passed back in the YAML error block>

Both seq_no and detail fields do not appear in the show output.

9.2.  Overview of LNet Module Parameters

LNet kernel module (lnet) parameters specify how LNet is to be configured to work with Lustre, including which NICs will be configured to work with Lustre and the routing to be used with Lustre.

Parameters for LNet can be specified in the /etc/modprobe.d/lustre.conf file. In some cases the parameters may have been stored in /etc/modprobe.conf, but this has been deprecated since before RHEL5 and SLES10, and having a separate /etc/modprobe.d/lustre.conf file simplifies administration and distribution of the Lustre networking configuration. This file contains one or more entries with the syntax:

options lnet parameter=value

To specify the network interfaces that are to be used for Lustre, set either the networks parameter or the ip2nets parameter (only one of these parameters can be used at a time):

  • networks - Specifies the networks to be used.

  • ip2nets - Lists globally-available networks, each with a range of IP addresses. LNet then identifies locally-available networks through address list-matching lookup.

See Section 9.3, “Setting the LNet Module networks Parameter” and Section 9.4, “Setting the LNet Module ip2nets Parameter” for more details.

To set up routing between networks, use:

  • routes - Lists networks and the NIDs of routers that forward to them.

See Section 9.5, “Setting the LNet Module routes Parameter” for more details.

A router checker can be configured to enable Lustre nodes to detect router health status, avoid routers that appear dead, and reuse those that restore service after failures. See Section 9.7, “Configuring the Router Checker” for more details.

For a complete reference to the LNet module parameters, see Chapter 37, Configuration Files and Module ParametersLNet Options.

Note

We recommend that you use 'dotted-quad' notation for IP addresses rather than host names to make it easier to read debug logs and debug configurations with multiple interfaces.

9.2.1. Using a Lustre Network Identifier (NID) to Identify a Node

A Lustre network identifier (NID) is used to uniquely identify a Lustre network endpoint by node ID and network type. The format of the NID is:

network_id@network_type

Examples are:

10.67.73.200@tcp0
10.67.75.100@o2ib

The first entry above identifies a TCP/IP node, while the second entry identifies an InfiniBand node.

When a mount command is run on a client, the client uses the NID of the MDS to retrieve configuration information. If an MDS has more than one NID, the client should use the appropriate NID for its local network.

To determine the appropriate NID to specify in the mount command, use the lctl command. To display MDS NIDs, run on the MDS :

lctl list_nids

To determine if a client can reach the MDS using a particular NID, run on the client:

lctl which_nid MDS_NID

9.3. Setting the LNet Module networks Parameter

If a node has more than one network interface, you'll typically want to dedicate a specific interface to Lustre. You can do this by including an entry in the lustre.conf file on the node that sets the LNet module networks parameter:

options lnet networks=comma-separated list of
    networks

This example specifies that a Lustre node will use a TCP/IP interface and an InfiniBand interface:

options lnet networks=tcp0(eth0),o2ib(ib0)

This example specifies that the Lustre node will use the TCP/IP interface eth1:

options lnet networks=tcp0(eth1)

Depending on the network design, it may be necessary to specify explicit interfaces. To explicitly specify that interface eth2 be used for network tcp0 and eth3 be used for tcp1 , use this entry:

options lnet networks=tcp0(eth2),tcp1(eth3)

When more than one interface is available during the network setup, Lustre chooses the best route based on the hop count. Once the network connection is established, Lustre expects the network to stay connected. In a Lustre network, connections do not fail over to another interface, even if multiple interfaces are available on the same node.

Note

LNet lines in lustre.conf are only used by the local node to determine what to call its interfaces. They are not used for routing decisions.

9.3.1. Multihome Server Example

If a server with multiple IP addresses (multihome server) is connected to a Lustre network, certain configuration setting are required. An example illustrating these setting consists of a network with the following nodes:

  • Server svr1 with three TCP NICs (eth0, eth1, and eth2) and an InfiniBand NIC.

  • Server svr2 with three TCP NICs (eth0, eth1, and eth2) and an InfiniBand NIC. Interface eth2 will not be used for Lustre networking.

  • TCP clients, each with a single TCP interface.

  • InfiniBand clients, each with a single Infiniband interface and a TCP/IP interface for administration.

To set the networks option for this example:

  • On each server, svr1 and svr2, include the following line in the lustre.conf file:

options lnet networks=tcp0(eth0),tcp1(eth1),o2ib
  • For TCP-only clients, the first available non-loopback IP interface is used for tcp0. Thus, TCP clients with only one interface do not need to have options defined in the lustre.conf file.

  • On the InfiniBand clients, include the following line in the lustre.conf file:

options lnet networks=o2ib

Note

By default, Lustre ignores the loopback (lo0) interface. Lustre does not ignore IP addresses aliased to the loopback. If you alias IP addresses to the loopback interface, you must specify all Lustre networks using the LNet networks parameter.

Note

If the server has multiple interfaces on the same subnet, the Linux kernel will send all traffic using the first configured interface. This is a limitation of Linux, not Lustre. In this case, network interface bonding should be used. For more information about network interface bonding, see Chapter 7, Setting Up Network Interface Bonding.

9.4. Setting the LNet Module ip2nets Parameter

The ip2nets option is typically used when a single, universal lustre.conf file is run on all servers and clients. Each node identifies the locally available networks based on the listed IP address patterns that match the node's local IP addresses.

Note that the IP address patterns listed in the ip2nets option are only used to identify the networks that an individual node should instantiate. They are not used by LNet for any other communications purpose.

For the example below, the nodes in the network have these IP addresses:

  • Server svr1: eth0 IP address 192.168.0.2, IP over Infiniband (o2ib) address 132.6.1.2.

  • Server svr2: eth0 IP address 192.168.0.4, IP over Infiniband (o2ib) address 132.6.1.4.

  • TCP clients have IP addresses 192.168.0.5-255.

  • Infiniband clients have IP over Infiniband (o2ib) addresses 132.6.[2-3].2, .4, .6, .8.

The following entry is placed in the lustre.conf file on each server and client:

options lnet 'ip2nets="tcp0(eth0) 192.168.0.[2,4]; \
tcp0 192.168.0.*; o2ib0 132.6.[1-3].[2-8/2]"'

Each entry in ip2nets is referred to as a 'rule'.

The order of LNet entries is important when configuring servers. If a server node can be reached using more than one network, the first network specified in lustre.conf will be used.

Because svr1 and svr2 match the first rule, LNet uses eth0 for tcp0 on those machines. (Although svr1 and svr2 also match the second rule, the first matching rule for a particular network is used).

The [2-8/2] format indicates a range of 2-8 stepped by 2; that is 2,4,6,8. Thus, the clients at 132.6.3.5 will not find a matching o2ib network.

9.5. Setting the LNet Module routes Parameter

The LNet module routes parameter is used to identify routers in a Lustre configuration. These parameters are set in modprobe.conf on each Lustre node.

Routes are typically set to connect to segregated subnetworks or to cross connect two different types of networks such as tcp and o2ib

The LNet routes parameter specifies a colon-separated list of router definitions. Each route is defined as a network number, followed by a list of routers:

routes=net_type router_NID(s)

This example specifies bi-directional routing in which TCP clients can reach Lustre resources on the IB networks and IB servers can access the TCP networks:

options lnet 'ip2nets="tcp0 192.168.0.*; \
  o2ib0(ib0) 132.6.1.[1-128]"' 'routes="tcp0   132.6.1.[1-8]@o2ib0; \
  o2ib0 192.16.8.0.[1-8]@tcp0"'

All LNet routers that bridge two networks are equivalent. They are not configured as primary or secondary, and the load is balanced across all available routers.

The number of LNet routers is not limited. Enough routers should be used to handle the required file serving bandwidth plus a 25 percent margin for headroom.

9.5.1. Routing Example

On the clients, place the following entry in the lustre.conf file

lnet networks="tcp" routes="o2ib0 192.168.0.[1-8]@tcp0"

On the router nodes, use:

lnet networks="tcp o2ib" forwarding=enabled 

On the MDS, use the reverse as shown below:

lnet networks="o2ib0" routes="tcp0 132.6.1.[1-8]@o2ib0" 

To start the routers, run:

modprobe lnet
lctl network configure

9.6. Testing the LNet Configuration

After configuring Lustre Networking, it is highly recommended that you test your LNet configuration using the LNet Self-Test provided with the Lustre software. For more information about using LNet Self-Test, see Chapter 26, Testing Lustre Network Performance (LNet Self-Test).

9.7. Configuring the Router Checker

In a Lustre configuration in which different types of networks, such as a TCP/IP network and an Infiniband network, are connected by routers, a router checker can be run on the clients and servers in the routed configuration to monitor the status of the routers. In a multi-hop routing configuration, router checkers can be configured on routers to monitor the health of their next-hop routers.

A router checker is configured by setting LNet parameters in lustre.conf by including an entry in this form:

options lnet
    router_checker_parameter=value

The router checker parameters are:

  • live_router_check_interval - Specifies a time interval in seconds after which the router checker will ping the live routers. The default value is 0, meaning no checking is done. To set the value to 60, enter:

    options lnet live_router_check_interval=60
  • dead_router_check_interval - Specifies a time interval in seconds after which the router checker will check for dead routers. The default value is 0, meaning no checking is done. To set the value to 60, enter:

    options lnet dead_router_check_interval=60
  • auto_down - Enables/disables (1/0) the automatic marking of router state as up or down. The default value is 1. To disable router marking, enter:

    options lnet auto_down=0
  • router_ping_timeout - Specifies a timeout for the router checker when it checks live or dead routers. The router checker sends a ping message to each dead or live router once every dead_router_check_interval or live_router_check_interval respectively. The default value is 50. To set the value to 60, enter:

    options lnet router_ping_timeout=60

    Note

    The router_ping_timeout is consistent with the default LND timeouts. You may have to increase it on very large clusters if the LND timeout is also increased. For larger clusters, we suggest increasing the check interval.

  • check_routers_before_use - Specifies that routers are to be checked before use. Set to off by default. If this parameter is set to on, the dead_router_check_interval parameter must be given a positive integer value.

    options lnet check_routers_before_use=on

The router checker obtains the following information from each router:

  • Time the router was disabled

  • Elapsed disable time

If the router checker does not get a reply message from the router within router_ping_timeout seconds, it considers the router to be down.

If a router is marked 'up' and responds to a ping, the timeout is reset.

If 100 packets have been sent successfully through a router, the sent-packets counter for that router will have a value of 100.

9.8. Best Practices for LNet Options

For the networks, ip2nets, and routes options, follow these best practices to avoid configuration errors.

9.8.1. Escaping commas with quotes

Depending on the Linux distribution, commas may need to be escaped using single or double quotes. In the extreme case, the options entry would look like this:

options
      lnet'networks="tcp0,elan0"'
      'routes="tcp [2,10]@elan0"'

Added quotes may confuse some distributions. Messages such as the following may indicate an issue related to added quotes:

lnet: Unknown parameter 'networks'

A 'Refusing connection - no matching NID' message generally points to an error in the LNet module configuration.

9.8.2. Including comments

Place the semicolon terminating a comment immediately after the comment. LNet silently ignores everything between the # character at the beginning of the comment and the next semicolon.

In this incorrect example, LNet silently ignores pt11 192.168.0.[92,96], resulting in these nodes not being properly initialized. No error message is generated.

options lnet ip2nets="pt10 192.168.0.[89,93]; # comment
      with semicolon BEFORE comment \ pt11 192.168.0.[92,96];

This correct example shows the required syntax:

options lnet ip2nets="pt10 192.168.0.[89,93] \
# comment with semicolon AFTER comment; \
pt11 192.168.0.[92,96] # comment

Do not add an excessive number of comments. The Linux kernel limits the length of character strings used in module options (usually to 1KB, but this may differ between vendor kernels). If you exceed this limit, errors result and the specified configuration may not be processed correctly.

Chapter 10. Configuring a Lustre File System

This chapter shows how to configure a simple Lustre file system comprised of a combined MGS/MDT, an OST and a client. It includes:

10.1.  Configuring a Simple Lustre File System

A Lustre file system can be set up in a variety of configurations by using the administrative utilities provided with the Lustre software. The procedure below shows how to configure a simple Lustre file system consisting of a combined MGS/MDS, one OSS with two OSTs, and a client. For an overview of the entire Lustre installation procedure, see Chapter 4, Installation Overview.

This configuration procedure assumes you have completed the following:

The following optional steps should also be completed, if needed, before the Lustre software is configured:

  • Set up a hardware or software RAID on block devices to be used as OSTs or MDTs.For information about setting up RAID, see the documentation for your RAID controller or Chapter 6, Configuring Storage on a Lustre File System.

  • Set up network interface bonding on Ethernet interfaces.For information about setting up network interface bonding, see Chapter 7, Setting Up Network Interface Bonding.

  • Set lnet module parameters to specify how Lustre Networking (LNet) is to be configured to work with a Lustre file system and test the LNet configuration.LNet will, by default, use the first TCP/IP interface it discovers on a system. If this network configuration is sufficient, you do not need to configure LNet. LNet configuration is required if you are using InfiniBand or multiple Ethernet interfaces.

For information about configuring LNet, see Chapter 9, Configuring Lustre Networking (LNet). For information about testing LNet, see Chapter 26, Testing Lustre Network Performance (LNet Self-Test).

  • Run the benchmark script sgpdd-survey to determine baseline performance of your hardware.Benchmarking your hardware will simplify debugging performance issues that are unrelated to the Lustre software and ensure you are getting the best possible performance with your installation. For information about running sgpdd-survey, see Chapter 27, Benchmarking Lustre File System Performance (Lustre I/O Kit).

Note

The sgpdd-survey script overwrites the device being tested so it must be run before the OSTs are configured.

To configure a simple Lustre file system, complete these steps:

  1. Create a combined MGS/MDT file system on a block device. On the MDS node, run:

    mkfs.lustre --fsname=
    fsname --mgs --mdt --index=0 
    /dev/block_device
    

    The default file system name ( fsname) is lustre.

    Note

    If you plan to create multiple file systems, the MGS should be created separately on its own dedicated block device, by running:

    mkfs.lustre --fsname=
    fsname --mgs 
    /dev/block_device
    

    See Section 13.7, “ Running Multiple Lustre File Systems”for more details.

  2. Optional for Lustre software release 2.4 and later. Add in additional MDTs.

    mkfs.lustre --fsname=
    fsname --mgsnode=
    nid --mdt --index=1 
    /dev/block_device
    

    Note

    Up to 4095 additional MDTs can be added.

  3. Mount the combined MGS/MDT file system on the block device. On the MDS node, run:

    mount -t lustre 
    /dev/block_device 
    /mount_point
    

    Note

    If you have created an MGS and an MDT on separate block devices, mount them both.

  4. Create the OST. On the OSS node, run:

    mkfs.lustre --fsname=
    fsname --mgsnode=
    MGS_NID --ost --index=
    OST_index 
    /dev/block_device
    

    When you create an OST, you are formatting a ldiskfs or ZFS file system on a block storage device like you would with any local file system.

    You can have as many OSTs per OSS as the hardware or drivers allow. For more information about storage and memory requirements for a Lustre file system, see Chapter 5, Determining Hardware Configuration Requirements and Formatting Options.

    You can only configure one OST per block device. You should create an OST that uses the raw block device and does not use partitioning.

    You should specify the OST index number at format time in order to simplify translating the OST number in error messages or file striping to the OSS node and block device later on.

    If you are using block devices that are accessible from multiple OSS nodes, ensure that you mount the OSTs from only one OSS node at at time. It is strongly recommended that multiple-mount protection be enabled for such devices to prevent serious data corruption. For more information about multiple-mount protection, see Chapter 20, Lustre File System Failover and Multiple-Mount Protection.

    Note

    The Lustre software currently supports block devices up to 128 TB on Red Hat Enterprise Linux 5 and 6 (up to 8 TB on other distributions). If the device size is only slightly larger that 16 TB, it is recommended that you limit the file system size to 16 TB at format time. We recommend that you not place DOS partitions on top of RAID 5/6 block devices due to negative impacts on performance, but instead format the whole disk for the file system.

  5. Mount the OST. On the OSS node where the OST was created, run:

    mount -t lustre 
    /dev/block_device 
    /mount_point
    

    Note

    To create additional OSTs, repeat Step 4and Step 5, specifying the next higher OST index number.

  6. Mount the Lustre file system on the client. On the client node, run:

    mount -t lustre 
    MGS_node:/
    fsname 
    /mount_point 
    

    Note

    To create additional clients, repeat Step 6.

    Note

    If you have a problem mounting the file system, check the syslogs on the client and all the servers for errors and also check the network settings. A common issue with newly-installed systems is that hosts.deny or firewall rules may prevent connections on port 988.

  7. Verify that the file system started and is working correctly. Do this by running lfs df, dd and ls commands on the client node.

  8. (Optional)Run benchmarking tools to validate the performance of hardware and software layers in the cluster. Available tools include:

10.1.1.  Simple Lustre Configuration Example

To see the steps to complete for a simple Lustre file system configuration, follow this example in which a combined MGS/MDT and two OSTs are created to form a file system called temp. Three block devices are used, one for the combined MGS/MDS node and one for each OSS node. Common parameters used in the example are listed below, along with individual node parameters.

Common Parameters

Value

Description

 

MGS node

10.2.0.1@tcp0

Node for the combined MGS/MDS

 

file system

temp

Name of the Lustre file system

 

network type

TCP/IP

Network type used for Lustre file system temp

Node Parameters

Value

Description

MGS/MDS node

 

MGS/MDS node

mdt0

MDS in Lustre file system temp

 

block device

/dev/sdb

Block device for the combined MGS/MDS node

 

mount point

/mnt/mdt

Mount point for the mdt0 block device ( /dev/sdb) on the MGS/MDS node

First OSS node

 

OSS node

oss0

First OSS node in Lustre file system temp

 

OST

ost0

First OST in Lustre file system temp

 

block device

/dev/sdc

Block device for the first OSS node ( oss0)

 

mount point

/mnt/ost0

Mount point for the ost0 block device ( /dev/sdc) on the oss1 node

Second OSS node

OSS node

oss1

Second OSS node in Lustre file system temp

OST

ost1

Second OST in Lustre file system temp

 

block device

/dev/sdd

Block device for the second OSS node (oss1)

mount point

/mnt/ost1

Mount point for the ost1 block device ( /dev/sdd) on the oss1 node

Client node

client node

client1

Client in Lustre file system temp

mount point

/lustre

Mount point for Lustre file system temp on the client1 node

Note

We recommend that you use 'dotted-quad' notation for IP addresses rather than host names to make it easier to read debug logs and debug configurations with multiple interfaces.

For this example, complete the steps below:

  1. Create a combined MGS/MDT file system on the block device. On the MDS node, run:

    [root@mds /]# mkfs.lustre --fsname=temp --mgs --mdt --index=0 /dev/sdb
    

    This command generates this output:

        Permanent disk data:
    Target:            temp-MDT0000
    Index:             0
    Lustre FS: temp
    Mount type:        ldiskfs
    Flags:             0x75
       (MDT MGS first_time update )
    Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
    Parameters: mdt.identity_upcall=/usr/sbin/l_getidentity
     
    checking for existing Lustre data: not found
    device size = 16MB
    2 6 18
    formatting backing filesystem ldiskfs on /dev/sdb
       target name             temp-MDTffff
       4k blocks               0
       options                 -i 4096 -I 512 -q -O dir_index,uninit_groups -F
    mkfs_cmd = mkfs.ext2 -j -b 4096 -L temp-MDTffff  -i 4096 -I 512 -q -O 
    dir_index,uninit_groups -F /dev/sdb
    Writing CONFIGS/mountdata 
    
  2. Mount the combined MGS/MDT file system on the block device. On the MDS node, run:

    [root@mds /]# mount -t lustre /dev/sdb /mnt/mdt
    

    This command generates this output:

    Lustre: temp-MDT0000: new disk, initializing 
    Lustre: 3009:0:(lproc_mds.c:262:lprocfs_wr_identity_upcall()) temp-MDT0000:
    group upcall set to /usr/sbin/l_getidentity
    Lustre: temp-MDT0000.mdt: set parameter identity_upcall=/usr/sbin/l_getidentity
    Lustre: Server temp-MDT0000 on device /dev/sdb has started 
    
  3. Create and mount ost0.

    In this example, the OSTs ( ost0 and ost1) are being created on different OSS nodes ( oss0 and oss1 respectively).

    1. Create ost0. On oss0 node, run:

      [root@oss0 /]# mkfs.lustre --fsname=temp --mgsnode=10.2.0.1@tcp0 --ost
      --index=0 /dev/sdc
      

      The command generates this output:

          Permanent disk data:
      Target:            temp-OST0000
      Index:             0
      Lustre FS: temp
      Mount type:        ldiskfs
      Flags:             0x72
      (OST first_time update)
      Persistent mount opts: errors=remount-ro,extents,mballoc
      Parameters: mgsnode=10.2.0.1@tcp
       
      checking for existing Lustre data: not found
      device size = 16MB
      2 6 18
      formatting backing filesystem ldiskfs on /dev/sdc
         target name             temp-OST0000
         4k blocks               0
         options                 -I 256 -q -O dir_index,uninit_groups -F
      mkfs_cmd = mkfs.ext2 -j -b 4096 -L temp-OST0000  -I 256 -q -O
      dir_index,uninit_groups -F /dev/sdc
      Writing CONFIGS/mountdata 
      
    2. Mount ost0 on the OSS on which it was created. On oss0 node, run:

      root@oss0 /] mount -t lustre /dev/sdc /mnt/ost0
      

      The command generates this output:

      LDISKFS-fs: file extents enabled 
      LDISKFS-fs: mballoc enabled
      Lustre: temp-OST0000: new disk, initializing
      Lustre: Server temp-OST0000 on device /dev/sdb has started
      

      Shortly afterwards, this output appears:

      Lustre: temp-OST0000: received MDS connection from 10.2.0.1@tcp0
      Lustre: MDS temp-MDT0000: temp-OST0000_UUID now active, resetting orphans 
      
  4. Create and mount ost1.

    1. Create ost1. On oss1 node, run:

      [root@oss1 /]# mkfs.lustre --fsname=temp --mgsnode=10.2.0.1@tcp0 \
                 --ost --index=1 /dev/sdd
      

      The command generates this output:

          Permanent disk data:
      Target:            temp-OST0001
      Index:             1
      Lustre FS: temp
      Mount type:        ldiskfs
      Flags:             0x72
      (OST first_time update)
      Persistent mount opts: errors=remount-ro,extents,mballoc
      Parameters: mgsnode=10.2.0.1@tcp
       
      checking for existing Lustre data: not found
      device size = 16MB
      2 6 18
      formatting backing filesystem ldiskfs on /dev/sdd
         target name             temp-OST0001
         4k blocks               0
         options                 -I 256 -q -O dir_index,uninit_groups -F
      mkfs_cmd = mkfs.ext2 -j -b 4096 -L temp-OST0001  -I 256 -q -O
      dir_index,uninit_groups -F /dev/sdc
      Writing CONFIGS/mountdata 
      
    2. Mount ost1 on the OSS on which it was created. On oss1 node, run:

      root@oss1 /] mount -t lustre /dev/sdd /mnt/ost1 
      

      The command generates this output:

      LDISKFS-fs: file extents enabled 
      LDISKFS-fs: mballoc enabled
      Lustre: temp-OST0001: new disk, initializing
      Lustre: Server temp-OST0001 on device /dev/sdb has started
      

      Shortly afterwards, this output appears:

      Lustre: temp-OST0001: received MDS connection from 10.2.0.1@tcp0
      Lustre: MDS temp-MDT0000: temp-OST0001_UUID now active, resetting orphans 
      
  5. Mount the Lustre file system on the client. On the client node, run:

    root@client1 /] mount -t lustre 10.2.0.1@tcp0:/temp /lustre 
    

    This command generates this output:

    Lustre: Client temp-client has started
    
  6. Verify that the file system started and is working by running the df, dd and ls commands on the client node.

    1. Run the lfs df -h command:

      [root@client1 /] lfs df -h 
      

      The lfs df -h command lists space usage per OST and the MDT in human-readable format. This command generates output similar to this:

      UUID               bytes      Used      Available   Use%    Mounted on
      temp-MDT0000_UUID  8.0G      400.0M       7.6G        0%      /lustre[MDT:0]
      temp-OST0000_UUID  800.0G    400.0M     799.6G        0%      /lustre[OST:0]
      temp-OST0001_UUID  800.0G    400.0M     799.6G        0%      /lustre[OST:1]
      filesystem summary:  1.6T    800.0M       1.6T        0%      /lustre
      
    2. Run the lfs df -ih command.

      [root@client1 /] lfs df -ih
      

      The lfs df -ih command lists inode usage per OST and the MDT. This command generates output similar to this:

      UUID              Inodes      IUsed       IFree   IUse%     Mounted on
      temp-MDT0000_UUID   2.5M        32         2.5M      0%       /lustre[MDT:0]
      temp-OST0000_UUID   5.5M        54         5.5M      0%       /lustre[OST:0]
      temp-OST0001_UUID   5.5M        54         5.5M      0%       /lustre[OST:1]
      filesystem summary: 2.5M        32         2.5M      0%       /lustre
      
    3. Run the dd command:

      [root@client1 /] cd /lustre
      [root@client1 /lustre] dd if=/dev/zero of=/lustre/zero.dat bs=4M count=2
      

      The dd command verifies write functionality by creating a file containing all zeros ( 0s). In this command, an 8 MB file is created. This command generates output similar to this:

      2+0 records in
      2+0 records out
      8388608 bytes (8.4 MB) copied, 0.159628 seconds, 52.6 MB/s
      
    4. Run the ls command:

      [root@client1 /lustre] ls -lsah
      

      The ls -lsah command lists files and directories in the current working directory. This command generates output similar to this:

      total 8.0M
      4.0K drwxr-xr-x  2 root root 4.0K Oct 16 15:27 .
      8.0K drwxr-xr-x 25 root root 4.0K Oct 16 15:27 ..
      8.0M -rw-r--r--  1 root root 8.0M Oct 16 15:27 zero.dat 
       
      

Once the Lustre file system is configured, it is ready for use.

10.2.  Additional Configuration Options

This section describes how to scale the Lustre file system or make configuration changes using the Lustre configuration utilities.

10.2.1.  Scaling the Lustre File System

A Lustre file system can be scaled by adding OSTs or clients. For instructions on creating additional OSTs repeat Step 3and Step 5above. For mounting additional clients, repeat Step 6for each client.

10.2.2.  Changing Striping Defaults

The default settings for the file layout stripe pattern are shown in Table 10.1, “Default stripe pattern”.

Table 10.1. Default stripe pattern

File Layout Parameter

Default

Description

stripe_size

1 MB

Amount of data to write to one OST before moving to the next OST.

stripe_count

1

The number of OSTs to use for a single file.

start_ost

-1

The first OST where objects are created for each file. The default -1 allows the MDS to choose the starting index based on available space and load balancing. It's strongly recommended not to change the default for this parameter to a value other than -1.


Use the lfs setstripe command described in Chapter 18, Managing File Layout (Striping) and Free Spaceto change the file layout configuration.

10.2.3.  Using the Lustre Configuration Utilities

If additional configuration is necessary, several configuration utilities are available:

  • mkfs.lustre- Use to format a disk for a Lustre service.

  • tunefs.lustre- Use to modify configuration information on a Lustre target disk.

  • lctl- Use to directly control Lustre features via an ioctl interface, allowing various configuration, maintenance and debugging features to be accessed.

  • mount.lustre- Use to start a Lustre client or target service.

For examples using these utilities, see the topic Chapter 38, System Configuration Utilities

The lfs utility is useful for configuring and querying a variety of options related to files. For more information, see Chapter 34, User Utilities.

Note

Some sample scripts are included in the directory where the Lustre software is installed. If you have installed the Lustre source code, the scripts are located in the lustre/tests sub-directory. These scripts enable quick setup of some simple standard Lustre configurations.

Chapter 11. Configuring Failover in a Lustre File System

This chapter describes how to configure failover in a Lustre file system. It includes:

For an overview of failover functionality in a Lustre file system, see Chapter 3, Understanding Failover in a Lustre File System.

11.1. Setting Up a Failover Environment

The Lustre software provides failover mechanisms only at the layer of the Lustre file system. No failover functionality is provided for system-level components such as failing hardware or applications, or even for the entire failure of a node, as would typically be provided in a complete failover solution. Failover functionality such as node monitoring, failure detection, and resource fencing must be provided by external HA software, such as PowerMan or the open source Corosync and Pacemaker packages provided by Linux operating system vendors. Corosync provides support for detecting failures, and Pacemaker provides the actions to take once a failure has been detected.

11.1.1. Selecting Power Equipment

Failover in a Lustre file system requires the use of a remote power control (RPC) mechanism, which comes in different configurations. For example, Lustre server nodes may be equipped with IPMI/BMC devices that allow remote power control. In the past, software or even “sneakerware” has been used, but these are not recommended. For recommended devices, refer to the list of supported RPC devices on the website for the PowerMan cluster power management utility:

http://code.google.com/p/powerman/wiki/SupportedDevs

11.1.2. Selecting Power Management Software

Lustre failover requires RPC and management capability to verify that a failed node is shut down before I/O is directed to the failover node. This avoids double-mounting the two nodes and the risk of unrecoverable data corruption. A variety of power management tools will work. Two packages that have been commonly used with the Lustre software are PowerMan and Linux-HA (aka. STONITH ).

The PowerMan cluster power management utility is used to control RPC devices from a central location. PowerMan provides native support for several RPC varieties and Expect-like configuration simplifies the addition of new devices. The latest versions of PowerMan are available at:

http://code.google.com/p/powerman/

STONITH, or “Shoot The Other Node In The Head”, is a set of power management tools provided with the Linux-HA package prior to Red Hat Enterprise Linux 6. Linux-HA has native support for many power control devices, is extensible (uses Expect scripts to automate control), and provides the software to detect and respond to failures. With Red Hat Enterprise Linux 6, Linux-HA is being replaced in the open source community by the combination of Corosync and Pacemaker. For Red Hat Enterprise Linux subscribers, cluster management using CMAN is available from Red Hat.

11.1.3. Selecting High-Availability (HA) Software

The Lustre file system must be set up with high-availability (HA) software to enable a complete Lustre failover solution. Except for PowerMan, the HA software packages mentioned above provide both power management and cluster management. For information about setting up failover with Pacemaker, see:

11.2. Preparing a Lustre File System for Failover

To prepare a Lustre file system to be configured and managed as an HA system by a third-party HA application, each storage target (MGT, MGS, OST) must be associated with a second node to create a failover pair. This configuration information is then communicated by the MGS to a client when the client mounts the file system.

The per-target configuration is relayed to the MGS at mount time. Some rules related to this are:

  • When a target is initially mounted, the MGS reads the configuration information from the target (such as mgt vs. ost, failnode, fsname) to configure the target into a Lustre file system. If the MGS is reading the initial mount configuration, the mounting node becomes that target's “primary” node.

  • When a target is subsequently mounted, the MGS reads the current configuration from the target and, as needed, will reconfigure the MGS database target information

When the target is formatted using the mkfs.lustre command, the failover service node(s) for the target are designated using the --servicenode option. In the example below, an OST with index 0 in the file system testfs is formatted with two service nodes designated to serve as a failover pair:

mkfs.lustre --reformat --ost --fsname testfs --mgsnode=192.168.10.1@o3ib \  
              --index=0 --servicenode=192.168.10.7@o2ib \
              --servicenode=192.168.10.8@o2ib \  
              /dev/sdb

More than two potential service nodes can be designated for a target. The target can then be mounted on any of the designated service nodes.

When HA is configured on a storage target, the Lustre software enables multi-mount protection (MMP) on that storage target. MMP prevents multiple nodes from simultaneously mounting and thus corrupting the data on the target. For more about MMP, see Chapter 20, Lustre File System Failover and Multiple-Mount Protection.

If the MGT has been formatted with multiple service nodes designated, this information must be conveyed to the Lustre client in the mount command used to mount the file system. In the example below, NIDs for two MGSs that have been designated as service nodes for the MGT are specified in the mount command executed on the client:

mount -t lustre 10.10.120.1@tcp1:10.10.120.2@tcp1:/testfs /lustre/testfs

When a client mounts the file system, the MGS provides configuration information to the client for the MDT(s) and OST(s) in the file system along with the NIDs for all service nodes associated with each target and the service node on which the target is mounted. Later, when the client attempts to access data on a target, it will try the NID for each specified service node until it connects to the target.

Previous to Lustre software release 2.0, the --failnode option to mkfs.lustre was used to designate a failover service node for a primary server for a target. When the --failnode option is used, certain restrictions apply:

  • The target must be initially mounted on the primary service node, not the failover node designated by the --failnode option.

  • If the tunefs.lustre –-writeconf option is used to erase and regenerate the configuration log for the file system, a target cannot be initially mounted on a designated failnode.

  • If a --failnode option is added to a target to designate a failover server for the target, the target must be re-mounted on the primary node before the --failnode option takes effect

11.3. Administering Failover in a Lustre File System

For additional information about administering failover features in a Lustre file system, see:

Part III. Administering Lustre

Part III provides information about tools and procedures to use to administer a Lustre file system. You will find information in this section about:

Tip

The starting point for administering a Lustre file system is to monitor all logs and console logs for system health:

- Monitor logs on all servers and all clients.

- Invest in tools that allow you to condense logs from multiple systems.

- Use the logging resources provided in the Linux distribution.

Table of Contents

12. Monitoring a Lustre File System
12.1. Lustre Changelogs
12.1.1. Working with Changelogs
12.1.2. Changelog Examples
12.2. Lustre Jobstats
12.2.1. How Jobstats Works
12.2.2. Enable/Disable Jobstats
12.2.3. Check Job Stats
12.2.4. Clear Job Stats
12.2.5. Configure Auto-cleanup Interval
12.3. Lustre Monitoring Tool (LMT)
12.4. CollectL
12.5. Other Monitoring Options
13. Lustre Operations
13.1. Mounting by Label
13.2. Starting Lustre
13.3. Mounting a Server
13.4. Unmounting a Server
13.5. Specifying Failout/Failover Mode for OSTs
13.6. Handling Degraded OST RAID Arrays
13.7. Running Multiple Lustre File Systems
13.8. Creating a sub-directory on a given MDTL 2.4
13.9. Creating a directory striped across multiple MDTsL 2.8
13.10. Setting and Retrieving Lustre Parameters
13.10.1. Setting Tunable Parameters with mkfs.lustre
13.10.2. Setting Parameters with tunefs.lustre
13.10.3. Setting Parameters with lctl
13.11. Specifying NIDs and Failover
13.12. Erasing a File System
13.13. Reclaiming Reserved Disk Space
13.14. Replacing an Existing OST or MDT
13.15. Identifying To Which Lustre File an OST Object Belongs
14. Lustre Maintenance
14.1. Working with Inactive OSTs
14.2. Finding Nodes in the Lustre File System
14.3. Mounting a Server Without Lustre Service
14.4. Regenerating Lustre Configuration Logs
14.5. Changing a Server NID
14.6. Adding a New MDT to a Lustre File SystemL 2.4
14.7. Adding a New OST to a Lustre File System
14.8. Removing and Restoring OSTs
14.8.1. Removing a MDT from the File SystemL 2.4
14.8.2. Working with Inactive MDTsL 2.4
14.8.3. Removing an OST from the File System
14.8.4. Backing Up OST Configuration Files
14.8.5. Restoring OST Configuration Files
14.8.6. Returning a Deactivated OST to Service
14.9. Aborting Recovery
14.10. Determining Which Machine is Serving an OST
14.11. Changing the Address of a Failover Node
14.12. Separate a combined MGS/MDT
15. Managing Lustre Networking (LNet)
15.1. Updating the Health Status of a Peer or Router
15.2. Starting and Stopping LNet
15.2.1. Starting LNet
15.2.2. Stopping LNet
15.3. Multi-Rail Configurations with LNet
15.4. Load Balancing with an InfiniBand* Network
15.4.1. Setting Up lustre.conf for Load Balancing
15.5. Dynamically Configuring LNet RoutesL 2.4
15.5.1. lustre_routes_config
15.5.2. lustre_routes_conversion
15.5.3. Route Configuration Examples
16. Upgrading a Lustre File System
16.1. Release Interoperability and Upgrade Requirements
16.2. Upgrading to Lustre Software Release 2.x (Major Release)
16.3. Upgrading to Lustre Software Release 2.x.y (Minor Release)
17. Backing Up and Restoring a File System
17.1. Backing up a File System
17.1.1. Lustre_rsync
17.2. Backing Up and Restoring an MDT or OST (ldiskfs Device Level)
17.3. Backing Up an OST or MDT (ldiskfs File System Level)
17.4. Restoring a File-Level Backup
17.5. Using LVM Snapshots with the Lustre File System
17.5.1. Creating an LVM-based Backup File System
17.5.2. Backing up New/Changed Files to the Backup File System
17.5.3. Creating Snapshot Volumes
17.5.4. Restoring the File System From a Snapshot
17.5.5. Deleting Old Snapshots
17.5.6. Changing Snapshot Volume Size
18. Managing File Layout (Striping) and Free Space
18.1. How Lustre File System Striping Works
18.2. Lustre File Layout (Striping) Considerations
18.2.1. Choosing a Stripe Size
18.3. Setting the File Layout/Striping Configuration (lfs setstripe)
18.3.1. Specifying a File Layout (Striping Pattern) for a Single File
18.3.2. Setting the Striping Layout for a Directory
18.3.3. Setting the Striping Layout for a File System
18.3.4. Creating a File on a Specific OST
18.4. Retrieving File Layout/Striping Information (getstripe)
18.4.1. Displaying the Current Stripe Size
18.4.2. Inspecting the File Tree
18.4.3. Locating the MDT for a remote directory
18.5. Managing Free Space
18.5.1. Checking File System Free Space
18.5.2. Stripe Allocation Methods
18.5.3. Adjusting the Weighting Between Free Space and Location
18.6. Lustre Striping Internals
19. Managing the File System and I/O
19.1. Handling Full OSTs
19.1.1. Checking OST Space Usage
19.1.2. Taking a Full OST Offline
19.1.3. Migrating Data within a File System
19.1.4. Returning an Inactive OST Back Online
19.2. Creating and Managing OST Pools
19.2.1. Working with OST Pools
19.2.2. Tips for Using OST Pools
19.3. Adding an OST to a Lustre File System
19.4. Performing Direct I/O
19.4.1. Making File System Objects Immutable
19.5. Other I/O Options
19.5.1. Lustre Checksums
19.5.2. Ptlrpc Thread Pool
20. Lustre File System Failover and Multiple-Mount Protection
20.1. Overview of Multiple-Mount Protection
20.2. Working with Multiple-Mount Protection
21. Configuring and Managing Quotas
21.1. Working with Quotas
21.2. Enabling Disk Quotas
21.2.1. Enabling Disk Quotas (Lustre Software Prior to Release 2.4)
21.2.2. Enabling Disk Quotas (Lustre Software Release 2.4 and later)L 2.4
21.3. Quota Administration
21.4. Quota Allocation
21.5. Quotas and Version Interoperability
21.6. Granted Cache and Quota Limits
21.7. Lustre Quota Statistics
21.7.1. Interpreting Quota Statistics
22. Hierarchical Storage Management (HSM)L 2.5
22.1. Introduction
22.2. Setup
22.2.1. Requirements
22.2.2. Coordinator
22.2.3. Agents
22.3. Agents and copytool
22.3.1. Archive ID, multiple backends
22.3.2. Registered agents
22.3.3. Timeout
22.4. Requests
22.4.1. Commands
22.4.2. Automatic restore
22.4.3. Request monitoring
22.5. File states
22.6. Tuning
22.6.1. hsm_controlpolicy
22.6.2. max_requests
22.6.3. policy
22.6.4. grace_delay
22.7. change logs
22.8. Policy engine
22.8.1. Robinhood
23. Mapping UIDs and GIDs with NodemapL 2.9
23.1. Setting a Mapping
23.1.1. Defining Terms
23.1.2. Deciding on NID Ranges
23.1.3. Describing and Deploying a Sample Mapping
23.2. Altering Properties
23.2.1. Managing the Properties
23.2.2. Mixing Properties
23.3. Enabling the Feature
23.4. Verifying Settings
23.5. Ensuring Consistency
24. Configuring Shared-Secret Key (SSK) SecurityL 2.9
24.1. SSK Security Overview
24.1.1. Key features
24.2. SSK Security Flavors
24.2.1. Secure RPC Rules
24.3. SSK Key Files
24.3.1. Key File Management
24.4. Lustre GSS Keyring
24.4.1. Setup
24.4.2. Server Setup
24.4.3. Debugging GSS Keyring
24.4.4. Revoking Keys
24.5. Role of Nodemap in SSK
24.6. SSK Examples
24.6.1. Securing Client to Server Communications
24.6.2. Securing MGS Communications
24.6.3. Securing Server to Server Communications
24.7. Viewing Secure PtlRPC Contexts
25. Managing Security in a Lustre File System
25.1. Using ACLs
25.1.1. How ACLs Work
25.1.2. Using ACLs with the Lustre Software
25.1.3. Examples
25.2. Using Root Squash
25.2.1. Configuring Root Squash
25.2.2. Enabling and Tuning Root Squash
25.2.3. Tips on Using Root Squash

Chapter 12. Monitoring a Lustre File System

This chapter provides information on monitoring a Lustre file system and includes the following sections:

12.1.  Lustre Changelogs

The changelogs feature records events that change the file system namespace or file metadata. Changes such as file creation, deletion, renaming, attribute changes, etc. are recorded with the target and parent file identifiers (FIDs), the name of the target, and a timestamp. These records can be used for a variety of purposes:

  • Capture recent changes to feed into an archiving system.

  • Use changelog entries to exactly replicate changes in a file system mirror.

  • Set up "watch scripts" that take action on certain events or directories.

  • Maintain a rough audit trail (file/directory changes with timestamps, but no user information).

Changelogs record types are:

Value

Description

MARK

Internal recordkeeping

CREAT

Regular file creation

MKDIR

Directory creation

HLINK

Hard link

SLINK

Soft link

MKNOD

Other file creation

UNLNK

Regular file removal

RMDIR

Directory removal

RNMFM

Rename, original

RNMTO

Rename, final

IOCTL

ioctl on file or directory

TRUNC

Regular file truncated

SATTR

Attribute change

XATTR

Extended attribute change

UNKNW

Unknown operation

FID-to-full-pathname and pathname-to-FID functions are also included to map target and parent FIDs into the file system namespace.

12.1.1.  Working with Changelogs

Several commands are available to work with changelogs.

12.1.1.1.  lctl changelog_register

Because changelog records take up space on the MDT, the system administration must register changelog users. The registrants specify which records they are "done with", and the system purges up to the greatest common record.

To register a new changelog user, run:

lctl --device fsname-MDTnumber changelog_register

Changelog entries are not purged beyond a registered user's set point (see lfs changelog_clear).

12.1.1.2.  lfs changelog

To display the metadata changes on an MDT (the changelog records), run:

lfs changelog fsname-MDTnumber [startrec [endrec]] 

It is optional whether to specify the start and end records.

These are sample changelog records:

2 02MKDIR 4298396676 0x0 t=[0x200000405:0x15f9:0x0] p=[0x13:0x15e5a7a3:0x0]\
 pics 
3 01CREAT 4298402264 0x0 t=[0x200000405:0x15fa:0x0] p=[0x200000405:0x15f9:0\
x0] chloe.jpg 
4 06UNLNK 4298404466 0x0 t=[0x200000405:0x15fa:0x0] p=[0x200000405:0x15f9:0\
x0] chloe.jpg 
5 07RMDIR 4298405394 0x0 t=[0x200000405:0x15f9:0x0] p=[0x13:0x15e5a7a3:0x0]\
 pics 

12.1.1.3.  lfs changelog_clear

To clear old changelog records for a specific user (records that the user no longer needs), run:

lfs changelog_clear mdt_name userid endrec

The changelog_clear command indicates that changelog records previous to endrec are no longer of interest to a particular user userid, potentially allowing the MDT to free up disk space. An endrec value of 0 indicates the current last record. To run changelog_clear, the changelog user must be registered on the MDT node using lctl.

When all changelog users are done with records < X, the records are deleted.

12.1.1.4.  lctl changelog_deregister

To deregister (unregister) a changelog user, run:

lctl --device mdt_device changelog_deregister userid       

changelog_deregister cl1 effectively does a changelog_clear cl1 0 as it deregisters.

12.1.2. Changelog Examples

This section provides examples of different changelog commands.

12.1.2.1. Registering a Changelog User

To register a new changelog user for a device (lustre-MDT0000):

# lctl --device lustre-MDT0000 changelog_register
lustre-MDT0000: Registered changelog userid 'cl1'

12.1.2.2. Displaying Changelog Records

To display changelog records on an MDT (lustre-MDT0000):

$ lfs changelog lustre-MDT0000
1 00MARK  19:08:20.890432813 2010.03.24 0x0 t=[0x10001:0x0:0x0] p=[0:0x0:0x\
0] mdd_obd-lustre-MDT0000-0 
2 02MKDIR 19:10:21.509659173 2010.03.24 0x0 t=[0x200000420:0x3:0x0] p=[0x61\
b4:0xca2c7dde:0x0] mydir 
3 14SATTR 19:10:27.329356533 2010.03.24 0x0 t=[0x200000420:0x3:0x0] 
4 01CREAT 19:10:37.113847713 2010.03.24 0x0 t=[0x200000420:0x4:0x0] p=[0x20\
0000420:0x3:0x0] hosts 

Changelog records include this information:

rec# 
operation_type(numerical/text) 
timestamp 
datestamp 
flags 
t=target_FID 
p=parent_FID 
target_name

Displayed in this format:

rec# operation_type(numerical/text) timestamp datestamp flags t=target_FID \
p=parent_FID target_name

For example:

4 01CREAT 19:10:37.113847713 2010.03.24 0x0 t=[0x200000420:0x4:0x0] p=[0x20\
0000420:0x3:0x0] hosts

12.1.2.3. Clearing Changelog Records

To notify a device that a specific user (cl1) no longer needs records (up to and including 3):

$ lfs changelog_clear  lustre-MDT0000 cl1 3

To confirm that the changelog_clear operation was successful, run lfs changelog; only records after id-3 are listed:

$ lfs changelog lustre-MDT0000
4 01CREAT 19:10:37.113847713 2010.03.24 0x0 t=[0x200000420:0x4:0x0] p=[0x20\
0000420:0x3:0x0] hosts

12.1.2.4. Deregistering a Changelog User

To deregister a changelog user (cl1) for a specific device (lustre-MDT0000):

# lctl --device lustre-MDT0000 changelog_deregister cl1
lustre-MDT0000: Deregistered changelog user 'cl1'

The deregistration operation clears all changelog records for the specified user (cl1).

$ lfs changelog lustre-MDT0000
5 00MARK  19:13:40.858292517 2010.03.24 0x0 t=[0x40001:0x0:0x0] p=[0:0x0:0x\
0] mdd_obd-lustre-MDT0000-0 

Note

MARK records typically indicate changelog recording status changes.

12.1.2.5. Displaying the Changelog Index and Registered Users

To display the current, maximum changelog index and registered changelog users for a specific device (lustre-MDT0000):

# lctl get_param  mdd.lustre-MDT0000.changelog_users 
mdd.lustre-MDT0000.changelog_users=current index: 8 
ID    index 
cl2   8

12.1.2.6. Displaying the Changelog Mask

To show the current changelog mask on a specific device (lustre-MDT0000):

# lctl get_param  mdd.lustre-MDT0000.changelog_mask 

mdd.lustre-MDT0000.changelog_mask= 
MARK CREAT MKDIR HLINK SLINK MKNOD UNLNK RMDIR RNMFM RNMTO OPEN CLOSE IOCTL\
 TRUNC SATTR XATTR HSM 

12.1.2.7. Setting the Changelog Mask

To set the current changelog mask on a specific device (lustre-MDT0000):

# lctl set_param mdd.lustre-MDT0000.changelog_mask=HLINK 
mdd.lustre-MDT0000.changelog_mask=HLINK 
$ lfs changelog_clear lustre-MDT0000 cl1 0 
$ mkdir /mnt/lustre/mydir/foo
$ cp /etc/hosts /mnt/lustre/mydir/foo/file
$ ln /mnt/lustre/mydir/foo/file /mnt/lustre/mydir/myhardlink

Only item types that are in the mask show up in the changelog.

$ lfs changelog lustre-MDT0000
9 03HLINK 19:19:35.171867477 2010.03.24 0x0 t=[0x200000420:0x6:0x0] p=[0x20\
0000420:0x3:0x0] myhardlink

12.2.  Lustre Jobstats

The Lustre jobstats feature is available starting in Lustre software release 2.3. It collects file system operation statistics for user processes running on Lustre clients, and exposes them via procfs on the server using the unique Job Identifier (JobID) provided by the job scheduler for each job. Job schedulers known to be able to work with jobstats include: SLURM, SGE, LSF, Loadleveler, PBS and Maui/MOAB.

Since jobstats is implemented in a scheduler-agnostic manner, it is likely that it will be able to work with other schedulers also.

12.2.1.  How Jobstats Works

The Lustre jobstats code on the client extracts the unique JobID from an environment variable within the user process, and sends this JobID to the server with the I/O operation. The server tracks statistics for operations whose JobID is given, indexed by that ID.

A Lustre setting on the client, jobid_var, specifies which variable to use. Any environment variable can be specified. For example, SLURM sets the SLURM_JOB_ID environment variable with the unique job ID on each client when the job is first launched on a node, and the SLURM_JOB_ID will be inherited by all child processes started below that process.

Lustre can also be configured to generate a synthetic JobID from the user's process name and User ID, by setting jobid_var to a special value, procname_uid.

The setting of jobid_var need not be the same on all clients. For example, one could use SLURM_JOB_ID on all clients managed by SLURM, and use procname_uid on clients not managed by SLURM, such as interactive login nodes.

It is not possible to have different jobid_var settings on a single node, since it is unlikely that multiple job schedulers are active on one client. However, the actual JobID value is local to each process environment and it is possible for multiple jobs with different JobIDs to be active on a single client at one time.

12.2.2.  Enable/Disable Jobstats

Jobstats are disabled by default. The current state of jobstats can be verified by checking lctl get_param jobid_var on a client:

$ lctl get_param jobid_var
jobid_var=disable
      

To enable jobstats on the testfs file system with SLURM:

# lctl conf_param testfs.sys.jobid_var=SLURM_JOB_ID

The lctl conf_param command to enable or disable jobstats should be run on the MGS as root. The change is persistent, and will be propagated to the MDS, OSS, and client nodes automatically when it is set on the MGS and for each new client mount.

To temporarily enable jobstats on a client, or to use a different jobid_var on a subset of nodes, such as nodes in a remote cluster that use a different job scheduler, or interactive login nodes that do not use a job scheduler at all, run the lctl set_param command directly on the client node(s) after the filesystem is mounted. For example, to enable the procname_uid synthetic JobID on a login node run:

# lctl set_param jobid_var=procname_uid

The lctl set_param setting is not persistent, and will be reset if the global jobid_var is set on the MGS or if the filesystem is unmounted.

The following table shows the environment variables which are set by various job schedulers. Set jobid_var to the value for your job scheduler to collect statistics on a per job basis.

Job Scheduler

Environment Variable

Simple Linux Utility for Resource Management (SLURM)

SLURM_JOB_ID

Sun Grid Engine (SGE)

JOB_ID

Load Sharing Facility (LSF)

LSB_JOBID

Loadleveler

LOADL_STEP_ID

Portable Batch Scheduler (PBS)/MAUI

PBS_JOBID

Cray Application Level Placement Scheduler (ALPS)

ALPS_APP_ID

There are two special values for jobid_var: disable and procname_uid. To disable jobstats, specify jobid_var as disable:

# lctl conf_param testfs.sys.jobid_var=disable

To track job stats per process name and user ID (for debugging, or if no job scheduler is in use on some nodes such as login nodes), specify jobid_var as procname_uid:

# lctl conf_param testfs.sys.jobid_var=procname_uid

12.2.3.  Check Job Stats

Metadata operation statistics are collected on MDTs. These statistics can be accessed for all file systems and all jobs on the MDT via the lctl get_param mdt.*.job_stats. For example, clients running with jobid_var=procname_uid:

# lctl get_param mdt.*.job_stats
job_stats:
- job_id:          bash.0
  snapshot_time:   1352084992
  open:            { samples:     2, unit:  reqs }
  close:           { samples:     2, unit:  reqs }
  mknod:           { samples:     0, unit:  reqs }
  link:            { samples:     0, unit:  reqs }
  unlink:          { samples:     0, unit:  reqs }
  mkdir:           { samples:     0, unit:  reqs }
  rmdir:           { samples:     0, unit:  reqs }
  rename:          { samples:     0, unit:  reqs }
  getattr:         { samples:     3, unit:  reqs }
  setattr:         { samples:     0, unit:  reqs }
  getxattr:        { samples:     0, unit:  reqs }
  setxattr:        { samples:     0, unit:  reqs }
  statfs:          { samples:     0, unit:  reqs }
  sync:            { samples:     0, unit:  reqs }
  samedir_rename:  { samples:     0, unit:  reqs }
  crossdir_rename: { samples:     0, unit:  reqs }
- job_id:          mythbackend.0
  snapshot_time:   1352084996
  open:            { samples:    72, unit:  reqs }
  close:           { samples:    73, unit:  reqs }
  mknod:           { samples:     0, unit:  reqs }
  link:            { samples:     0, unit:  reqs }
  unlink:          { samples:    22, unit:  reqs }
  mkdir:           { samples:     0, unit:  reqs }
  rmdir:           { samples:     0, unit:  reqs }
  rename:          { samples:     0, unit:  reqs }
  getattr:         { samples:   778, unit:  reqs }
  setattr:         { samples:    22, unit:  reqs }
  getxattr:        { samples:     0, unit:  reqs }
  setxattr:        { samples:     0, unit:  reqs }
  statfs:          { samples: 19840, unit:  reqs }
  sync:            { samples: 33190, unit:  reqs }
  samedir_rename:  { samples:     0, unit:  reqs }
  crossdir_rename: { samples:     0, unit:  reqs }
    

Data operation statistics are collected on OSTs. Data operations statistics can be accessed via lctl get_param obdfilter.*.job_stats, for example:

$ lctl get_param obdfilter.*.job_stats
obdfilter.myth-OST0000.job_stats=
job_stats:
- job_id:          mythcommflag.0
  snapshot_time:   1429714922
  read:    { samples: 974, unit: bytes, min: 4096, max: 1048576, sum: 91530035 }
  write:   { samples:   0, unit: bytes, min:    0, max:       0, sum:        0 }
  setattr: { samples:   0, unit:  reqs }
  punch:   { samples:   0, unit:  reqs }
  sync:    { samples:   0, unit:  reqs }
obdfilter.myth-OST0001.job_stats=
job_stats:
- job_id:          mythbackend.0
  snapshot_time:   1429715270
  read:    { samples:   0, unit: bytes, min:     0, max:      0, sum:        0 }
  write:   { samples:   1, unit: bytes, min: 96899, max:  96899, sum:    96899 }
  setattr: { samples:   0, unit:  reqs }
  punch:   { samples:   1, unit:  reqs }
  sync:    { samples:   0, unit:  reqs }
obdfilter.myth-OST0002.job_stats=job_stats:
obdfilter.myth-OST0003.job_stats=job_stats:
obdfilter.myth-OST0004.job_stats=
job_stats:
- job_id:          mythfrontend.500
  snapshot_time:   1429692083
  read:    { samples:   9, unit: bytes, min: 16384, max: 1048576, sum: 4444160 }
  write:   { samples:   0, unit: bytes, min:     0, max:       0, sum:       0 }
  setattr: { samples:   0, unit:  reqs }
  punch:   { samples:   0, unit:  reqs }
  sync:    { samples:   0, unit:  reqs }
- job_id:          mythbackend.500
  snapshot_time:   1429692129
  read:    { samples:   0, unit: bytes, min:     0, max:       0, sum:       0 }
  write:   { samples:   1, unit: bytes, min: 56231, max:   56231, sum:   56231 }
  setattr: { samples:   0, unit:  reqs }
  punch:   { samples:   1, unit:  reqs }
  sync:    { samples:   0, unit:  reqs }
    

12.2.4.  Clear Job Stats

Accumulated job statistics can be reset by writing proc file job_stats.

Clear statistics for all jobs on the local node:

# lctl set_param obdfilter.*.job_stats=clear

Clear statistics only for job 'bash.0' on lustre-MDT0000:

# lctl set_param mdt.lustre-MDT0000.job_stats=bash.0

12.2.5.  Configure Auto-cleanup Interval

By default, if a job is inactive for 600 seconds (10 minutes) statistics for this job will be dropped. This expiration value can be changed temporarily via:

# lctl set_param *.*.job_cleanup_interval={max_age}

It can also be changed permanently, for example to 700 seconds via:

# lctl conf_param testfs.mdt.job_cleanup_interval=700

The job_cleanup_interval can be set as 0 to disable the auto-cleanup. Note that if auto-cleanup of Jobstats is disabled, then all statistics will be kept in memory forever, which may eventually consume all memory on the servers. In this case, any monitoring tool should explicitly clear individual job statistics as they are processed, as shown above.

12.3.  Lustre Monitoring Tool (LMT)

The Lustre Monitoring Tool (LMT) is a Python-based, distributed system that provides a top-like display of activity on server-side nodes (MDS, OSS and portals routers) on one or more Lustre file systems. It does not provide support for monitoring clients. For more information on LMT, including the setup procedure, see:

https://github.com/chaos/lmt/wiki

LMT questions can be directed to:

lmt-discuss@googlegroups.com

12.4.  CollectL

CollectL is another tool that can be used to monitor a Lustre file system. You can run CollectL on a Lustre system that has any combination of MDSs, OSTs and clients. The collected data can be written to a file for continuous logging and played back at a later time. It can also be converted to a format suitable for plotting.

For more information about CollectL, see:

http://collectl.sourceforge.net

Lustre-specific documentation is also available. See:

http://collectl.sourceforge.net/Tutorial-Lustre.html

12.5.  Other Monitoring Options

A variety of standard tools are available publicly including the following:

Another option is to script a simple monitoring solution that looks at various reports from ipconfig, as well as the procfs files generated by the Lustre software.

Chapter 13. Lustre Operations

Once you have the Lustre file system up and running, you can use the procedures in this section to perform these basic Lustre administration tasks.

13.1.  Mounting by Label

The file system name is limited to 8 characters. We have encoded the file system and target information in the disk label, so you can mount by label. This allows system administrators to move disks around without worrying about issues such as SCSI disk reordering or getting the /dev/device wrong for a shared target. Soon, file system naming will be made as fail-safe as possible. Currently, Linux disk labels are limited to 16 characters. To identify the target within the file system, 8 characters are reserved, leaving 8 characters for the file system name:

fsname-MDT0000 or 
fsname-OST0a19

To mount by label, use this command:

mount -t lustre -L 
file_system_label 
/mount_point

This is an example of mount-by-label:

mds# mount -t lustre -L testfs-MDT0000 /mnt/mdt

Caution

Mount-by-label should NOT be used in a multi-path environment or when snapshots are being created of the device, since multiple block devices will have the same label.

Although the file system name is internally limited to 8 characters, you can mount the clients at any mount point, so file system users are not subjected to short names. Here is an example:

client# mount -t lustre mds0@tcp0:/short 
/dev/long_mountpoint_name

13.2.  Starting Lustre

On the first start of a Lustre file system, the components must be started in the following order:

  1. Mount the MGT.

    Note

    If a combined MGT/MDT is present, Lustre will correctly mount the MGT and MDT automatically.

  2. Mount the MDT.

    Note

    Introduced in Lustre 2.4

    Mount all MDTs if multiple MDTs are present.

  3. Mount the OST(s).

  4. Mount the client(s).

13.3.  Mounting a Server

Starting a Lustre server is straightforward and only involves the mount command. Lustre servers can be added to /etc/fstab:

mount -t lustre

The mount command generates output similar to this:

/dev/sda1 on /mnt/test/mdt type lustre (rw)
/dev/sda2 on /mnt/test/ost0 type lustre (rw)
192.168.0.21@tcp:/testfs on /mnt/testfs type lustre (rw)

In this example, the MDT, an OST (ost0) and file system (testfs) are mounted.

LABEL=testfs-MDT0000 /mnt/test/mdt lustre defaults,_netdev,noauto 0 0
LABEL=testfs-OST0000 /mnt/test/ost0 lustre defaults,_netdev,noauto 0 0

In general, it is wise to specify noauto and let your high-availability (HA) package manage when to mount the device. If you are not using failover, make sure that networking has been started before mounting a Lustre server. If you are running Red Hat Enterprise Linux, SUSE Linux Enterprise Server, Debian operating system (and perhaps others), use the _netdev flag to ensure that these disks are mounted after the network is up.

We are mounting by disk label here. The label of a device can be read with e2label. The label of a newly-formatted Lustre server may end in FFFF if the --index option is not specified to mkfs.lustre, meaning that it has yet to be assigned. The assignment takes place when the server is first started, and the disk label is updated. It is recommended that the --index option always be used, which will also ensure that the label is set at format time.

Caution

Do not do this when the client and OSS are on the same node, as memory pressure between the client and OSS can lead to deadlocks.

Caution

Mount-by-label should NOT be used in a multi-path environment.

13.4.  Unmounting a Server

To stop a Lustre server, use the umount /mount point command.

For example, to stop ost0 on mount point /mnt/test, run:

$ umount /mnt/test

Gracefully stopping a server with the umount command preserves the state of the connected clients. The next time the server is started, it waits for clients to reconnect, and then goes through the recovery procedure.

If the force ( -f) flag is used, then the server evicts all clients and stops WITHOUT recovery. Upon restart, the server does not wait for recovery. Any currently connected clients receive I/O errors until they reconnect.

Note

If you are using loopback devices, use the -d flag. This flag cleans up loop devices and can always be safely specified.

13.5.  Specifying Failout/Failover Mode for OSTs

In a Lustre file system, an OST that has become unreachable because it fails, is taken off the network, or is unmounted can be handled in one of two ways:

  • In failout mode, Lustre clients immediately receive errors (EIOs) after a timeout, instead of waiting for the OST to recover.

  • In failover mode, Lustre clients wait for the OST to recover.

By default, the Lustre file system uses failover mode for OSTs. To specify failout mode instead, use the --param="failover.mode=failout" option as shown below (entered on one line):

oss# mkfs.lustre --fsname=
fsname --mgsnode=
mgs_NID --param=failover.mode=failout 
      --ost --index=
ost_index 
/dev/ost_block_device

In the example below, failout mode is specified for the OSTs on the MGS mds0 in the file system testfs(entered on one line).

oss# mkfs.lustre --fsname=testfs --mgsnode=mds0 --param=failover.mode=failout 
      --ost --index=3 /dev/sdb 

Caution

Before running this command, unmount all OSTs that will be affected by a change in failover/ failout mode.

Note

After initial file system configuration, use the tunefs.lustre utility to change the mode. For example, to set the failout mode, run:

$ tunefs.lustre --param failover.mode=failout 
/dev/ost_device

13.6.  Handling Degraded OST RAID Arrays

Lustre includes functionality that notifies Lustre if an external RAID array has degraded performance (resulting in reduced overall file system performance), either because a disk has failed and not been replaced, or because a disk was replaced and is undergoing a rebuild. To avoid a global performance slowdown due to a degraded OST, the MDS can avoid the OST for new object allocation if it is notified of the degraded state.

A parameter for each OST, called degraded, specifies whether the OST is running in degraded mode or not.

To mark the OST as degraded, use:

lctl set_param obdfilter.{OST_name}.degraded=1

To mark that the OST is back in normal operation, use:

lctl set_param obdfilter.{OST_name}.degraded=0

To determine if OSTs are currently in degraded mode, use:

lctl get_param obdfilter.*.degraded

If the OST is remounted due to a reboot or other condition, the flag resets to 0.

It is recommended that this be implemented by an automated script that monitors the status of individual RAID devices.

13.7.  Running Multiple Lustre File Systems

Lustre supports multiple file systems provided the combination of NID:fsname is unique. Each file system must be allocated a unique name during creation with the --fsname parameter. Unique names for file systems are enforced if a single MGS is present. If multiple MGSs are present (for example if you have an MGS on every MDS) the administrator is responsible for ensuring file system names are unique. A single MGS and unique file system names provides a single point of administration and allows commands to be issued against the file system even if it is not mounted.

Lustre supports multiple file systems on a single MGS. With a single MGS fsnames are guaranteed to be unique. Lustre also allows multiple MGSs to co-exist. For example, multiple MGSs will be necessary if multiple file systems on different Lustre software versions are to be concurrently available. With multiple MGSs additional care must be taken to ensure file system names are unique. Each file system should have a unique fsname among all systems that may interoperate in the future.

By default, the mkfs.lustre command creates a file system named lustre. To specify a different file system name (limited to 8 characters) at format time, use the --fsname option:

mkfs.lustre --fsname=
file_system_name

Note

The MDT, OSTs and clients in the new file system must use the same file system name (prepended to the device name). For example, for a new file system named foo, the MDT and two OSTs would be named foo-MDT0000, foo-OST0000, and foo-OST0001.

To mount a client on the file system, run:

client# mount -t lustre 
mgsnode:
/new_fsname 
/mount_point

For example, to mount a client on file system foo at mount point /mnt/foo, run:

client# mount -t lustre mgsnode:/foo /mnt/foo

Note

If a client(s) will be mounted on several file systems, add the following line to /etc/xattr.conf file to avoid problems when files are moved between the file systems: lustre.* skip

Note

To ensure that a new MDT is added to an existing MGS create the MDT by specifying: --mdt --mgsnode= mgs_NID.

A Lustre installation with two file systems ( foo and bar) could look like this, where the MGS node is mgsnode@tcp0 and the mount points are /mnt/foo and /mnt/bar.

mgsnode# mkfs.lustre --mgs /dev/sda
mdtfoonode# mkfs.lustre --fsname=foo --mgsnode=mgsnode@tcp0 --mdt --index=0
/dev/sdb
ossfoonode# mkfs.lustre --fsname=foo --mgsnode=mgsnode@tcp0 --ost --index=0
/dev/sda
ossfoonode# mkfs.lustre --fsname=foo --mgsnode=mgsnode@tcp0 --ost --index=1
/dev/sdb
mdtbarnode# mkfs.lustre --fsname=bar --mgsnode=mgsnode@tcp0 --mdt --index=0
/dev/sda
ossbarnode# mkfs.lustre --fsname=bar --mgsnode=mgsnode@tcp0 --ost --index=0
/dev/sdc
ossbarnode# mkfs.lustre --fsname=bar --mgsnode=mgsnode@tcp0 --ost --index=1
/dev/sdd

To mount a client on file system foo at mount point /mnt/foo, run:

client# mount -t lustre mgsnode@tcp0:/foo /mnt/foo

To mount a client on file system bar at mount point /mnt/bar, run:

client# mount -t lustre mgsnode@tcp0:/bar /mnt/bar
Introduced in Lustre 2.4

13.8.  Creating a sub-directory on a given MDT

Lustre 2.4 enables individual sub-directories to be serviced by unique MDTs. An administrator can allocate a sub-directory to a given MDT using the command:

client# lfs mkdir –i
mdt_index
/mount_point/remote_dir

This command will allocate the sub-directory remote_dir onto the MDT of index mdt_index. For more information on adding additional MDTs and mdt_index see 2.

Warning

An administrator can allocate remote sub-directories to separate MDTs. Creating remote sub-directories in parent directories not hosted on MDT0 is not recommended. This is because the failure of the parent MDT will leave the namespace below it inaccessible. For this reason, by default it is only possible to create remote sub-directories off MDT0. To relax this restriction and enable remote sub-directories off any MDT, an administrator must issue the following command on the MGS:

mgs# lctl conf_param fsname.mdt.enable_remote_dir=1

For Lustre filesystem 'scratch', the command executed is:

mgs# lctl conf_param scratch.mdt.enable_remote_dir=1

To verify the configuration setting execute the following command on any MDS:

mds# lctl get_param mdt.*.enable_remote_dir
Introduced in Lustre 2.8

With Lustre software version 2.8, a new tunable is available to allow users with a specific group ID to create and delete remote and striped directories. This tunable is enable_remote_dir_gid. For example, setting this parameter to the 'wheel' or 'admin' group ID allows users with that GID to create and delete remote and striped directories. Setting this parameter to -1 on MDT0 to permanently allow any non-root users create and delete remote and striped directories. On the MGS execute the following command:

mgs# lctl conf_param fsname.mdt.enable_remote_dir_gid=-1

For the Lustre filesystem 'scratch', the commands expands to:

mgs# lctl conf_param scratch.mdt.enable_remote_dir_gid=-1

. The change can be verified by executing the following command on every MDS:

mds# lctl get_param mdt.*.enable_remote_dir_gid

Introduced in Lustre 2.8

13.9.  Creating a directory striped across multiple MDTs

The Lustre 2.8 DNE feature enables individual files in a given directory to store their metadata on separate MDTs (a striped directory) once additional MDTs have been added to the filesystem, see Section 14.6, “Adding a New MDT to a Lustre File System”. The result of this is that metadata requests for files in a striped directory are serviced by multiple MDTs and metadata service load is distributed over all the MDTs that service a given directory. By distributing metadata service load over multiple MDTs, performance can be improved beyond the limit of single MDT performance. Prior to the development of this feature all files in a directory must record their metadata on a single MDT.

This command to stripe a directory over mdt_count MDTs is:

client# lfs mkdir -c
mdt_count
/mount_point/new_directory

The striped directory feature is most useful for distributing single large directories (50k entries or more) across multiple MDTs, since it incurs more overhead than non-striped directories.

13.10.  Setting and Retrieving Lustre Parameters

Several options are available for setting parameters in Lustre:

13.10.1. Setting Tunable Parameters with mkfs.lustre

When the file system is first formatted, parameters can simply be added as a --param option to the mkfs.lustre command. For example:

mds# mkfs.lustre --mdt --param="sys.timeout=50" /dev/sda

For more details about creating a file system,see Chapter 10, Configuring a Lustre File System. For more details about mkfs.lustre, see Chapter 38, System Configuration Utilities.

13.10.2. Setting Parameters with tunefs.lustre

If a server (OSS or MDS) is stopped, parameters can be added to an existing file system using the --param option to the tunefs.lustre command. For example:

oss# tunefs.lustre --param=failover.node=192.168.0.13@tcp0 /dev/sda

With tunefs.lustre, parameters are additive-- new parameters are specified in addition to old parameters, they do not replace them. To erase all old tunefs.lustre parameters and just use newly-specified parameters, run:

mds# tunefs.lustre --erase-params --param=
new_parameters 

The tunefs.lustre command can be used to set any parameter settable in a /proc/fs/lustre file and that has its own OBD device, so it can be specified as obdname|fsname. obdtype. proc_file_name= value. For example:

mds# tunefs.lustre --param mdt.identity_upcall=NONE /dev/sda1

For more details about tunefs.lustre, see Chapter 38, System Configuration Utilities.

13.10.3. Setting Parameters with lctl

When the file system is running, the lctl command can be used to set parameters (temporary or permanent) and report current parameter values. Temporary parameters are active as long as the server or client is not shut down. Permanent parameters live through server and client reboots.

Note

The lctl list_param command enables users to list all parameters that can be set. See Section 13.10.3.4, “Listing Parameters”.

For more details about the lctl command, see the examples in the sections below and Chapter 38, System Configuration Utilities.

13.10.3.1. Setting Temporary Parameters

Use lctl set_param to set temporary parameters on the node where it is run. These parameters map to items in /proc/{fs,sys}/{lnet,lustre}. The lctl set_param command uses this syntax:

lctl set_param [-n] 
obdtype.
obdname.
proc_file_name=
value

For example:

# lctl set_param osc.*.max_dirty_mb=1024
osc.myth-OST0000-osc.max_dirty_mb=32
osc.myth-OST0001-osc.max_dirty_mb=32
osc.myth-OST0002-osc.max_dirty_mb=32
osc.myth-OST0003-osc.max_dirty_mb=32
osc.myth-OST0004-osc.max_dirty_mb=32

13.10.3.2. Setting Permanent Parameters

Use the lctl conf_param command to set permanent parameters. In general, the lctl conf_param command can be used to specify any parameter settable in a /proc/fs/lustre file, with its own OBD device. The lctl conf_param command uses this syntax (same as the mkfs.lustre and tunefs.lustre commands):

obdname|fsname.
obdtype.
proc_file_name=
value) 

Here are a few examples of lctl conf_param commands:

mgs# lctl conf_param testfs-MDT0000.sys.timeout=40
$ lctl conf_param testfs-MDT0000.mdt.identity_upcall=NONE
$ lctl conf_param testfs.llite.max_read_ahead_mb=16
$ lctl conf_param testfs-MDT0000.lov.stripesize=2M
$ lctl conf_param testfs-OST0000.osc.max_dirty_mb=29.15
$ lctl conf_param testfs-OST0000.ost.client_cache_seconds=15
$ lctl conf_param testfs.sys.timeout=40 

Caution

Parameters specified with the lctl conf_param command are set permanently in the file system's configuration file on the MGS.

Introduced in Lustre 2.5

13.10.3.3. Setting Permanent Parameters with lctl set_param -P

Use the lctl set_param -P to set parameters permanently. This command must be issued on the MGS. The given parameter is set on every host using lctl upcall. Parameters map to items in /proc/{fs,sys}/{lnet,lustre}. The lctl set_param command uses this syntax:

lctl set_param -P 
obdtype.
obdname.
proc_file_name=
value

For example:

# lctl set_param -P osc.*.max_dirty_mb=1024
osc.myth-OST0000-osc.max_dirty_mb=32
osc.myth-OST0001-osc.max_dirty_mb=32
osc.myth-OST0002-osc.max_dirty_mb=32
osc.myth-OST0003-osc.max_dirty_mb=32
osc.myth-OST0004-osc.max_dirty_mb=32 

Use -d(only with -P) option to delete permanent parameter. Syntax:

lctl set_param -P -d
obdtype.
obdname.
proc_file_name

For example:

# lctl set_param -P -d osc.*.max_dirty_mb 

13.10.3.4. Listing Parameters

To list Lustre or LNet parameters that are available to set, use the lctl list_param command. For example:

lctl list_param [-FR] 
obdtype.
obdname

The following arguments are available for the lctl list_param command.

-F Add ' /', ' @' or ' =' for directories, symlinks and writeable files, respectively

-R Recursively lists all parameters under the specified path

For example:

oss# lctl list_param obdfilter.lustre-OST0000 

13.10.3.5. Reporting Current Parameter Values

To report current Lustre parameter values, use the lctl get_param command with this syntax:

lctl get_param [-n] 
obdtype.
obdname.
proc_file_name

This example reports data on RPC service times.

oss# lctl get_param -n ost.*.ost_io.timeouts
service : cur 1 worst 30 (at 1257150393, 85d23h58m54s ago) 1 1 1 1 

This example reports the amount of space this client has reserved for writeback cache with each OST:

client# lctl get_param osc.*.cur_grant_bytes
osc.myth-OST0000-osc-ffff8800376bdc00.cur_grant_bytes=2097152
osc.myth-OST0001-osc-ffff8800376bdc00.cur_grant_bytes=33890304
osc.myth-OST0002-osc-ffff8800376bdc00.cur_grant_bytes=35418112
osc.myth-OST0003-osc-ffff8800376bdc00.cur_grant_bytes=2097152
osc.myth-OST0004-osc-ffff8800376bdc00.cur_grant_bytes=33808384

13.11.  Specifying NIDs and Failover

If a node has multiple network interfaces, it may have multiple NIDs, which must all be identified so other nodes can choose the NID that is appropriate for their network interfaces. Typically, NIDs are specified in a list delimited by commas ( ,). However, when failover nodes are specified, the NIDs are delimited by a colon ( :) or by repeating a keyword such as --mgsnode= or --servicenode=).

To display the NIDs of all servers in networks configured to work with the Lustre file system, run (while LNet is running):

lctl list_nids

In the example below, mds0 and mds1 are configured as a combined MGS/MDT failover pair and oss0 and oss1 are configured as an OST failover pair. The Ethernet address for mds0 is 192.168.10.1, and for mds1 is 192.168.10.2. The Ethernet addresses for oss0 and oss1 are 192.168.10.20 and 192.168.10.21 respectively.

mds0# mkfs.lustre --fsname=testfs --mdt --mgs \
        --servicenode=192.168.10.2@tcp0 \
        -–servicenode=192.168.10.1@tcp0 /dev/sda1
mds0# mount -t lustre /dev/sda1 /mnt/test/mdt
oss0# mkfs.lustre --fsname=testfs --servicenode=192.168.10.20@tcp0 \
        --servicenode=192.168.10.21 --ost --index=0 \
        --mgsnode=192.168.10.1@tcp0 --mgsnode=192.168.10.2@tcp0 \
        /dev/sdb
oss0# mount -t lustre /dev/sdb /mnt/test/ost0
client# mount -t lustre 192.168.10.1@tcp0:192.168.10.2@tcp0:/testfs \
        /mnt/testfs
mds0# umount /mnt/mdt
mds1# mount -t lustre /dev/sda1 /mnt/test/mdt
mds1# lctl get_param mdt.testfs-MDT0000.recovery_status

Where multiple NIDs are specified separated by commas (for example, 10.67.73.200@tcp,192.168.10.1@tcp), the two NIDs refer to the same host, and the Lustre software chooses the best one for communication. When a pair of NIDs is separated by a colon (for example, 10.67.73.200@tcp:10.67.73.201@tcp), the two NIDs refer to two different hosts and are treated as a failover pair (the Lustre software tries the first one, and if that fails, it tries the second one.)

Two options to mkfs.lustre can be used to specify failover nodes. Introduced in Lustre software release 2.0, the --servicenode option is used to specify all service NIDs, including those for primary nodes and failover nodes. When the --servicenode option is used, the first service node to load the target device becomes the primary service node, while nodes corresponding to the other specified NIDs become failover locations for the target device. An older option, --failnode, specifies just the NIDS of failover nodes. For more information about the --servicenode and --failnode options, see Chapter 11, Configuring Failover in a Lustre File System.

13.12.  Erasing a File System

If you want to erase a file system and permanently delete all the data in the file system, run this command on your targets:

$ "mkfs.lustre --reformat"

If you are using a separate MGS and want to keep other file systems defined on that MGS, then set the writeconf flag on the MDT for that file system. The writeconf flag causes the configuration logs to be erased; they are regenerated the next time the servers start.

To set the writeconf flag on the MDT:

  1. Unmount all clients/servers using this file system, run:

    $ umount /mnt/lustre
    
  2. Permanently erase the file system and, presumably, replace it with another file system, run:

    $ mkfs.lustre --reformat --fsname spfs --mgs --mdt --index=0 /dev/
    {mdsdev}
    
  3. If you have a separate MGS (that you do not want to reformat), then add the --writeconf flag to mkfs.lustre on the MDT, run:

    $ mkfs.lustre --reformat --writeconf --fsname spfs --mgsnode=
    mgs_nid --mdt --index=0 
    /dev/mds_device
    

Note

If you have a combined MGS/MDT, reformatting the MDT reformats the MGS as well, causing all configuration information to be lost; you can start building your new file system. Nothing needs to be done with old disks that will not be part of the new file system, just do not mount them.

13.13.  Reclaiming Reserved Disk Space

All current Lustre installations run the ldiskfs file system internally on service nodes. By default, ldiskfs reserves 5% of the disk space to avoid file system fragmentation. In order to reclaim this space, run the following command on your OSS for each OST in the file system:

tune2fs [-m reserved_blocks_percent] /dev/
{ostdev}

You do not need to shut down Lustre before running this command or restart it afterwards.

Warning

Reducing the space reservation can cause severe performance degradation as the OST file system becomes more than 95% full, due to difficulty in locating large areas of contiguous free space. This performance degradation may persist even if the space usage drops below 95% again. It is recommended NOT to reduce the reserved disk space below 5%.

13.14.  Replacing an Existing OST or MDT

To copy the contents of an existing OST to a new OST (or an old MDT to a new MDT), follow the process for either OST/MDT backups in Section 17.2, “ Backing Up and Restoring an MDT or OST (ldiskfs Device Level)”or Section 17.3, “ Backing Up an OST or MDT (ldiskfs File System Level)”. For more information on removing a MDT, see Section 14.8.1, “Removing a MDT from the File System”.

13.15.  Identifying To Which Lustre File an OST Object Belongs

Use this procedure to identify the file containing a given object on a given OST.

  1. On the OST (as root), run debugfs to display the file identifier ( FID) of the file associated with the object.

    For example, if the object is 34976 on /dev/lustre/ost_test2, the debug command is:

    # debugfs -c -R "stat /O/0/d$((34976 % 32))/34976" /dev/lustre/ost_test2 
    

    The command output is:

    debugfs 1.42.3.wc3 (15-Aug-2012)
    /dev/lustre/ost_test2: catastrophic mode - not reading inode or group bitmaps
    Inode: 352365   Type: regular    Mode:  0666   Flags: 0x80000
    Generation: 2393149953    Version: 0x0000002a:00005f81
    User:  1000   Group:  1000   Size: 260096
    File ACL: 0    Directory ACL: 0
    Links: 1   Blockcount: 512
    Fragment:  Address: 0    Number: 0    Size: 0
    ctime: 0x4a216b48:00000000 -- Sat May 30 13:22:16 2009
    atime: 0x4a216b48:00000000 -- Sat May 30 13:22:16 2009
    mtime: 0x4a216b48:00000000 -- Sat May 30 13:22:16 2009
    crtime: 0x4a216b3c:975870dc -- Sat May 30 13:22:04 2009
    Size of extra inode fields: 24
    Extended attributes stored in inode body:
      fid = "b9 da 24 00 00 00 00 00 6a fa 0d 3f 01 00 00 00 eb 5b 0b 00 00 00 0000
    00 00 00 00 00 00 00 00 " (32)
      fid: objid=34976 seq=0 parent=[0x24dab9:0x3f0dfa6a:0x0] stripe=1
    EXTENTS:
    (0-64):4620544-4620607
    
  2. For Lustre software release 2.x file systems, the parent FID will be of the form [0x200000400:0x122:0x0] and can be resolved directly using the lfs fid2path [0x200000404:0x122:0x0] /mnt/lustre command on any Lustre client, and the process is complete.

  3. In this example the parent inode FID is an upgraded 1.x inode (due to the first part of the FID being below 0x200000400), the MDT inode number is 0x24dab9 and generation 0x3f0dfa6a and the pathname needs to be resolved using debugfs.

  4. On the MDS (as root), use debugfs to find the file associated with the inode:

    # debugfs -c -R "ncheck 0x24dab9" /dev/lustre/mdt_test 
    

    Here is the command output:

    debugfs 1.42.3.wc2 (15-Aug-2012)
    /dev/lustre/mdt_test: catastrophic mode - not reading inode or group bitmap\
    s
    Inode      Pathname
    2415289    /ROOT/brian-laptop-guest/clients/client11/~dmtmp/PWRPNT/ZD16.BMP
    

The command lists the inode and pathname associated with the object.

Note

Debugfs' ''ncheck'' is a brute-force search that may take a long time to complete.

Note

To find the Lustre file from a disk LBA, follow the steps listed in the document at this URL: http://smartmontools.sourceforge.net/badblockhowto.html. Then, follow the steps above to resolve the Lustre filename.

Chapter 14. Lustre Maintenance

Once you have the Lustre file system up and running, you can use the procedures in this section to perform these basic Lustre maintenance tasks:

14.1.  Working with Inactive OSTs

To mount a client or an MDT with one or more inactive OSTs, run commands similar to this:

client# mount -o exclude=testfs-OST0000 -t lustre \
           uml1:/testfs /mnt/testfs
            client# lctl get_param lov.testfs-clilov-*.target_obd

To activate an inactive OST on a live client or MDT, use the lctl activate command on the OSC device. For example:

lctl --device 7 activate

Note

A colon-separated list can also be specified. For example, exclude=testfs-OST0000:testfs-OST0001.

14.2.  Finding Nodes in the Lustre File System

There may be situations in which you need to find all nodes in your Lustre file system or get the names of all OSTs.

To get a list of all Lustre nodes, run this command on the MGS:

# lctl get_param mgs.MGS.live.*

Note

This command must be run on the MGS.

In this example, file system testfs has three nodes, testfs-MDT0000, testfs-OST0000, and testfs-OST0001.

mgs:/root# lctl get_param mgs.MGS.live.* 
                fsname: testfs 
                flags: 0x0     gen: 26 
                testfs-MDT0000 
                testfs-OST0000 
                testfs-OST0001 

To get the names of all OSTs, run this command on the MDS:

mds:/root# lctl get_param lov.*-mdtlov.target_obd 

Note

This command must be run on the MDS.

In this example, there are two OSTs, testfs-OST0000 and testfs-OST0001, which are both active.

mgs:/root# lctl get_param lov.testfs-mdtlov.target_obd 
0: testfs-OST0000_UUID ACTIVE 
1: testfs-OST0001_UUID ACTIVE 

14.3.  Mounting a Server Without Lustre Service

If you are using a combined MGS/MDT, but you only want to start the MGS and not the MDT, run this command:

mount -t lustre /dev/mdt_partition -o nosvc /mount_point

The mdt_partition variable is the combined MGS/MDT block device.

In this example, the combined MGS/MDT is testfs-MDT0000 and the mount point is /mnt/test/mdt.

$ mount -t lustre -L testfs-MDT0000 -o nosvc /mnt/test/mdt

14.4.  Regenerating Lustre Configuration Logs

If the Lustre file system configuration logs are in a state where the file system cannot be started, use the writeconf command to erase them. After the writeconf command is run and the servers restart, the configuration logs are re-generated and stored on the MGS (as in a new file system).

You should only use the writeconf command if:

  • The configuration logs are in a state where the file system cannot start

  • A server NID is being changed

The writeconf command is destructive to some configuration items (i.e., OST pools information and items set via conf_param), and should be used with caution. To avoid problems:

  • Shut down the file system before running the writeconf command

  • Run the writeconf command on all servers (MDT first, then OSTs)

  • Start the file system in this order:

    • MGS (or the combined MGS/MDT)

    • MDT

    • OSTs

    • Lustre clients

Caution

The OST pools feature enables a group of OSTs to be named for file striping purposes. If you use OST pools, be aware that running the writeconf command erases all pools information (as well as any other parameters set via lctl conf_param). We recommend that the pools definitions (and conf_param settings) be executed via a script, so they can be reproduced easily after a writeconf is performed.

To regenerate Lustre file system configuration logs:

  1. Shut down the file system in this order.

    1. Unmount the clients.

    2. Unmount the MDT.

    3. Unmount all OSTs.

  2. Make sure the the MDT and OST devices are available.

  3. Run the writeconf command on all servers.

    Run writeconf on the MDT first, and then the OSTs.

    1. On the MDT, run:

      mdt# tunefs.lustre --writeconf /dev/mdt_device
    2. On each OST, run:

      ost# tunefs.lustre --writeconf /dev/ost_device

  4. Restart the file system in this order.

    1. Mount the MGS (or the combined MGS/MDT).

    2. Mount the MDT.

    3. Mount the OSTs.

    4. Mount the clients.

After the writeconf command is run, the configuration logs are re-generated as servers restart.

14.5.  Changing a Server NID

In Lustre software release 2.3 or earlier, the tunefs.lustre --writeconf command is used to rewrite all of the configuration files.

Introduced in Lustre 2.4

If you need to change the NID on the MDT or OST, a new replace_nids command was added in Lustre software release 2.4 to simplify this process. The replace_nids command differs from tunefs.lustre --writeconf in that it does not erase the entire configuration log, precluding the need the need to execute the writeconf command on all servers and re-specify all permanent parameter settings. However, the writeconf command can still be used if desired.

Change a server NID in these situations:

  • New server hardware is added to the file system, and the MDS or an OSS is being moved to the new machine.

  • New network card is installed in the server.

  • You want to reassign IP addresses.

To change a server NID:

  1. Update the LNet configuration in the /etc/modprobe.conf file so the list of server NIDs is correct. Use lctl list_nids to view the list of server NIDS.

    The lctl list_nids command indicates which network(s) are configured to work with the Lustre file system.

  2. Shut down the file system in this order:

    1. Unmount the clients.

    2. Unmount the MDT.

    3. Unmount all OSTs.

  3. If the MGS and MDS share a partition, start the MGS only:

    mount -t lustre MDT partition -o nosvc mount_point
  4. Run the replace_nids command on the MGS:

    lctl replace_nids devicename nid1[,nid2,nid3 ...]

    where devicename is the Lustre target name, e.g. testfs-OST0013

  5. If the MGS and MDS share a partition, stop the MGS:

    umount mount_point

Note

The replace_nids command also cleans all old, invalidated records out of the configuration log, while preserving all other current settings.

Note

The previous configuration log is backed up on the MGS disk with the suffix '.bak'.

Introduced in Lustre 2.4

14.6. Adding a New MDT to a Lustre File System

Additional MDTs can be added using the DNE feature to serve one or more remote sub-directories within a filesystem, in order to increase the total number of files that can be created in the filesystem, to increase aggregate metadata performance, or to isolate user or application workloads from other users of the filesystem. It is possible to have multiple remote sub-directories reference the same MDT. However, the root directory will always be located on MDT0. To add a new MDT into the file system:

  1. Discover the maximum MDT index. Each MDT must have unique index.

    client$ lctl dl | grep mdc
    36 UP mdc testfs-MDT0000-mdc-ffff88004edf3c00 4c8be054-144f-9359-b063-8477566eb84e 5
    37 UP mdc testfs-MDT0001-mdc-ffff88004edf3c00 4c8be054-144f-9359-b063-8477566eb84e 5
    38 UP mdc testfs-MDT0002-mdc-ffff88004edf3c00 4c8be054-144f-9359-b063-8477566eb84e 5
    39 UP mdc testfs-MDT0003-mdc-ffff88004edf3c00 4c8be054-144f-9359-b063-8477566eb84e 5
    
  2. Add the new block device as a new MDT at the next available index. In this example, the next available index is 4.

    mds# mkfs.lustre --reformat --fsname=testfs --mdt --mgsnode=mgsnode --index 4 /dev/mdt4_device
    
  3. Mount the MDTs.

    mds# mount –t lustre /dev/mdt4_blockdevice /mnt/mdt4
    
  4. In order to start creating new files and directories on the new MDT(s) they need to be attached into the namespace at one or more subdirectories using the lfs mkdir command. All files and directories below those created with lfs mkdir will also be created on the same MDT unless otherwise specified.

    client# lfs mkdir -i 3 /mnt/testfs/new_dir_on_mdt3
    client# lfs mkdir -i 4 /mnt/testfs/new_dir_on_mdt4
    client# lfs mkdir -c 4 /mnt/testfs/new_directory_striped_across_4_mdts
    

14.7.  Adding a New OST to a Lustre File System

To add an OST to existing Lustre file system:

  1. Add a new OST by passing on the following commands, run:

    oss# mkfs.lustre --fsname=spfs --mgsnode=mds16@tcp0 --ost --index=12 /dev/sda
    oss# mkdir -p /mnt/test/ost12
    oss# mount -t lustre /dev/sda /mnt/test/ost12
  2. Migrate the data (possibly).

    The file system is quite unbalanced when new empty OSTs are added. New file creations are automatically balanced. If this is a scratch file system or files are pruned at a regular interval, then no further work may be needed.

    New files being created will preferentially be placed on the empty OST. As old files are deleted, they will release space on the old OST.

    Files existing prior to the expansion can optionally be rebalanced with an in-place copy, which can be done with a simple script. The basic method is to copy existing files to a temporary file, then move the temp file over the old one. This should not be attempted with files which are currently being written to by users or applications. This operation redistributes the stripes over the entire set of OSTs.

    For example, to rebalance all files within /mnt/lustre/dir, enter:

    client# lfs_migrate /mnt/lustre/file

    To migrate files within the /test file system on OST0004 that are larger than 4GB in size, enter:

    client# lfs find /test -obd test-OST0004 -size +4G | lfs_migrate -y

    See Section 34.2, “ lfs_migrate for more details.

14.8.  Removing and Restoring OSTs

OSTs can be removed from and restored to a Lustre file system. Removing a OST means the OST is deactivated in the file system, not permanently removed.

Note

A removed OST still appears in the file system; do not create a new OST with the same name.

You may want to remove (deactivate) an OST and prevent new files from being written to it in several situations:

  • Hard drive has failed and a RAID resync/rebuild is underway

  • OST is nearing its space capacity

  • OST storage has failed permanently

Introduced in Lustre 2.4

14.8.1. Removing a MDT from the File System

If the MDT is permanently inaccessible, lfs rmdir {directory} can be used to delete the directory entry. A normal rmdir will report an IO error due to the remote MDT being inactive. After the remote directory has been removed, the administrator should mark the MDT as permanently inactive with:

lctl conf_param {MDT name}.mdc.active=0

A user can identify which MDT holds a remote sub-directory using the lfs utility. For example:

client$ lfs getstripe -M /mnt/lustre/remote_dir1
1
client$ mkdir /mnt/lustre/local_dir0
client$ lfs getstripe -M /mnt/lustre/local_dir0
0

The getstripe [--mdt-index|-M] parameters return the index of the MDT that is serving the given directory.

Introduced in Lustre 2.4

14.8.2.  Working with Inactive MDTs

Files located on or below an inactive MDT are inaccessible until the MDT is activated again. Clients accessing an inactive MDT will receive an EIO error.

14.8.3.  Removing an OST from the File System

When removing an OST, remember that the MDT does not communicate directly with OSTs. Rather, each OST has a corresponding OSC which communicates with the MDT. It is necessary to determine the device number of the OSC that corresponds to the OST. Then, you use this device number to deactivate the OSC on the MDT.

To remove an OST from the file system:

  1. For the OST to be removed, determine the device number of the corresponding OSC on the MDT.

    1. List all OSCs on the node, along with their device numbers. Run:

      lctl dl | grep osc

      For example: lctl dl | grep

      11 UP osc testfs-OST-0000-osc-cac94211 4ea5b30f-6a8e-55a0-7519-2f20318ebdb4 5
      12 UP osc testfs-OST-0001-osc-cac94211 4ea5b30f-6a8e-55a0-7519-2f20318ebdb4 5
      13 IN osc testfs-OST-0000-osc testfs-MDT0000-mdtlov_UUID 5
      14 UP osc testfs-OST-0001-osc testfs-MDT0000-mdtlov_UUID 5
    2. Determine the device number of the OSC that corresponds to the OST to be removed.

  2. Temporarily deactivate the OSC on the MDT. On the MDT, run:

    mds# lctl --device lustre_devno deactivate

    For example, based on the command output in Step 1, to deactivate device 13 (the MDT’s OSC for OST-0000), the command would be:

    mds# lctl --device 13 deactivate

    This marks the OST as inactive on the MDS, so no new objects are assigned to the OST. This does not prevent use of existing objects for reads or writes.

    Note

    Do not deactivate the OST on the clients. Do so causes errors (EIOs), and the copy out to fail.

    Caution

    Do not use lctl conf_param to deactivate the OST. It permanently sets a parameter in the file system configuration.

  3. Discover all files that have objects residing on the deactivated OST.

    Depending on whether the deactivated OST is available or not, the data from that OST may be migrated to other OSTs, or may need to be restored from backup.

    1. If the OST is still online and available, find all files with objects on the deactivated OST, and copy them to other OSTs in the file system to:

      client# lfs find --obd ost_name /mount/point | lfs_migrate -y
    2. If the OST is no longer available, delete the files on that OST and restore them from backup:

      client# lfs find --obd ost_uuid -print0 /mount/point | \
                 tee /tmp/files_to_restore | xargs -0 -n 1 unlink

      The list of files that need to be restored from backup is stored in /tmp/files_to_restore. Restoring these files is beyond the scope of this document.

  4. Deactivate the OST.

    1. If there is expected to be a replacement OST in some short time (a few days), the OST can temporarily be deactivated on the clients using:

      client# lctl set_param osc.fsname-OSTnumber-*.active=0

      Note

      This setting is only temporary and will be reset if the clients are remounted or rebooted. It needs to be run on all clients.

    If there is not expected to be a replacement for this OST in the near future, permanently deactivate it on all clients and the MDS by running the following command on the MGS:

    mgs# lctl conf_param ost_name.osc.active=0

    Note

    A deactivated OST still appears in the file system configuration, though a new OST with the same name can be created using the --replace option for mkfs.lustre.

14.8.4.  Backing Up OST Configuration Files

If the OST device is still accessible, then the Lustre configuration files on the OST should be backed up and saved for future use in order to avoid difficulties when a replacement OST is returned to service. These files rarely change, so they can and should be backed up while the OST is functional and accessible. If the deactivated OST is still available to mount (i.e. has not permanently failed or is unmountable due to severe corruption), an effort should be made to preserve these files.

  1. Mount the OST file system.

    oss# mkdir -p /mnt/ost
    [oss]# mount -t ldiskfs /dev/ost_device /mnt/ost

  2. Back up the OST configuration files.

    oss# tar cvf ost_name.tar -C /mnt/ost last_rcvd \
               CONFIGS/ O/0/LAST_ID

  3. Unmount the OST file system.

    oss# umount /mnt/ost

14.8.5.  Restoring OST Configuration Files

If the original OST is still available, it is best to follow the OST backup and restore procedure given in either Section 17.2, “ Backing Up and Restoring an MDT or OST (ldiskfs Device Level)”, or Section 17.3, “ Backing Up an OST or MDT (ldiskfs File System Level)” and Section 17.4, “ Restoring a File-Level Backup”.

To replace an OST that was removed from service due to corruption or hardware failure, the file system needs to be formatted using mkfs.lustre, and the Lustre file system configuration should be restored, if available.

If the OST configuration files were not backed up, due to the OST file system being completely inaccessible, it is still possible to replace the failed OST with a new one at the same OST index.

  1. Format the OST file system.

    oss# mkfs.lustre --ost --index=old_ost_index other_options \
               /dev/new_ost_dev

  2. Mount the OST file system.

    oss# mkdir /mnt/ost
    oss# mount -t ldiskfs /dev/new_ost_dev /mnt/ost

  3. Restore the OST configuration files, if available.

    oss# tar xvf ost_name.tar -C /mnt/ost
  4. Recreate the OST configuration files, if unavailable.

    Follow the procedure in Section 29.3.4, “Fixing a Bad LAST_ID on an OST” to recreate the LAST_ID file for this OST index. The last_rcvd file will be recreated when the OST is first mounted using the default parameters, which are normally correct for all file systems. The CONFIGS/mountdata file is created by mkfs.lustre at format time, but has flags set that request it to register itself with the MGS. It is possible to copy these flags from another working OST (which should be the same):

    oss1# debugfs -c -R "dump CONFIGS/mountdata /tmp/ldd" /dev/other_osdev
    oss1# scp /tmp/ldd oss0:/tmp/ldd
    oss0# dd if=/tmp/ldd of=/mnt/ost/CONFIGS/mountdata bs=4 count=1 seek=5 skip=5 conv=notrunc
  5. Unmount the OST file system.

    oss# umount /mnt/ost

14.8.6. Returning a Deactivated OST to Service

If the OST was permanently deactivated, it needs to be reactivated in the MGS configuration.

mgs# lctl conf_param ost_name.osc.active=1

If the OST was temporarily deactivated, it needs to be reactivated on the MDS and clients.

mds# lctl --device lustre_devno activate
client# lctl set_param osc.fsname-OSTnumber-*.active=1

14.9.  Aborting Recovery

You can abort recovery with either the lctl utility or by mounting the target with the abort_recov option (mount -o abort_recov). When starting a target, run:

mds# mount -t lustre -L mdt_name -o abort_recov /mount_point

Note

The recovery process is blocked until all OSTs are available.

14.10.  Determining Which Machine is Serving an OST

In the course of administering a Lustre file system, you may need to determine which machine is serving a specific OST. It is not as simple as identifying the machine’s IP address, as IP is only one of several networking protocols that the Lustre software uses and, as such, LNet does not use IP addresses as node identifiers, but NIDs instead. To identify the NID that is serving a specific OST, run one of the following commands on a client (you do not need to be a root user):

client$ lctl get_param osc.fsname-OSTnumber*.ost_conn_uuid

For example:

client$ lctl get_param osc.*-OST0000*.ost_conn_uuid 
osc.testfs-OST0000-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp

- OR -

client$ lctl get_param osc.*.ost_conn_uuid 
osc.testfs-OST0000-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp
osc.testfs-OST0001-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp
osc.testfs-OST0002-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp
osc.testfs-OST0003-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp
osc.testfs-OST0004-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp

14.11.  Changing the Address of a Failover Node

To change the address of a failover node (e.g, to use node X instead of node Y), run this command on the OSS/OST partition (depending on which option was used to originally identify the NID):

oss# tunefs.lustre --erase-params --servicenode=NID /dev/ost_device

or

oss# tunefs.lustre --erase-params --failnode=NID /dev/ost_device

For more information about the --servicenode and --failnode options, see Chapter 11, Configuring Failover in a Lustre File System.

14.12.  Separate a combined MGS/MDT

These instructions assume the MGS node will be the same as the MDS node. For instructions on how to move MGS to a different node, see Section 14.5, “ Changing a Server NID”.

These instructions are for doing the split without shutting down other servers and clients.

  1. Stop the MDS.

    Unmount the MDT

    umount -f /dev/mdt_device 
  2. Create the MGS.

    mds# mkfs.lustre --mgs --device-size=size /dev/mgs_device
  3. Copy the configuration data from MDT disk to the new MGS disk.

    mds# mount -t ldiskfs -o ro /dev/mdt_device /mdt_mount_point
    mds# mount -t ldiskfs -o rw /dev/mgs_device /mgs_mount_point 
    mds# cp -r /mdt_mount_point/CONFIGS/filesystem_name-* /mgs_mount_point/CONFIGS/. 
    mds# umount /mgs_mount_point
    mds# umount /mdt_mount_point

    See Section 14.4, “ Regenerating Lustre Configuration Logs” for alternative method.

  4. Start the MGS.

    mgs# mount -t lustre /dev/mgs_device /mgs_mount_point

    Check to make sure it knows about all your file system

    mgs:/root# lctl get_param mgs.MGS.filesystems
  5. Remove the MGS option from the MDT, and set the new MGS nid.

    mds# tunefs.lustre --nomgs --mgsnode=new_mgs_nid /dev/mdt-device
  6. Start the MDT.

    mds# mount -t lustre /dev/mdt_device /mdt_mount_point

    Check to make sure the MGS configuration looks right:

    mgs# lctl get_param mgs.MGS.live.filesystem_name

Chapter 15. Managing Lustre Networking (LNet)

This chapter describes some tools for managing Lustre networking (LNet) and includes the following sections:

15.1.  Updating the Health Status of a Peer or Router

There are two mechanisms to update the health status of a peer or a router:

  • LNet can actively check health status of all routers and mark them as dead or alive automatically. By default, this is off. To enable it set auto_down and if desired check_routers_before_use. This initial check may cause a pause equal to router_ping_timeout at system startup, if there are dead routers in the system.

  • When there is a communication error, all LNDs notify LNet that the peer (not necessarily a router) is down. This mechanism is always on, and there is no parameter to turn it off. However, if you set the LNet module parameter auto_down to 0, LNet ignores all such peer-down notifications.

Several key differences in both mechanisms:

  • The router pinger only checks routers for their health, while LNDs notices all dead peers, regardless of whether they are a router or not.

  • The router pinger actively checks the router health by sending pings, but LNDs only notice a dead peer when there is network traffic going on.

  • The router pinger can bring a router from alive to dead or vice versa, but LNDs can only bring a peer down.

15.2. Starting and Stopping LNet

The Lustre software automatically starts and stops LNet, but it can also be manually started in a standalone manner. This is particularly useful to verify that your networking setup is working correctly before you attempt to start the Lustre file system.

15.2.1. Starting LNet

To start LNet, run:

$ modprobe lnet
$ lctl network up

To see the list of local NIDs, run:

$ lctl list_nids

This command tells you the network(s) configured to work with the Lustre file system.

If the networks are not correctly setup, see the modules.conf "networks=" line and make sure the network layer modules are correctly installed and configured.

To get the best remote NID, run:

$ lctl which_nid NIDs

where NIDs is the list of available NIDs.

This command takes the "best" NID from a list of the NIDs of a remote host. The "best" NID is the one that the local node uses when trying to communicate with the remote node.

15.2.1.1. Starting Clients

To start a TCP client, run:

mount -t lustre mdsnode:/mdsA/client /mnt/lustre/

To start an Elan client, run:

mount -t lustre 2@elan0:/mdsA/client /mnt/lustre

15.2.2. Stopping LNet

Before the LNet modules can be removed, LNet references must be removed. In general, these references are removed automatically when the Lustre file system is shut down, but for standalone routers, an explicit step is needed to stop LNet. Run:

lctl network unconfigure

Note

Attempting to remove Lustre modules prior to stopping the network may result in a crash or an LNet hang. If this occurs, the node must be rebooted (in most cases). Make sure that the Lustre network and Lustre file system are stopped prior to unloading the modules. Be extremely careful using rmmod -f.

To unconfigure the LNet network, run:

modprobe -r lnd_and_lnet_modules

Note

To remove all Lustre modules, run:

$ lustre_rmmod

15.3. Multi-Rail Configurations with LNet

To aggregate bandwidth across both rails of a dual-rail IB cluster (o2iblnd) [1] using LNet, consider these points:

  • LNet can work with multiple rails, however, it does not load balance across them. The actual rail used for any communication is determined by the peer NID.

  • Multi-rail LNet configurations do not provide an additional level of network fault tolerance. The configurations described below are for bandwidth aggregation only.

  • A Lustre node always uses the same local NID to communicate with a given peer NID. The criteria used to determine the local NID are:

    • Introduced in Lustre 2.5

      Lowest route priority number (lower number, higher priority).

    • Fewest hops (to minimize routing), and

    • Appears first in the "networks" or "ip2nets" LNet configuration strings

15.4. Load Balancing with an InfiniBand* Network

A Lustre file system contains OSSs with two InfiniBand HCAs. Lustre clients have only one InfiniBand HCA using OFED-based Infiniband ''o2ib'' drivers. Load balancing between the HCAs on the OSS is accomplished through LNet.

15.4.1. Setting Up lustre.conf for Load Balancing

To configure LNet for load balancing on clients and servers:

  1. Set the lustre.conf options.

    Depending on your configuration, set lustre.conf options as follows:

    • Dual HCA OSS server

    options lnet networks="o2ib0(ib0),o2ib1(ib1)"
    • Client with the odd IP address

    options lnet ip2nets="o2ib0(ib0) 192.168.10.[103-253/2]"
    • Client with the even IP address

    options lnet ip2nets="o2ib1(ib0) 192.168.10.[102-254/2]"
  2. Run the modprobe lnet command and create a combined MGS/MDT file system.

    The following commands create an MGS/MDT or OST file system and mount the targets on the servers.

    modprobe lnet
    # mkfs.lustre --fsname lustre --mgs --mdt /dev/mdt_device
    # mkdir -p /mount_point
    # mount -t lustre /dev/mdt_device /mount_point

    For example:

    modprobe lnet
    mds# mkfs.lustre --fsname lustre --mdt --mgs /dev/sda
    mds# mkdir -p /mnt/test/mdt
    mds# mount -t lustre /dev/sda /mnt/test/mdt   
    mds# mount -t lustre mgs@o2ib0:/lustre /mnt/mdt
    oss# mkfs.lustre --fsname lustre --mgsnode=mds@o2ib0 --ost --index=0 /dev/sda
    oss# mkdir -p /mnt/test/mdt
    oss# mount -t lustre /dev/sda /mnt/test/ost   
    oss# mount -t lustre mgs@o2ib0:/lustre /mnt/ost0
  3. Mount the clients.

    client# mount -t lustre mgs_node:/fsname /mount_point

    This example shows an IB client being mounted.

    client# mount -t lustre
    192.168.10.101@o2ib0,192.168.10.102@o2ib1:/mds/client /mnt/lustre

As an example, consider a two-rail IB cluster running the OFED stack with these IPoIB address assignments.

             ib0                             ib1
Servers            192.168.0.*                     192.168.1.*
Clients            192.168.[2-127].*               192.168.[128-253].*

You could create these configurations:

  • A cluster with more clients than servers. The fact that an individual client cannot get two rails of bandwidth is unimportant because the servers are typically the actual bottleneck.

ip2nets="o2ib0(ib0),    o2ib1(ib1)      192.168.[0-1].*                     \
                                            #all servers;\
                   o2ib0(ib0)      192.168.[2-253].[0-252/2]       #even cl\
ients;\
                   o2ib1(ib1)      192.168.[2-253].[1-253/2]       #odd cli\
ents"

This configuration gives every server two NIDs, one on each network, and statically load-balances clients between the rails.

  • A single client that must get two rails of bandwidth, and it does not matter if the maximum aggregate bandwidth is only (# servers) * (1 rail).

ip2nets="       o2ib0(ib0)                      192.168.[0-1].[0-252/2]     \
                                            #even servers;\
           o2ib1(ib1)                      192.168.[0-1].[1-253/2]         \
                                        #odd servers;\
           o2ib0(ib0),o2ib1(ib1)           192.168.[2-253].*               \
                                        #clients"

This configuration gives every server a single NID on one rail or the other. Clients have a NID on both rails.

  • All clients and all servers must get two rails of bandwidth.

ip2nets=†  o2ib0(ib0),o2ib2(ib1)           192.168.[0-1].[0-252/2]       \
  #even servers;\
           o2ib1(ib0),o2ib3(ib1)           192.168.[0-1].[1-253/2]         \
#odd servers;\
           o2ib0(ib0),o2ib3(ib1)           192.168.[2-253].[0-252/2)       \
#even clients;\
           o2ib1(ib0),o2ib2(ib1)           192.168.[2-253].[1-253/2)       \
#odd clients"

This configuration includes two additional proxy o2ib networks to work around the simplistic NID selection algorithm in the Lustre software. It connects "even" clients to "even" servers with o2ib0 on rail0, and "odd" servers with o2ib3 on rail1. Similarly, it connects "odd" clients to "odd" servers with o2ib1 on rail0, and "even" servers with o2ib2 on rail1.

Introduced in Lustre 2.4

15.5. Dynamically Configuring LNet Routes

Two scripts are provided: lustre/scripts/lustre_routes_config and lustre/scripts/lustre_routes_conversion.

lustre_routes_config sets or cleans up LNet routes from the specified config file. The /etc/sysconfig/lnet_routes.conf file can be used to automatically configure routes on LNet startup.

lustre_routes_conversion converts a legacy routes configuration file to the new syntax, which is parsed by lustre_routes_config.

15.5.1.  lustre_routes_config

lustre_routes_config usage is as follows

lustre_routes_config [--setup|--cleanup|--dry-run|--verbose] config_file
         --setup: configure routes listed in config_file
         --cleanup: unconfigure routes listed in config_file
         --dry-run: echo commands to be run, but do not execute them
         --verbose: echo commands before they are executed 

The format of the file which is passed into the script is as follows:

network: { gateway: gateway@exit_network [hop: hop] [priority: priority] }

An LNet router is identified when its local NID appears within the list of routes. However, this can not be achieved by the use of this script, since the script only adds extra routes after the router is identified. To ensure that a router is identified correctly, make sure to add its local NID in the routes parameter in the modprobe lustre configuration file. See Section 37.1, “ Introduction”.

15.5.2. lustre_routes_conversion

lustre_routes_conversion usage is as follows:

lustre_routes_conversion legacy_file new_file

lustre_routes_conversion takes as a first parameter a file with routes configured as follows:

network [hop] gateway@exit network[:priority];

The script then converts each routes entry in the provided file to:

network: { gateway: gateway@exit network [hop: hop] [priority: priority] }

and appends each converted entry to the output file passed in as the second parameter to the script.

15.5.3. Route Configuration Examples

Below is an example of a legacy LNet route configuration. A legacy configuration file can have multiple entries.

tcp1 10.1.1.2@tcp0:1;
tcp2 10.1.1.3@tcp0:2;
tcp3 10.1.1.4@tcp0;

Below is an example of the converted LNet route configuration. The following would be the result of the lustre_routes_conversion script, when run on the above legacy entries.

tcp1: { gateway: 10.1.1.2@tcp0 priority: 1 }
tcp2: { gateway: 10.1.1.2@tcp0 priority: 2 }
tcp1: { gateway: 10.1.1.4@tcp0 }


[1] Multi-rail configurations are only supported by o2iblnd; other IB LNDs do not support multiple interfaces.

Chapter 16. Upgrading a Lustre File System

This chapter describes interoperability between Lustre software releases. It also provides procedures for upgrading from Lustre software release 1.8 to Lustre software release 2.x , from a Lustre software release 2.x to a more recent Lustre software release 2.x (major release upgrade), and from a a Lustre software release 2.x.y to a more recent Lustre software release 2.x.y (minor release upgrade). It includes the following sections:

16.1.  Release Interoperability and Upgrade Requirements

Lustre software release 2.x (major) upgrade:

  • All servers must be upgraded at the same time, while some or all clients may be upgraded independently of the servers.

  • All servers must be be upgraded to a Linux kernel supported by the Lustre software. See the Lustre Release Notes for your Lustre version for a list of tested Linux distributions.

  • Clients to be upgraded must be running a compatible Linux distribution as described in the Release Notes.

Lustre software release 2.x.y release (minor) upgrade:

  • All servers must be upgraded at the same time, while some or all clients may be upgraded.

  • Rolling upgrades are supported for minor releases allowing individual servers and clients to be upgraded without stopping the Lustre file system.

16.2.  Upgrading to Lustre Software Release 2.x (Major Release)

The procedure for upgrading from a Lustre software release 2.x to a more recent 2.x release of the Lustre software is described in this section.

Note

This procedure can also be used to upgrade Lustre software release 1.8.6-wc1 or later to any Lustre software release 2.x. To upgrade other versions of Lustre software release 1.8.x, contact your support provider.

Note

Introduced in Lustre 2.2

In Lustre software release 2.2, a feature has been added that allows striping across up to 2000 OSTs. By default, this "wide striping" feature is disabled. It is activated by setting the large_xattr or ea_inode option on the MDT using either mkfs.lustre or tune2fs. For example after upgrading an existing file system to Lustre software release 2.2 or later, wide striping can be enabled by running the following command on the MDT device before mounting it:

tune2fs -O large_xattr

Once the wide striping feature is enabled and in use on the MDT, it is not possible to directly downgrade the MDT file system to an earlier version of the Lustre software that does not support wide striping. To disable wide striping:

  1. Delete all wide-striped files.

    OR

    Use lfs_migrate with the option -c stripe_count(set stripe_countto 160) to move the files to another location.

  2. Unmount the MDT.

  3. Run the following command to turn off the large_xattr option:

    tune2fs -O ^large_xattr

Using either mkfs.lustre or tune2fs with large_xattr or ea_inode option reseults in ea_inode in the file system feature list.

Introduced in Lustre 2.3

Note

To generate a list of all files with more than 160 stripes use lfs find with the --stripe-count option:

lfs find ${mountpoint} --stripe-count=+160
Introduced in Lustre 2.4

Note

In Lustre software release 2.4, a new feature allows using multiple MDTs, which can each serve one or more remote sub-directories in the file system. The root directory is always located on MDT0.

Note that clients running a release prior to the Lustre software release 2.4 can only see the namespace hosted by MDT0 and will return an IO error if an attempt is made to access a directory on another MDT.

To upgrade a Lustre software release 2.x to a more recent major release, complete these steps:

  1. Create a complete, restorable file system backup.

    Caution

    Before installing the Lustre software, back up ALL data. The Lustre software contains kernel modifications that interact with storage devices and may introduce security issues and data loss if not installed, configured, or administered properly. If a full backup of the file system is not practical, a device-level backup of the MDT file system is recommended. See Chapter 17, Backing Up and Restoring a File System for a procedure.

  2. Shut down the file system by unmounting all clients and servers in the order shown below (unmounting a block device causes the Lustre software to be shut down on that node):

    1. Unmount the clients. On each client node, run:

      umount -a -t lustre
    2. Unmount the MDT. On the MDS node, run:

      umount -a -t lustre
    3. Unmount all the OSTs. On each OSS node, run:

      umount -a -t lustre
  3. Upgrade the Linux operating system on all servers to a compatible (tested) Linux distribution and reboot.

  4. Upgrade the Linux operating system on all clients to Red Hat Enterprise Linux 6 or other compatible (tested) distribution and reboot.

  5. Download the Lustre server RPMs for your platform from the Lustre Releasesrepository. See Table 8.1, “Packages Installed on Lustre Servers”for a list of required packages.

  6. Install the Lustre server packages on all Lustre servers (MGS, MDSs, and OSSs).

    1. Log onto a Lustre server as the root user

    2. Use the yum command to install the packages:

      # yum --nogpgcheck install pkg1.rpm pkg2.rpm ... 

    3. Verify the packages are installed correctly:

      rpm -qa|egrep "lustre|wc"

    4. Repeat these steps on each Lustre server.

  7. Download the Lustre client RPMs for your platform from the Lustre Releasesrepository. See Table 8.2, “Packages Installed on Lustre Clients”for a list of required packages.

    Note

    The version of the kernel running on a Lustre client must be the same as the version of the lustre-client-modules- verpackage being installed. If not, a compatible kernel must be installed on the client before the Lustre client packages are installed.

  8. Install the Lustre client packages on each of the Lustre clients to be upgraded.

    1. Log onto a Lustre client as the root user.

    2. Use the yum command to install the packages:

      # yum --nogpgcheck install pkg1.rpm pkg2.rpm ... 

    3. Verify the packages were installed correctly:

      # rpm -qa|egrep "lustre|kernel"

    4. Repeat these steps on each Lustre client.

  9. (Optional) For upgrades to Lustre software release 2.2 or higher, to enable wide striping on an existing MDT, run the following command on the MDT :

    mdt# tune2fs -O large_xattr device

    For more information about wide striping, see Section 18.6, “Lustre Striping Internals”.

  10. (Optional) For upgrades to Lustre software release 2.4 or higher, to format an additional MDT, complete these steps:

    1. Determine the index used for the first MDT (each MDT must have unique index). Enter:

      client$ lctl dl | grep mdc
      36 UP mdc lustre-MDT0000-mdc-ffff88004edf3c00 
            4c8be054-144f-9359-b063-8477566eb84e 5

      In this example, the next available index is 1.

    2. Add the new block device as a new MDT at the next available index by entering (on one line):

      mds# mkfs.lustre --reformat --fsname=filesystem_name --mdt \
          --mgsnode=mgsnode --index 1 
      /dev/mdt1_device
  11. (Optional) If you are upgrading to Lustre software release 2.3 or higher from Lustre software release 2.2 or earlier and want to enable the quota feature, complete these steps:

    1. Before setting up the file system, enter on both the MDS and OSTs:

      tunefs.lustre --quota
    2. When setting up the file system, enter:

      conf_param $FSNAME.quota.mdt=$QUOTA_TYPE
      conf_param $FSNAME.quota.ost=$QUOTA_TYPE
  12. (Optional) If you are upgrading from Lustre software release 1.8, you must manually enable the FID-in-dirent feature. On the MDS, enter:

    tune2fs –O dirdata /dev/mdtdev

    Warning

    This step is not reversible. Do not complete this step until you are sure you will not be downgrading the Lustre software.

    Introduced in Lustre 2.4

    This step only enables FID-in-dirent for newly created files. If you are upgrading to Lustre software release 2.4, you can use namespace LFSCK to enable FID-in-dirent for the existing files. For the case of upgrading from Lustre software release 1.8, it is important to note that if you do NOT enable dirdata via the tune2fs command above, the namespace LFSCK will NOT generate FID-in-dirent for the existing files. For more information about FID-in-dirent and related functionalities in LFSCK, see Section 1.3, “ Lustre File System Storage and I/O”.

  13. Start the Lustre file system by starting the components in the order shown in the following steps:

    1. Mount the MGT. On the MGS, run

      mgs# mount -a -t lustre
    2. Mount the MDT(s). On each MDT, run:

      mds# mount -a -t lustre
    3. Mount all the OSTs. On each OSS node, run:

      oss# mount -a -t lustre

      Note

      This command assumes that all the OSTs are listed in the /etc/fstab file. OSTs that are not listed in the /etc/fstab file, must be mounted individually by running the mount command:

      mount -t lustre /dev/block_device/mount_point
    4. Mount the file system on the clients. On each client node, run:

      client# mount -a -t lustre

Note

The mounting order described in the steps above must be followed for the initial mount and registration of a Lustre file system after an upgrade. For a normal start of a Lustre file system, the mounting order is MGT, OSTs, MDT(s), clients.

If you have a problem upgrading a Lustre file system, see Section 29.2, “ Reporting a Lustre File System Bug”for some ways to get help.

16.3.  Upgrading to Lustre Software Release 2.x.y (Minor Release)

Rolling upgrades are supported for upgrading from any Lustre software release 2.x.y to a more recent Lustre software release 2.X.y. This allows the Lustre file system to continue to run while individual servers (or their failover partners) and clients are upgraded one at a time. The procedure for upgrading a Lustre software release 2.x.y to a more recent minor release is described in this section.

To upgrade Lustre software release 2.x.y to a more recent minor release, complete these steps:

  1. Create a complete, restorable file system backup.

    Caution

    Before installing the Lustre software, back up ALL data. The Lustre software contains kernel modifications that interact with storage devices and may introduce security issues and data loss if not installed, configured, or administered properly. If a full backup of the file system is not practical, a device-level backup of the MDT file system is recommended. See Chapter 17, Backing Up and Restoring a File System for a procedure.

  2. Download the Lustre server RPMs for your platform from the Lustre Releasesrepository. See Table 8.1, “Packages Installed on Lustre Servers”for a list of required packages.

  3. For a rolling upgrade, complete any procedures required to keep the Lustre file system running while the server to be upgraded is offline, such as failing over a primary server to its secondary partner.

  4. Unmount the Lustre server to be upgraded (MGS, MDS, or OSS)

  5. Install the Lustre server packages on the Lustre server.

    1. Log onto the Lustre server as the root user

    2. Use the yum command to install the packages:

      # yum --nogpgcheck install pkg1.rpm pkg2.rpm ... 

    3. Verify the packages are installed correctly:

      rpm -qa|egrep "lustre|wc"

    4. Mount the Lustre server to restart the Lustre software on the server:

      server# mount -a -t lustre
    5. Repeat these steps on each Lustre server.

  6. Download the Lustre client RPMs for your platform from the Lustre Releasesrepository. See Table 8.2, “Packages Installed on Lustre Clients”for a list of required packages.

  7. Install the Lustre client packages on each of the Lustre clients to be upgraded.

    1. Log onto a Lustre client as the root user.

    2. Use the yum command to install the packages:

      # yum --nogpgcheck install pkg1.rpm pkg2.rpm ... 

    3. Verify the packages were installed correctly:

      # rpm -qa|egrep "lustre|kernel"

    4. Mount the Lustre client to restart the Lustre software on the client:

      client# mount -a -t lustre
    5. Repeat these steps on each Lustre client.

If you have a problem upgrading a Lustre file system, see Section 29.2, “ Reporting a Lustre File System Bug”for some suggestions for how to get help.

Chapter 17. Backing Up and Restoring a File System

This chapter describes how to backup and restore at the file system-level, device-level and file-level in a Lustre file system. Each backup approach is described in the the following sections:

It is strongly recommended that sites perform periodic device-level backup of the MDT(s) (Section 17.2, “ Backing Up and Restoring an MDT or OST (ldiskfs Device Level)”), for example twice a week with alternate backups going to a separate device, even if there is not enough capacity to do a full backup of all of the filesystem data. Even if there are separate file-level backups of some or all files in the filesystem, having a device-level backup of the MDT can be very useful in case of MDT failure or corruption. Being able to restore a device-level MDT backup can avoid the significantly longer process of restoring the entire filesystem from backup. Since the MDT is required for access to all files, its loss would otherwise force full restore of the filesystem (if that is even possible) even if the OSTs are still OK.

Performing a periodic device-level MDT backup can be done relatively inexpensively because the storage need only be connected to the primary MDS (it can be manually connected to the backup MDS in the rare case it is needed), and only needs good linear read/write performance. While the device-level MDT backup is not useful for restoring individual files, it is most efficient to handle the case of MDT failure or corruption.

17.1.  Backing up a File System

Backing up a complete file system gives you full control over the files to back up, and allows restoration of individual files as needed. File system-level backups are also the easiest to integrate into existing backup solutions.

File system backups are performed from a Lustre client (or many clients working parallel in different directories) rather than on individual server nodes; this is no different than backing up any other file system.

However, due to the large size of most Lustre file systems, it is not always possible to get a complete backup. We recommend that you back up subsets of a file system. This includes subdirectories of the entire file system, filesets for a single user, files incremented by date, and so on, so that restores can be done more efficiently.

Note

Lustre internally uses a 128-bit file identifier (FID) for all files. To interface with user applications, the 64-bit inode numbers are returned by the stat(), fstat(), and readdir() system calls on 64-bit applications, and 32-bit inode numbers to 32-bit applications.

Some 32-bit applications accessing Lustre file systems (on both 32-bit and 64-bit CPUs) may experience problems with the stat(), fstat() or readdir() system calls under certain circumstances, though the Lustre client should return 32-bit inode numbers to these applications.

In particular, if the Lustre file system is exported from a 64-bit client via NFS to a 32-bit client, the Linux NFS server will export 64-bit inode numbers to applications running on the NFS client. If the 32-bit applications are not compiled with Large File Support (LFS), then they return EOVERFLOW errors when accessing the Lustre files. To avoid this problem, Linux NFS clients can use the kernel command-line option "nfs.enable_ino64=0" in order to force the NFS client to export 32-bit inode numbers to the client.

Workaround: We very strongly recommend that backups using tar(1) and other utilities that depend on the inode number to uniquely identify an inode to be run on 64-bit clients. The 128-bit Lustre file identifiers cannot be uniquely mapped to a 32-bit inode number, and as a result these utilities may operate incorrectly on 32-bit clients. While there is still a small chance of inode number collisions with 64-bit inodes, the FID allocation pattern is designed to avoid collisions for long periods of usage.

17.1.1.  Lustre_rsync

The lustre_rsync feature keeps the entire file system in sync on a backup by replicating the file system's changes to a second file system (the second file system need not be a Lustre file system, but it must be sufficiently large). lustre_rsync uses Lustre changelogs to efficiently synchronize the file systems without having to scan (directory walk) the Lustre file system. This efficiency is critically important for large file systems, and distinguishes the Lustre lustre_rsync feature from other replication/backup solutions.

17.1.1.1.  Using Lustre_rsync

The lustre_rsync feature works by periodically running lustre_rsync, a userspace program used to synchronize changes in the Lustre file system onto the target file system. The lustre_rsync utility keeps a status file, which enables it to be safely interrupted and restarted without losing synchronization between the file systems.

The first time that lustre_rsync is run, the user must specify a set of parameters for the program to use. These parameters are described in the following table and in Section 38.13, “ lustre_rsync”. On subsequent runs, these parameters are stored in the the status file, and only the name of the status file needs to be passed to lustre_rsync.

Before using lustre_rsync:

- AND -

  • Verify that the Lustre file system (source) and the replica file system (target) are identical before registering the changelog user. If the file systems are discrepant, use a utility, e.g. regular rsync(not lustre_rsync), to make them identical.

The lustre_rsync utility uses the following parameters:

Parameter

Description

--source= src

The path to the root of the Lustre file system (source) which will be synchronized. This is a mandatory option if a valid status log created during a previous synchronization operation ( --statuslog) is not specified.

--target= tgt

The path to the root where the source file system will be synchronized (target). This is a mandatory option if the status log created during a previous synchronization operation ( --statuslog) is not specified. This option can be repeated if multiple synchronization targets are desired.

--mdt= mdt

The metadata device to be synchronized. A changelog user must be registered for this device. This is a mandatory option if a valid status log created during a previous synchronization operation ( --statuslog) is not specified.

--user= userid

The changelog user ID for the specified MDT. To use lustre_rsync, the changelog user must be registered. For details, see the changelog_register parameter in Chapter 38, System Configuration Utilities( lctl). This is a mandatory option if a valid status log created during a previous synchronization operation ( --statuslog) is not specified.

--statuslog= log

A log file to which synchronization status is saved. When the lustre_rsync utility starts, if the status log from a previous synchronization operation is specified, then the state is read from the log and otherwise mandatory --source, --target and --mdt options can be skipped. Specifying the --source, --target and/or --mdt options, in addition to the --statuslog option, causes the specified parameters in the status log to be overridden. Command line options take precedence over options in the status log.

--xattr yes|no

Specifies whether extended attributes ( xattrs) are synchronized or not. The default is to synchronize extended attributes.

Note

Disabling xattrs causes Lustre striping information not to be synchronized.

--verbose

Produces verbose output.

--dry-run

Shows the output of lustre_rsync commands ( copy, mkdir, etc.) on the target file system without actually executing them.

--abort-on-err

Stops processing the lustre_rsync operation if an error occurs. The default is to continue the operation.

17.1.1.2.  lustre_rsync Examples

Sample lustre_rsync commands are listed below.

Register a changelog user for an MDT (e.g. testfs-MDT0000).

# lctl --device testfs-MDT0000 changelog_register testfs-MDT0000
Registered changelog userid 'cl1'

Synchronize a Lustre file system ( /mnt/lustre) to a target file system ( /mnt/target).

$ lustre_rsync --source=/mnt/lustre --target=/mnt/target \
           --mdt=testfs-MDT0000 --user=cl1 --statuslog sync.log  --verbose 
Lustre filesystem: testfs 
MDT device: testfs-MDT0000 
Source: /mnt/lustre 
Target: /mnt/target 
Statuslog: sync.log 
Changelog registration: cl1 
Starting changelog record: 0 
Errors: 0 
lustre_rsync took 1 seconds 
Changelog records consumed: 22

After the file system undergoes changes, synchronize the changes onto the target file system. Only the statuslog name needs to be specified, as it has all the parameters passed earlier.

$ lustre_rsync --statuslog sync.log --verbose 
Replicating Lustre filesystem: testfs 
MDT device: testfs-MDT0000 
Source: /mnt/lustre 
Target: /mnt/target 
Statuslog: sync.log 
Changelog registration: cl1 
Starting changelog record: 22 
Errors: 0 
lustre_rsync took 2 seconds 
Changelog records consumed: 42

To synchronize a Lustre file system ( /mnt/lustre) to two target file systems ( /mnt/target1 and /mnt/target2).

$ lustre_rsync --source=/mnt/lustre --target=/mnt/target1 \
           --target=/mnt/target2 --mdt=testfs-MDT0000 --user=cl1  \
           --statuslog sync.log

17.2.  Backing Up and Restoring an MDT or OST (ldiskfs Device Level)

In some cases, it is useful to do a full device-level backup of an individual device (MDT or OST), before replacing hardware, performing maintenance, etc. Doing full device-level backups ensures that all of the data and configuration files is preserved in the original state and is the easiest method of doing a backup. For the MDT file system, it may also be the fastest way to perform the backup and restore, since it can do large streaming read and write operations at the maximum bandwidth of the underlying devices.

Note

Keeping an updated full backup of the MDT is especially important because permanent failure or corruption of the MDT file system renders the much larger amount of data in all the OSTs largely inaccessible and unusable. The storage needed for one or two full MDT device backups is much smaller than doing a full filesystem backup, and can use less expensive storage than the actual MDT device(s) since it only needs to have good streaming read/write speed instead of high random IOPS.

Introduced in Lustre 2.3

Warning

In Lustre software release 2.0 through 2.2, the only successful way to backup and restore an MDT is to do a device-level backup as is described in this section. File-level restore of an MDT is not possible before Lustre software release 2.3, as the Object Index (OI) file cannot be rebuilt after restore without the OI Scrub functionality. Since Lustre software release 2.3, Object Index files are automatically rebuilt at first mount after a restore is detected (see LU-957), and file-level backup is supported (see Section 17.3, “ Backing Up an OST or MDT (ldiskfs File System Level)”).

If hardware replacement is the reason for the backup or if a spare storage device is available, it is possible to do a raw copy of the MDT or OST from one block device to the other, as long as the new device is at least as large as the original device. To do this, run:

dd if=/dev/{original} of=/dev/{newdev} bs=4M

If hardware errors cause read problems on the original device, use the command below to allow as much data as possible to be read from the original device while skipping sections of the disk with errors:

dd if=/dev/{original} of=/dev/{newdev} bs=4k conv=sync,noerror /
      count={original size in 4kB blocks}

Even in the face of hardware errors, the ldiskfs file system is very robust and it may be possible to recover the file system data after running e2fsck -fy /dev/{newdev} on the new device, along with ll_recover_lost_found_objs for OST devices.

Introduced in Lustre 2.6

With Lustre software version 2.6 and later, there is no longer a need to run ll_recover_lost_found_objs on the OSTs, since the LFSCK scanning will automatically move objects from lost+found back into its correct location on the OST after directory corruption.

In order to ensure that the backup is fully consistent, the MDT or OST must be unmounted, so that there are no changes being made to the device while the data is being transferred. If the reason for the backup is preventative (i.e. MDT backup on a running MDS in case of future failures) then it is possible to perform a consistent backup from an LVM snapshot. If an LVM snapshot is not available, and taking the MDS offline for a backup is unacceptable, it is also possible to perform a backup from the raw MDT block device. While the backup from the raw device will not be fully consistent due to ongoing changes, the vast majority of ldiskfs metadata is statically allocated, and inconsistencies in the backup can be fixed by running e2fsck on the backup device, and is still much better than not having any backup at all.

17.3.  Backing Up an OST or MDT (ldiskfs File System Level)

This procedure provides an alternative to backup or migrate the data of an OST or MDT at the file level. At the file-level, unused space is omitted from the backed up and the process may be completed quicker with smaller total backup size. Backing up a single OST device is not necessarily the best way to perform backups of the Lustre file system, since the files stored in the backup are not usable without metadata stored on the MDT and additional file stripes that may be on other OSTs. However, it is the preferred method for migration of OST devices, especially when it is desirable to reformat the underlying file system with different configuration options or to reduce fragmentation.

Note

Prior to Lustre software release 2.3, the only successful way to perform an MDT backup and restore is to do a device-level backup as is described in Section 17.2, “ Backing Up and Restoring an MDT or OST (ldiskfs Device Level)”. The ability to do MDT file-level backups is not available for Lustre software release 2.0 through 2.2, because restoration of the Object Index (OI) file does not return the MDT to a functioning state. Since Lustre software release 2.3, Object Index files are automatically rebuilt at first mount after a restore is detected (see LU-957), so file-level MDT restore is supported.

For Lustre software release 2.3 and newer with MDT file-level backup support, substitute mdt for ost in the instructions below.

  1. Make a mountpoint for the file system.

    [oss]# mkdir -p /mnt/ost
  2. Mount the file system.

    [oss]# mount -t ldiskfs /dev/{ostdev} /mnt/ost
  3. Change to the mountpoint being backed up.

    [oss]# cd /mnt/ost
  4. Back up the extended attributes.

    [oss]# getfattr -R -d -m '.*' -e hex -P . > ea-$(date +%Y%m%d).bak

    Note

    If the tar(1) command supports the --xattr option, the getfattr step may be unnecessary as long as tar does a backup of the trusted.* attributes. However, completing this step is not harmful and can serve as an added safety measure.

    Note

    In most distributions, the getfattr command is part of the attr package. If the getfattr command returns errors like Operation not supported, then the kernel does not correctly support EAs. Stop and use a different backup method.

  5. Verify that the ea-$date.bak file has properly backed up the EA data on the OST.

    Without this attribute data, the restore process may be missing extra data that can be very useful in case of later file system corruption. Look at this file with more or a text editor. Each object file should have a corresponding item similar to this:

    [oss]# file: O/0/d0/100992
    trusted.fid= \
    0x0d822200000000004a8a73e500000000808a0100000000000000000000000000
  6. Back up all file system data.

    [oss]# tar czvf {backup file}.tgz [--xattrs] --sparse .

    Note

    The tar --sparse option is vital for backing up an MDT. In order to have --sparse behave correctly, and complete the backup of and MDT in finite time, the version of tar must be specified. Correctly functioning versions of tar include the Lustre software enhanced version of tar at https://wiki.hpdd.intel.com/display/PUB/Lustre+Tools#LustreTools-lustre-tar, the tar from a Red Hat Enterprise Linux distribution (version 6.3 or more recent) and the GNU tar version 1.25 or more recent.

    Warning

    The tar --xattrs option is only available in GNU tar distributions from Red Hat or Intel.

  7. Change directory out of the file system.

    [oss]# cd -
  8. Unmount the file system.

    [oss]# umount /mnt/ost

    Note

    When restoring an OST backup on a different node as part of an OST migration, you also have to change server NIDs and use the --writeconf command to re-generate the configuration logs. See Chapter 14, Lustre Maintenance(Changing a Server NID).

17.4.  Restoring a File-Level Backup

To restore data from a file-level backup, you need to format the device, restore the file data and then restore the EA data.

  1. Format the new device.

    [oss]# mkfs.lustre --ost --index {OST index} {other options} /dev/{newdev}
  2. Set the file system label.

    [oss]# e2label {fsname}-OST{index in hex} /mnt/ost
  3. Mount the file system.

    [oss]# mount -t ldiskfs /dev/{newdev} /mnt/ost
  4. Change to the new file system mount point.

    [oss]# cd /mnt/ost
  5. Restore the file system backup.

    [oss]# tar xzvpf {backup file} [--xattrs] --sparse
  6. Restore the file system extended attributes.

    [oss]# setfattr --restore=ea-${date}.bak

    Note

    If --xattrs option is supported by tar and specified in the step above, this step is redundant.

  7. Verify that the extended attributes were restored.

    [oss]# getfattr -d -m ".*" -e hex O/0/d0/100992 trusted.fid= \
    0x0d822200000000004a8a73e500000000808a0100000000000000000000000000
  8. Remove old OI files.

    [oss]# rm -f oi.16*
  9. Remove old CATALOGS.

    [oss]# rm -f CATALOGS

    Note

    This is optional for the MDT side only. The CATALOGS record the llog file handlers that are used for recovering cross-server updates. Before OI scrub rebuilds the OI mappings for the llog files, the related recovery will get a failure if it runs faster than the background OI scrub. This will result in a failure of the whole mount process. OI scrub is an online tool, therefore, a mount failure means that the OI scrub will be stopped. Removing the old CATALOGS will avoid this potential trouble. The side-effect of removing old CATALOGS is that the recovery for related cross-server updates will be aborted. However, this can be handled by LFSCK after the system mount is up.

  10. Change directory out of the file system.

    [oss]# cd -
  11. Unmount the new file system.

    [oss]# umount /mnt/ost
Introduced in Lustre 2.3

If the file system was used between the time the backup was made and when it was restored, then the online LFSCK tool (part of Lustre code after version 2.3) will automatically be run to ensure the file system is coherent. If all of the device file systems were backed up at the same time after the entire Lustre file system was stopped, this step is unnecessary. In either case, the file system will be immediately although there may be I/O errors reading from files that are present on the MDT but not the OSTs, and files that were created after the MDT backup will not be accessible or visible. See Section 30.4, “ Checking the file system with LFSCK”for details on using LFSCK.

17.5.  Using LVM Snapshots with the Lustre File System

If you want to perform disk-based backups (because, for example, access to the backup system needs to be as fast as to the primary Lustre file system), you can use the Linux LVM snapshot tool to maintain multiple, incremental file system backups.

Because LVM snapshots cost CPU cycles as new files are written, taking snapshots of the main Lustre file system will probably result in unacceptable performance losses. You should create a new, backup Lustre file system and periodically (e.g., nightly) back up new/changed files to it. Periodic snapshots can be taken of this backup file system to create a series of "full" backups.

Note

Creating an LVM snapshot is not as reliable as making a separate backup, because the LVM snapshot shares the same disks as the primary MDT device, and depends on the primary MDT device for much of its data. If the primary MDT device becomes corrupted, this may result in the snapshot being corrupted.

17.5.1.  Creating an LVM-based Backup File System

Use this procedure to create a backup Lustre file system for use with the LVM snapshot mechanism.

  1. Create LVM volumes for the MDT and OSTs.

    Create LVM devices for your MDT and OST targets. Make sure not to use the entire disk for the targets; save some room for the snapshots. The snapshots start out as 0 size, but grow as you make changes to the current file system. If you expect to change 20% of the file system between backups, the most recent snapshot will be 20% of the target size, the next older one will be 40%, etc. Here is an example:

    cfs21:~# pvcreate /dev/sda1
       Physical volume "/dev/sda1" successfully created
    cfs21:~# vgcreate vgmain /dev/sda1
       Volume group "vgmain" successfully created
    cfs21:~# lvcreate -L200G -nMDT0 vgmain
       Logical volume "MDT0" created
    cfs21:~# lvcreate -L200G -nOST0 vgmain
       Logical volume "OST0" created
    cfs21:~# lvscan
       ACTIVE                  '/dev/vgmain/MDT0' [200.00 GB] inherit
       ACTIVE                  '/dev/vgmain/OST0' [200.00 GB] inherit
  2. Format the LVM volumes as Lustre targets.

    In this example, the backup file system is called main and designates the current, most up-to-date backup.

    cfs21:~# mkfs.lustre --fsname=main --mdt --index=0 /dev/vgmain/MDT0
     No management node specified, adding MGS to this MDT.
        Permanent disk data:
     Target:     main-MDT0000
     Index:      0
     Lustre FS:  main
     Mount type: ldiskfs
     Flags:      0x75
                   (MDT MGS first_time update )
     Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
     Parameters:
    checking for existing Lustre data
     device size = 200GB
     formatting backing filesystem ldiskfs on /dev/vgmain/MDT0
             target name  main-MDT0000
             4k blocks     0
             options        -i 4096 -I 512 -q -O dir_index -F
     mkfs_cmd = mkfs.ext2 -j -b 4096 -L main-MDT0000  -i 4096 -I 512 -q
      -O dir_index -F /dev/vgmain/MDT0
     Writing CONFIGS/mountdata
    cfs21:~# mkfs.lustre --mgsnode=cfs21 --fsname=main --ost --index=0
    /dev/vgmain/OST0
        Permanent disk data:
     Target:     main-OST0000
     Index:      0
     Lustre FS:  main
     Mount type: ldiskfs
     Flags:      0x72
                   (OST first_time update )
     Persistent mount opts: errors=remount-ro,extents,mballoc
     Parameters: mgsnode=192.168.0.21@tcp
    checking for existing Lustre data
     device size = 200GB
     formatting backing filesystem ldiskfs on /dev/vgmain/OST0
             target name  main-OST0000
             4k blocks     0
             options        -I 256 -q -O dir_index -F
     mkfs_cmd = mkfs.ext2 -j -b 4096 -L lustre-OST0000 -J size=400 -I 256 
      -i 262144 -O extents,uninit_bg,dir_nlink,huge_file,flex_bg -G 256 
      -E resize=4290772992,lazy_journal_init, -F /dev/vgmain/OST0
     Writing CONFIGS/mountdata
    cfs21:~# mount -t lustre /dev/vgmain/MDT0 /mnt/mdt
    cfs21:~# mount -t lustre /dev/vgmain/OST0 /mnt/ost
    cfs21:~# mount -t lustre cfs21:/main /mnt/main
    

17.5.2.  Backing up New/Changed Files to the Backup File System

At periodic intervals e.g., nightly, back up new and changed files to the LVM-based backup file system.

cfs21:~# cp /etc/passwd /mnt/main 
 
cfs21:~# cp /etc/fstab /mnt/main 
 
cfs21:~# ls /mnt/main 
fstab  passwd

17.5.3.  Creating Snapshot Volumes

Whenever you want to make a "checkpoint" of the main Lustre file system, create LVM snapshots of all target MDT and OSTs in the LVM-based backup file system. You must decide the maximum size of a snapshot ahead of time, although you can dynamically change this later. The size of a daily snapshot is dependent on the amount of data changed daily in the main Lustre file system. It is likely that a two-day old snapshot will be twice as big as a one-day old snapshot.

You can create as many snapshots as you have room for in the volume group. If necessary, you can dynamically add disks to the volume group.

The snapshots of the target MDT and OSTs should be taken at the same point in time. Make sure that the cronjob updating the backup file system is not running, since that is the only thing writing to the disks. Here is an example:

cfs21:~# modprobe dm-snapshot
cfs21:~# lvcreate -L50M -s -n MDT0.b1 /dev/vgmain/MDT0
   Rounding up size to full physical extent 52.00 MB
   Logical volume "MDT0.b1" created
cfs21:~# lvcreate -L50M -s -n OST0.b1 /dev/vgmain/OST0
   Rounding up size to full physical extent 52.00 MB
   Logical volume "OST0.b1" created

After the snapshots are taken, you can continue to back up new/changed files to "main". The snapshots will not contain the new files.

cfs21:~# cp /etc/termcap /mnt/main
cfs21:~# ls /mnt/main
fstab  passwd  termcap

17.5.4.  Restoring the File System From a Snapshot

Use this procedure to restore the file system from an LVM snapshot.

  1. Rename the LVM snapshot.

    Rename the file system snapshot from "main" to "back" so you can mount it without unmounting "main". This is recommended, but not required. Use the --reformat flag to tunefs.lustre to force the name change. For example:

    cfs21:~# tunefs.lustre --reformat --fsname=back --writeconf /dev/vgmain/MDT0.b1
     checking for existing Lustre data
     found Lustre data
     Reading CONFIGS/mountdata
    Read previous values:
     Target:     main-MDT0000
     Index:      0
     Lustre FS:  main
     Mount type: ldiskfs
     Flags:      0x5
                  (MDT MGS )
     Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
     Parameters:
    Permanent disk data:
     Target:     back-MDT0000
     Index:      0
     Lustre FS:  back
     Mount type: ldiskfs
     Flags:      0x105
                  (MDT MGS writeconf )
     Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
     Parameters:
    Writing CONFIGS/mountdata
    cfs21:~# tunefs.lustre --reformat --fsname=back --writeconf /dev/vgmain/OST0.b1
     checking for existing Lustre data
     found Lustre data
     Reading CONFIGS/mountdata
    Read previous values:
     Target:     main-OST0000
     Index:      0
     Lustre FS:  main
     Mount type: ldiskfs
     Flags:      0x2
                  (OST )
     Persistent mount opts: errors=remount-ro,extents,mballoc
     Parameters: mgsnode=192.168.0.21@tcp
    Permanent disk data:
     Target:     back-OST0000
     Index:      0
     Lustre FS:  back
     Mount type: ldiskfs
     Flags:      0x102
                  (OST writeconf )
     Persistent mount opts: errors=remount-ro,extents,mballoc
     Parameters: mgsnode=192.168.0.21@tcp
    Writing CONFIGS/mountdata
    

    When renaming a file system, we must also erase the last_rcvd file from the snapshots

    cfs21:~# mount -t ldiskfs /dev/vgmain/MDT0.b1 /mnt/mdtback
    cfs21:~# rm /mnt/mdtback/last_rcvd
    cfs21:~# umount /mnt/mdtback
    cfs21:~# mount -t ldiskfs /dev/vgmain/OST0.b1 /mnt/ostback
    cfs21:~# rm /mnt/ostback/last_rcvd
    cfs21:~# umount /mnt/ostback
  2. Mount the file system from the LVM snapshot. For example:

    cfs21:~# mount -t lustre /dev/vgmain/MDT0.b1 /mnt/mdtback
    cfs21:~# mount -t lustre /dev/vgmain/OST0.b1 /mnt/ostback
    cfs21:~# mount -t lustre cfs21:/back /mnt/back
  3. Note the old directory contents, as of the snapshot time. For example:

    cfs21:~/cfs/b1_5/lustre/utils# ls /mnt/back
    fstab  passwds
    

17.5.5.  Deleting Old Snapshots

To reclaim disk space, you can erase old snapshots as your backup policy dictates. Run:

lvremove /dev/vgmain/MDT0.b1

17.5.6.  Changing Snapshot Volume Size

You can also extend or shrink snapshot volumes if you find your daily deltas are smaller or larger than expected. Run:

lvextend -L10G /dev/vgmain/MDT0.b1

Note

Extending snapshots seems to be broken in older LVM. It is working in LVM v2.02.01.

Chapter 18. Managing File Layout (Striping) and Free Space

This chapter describes file layout (striping) and I/O options, and includes the following sections:

18.1.  How Lustre File System Striping Works

In a Lustre file system, the MDS allocates objects to OSTs using either a round-robin algorithm or a weighted algorithm. When the amount of free space is well balanced (i.e., by default, when the free space across OSTs differs by less than 17%), the round-robin algorithm is used to select the next OST to which a stripe is to be written. Periodically, the MDS adjusts the striping layout to eliminate some degenerated cases in which applications that create very regular file layouts (striping patterns) preferentially use a particular OST in the sequence.

Normally the usage of OSTs is well balanced. However, if users create a small number of exceptionally large files or incorrectly specify striping parameters, imbalanced OST usage may result. When the free space across OSTs differs by more than a specific amount (17% by default), the MDS then uses weighted random allocations with a preference for allocating objects on OSTs with more free space. (This can reduce I/O performance until space usage is rebalanced again.) For a more detailed description of how striping is allocated, see Section 18.5, “Managing Free Space”.

Introduced in Lustre 2.2

Files can only be striped over a finite number of OSTs. Prior to Lustre software release 2.2, the maximum number of OSTs that a file could be striped across was limited to 160. As of Lustre software release 2.2, the maximum number of OSTs is 2000. For more information, see Section 18.6, “Lustre Striping Internals”.

18.2.  Lustre File Layout (Striping) Considerations

Whether you should set up file striping and what parameter values you select depends on your needs. A good rule of thumb is to stripe over as few objects as will meet those needs and no more.

Some reasons for using striping include:

  • Providing high-bandwidth access. Many applications require high-bandwidth access to a single file, which may be more bandwidth than can be provided by a single OSS. Examples are a scientific application that writes to a single file from hundreds of nodes, or a binary executable that is loaded by many nodes when an application starts.

    In cases like these, a file can be striped over as many OSSs as it takes to achieve the required peak aggregate bandwidth for that file. Striping across a larger number of OSSs should only be used when the file size is very large and/or is accessed by many nodes at a time. Currently, Lustre files can be striped across up to 2000 OSTs, the maximum stripe count for an ldiskfs file system.

  • Improving performance when OSS bandwidth is exceeded. Striping across many OSSs can improve performance if the aggregate client bandwidth exceeds the server bandwidth and the application reads and writes data fast enough to take advantage of the additional OSS bandwidth. The largest useful stripe count is bounded by the I/O rate of the clients/jobs divided by the performance per OSS.

  • Providing space for very large files. Striping is useful when a single OST does not have enough free space to hold the entire file.

Some reasons to minimize or avoid striping:

  • Increased overhead. Striping results in more locks and extra network operations during common operations such as stat and unlink. Even when these operations are performed in parallel, one network operation takes less time than 100 operations.

    Increased overhead also results from server contention. Consider a cluster with 100 clients and 100 OSSs, each with one OST. If each file has exactly one object and the load is distributed evenly, there is no contention and the disks on each server can manage sequential I/O. If each file has 100 objects, then the clients all compete with one another for the attention of the servers, and the disks on each node seek in 100 different directions resulting in needless contention.

  • Increased risk. When files are striped across all servers and one of the servers breaks down, a small part of each striped file is lost. By comparison, if each file has exactly one stripe, fewer files are lost, but they are lost in their entirety. Many users would prefer to lose some of their files entirely than all of their files partially.

18.2.1.  Choosing a Stripe Size

Choosing a stripe size is a balancing act, but reasonable defaults are described below. The stripe size has no effect on a single-stripe file.

  • The stripe size must be a multiple of the page size. Lustre software tools enforce a multiple of 64 KB (the maximum page size on ia64 and PPC64 nodes) so that users on platforms with smaller pages do not accidentally create files that might cause problems for ia64 clients.

  • The smallest recommended stripe size is 512 KB. Although you can create files with a stripe size of 64 KB, the smallest practical stripe size is 512 KB because the Lustre file system sends 1MB chunks over the network. Choosing a smaller stripe size may result in inefficient I/O to the disks and reduced performance.

  • A good stripe size for sequential I/O using high-speed networks is between 1 MB and 4 MB. In most situations, stripe sizes larger than 4 MB may result in longer lock hold times and contention during shared file access.

  • The maximum stripe size is 4 GB. Using a large stripe size can improve performance when accessing very large files. It allows each client to have exclusive access to its own part of a file. However, a large stripe size can be counterproductive in cases where it does not match your I/O pattern.

  • Choose a stripe pattern that takes into account the write patterns of your application. Writes that cross an object boundary are slightly less efficient than writes that go entirely to one server. If the file is written in a consistent and aligned way, make the stripe size a multiple of the write() size.

18.3. Setting the File Layout/Striping Configuration (lfs setstripe)

Use the lfs setstripe command to create new files with a specific file layout (stripe pattern) configuration.

lfs setstripe [--size|-s stripe_size] [--count|-c stripe_count] \
[--index|-i start_ost] [--pool|-p pool_name] filename|dirname 

stripe_size

The stripe_size indicates how much data to write to one OST before moving to the next OST. The default stripe_size is 1 MB. Passing a stripe_size of 0 causes the default stripe size to be used. Otherwise, the stripe_size value must be a multiple of 64 KB.

stripe_count

The stripe_count indicates how many OSTs to use. The default stripe_count value is 1. Setting stripe_count to 0 causes the default stripe count to be used. Setting stripe_count to -1 means stripe over all available OSTs (full OSTs are skipped).

start_ost

The start OST is the first OST to which files are written. The default value for start_ost is -1, which allows the MDS to choose the starting index. This setting is strongly recommended, as it allows space and load balancing to be done by the MDS as needed. If the value of start_ost is set to a value other than -1, the file starts on the specified OST index. OST index numbering starts at 0.

Note

If the specified OST is inactive or in a degraded mode, the MDS will silently choose another target.

Note

If you pass a start_ost value of 0 and a stripe_count value of 1, all files are written to OST 0, until space is exhausted. This is probably not what you meant to do. If you only want to adjust the stripe count and keep the other parameters at their default settings, do not specify any of the other parameters:

client# lfs setstripe -c stripe_count filename

pool_name

The pool_name specifies the OST pool to which the file will be written. This allows limiting the OSTs used to a subset of all OSTs in the file system. For more details about using OST pools, see Creating and Managing OST Pools.

18.3.1. Specifying a File Layout (Striping Pattern) for a Single File

It is possible to specify the file layout when a new file is created using the command lfs setstripe. This allows users to override the file system default parameters to tune the file layout more optimally for their application. Execution of an lfs setstripe command fails if the file already exists.

18.3.1.1. Setting the Stripe Size

The command to create a new file with a specified stripe size is similar to:

[client]# lfs setstripe -s 4M /mnt/lustre/new_file

This example command creates the new file /mnt/lustre/new_file with a stripe size of 4 MB.

Now, when the file is created, the new stripe setting creates the file on a single OST with a stripe size of 4M:

 [client]# lfs getstripe /mnt/lustre/new_file
/mnt/lustre/4mb_file
lmm_stripe_count:   1
lmm_stripe_size:    4194304
lmm_pattern:        1
lmm_layout_gen:     0
lmm_stripe_offset:  1
obdidx     objid        objid           group
1          690550       0xa8976         0 

In this example, the stripe size is 4 MB.

18.3.1.2.  Setting the Stripe Count

The command below creates a new file with a stripe count of -1 to specify striping over all available OSTs:

[client]# lfs setstripe -c -1 /mnt/lustre/full_stripe

The example below indicates that the file full_stripe is striped over all six active OSTs in the configuration:

[client]# lfs getstripe /mnt/lustre/full_stripe
/mnt/lustre/full_stripe
  obdidx   objid   objid   group
  0        8       0x8     0
  1        4       0x4     0
  2        5       0x5     0
  3        5       0x5     0
  4        4       0x4     0
  5        2       0x2     0

This is in contrast to the output in Section 18.3.1.1, “Setting the Stripe Size”, which shows only a single object for the file.

18.3.2. Setting the Striping Layout for a Directory

In a directory, the lfs setstripe command sets a default striping configuration for files created in the directory. The usage is the same as lfs setstripe for a regular file, except that the directory must exist prior to setting the default striping configuration. If a file is created in a directory with a default stripe configuration (without otherwise specifying striping), the Lustre file system uses those striping parameters instead of the file system default for the new file.

To change the striping pattern for a sub-directory, create a directory with desired file layout as described above. Sub-directories inherit the file layout of the root/parent directory.

18.3.3. Setting the Striping Layout for a File System

Setting the striping specification on the root directory determines the striping for all new files created in the file system unless an overriding striping specification takes precedence (such as a striping layout specified by the application, or set using lfs setstripe, or specified for the parent directory).

Note

The striping settings for a root directory are, by default, applied to any new child directories created in the root directory, unless striping settings have been specified for the child directory.

18.3.4. Creating a File on a Specific OST

You can use lfs setstripe to create a file on a specific OST. In the following example, the file file1 is created on the first OST (OST index is 0).

$ lfs setstripe --count 1 --index 0 file1
$ dd if=/dev/zero of=file1 count=1 bs=100M
1+0 records in
1+0 records out

$ lfs getstripe file1
/mnt/testfs/file1
lmm_stripe_count:   1
lmm_stripe_size:    1048576
lmm_pattern:        1
lmm_layout_gen:     0
lmm_stripe_offset:  0               
     obdidx    objid   objid    group                    
     0         37364   0x91f4   0

18.4. Retrieving File Layout/Striping Information (getstripe)

The lfs getstripe command is used to display information that shows over which OSTs a file is distributed. For each OST, the index and UUID is displayed, along with the OST index and object ID for each stripe in the file. For directories, the default settings for files created in that directory are displayed.

18.4.1. Displaying the Current Stripe Size

To see the current stripe size for a Lustre file or directory, use the lfs getstripe command. For example, to view information for a directory, enter a command similar to:

[client]# lfs getstripe /mnt/lustre 

This command produces output similar to:

/mnt/lustre 
(Default) stripe_count: 1 stripe_size: 1M stripe_offset: -1

In this example, the default stripe count is 1 (data blocks are striped over a single OST), the default stripe size is 1 MB, and the objects are created over all available OSTs.

To view information for a file, enter a command similar to:

$ lfs getstripe /mnt/lustre/foo
/mnt/lustre/foo
lmm_stripe_count:   1
lmm_stripe_size:    1048576
lmm_pattern:        1
lmm_layout_gen:     0
lmm_stripe_offset:  0
  obdidx   objid    objid      group
  2        835487   m0xcbf9f   0 

In this example, the file is located on obdidx 2, which corresponds to the OST lustre-OST0002. To see which node is serving that OST, run:

$ lctl get_param osc.lustre-OST0002-osc.ost_conn_uuid
osc.lustre-OST0002-osc.ost_conn_uuid=192.168.20.1@tcp

18.4.2. Inspecting the File Tree

To inspect an entire tree of files, use the lfs find command:

lfs find [--recursive | -r] file|directory ...

18.4.3. Locating the MDT for a remote directory

Introduced in Lustre 2.4

Lustre software release 2.4 can be configured with multiple MDTs in the same file system. Each sub-directory can have a different MDT. To identify on which MDT a given subdirectory is located, pass the getstripe [--mdt-index|-M] parameters to lfs. An example of this command is provided in the section Section 14.8.1, “Removing a MDT from the File System”.

18.5. Managing Free Space

To optimize file system performance, the MDT assigns file stripes to OSTs based on two allocation algorithms. The round-robin allocator gives preference to location (spreading out stripes across OSSs to increase network bandwidth utilization) and the weighted allocator gives preference to available space (balancing loads across OSTs). Threshold and weighting factors for these two algorithms can be adjusted by the user. The MDT reserves 0.1 percent of total OST space and 32 inodes for each OST. The MDT stops object allocation for the OST if available space is less than reserved or the OST has fewer than 32 free inodes. The MDT starts object allocation when available space is twice as big as the reserved space and the OST has more than 64 free inodes. Note, clients could append existing files no matter what object allocation state is.

Introduced in Lustre 2.9

The reserved space for each OST can be adjusted by the user. Use the lctl set_param command, for example the next command reserve 1GB space for all OSTs.

lctl set_param -P osp.*.reserved_mb_low=1024

This section describes how to check available free space on disks and how free space is allocated. It then describes how to set the threshold and weighting factors for the allocation algorithms.

18.5.1. Checking File System Free Space

Free space is an important consideration in assigning file stripes. The lfs df command can be used to show available disk space on the mounted Lustre file system and space consumption per OST. If multiple Lustre file systems are mounted, a path may be specified, but is not required. Options to the lfs df command are shown below.

Option

Description

-h

Displays sizes in human readable format (for example: 1K, 234M, 5G).

-i, --inodes

Lists inodes instead of block usage.

Note

The df -i and lfs df -i commands show the minimum number of inodes that can be created in the file system at the current time. If the total number of objects available across all of the OSTs is smaller than those available on the MDT(s), taking into account the default file striping, then df -i will also report a smaller number of inodes than could be created. Running lfs df -i will report the actual number of inodes that are free on each target.

For ZFS file systems, the number of inodes that can be created is dynamic and depends on the free space in the file system. The Free and Total inode counts reported for a ZFS file system are only an estimate based on the current usage for each target. The Used inode count is the actual number of inodes used by the file system.

Examples

[client1] $ lfs df
UUID                1K-blockS  Used      Available Use% Mounted on
mds-lustre-0_UUID   9174328    1020024   8154304   11%  /mnt/lustre[MDT:0]
ost-lustre-0_UUID   94181368   56330708  37850660  59%  /mnt/lustre[OST:0]
ost-lustre-1_UUID   94181368   56385748  37795620  59%  /mnt/lustre[OST:1]
ost-lustre-2_UUID   94181368   54352012  39829356  57%  /mnt/lustre[OST:2]
filesystem summary: 282544104  167068468 39829356  57%  /mnt/lustre
 
[client1] $ lfs df -h
UUID                bytes    Used    Available   Use%  Mounted on
mds-lustre-0_UUID   8.7G     996.1M  7.8G        11%   /mnt/lustre[MDT:0]
ost-lustre-0_UUID   89.8G    53.7G   36.1G       59%   /mnt/lustre[OST:0]
ost-lustre-1_UUID   89.8G    53.8G   36.0G       59%   /mnt/lustre[OST:1]
ost-lustre-2_UUID   89.8G    51.8G   38.0G       57%   /mnt/lustre[OST:2]
filesystem summary: 269.5G   159.3G  110.1G      59%   /mnt/lustre
 
[client1] $ lfs df -i 
UUID                Inodes  IUsed IFree   IUse% Mounted on
mds-lustre-0_UUID   2211572 41924 2169648 1%    /mnt/lustre[MDT:0]
ost-lustre-0_UUID   737280  12183 725097  1%    /mnt/lustre[OST:0]
ost-lustre-1_UUID   737280  12232 725048  1%    /mnt/lustre[OST:1]
ost-lustre-2_UUID   737280  12214 725066  1%    /mnt/lustre[OST:2]
filesystem summary: 2211572 41924 2169648 1%    /mnt/lustre[OST:2]

18.5.2.  Stripe Allocation Methods

Two stripe allocation methods are provided:

  • Round-robin allocator - When the OSTs have approximately the same amount of free space, the round-robin allocator alternates stripes between OSTs on different OSSs, so the OST used for stripe 0 of each file is evenly distributed among OSTs, regardless of the stripe count. In a simple example with eight OSTs numbered 0-7, objects would be allocated like this:

    File 1: OST1, OST2, OST3, OST4
    File 2: OST5, OST6, OST7
    File 3: OST0, OST1, OST2, OST3, OST4, OST5
    File 4: OST6, OST7, OST0

    Here are several more sample round-robin stripe orders (each letter represents a different OST on a single OSS):

    3: AAA

    One 3-OST OSS

    3x3: ABABAB

    Two 3-OST OSSs

    3x4: BBABABA

    One 3-OST OSS (A) and one 4-OST OSS (B)

    3x5: BBABBABA

    One 3-OST OSS (A) and one 5-OST OSS (B)

    3x3x3: ABCABCABC

    Three 3-OST OSSs

  • Weighted allocator - When the free space difference between the OSTs becomes significant, the weighting algorithm is used to influence OST ordering based on size (amount of free space available on each OST) and location (stripes evenly distributed across OSTs). The weighted allocator fills the emptier OSTs faster, but uses a weighted random algorithm, so the OST with the most free space is not necessarily chosen each time.

The allocation method is determined by the amount of free-space imbalance on the OSTs. When free space is relatively balanced across OSTs, the faster round-robin allocator is used, which maximizes network balancing. The weighted allocator is used when any two OSTs are out of balance by more than the specified threshold (17% by default). The threshold between the two allocation methods is defined in the file /proc/fs/fsname/lov/fsname-mdtlov/qos_threshold_rr.

To set the qos_threshold_r to 25, enter this command on the MGS:

lctl set_param lov.fsname-mdtlov.qos_threshold_rr=25

18.5.3. Adjusting the Weighting Between Free Space and Location

The weighting priority used by the weighted allocator is set in the file /proc/fs/fsname/lov/fsname-mdtlov/qos_prio_free. Increasing the value of qos_prio_free puts more weighting on the amount of free space available on each OST and less on how stripes are distributed across OSTs. The default value is 91 (percent). When the free space priority is set to 100 (percent), weighting is based entirely on free space and location is no longer used by the striping algorithm.

To change the allocator weighting to 100, enter this command on the MGS:

lctl conf_param fsname-MDT0000.lov.qos_prio_free=100

.

Note

When qos_prio_free is set to 100, a weighted random algorithm is still used to assign stripes, so, for example, if OST2 has twice as much free space as OST1, OST2 is twice as likely to be used, but it is not guaranteed to be used.

18.6. Lustre Striping Internals

For Lustre releases prior to Lustre software release 2.2, files can be striped across a maximum of 160 OSTs. Lustre inodes use an extended attribute to record the location of each object (the object ID and the number of the OST on which it is stored). The size of the extended attribute limits the maximum stripe count to 160 objects.

Introduced in Lustre 2.2

In Lustre software release 2.2 and subsequent releases, the maximum number of OSTs over which files can be striped has been raised to 2000 by allocating a new block on which to store the extended attribute that holds the object information. This feature, known as "wide striping," only allocates the additional extended attribute data block if the file is striped with a stripe count greater than 160. The file layout (object ID, OST number) is stored on the new data block with a pointer to this block stored in the original Lustre inode for the file. For files smaller than 160 objects, the Lustre inode is used to store the file layout.

Chapter 19. Managing the File System and I/O

19.1.  Handling Full OSTs

Sometimes a Lustre file system becomes unbalanced, often due to incorrectly-specified stripe settings, or when very large files are created that are not striped over all of the OSTs. If an OST is full and an attempt is made to write more information to the file system, an error occurs. The procedures below describe how to handle a full OST.

The MDS will normally handle space balancing automatically at file creation time, and this procedure is normally not needed, but may be desirable in certain circumstances (e.g. when creating very large files that would consume more than the total free space of the full OSTs).

19.1.1.  Checking OST Space Usage

The example below shows an unbalanced file system:

client# lfs df -h
UUID                       bytes           Used            Available       \
Use%            Mounted on
testfs-MDT0000_UUID        4.4G            214.5M          3.9G            \
4%              /mnt/testfs[MDT:0]
testfs-OST0000_UUID        2.0G            751.3M          1.1G            \
37%             /mnt/testfs[OST:0]
testfs-OST0001_UUID        2.0G            755.3M          1.1G            \
37%             /mnt/testfs[OST:1]
testfs-OST0002_UUID        2.0G            1.7G            155.1M          \
86%             /mnt/testfs[OST:2] ****
testfs-OST0003_UUID        2.0G            751.3M          1.1G            \
37%             /mnt/testfs[OST:3]
testfs-OST0004_UUID        2.0G            747.3M          1.1G            \
37%             /mnt/testfs[OST:4]
testfs-OST0005_UUID        2.0G            743.3M          1.1G            \
36%             /mnt/testfs[OST:5]
 
filesystem summary:        11.8G           5.4G            5.8G            \
45%             /mnt/testfs

In this case, OST0002 is almost full and when an attempt is made to write additional information to the file system (even with uniform striping over all the OSTs), the write command fails as follows:

client# lfs setstripe /mnt/testfs 4M 0 -1
client# dd if=/dev/zero of=/mnt/testfs/test_3 bs=10M count=100
dd: writing '/mnt/testfs/test_3': No space left on device
98+0 records in
97+0 records out
1017192448 bytes (1.0 GB) copied, 23.2411 seconds, 43.8 MB/s

19.1.2.  Taking a Full OST Offline

To avoid running out of space in the file system, if the OST usage is imbalanced and one or more OSTs are close to being full while there are others that have a lot of space, the full OSTs may optionally be deactivated at the MDS to prevent the MDS from allocating new objects there.

  1. Log into the MDS server:

    client# ssh root@192.168.0.10 
    root@192.168.0.10's password: 
    Last login: Wed Nov 26 13:35:12 2008 from 192.168.0.6
    
  2. Use the lctl dl command to show the status of all file system components:

    mds# lctl dl 
    0 UP mgs MGS MGS 9 
    1 UP mgc MGC192.168.0.10@tcp e384bb0e-680b-ce25-7bc9-81655dd1e813 5
    2 UP mdt MDS MDS_uuid 3
    3 UP lov testfs-mdtlov testfs-mdtlov_UUID 4
    4 UP mds testfs-MDT0000 testfs-MDT0000_UUID 5
    5 UP osc testfs-OST0000-osc testfs-mdtlov_UUID 5
    6 UP osc testfs-OST0001-osc testfs-mdtlov_UUID 5
    7 UP osc testfs-OST0002-osc testfs-mdtlov_UUID 5
    8 UP osc testfs-OST0003-osc testfs-mdtlov_UUID 5
    9 UP osc testfs-OST0004-osc testfs-mdtlov_UUID 5
    10 UP osc testfs-OST0005-osc testfs-mdtlov_UUID 5
    
  3. Use lctl deactivate to take the full OST offline:

    mds# lctl --device 7 deactivate
    
  4. Display the status of the file system components:

    mds# lctl dl 
    0 UP mgs MGS MGS 9
    1 UP mgc MGC192.168.0.10@tcp e384bb0e-680b-ce25-7bc9-81655dd1e813 5
    2 UP mdt MDS MDS_uuid 3
    3 UP lov testfs-mdtlov testfs-mdtlov_UUID 4
    4 UP mds testfs-MDT0000 testfs-MDT0000_UUID 5
    5 UP osc testfs-OST0000-osc testfs-mdtlov_UUID 5
    6 UP osc testfs-OST0001-osc testfs-mdtlov_UUID 5
    7 IN osc testfs-OST0002-osc testfs-mdtlov_UUID 5
    8 UP osc testfs-OST0003-osc testfs-mdtlov_UUID 5
    9 UP osc testfs-OST0004-osc testfs-mdtlov_UUID 5
    10 UP osc testfs-OST0005-osc testfs-mdtlov_UUID 5
    

The device list shows that OST0002 is now inactive. When new files are created in the file system, they will only use the remaining active OSTs. Either manual space rebalancing can be done by migrating data to other OSTs, as shown in the next section, or normal file deletion and creation can be allowed to passively rebalance the space usage.

19.1.3.  Migrating Data within a File System

Introduced in Lustre 2.8

Lustre software version 2.8 includes a feature to migrate metadata (directories and inodes therein) between MDTs. This migration can only be performed on whole directories. For example, to migrate the contents of the /testfs/testremote directory from the MDT it currently resides on to MDT0000, the sequence of commands is as follows:

$ cd /testfs
lfs getdirstripe -M ./testremote which MDT is dir on?
1
$ for i in $(seq 3); do touch ./testremote/${i}.txt; done create test files
$ for i in $(seq 3); do lfs getstripe -M ./testremote/${i}.txt; done check files are on MDT 1
1
1
1
$ lfs migrate -m 0 ./testremote migrate testremote to MDT 0
$ lfs getdirstripe -M ./testremote which MDT is dir on now?
0
$ for i in $(seq 3); do lfs getstripe -M ./testremote/${i}.txt; done check files are on MDT 0 too
0
0
0

For more information, see man lfs

Warning

Currently, only whole directories can be migrated between MDTs. During migration each file receives a new identifier (FID). As a consequence, the file receives a new inode number. Some system tools (for example, backup and archiving tools) may consider the migrated files to be new, even though the contents are unchanged.

If there is a need to migrate the file data from the current OST(s) to new OSTs, the data must be migrated (copied) to the new location. The simplest way to do this is to use the lfs_migrate command (see Section 34.2, “ lfs_migrate). However, the steps for migrating a file by hand are also shown here for reference.

  1. Identify the file(s) to be moved.

    In the example below, the object information portion of the output from the lfs getstripe command below shows that the test_2file is located entirely on OST0002:

    client# lfs getstripe /mnt/testfs/test_2
    /mnt/testfs/test_2
    obdidx     objid   objid   group
         2      8     0x8       0
    
  2. To move the data, create a copy and remove the original:

    client# cp -a /mnt/testfs/test_2 /mnt/testfs/test_2.tmp
    client# mv /mnt/testfs/test_2.tmp /mnt/testfs/test_2
    
  3. If the space usage of OSTs is severely imbalanced, it is possible to find and migrate large files from their current location onto OSTs that have more space, one could run:

    client# lfs find --ost 
    ost_name -size +1G | lfs_migrate -y
    
  4. Check the file system balance.

    The lfs df output in the example below shows a more balanced system compared to the lfs df output in the example in Section 19.1, “ Handling Full OSTs”.

    client# lfs df -h
    UUID                 bytes         Used            Available       Use%    \
            Mounted on
    testfs-MDT0000_UUID   4.4G         214.5M          3.9G            4%      \
            /mnt/testfs[MDT:0]
    testfs-OST0000_UUID   2.0G         1.3G            598.1M          65%     \
            /mnt/testfs[OST:0]
    testfs-OST0001_UUID   2.0G         1.3G            594.1M          65%     \
            /mnt/testfs[OST:1]
    testfs-OST0002_UUID   2.0G         913.4M          1000.0M         45%     \
            /mnt/testfs[OST:2]
    testfs-OST0003_UUID   2.0G         1.3G            602.1M          65%     \
            /mnt/testfs[OST:3]
    testfs-OST0004_UUID   2.0G         1.3G            606.1M          64%     \
            /mnt/testfs[OST:4]
    testfs-OST0005_UUID   2.0G         1.3G            610.1M          64%     \
            /mnt/testfs[OST:5]
     
    filesystem summary:  11.8G 7.3G            3.9G    61%                     \
    /mnt/testfs
    

19.1.4.  Returning an Inactive OST Back Online

Once the deactivated OST(s) no longer are severely imbalanced, due to either active or passive data redistribution, they should be reactivated so they will again have new files allocated on them.

[mds]# lctl --device 7 activate
[mds]# lctl dl
  0 UP mgs MGS MGS 9
  1 UP mgc MGC192.168.0.10@tcp e384bb0e-680b-ce25-7bc9-816dd1e813 5
  2 UP mdt MDS MDS_uuid 3
  3 UP lov testfs-mdtlov testfs-mdtlov_UUID 4
  4 UP mds testfs-MDT0000 testfs-MDT0000_UUID 5
  5 UP osc testfs-OST0000-osc testfs-mdtlov_UUID 5
  6 UP osc testfs-OST0001-osc testfs-mdtlov_UUID 5
  7 UP osc testfs-OST0002-osc testfs-mdtlov_UUID 5
  8 UP osc testfs-OST0003-osc testfs-mdtlov_UUID 5
  9 UP osc testfs-OST0004-osc testfs-mdtlov_UUID 5
 10 UP osc testfs-OST0005-osc testfs-mdtlov_UUID

19.2.  Creating and Managing OST Pools

The OST pools feature enables users to group OSTs together to make object placement more flexible. A 'pool' is the name associated with an arbitrary subset of OSTs in a Lustre cluster.

OST pools follow these rules:

  • An OST can be a member of multiple pools.

  • No ordering of OSTs in a pool is defined or implied.

  • Stripe allocation within a pool follows the same rules as the normal stripe allocator.

  • OST membership in a pool is flexible, and can change over time.

When an OST pool is defined, it can be used to allocate files. When file or directory striping is set to a pool, only OSTs in the pool are candidates for striping. If a stripe_index is specified which refers to an OST that is not a member of the pool, an error is returned.

OST pools are used only at file creation. If the definition of a pool changes (an OST is added or removed or the pool is destroyed), already-created files are not affected.

Note

An error ( EINVAL) results if you create a file using an empty pool.

Note

If a directory has pool striping set and the pool is subsequently removed, the new files created in this directory have the (non-pool) default striping pattern for that directory applied and no error is returned.

19.2.1. Working with OST Pools

OST pools are defined in the configuration log on the MGS. Use the lctl command to:

  • Create/destroy a pool

  • Add/remove OSTs in a pool

  • List pools and OSTs in a specific pool

The lctl command MUST be run on the MGS. Another requirement for managing OST pools is to either have the MDT and MGS on the same node or have a Lustre client mounted on the MGS node, if it is separate from the MDS. This is needed to validate the pool commands being run are correct.

Caution

Running the writeconf command on the MDS erases all pools information (as well as any other parameters set using lctl conf_param). We recommend that the pools definitions (and conf_param settings) be executed using a script, so they can be reproduced easily after a writeconf is performed.

To create a new pool, run:

mgs# lctl pool_new 
fsname.
poolname

Note

The pool name is an ASCII string up to 15 characters.

To add the named OST to a pool, run:

mgs# lctl pool_add 
fsname.
poolname 
ost_list

Where:

  • ost_listis fsname-OST index_range

  • index_rangeis ost_index_start- ost_index_end[,index_range] or ost_index_start- ost_index_end/step

If the leading fsname and/or ending _UUID are missing, they are automatically added.

For example, to add even-numbered OSTs to pool1 on file system testfs, run a single command ( pool_add) to add many OSTs to the pool at one time:

lctl pool_add testfs.pool1 OST[0-10/2]

Note

Each time an OST is added to a pool, a new llog configuration record is created. For convenience, you can run a single command.

To remove a named OST from a pool, run:

mgs# lctl pool_remove 
fsname.
poolname 
ost_list

To destroy a pool, run:

mgs# lctl pool_destroy 
fsname.
poolname

Note

All OSTs must be removed from a pool before it can be destroyed.

To list pools in the named file system, run:

mgs# lctl pool_list 
fsname|pathname

To list OSTs in a named pool, run:

lctl pool_list 
fsname.
poolname

19.2.1.1. Using the lfs Command with OST Pools

Several lfs commands can be run with OST pools. Use the lfs setstripe command to associate a directory with an OST pool. This causes all new regular files and directories in the directory to be created in the pool. The lfs command can be used to list pools in a file system and OSTs in a named pool.

To associate a directory with a pool, so all new files and directories will be created in the pool, run:

client# lfs setstripe --pool|-p pool_name 
filename|dirname 

To set striping patterns, run:

client# lfs setstripe [--size|-s stripe_size] [--offset|-o start_ost]
           [--count|-c stripe_count] [--pool|-p pool_name]
           
dir|filename

Note

If you specify striping with an invalid pool name, because the pool does not exist or the pool name was mistyped, lfs setstripe returns an error. Run lfs pool_list to make sure the pool exists and the pool name is entered correctly.

Note

The --pool option for lfs setstripe is compatible with other modifiers. For example, you can set striping on a directory to use an explicit starting index.

19.2.2.  Tips for Using OST Pools

Here are several suggestions for using OST pools.

  • A directory or file can be given an extended attribute (EA), that restricts striping to a pool.

  • Pools can be used to group OSTs with the same technology or performance (slower or faster), or that are preferred for certain jobs. Examples are SATA OSTs versus SAS OSTs or remote OSTs versus local OSTs.

  • A file created in an OST pool tracks the pool by keeping the pool name in the file LOV EA.

19.3.  Adding an OST to a Lustre File System

To add an OST to existing Lustre file system:

  1. Add a new OST by passing on the following commands, run:

    oss# mkfs.lustre --fsname=testfs --mgsnode=mds16@tcp0 --ost --index=12 /dev/sda
    oss# mkdir -p /mnt/testfs/ost12
    oss# mount -t lustre /dev/sda /mnt/testfs/ost12
    
  2. Migrate the data (possibly).

    The file system is quite unbalanced when new empty OSTs are added. New file creations are automatically balanced. If this is a scratch file system or files are pruned at a regular interval, then no further work may be needed. Files existing prior to the expansion can be rebalanced with an in-place copy, which can be done with a simple script.

    The basic method is to copy existing files to a temporary file, then move the temp file over the old one. This should not be attempted with files which are currently being written to by users or applications. This operation redistributes the stripes over the entire set of OSTs.

    A very clever migration script would do the following:

    • Examine the current distribution of data.

    • Calculate how much data should move from each full OST to the empty ones.

    • Search for files on a given full OST (using lfs getstripe).

    • Force the new destination OST (using lfs setstripe).

    • Copy only enough files to address the imbalance.

If a Lustre file system administrator wants to explore this approach further, per-OST disk-usage statistics can be found under /proc/fs/lustre/osc/*/rpc_stats

19.4.  Performing Direct I/O

The Lustre software supports the O_DIRECT flag to open.

Applications using the read() and write() calls must supply buffers aligned on a page boundary (usually 4 K). If the alignment is not correct, the call returns -EINVAL. Direct I/O may help performance in cases where the client is doing a large amount of I/O and is CPU-bound (CPU utilization 100%).

19.4.1. Making File System Objects Immutable

An immutable file or directory is one that cannot be modified, renamed or removed. To do this:

chattr +i 
file

To remove this flag, use chattr -i

19.5. Other I/O Options

This section describes other I/O options, including checksums, and the ptlrpcd thread pool.

19.5.1. Lustre Checksums

To guard against network data corruption, a Lustre client can perform two types of data checksums: in-memory (for data in client memory) and wire (for data sent over the network). For each checksum type, a 32-bit checksum of the data read or written on both the client and server is computed, to ensure that the data has not been corrupted in transit over the network. The ldiskfs backing file system does NOT do any persistent checksumming, so it does not detect corruption of data in the OST file system.

The checksumming feature is enabled, by default, on individual client nodes. If the client or OST detects a checksum mismatch, then an error is logged in the syslog of the form:

LustreError: BAD WRITE CHECKSUM: changed in transit before arrival at OST: \
from 192.168.1.1@tcp inum 8991479/2386814769 object 1127239/0 extent [10240\
0-106495]

If this happens, the client will re-read or re-write the affected data up to five times to get a good copy of the data over the network. If it is still not possible, then an I/O error is returned to the application.

To enable both types of checksums (in-memory and wire), run:

lctl set_param llite.*.checksum_pages=1

To disable both types of checksums (in-memory and wire), run:

lctl set_param llite.*.checksum_pages=0

To check the status of a wire checksum, run:

lctl get_param osc.*.checksums

19.5.1.1. Changing Checksum Algorithms

By default, the Lustre software uses the adler32 checksum algorithm, because it is robust and has a lower impact on performance than crc32. The Lustre file system administrator can change the checksum algorithm via lctl get_param, depending on what is supported in the kernel.

To check which checksum algorithm is being used by the Lustre software, run:

$ lctl get_param osc.*.checksum_type

To change the wire checksum algorithm, run:

$ lctl set_param osc.*.checksum_type=
algorithm

Note

The in-memory checksum always uses the adler32 algorithm, if available, and only falls back to crc32 if adler32 cannot be used.

In the following example, the lctl get_param command is used to determine that the Lustre software is using the adler32 checksum algorithm. Then the lctl set_param command is used to change the checksum algorithm to crc32. A second lctl get_param command confirms that the crc32 checksum algorithm is now in use.

$ lctl get_param osc.*.checksum_type
osc.testfs-OST0000-osc-ffff81012b2c48e0.checksum_type=crc32 [adler]
$ lctl set_param osc.*.checksum_type=crc32
osc.testfs-OST0000-osc-ffff81012b2c48e0.checksum_type=crc32
$ lctl get_param osc.*.checksum_type
osc.testfs-OST0000-osc-ffff81012b2c48e0.checksum_type=[crc32] adler

19.5.2. Ptlrpc Thread Pool

Releases prior to Lustre software release 2.2 used two portal RPC daemons for each client/server pair. One daemon handled all synchronous IO requests, and the second daemon handled all asynchronous (non-IO) RPCs. The increasing use of large SMP nodes for Lustre servers exposed some scaling issues. The lack of threads for large SMP nodes resulted in cases where a single CPU would be 100% utilized and other CPUs would be relativity idle. This is especially noticeable when a single client traverses a large directory.

Lustre software release 2.2.x implements a ptlrpc thread pool, so that multiple threads can be created to serve asynchronous RPC requests. The number of threads spawned is controlled at module load time using module options. By default one thread is spawned per CPU, with a minimum of 2 threads spawned irrespective of module options.

One of the issues with thread operations is the cost of moving a thread context from one CPU to another with the resulting loss of CPU cache warmth. To reduce this cost, ptlrpc threads can be bound to a CPU. However, if the CPUs are busy, a bound thread may not be able to respond quickly, as the bound CPU may be busy with other tasks and the thread must wait to schedule.

Because of these considerations, the pool of ptlrpc threads can be a mixture of bound and unbound threads. The system operator can balance the thread mixture based on system size and workload.

19.5.2.1. ptlrpcd parameters

These parameters should be set in /etc/modprobe.conf or in the etc/modprobe.d directory, as options for the ptlrpc module.

options ptlrpcd max_ptlrpcds=XXX

Sets the number of ptlrpcd threads created at module load time. The default if not specified is one thread per CPU, including hyper-threaded CPUs. The lower bound is 2 (old prlrpcd behaviour)

options ptlrpcd ptlrpcd_bind_policy=[1-4]

Controls the binding of threads to CPUs. There are four policy options.

  • PDB_POLICY_NONE(ptlrpcd_bind_policy=1) All threads are unbound.

  • PDB_POLICY_FULL(ptlrpcd_bind_policy=2) All threads attempt to bind to a CPU.

  • PDB_POLICY_PAIR(ptlrpcd_bind_policy=3) This is the default policy. Threads are allocated as a bound/unbound pair. Each thread (bound or free) has a partner thread. The partnering is used by the ptlrpcd load policy, which determines how threads are allocated to CPUs.

  • PDB_POLICY_NEIGHBOR(ptlrpcd_bind_policy=4) Threads are allocated as a bound/unbound pair. Each thread (bound or free) has two partner threads.

Chapter 20. Lustre File System Failover and Multiple-Mount Protection

This chapter describes the multiple-mount protection (MMP) feature, which protects the file system from being mounted simultaneously to more than one node. It includes the following sections:

Note

For information about configuring a Lustre file system for failover, see Chapter 11, Configuring Failover in a Lustre File System

20.1.  Overview of Multiple-Mount Protection

The multiple-mount protection (MMP) feature protects the Lustre file system from being mounted simultaneously to more than one node. This feature is important in a shared storage environment (for example, when a failover pair of OSSs share a LUN).

The backend file system, ldiskfs, supports the MMP mechanism. A block in the file system is updated by a kmmpd daemon at one second intervals, and a sequence number is written in this block. If the file system is cleanly unmounted, then a special "clean" sequence is written to this block. When mounting the file system, ldiskfs checks if the MMP block has a clean sequence or not.

Even if the MMP block has a clean sequence, ldiskfs waits for some interval to guard against the following situations:

  • If I/O traffic is heavy, it may take longer for the MMP block to be updated.

  • If another node is trying to mount the same file system, a "race" condition may occur.

With MMP enabled, mounting a clean file system takes at least 10 seconds. If the file system was not cleanly unmounted, then the file system mount may require additional time.

Note

The MMP feature is only supported on Linux kernel versions newer than 2.6.9.

20.2. Working with Multiple-Mount Protection

On a new Lustre file system, MMP is automatically enabled by mkfs.lustre at format time if failover is being used and the kernel and e2fsprogs version support it. On an existing file system, a Lustre file system administrator can manually enable MMP when the file system is unmounted.

Use the following commands to determine whether MMP is running in the Lustre file system and to enable or disable the MMP feature.

To determine if MMP is enabled, run:

dumpe2fs -h /dev/block_device | grep mmp

Here is a sample command:

dumpe2fs -h /dev/sdc | grep mmp 
Filesystem features: has_journal ext_attr resize_inode dir_index 
filetype extent mmp sparse_super large_file uninit_bg

To manually disable MMP, run:

tune2fs -O ^mmp /dev/block_device

To manually enable MMP, run:

tune2fs -O mmp /dev/block_device

When MMP is enabled, if ldiskfs detects multiple mount attempts after the file system is mounted, it blocks these later mount attempts and reports the time when the MMP block was last updated, the node name, and the device name of the node where the file system is currently mounted.

Chapter 21. Configuring and Managing Quotas

21.1.  Working with Quotas

Quotas allow a system administrator to limit the amount of disk space a user or group can use. Quotas are set by root, and can be specified for individual users and/or groups. Before a file is written to a partition where quotas are set, the quota of the creator's group is checked. If a quota exists, then the file size counts towards the group's quota. If no quota exists, then the owner's user quota is checked before the file is written. Similarly, inode usage for specific functions can be controlled if a user over-uses the allocated space.

Lustre quota enforcement differs from standard Linux quota enforcement in several ways:

  • Quotas are administered via the lfs and lctl commands (post-mount).

  • The quota feature in Lustre software is distributed throughout the system (as the Lustre file system is a distributed file system). Because of this, quota setup and behavior on Lustre is different from local disk quotas in the following ways:

    • No single point of administration: some commands must be executed on the MGS, other commands on the MDSs and OSSs, and still other commands on the client.

    • Granularity: a local quota is typically specified for kilobyte resolution, Lustre uses one megabyte as the smallest quota resolution.

    • Accuracy: quota information is distributed throughout the file system and can only be accurately calculated with a completely quite file system.

  • Quotas are allocated and consumed in a quantized fashion.

  • Client does not set the usrquota or grpquota options to mount. As of Lustre software release 2.4, space accounting is always enabled by default and quota enforcement can be enabled/disabled on a per-file system basis with lctl conf_param. It is worth noting that both lfs quotaon and quota_type are deprecated as of Lustre software release 2.4.0.

Caution

Although a quota feature is available in the Lustre software, root quotas are NOT enforced.

lfs setquota -u root (limits are not enforced)

lfs quota -u root (usage includes internal Lustre data that is dynamic in size and does not accurately reflect mount point visible block and inode usage).

21.2.  Enabling Disk Quotas

The design of quotas on Lustre has management and enforcement separated from resource usage and accounting. Lustre software is responsible for management and enforcement. The back-end file system is responsible for resource usage and accounting. Because of this, it is necessary to begin enabling quotas by enabling quotas on the back-end disk system. Because quota setup is dependent on the Lustre software version in use, you may first need to run lctl get_param version to identify which version? you are currently using.

21.2.1. Enabling Disk Quotas (Lustre Software Prior to Release 2.4)

For Lustre software releases older than release 2.4, lfs quotacheck must be first run from a client node to create quota files on the Lustre targets (i.e. the MDT and OSTs). lfs quotacheck requires the file system to be quiescent (i.e. no modifying operations like write, truncate, create or delete should run concurrently). Failure to follow this caution may result in inaccurate user/group disk usage. Operations that do not change Lustre files (such as read or mount) are okay to run. lfs quotacheck performs a scan on all the Lustre targets to calculates the block/inode usage for each user/group. If the Lustre file system has many files, quotacheck may take a long time to complete. Several options can be passed to lfs quotacheck:

# lfs quotacheck -ug /mnt/testfs
  • u-- checks the user disk quota information

  • g-- checks the group disk quota information

By default, quota is turned on after quotacheck completes. However, this setting isn't persistent and quota will have to be enabled again (via lfs quotaon) if one of the Lustre targets is restarted. lfs quotaoff is used to turn off quota.

To enable quota permanently with a Lustre software release older than release 2.4, the quota_type parameter must be used. This requires setting mdd.quota_type and ost.quota_type, respectively, on the MDT and OSTs. quota_type can be set to the string u (user), g (group) or ug for both users and groups. This parameter can be specified at mkfs time ( mkfs.lustre --param mdd.quota_type=ug) or with tunefs.lustre. As an example:

tunefs.lustre --param ost.quota_type=ug $ost_dev

When using mkfs.lustre --param mdd.quota_type=ug or tunefs.lustre --param ost.quota_type=ug, be sure to run the command on all OSTs and the MDT. Otherwise, abnormal results may occur.

Warning

In Lustre software releases before 2.4, when new OSTs are added to the file system, quotas are not automatically propagated to the new OSTs. As a workaround, clear and then reset quotas for each user or group using the lfs setquota command. In the example below, quotas are cleared and reset for user bob on file system testfs:

$ lfs setquota -u bob -b 0 -B 0 -i 0 -I 0 /mnt/testfs
$ lfs setquota -u bob -b 307200 -B 309200 -i 10000 -I 11000 /mnt/testfs
Introduced in Lustre 2.4

21.2.2. Enabling Disk Quotas (Lustre Software Release 2.4 and later)

Quota setup is orchestrated by the MGS and all setup commands in this section must be run on the MGS. Once setup, verification of the quota state must be performed on the MDT. Although quota enforcement is managed by the Lustre software, each OSD implementation relies on the back-end file system to maintain per-user/group block and inode usage. Hence, differences exist when setting up quotas with ldiskfs or ZFS back-ends:

  • For ldiskfs backends, mkfs.lustre now creates empty quota files and enables the QUOTA feature flag in the superblock which turns quota accounting on at mount time automatically. e2fsck was also modified to fix the quota files when the QUOTA feature flag is present.

  • For ZFS backend, accounting ZAPs are created and maintained by the ZFS file system itself. While ZFS tracks per-user and group block usage, it does not handle inode accounting. The ZFS OSD implements its own support for inode tracking. Two options are available:

    1. The ZFS OSD can estimate the number of inodes in-use based on the number of blocks used by a given user or group. This mode can be enabled by running the following command on the server running the target: lctl set_param osd-zfs.${FSNAME}-${TARGETNAME}.quota_iused_estimate=1.

    2. Similarly to block accounting, dedicated ZAPs are also created the ZFS OSD to maintain per-user and group inode usage. This is the default mode which corresponds to quota_iused_estimate set to 0.

Note

Lustre file systems formatted with a Lustre release prior to 2.4.0 can be still safely upgraded to release 2.4.0, but will not have functional space usage report until tunefs.lustre --quota is run against all targets. This command sets the QUOTA feature flag in the superblock and runs e2fsck (as a result, the target must be offline) to build the per-UID/GID disk usage database. See Section 21.5, “ Quotas and Version Interoperability” for further important considerations.

Caution

Lustre software release 2.4 and later requires a version of e2fsprogs that supports quota (i.e. newer or equal to 1.42.3.wc1) to be installed on the server nodes using ldiskfs backend (e2fsprogs is not needed with ZFS backend). In general, we recommend to use the latest e2fsprogs version available on http://downloads.hpdd.intel.com/public/e2fsprogs/.

The ldiskfs OSD relies on the standard Linux quota to maintain accounting information on disk. As a consequence, the Linux kernel running on the Lustre servers using ldiskfs backend must have CONFIG_QUOTA, CONFIG_QUOTACTL and CONFIG_QFMT_V2 enabled.

As of Lustre software release 2.4.0, quota enforcement is thus turned on/off independently of space accounting which is always enabled. lfs quota on|off as well as the per-target quota_type parameter are deprecated in favor of a single per-file system quota parameter controlling inode/block quota enforcement. Like all permanent parameters, this quota parameter can be set via lctl conf_param on the MGS via the following syntax:

lctl conf_param fsname.quota.ost|mdt=u|g|ug|none
  • ost -- to configure block quota managed by OSTs

  • mdt -- to configure inode quota managed by MDTs

  • u -- to enable quota enforcement for users only

  • g -- to enable quota enforcement for groups only

  • ug -- to enable quota enforcement for both users and groups

  • none -- to disable quota enforcement for both users and groups

Examples:

To turn on user and group quotas for block only on file system testfs1, on the MGS run:

$ lctl conf_param testfs1.quota.ost=ug

To turn on group quotas for inodes on file system testfs2, on the MGS run:

$ lctl conf_param testfs2.quota.mdt=g

To turn off user and group quotas for both inode and block on file system testfs3, on the MGS run:

$ lctl conf_param testfs3.quota.ost=none
$ lctl conf_param testfs3.quota.mdt=none

21.2.2.1.  Quota Verification

Once the quota parameters have been configured, all targets which are part of the file system will be automatically notified of the new quota settings and enable/disable quota enforcement as needed. The per-target enforcement status can still be verified by running the following command on the MDS(s):

$ lctl get_param osd-*.*.quota_slave.info
osd-zfs.testfs-MDT0000.quota_slave.info=
target name:    testfs-MDT0000
pool ID:        0
type:           md
quota enabled:  ug
conn to master: setup
user uptodate:  glb[1],slv[1],reint[0]
group uptodate: glb[1],slv[1],reint[0]

21.3.  Quota Administration

Once the file system is up and running, quota limits on blocks and inodes can be set for both user and group. This is controlled entirely from a client via three quota parameters:

Grace period-- The period of time (in seconds) within which users are allowed to exceed their soft limit. There are four types of grace periods:

  • user block soft limit

  • user inode soft limit

  • group block soft limit

  • group inode soft limit

The grace period applies to all users. The user block soft limit is for all users who are using a blocks quota.

Soft limit -- The grace timer is started once the soft limit is exceeded. At this point, the user/group can still allocate block/inode. When the grace time expires and if the user is still above the soft limit, the soft limit becomes a hard limit and the user/group can't allocate any new block/inode any more. The user/group should then delete files to be under the soft limit. The soft limit MUST be smaller than the hard limit. If the soft limit is not needed, it should be set to zero (0).

Hard limit -- Block or inode allocation will fail with EDQUOT(i.e. quota exceeded) when the hard limit is reached. The hard limit is the absolute limit. When a grace period is set, one can exceed the soft limit within the grace period if under the hard limit.

Due to the distributed nature of a Lustre file system and the need to maintain performance under load, those quota parameters may not be 100% accurate. The quota settings can be manipulated via the lfs command, executed on a client, and includes several options to work with quotas:

  • quota -- displays general quota information (disk usage and limits)

  • setquota -- specifies quota limits and tunes the grace period. By default, the grace period is one week.

Usage:

lfs quota [-q] [-v] [-h] [-o obd_uuid] [-u|-g uname|uid|gname|gid] /mount_point
lfs quota -t -u|-g /mount_point
lfs setquota -u|--user|-g|--group username|groupname [-b block-softlimit] \
             [-B block_hardlimit] [-i inode_softlimit] \
             [-I inode_hardlimit] /mount_point

To display general quota information (disk usage and limits) for the user running the command and his primary group, run:

$ lfs quota /mnt/testfs

To display general quota information for a specific user (" bob" in this example), run:

$ lfs quota -u bob /mnt/testfs

To display general quota information for a specific user (" bob" in this example) and detailed quota statistics for each MDT and OST, run:

$ lfs quota -u bob -v /mnt/testfs

To display general quota information for a specific group (" eng" in this example), run:

$ lfs quota -g eng /mnt/testfs

To display block and inode grace times for user quotas, run:

$ lfs quota -t -u /mnt/testfs

To set user or group quotas for a specific ID ("bob" in this example), run:

$ lfs setquota -u bob -b 307200 -B 309200 -i 10000 -I 11000 /mnt/testfs

In this example, the quota for user "bob" is set to 300 MB (309200*1024) and the hard limit is 11,000 files. Therefore, the inode hard limit should be 11000.

The quota command displays the quota allocated and consumed by each Lustre target. Using the previous setquota example, running this lfs quota command:

$ lfs quota -u bob -v /mnt/testfs

displays this command output:

Disk quotas for user bob (uid 6000):
Filesystem          kbytes quota limit grace files quota limit grace
/mnt/testfs         0      30720 30920 -     0     10000 11000 -
testfs-MDT0000_UUID 0      -      8192 -     0     -     2560  -
testfs-OST0000_UUID 0      -      8192 -     0     -     0     -
testfs-OST0001_UUID 0      -      8192 -     0     -     0     -
Total allocated inode limit: 2560, total allocated block limit: 24576

Global quota limits are stored in dedicated index files (there is one such index per quota type) on the quota master target (aka QMT). The QMT runs on MDT0000 and exports the global indexes via /proc. The global indexes can thus be dumped via the following command:

# lctl get_param qmt.testfs-QMT0000.*.glb-*

The format of global indexes depends on the OSD type. The ldiskfs OSD uses an IAM files while the ZFS OSD creates dedicated ZAPs.

Each slave also stores a copy of this global index locally. When the global index is modified on the master, a glimpse callback is issued on the global quota lock to notify all slaves that the global index has been modified. This glimpse callback includes information about the identifier subject to the change. If the global index on the QMT is modified while a slave is disconnected, the index version is used to determine whether the slave copy of the global index isn't up to date any more. If so, the slave fetches the whole index again and updates the local copy. The slave copy of the global index is also exported via /proc and can be accessed via the following command:

lctl get_param osd-*.*.quota_slave.limit*

Note

Prior to 2.4, global quota limits used to be stored in administrative quota files using the on-disk format of the linux quota file. When upgrading MDT0000 to 2.4, those administrative quota files are converted into IAM indexes automatically, conserving existing quota limits previously set by the administrator.

21.4.  Quota Allocation

In a Lustre file system, quota must be properly allocated or users may experience unnecessary failures. The file system block quota is divided up among the OSTs within the file system. Each OST requests an allocation which is increased up to the quota limit. The quota allocation is then quantized to reduce the number of quota-related request traffic.

The Lustre quota system distributes quotas from the Quota Master Target (aka QMT). Only one QMT instance is supported for now and only runs on the same node as MDT0000. All OSTs and MDTs set up a Quota Slave Device (aka QSD) which connects to the QMT to allocate/release quota space. The QSD is setup directly from the OSD layer.

To reduce quota requests, quota space is initially allocated to QSDs in very large chunks. How much unused quota space can be hold by a target is controlled by the qunit size. When quota space for a given ID is close to exhaustion on the QMT, the qunit size is reduced and QSDs are notified of the new qunit size value via a glimpse callback. Slaves are then responsible for releasing quota space above the new qunit value. The qunit size isn't shrunk indefinitely and there is a minimal value of 1MB for blocks and 1,024 for inodes. This means that the quota space rebalancing process will stop when this minimum value is reached. As a result, quota exceeded can be returned while many slaves still have 1MB or 1,024 inodes of spare quota space.

If we look at the setquota example again, running this lfs quota command:

# lfs quota -u bob -v /mnt/testfs

displays this command output:

Disk quotas for user bob (uid 500):
Filesystem          kbytes quota limit grace       files  quota limit grace
/mnt/testfs         30720* 30720 30920 6d23h56m44s 10101* 10000 11000
6d23h59m50s
testfs-MDT0000_UUID 0      -     0     -           10101  -     10240
testfs-OST0000_UUID 0      -     1024  -           -      -     -
testfs-OST0001_UUID 30720* -     29896 -           -      -     -
Total allocated inode limit: 10240, total allocated block limit: 30920

The total quota limit of 30,920 is allocated to user bob, which is further distributed to two OSTs.

Values appended with ' *' show that the quota limit has been exceeded, causing the following error when trying to write or create a file:

$ cp: writing `/mnt/testfs/foo`: Disk quota exceeded.

Note

It is very important to note that the block quota is consumed per OST and the inode quota per MDS. Therefore, when the quota is consumed on one OST (resp. MDT), the client may not be able to create files regardless of the quota available on other OSTs (resp. MDTs).

Setting the quota limit below the minimal qunit size may prevent the user/group from all file creation. It is thus recommended to use soft/hard limits which are a multiple of the number of OSTs * the minimal qunit size.

To determine the total number of inodes, use lfs df -i(and also lctl get_param *.*.filestotal). For more information on using the lfs df -i command and the command output, see Section 18.5.1, “Checking File System Free Space”.

Unfortunately, the statfs interface does not report the free inode count directly, but instead reports the total inode and used inode counts. The free inode count is calculated for df from (total inodes - used inodes). It is not critical to know the total inode count for a file system. Instead, you should know (accurately), the free inode count and the used inode count for a file system. The Lustre software manipulates the total inode count in order to accurately report the other two values.

21.5.  Quotas and Version Interoperability

The new quota protocol introduced in Lustre software release 2.4.0 is not compatible with previous versions. As a consequence, all Lustre servers must be upgraded to release 2.4.0 for quota to be functional. Quota limits set on the Lustre file system prior to the upgrade will be automatically migrated to the new quota index format. As for accounting information with ldiskfs backend, they will be regenerated by running tunefs.lustre --quota against all targets. It is worth noting that running tunefs.lustre --quota is mandatory for all targets formatted with a Lustre software release older than release 2.4.0, otherwise quota enforcement as well as accounting won't be functional.

Besides, the quota protocol in release 2.4 takes for granted that the Lustre client supports the OBD_CONNECT_EINPROGRESS connect flag. Clients supporting this flag will retry indefinitely when the server returns EINPROGRESS in a reply. Here is the list of Lustre client version which are compatible with release 2.4:

  • Release 2.3-based clients and later

  • Release 1.8 clients newer or equal to release 1.8.9-wc1

  • Release 2.1 clients newer or equal to release 2.1.4

21.6.  Granted Cache and Quota Limits

In a Lustre file system, granted cache does not respect quota limits. In this situation, OSTs grant cache to a Lustre client to accelerate I/O. Granting cache causes writes to be successful in OSTs, even if they exceed the quota limits, and will overwrite them.

The sequence is:

  1. A user writes files to the Lustre file system.

  2. If the Lustre client has enough granted cache, then it returns 'success' to users and arranges the writes to the OSTs.

  3. Because Lustre clients have delivered success to users, the OSTs cannot fail these writes.

Because of granted cache, writes always overwrite quota limitations. For example, if you set a 400 GB quota on user A and use IOR to write for user A from a bundle of clients, you will write much more data than 400 GB, and cause an out-of-quota error ( EDQUOT).

Note

The effect of granted cache on quota limits can be mitigated, but not eradicated. Reduce the maximum amount of dirty data on the clients (minimal value is 1MB):

  • lctl set_param osc.*.max_dirty_mb=8

21.7.  Lustre Quota Statistics

The Lustre software includes statistics that monitor quota activity, such as the kinds of quota RPCs sent during a specific period, the average time to complete the RPCs, etc. These statistics are useful to measure performance of a Lustre file system.

Each quota statistic consists of a quota event and min_time, max_time and sum_time values for the event.

Quota Event

Description

sync_acq_req

Quota slaves send a acquiring_quota request and wait for its return.

sync_rel_req

Quota slaves send a releasing_quota request and wait for its return.

async_acq_req

Quota slaves send an acquiring_quota request and do not wait for its return.

async_rel_req

Quota slaves send a releasing_quota request and do not wait for its return.

wait_for_blk_quota (lquota_chkquota)

Before data is written to OSTs, the OSTs check if the remaining block quota is sufficient. This is done in the lquota_chkquota function.

wait_for_ino_quota (lquota_chkquota)

Before files are created on the MDS, the MDS checks if the remaining inode quota is sufficient. This is done in the lquota_chkquota function.

wait_for_blk_quota (lquota_pending_commit)

After blocks are written to OSTs, relative quota information is updated. This is done in the lquota_pending_commit function.

wait_for_ino_quota (lquota_pending_commit)

After files are created, relative quota information is updated. This is done in the lquota_pending_commit function.

wait_for_pending_blk_quota_req (qctxt_wait_pending_dqacq)

On the MDS or OSTs, there is one thread sending a quota request for a specific UID/GID for block quota at any time. At that time, if other threads need to do this too, they should wait. This is done in the qctxt_wait_pending_dqacq function.

wait_for_pending_ino_quota_req (qctxt_wait_pending_dqacq)

On the MDS, there is one thread sending a quota request for a specific UID/GID for inode quota at any time. If other threads need to do this too, they should wait. This is done in the qctxt_wait_pending_dqacq function.

nowait_for_pending_blk_quota_req (qctxt_wait_pending_dqacq)

On the MDS or OSTs, there is one thread sending a quota request for a specific UID/GID for block quota at any time. When threads enter qctxt_wait_pending_dqacq, they do not need to wait. This is done in the qctxt_wait_pending_dqacq function.

nowait_for_pending_ino_quota_req (qctxt_wait_pending_dqacq)

On the MDS, there is one thread sending a quota request for a specific UID/GID for inode quota at any time. When threads enter qctxt_wait_pending_dqacq, they do not need to wait. This is done in the qctxt_wait_pending_dqacq function.

quota_ctl

The quota_ctl statistic is generated when lfs setquota, lfs quota and so on, are issued.

adjust_qunit

Each time qunit is adjusted, it is counted.

21.7.1. Interpreting Quota Statistics

Quota statistics are an important measure of the performance of a Lustre file system. Interpreting these statistics correctly can help you diagnose problems with quotas, and may indicate adjustments to improve system performance.

For example, if you run this command on the OSTs:

lctl get_param lquota.testfs-OST0000.stats

You will get a result similar to this:

snapshot_time                                1219908615.506895 secs.usecs
async_acq_req                              1 samples [us]  32 32 32
async_rel_req                              1 samples [us]  5 5 5
nowait_for_pending_blk_quota_req(qctxt_wait_pending_dqacq) 1 samples [us] 2\
 2 2
quota_ctl                          4 samples [us]  80 3470 4293
adjust_qunit                               1 samples [us]  70 70 70
....

In the first line, snapshot_time indicates when the statistics were taken. The remaining lines list the quota events and their associated data.

In the second line, the async_acq_req event occurs one time. The min_time, max_time and sum_time statistics for this event are 32, 32 and 32, respectively. The unit is microseconds (μs).

In the fifth line, the quota_ctl event occurs four times. The min_time, max_time and sum_time statistics for this event are 80, 3470 and 4293, respectively. The unit is microseconds (μs).

Introduced in Lustre 2.5

Chapter 22. Hierarchical Storage Management (HSM)

This chapter describes how to bind Lustre to a Hierarchical Storage Management (HSM) solution.

22.1.  Introduction

The Lustre file system can bind to a Hierarchical Storage Management (HSM) solution using a specific set of functions. These functions enable connecting a Lustre file system to one or more external storage systems, typically HSMs. With a Lustre file system bound to a HSM solution, the Lustre file system acts as a high speed cache in front of these slower HSM storage systems.

The Lustre file system integration with HSM provides a mechanism for files to simultaneously exist in a HSM solution and have a metadata entry in the Lustre file system that can be examined. Reading, writing or truncating the file will trigger the file data to be fetched from the HSM storage back into the Lustre file system.

The process of copying a file into the HSM storage is known as archive. Once the archive is complete, the Lustre file data can be deleted (known as release.) The process of returning data from the HSM storage to the Lustre file system is called restore. The archive and restore operations require a Lustre file system component called an Agent.

An Agent is a specially designed Lustre client node that mounts the Lustre file system in question. On an Agent, a user space program called a copytool is run to coordinate the archive and restore of files between the Lustre file system and the HSM solution.

Requests to restore a given file are registered and dispatched by a facet on the MDT called the Coordinator.

Figure 22.1. Overview of the Lustre file system HSM

Overview of the Lustre file system HSM


22.2.  Setup

22.2.1.  Requirements

To setup a Lustre/HSM configuration you need:

  • a standard Lustre file system (version 2.5.0 and above)

  • a minimum of 2 clients, 1 used for your chosen computation task that generates useful data, and 1 used as an agent.

Multiple agents can be employed. All the agents need to share access to their backend storage. For the POSIX copytool, a POSIX namespace like NFS or another Lustre file system is suitable.

22.2.2.  Coordinator

To bind a Lustre file system to a HSM system a coordinator must be activated on each of your filesystem MDTs. This can be achieved with the command:

$ lctl set_param mdt.$FSNAME-MDT0000.hsm_control=enabled
mdt.lustre-MDT0000.hsm_control=enabled

To verify that the coordinator is running correctly

$ lctl get_param mdt.$FSNAME-MDT0000.hsm_control
mdt.lustre-MDT0000.hsm_control=enabled

22.2.3.  Agents

Once a coordinator is started, launch the copytool on each agent node to connect to your HSM storage. If your HSM storage has POSIX access this command will be of the form:

lhsmtool_posix --daemon --hsm-root $HSMPATH --archive=1 $LUSTREPATH

The POSIX copytool must be stopped by sending it a TERM signal.

22.3.  Agents and copytool

Agents are Lustre file system clients running copytool. copytool is a userspace daemon that transfers data between Lustre and a HSM solution. Because different HSM solutions use different APIs, copytools can typically only work with a specific HSM. Only one copytool can be run by an agent node.

The following rule applies regarding copytool instances: a Lustre file system only supports a single copytool process, per ARCHIVE ID (see below), per client node. Due to a Lustre software limitation, this constraint is irrespective of the number of Lustre file systems mounted by the Agent.

Bundled with Lustre tools, the POSIX copytool can work with any HSM or external storage that exports a POSIX API.

22.3.1.  Archive ID, multiple backends

A Lustre file system can be bound to several different HSM solutions. Each bound HSM solution is identified by a number referred to as ARCHIVE ID. A unique value of ARCHIVE ID must be chosen for each bound HSM solution. ARCHIVE ID must be in the range 1 to 32.

A Lustre file system supports an unlimited number of copytool instances. You need, at least, one copytool per ARCHIVE ID. When using the POSIX copytool, this ID is defined using --archive switch.

For example: if a single Lustre file system is bound to 2 different HSMs (A and B,) ARCHIVE ID “1” can be chosen for HSM A and ARCHIVE ID “2” for HSM B. If you start 3 copytool instances for ARCHIVE ID 1, all of them will use Archive ID “1”. The same rule applies for copytool instances dealing with the HSM B, using Archive ID “2”.

When issuing HSM requests, you can use the --archive switch to choose the backend you want to use. In this example, file foo will be archived into backend ARCHIVE ID “5”:

$ lfs hsm_archive --archive=5 /mnt/lustre/foo

A default ARCHIVE ID can be defined which will be used when the --archive switch is not specified:

$ lctl set_param -P mdt.lustre-MDT0000.hsm.default_archive_id=5

The ARCHIVE ID of archived files can be checked using lfs hsm_state command:

$ lfs hsm_state /mnt/lustre/foo
/mnt/lustre/foo: (0x00000009) exists archived, archive_id:5

22.3.2.  Registered agents

A Lustre file system allocates a unique UUID per client mount point, for each filesystem. Only one copytool can be registered for each Lustre mount point. As a consequence, the UUID uniquely identifies a copytool, per filesystem.

The currently registered copytool instances (agents UUID) can be retrieved by running the following command, per MDT, on MDS nodes:

$ lctl get_param -n mdt.$FSNAME-MDT0000.hsm.agents
uuid=a19b2416-0930-fc1f-8c58-c985ba5127ad archive_id=1 requests=[current:0 ok:0 errors:0]

The returned fields have the following meaning:

  • uuid the client mount used by the corresponding copytool.

  • archive_id comma-separated list of ARCHIVE IDs accessible by this copytool.

  • requests various statistics on the number of requests processed by this copytool.

22.3.3.  Timeout

One or more copytool instances may experience conditions that cause them to become unresponsive. To avoid blocking access to the related files a timeout value is defined for request processing. A copytool must be able to fully complete a request within this time. The default is 3600 seconds.

$ lctl set_param -n mdt.lustre-MDT0000.hsm.active_request_timeout

22.4.  Requests

Data management between a Lustre file system and HSM solutions is driven by requests. There are five types:

  • ARCHIVE Copy data from a Lustre file system file into the HSM solution.

  • RELEASE Remove file data from the Lustre file system.

  • RESTORE Copy back data from the HSM solution into the corresponding Lustre file system file.

  • REMOVE Delete the copy of the data from the HSM solution.

  • CANCEL Cancel an in-progress or pending request.

Only the RELEASE is performed synchronously and does not involve the coordinator. Other requests are handled by Coordinators. Each MDT coordinator is resiliently managing them.

22.4.1.  Commands

Requests are submitted using lfs command:

$ lfs hsm_archive [--archive=ID] FILE1 [FILE2...]
$ lfs hsm_release FILE1 [FILE2...]
$ lfs hsm_restore FILE1 [FILE2...]
$ lfs hsm_remove  FILE1 [FILE2...]

Requests are sent to the default ARCHIVE ID unless an ARCHIVE ID is specified with the --archive option (See Section 22.3.1, “ Archive ID, multiple backends ”).

22.4.2.  Automatic restore

Released files are automatically restored when a process tries to read or modify them. The corresponding I/O will block waiting for the file to be restored. This is transparent to the process. For example, the following command automatically restores the file if released.

$ cat /mnt/lustre/released_file

22.4.3.  Request monitoring

The list of registered requests and their status can be monitored, per MDT, with the following command:

$ lctl get_param -n mdt.lustre-MDT0000.hsm.actions

The list of requests currently being processed by a copytool is available with:

$ lctl get_param -n mdt.lustre-MDT0000.hsm.active_requests

22.5.  File states

When files are archived or released, their state in the Lustre file system changes. This state can be read using the following lfs command:

$ lfs hsm_state FILE1 [FILE2...]

There is also a list of specific policy flags which could be set to have a per-file specific policy:

  • NOARCHIVE This file will never be archived.

  • NORELEASE This file will never be released. This value cannot be set if the flag is currently set to RELEASED

  • DIRTY This file has been modified since a copy of it was made in the HSM solution. DIRTY files should be archived again. The DIRTY flag can only be set if EXIST is set.

The following options can only be set by the root user.

  • LOST This file was previously archived but the copy was lost on the HSM solution for some reason in the backend (for example, by a corrupted tape), and could not be restored. If the file is not in the RELEASE state it needs to be archived again. If the file is in the RELEASE state, the file data is lost.

Some flags can be manually set or cleared using the following commands:

$ lfs hsm_set [FLAGS] FILE1 [FILE2...]
$ lfs hsm_clear [FLAGS] FILE1 [FILE2...]

22.6.  Tuning

22.6.1.  hsm_controlpolicy

hsm_control controls coordinator activity and can also purge the action list.

$ lctl set_param mdt.$FSNAME-MDT0000.hsm_control=purge

Possible values are:

  • enabled Start coordinator thread. Requests are dispatched on available copytool instances.

  • disabled Pause coordinator activity. No new request will be scheduled. No timeout will be handled. New requests will be registered but will be handled only when the coordinator is enabled again.

  • shutdown Stop coordinator thread. No request can be submitted.

  • purge Clear all recorded requests. Do not change coordinator state.

22.6.2.  max_requests

max_requests is the maximum number of active requests at the same time. This is a per coordinator value, and independent of the number of agents.

For example, if 2 MDT and 4 agents are present, the agents will never have to handle more than 2 x max_requests.

$ lctl set_param mdt.$FSNAME-MDT0000.hsm.max_requests=10

22.6.3.  policy

Change system behavior. Values can be added or removed by prefixing them with '+' or '-'.

$ lctl set_param mdt.$FSNAME-MDT0000.hsm.policy=+NRA

Possible values are a combination of:

  • NRA No Retry Action. If a restore fails, do not reschedule it automatically.

  • NBR Non Blocking Restore. No automatic restore is triggered. Access to a released file returns ENODATA.

22.6.4.  grace_delay

grace_delay is the delay, expressed in seconds, before a successful or failed request is cleared from the whole request list.

$ lctl set_param mdt.$FSNAME-MDT0000.hsm.grace_delay=10

22.7.  change logs

A changelog record type “HSM“ was added for Lustre file system logs that relate to HSM events.

16HSM   13:49:47.469433938 2013.10.01 0x280 t=[0x200000400:0x1:0x0]

Two items of information are available for each HSM record: the FID of the modified file and a bit mask. The bit mask codes the following information (lowest bits first):

  • Error code, if any (7 bits)

  • HSM event (3 bits)

    • HE_ARCHIVE = 0 File has been archived.

    • HE_RESTORE = 1 File has been restored.

    • HE_CANCEL = 2 A request for this file has been canceled.

    • HE_RELEASE = 3 File has been released.

    • HE_REMOVE = 4 A remove request has been executed automatically.

    • HE_STATE = 5 File flags have changed.

  • HSM flags (3 bits)

    • CLF_HSM_DIRTY=0x1

In the above example, 0x280 means the error code is 0 and the event is HE_STATE.

When using liblustreapi, there is a list of helper functions to easily extract the different values from this bitmask, like: hsm_get_cl_event(), hsm_get_cl_flags(), and hsm_get_cl_error()

22.8.  Policy engine

A Lustre file system does not have an internal component responsible for automatically scheduling archive requests and release requests under any conditions (like low space). Automatically scheduling archive operations is the role of the policy engine.

It is recommended that the Policy Engine run on a dedicated client, similar to an agent node, with a Lustre version 2.5+.

A policy engine is a userspace program using the Lustre file system HSM specific API to monitor the file system and schedule requests.

Robinhood is the recommended policy engine.

22.8.1.  Robinhood

Robinhood is a Policy engine and reporting tool for large file systems. It maintains a replicate of file system metadata in a database that can be queried at will. Robinhood makes it possible to schedule mass action on file system entries by defining attribute-based policies, provides fast find and du enhanced clones, and provides administrators with an overall view of file system content through a web interface and command line tools.

Robinhood can be used for various configurations. Robinhood is an external project, and further information can be found on the project website: https://sourceforge.net/apps/trac/robinhood/wiki/Doc.

Introduced in Lustre 2.9

Chapter 23. Mapping UIDs and GIDs with Nodemap

This chapter describes how to map UID and GIDs across a Lustre file system using the nodemap feature, and includes the following sections:

23.1. Setting a Mapping

The nodemap feature supported in Lustre 2.9 was first introduced in Lustre 2.7 as a technology preview. It allows UIDs and GIDs from remote systems to be mapped to local sets of UIDs and GIDs while retaining POSIX ownership, permissions and quota information. As a result, multiple sites with conflicting user and group identifiers can operate on a single Lustre file system without creating collisions in UID or GID space.

23.1.1. Defining Terms

When the nodemap feature is enabled, client file system access to a Lustre system is filtered through the nodemap identity mapping policy engine. Lustre connectivity is governed by network identifiers, or NIDs, such as 192.168.7.121@tcp. When an operation is made from a NID, Lustre decides if that NID is part of a nodemap, a policy group consisting of one or more NID ranges. If no policy group exists for that NID, access is squashed to user nobody by default. Each policy group also has several properties, such as trusted and admin, which determine access conditions. A collection of identity maps or idmaps are kept for each policy group. These idmaps determine how UIDs and GIDs on the client are translated into the canonical user space of the local Lustre file system.

In order for nodemap to function properly, the MGS, MDS, and OSS systems must all have a version of Lustre which supports nodemap. Clients operate transparently and do not require special configuration or knowledge of the nodemap setup.

23.1.2. Deciding on NID Ranges

NIDs can be described as either a singleton address or a range of addresses. A single address is described in standard Lustre NID format, such as 10.10.6.120@tcp. A range is described using a dash to separate the range, for example, 192.168.20.[0-255]@tcp.

The range must be contiguous. The full LNet definition for a nidlist is as follows:

<nidlist>       :== <nidrange> [ ' ' <nidrange> ]
<nidrange>      :== <addrrange> '@' <net>
<addrrange>     :== '*' |
                        <ipaddr_range> |
                        <numaddr_range>
<ipaddr_range>  :==
        <numaddr_range>.<numaddr_range>.<numaddr_range>.<numaddr_range>
<numaddr_range> :== <number> |
                        <expr_list>
<expr_list>     :== '[' <range_expr> [ ',' <range_expr>] ']'
<range_expr>    :== <number> |
                        <number> '-' <number> |
                        <number> '-' <number> '/' <number>
<net>           :== <netname> | <netname><number>
<netname>       :== "lo" | "tcp" | "o2ib" | "gni"
<number>        :== <nonnegative decimal> | <hexadecimal>

23.1.3. Describing and Deploying a Sample Mapping

Deploy nodemap by first considering which users need to be mapped, and what sets of network addresses or ranges are involved. Issues of visibility between users must be examined as well.

Consider a deployment where researchers are working on data relating to birds. The researchers use a computing system which mounts Lustre from a single IPv4 address, 192.168.0.100. Name this policy group BirdResearchSite. The IP address forms the NID 192.168.0.100@tcp. Create the policy group and add the NID to that group on the MGS using the lctl command:

mgs# lctl nodemap_add BirdResearchSite
mgs# lctl nodemap_add_range --name BirdResearchSite --range 192.168.0.100@tcp

Note

A NID cannot be in more than one policy group. Assign a NID to a new policy group by first removing it from the existing group.

The researchers use the following identifiers on their host system:

  • swan (UID 530) member of group wetlands (GID 600)

  • duck (UID 531) member of group wetlands (GID 600)

  • hawk (UID 532) member of group raptor (GID 601)

  • merlin (UID 533) member of group raptor (GID 601)

Assign a set of six idmaps to this policy group, with four for UIDs, and two for GIDs. Pick a starting point, e.g. UID 11000, with room for additional UIDs and GIDs to be added as the configuration grows. Use the lctl command to set up the idmaps:

mgs# lctl nodemap_add_idmap --name BirdResearchSite --idtype uid --idmap 530:11000
mgs# lctl nodemap_add_idmap --name BirdResearchSite --idtype uid --idmap 531:11001
mgs# lctl nodemap_add_idmap --name BirdResearchSite --idtype uid --idmap 532:11002
mgs# lctl nodemap_add_idmap --name BirdResearchSite --idtype uid --idmap 533:11003
mgs# lctl nodemap_add_idmap --name BirdResearchSite --idtype gid --idmap 600:11000
mgs# lctl nodemap_add_idmap --name BirdResearchSite --idtype gid --idmap 601:11001

The parameter 530:11000 assigns a client UID, for example UID 530, to a single canonical UID, such as UID 11000. Each assignment is made individually. There is no method to specify a range 530-533:11000-11003. UID and GID idmaps are assigned separately. There is no implied relationship between the two.

Files created on the Lustre file system from the 192.168.0.100@tcp NID using UID duck and GID wetlands are stored in the Lustre file system using the canonical identifiers, in this case UID 11001 and GID 11000. A different NID, if not part of the same policy group, sees its own view of the same file space.

Suppose a previously created project directory exists owned by UID 11002/GID 11001, with mode 770. When users hawk and merlin at 192.168.0.100 place files named hawk-file and merlin-file into the directory, the contents from the 192.168.0.100 client appear as:

[merlin@192.168.0.100 projectsite]$ ls -la
total 34520
drwxrwx--- 2 hawk   raptor     4096 Jul 23 09:06 .
drwxr-xr-x 3 nobody nobody     4096 Jul 23 09:02 ..
-rw-r--r-- 1 hawk   raptor 10240000 Jul 23 09:05 hawk-file
-rw-r--r-- 1 merlin raptor 25100288 Jul 23 09:06 merlin-file

From a privileged view, the canonical owners are displayed:

[root@trustedSite projectsite]# ls -la
total 34520
drwxrwx--- 2 11002 11001     4096 Jul 23 09:06 .
drwxr-xr-x 3 root root     4096 Jul 23 09:02 ..
-rw-r--r-- 1 11002 11001 10240000 Jul 23 09:05 hawk-file
-rw-r--r-- 1 11003 11001 25100288 Jul 23 09:06 merlin-file

If UID 11002 or GID 11001 do not exist on the Lustre MDS or MGS, create them in LDAP or other data sources, or trust clients by setting identity_upcall to NONE. For more information, see Section 35.1, “User/Group Upcall”.

Building a larger and more complex configuration is possible by iterating through the lctl commands above. In short:

  1. Create a name for the policy group.

  2. Create a set of NID ranges used by the group.

  3. Define which UID and GID translations need to occur for the group.

23.2. Altering Properties

Privileged users access mapped systems with rights dependent on certain properties, described below. By default, root access is squashed to user nobody, which interferes with most administrative actions.

23.2.1. Managing the Properties

Several properties exist, off by default, which change client behavior: admin, trusted, squash_uid, squash_gid, and deny_unknown.

  • The trusted property permits members of a policy group to see the file system's canonical identifiers. In the above example, UID 11002 and GID 11001 will be seen without translation. This can be utilized when local UID and GID sets already map directly to the specified users.

  • The property admin defines whether root is squashed on the policy group. By default, it is squashed, unless this property is enabled. Coupled with the trusted property, this will allow unmapped access for backup nodes, transfer points, or other administrative mount points.

  • The property deny_unknown denies all access to users not mapped in a particular nodemap. This is useful if a site is concerned about unmapped users accessing the file system in order to satisfy security requirements.

  • The properties squash_uid and squash_gid define the default UID and GID that users will be squashed to if unmapped, unless the deny_unknown flag is set, in which case access will still be denied.

Alter values to either true (1) or false (0) on the MGS:

mgs# lctl nodemap_modify --name BirdAdminSite --property trusted --value 1
mgs# lctl nodemap_modify --name BirdAdminSite --property admin --value 1
mgs# lctl nodemap_modify --name BirdAdminSite --property deny_unknown --value 1

Change values during system downtime to minimize the chance of any ownership or permissions problems if the policy group is active. Although changes can be made live, client caching of data may interfere with modification as there are a few seconds of lead time before the change is distributed.

23.2.2. Mixing Properties

With both admin and trusted properties set, the policy group has full access, as if nodemap was turned off, to the Lustre file system. The administrative site for the Lustre file system needs at least one group with both properties in order to perform maintenance or to perform administrative tasks.

Warning

MDS systems must be in a policy group with both these properties set to 1. It is recommended to put the MDS in a policy group labeled “TrustedSystems” or some identifier that makes the association clear.

If a policy group has the admin property set, but does not have the property trusted set, root is mapped directly to root, any explicitly specified UID and GID idmaps are honored, and other access is squashed. If root alters ownership to UIDs or GIDs which are locally known from that host but not part of an idmap, root effectively changes ownership of those files to the default squashed UID and GID.

If trusted is set but admin is not, the policy group has full access to the canonical UID and GID sets of the Lustre file system, and root is squashed.

The deny_unknown property, once enabled, prevents unmapped users from accessing the file system. Root access also is denied, if the admin property is off, and root is not part of any mapping.

When nodemaps are modified, the change events are queued and distributed across the cluster. Under normal conditions, these changes can take around ten seconds to propagate. During this distribution window, file access could be made via the old or new nodemap settings. Therefore, it is recommended to save changes for a maintenance window or to deploy them while the mapped nodes are not actively writing to the file system.

23.3. Enabling the Feature

The nodemap feature is simple to enable:

mgs# lctl nodemap_activate 1

Passing the parameter 0 instead of 1 disables the feature again. After deploying the feature, validate the mappings are intact before offering the file system to be mounted by clients.

Introduced in Lustre 2.8

So far, changes have been made on the MGS. Prior to Lustre 2.9, changes must also be manually set on MDS systems as well. Also, changes must be manually deployed to OSS servers if quota is enforced, utilizing lctl set_param instead of lctl. Prior to 2.9, the configuration is not persistent, requiring a script which generates the mapping to be saved and deployed after every Lustre restart. As an example, use this style to deploy settings on the OSS:

oss# lctl set_param nodemap.add_nodemap=SiteName
oss# lctl set_param nodemap.add_nodemap_range='SiteName 192.168.0.15@tcp'
oss# lctl set_param nodemap.add_nodemap_idmap='SiteName uid 510:1700'
oss# lctl set_param nodemap.add_nodemap_idmap='SiteName gid 612:1702'

In Lustre 2.9 and later, nodemap configuration is saved on the MGS and distributed automatically to MGS, MDS, and OSS nodes, a process which takes approximately ten seconds in normal circumstances.

23.4. Verifying Settings

By using lctl nodemap_info all, existing nodemap configuration is listed for easy export. This command acts as a shortcut into the /proc interface for nodemap. Within /proc/fs/lustre/nodemap/ on the Lustre MGS, the file active contains a 1 if nodemap is active on the system. Each policy group creates a directory containing the following parameters:

  • admin and trusted each contain a ‘1’ if the values are set, and a ‘0’ otherwise.

  • idmap contains a list of the idmaps for the policy group, while ranges contains a list of NIDs for the group.

  • squash_uid and squash_gid determine what UID and GID users are squashed to if needed.

The expected outputs for the BirdResearchSite in the example above are:

mgs# lctl get_param nodemap.BirdResearchSite.idmap

 [
  { idtype: uid, client_id: 530, fs_id: 11000 },
  { idtype: uid, client_id: 531, fs_id: 11001 },
  { idtype: uid, client_id: 532, fs_id: 11002 },
  { idtype: uid, client_id: 533, fs_id: 11003 },
  { idtype: gid, client_id: 600, fs_id: 11000 },
  { idtype: gid, client_id: 601, fs_id: 11001 }
 ]

 mgs# lctl get_param nodemap.BirdResearchSite.ranges
 [
  { id: 11, start_nid: 192.168.0.100@tcp, end_nid: 192.168.0.100@tcp }
 ]

23.5. Ensuring Consistency

Consistency issues may arise in a nodemap enabled configuration when Lustre clients mount from an unknown NID range, new UIDs and GIDs that were not part of a known map are added, or there are misconfigurations in the rules. Keep in mind the following when activating nodemap on a production system:

  • Creating new policy groups or idmaps on a production system is allowed, but reserve a maintenance window to alter the trusted property to avoid metadata problems.

  • To perform administrative tasks, access the Lustre file system via a policy group with trusted and admin properties set. This prevents the creation of orphaned and squashed files. Granting the admin property without the trusted property is dangerous. The root user on the client may know of UIDs and GIDs that are not present in any idmap. If root alters ownership to those identifiers, the ownership is squashed as a result. For example, tar file extracts may be flipped from an expected UID such as UID 500 to nobody, normally UID 99.

  • To map distinct UIDs at two or more sites onto a single UID or GID on the Lustre file system, create overlapping idmaps and place each site in its own policy group. Each distinct UID may have its own mapping onto the target UID or GID.

  • Introduced in Lustre 2.8

    In Lustre 2.8, changes must be manually kept in a script file to be re-applied after a Lustre reload, and changes must be made on each OSS, MDS, and MGS nodes, as there is no automatic synchronization between the nodes.

  • If deny_unknown is in effect, it is possible for unmapped users to see dentries which were viewed by a mapped user. This is a result of client caching, and unmapped users will not be able to view any file contents.

  • Nodemap activation status can be checked with lctl nodemap_info, but extra validation is possible. One way of ensuring valid deployment on a production system is to create a fingerprint of known files with specific UIDs and GIDs mapped to a test client. After bringing the Lustre system online after maintenance, the test client can validate the UIDs and GIDs map correctly before the system is mounted in user space.

Introduced in Lustre 2.9

Chapter 24. Configuring Shared-Secret Key (SSK) Security

This chapter describes how to configure Shared-Secret Key security and includes the following sections:

24.1. SSK Security Overview

The SSK feature ensures integrity and data protection for Lustre PtlRPC traffic. Key files containing a shared secret and session-specific attributes are distributed to Lustre hosts. This authorizes Lustre hosts to mount the file system and optionally enables secure data transport, depending on which security flavor is configured. The administrator handles the generation, distribution, and installation of SSK key files, see Section 24.3.1, “Key File Management”.

24.1.1. Key features

SSK provides the following key features:

  • Host-based authentication

  • Data Transport Privacy

    • Encrypts Lustre RPCs

    • Prevents eavesdropping

  • Data Transport Integrity - Keyed-Hashing Message Authentication Code (HMAC)

    • Prevents man-in-the-middle attacks

    • Ensures RPCs cannot be altered undetected

24.2. SSK Security Flavors

SSK is implemented as a Generic Security Services (GSS) mechanism through Lustre's support of the GSS Application Program Interface (GSSAPI). The SSK GSS mechanism supports five flavors that offer varying levels of protection.

Flavors provided:

  • skn - SSK Null (Authentication)

  • ska - SSK Authentication and Integrity for non-bulk RPCs

  • ski - SSK Authentication and Integrity

  • skpi - SSK Authentication, Privacy, and Authentication

  • gssnull - Provides no protection. Used for testing purposes only

The table below describes the security characteristics of each flavor:

Table 24.1. SSK Security Flavor Protections

skn

ska

ski

skpi

Required to mount file system

Yes

Yes

Yes

Yes

Provides RPC Integrity

No

Yes

Yes

Yes

Provides RPC Privacy

No

No

No

Yes

Provides Bulk RPC Integrity

No

No

Yes

Yes

Provides Bulk RPC Privacy

No

No

No

Yes


Valid non-GSS flavors include:

null - Provides no protection. This is the default flavor.

plain - Plaintext with a hash on each RPC.

24.2.1. Secure RPC Rules

Secure RPC configuration rules are written to the Lustre log (llog) with the lctl command. Rules are processed with the llog and dictate the security flavor that is used for a particular Lustre network and direction.

Note

Rules take affect in a matter of seconds and impact both existing and new connections.

Rule format:

target.srpc.flavor.network[.direction]=flavor

  • target - This could be the file system name or a specific MDT/OST device name.

  • network - LNet network name of the RPC initiator. For example tcp1 or o2ib0. This can also be the keyword default that applies to all networks otherwise specified.

  • direction - Direction is optional. This could be one of mdt2mdt, mdt2ost, cli2mdt, or cli2ost.

Note

To secure the connection to the MGS use the mgssec=flavor mount option. This is required because security rules are unknown to the initiator until after the MGS connection has been established.

The examples below are for a test Lustre file system named testfs.

24.2.1.1. Defining Rules

Rules can be defined and deleted in any order. The rule with the greatest specificity for a given connection is applied. The fsname.srpc.flavor.default rule is the broadest rule as it applies to all non-MGS connections for the file system in the absence of a more specific rule. You may tailor SSK security to your needs by further specifying a specific target, network, and/or direction.

The following example illustrates an approach to configuring SSK security for an environment consisting of three LNet networks. The requirements for this example are:

  • All non-MGS connections must be authenticated.

  • PtlRPC traffic on LNet network tcp0 must be encrypted.

  • LNet networks tcp1 and o2ib0 are local physically secure networks that require high performance. Do not encrypt PtlRPC traffic on these networks.

  1. Ensure that all non-MGS connections are authenticated and encrypted by default.

    mgs# lctl conf_param testfs.srpc.flavor.default=skpi
  2. Override the file system default security flavor on LNet networks tcp1 and o2ib0 with ska. Security flavor ska provides authentication but without the performance impact of encryption and bulk RPC integrity.

    mgs# lctl conf_param testfs.srpc.flavor.tcp1=ska
    mgs# lctl conf_param testfs.srpc.flavor.o2ib0=ska

Note

Currently the "lctl set_param -P" format does not work with sptlrpc.

24.2.1.2. Listing Rules

To view the Secure RPC Config Rules, enter:

mgs# lctl get_param mgs.*.live.testfs
...
Secure RPC Config Rules:
testfs.srpc.flavor.tcp.cli2mdt=skpi
testfs.srpc.flavor.tcp.cli2ost=skpi
testfs.srpc.flavor.o2ib=ski
...

24.2.1.3. Deleting Rules

To delete a security flavor for an LNet network use the conf_param -d command to delete the flavor for that network:

For example, to delete the testfs.srpc.flavor.o2ib1=ski rule, enter:

mgs# lctl conf_param -d testfs.srpc.flavor.o2ib1

24.3. SSK Key Files

SSK key files are a collection of attributes formatted as fixed length values and stored in a file, which are distributed by the administrator to client and server nodes. Attributes include:

  • Version - Key file schema version number. Not user-defined.

  • Type - A mandatory attribute that denotes the Lustre role of the key file consumer. Valid key types are:

    • mgs - for MGS when the mgssec mount.lustre option is used.

    • server - for MDS and OSS servers

    • client - for clients as well as servers who communicate with other servers in a client context (e.g. MDS communication with OSTs).

  • HMAC algorithm - The Keyed-Hash Message Authentication Code algorithm used for integrity. Valid algorithms are (Default: SHA256):

    • SHA256

    • SHA512

  • Cryptographic algorithm - Cipher for encryption. Valid algorithms are (Default: AES-256-CTR).

    • AES-256-CTR

  • Session security context expiration - Seconds before session contexts generated from key expire and are regenerated (Default: 604800 seconds (7 days)).

  • Shared key length - Shared key length in bits (Default: 256).

  • Prime length - Length of prime (p) in bits used for the Diffie-Hellman Key Exchange (DHKE). (Default: 2048). This is generated only for client keys and can take a while to generate. This value also sets the minimum prime length that servers and MGS will accept from a client. Clients attempting to connect with a prime length less than the minimum will be rejected. In this way servers can guarantee the minimum encryption level that will be permitted.

  • File system name - Lustre File system name for key.

  • MGS NIDs - Comma-separated list of MGS NIDs. Only required when mgssec is used (Default: "").

  • Nodemap name - Nodemap name for key (Default: "default"). See Section 24.5, “Role of Nodemap in SSK”

  • Shared key - Shared secret used by all SSK flavors to provide authentication.

  • Prime (p) - Prime used for the DHKE. This is only used for keys with Type=client.

Note

Key files provide a means to authenticate Lustre connections; always store and transfer key files securely. Key files must not be world writable or they will fail to load.

24.3.1. Key File Management

The lgss_sk utility is used to write, modify, and read SSK key files. lgss_sk can be used to load key files singularly into the kernel keyring. lgss_sk options include:

Table 24.2. lgss_sk Parameters

Parameter

Value

Description

-l|--load

filename

Install key from file into user's session keyring. Must be executed by root.

-m|--modify

filename

Modify a file's key attributes

-r|--read

filename

Show file's key attributes

-w|--write

filename

Generate key file

-c|--crypt

cipher

Cipher for encryption (Default: AES Counter mode)

AES-256-CTR

-i|--hmac

hash

Hash algorithm for intregrity (Default: SHA256)

SHA256 or SHA512

-e|--expire

seconds

Seconds before contexts from key expire (Default: 604800 (7 days))

-f|--fsname

name

File system name for key

-g|--mgsnids

NID(s)

Comma separated list of MGS NID(s). Only required when mgssec is used (Default: "")

-n|--nodemap

map

Nodemap name for key (Default: "default")

-p|--prime-bits

length

Prime length (p) for DHKE in bits (Default: 2048)

-t|--type

type

Key type (mgs, server, client)

-k|--key-bits

length

Shared key length in bits (Default: 256)

-d|--data

file

Shared key random data source (Default: /dev/random)

-v|--verbose

Increase verbosity for errors


24.3.1.1. Writing Key Files

Key files are generated by the lgss_sk utility. Parameters are specified on the command line followed by the --write parameter and the filename to write to. The lgss_sk utility will not overwrite files so the filename must be unique. Mandatory parameters for generating key files are --type, either --fsname or --mgsnids, and --write; all other parameters are optional.

lgss_sk uses /dev/random as the default entropy data source; you may override this with the --data parameter. When no hardware random number generator is available on the system where lgss_sk is executing, you may need to press keys on the keyboard or move the mouse (if directly attached to the system) or cause disk IO (if system is remote), in order to generate entropy for the shared key. It is possible to use /dev/urandom for testing purposes but this may provide less security in some cases.

Example:

To create a server type key file for the testfs Lustre file system for clients in the biology nodemap, enter:

server# lgss_sk -t server -f testfs -n biology \
-w testfs.server.biology.key

24.3.1.2. Modifying Key Files

Like writing key files you modify them by specifying the paramaters on the command line that you want to change. Only key file attributes associated with the parameters provided are changed; all other attributes remain unchanged.

To modify a key file's Type to client and populate the Prime (p) key attribute, if it is missing, enter:

client# lgss_sk -t client -m testfs.client.biology.key

To add MGS NIDs 192.168.1.101@tcp,10.10.0.101@o2ib to server key file testfs.server.biology.key and client key file testfs.client.biology.key, enter

server# lgss_sk -g 192.168.1.101@tcp,10.10.0.101@o2ib \
-m testfs.server.biology.key

client# lgss_sk -g 192.168.1.101@tcp,10.10.0.101@o2ib \
-m testfs.client.biology.key

To modify the testfs.server.biology.key on the MGS to support MGS connections from biology clients, modify the key file's Type to include mgs in addition to server, enter:

mgs# lgss_sk -t mgs,server -m testfs.server.biology.key

24.3.1.3. Reading Key Files

Read key files with the lgss_sk utility and --read parameter. Read the keys modified in the previous examples:

mgs# lgss_sk -r testfs.server.biology.key
Version:        1
Type:           mgs server
HMAC alg:       SHA256
Crypt alg:      AES-256-CTR
Ctx Expiration: 604800 seconds
Shared keylen:  256 bits
Prime length:   2048 bits
File system:    testfs
MGS NIDs:       192.168.1.101@tcp 10.10.0.101@o2ib
Nodemap name:   biology
Shared key:
  0000: 84d2 561f 37b0 4a58 de62 8387 217d c30a  ..V.7.JX.b..!}..
  0010: 1caa d39c b89f ee6c 2885 92e7 0765 c917  .......l(....e..

client# lgss_sk -r testfs.client.biology.key
Version:        1
Type:           client
HMAC alg:       SHA256
Crypt alg:      AES-256-CTR
Ctx Expiration: 604800 seconds
Shared keylen:  256 bits
Prime length:   2048 bits
File system:    testfs
MGS NIDs:       192.168.1.101@tcp 10.10.0.101@o2ib
Nodemap name:   biology
Shared key:
  0000: 84d2 561f 37b0 4a58 de62 8387 217d c30a  ..V.7.JX.b..!}..
  0010: 1caa d39c b89f ee6c 2885 92e7 0765 c917  .......l(....e..
Prime (p) :
  0000: 8870 c3e3 09a5 7091 ae03 f877 f064 c7b5  .p....p....w.d..
  0010: 14d9 bc54 75f8 80d3 22f9 2640 0215 6404  ...Tu...".&@..d.
  0020: 1c53 ba84 1267 bea2 fb05 37a4 ed2d 5d90  .S...g....7..-].
  0030: 84e3 1a67 67f0 47c7 0c68 5635 f50e 9cf0  ...gg.G..hV5....
  0040: e622 6f53 2627 6af6 9598 eeed 6290 9b1e  ."oS&'j.....b...
  0050: 2ec5 df04 884a ea12 9f24 cadc e4b6 e91d  .....J...$......
  0060: 362f a239 0a6d 0141 b5e0 5c56 9145 6237  6/.9.m.A..\V.Eb7
  0070: 59ed 3463 90d7 1cbe 28d5 a15d 30f7 528b  Y.4c....(..]0.R.
  0080: 76a3 2557 e585 a1be c741 2a81 0af0 2181  v.%W.....A*...!.
  0090: 93cc a17a 7e27 6128 5ebd e0a4 3335 db63  ...z~'a(^...35.c
  00a0: c086 8d0d 89c1 c203 3298 2336 59d8 d7e7  ........2.#6Y...
  00b0: e52a b00c 088f 71c3 5109 ef14 3910 fcf6  .*....q.Q...9...
  00c0: 0fa0 7db7 4637 bb95 75f4 eb59 b0cd 4077  ..}.F7..u..Y..@w
  00d0: 8f6a 2ebd f815 a9eb 1b77 c197 5100 84c0  .j.......w..Q...
  00e0: 3dc0 d75d 40b3 6be5 a843 751a b09c 1b20  =..]@.k..Cu....
  00f0: 8126 4817 e657 b004 06b6 86fb 0e08 6a53  .&H..W........jS

24.3.1.4. Loading Key Files

Key files can be loaded into the kernel keyring with the lgss_sk utility or at mount time with the skpath mount option. The skpath method has the advantage that it accepts a directory path and loads all key files within the directory into the keyring. The lgss_sk utility loads a single key file into the keyring with each invocation. Key files must not be world writable or they will fail to load.

Third party tools can also load the keys if desired. The only caveat is that the key must be available when the request_key upcall to userspace is made and they use the correct key descriptions for a key so that it can be found during the upcall (see Key Descriptions).

Examples:

Load the testfs.server.biology.key key file using lgss_sk, enter:

server# lgss_sk -l testfs.server.biology.key

Use the skpath mount option to load all of the key files in the /secure_directory directory when mounting a storage target, enter:

server# mount -t lustre -o skpath=/secure_directory \
/storage/target /mount/point

Use the skpath mount option to load key files into the keyring on a client, enter:

client# mount -t lustre -o skpath=/secure_directory \
mgsnode:/testfs /mnt/testfs

24.4. Lustre GSS Keyring

The Lustre GSS Keyring binary lgss_keyring is used by SSK to handle the upcall from kernel space into user space via request-key. The purpose of lgss_keyring is to create a token that is passed as part of the security context initialization RPC (SEC_CTX_INIT)

24.4.1. Setup

The Lustre GSS keyring types of flavors utilize the Linux kernel keyring infrastructure to maintain keys as well as to perform the upcall from kernel space to userspace for key negotiation/establishment. The GSS keyring establishes a key type (see “request-key(8)”) named lgssc when the Lustre ptlrpc_gss kernel module is loaded. When a security context must be established it creates a key and uses the request-key binary in an upcall to establish the key. This key will look for the configuration file in /etc/request-key.d with the name keytype.conf, for Lustre this is lgssc.conf.

Each node participating in SSK Security must have a /etc/request-key.d/lgssc.conf file that contains the following single line:

create lgssc * * /usr/sbin/lgss_keyring %o %k %t %d %c %u %g %T %P %S

The request-key binary will call lgss_keyring with the arguments following it with their substituted values (see request-key.conf(5)).

24.4.2. Server Setup

Lustre servers do not use the Linux request-key mechanism as clients do. Instead servers run a daemon that uses a pipefs with the kernel to trigger events based on read/write to a file descriptor. The server-side binary is lsvcgssd. It can be executed in the foreground or as a daemon. Below are the parameters for the lsvcgssd binary which requires various security flavors (gssnull, krb5, sk) to be enabled explicitly. This ensures that only required functionality is enabled.

Table 24.3. lsvcgssd Parameters

Parameter

Description

-f

Run in foreground

-n

Do not establish Kerberos credentials

-v

Verbosity

-m

Service MDS

-o

Service OSS

-g

Service MGS

-k

Enable Kerberos support

-s

Enable Shared Key support

-z

Enable gssnull support


A SysV style init script is installed for starting and stopping the lsvcgssd daemon. The init script checks the LSVCGSSARGS variable in the /etc/sysconfig/lsvcgss configuration file for startup parameters.

Keys during the upcall on the client and handling of an RPC on the server are found by using a specific key description for each key in the kernel keyring.

For each MGS NID there must be a separate key loaded. The format of the key description should be:

Table 24.4. Key Descriptions

Type

Key Description

Example

MGC

lustre:MGCNID

lustre:MGC192.168.1.10@tcp

MDC/OSC/OSP/LWP

lustre:fsname

lustre:testfs

MDT

lustre:fsname:NodemapName

lustre:testfs:biology

OST

lustre:fsname:NodemapName

lustre:testfs:biology

MGS

lustre:MGS

lustre:MGS


All keys for Lustre use the user type for keys and are attached to the user’s keyring. This is not configurable. Below is an example showing how to list the user’s keyring, load a key file, read the key, and clear the key from the kernel keyring.

client# keyctl show
Session Keyring
  17053352 --alswrv      0     0  keyring: _ses
 773000099 --alswrv      0 65534   \_ keyring: _uid.0

client# lgss_sk -l /secure_directory/testfs.client.key

client# keyctl show
Session Keyring
  17053352 --alswrv      0     0  keyring: _ses
 773000099 --alswrv      0 65534   \_ keyring: _uid.0
1028795127 --alswrv      0     0       \_ user: lustre:testfs

client# keyctl pipe 1028795127 | lgss_sk -r -
Version:        1
Type:           client
HMAC alg:       SHA256
Crypt alg:      AES-256-CTR
Ctx Expiration: 604800 seconds
Shared keylen:  256 bits
Prime length:   2048 bits
File system:    testfs
MGS NIDs:
Nodemap name:   default
Shared key:
  0000: faaf 85da 93d0 6ffc f38c a5c6 f3a6 0408  ......o.........
  0010: 1e94 9b69 cf82 d0b9 880b f173 c3ea 787a  ...i.......s..xz
Prime (p) :
  0000: 9c12 ed95 7b9d 275a 229e 8083 9280 94a0  ....{.'Z".......
  0010: 8593 16b2 a537 aa6f 8b16 5210 3dd5 4c0c  .....7.o..R.=.L.
  0020: 6fae 2729 fcea 4979 9435 f989 5b6e 1b8a  o.')..Iy.5..[n..
  0030: 5039 8db2 3a23 31f0 540c 33cb 3b8e 6136  P9..:#1.T.3.;.a6
  0040: ac18 1eba f79f c8dd 883d b4d2 056c 0501  .........=...l..
  0050: ac17 a4ab 9027 4930 1d19 7850 2401 7ac4  .....'I0..xP$.z.
  0060: 92b4 2151 8837 ba23 94cf 22af 72b3 e567  ..!Q.7.#..".r..g
  0070: 30eb 0cd4 3525 8128 b0ff 935d 0ba3 0fc0  0...5%.(...]....
  0080: 9afa 5da7 0329 3ce9 e636 8a7d c782 6203  ..]..)<..6.}..b.
  0090: bb88 012e 61e7 5594 4512 4e37 e01d bdfc  ....a.U.E.N7....
  00a0: cb1d 6bd2 6159 4c3a 1f4f 1167 0e26 9e5e  ..k.aYL:.O.g.&.^
  00b0: 3cdc 4a93 63f6 24b1 e0f1 ed77 930b 9490  <.J.c.$....w....
  00c0: 25ef 4718 bff5 033e 11ba e769 4969 8a73  %.G....>...iIi.s
  00d0: 9f5f b7bb 9fa0 7671 79a4 0d28 8a80 1ea1  ._....vqy..(....
  00e0: a4df 98d6 e20e fe10 8190 5680 0d95 7c83  ..........V...|.
  00f0: 6e21 abb3 a303 ff55 0aa8 ad89 b8bf 7723  n!.....U......w#

client# keyctl clear @u

client# keyctl show
Session Keyring
  17053352 --alswrv      0     0  keyring: _ses
 773000099 --alswrv      0 65534   \_ keyring: _uid.0

24.4.3. Debugging GSS Keyring

Lustre client and server support several debug levels, which can be seen below.

Debug levels:

  • 0 - Error

  • 1 - Warn

  • 2 - Info

  • 3 - Debug

  • 4 - Trace

To set the debug level on the client use the Lustre parameter:

sptlrpc.gss.lgss_keyring.debug_level

For example to set the debug level to trace, enter:

client# lctl set_param sptlrpc.gss.lgss_keyring.debug_level=4

Server-side verbosity is increased by adding additional verbose flags (-v) to the command line arguments for the daemon. The following command runs the lsvcgssd daemon in the foreground with debug verbosity supporting gssnull and SSK

server# lsvcgssd -f -vvv -z -s

lgss_keyring is called as part of the request-key upcall which has no standard output; therefore logging is done through syslog. The server-side logging with lsvcgssd is written to standard output when executing in the foreground and to syslog in daemon mode.

24.4.4. Revoking Keys

The keys discussed above with lgss_sk and the skpath mount options are not revoked. They are only used to create valid contexts for client connections. Instead of revoking them they can be invalidated in one of two ways.

  • Unloading the key from the user keyring on the server will cause new client connections to fail. If no longer necessary it can be deleted.

  • Changing the nodemap name for the clients on the servers. Since the nodemap is an integral part of the shared key context instantiation, renaming the nodemap a group of NIDs belongs to will prevent any new contexts.

There currently does not exist a mechanism to flush contexts from Lustre. Targets could be unmounted from the servers to purge contexts. Alternatively shorter context expiration could be used when the key is created so that contexts need to be refreshed more frequently than the default. 3600 seconds could be reasonable depending on the use case so that contexts will have to be renegotiated every hour.

24.5. Role of Nodemap in SSK

SSK uses Nodemap (See Chapter 23, Mapping UIDs and GIDs with Nodemap) policy group names and their associated NID range(s) as a mechanism to prevent key file forgery, and to control the range of NIDs on which a given key file can be used.

Clients assume they are in the nodemap specified in the key file they use. When clients instantiate security contexts an upcall is triggered that specifies information about the context that triggers it. From this context information request-key calls lgss_keyring, which in turn looks up the key with description lustre:fsname or lustre:target_name for the MGC. Using the key found in the user keyring matching the description, the nodemap name is read from the key, hashed with SHA256, and sent to the server.

Servers look up the client’s NID to determine which nodemap the NID is associated with and sends the nodemap name to lsvcgssd. The lsvcgssd daemon verifies whether the HMAC equals the nodemap value sent by the client. This prevents forgery and invalidates the key when a client’s NID is not associated with the nodemap name defined on the servers.

It is not required to activate the Nodemap feature in order for SSK to perform client NID to nodemap name lookups.

24.6. SSK Examples

The examples in this section use 1 MGS/MDS (NID 172.16.0.1@tcp), 1 OSS (NID 172.16.0.3@tcp), and 2 clients. The Lustre file system name is testfs.

24.6.1. Securing Client to Server Communications

This example illustrates how to configure SSK to apply Privacy and Integrity protections to client-to-server PtlRPC traffic on the tcp network. Rules that specify a direction, specifically cli2mdt and cli2ost, are used. This permits server-to-server communications to continue using null which is the default flavor for all Lustre connections. This arrangement provides no server-to-server protections, see Section 24.6.3, “Securing Server to Server Communications”.

  1. Create secure directory for storing SSK key files.

    mds# mkdir /secure_directory
    mds# chmod 600 /secure_directory
    oss# mkdir /secure_directory
    oss# chmod 600 /secure_directory
    cli1# mkdir /secure_directory
    cli1# chmod 600 /secure_directory
    cli2# mkdir /secure_directory
    cli2# chmod 600 /secure_directory
  2. Generate a key file for the MDS and OSS servers. Run:

    mds# lgss_sk -t server -f testfs -w \
    /secure_directory/testfs.server.key
  3. Securely copy the /secure_directory/testfs.server.key key file to the OSS.

    mds# scp /secure_directory/testfs.server.key \
    oss:/secure_directory/
  4. Securely copy the /secure_directory/testfs.server.key key file to /secure_directory/testfs.client.key on client1.

    mds# scp /secure_directory/testfs.server.key \
    client1:/secure_directory/testfs.client.key
  5. Modify the key file type to client on client1. This operation also generates a prime number of Prime length to populate the Prime (p) attribute. Run:

    client1# lgss_sk -t client \
    -m /secure_directory/testfs.client.key
  6. Create a /etc/request-key.d/lgssc.conf file on all nodes that contains this line 'create lgssc * * /usr/sbin/lgss_keyring %o %k %t %d %c %u %g %T %P %S' without the single quotes. Run:

    mds# echo create lgssc \* \* /usr/sbin/lgss_keyring %o %k %t %d %c %u %g %T %P %S > /etc/request-key.d/lgssc.conf
    oss# echo create lgssc \* \* /usr/sbin/lgss_keyring %o %k %t %d %c %u %g %T %P %S > /etc/request-key.d/lgssc.conf
    client1# echo create lgssc \* \* /usr/sbin/lgss_keyring %o %k %t %d %c %u %g %T %P %S > /etc/request-key.d/lgssc.conf
    client2# echo create lgssc \* \* /usr/sbin/lgss_keyring %o %k %t %d %c %u %g %T %P %S > /etc/request-key.d/lgssc.conf
  7. Configure the lsvcgss daemon on the MDS and OSS. Set the LSVCGSSDARGS variable in /etc/sysconfig/lsvcgss on the MDS to ‘-s -m’. On the OSS, set the LSVCGSSDARGS variable in /etc/sysconfig/lsvcgss to ‘-s -o’

  8. Start the lsvcgssd daemon on the MDS and OSS. Run:

    mds# systemctl start lsvcgss.service
    oss# systemctl start lsvcgss.service
  9. Mount the MDT and OST with the -o skpath=/secure_directory mount option. The skpath option loads all SSK key files found in the directory into the kernel keyring.

  10. Set client to MDT and client to OST security flavor to SSK Privacy and Integrity, skpi:

    mds# lctl conf_param testfs.srpc.flavor.tcp.cli2mdt=skpi
    mds# lctl conf_param testfs.srpc.flavor.tcp.cli2ost=skpi
  11. Mount the testfs file system on client1 and client2:

    client1# mount -t lustre -o skpath=/secure_directory 172.16.0.1@tcp:/testfs /mnt/testfs
    client2# mount -t lustre -o skpath=/secure_directory 172.16.0.1@tcp:/testfs /mnt/testfs
    mount.lustre: mount 172.16.0.1@tcp:/testfs at /mnt/testfs failed: Connection refused
  12. client2 failed to authenticate because it does not have a valid key file. Repeat steps 4 and 5, substitute client1 for client2, then mount the testfs file system on client2:

    client2# mount -t lustre -o skpath=/secure_directory 172.16.0.1@tcp:/testfs /mnt/testfs
  13. Verify that the mdc and osc connections are using the SSK mechanism and that rpc and bulk security flavors are skpi. See Section 24.7, “Viewing Secure PtlRPC Contexts”.

    Notice the mgc connection to the MGS has no secure PtlRPC security context. This is because skpi security was only specified for client-to-MDT and client-to-OST connections in step 10. The following example details the steps necessary to secure the connection to the MGS.

24.6.2. Securing MGS Communications

This example builds on the previous example.

  1. Enable lsvcgss MGS service support on MGS. Edit /etc/sysconfig/lsvcgss on the MGS and add the (-g) parameter to the LSVCGSSDARGS variable. Restart the lsvcgss service.

  2. Add mgs key type and MGS NIDs to /secure_directory/testfs.server.key on MDS.

    mgs# lgss_sk -t mgs,server -g 172.16.0.1@tcp,172.16.0.2@tcp -m /secure_directory/testfs.server.key
  3. Load the modified key file on the MGS. Run:

    mgs# lgss_sk -l /secure_directory/testfs.server.key
  4. Add MGS NIDs to /secure_directory/testfs.client.key on client, client1.

    client1# lgss_sk -g 172.16.0.1@tcp,172.16.0.2@tcp -m /secure_directory/testfs.client.key
  5. Unmount the testfs file system on client1, then mount with the mgssec=skpi mount option:

    cli1# mount -t lustre -o mgssec=skpi,skpath=/secure_directory 172.16.0.1@tcp:/testfs /mnt/testfs
  6. Verify that client1’s MGC connection is using the SSK mechanism and skpi security flavor. See Section 24.7, “Viewing Secure PtlRPC Contexts”.

24.6.3. Securing Server to Server Communications

This example illustrates how to configure SSK to apply Integrity protection, ski flavor, to MDT-to-OST PtlRPC traffic on the tcp network.

This example builds on the previous example.

  1. Create a Nodemap policy group named LustreServers on the MGS for the Lustre Servers, enter:

    mgs# lctl nodemap_add LustreServers
  2. Add MDS and OSS NIDs to the LustreServers nodemap, enter:

    mgs# lctl nodemap_add_range --name LustreServers --range 172.16.0.[1-3]@tcp
  3. Create key file of type mgs,server for use with nodes in the LustreServers Nodemap range.

    mds# lgss_sk -t mgs,server -f testfs -g \
    172.16.0.1@tcp,172.16.0.2@tcp -n LustreServers -w \
    /secure_directory/testfs.LustreServers.key
  4. Securely copy the /secure_directory/testfs.LustreServers.key key file to the OSS.

    mds# scp /secure_directory/testfs.LustreServers.key oss:/secure_directory/
  5. On the MDS and OSS, copy /secure_directory/testfs.LustreServers.key to /secure_directory/testfs.LustreServers.client.key.

  6. On each server modify the key file type of /secure_directory/testfs.LustreServers.client.key to be of type client. This operation also generates a prime number of Prime length to populate the Prime (p) attribute. Run:

    mds# lgss_sk -t client -m \
    /secure_directory/testfs.LustreServers.client.key
    oss# lgss_sk -t client -m \
    /secure_directory/testfs.LustreServers.client.key
  7. Load the /secure_directory/testfs.LustreServers.key and /secure_directory/testfs.LustreServers.client.key key files into the keyring on the MDS and OSS, enter:

    mds# lgss_sk -l /secure_directory/testfs.LustreServers.key
    mds# lgss_sk -l /secure_directory/testfs.LustreServers.client.key
    oss# lgss_sk -l /secure_directory/testfs.LustreServers.key
    oss# lgss_sk -l /secure_directory/testfs.LustreServers.client.key
  8. Set MDT to OST security flavor to SSK Integrity, ski:

    mds# lctl conf_param testfs.srpc.flavor.tcp.mdt2ost=ski
  9. Verify that the osc and osp connections to the OST have a secure ski security context. See Section 24.7, “Viewing Secure PtlRPC Contexts”.

24.7. Viewing Secure PtlRPC Contexts

From the client (or servers which have mgc, osc, mdc contexts) you can view info regarding all users’ contexts and the flavor in use for an import. For user’s contexts (srpc_context), SSK and gssnull only support a single root UID so there should only be one context. The other file in the import (srpc_info) has additional sptlrpc details. The rpc and bulk flavors allow you to verify which security flavor is in use.

client1# lctl get_param *.*.srpc_*
mdc.testfs-MDT0000-mdc-ffff8800da9f0800.srpc_contexts=
ffff8800da9600c0: uid 0, ref 2, expire 1478531769(+604695), fl uptodate,cached,, seq 7, win 2048, key 27a24430(ref 1), hdl 0xf2020f47cbffa93d:0xc23f4df4bcfb7be7, mech: sk
mdc.testfs-MDT0000-mdc-ffff8800da9f0800.srpc_info=
rpc flavor:     skpi
bulk flavor:    skpi
flags:          rootonly,udesc,
id:             3
refcount:       3
nctx:   1
gc internal     3600
gc next 3505
mgc.MGC172.16.0.1@tcp.srpc_contexts=
ffff8800dbb09b40: uid 0, ref 2, expire 1478531769(+604695), fl uptodate,cached,, seq 18, win 2048, key 3e3f709f(ref 1), hdl 0xf2020f47cbffa93b:0xc23f4df4bcfb7be6, mech: sk
mgc.MGC172.16.0.1@tcp.srpc_info=
rpc flavor:     skpi
bulk flavor:    skpi
flags:          -,
id:             2
refcount:       3
nctx:   1
gc internal     3600
gc next 3505
osc.testfs-OST0000-osc-ffff8800da9f0800.srpc_contexts=
ffff8800db9e5600: uid 0, ref 2, expire 1478531770(+604696), fl uptodate,cached,, seq 3, win 2048, key 3f7c1d70(ref 1), hdl 0xf93e61c64b6b415d:0xc23f4df4bcfb7bea, mech: sk
osc.testfs-OST0000-osc-ffff8800da9f0800.srpc_info=
rpc flavor:     skpi
bulk flavor:    skpi
flags:          rootonly,bulk,
id:             6
refcount:       3
nctx:   1
gc internal     3600
gc next 3505

Chapter 25. Managing Security in a Lustre File System

This chapter describes security features of the Lustre file system and includes the following sections:

25.1. Using ACLs

An access control list (ACL), is a set of data that informs an operating system about permissions or access rights that each user or group has to specific system objects, such as directories or files. Each object has a unique security attribute that identifies users who have access to it. The ACL lists each object and user access privileges such as read, write or execute.

25.1.1. How ACLs Work

Implementing ACLs varies between operating systems. Systems that support the Portable Operating System Interface (POSIX) family of standards share a simple yet powerful file system permission model, which should be well-known to the Linux/UNIX administrator. ACLs add finer-grained permissions to this model, allowing for more complicated permission schemes. For a detailed explanation of ACLs on a Linux operating system, refer to the SUSE Labs article, Posix Access Control Lists on Linux:

http://www.suse.de/~agruen/acl/linux-acls/online/

We have implemented ACLs according to this model. The Lustre software works with the standard Linux ACL tools, setfacl, getfacl, and the historical chacl, normally installed with the ACL package.

Note

ACL support is a system-range feature, meaning that all clients have ACL enabled or not. You cannot specify which clients should enable ACL.

25.1.2. Using ACLs with the Lustre Software

POSIX Access Control Lists (ACLs) can be used with the Lustre software. An ACL consists of file entries representing permissions based on standard POSIX file system object permissions that define three classes of user (owner, group and other). Each class is associated with a set of permissions [read (r), write (w) and execute (x)].

  • Owner class permissions define access privileges of the file owner.

  • Group class permissions define access privileges of the owning group.

  • Other class permissions define access privileges of all users not in the owner or group class.

The ls -l command displays the owner, group, and other class permissions in the first column of its output (for example, -rw-r- -- for a regular file with read and write access for the owner class, read access for the group class, and no access for others).

Minimal ACLs have three entries. Extended ACLs have more than the three entries. Extended ACLs also contain a mask entry and may contain any number of named user and named group entries.

The MDS needs to be configured to enable ACLs. Use --mountfsoptions to enable ACLs when creating your configuration:

$ mkfs.lustre --fsname spfs --mountfsoptions=acl --mdt -mgs /dev/sda

Alternately, you can enable ACLs at run time by using the --acl option with mkfs.lustre:

$ mount -t lustre -o acl /dev/sda /mnt/mdt

To check ACLs on the MDS:

$ lctl get_param -n mdc.home-MDT0000-mdc-*.connect_flags | grep acl acl

To mount the client with no ACLs:

$ mount -t lustre -o noacl ibmds2@o2ib:/home /home

ACLs are enabled in a Lustre file system on a system-wide basis; either all clients enable ACLs or none do. Activating ACLs is controlled by MDS mount options acl / noacl (enable/disable ACLs). Client-side mount options acl/noacl are ignored. You do not need to change the client configuration, and the 'acl' string will not appear in the client /etc/mtab. The client acl mount option is no longer needed. If a client is mounted with that option, then this message appears in the MDS syslog:

...MDS requires ACL support but client does not

The message is harmless but indicates a configuration issue, which should be corrected.

If ACLs are not enabled on the MDS, then any attempts to reference an ACL on a client return an Operation not supported error.

25.1.3. Examples

These examples are taken directly from the POSIX paper referenced above. ACLs on a Lustre file system work exactly like ACLs on any Linux file system. They are manipulated with the standard tools in the standard manner. Below, we create a directory and allow a specific user access.

[root@client lustre]# umask 027
[root@client lustre]# mkdir rain
[root@client lustre]# ls -ld rain
drwxr-x---  2 root root 4096 Feb 20 06:50 rain
[root@client lustre]# getfacl rain
# file: rain
# owner: root
# group: root
user::rwx
group::r-x
other::---
 
[root@client lustre]# setfacl -m user:chirag:rwx rain
[root@client lustre]# ls -ld rain
drwxrwx---+ 2 root root 4096 Feb 20 06:50 rain
[root@client lustre]# getfacl --omit-header rain
user::rwx
user:chirag:rwx
group::r-x
mask::rwx
other::---

25.2. Using Root Squash

Root squash is a security feature which restricts super-user access rights to a Lustre file system. Without the root squash feature enabled, Lustre file system users on untrusted clients could access or modify files owned by root on the file system, including deleting them. Using the root squash feature restricts file access/modifications as the root user to only the specified clients. Note, however, that this does not prevent users on insecure clients from accessing files owned by other users.

The root squash feature works by re-mapping the user ID (UID) and group ID (GID) of the root user to a UID and GID specified by the system administrator, via the Lustre configuration management server (MGS). The root squash feature also enables the Lustre file system administrator to specify a set of client for which UID/GID re-mapping does not apply.

25.2.1. Configuring Root Squash

Root squash functionality is managed by two configuration parameters, root_squash and nosquash_nids.

  • The root_squash parameter specifies the UID and GID with which the root user accesses the Lustre file system.

  • The nosquash_nids parameter specifies the set of clients to which root squash does not apply. LNet NID range syntax is used for this parameter (see the NID range syntax rules described in Section 25.2.2, “Enabling and Tuning Root Squash”). For example:

nosquash_nids=172.16.245.[0-255/2]@tcp

In this example, root squash does not apply to TCP clients on subnet 172.16.245.0 that have an even number as the last component of their IP address.

25.2.2. Enabling and Tuning Root Squash

The default value for nosquash_nids is NULL, which means that root squashing applies to all clients. Setting the root squash UID and GID to 0 turns root squash off.

Root squash parameters can be set when the MDT is created (mkfs.lustre --mdt). For example:

mds# mkfs.lustre --reformat --fsname=testfs --mdt --mgs \
       --param "mdt.root_squash=500:501" \
       --param "mdt.nosquash_nids='0@elan1 192.168.1.[10,11]'" /dev/sda1

Root squash parameters can also be changed on an unmounted device with tunefs.lustre. For example:

tunefs.lustre --param "mdt.root_squash=65534:65534"  \
--param "mdt.nosquash_nids=192.168.0.13@tcp0" /dev/sda1

Root squash parameters can also be changed with the lctl conf_param command. For example:

mgs# lctl conf_param testfs.mdt.root_squash="1000:101"
mgs# lctl conf_param testfs.mdt.nosquash_nids="*@tcp"

Note

When using the lctl conf_param command, keep in mind:

  • lctl conf_param must be run on a live MGS

  • lctl conf_param causes the parameter to change on all MDSs

  • lctl conf_param is to be used once per a parameter

The nosquash_nids list can be cleared with:

mgs# lctl conf_param testfs.mdt.nosquash_nids="NONE"

- OR -

mgs# lctl conf_param testfs.mdt.nosquash_nids="clear"

If the nosquash_nids value consists of several NID ranges (e.g. 0@elan, 1@elan1), the list of NID ranges must be quoted with single (') or double ('') quotation marks. List elements must be separated with a space. For example:

mds# mkfs.lustre ... --param "mdt.nosquash_nids='0@elan1 1@elan2'" /dev/sda1
lctl conf_param testfs.mdt.nosquash_nids="24@elan 15@elan1"

These are examples of incorrect syntax:

mds# mkfs.lustre ... --param "mdt.nosquash_nids=0@elan1 1@elan2" /dev/sda1
lctl conf_param testfs.mdt.nosquash_nids=24@elan 15@elan1

To check root squash parameters, use the lctl get_param command:

mds# lctl get_param mdt.testfs-MDT0000.root_squash
lctl get_param mdt.*.nosquash_nids

Note

An empty nosquash_nids list is reported as NONE.

25.2.3. Tips on Using Root Squash

Lustre configuration management limits root squash in several ways.

  • The lctl conf_param value overwrites the parameter's previous value. If the new value uses an incorrect syntax, then the system continues with the old parameters and the previously-correct value is lost on remount. That is, be careful doing root squash tuning.

  • mkfs.lustre and tunefs.lustre do not perform parameter syntax checking. If the root squash parameters are incorrect, they are ignored on mount and the default values are used instead.

  • Root squash parameters are parsed with rigorous syntax checking. The root_squash parameter should be specified as <decnum>:<decnum>. The nosquash_nids parameter should follow LNet NID range list syntax.

LNet NID range syntax:

<nidlist>     :== <nidrange> [ ' ' <nidrange> ]
<nidrange>   :== <addrrange> '@' <net>
<addrrange>  :== '*' |
           <ipaddr_range> |
           <numaddr_range>
<ipaddr_range>       :==
<numaddr_range>.<numaddr_range>.<numaddr_range>.<numaddr_range>
<numaddr_range>      :== <number> |
                   <expr_list>
<expr_list>  :== '[' <range_expr> [ ',' <range_expr>] ']'
<range_expr> :== <number> |
           <number> '-' <number> |
           <number> '-' <number> '/' <number>
<net>        :== <netname> | <netname><number>
<netname>    :== "lo" | "tcp" | "o2ib" | "cib" | "openib" | "iib" | 
           "vib" | "ra" | "elan" | "gm" | "mx" | "ptl"
<number>     :== <nonnegative decimal> | <hexadecimal>

Note

For networks using numeric addresses (e.g. elan), the address range must be specified in the <numaddr_range> syntax. For networks using IP addresses, the address range must be in the <ipaddr_range>. For example, if elan is using numeric addresses, 1.2.3.4@elan is incorrect.

Part IV. Tuning a Lustre File System for Performance

Part IV describes tools and procedures used to tune a Lustre file system for optimum performance. You will find information in this section about:

Table of Contents

26. Testing Lustre Network Performance (LNet Self-Test)
26.1. LNet Self-Test Overview
26.1.1. Prerequisites
26.2. Using LNet Self-Test
26.2.1. Creating a Session
26.2.2. Setting Up Groups
26.2.3. Defining and Running the Tests
26.2.4. Sample Script
26.3. LNet Self-Test Command Reference
26.3.1. Session Commands
26.3.2. Group Commands
26.3.3. Batch and Test Commands
26.3.4. Other Commands
27. Benchmarking Lustre File System Performance (Lustre I/O Kit)
27.1. Using Lustre I/O Kit Tools
27.1.1. Contents of the Lustre I/O Kit
27.1.2. Preparing to Use the Lustre I/O Kit
27.2. Testing I/O Performance of Raw Hardware (sgpdd-survey)
27.2.1. Tuning Linux Storage Devices
27.2.2. Running sgpdd-survey
27.3. Testing OST Performance (obdfilter-survey)
27.3.1. Testing Local Disk Performance
27.3.2. Testing Network Performance
27.3.3. Testing Remote Disk Performance
27.3.4. Output Files
27.4. Testing OST I/O Performance (ost-survey)
27.5. Testing MDS Performance (mds-survey)
27.5.1. Output Files
27.5.2. Script Output
27.6. Collecting Application Profiling Information (stats-collect)
27.6.1. Using stats-collect
28. Tuning a Lustre File System
28.1. Optimizing the Number of Service Threads
28.1.1. Specifying the OSS Service Thread Count
28.1.2. Specifying the MDS Service Thread Count
28.2. Binding MDS Service Thread to CPU PartitionsL 2.3
28.3. Tuning LNet Parameters
28.3.1. Transmit and Receive Buffer Size
28.3.2. Hardware Interrupts ( enable_irq_affinity)
28.3.3. Binding Network Interface Against CPU PartitionsL 2.3
28.3.4. Network Interface Credits
28.3.5. Router Buffers
28.3.6. Portal Round-Robin
28.3.7. LNet Peer Health
28.4. libcfs TuningL 2.3
28.4.1. CPU Partition String Patterns
28.5. LND Tuning
28.6. Network Request Scheduler (NRS) TuningL 2.4
28.6.1. First In, First Out (FIFO) policy
28.6.2. Client Round-Robin over NIDs (CRR-N) policy
28.6.3. Object-based Round-Robin (ORR) policy
28.6.4. Target-based Round-Robin (TRR) policy
28.6.5. Token Bucket Filter (TBF) policyL 2.6
28.7. Lockless I/O Tunables
28.8. Server-Side Advice and Hinting L 2.9
28.8.1. Overview
28.8.2. Examples
28.9. Large Bulk IO (16MB RPC) L 2.9
28.9.1. Overview
28.9.2. Usage
28.10. Improving Lustre I/O Performance for Small Files
28.11. Understanding Why Write Performance is Better Than Read Performance

Chapter 26. Testing Lustre Network Performance (LNet Self-Test)

This chapter describes the LNet self-test, which is used by site administrators to confirm that Lustre Networking (LNet) has been properly installed and configured, and that underlying network software and hardware are performing according to expectations. The chapter includes:

26.1.  LNet Self-Test Overview

LNet self-test is a kernel module that runs over LNet and the Lustre network drivers (LNDs). It is designed to:

  • Test the connection ability of the Lustre network

  • Run regression tests of the Lustre network

  • Test performance of the Lustre network

After you have obtained performance results for your Lustre network, refer to Chapter 28, Tuning a Lustre File System for information about parameters that can be used to tune LNet for optimum performance.

Note

Apart from the performance impact, LNet self-test is invisible to the Lustre file system.

An LNet self-test cluster includes two types of nodes:

  • Console node - A node used to control and monitor an LNet self-test cluster. The console node serves as the user interface of the LNet self-test system and can be any node in the test cluster. All self-test commands are entered from the console node. From the console node, a user can control and monitor the status of the entire LNet self-test cluster (session). The console node is exclusive in that a user cannot control two different sessions from one console node.

  • Test nodes - The nodes on which the tests are run. Test nodes are controlled by the user from the console node; the user does not need to log into them directly.

LNet self-test has two user utilities:

  • lst - The user interface for the self-test console (run on the console node). It provides a list of commands to control the entire test system, including commands to create a session, create test groups, etc.

  • lstclient - The userspace LNet self-test program (run on a test node). The lstclient utility is linked with userspace LNDs and LNet. This utility is not needed if only kernel space LNet and LNDs are used.

Note

Test nodes can be in either kernel or userspace. A console node can invite a kernel test node to join the session by running lst add_group NID, but the console node cannot actively add a userspace test node to the session. A console node can passively accept a test node to the session while the test node is running lstclient to connect to the console node.

26.1.1. Prerequisites

To run LNet self-test, these modules must be loaded on both console nodes and test nodes:

  • libcfs

  • net

  • lnet_selftest

  • klnds: A kernel Lustre network driver (LND) (i.e, ksocklnd, ko2iblnd...) as needed by your network configuration.

To load the required modules, run:

modprobe lnet_selftest 

This command recursively loads the modules on which LNet self-test depends.

Note

While the console node and test nodes require all the prerequisite modules to be loaded, userspace test nodes do not require these modules.

26.2. Using LNet Self-Test

This section describes how to create and run an LNet self-test. The examples shown are for a test that simulates the traffic pattern of a set of Lustre servers on a TCP network accessed by Lustre clients on an InfiniBand network connected via LNet routers. In this example, half the clients are reading and half the clients are writing.

26.2.1. Creating a Session

A session is a set of processes that run on a test node. Only one session can be run at a time on a test node to ensure that the session has exclusive use of the node. The console node is used to create, change or destroy a session (new_session, end_session, show_session). For more about session parameters, see Section 26.3.1, “Session Commands”.

Almost all operations should be performed within the context of a session. From the console node, a user can only operate nodes in his own session. If a session ends, the session context in all test nodes is stopped.

The following commands set the LST_SESSION environment variable to identify the session on the console node and create a session called read_write:

export LST_SESSION=$$
lst new_session read_write

26.2.2. Setting Up Groups

A group is a named collection of nodes. Any number of groups can exist in a single LNet self-test session. Group membership is not restricted in that a test node can be included in any number of groups.

Each node in a group has a rank, determined by the order in which it was added to the group. The rank is used to establish test traffic patterns.

A user can only control nodes in his/her session. To allocate nodes to the session, the user needs to add nodes to a group (of the session). All nodes in a group can be referenced by the group name. A node can be allocated to multiple groups of a session.

In the following example, three groups are established on a console node:

lst add_group servers 192.168.10.[8,10,12-16]@tcp
lst add_group readers 192.168.1.[1-253/2]@o2ib
lst add_group writers 192.168.1.[2-254/2]@o2ib

These three groups include:

  • Nodes that will function as 'servers' to be accessed by 'clients' during the LNet self-test session

  • Nodes that will function as 'clients' that will simulate reading data from the 'servers'

  • Nodes that will function as 'clients' that will simulate writing data to the 'servers'

Note

A console node can associate kernel space test nodes with the session by running lst add_group NIDs, but a userspace test node cannot be actively added to the session. A console node can passively "accept" a test node to associate with a test session while the test node running lstclient connects to the console node, i.e: lstclient --sesid CONSOLE_NID --group NAME).

26.2.3. Defining and Running the Tests

A test generates a network load between two groups of nodes, a source group identified using the --from parameter and a target group identified using the --to parameter. When a test is running, each node in the --from group simulates a client by sending requests to nodes in the --to group, which are simulating a set of servers, and then receives responses in return. This activity is designed to mimic Lustre file system RPC traffic.

A batch is a collection of tests that are started and stopped together and run in parallel. A test must always be run as part of a batch, even if it is just a single test. Users can only run or stop a test batch, not individual tests.

Tests in a batch are non-destructive to the file system, and can be run in a normal Lustre file system environment (provided the performance impact is acceptable).

A simple batch might contain a single test, for example, to determine whether the network bandwidth presents an I/O bottleneck. In this example, the --to group could be comprised of Lustre OSSs and --from group the compute nodes. A second test could be added to perform pings from a login node to the MDS to see how checkpointing affects the ls -l process.

Two types of tests are available:

  • ping - A ping generates a short request message, which results in a short response. Pings are useful to determine latency and small message overhead and to simulate Lustre metadata traffic.

  • brw - In a brw ('bulk read write') test, data is transferred from the target to the source (brwread) or data is transferred from the source to the target (brwwrite). The size of the bulk transfer is set using the size parameter. A brw test is useful to determine network bandwidth and to simulate Lustre I/O traffic.

In the example below, a batch is created called bulk_rw. Then two brw tests are added. In the first test, 1M of data is sent from the servers to the clients as a simulated read operation with a simple data validation check. In the second test, 4K of data is sent from the clients to the servers as a simulated write operation with a full data validation check.

lst add_batch bulk_rw
lst add_test --batch bulk_rw --from readers --to servers \
  brw read check=simple size=1M
lst add_test --batch bulk_rw --from writers --to servers \
  brw write check=full size=4K

The traffic pattern and test intensity is determined by several properties such as test type, distribution of test nodes, concurrency of test, and RDMA operation type. For more details, see Section 26.3.3, “Batch and Test Commands”.

26.2.4. Sample Script

This sample LNet self-test script simulates the traffic pattern of a set of Lustre servers on a TCP network, accessed by Lustre clients on an InfiniBand network (connected via LNet routers). In this example, half the clients are reading and half the clients are writing.

Run this script on the console node:

#!/bin/bash
export LST_SESSION=$$
lst new_session read/write
lst add_group servers 192.168.10.[8,10,12-16]@tcp
lst add_group readers 192.168.1.[1-253/2]@o2ib
lst add_group writers 192.168.1.[2-254/2]@o2ib
lst add_batch bulk_rw
lst add_test --batch bulk_rw --from readers --to servers \
brw read check=simple size=1M
lst add_test --batch bulk_rw --from writers --to servers \
brw write check=full size=4K
# start running
lst run bulk_rw
# display server stats for 30 seconds
lst stat servers & sleep 30; kill $!
# tear down
lst end_session

Note

This script can be easily adapted to pass the group NIDs by shell variables or command line arguments (making it good for general-purpose use).

26.3. LNet Self-Test Command Reference

The LNet self-test (lst) utility is used to issue LNet self-test commands. The lst utility takes a number of command line arguments. The first argument is the command name and subsequent arguments are command-specific.

26.3.1. Session Commands

This section describes lst session commands.

LST_FEATURES

The lst utility uses the LST_FEATURES environmental variable to determine what optional features should be enabled. All features are disabled by default. The supported values for LST_FEATURES are:

  • 1 - Enable the Variable Page Size feature for LNet Selftest.

Example:

export LST_FEATURES=1

LST_SESSION

The lst utility uses the LST_SESSION environmental variable to identify the session locally on the self-test console node. This should be a numeric value that uniquely identifies all session processes on the node. It is convenient to set this to the process ID of the shell both for interactive use and in shell scripts. Almost all lst commands require LST_SESSION to be set.

Example:

export LST_SESSION=$$

new_session [--timeout SECONDS] [--force] SESSNAME

Creates a new session session named SESSNAME.

Parameter

Description

--timeout seconds

Console timeout value of the session. The session ends automatically if it remains idle (i.e., no commands are issued) for this period.

--force

Ends conflicting sessions. This determines who 'wins' when one session conflicts with another. For example, if there is already an active session on this node, then the attempt to create a new session fails unless the --force flag is specified. If the --force flag is specified, then the active session is ended. Similarly, if a session attempts to add a node that is already 'owned' by another session, the --force flag allows this session to 'steal' the node.

name

A human-readable string to print when listing sessions or reporting session conflicts.

Example:

$ lst new_session --force read_write

end_session

Stops all operations and tests in the current session and clears the session's status.

$ lst end_session

show_session

Shows the session information. This command prints information about the current session. It does not require LST_SESSION to be defined in the process environment.

$ lst show_session

26.3.2. Group Commands

This section describes lst group commands.

add_group name NIDS [NIDs...]

Creates the group and adds a list of test nodes to the group.

Parameter

Description

name

Name of the group.

NIDs

A string that may be expanded to include one or more LNet NIDs.

Example:

$ lst add_group servers 192.168.10.[35,40-45]@tcp
$ lst add_group clients 192.168.1.[10-100]@tcp 192.168.[2,4].\
  [10-20]@tcp

update_group name [--refresh] [--clean status] [--remove NIDs]

Updates the state of nodes in a group or adjusts a group's membership. This command is useful if some nodes have crashed and should be excluded from the group.

Parameter

Description

--refresh

Refreshes the state of all inactive nodes in the group.

--clean status

Removes nodes with a specified status from the group. Status may be:

active

The node is in the current session.

busy

The node is now owned by another session.

down

The node has been marked down.

unknown

The node's status has yet to be determined.

invalid

Any state but active.

--remove NIDs

Removes specified nodes from the group.

Example:

$ lst update_group clients --refresh
$ lst update_group clients --clean busy
$ lst update_group clients --clean invalid // \
  invalid == busy || down || unknown
$ lst update_group clients --remove \192.168.1.[10-20]@tcp

list_group [name] [--active] [--busy] [--down] [--unknown] [--all]

Prints information about a group or lists all groups in the current session if no group is specified.

Parameter

Description

name

The name of the group.

--active

Lists the active nodes.

--busy

Lists the busy nodes.

--down

Lists the down nodes.

--unknown

Lists unknown nodes.

--all

Lists all nodes.

Example:

$ lst list_group
1) clients
2) servers
Total 2 groups
$ lst list_group clients
ACTIVE BUSY DOWN UNKNOWN TOTAL
3 1 2 0 6
$ lst list_group clients --all
192.168.1.10@tcp Active
192.168.1.11@tcp Active
192.168.1.12@tcp Busy
192.168.1.13@tcp Active
192.168.1.14@tcp DOWN
192.168.1.15@tcp DOWN
Total 6 nodes
$ lst list_group clients --busy
192.168.1.12@tcp Busy
Total 1 node

del_group name

Removes a group from the session. If the group is referred to by any test, then the operation fails. If nodes in the group are referred to only by this group, then they are kicked out from the current session; otherwise, they are still in the current session.

$ lst del_group clients

lstclient --sesid NID --group name [--server_mode]

Use lstclient to run the userland self-test client. The lstclient command should be executed after creating a session on the console. There are only two mandatory options for lstclient:

Parameter

Description

--sesid NID

The first console's NID.

--group name

The test group to join.

--server_mode

When included, forces LNet to behave as a server, such as starting an acceptor if the underlying NID needs it or using privileged ports. Only root is allowed to use the --server_mode option.

Example:

Console $ lst new_session testsession
Client1 $ lstclient --sesid 192.168.1.52@tcp --group clients

Example:

Client1 $ lstclient --sesid 192.168.1.52@tcp |--group clients --server_mode

26.3.3. Batch and Test Commands

This section describes lst batch and test commands.

add_batch name

A default batch test set named batch is created when the session is started. You can specify a batch name by using add_batch:

$ lst add_batch bulkperf

Creates a batch test called bulkperf.

add_test --batch batchname [--loop loop_count] [--concurrency active_count] [--distribute source_count:sink_count] \
         --from group --to group brw|ping test_options
        

Adds a test to a batch. The parameters are described below.

Parameter

Description

--batch batchname

Names a group of tests for later execution.

--loop loop_count

Number of times to run the test.

--concurrency active_count

The number of requests that are active at one time.

--distribute source_count:sink_count

Determines the ratio of client nodes to server nodes for the specified test. This allows you to specify a wide range of topologies, including one-to-one and all-to-all. Distribution divides the source group into subsets, which are paired with equivalent subsets from the target group so only nodes in matching subsets communicate.

--from group

The source group (test client).

--to group

The target group (test server).

ping

Sends a small request message, resulting in a small reply message. For more details, see Section 26.2.3, “Defining and Running the Tests”. ping does not have any additional options.

brw

Sends a small request message followed by a bulk data transfer, resulting in a small reply message. Section 26.2.3, “Defining and Running the Tests”. Options are:

read | write

Read or write. The default is read.

  size=bytes[KM]

I/O size in bytes, kilobytes, or Megabytes (i.e., size=1024, size=4K, size=1M). The default is 4 kilobytes.

  check=full|simple

A data validation check (checksum of data). The default is that no check is done.

Examples showing use of the distribute parameter:

Clients: (C1, C2, C3, C4, C5, C6)
Server: (S1, S2, S3)
--distribute 1:1 (C1->S1), (C2->S2), (C3->S3), (C4->S1), (C5->S2),
\(C6->S3) /* -> means test conversation */ --distribute 2:1 (C1,C2->S1), (C3,C4->S2), (C5,C6->S3)
--distribute 3:1 (C1,C2,C3->S1), (C4,C5,C6->S2), (NULL->S3)
--distribute 3:2 (C1,C2,C3->S1,S2), (C4,C5,C6->S3,S1)
--distribute 4:1 (C1,C2,C3,C4->S1), (C5,C6->S2), (NULL->S3)
--distribute 4:2 (C1,C2,C3,C4->S1,S2), (C5, C6->S3, S1)
--distribute 6:3 (C1,C2,C3,C4,C5,C6->S1,S2,S3)

The setting --distribute 1:1 is the default setting where each source node communicates with one target node.

When the setting --distribute 1: n (where n is the size of the target group) is used, each source node communicates with every node in the target group.

Note that if there are more source nodes than target nodes, some source nodes may share the same target nodes. Also, if there are more target nodes than source nodes, some higher-ranked target nodes will be idle.

Example showing a brw test:

$ lst add_group clients 192.168.1.[10-17]@tcp
$ lst add_group servers 192.168.10.[100-103]@tcp
$ lst add_batch bulkperf
$ lst add_test --batch bulkperf --loop 100 --concurrency 4 \
  --distribute 4:2 --from clients brw WRITE size=16K

In the example above, a batch test called bulkperf that will do a 16 kbyte bulk write request. In this test, two groups of four clients (sources) write to each of four servers (targets) as shown below:

  • 192.168.1.[10-13] will write to 192.168.10.[100,101]

  • 192.168.1.[14-17] will write to 192.168.10.[102,103]

list_batch [name] [--test index] [--active] [--invalid] [--server|client]

Lists batches in the current session or lists client and server nodes in a batch or a test.

Parameter

Description

--test index

Lists tests in a batch. If no option is used, all tests in the batch are listed. If one of these options are used, only specified tests in the batch are listed:

active

Lists only active batch tests.

invalid

Lists only invalid batch tests.

server | client

Lists client and server nodes in a batch test.

Example:

$ lst list_batchbulkperf
$ lst list_batch bulkperf
Batch: bulkperf Tests: 1 State: Idle
ACTIVE BUSY DOWN UNKNOWN TOTAL
client 8 0 0 0 8
server 4 0 0 0 4
Test 1(brw) (loop: 100, concurrency: 4)
ACTIVE BUSY DOWN UNKNOWN TOTAL
client 8 0 0 0 8
server 4 0 0 0 4
$ lst list_batch bulkperf --server --active
192.168.10.100@tcp Active
192.168.10.101@tcp Active
192.168.10.102@tcp Active
192.168.10.103@tcp Active

run name

Runs the batch.

$ lst run bulkperf

stop name

Stops the batch.

$ lst stop bulkperf

query name [--test index] [--timeout seconds] [--loop loopcount] [--delay seconds] [--all]

Queries the batch status.

Parameter

Description

--test index

Only queries the specified test. The test index starts from 1.

--timeout seconds

The timeout value to wait for RPC. The default is 5 seconds.

--loop #

The loop count of the query.

--delay seconds

The interval of each query. The default is 5 seconds.

--all

The list status of all nodes in a batch or a test.

Example:

$ lst run bulkperf
$ lst query bulkperf --loop 5 --delay 3
Batch is running
Batch is running
Batch is running
Batch is running
Batch is running
$ lst query bulkperf --all
192.168.1.10@tcp Running
192.168.1.11@tcp Running
192.168.1.12@tcp Running
192.168.1.13@tcp Running
192.168.1.14@tcp Running
192.168.1.15@tcp Running
192.168.1.16@tcp Running
192.168.1.17@tcp Running
$ lst stop bulkperf
$ lst query bulkperf
Batch is idle

26.3.4. Other Commands

This section describes other lst commands.

ping [-session] [--group name] [--nodes NIDs] [--batch name] [--server] [--timeout seconds]

Sends a 'hello' query to the nodes.

Parameter

Description

--session

Pings all nodes in the current session.

--group name

Pings all nodes in a specified group.

--nodes NIDs

Pings all specified nodes.

--batch name

Pings all client nodes in a batch.

--server

Sends RPC to all server nodes instead of client nodes. This option is only used with --batch name.

--timeout seconds

The RPC timeout value.

Example:

# lst ping 192.168.10.[15-20]@tcp
192.168.1.15@tcp Active [session: liang id: 192.168.1.3@tcp]
192.168.1.16@tcp Active [session: liang id: 192.168.1.3@tcp]
192.168.1.17@tcp Active [session: liang id: 192.168.1.3@tcp]
192.168.1.18@tcp Busy [session: Isaac id: 192.168.10.10@tcp]
192.168.1.19@tcp Down [session: <NULL> id: LNET_NID_ANY]
192.168.1.20@tcp Down [session: <NULL> id: LNET_NID_ANY]

stat [--bw] [--rate] [--read] [--write] [--max] [--min] [--avg] " " [--timeout seconds] [--delay seconds] group|NIDs [group|NIDs]

The collection performance and RPC statistics of one or more nodes.

Parameter

Description

--bw

Displays the bandwidth of the specified group/nodes.

--rate

Displays the rate of RPCs of the specified group/nodes.

--read

Displays the read statistics of the specified group/nodes.

--write

Displays the write statistics of the specified group/nodes.

--max

Displays the maximum value of the statistics.

--min

Displays the minimum value of the statistics.

--avg

Displays the average of the statistics.

--timeout seconds

The timeout of the statistics RPC. The default is 5 seconds.

--delay seconds

The interval of the statistics (in seconds).

Example:

$ lst run bulkperf
$ lst stat clients
[LNet Rates of clients]
[W] Avg: 1108 RPC/s Min: 1060 RPC/s Max: 1155 RPC/s
[R] Avg: 2215 RPC/s Min: 2121 RPC/s Max: 2310 RPC/s
[LNet Bandwidth of clients]
[W] Avg: 16.60 MB/s Min: 16.10 MB/s Max: 17.1 MB/s
[R] Avg: 40.49 MB/s Min: 40.30 MB/s Max: 40.68 MB/s

Specifying a group name ( group ) causes statistics to be gathered for all nodes in a test group. For example:

$ lst stat servers

where servers is the name of a test group created by lst add_group

Specifying a NID range (NIDs) causes statistics to be gathered for selected nodes. For example:

$ lst stat 192.168.0.[1-100/2]@tcp

Only LNet performance statistics are available. By default, all statistics information is displayed. Users can specify additional information with these options.

show_error [--session] [group|NIDs]...

Lists the number of failed RPCs on test nodes.

Parameter

Description

--session

Lists errors in the current test session. With this option, historical RPC errors are not listed.

Example:

$ lst show_error client
sclients
12345-192.168.1.15@tcp: [Session: 1 brw errors, 0 ping errors] \
  [RPC: 20 errors, 0 dropped,
12345-192.168.1.16@tcp: [Session: 0 brw errors, 0 ping errors] \
  [RPC: 1 errors, 0 dropped, Total 2 error nodes in clients
$ lst show_error --session clients
clients
12345-192.168.1.15@tcp: [Session: 1 brw errors, 0 ping errors]
Total 1 error nodes in clients

Chapter 27. Benchmarking Lustre File System Performance (Lustre I/O Kit)

This chapter describes the Lustre I/O kit, a collection of I/O benchmarking tools for a Lustre cluster. It includes:

27.1.  Using Lustre I/O Kit Tools

The tools in the Lustre I/O Kit are used to benchmark Lustre file system hardware and validate that it is working as expected before you install the Lustre software. It can also be used to to validate the performance of the various hardware and software layers in the cluster and also to find and troubleshoot I/O issues.

Typically, performance is measured starting with single raw devices and then proceeding to groups of devices. Once raw performance has been established, other software layers are then added incrementally and tested.

27.1.1. Contents of the Lustre I/O Kit

The I/O kit contains three tests, each of which tests a progressively higher layer in the Lustre software stack:

  • sgpdd-survey - Measure basic 'bare metal' performance of devices while bypassing the kernel block device layers, buffer cache, and file system.

  • obdfilter-survey - Measure the performance of one or more OSTs directly on the OSS node or alternately over the network from a Lustre client.

  • ost-survey - Performs I/O against OSTs individually to allow performance comparisons to detect if an OST is performing sub-optimally due to hardware issues.

Typically with these tests, a Lustre file system should deliver 85-90% of the raw device performance.

A utility stats-collect is also provided to collect application profiling information from Lustre clients and servers. See Section 27.6, “Collecting Application Profiling Information (stats-collect)” for more information.

27.1.2. Preparing to Use the Lustre I/O Kit

The following prerequisites must be met to use the tests in the Lustre I/O kit:

  • Password-free remote access to nodes in the system (provided by ssh or rsh).

  • LNet self-test completed to test that Lustre networking has been properly installed and configured. See Chapter 26, Testing Lustre Network Performance (LNet Self-Test).

  • Lustre file system software installed.

  • sg3_utils package providing the sgp_dd tool (sg3_utils is a separate RPM package available online using YUM).

Download the Lustre I/O kit (lustre-iokit)from:

http://downloads.hpdd.intel.com/

27.2. Testing I/O Performance of Raw Hardware (sgpdd-survey)

The sgpdd-survey tool is used to test bare metal I/O performance of the raw hardware, while bypassing as much of the kernel as possible. This survey may be used to characterize the performance of a SCSI device by simulating an OST serving multiple stripe files. The data gathered by this survey can help set expectations for the performance of a Lustre OST using this device.

The script uses sgp_dd to carry out raw sequential disk I/O. It runs with variable numbers of sgp_dd threads to show how performance varies with different request queue depths.

The script spawns variable numbers of sgp_dd instances, each reading or writing a separate area of the disk to demonstrate performance variance within a number of concurrent stripe files.

Several tips and insights for disk performance measurement are described below. Some of this information is specific to RAID arrays and/or the Linux RAID implementation.

  • Performance is limited by the slowest disk.

    Before creating a RAID array, benchmark all disks individually. We have frequently encountered situations where drive performance was not consistent for all devices in the array. Replace any disks that are significantly slower than the rest.

  • Disks and arrays are very sensitive to request size.

    To identify the optimal request size for a given disk, benchmark the disk with different record sizes ranging from 4 KB to 1 to 2 MB.

Caution

The sgpdd-survey script overwrites the device being tested, which results in the LOSS OF ALL DATA on that device. Exercise caution when selecting the device to be tested.

Note

Array performance with all LUNs loaded does not always match the performance of a single LUN when tested in isolation.

Prerequisites:

  • sgp_dd tool in the sg3_utils package

  • Lustre software is NOT required

The device(s) being tested must meet one of these two requirements:

  • If the device is a SCSI device, it must appear in the output of sg_map (make sure the kernel module sg is loaded).

  • If the device is a raw device, it must appear in the output of raw -qa.

Raw and SCSI devices cannot be mixed in the test specification.

Note

If you need to create raw devices to use the sgpdd-survey tool, note that raw device 0 cannot be used due to a bug in certain versions of the "raw" utility (including the version shipped with Red Hat Enterprise Linux 4U4.)

27.2.1. Tuning Linux Storage Devices

To get large I/O transfers (1 MB) to disk, it may be necessary to tune several kernel parameters as specified:

/sys/block/sdN/queue/max_sectors_kb = 4096
/sys/block/sdN/queue/max_phys_segments = 256
/proc/scsi/sg/allow_dio = 1
/sys/module/ib_srp/parameters/srp_sg_tablesize = 255
/sys/block/sdN/queue/scheduler

Note

Recommended schedulers are deadline and noop. The scheduler is set by default to deadline, unless it has already been set to noop.

27.2.2. Running sgpdd-survey

The sgpdd-survey script must be customized for the particular device being tested and for the location where the script saves its working and result files (by specifying the ${rslt} variable). Customization variables are described at the beginning of the script.

When the sgpdd-survey script runs, it creates a number of working files and a pair of result files. The names of all the files created start with the prefix defined in the variable ${rslt}. (The default value is /tmp.) The files include:

  • File containing standard output data (same as stdout)

    rslt_date_time.summary
  • Temporary (tmp) files

    rslt_date_time_*
    
  • Collected tmp files for post-mortem

    rslt_date_time.detail
    

The stdout and the .summary file will contain lines like this:

total_size 8388608K rsz 1024 thr 1 crg 1 180.45 MB/s 1 x 180.50 \
        = 180.50 MB/s

Each line corresponds to a run of the test. Each test run will have a different number of threads, record size, or number of regions.

  • total_size - Size of file being tested in KBs (8 GB in above example).

  • rsz - Record size in KBs (1 MB in above example).

  • thr - Number of threads generating I/O (1 thread in above example).

  • crg - Current regions, the number of disjoint areas on the disk to which I/O is being sent (1 region in above example, indicating that no seeking is done).

  • MB/s - Aggregate bandwidth measured by dividing the total amount of data by the elapsed time (180.45 MB/s in the above example).

  • MB/s - The remaining numbers show the number of regions X performance of the slowest disk as a sanity check on the aggregate bandwidth.

If there are so many threads that the sgp_dd script is unlikely to be able to allocate I/O buffers, then ENOMEM is printed in place of the aggregate bandwidth result.

If one or more sgp_dd instances do not successfully report a bandwidth number, then FAILED is printed in place of the aggregate bandwidth result.

27.3. Testing OST Performance (obdfilter-survey)

The obdfilter-survey script generates sequential I/O from varying numbers of threads and objects (files) to simulate the I/O patterns of a Lustre client.

The obdfilter-survey script can be run directly on the OSS node to measure the OST storage performance without any intervening network, or it can be run remotely on a Lustre client to measure the OST performance including network overhead.

The obdfilter-survey is used to characterize the performance of the following:

  • Local file system - In this mode, the obdfilter-survey script exercises one or more instances of the obdfilter directly. The script may run on one or more OSS nodes, for example, when the OSSs are all attached to the same multi-ported disk subsystem.

    Run the script using the case=disk parameter to run the test against all the local OSTs. The script automatically detects all local OSTs and includes them in the survey.

    To run the test against only specific OSTs, run the script using the targets=parameter to list the OSTs to be tested explicitly. If some OSTs are on remote nodes, specify their hostnames in addition to the OST name (for example, oss2:lustre-OST0004).

    All obdfilter instances are driven directly. The script automatically loads the obdecho module (if required) and creates one instance of echo_client for each obdfilter instance in order to generate I/O requests directly to the OST.

    For more details, see Section 27.3.1, “Testing Local Disk Performance”.

  • Network - In this mode, the Lustre client generates I/O requests over the network but these requests are not sent to the OST file system. The OSS node runs the obdecho server to receive the requests but discards them before they are sent to the disk.

    Pass the parameters case=network and targets=hostname|IP_of_server to the script. For each network case, the script does the required setup.

    For more details, see Section 27.3.2, “Testing Network Performance”

  • Remote file system over the network - In this mode the obdfilter-survey script generates I/O from a Lustre client to a remote OSS to write the data to the file system.

    To run the test against all the local OSCs, pass the parameter case=netdisk to the script. Alternately you can pass the target= parameter with one or more OSC devices (e.g., lustre-OST0000-osc-ffff88007754bc00) against which the tests are to be run.

    For more details, see Section 27.3.3, “Testing Remote Disk Performance”.

Caution

The obdfilter-survey script is potentially destructive and there is a small risk data may be lost. To reduce this risk, obdfilter-survey should not be run on devices that contain data that needs to be preserved. Thus, the best time to run obdfilter-survey is before the Lustre file system is put into production. The reason obdfilter-survey may be safe to run on a production file system is because it creates objects with object sequence 2. Normal file system objects are typically created with object sequence 0.

Note

If the obdfilter-survey test is terminated before it completes, some small amount of space is leaked. you can either ignore it or reformat the file system.

Note

The obdfilter-survey script is NOT scalable beyond tens of OSTs since it is only intended to measure the I/O performance of individual storage subsystems, not the scalability of the entire system.

Note

The obdfilter-survey script must be customized, depending on the components under test and where the script's working files should be kept. Customization variables are described at the beginning of the obdfilter-survey script. In particular, pay attention to the listed maximum values listed for each parameter in the script.

27.3.1. Testing Local Disk Performance

The obdfilter-survey script can be run automatically or manually against a local disk. This script profiles the overall throughput of storage hardware, including the file system and RAID layers managing the storage, by sending workloads to the OSTs that vary in thread count, object count, and I/O size.

When the obdfilter-survey script is run, it provides information about the performance abilities of the storage hardware and shows the saturation points.

The plot-obdfilter script generates from the output of the obdfilter-survey a CSV file and parameters for importing into a spreadsheet or gnuplot to visualize the data.

To run the obdfilter-survey script, create a standard Lustre file system configuration; no special setup is needed.

To perform an automatic run:

  1. Start the Lustre OSTs.

    The Lustre OSTs should be mounted on the OSS node(s) to be tested. The Lustre client is not required to be mounted at this time.

  2. Verify that the obdecho module is loaded. Run:

    modprobe obdecho
  3. Run the obdfilter-survey script with the parameter case=disk.

    For example, to run a local test with up to two objects (nobjhi), up to two threads (thrhi), and 1024 MB transfer size (size):

    $ nobjhi=2 thrhi=2 size=1024 case=disk sh obdfilter-survey
  4. Performance measurements for write, rewrite, read etc are provided below:

    # example output
    Fri Sep 25 11:14:03 EDT 2015 Obdfilter-survey for case=disk from hds1fnb6123
    ost 10 sz 167772160K rsz 1024K obj   10 thr   10 write 10982.73 [ 601.97,2912.91] rewrite 15696.54 [1160.92,3450.85] read 12358.60 [ 938.96,2634.87] 
    ...

    The file ./lustre-iokit/obdfilter-survey/README.obdfilter-survey provides an explaination for the output as follows:

    ost 10          is the total number of OSTs under test.
    sz 167772160K   is the total amount of data read or written (in bytes).
    rsz 1024K       is the record size (size of each echo_client I/O, in bytes).
    obj    10       is the total number of objects over all OSTs
    thr    10       is the total number of threads over all OSTs and objects
    write           is the test name.  If more tests have been specified they
               all appear on the same line.
    10982.73        is the aggregate bandwidth over all OSTs measured by
               dividing the total number of MB by the elapsed time.
    [601.97,2912.91] are the minimum and maximum instantaneous bandwidths seen on
               any individual OST.
    Note that although the numbers of threads and objects are specifed per-OST
    in the customization section of the script, results are reported aggregated
    over all OSTs.

To perform a manual run:

  1. Start the Lustre OSTs.

    The Lustre OSTs should be mounted on the OSS node(s) to be tested. The Lustre client is not required to be mounted at this time.

  2. Verify that the obdecho module is loaded. Run:

    modprobe obdecho
  3. Determine the OST names.

    On the OSS nodes to be tested, run the lctl dl command. The OST device names are listed in the fourth column of the output. For example:

    $ lctl dl |grep obdfilter
    0 UP obdfilter lustre-OST0001 lustre-OST0001_UUID 1159
    2 UP obdfilter lustre-OST0002 lustre-OST0002_UUID 1159
    ...
  4. List all OSTs you want to test.

    Use the targets=parameter to list the OSTs separated by spaces. List the individual OSTs by name using the format fsname-OSTnumber (for example, lustre-OST0001). You do not have to specify an MDS or LOV.

  5. Run the obdfilter-survey script with the targets=parameter.

    For example, to run a local test with up to two objects (nobjhi), up to two threads (thrhi), and 1024 Mb (size) transfer size:

    $ nobjhi=2 thrhi=2 size=1024 targets="lustre-OST0001 \
    	   lustre-OST0002" sh obdfilter-survey

27.3.2. Testing Network Performance

The obdfilter-survey script can only be run automatically against a network; no manual test is provided.

To run the network test, a specific Lustre file system setup is needed. Make sure that these configuration requirements have been met.

To perform an automatic run:

  1. Start the Lustre OSTs.

    The Lustre OSTs should be mounted on the OSS node(s) to be tested. The Lustre client is not required to be mounted at this time.

  2. Verify that the obdecho module is loaded. Run:

    modprobe obdecho
  3. Start lctl and check the device list, which must be empty. Run:

    lctl dl
  4. Run the obdfilter-survey script with the parameters case=network and targets=hostname|ip_of_server. For example:

    $ nobjhi=2 thrhi=2 size=1024 targets="oss0 oss1" \
    	   case=network sh obdfilter-survey
  5. On the server side, view the statistics at:

    /proc/fs/lustre/obdecho/echo_srv/stats

    where echo_srv is the obdecho server created by the script.

27.3.3. Testing Remote Disk Performance

The obdfilter-survey script can be run automatically or manually against a network disk. To run the network disk test, start with a standard Lustre configuration. No special setup is needed.

To perform an automatic run:

  1. Start the Lustre OSTs.

    The Lustre OSTs should be mounted on the OSS node(s) to be tested. The Lustre client is not required to be mounted at this time.

  2. Verify that the obdecho module is loaded. Run:

    modprobe obdecho
  3. Run the obdfilter-survey script with the parameter case=netdisk. For example:

    $ nobjhi=2 thrhi=2 size=1024 case=netdisk sh obdfilter-survey
    

To perform a manual run:

  1. Start the Lustre OSTs.

    The Lustre OSTs should be mounted on the OSS node(s) to be tested. The Lustre client is not required to be mounted at this time.

  2. Verify that the obdecho module is loaded. Run:

    modprobe obdecho

  3. Determine the OSC names.

    On the OSS nodes to be tested, run the lctl dl command. The OSC device names are listed in the fourth column of the output. For example:

    $ lctl dl |grep obdfilter
    3 UP osc lustre-OST0000-osc-ffff88007754bc00 \
               54b91eab-0ea9-1516-b571-5e6df349592e 5
    4 UP osc lustre-OST0001-osc-ffff88007754bc00 \
               54b91eab-0ea9-1516-b571-5e6df349592e 5
    ...
    
  4. List all OSCs you want to test.

    Use the targets=parameter to list the OSCs separated by spaces. List the individual OSCs by name separated by spaces using the format fsname-OST_name-osc-instance (for example, lustre-OST0000-osc-ffff88007754bc00). You do not have to specify an MDS or LOV.

  5. Run the obdfilter-survey script with the targets=osc and case=netdisk.

    An example of a local test run with up to two objects (nobjhi), up to two threads (thrhi), and 1024 Mb (size) transfer size is shown below:

    $ nobjhi=2 thrhi=2 size=1024 \
               targets="lustre-OST0000-osc-ffff88007754bc00 \
               lustre-OST0001-osc-ffff88007754bc00" sh obdfilter-survey
    

27.3.4. Output Files

When the obdfilter-survey script runs, it creates a number of working files and a pair of result files. All files start with the prefix defined in the variable ${rslt}.

File

Description

${rslt}.summary

Same as stdout

${rslt}.script_*

Per-host test script files

${rslt}.detail_tmp*

Per-OST result files

${rslt}.detail

Collected result files for post-mortem

The obdfilter-survey script iterates over the given number of threads and objects performing the specified tests and checks that all test processes have completed successfully.

Note

The obdfilter-survey script may not clean up properly if it is aborted or if it encounters an unrecoverable error. In this case, a manual cleanup may be required, possibly including killing any running instances of lctl (local or remote), removing echo_client instances created by the script and unloading obdecho.

27.3.4.1. Script Output

The .summary file and stdout of the obdfilter-survey script contain lines like:

ost 8 sz 67108864K rsz 1024 obj 8 thr 8 write 613.54 [ 64.00, 82.00]

Where:

Parameter and value

Description

ost 8

Total number of OSTs being tested.

sz 67108864K

Total amount of data read or written (in KB).

rsz 1024

Record size (size of each echo_client I/O, in KB).

obj 8

Total number of objects over all OSTs.

thr 8

Total number of threads over all OSTs and objects.

write

Test name. If more tests have been specified, they all appear on the same line.

613.54

Aggregate bandwidth over all OSTs (measured by dividing the total number of MB by the elapsed time).

[64, 82.00]

Minimum and maximum instantaneous bandwidths on an individual OST.

Note

Although the numbers of threads and objects are specified per-OST in the customization section of the script, the reported results are aggregated over all OSTs.

27.3.4.2. Visualizing Results

It is useful to import the obdfilter-survey script summary data (it is fixed width) into Excel (or any graphing package) and graph the bandwidth versus the number of threads for varying numbers of concurrent regions. This shows how the OSS performs for a given number of concurrently-accessed objects (files) with varying numbers of I/Os in flight.

It is also useful to monitor and record average disk I/O sizes during each test using the 'disk io size' histogram in the file /proc/fs/lustre/obdfilter/*/brw_stats (see Section 33.3.5, “Monitoring the OST Block I/O Stream” for details). These numbers help identify problems in the system when full-sized I/Os are not submitted to the underlying disk. This may be caused by problems in the device driver or Linux block layer.

The plot-obdfilter script included in the I/O toolkit is an example of processing output files to a .csv format and plotting a graph using gnuplot.

27.4. Testing OST I/O Performance (ost-survey)

The ost-survey tool is a shell script that uses lfs setstripe to perform I/O against a single OST. The script writes a file (currently using dd) to each OST in the Lustre file system, and compares read and write speeds. The ost-survey tool is used to detect anomalies between otherwise identical disk subsystems.

Note

We have frequently discovered wide performance variations across all LUNs in a cluster. This may be caused by faulty disks, RAID parity reconstruction during the test, or faulty network hardware.

To run the ost-survey script, supply a file size (in KB) and the Lustre file system mount point. For example, run:

$ ./ost-survey.sh -s 10 /mnt/lustre

Typical output is:

Number of Active OST devices : 4
Worst  Read OST indx: 2 speed: 2835.272725
Best   Read OST indx: 3 speed: 2872.889668
Read Average: 2852.508999 +/- 16.444792 MB/s
Worst  Write OST indx: 3 speed: 17.705545
Best   Write OST indx: 2 speed: 128.172576
Write Average: 95.437735 +/- 45.518117 MB/s
Ost#  Read(MB/s)  Write(MB/s)  Read-time  Write-time
----------------------------------------------------
0     2837.440       126.918        0.035      0.788
1     2864.433       108.954        0.035      0.918
2     2835.273       128.173        0.035      0.780
3     2872.890       17.706        0.035      5.648

27.5. Testing MDS Performance (mds-survey)

mds-survey is available in Lustre software release 2.2 and beyond. The mds-survey script tests the local metadata performance using the echo_client to drive different layers of the MDS stack: mdd, mdt, osd (the Lustre software only supports mdd stack). It can be used with the following classes of operations:

  • Open-create/mkdir/create

  • Lookup/getattr/setxattr

  • Delete/destroy

  • Unlink/rmdir

These operations will be run by a variable number of concurrent threads and will test with the number of directories specified by the user. The run can be executed such that all threads operate in a single directory (dir_count=1) or in private/unique directory (dir_count=x thrlo=x thrhi=x).

The mdd instance is driven directly. The script automatically loads the obdecho module if required and creates instance of echo_client.

This script can also create OST objects by providing stripe_count greater than zero.

To perform a run:

  1. Start the Lustre MDT.

    The Lustre MDT should be mounted on the MDS node to be tested.

  2. Start the Lustre OSTs (optional, only required when test with OST objects)

    The Lustre OSTs should be mounted on the OSS node(s).

  3. Run the mds-survey script as explain below

    The script must be customized according to the components under test and where it should keep its working files. Customization variables are described as followed:

    • thrlo - threads to start testing. skipped if less than dir_count

    • thrhi - maximum number of threads to test

    • targets - MDT instance

    • file_count - number of files per thread to test

    • dir_count - total number of directories to test. Must be less than or equal to thrhi

    • stripe_count - number stripe on OST objects

    • tests_str - test operations. Must have at least "create" and "destroy"

    • start_number - base number for each thread to prevent name collisions

    • layer - MDS stack's layer to be tested

    Run without OST objects creation:

    Setup the Lustre MDS without OST mounted. Then invoke the mds-survey script

    $ thrhi=64 file_count=200000 sh mds-survey

    Run with OST objects creation:

    Setup the Lustre MDS with at least one OST mounted. Then invoke the mds-survey script with stripe_count parameter

    $ thrhi=64 file_count=200000 stripe_count=2 sh mds-survey

    Note: a specific MDT instance can be specified using targets variable.

    $ targets=lustre-MDT0000 thrhi=64 file_count=200000 stripe_count=2 sh mds-survey

27.5.1. Output Files

When the mds-survey script runs, it creates a number of working files and a pair of result files. All files start with the prefix defined in the variable ${rslt}.

File

Description

${rslt}.summary

Same as stdout

${rslt}.script_*

Per-host test script files

${rslt}.detail_tmp*

Per-mdt result files

${rslt}.detail

Collected result files for post-mortem

The mds-survey script iterates over the given number of threads performing the specified tests and checks that all test processes have completed successfully.

Note

The mds-survey script may not clean up properly if it is aborted or if it encounters an unrecoverable error. In this case, a manual cleanup may be required, possibly including killing any running instances of lctl, removing echo_client instances created by the script and unloading obdecho.

27.5.2. Script Output

The .summary file and stdout of the mds-survey script contain lines like:

mdt 1 file 100000 dir 4 thr 4 create 5652.05 [ 999.01,46940.48] destroy 5797.79 [ 0.00,52951.55] 

Where:

Parameter and value

Description

mdt 1

Total number of MDT under test

file 100000

Total number of files per thread to operate

dir 4

Total number of directories to operate

thr 4

Total number of threads operate over all directories

create, destroy

Tests name. More tests will be displayed on the same line.

565.05

Aggregate operations over MDT measured by dividing the total number of operations by the elapsed time.

[999.01,46940.48]

Minimum and maximum instantaneous operation seen on any individual MDT

Note

If script output has "ERROR", this usually means there is issue during the run such as running out of space on the MDT and/or OST. More detailed debug information is available in the ${rslt}.detail file

27.6. Collecting Application Profiling Information (stats-collect)

The stats-collect utility contains the following scripts used to collect application profiling information from Lustre clients and servers:

  • lstat.sh - Script for a single node that is run on each profile node.

  • gather_stats_everywhere.sh - Script that collect statistics.

  • config.sh - Script that contains customized configuration descriptions.

The stats-collect utility requires:

  • Lustre software to be installed and set up on your cluster

  • SSH and SCP access to these nodes without requiring a password

27.6.1. Using stats-collect

The stats-collect utility is configured by including profiling configuration variables in the config.sh script. Each configuration variable takes the following form, where 0 indicates statistics are to be collected only when the script starts and stops and n indicates the interval in seconds at which statistics are to be collected:

statistic_INTERVAL=0|n

Statistics that can be collected include:

  • VMSTAT - Memory and CPU usage and aggregate read/write operations

  • SERVICE - Lustre OST and MDT RPC service statistics

  • BRW - OST bulk read/write statistics (brw_stats)

  • SDIO - SCSI disk IO statistics (sd_iostats)

  • MBALLOC - ldiskfs block allocation statistics

  • IO - Lustre target operations statistics

  • JBD - ldiskfs journal statistics

  • CLIENT - Lustre OSC request statistics

To collect profile information:

Begin collecting statistics on each node specified in the config.sh script.

  1. Starting the collect profile daemon on each node by entering:

    sh gather_stats_everywhere.sh config.sh start 
    
  2. Run the test.

  3. Stop collecting statistics on each node, clean up the temporary file, and create a profiling tarball.

    Enter:

    sh gather_stats_everywhere.sh config.sh stop log_name.tgz

    When log_name.tgz is specified, a profile tarball /tmp/log_name.tgz is created.

  4. Analyze the collected statistics and create a csv tarball for the specified profiling data.

    sh gather_stats_everywhere.sh config.sh analyse log_tarball.tgz csv
    

Chapter 28. Tuning a Lustre File System

This chapter contains information about tuning a Lustre file system for better performance.

Note

Many options in the Lustre software are set by means of kernel module parameters. These parameters are contained in the /etc/modprobe.d/lustre.conf file.

28.1.  Optimizing the Number of Service Threads

An OSS can have a minimum of two service threads and a maximum of 512 service threads. The number of service threads is a function of how much RAM and how many CPUs are on each OSS node (1 thread / 128MB * num_cpus). If the load on the OSS node is high, new service threads will be started in order to process more requests concurrently, up to 4x the initial number of threads (subject to the maximum of 512). For a 2GB 2-CPU system, the default thread count is 32 and the maximum thread count is 128.

Increasing the size of the thread pool may help when: