Lustre* Software Release 2.x

Operations Manual

Notwithstanding Intel's ownership of the copyright in the modifications to the original version of this Operations Manual, as between Intel and Oracle, Oracle and/or its affiliates retain sole ownership of the copyright in the unmodified portions of this Operations Manual.

Important Notice from Intel

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: https://www.intel.com/content/www/us/en/design/resource-design-center.html

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. Lustre is a registered trademark of Oracle Corporation.

*Other names and brands may be claimed as the property of others.

THE ORIGINAL LUSTRE 2.x FILESYSTEM: OPERATIONS MANUAL HAS BEEN MODIFIED: THIS OPERATIONS MANUAL IS A MODIFIED VERSION OF, AND IS DERIVED FROM, THE LUSTRE 2.0 FILESYSTEM: OPERATIONS MANUAL PUBLISHED BY ORACLE AND AVAILABLE AT [http://www.lustre.org/]. MODIFICATIONS (collectively, the "Modifications") HAVE BEEN MADE BY INTEL CORPORATION ("Intel"). ORACLE AND ITS AFFILIATES HAVE NOT REVIEWED, APPROVED, SPONSORED, OR ENDORSED THIS MODIFIED OPERATIONS MANUAL, OR ENDORSED INTEL, AND ORACLE AND ITS AFFILIATES ARE NOT RESPONSIBLE OR LIABLE FOR ANY MODIFICATIONS THAT INTEL HAS MADE TO THE ORIGINAL OPERATIONS MANUAL.

NOTHING IN THIS MODIFIED OPERATIONS MANUAL IS INTENDED TO AFFECT THE NOTICE PROVIDED BY ORACLE BELOW IN RESPECT OF THE ORIGINAL OPERATIONS MANUAL AND SUCH ORACLE NOTICE CONTINUES TO APPLY TO THIS MODIFIED OPERATIONS MANUAL EXCEPT FOR THE MODIFICATIONS; THIS INTEL NOTICE SHALL APPLY ONLY TO MODIFICATIONS MADE BY INTEL. AS BETWEEN YOU AND ORACLE: (I) NOTHING IN THIS INTEL NOTICE IS INTENDED TO AFFECT THE TERMS OF THE ORACLE NOTICE BELOW; AND (II) IN THE EVENT OF ANY CONFLICT BETWEEN THE TERMS OF THIS INTEL NOTICE AND THE TERMS OF THE ORACLE NOTICE, THE ORACLE NOTICE SHALL PREVAIL.

Your use of any Intel software shall be governed by separate license terms containing restrictions on use and disclosure and are protected by intellectual property laws.

The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing.

This work is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License. To view a copy of this license and obtain more information about Creative Commons licensing, visit Creative Commons Attribution-Share Alike 3.0 United States or send a letter to Creative Commons, 171 2nd Street, Suite 300, San Francisco, California 94105, USA.

Important Notice from Oracle

This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited.

The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing.

If this is software or related software documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, the following notice is applicable:

U.S. GOVERNMENT RIGHTS. Programs, software, databases, and related documentation and technical data delivered to U.S. Government customers are "commercial computer software" or "commercial technical data" pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such, the use, duplication, disclosure, modification, and adaptation shall be subject to the restrictions and license terms set forth in the applicable Government contract, and, to the extent applicable by the terms of the Government contract, the additional rights set forth in FAR 52.227-19, Commercial Computer Software License (December 2007). Oracle America, Inc., 500 Oracle Parkway, Redwood City, CA 94065.

This software or hardware is developed for general use in a variety of information management applications. It is not developed or intended for use in any inherently dangerous applications, including applications which may create a risk of personal injury. If you use this software or hardware in dangerous applications, then you shall be responsible to take all appropriate fail-safe, backup, redundancy, and other measures to ensure its safe use. Oracle Corporation and its affiliates disclaim any liability for any damages caused by use of this software or hardware in dangerous applications.

Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.

AMD, Opteron, the AMD logo, and the AMD Opteron logo are trademarks or registered trademarks of Advanced Micro Devices. Intel and Intel Xeon are trademarks or registered trademarks of Intel Corporation. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. UNIX is a registered trademark licensed through X/Open Company, Ltd.

This software or hardware and documentation may provide access to or information on content, products, and services from third parties. Oracle Corporation and its affiliates are not responsible for and expressly disclaim all warranties of any kind with respect to third-party content, products, and services. Oracle Corporation and its affiliates will not be responsible for any loss, costs, or damages incurred due to your access to or use of third-party content, products, or services.

Copyright © 2011, Oracle et/ou ses affiliés. Tous droits réservés.

Ce logiciel et la documentation qui l'accompagne sont protégés par les lois sur la propriété intellectuelle. Ils sont concédés sous licence et soumis à des restrictions d'utilisation et de divulgation. Sauf disposition de votre contrat de licence ou de la loi, vous ne pouvez pas copier, reproduire, traduire, diffuser, modifier, breveter, transmettre, distribuer, exposer, exécuter, publier ou afficher le logiciel, même partiellement, sous quelque forme et par quelque procédé que ce soit. Par ailleurs, il est interdit de procéder à toute ingénierie inverse du logiciel, de le désassembler ou de le décompiler, excepté à des fins d'interopérabilité avec des logiciels tiers ou tel que prescrit par la loi.

Les informations fournies dans ce document sont susceptibles de modification sans préavis. Par ailleurs, Oracle Corporation ne garantit pas qu'elles soient exemptes d'erreurs et vous invite, le cas échéant, à lui en faire part par écrit.

Si ce logiciel, ou la documentation qui l'accompagne, est concédé sous licence au Gouvernement des Etats-Unis, ou à toute entité qui délivre la licence de ce logiciel ou l'utilise pour le compte du Gouvernement des Etats-Unis, la notice suivante s'applique :

U.S. GOVERNMENT RIGHTS. Programs, software, databases, and related documentation and technical data delivered to U.S. Government customers are "commercial computer software" or "commercial technical data" pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such, the use, duplication, disclosure, modification, and adaptation shall be subject to the restrictions and license terms set forth in the applicable Government contract, and, to the extent applicable by the terms of the Government contract, the additional rights set forth in FAR 52.227-19, Commercial Computer Software License (December 2007). Oracle America, Inc., 500 Oracle Parkway, Redwood City, CA 94065.

Ce logiciel ou matériel a été développé pour un usage général dans le cadre d'applications de gestion des informations. Ce logiciel ou matériel n'est pas conçu ni n'est destiné à être utilisé dans des applications à risque, notamment dans des applications pouvant causer des dommages corporels. Si vous utilisez ce logiciel ou matériel dans le cadre d'applications dangereuses, il est de votre responsabilité de prendre toutes les mesures de secours, de sauvegarde, de redondance et autres mesures nécessaires à son utilisation dans des conditions optimales de sécurité. Oracle Corporation et ses affiliés déclinent toute responsabilité quant aux dommages causés par l'utilisation de ce logiciel ou matériel pour ce type d'applications.

Oracle et Java sont des marques déposées d'Oracle Corporation et/ou de ses affiliés. Tout autre nom mentionné peut correspondre à des marques appartenant à d'autres propriétaires qu'Oracle.

AMD, Opteron, le logo AMD et le logo AMD Opteron sont des marques ou des marques déposées d'Advanced Micro Devices. Intel et Intel Xeon sont des marques ou des marques déposées d'Intel Corporation. Toutes les marques SPARC sont utilisées sous licence et sont des marques ou des marques déposées de SPARC International, Inc. UNIX est une marque déposée concédée sous licence par X/Open Company, Ltd.

Ce logiciel ou matériel et la documentation qui l'accompagne peuvent fournir des informations ou des liens donnant accès à des contenus, des produits et des services émanant de tiers. Oracle Corporation et ses affiliés déclinent toute responsabilité ou garantie expresse quant aux contenus, produits ou services émanant de tiers. En aucun cas, Oracle Corporation et ses affiliés ne sauraient être tenus pour responsables des pertes subies, des coûts occasionnés ou des dommages causés par l'accès à des contenus, produits ou services tiers, ou à leur utilisation.

This work is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License. To view a copy of this license and obtain more information about Creative Commons licensing, visit Creative Commons Attribution-Share Alike 3.0 United States or send a letter to Creative Commons, 171 2nd Street, Suite 300, San Francisco, California 94105, USA.


Table of Contents

Preface
1. About this Document
1.1. UNIX* Commands
1.2. Shell Prompts
1.3. Related Documentation
1.4. Documentation and Support
2. Revisions
I. Introducing the Lustre* File System
1. Understanding Lustre Architecture
1.1. What a Lustre File System Is (and What It Isn't)
1.1.1. Lustre Features
1.2. Lustre Components
1.2.1. Management Server (MGS)
1.2.2. Lustre File System Components
1.2.3. Lustre Networking (LNet)
1.2.4. Lustre Cluster
1.3. Lustre File System Storage and I/O
1.3.1. Lustre File System and Striping
2. Understanding Lustre Networking (LNet)
2.1. Introducing LNet
2.2. Key Features of LNet
2.3. Lustre Networks
2.4. Supported Network Types
3. Understanding Failover in a Lustre File System
3.1. What is Failover?
3.1.1. Failover Capabilities
3.1.2. Types of Failover Configurations
3.2. Failover Functionality in a Lustre File System
3.2.1. MDT Failover Configuration (Active/Passive)
3.2.2. MDT Failover Configuration (Active/Active)
3.2.3. OST Failover Configuration (Active/Active)
II. Installing and Configuring Lustre
4. Installation Overview
4.1. Steps to Installing the Lustre Software
5. Determining Hardware Configuration Requirements and Formatting Options
5.1. Hardware Considerations
5.1.1. MGT and MDT Storage Hardware Considerations
5.1.2. OST Storage Hardware Considerations
5.2. Determining Space Requirements
5.2.1. Determining MGT Space Requirements
5.2.2. Determining MDT Space Requirements
5.2.3. Determining OST Space Requirements
5.3. Setting ldiskfs File System Formatting Options
5.3.1. Setting Formatting Options for an ldiskfs MDT
5.3.2. Setting Formatting Options for an ldiskfs OST
5.4. File and File System Limits
5.5. Determining Memory Requirements
5.5.1. Client Memory Requirements
5.5.2. MDS Memory Requirements
5.5.3. OSS Memory Requirements
5.6. Implementing Networks To Be Used by the Lustre File System
6. Configuring Storage on a Lustre File System
6.1. Selecting Storage for the MDT and OSTs
6.1.1. Metadata Target (MDT)
6.1.2. Object Storage Server (OST)
6.2. Reliability Best Practices
6.3. Performance Tradeoffs
6.4. Formatting Options for ldiskfs RAID Devices
6.4.1. Computing file system parameters for mkfs
6.4.2. Choosing Parameters for an External Journal
6.5. Connecting a SAN to a Lustre File System
7. Setting Up Network Interface Bonding
7.1. Network Interface Bonding Overview
7.2. Requirements
7.3. Bonding Module Parameters
7.4. Setting Up Bonding
7.4.1. Examples
7.5. Configuring a Lustre File System with Bonding
7.6. Bonding References
8. Installing the Lustre Software
8.1. Preparing to Install the Lustre Software
8.1.1. Software Requirements
8.1.2. Environmental Requirements
8.2. Lustre Software Installation Procedure
9. Configuring Lustre Networking (LNet)
9.1. Configuring LNet via lnetctlL 2.7
9.1.1. Configuring LNet
9.1.2. Displaying Global Settings
9.1.3. Adding, Deleting and Showing Networks
9.1.4. Manual Adding, Deleting and Showing PeersL 2.10
9.1.5. Dynamic Peer DiscoveryL 2.11
9.1.6. Adding, Deleting and Showing routes
9.1.7. Enabling and Disabling Routing
9.1.8. Showing routing information
9.1.9. Configuring Routing Buffers
9.1.10. Asymmetrical RoutesL 2.13
9.1.11. Importing YAML Configuration File
9.1.12. Exporting Configuration in YAML format
9.1.13. Showing LNet Traffic Statistics
9.1.14. YAML Syntax
9.2. Overview of LNet Module Parameters
9.2.1. Using a Lustre Network Identifier (NID) to Identify a Node
9.3. Setting the LNet Module networks Parameter
9.3.1. Multihome Server Example
9.4. Setting the LNet Module ip2nets Parameter
9.5. Setting the LNet Module routes Parameter
9.5.1. Routing Example
9.6. Testing the LNet Configuration
9.7. Configuring the Router Checker
9.8. Best Practices for LNet Options
9.8.1. Escaping commas with quotes
9.8.2. Including comments
10. Configuring a Lustre File System
10.1. Configuring a Simple Lustre File System
10.1.1. Simple Lustre Configuration Example
10.2. Additional Configuration Options
10.2.1. Scaling the Lustre File System
10.2.2. Changing Striping Defaults
10.2.3. Using the Lustre Configuration Utilities
11. Configuring Failover in a Lustre File System
11.1. Setting Up a Failover Environment
11.1.1. Selecting Power Equipment
11.1.2. Selecting Power Management Software
11.1.3. Selecting High-Availability (HA) Software
11.2. Preparing a Lustre File System for Failover
11.3. Administering Failover in a Lustre File System
III. Administering Lustre
12. Monitoring a Lustre File System
12.1. Lustre Changelogs
12.1.1. Working with Changelogs
12.1.2. Changelog Examples
12.1.3. Audit with ChangelogsL 2.11
12.2. Lustre Jobstats
12.2.1. How Jobstats Works
12.2.2. Enable/Disable Jobstats
12.2.3. Check Job Stats
12.2.4. Clear Job Stats
12.2.5. Configure Auto-cleanup Interval
12.2.6. Identifying Top JobsL 2.14
12.3. Lustre Monitoring Tool (LMT)
12.4. CollectL
12.5. Other Monitoring Options
13. Lustre Operations
13.1. Mounting by Label
13.2. Starting Lustre
13.3. Mounting a Server
13.4. Stopping the Filesystem
13.5. Unmounting a Specific Target on a Server
13.6. Specifying Failout/Failover Mode for OSTs
13.7. Handling Degraded OST RAID Arrays
13.8. Running Multiple Lustre File Systems
13.9. Creating a sub-directory on a specific MDT
13.10. Creating a directory striped across multiple MDTsL 2.8
13.10.1. Directory creation by space/inode usageL 2.13
13.10.2. Filesystem-wide default directory stripingL 2.14
13.11. Default Dir Stripe Policy
13.12. Setting and Retrieving Lustre Parameters
13.12.1. Setting Tunable Parameters with mkfs.lustre
13.12.2. Setting Parameters with tunefs.lustre
13.12.3. Setting Parameters with lctl
13.13. Specifying NIDs and Failover
13.14. Erasing a File System
13.15. Reclaiming Reserved Disk Space
13.16. Replacing an Existing OST or MDT
13.17. Identifying To Which Lustre File an OST Object Belongs
14. Lustre Maintenance
14.1. Working with Inactive OSTs
14.2. Finding Nodes in the Lustre File System
14.3. Mounting a Server Without Lustre Service
14.4. Regenerating Lustre Configuration Logs
14.5. Changing a Server NID
14.6. Clearing configurationL 2.11
14.7. Adding a New MDT to a Lustre File System
14.8. Adding a New OST to a Lustre File System
14.9. Removing and Restoring MDTs and OSTs
14.9.1. Removing an MDT from the File System
14.9.2. Working with Inactive MDTs
14.9.3. Removing an OST from the File System
14.9.4. Backing Up OST Configuration Files
14.9.5. Restoring OST Configuration Files
14.9.6. Returning a Deactivated OST to Service
14.10. Aborting Recovery
14.11. Determining Which Machine is Serving an OST
14.12. Changing the Address of a Failover Node
14.13. Separate a combined MGS/MDT
14.14. Set an MDT to read-onlyL 2.13
14.15. Tune Fallocate for ldiskfsL 2.14
15. Managing Lustre Networking (LNet)
15.1. Updating the Health Status of a Peer or Router
15.2. Starting and Stopping LNet
15.2.1. Starting LNet
15.2.2. Stopping LNet
15.3. Hardware Based Multi-Rail Configurations with LNet
15.4. Load Balancing with an InfiniBand* Network
15.4.1. Setting Up lustre.conf for Load Balancing
15.5. Dynamically Configuring LNet Routes
15.5.1. lustre_routes_config
15.5.2. lustre_routes_conversion
15.5.3. Route Configuration Examples
16. LNet Software Multi-RailL 2.10
16.1. Multi-Rail Overview
16.2. Configuring Multi-Rail
16.2.1. Configure Multiple Interfaces on the Local Node
16.2.2. Deleting Network Interfaces
16.2.3. Adding Remote Peers that are Multi-Rail Capable
16.2.4. Deleting Remote Peers
16.3. Notes on routing with Multi-Rail
16.3.1. Multi-Rail Cluster Example
16.3.2. Utilizing Router Resiliency
16.3.3. Mixed Multi-Rail/Non-Multi-Rail Cluster
16.4. Multi-Rail Routing with LNet HealthL 2.13
16.4.1. Configuration
16.4.2. Router Health
16.4.3. Discovery
16.4.4. Route Aliveness Criteria
16.5. LNet HealthL 2.12
16.5.1. Health Value
16.5.2. Failure Types and Behavior
16.5.3. User Interface
16.5.4. Displaying Information
16.5.5. Initial Settings Recommendations
17. Upgrading a Lustre File System
17.1. Release Interoperability and Upgrade Requirements
17.2. Upgrading to Lustre Software Release 2.x (Major Release)
17.3. Upgrading to Lustre Software Release 2.x.y (Minor Release)
18. Backing Up and Restoring a File System
18.1. Backing up a File System
18.1.1. Lustre_rsync
18.2. Backing Up and Restoring an MDT or OST (ldiskfs Device Level)
18.3. Backing Up an OST or MDT (Backend File System Level)
18.3.1. Backing Up an OST or MDT (Backend File System Level)L 2.11
18.3.2. Backing Up an OST or MDT
18.4. Restoring a File-Level Backup
18.5. Using LVM Snapshots with the Lustre File System
18.5.1. Creating an LVM-based Backup File System
18.5.2. Backing up New/Changed Files to the Backup File System
18.5.3. Creating Snapshot Volumes
18.5.4. Restoring the File System From a Snapshot
18.5.5. Deleting Old Snapshots
18.5.6. Changing Snapshot Volume Size
18.6. Migration Between ZFS and ldiskfs Target Filesystems L 2.11
18.6.1. Migrate from a ZFS to an ldiskfs based filesystem
18.6.2. Migrate from an ldiskfs to a ZFS based filesystem
19. Managing File Layout (Striping) and Free Space
19.1. How Lustre File System Striping Works
19.2. Lustre File Layout (Striping) Considerations
19.2.1. Choosing a Stripe Size
19.3. Setting the File Layout/Striping Configuration (lfs setstripe)
19.3.1. Specifying a File Layout (Striping Pattern) for a Single File
19.3.2. Setting the Striping Layout for a Directory
19.3.3. Setting the Striping Layout for a File System
19.3.4. Per File System Stripe Count Limit
19.3.5. Creating a File on a Specific OST
19.4. Retrieving File Layout/Striping Information (getstripe)
19.4.1. Displaying the Current Stripe Size
19.4.2. Inspecting the File Tree
19.4.3. Locating the MDT for a remote directory
19.5. Progressive File Layout(PFL)L 2.10
19.5.1. lfs setstripe
19.5.2. lfs migrate
19.5.3. lfs getstripe
19.5.4. lfs find
19.6. Self-Extending Layout (SEL)L 2.13
19.6.1. lfs setstripe
19.6.2. lfs getstripe
19.6.3. lfs find
19.7. Foreign LayoutL 2.13
19.7.1. lfs set[dir]stripe
19.7.2. lfs get[dir]stripe
19.7.3. lfs find
19.8. Managing Free Space
19.8.1. Checking File System Free Space
19.8.2. Stripe Allocation Methods
19.8.3. Adjusting the Weighting Between Free Space and Location
19.9. Lustre Striping Internals
20. Data on MDT (DoM)L 2.11
20.1. Introduction to Data on MDT (DoM)
20.2. User Commands
20.2.1. lfs setstripe for DoM files
20.2.2. Setting a default DoM layout to an existing directory
20.2.3. DoM Stripe Size Restrictions
20.2.4. lfs getstripe for DoM files
20.2.5. lfs find for DoM files
20.2.6. The dom_stripesize parameter
20.2.7. Disable DoM
21. Lazy Size on MDT (LSoM)L 2.12
21.1. Introduction to Lazy Size on MDT (LSoM)
21.2. Enable LSoM
21.3. User Commands
21.3.1. lfs getsom for LSoM data
21.3.2. Syncing LSoM data
22. File Level Redundancy (FLR)L 2.11
22.1. Introduction
22.2. Operations
22.2.1. Creating a Mirrored File or Directory
22.2.2. Extending a Mirrored File
22.2.3. Splitting a Mirrored File
22.2.4. Resynchronizing out-of-sync Mirrored File(s)
22.2.5. Verifying Mirrored File(s)
22.2.6. Finding Mirrored File(s)
22.3. Interoperability
23. Managing the File System and I/O
23.1. Handling Full OSTs
23.1.1. Checking OST Space Usage
23.1.2. Disabling creates on a Full OST
23.1.3. Migrating Data within a File System
23.1.4. Returning an Inactive OST Back Online
23.1.5. Migrating Metadata within a Filesystem
23.2. Creating and Managing OST Pools
23.2.1. Working with OST Pools
23.2.2. Tips for Using OST Pools
23.3. Adding an OST to a Lustre File System
23.4. Performing Direct I/O
23.4.1. Making File System Objects Immutable
23.5. Other I/O Options
23.5.1. Lustre Checksums
23.5.2. PtlRPC Client Thread Pool
24. Lustre File System Failover and Multiple-Mount Protection
24.1. Overview of Multiple-Mount Protection
24.2. Working with Multiple-Mount Protection
25. Configuring and Managing Quotas
25.1. Working with Quotas
25.2. Enabling Disk Quotas
25.2.1. Quota Verification
25.3. Quota Administration
25.4. Default QuotaL 2.12
25.4.1. Usage
25.5. Quota Allocation
25.6. Quotas and Version Interoperability
25.7. Granted Cache and Quota Limits
25.8. Lustre Quota Statistics
25.8.1. Interpreting Quota Statistics
25.9. Pool QuotasL 2.14
25.9.1. DOM and MDT pools
25.9.2. Lfs quota/setquota options to setup quota pools
25.9.3. Quota pools interoperability
25.9.4. Pool Quotas Hard Limit setup example
25.9.5. Pool Quotas Soft Limit setup example
26. Hierarchical Storage Management (HSM)L 2.5
26.1. Introduction
26.2. Setup
26.2.1. Requirements
26.2.2. Coordinator
26.2.3. Agents
26.3. Agents and copytool
26.3.1. Archive ID, multiple backends
26.3.2. Registered agents
26.3.3. Timeout
26.4. Requests
26.4.1. Commands
26.4.2. Automatic restore
26.4.3. Request monitoring
26.5. File states
26.6. Tuning
26.6.1. hsm_controlpolicy
26.6.2. max_requests
26.6.3. policy
26.6.4. grace_delay
26.7. change logs
26.8. Policy engine
26.8.1. Robinhood
27. Persistent Client Cache (PCC)L 2.13
27.1. Introduction
27.2. Design
27.2.1. Lustre Read-Write PCC Caching
27.2.2. Rule-based Persistent Client Cache
27.3. PCC Command Line Tools
27.3.1. Add a PCC backend on a client
27.3.2. Delete a PCC backend from a client
27.3.3. Remove all PCC backends on a client
27.3.4. List all PCC backends on a client
27.3.5. Attach given files into PCC
27.3.6. Attach given files into PCC by FID(s)
27.3.7. Detach given files from PCC
27.3.8. Detach given files from PCC by FID(s)
27.3.9. Display the PCC state for given files
27.4. PCC Configuration Example
28. Mapping UIDs and GIDs with NodemapL 2.9
28.1. Setting a Mapping
28.1.1. Defining Terms
28.1.2. Deciding on NID Ranges
28.1.3. Defining a Servers Specific Group
28.1.4. Describing and Deploying a Sample Mapping
28.1.5. Mapping Project IDsL 2.15
28.2. Removing Nodemaps
28.3. Altering Properties
28.3.1. Managing the Properties
28.3.2. Mixing Properties
28.4. Enabling the Feature
28.5. default Nodemap
28.6. Verifying Settings
28.7. Ensuring Consistency
29. Configuring Shared-Secret Key (SSK) SecurityL 2.9
29.1. SSK Security Overview
29.1.1. Key features
29.2. SSK Security Flavors
29.2.1. Secure RPC Rules
29.3. SSK Key Files
29.3.1. Key File Management
29.4. Lustre GSS Keyring
29.4.1. Setup
29.4.2. Server Setup
29.4.3. Debugging GSS Keyring
29.4.4. Revoking Keys
29.5. Role of Nodemap in SSK
29.6. SSK Examples
29.6.1. Securing Client to Server Communications
29.6.2. Securing MGS Communications
29.6.3. Securing Server to Server Communications
29.7. Viewing Secure PtlRPC Contexts
30. Managing Security in a Lustre File System
30.1. Using ACLs
30.1.1. How ACLs Work
30.1.2. Using ACLs with the Lustre Software
30.1.3. Examples
30.2. Using Root Squash
30.3. Isolating Clients to a Sub-directory Tree
30.3.1. Identifying Clients
30.3.2. Configuring Isolation
30.3.3. Making Isolation Permanent
30.4. Checking SELinux Policy Enforced by Lustre ClientsL 2.13
30.4.1. Determining SELinux Policy Info
30.4.2. Enforcing SELinux Policy Check
30.4.3. Making SELinux Policy Check Permanent
30.4.4. Sending SELinux Status Info from Clients
30.5. Encrypting files and directoriesL 2.14
30.5.1. Client-side encryption access semantics
30.5.2. Client-side encryption key hierarchy
30.5.3. Client-side encryption modes and usage
30.5.4. Client-side encryption threat model
30.5.5. Manage encryption on directories
30.6. Configuring Kerberos (KRB) Security
30.6.1. What Is Kerberos?
30.6.2. Security Flavor
30.6.3. Kerberos Setup
30.6.4. Networking
30.6.5. Required packages
30.6.6. Build Lustre
30.6.7. Running
30.6.8. Secure MGS connection
31. Lustre ZFS SnapshotsL 2.10
31.1. Introduction
31.1.1. Requirements
31.2. Configuration
31.3. Snapshot Operations
31.3.1. Creating a Snapshot
31.3.2. Delete a Snapshot
31.3.3. Mounting a Snapshot
31.3.4. Unmounting a Snapshot
31.3.5. List Snapshots
31.3.6. Modify Snapshot Attributes
31.4. Global Write Barriers
31.4.1. Impose Barrier
31.4.2. Remove Barrier
31.4.3. Query Barrier
31.4.4. Rescan Barrier
31.5. Snapshot Logs
31.6. Lustre Configuration Logs
IV. Tuning a Lustre File System for Performance
32. Testing Lustre Network Performance (LNet Self-Test)
32.1. LNet Self-Test Overview
32.1.1. Prerequisites
32.2. Using LNet Self-Test
32.2.1. Creating a Session
32.2.2. Setting Up Groups
32.2.3. Defining and Running the Tests
32.2.4. Sample Script
32.3. LNet Self-Test Command Reference
32.3.1. Session Commands
32.3.2. Group Commands
32.3.3. Batch and Test Commands
32.3.4. Other Commands
33. Benchmarking Lustre File System Performance (Lustre I/O Kit)
33.1. Using Lustre I/O Kit Tools
33.1.1. Contents of the Lustre I/O Kit
33.1.2. Preparing to Use the Lustre I/O Kit
33.2. Testing I/O Performance of Raw Hardware (sgpdd-survey)
33.2.1. Tuning Linux Storage Devices
33.2.2. Running sgpdd-survey
33.3. Testing OST Performance (obdfilter-survey)
33.3.1. Testing Local Disk Performance
33.3.2. Testing Network Performance
33.3.3. Testing Remote Disk Performance
33.3.4. Output Files
33.4. Testing OST I/O Performance (ost-survey)
33.5. Testing MDS Performance (mds-survey)
33.5.1. Output Files
33.5.2. Script Output
33.6. Collecting Application Profiling Information ( stats-collect)
33.6.1. Using stats-collect
34. Tuning a Lustre File System
34.1. Optimizing the Number of Service Threads
34.1.1. Specifying the OSS Service Thread Count
34.1.2. Specifying the MDS Service Thread Count
34.2. Binding MDS Service Thread to CPU Partitions
34.3. Tuning LNet Parameters
34.3.1. Transmit and Receive Buffer Size
34.3.2. Hardware Interrupts ( enable_irq_affinity)
34.3.3. Binding Network Interface Against CPU Partitions
34.3.4. Network Interface Credits
34.3.5. Router Buffers
34.3.6. Portal Round-Robin
34.3.7. LNet Peer Health
34.4. libcfs Tuning
34.4.1. CPU Partition String Patterns
34.5. LND Tuning
34.5.1. ko2iblnd Tuning
34.6. Network Request Scheduler (NRS) Tuning
34.6.1. First In, First Out (FIFO) policy
34.6.2. Client Round-Robin over NIDs (CRR-N) policy
34.6.3. Object-based Round-Robin (ORR) policy
34.6.4. Target-based Round-Robin (TRR) policy
34.6.5. Token Bucket Filter (TBF) policyL 2.6
34.6.6. Delay policyL 2.10
34.7. Lockless I/O Tunables
34.8. Server-Side Advice and Hinting L 2.9
34.8.1. Overview
34.8.2. Examples
34.9. Large Bulk IO (16MB RPC) L 2.9
34.9.1. Overview
34.9.2. Usage
34.10. Improving Lustre I/O Performance for Small Files
34.11. Understanding Why Write Performance is Better Than Read Performance
V. Troubleshooting a Lustre File System
35. Lustre File System Troubleshooting
35.1. Lustre Error Messages
35.1.1. Error Numbers
35.1.2. Viewing Error Messages
35.2. Reporting a Lustre File System Bug
35.2.1. Searching Jira*for Duplicate Tickets
35.3. Common Lustre File System Problems
35.3.1. OST Object is Missing or Damaged
35.3.2. OSTs Become Read-Only
35.3.3. Identifying a Missing OST
35.3.4. Fixing a Bad LAST_ID on an OST
35.3.5. Handling/Debugging "Bind: Address already in use" Error
35.3.6. Handling/Debugging Error "- 28"
35.3.7. Triggering Watchdog for PID NNN
35.3.8. Handling Timeouts on Initial Lustre File System Setup
35.3.9. Handling/Debugging "LustreError: xxx went back in time"
35.3.10. Lustre Error: "Slow Start_Page_Write"
35.3.11. Drawbacks in Doing Multi-client O_APPEND Writes
35.3.12. Slowdown Occurs During Lustre File System Startup
35.3.13. Log Message 'Out of Memory' on OST
35.3.14. Setting SCSI I/O Sizes
36. Troubleshooting Recovery
36.1. Recovering from Errors or Corruption on a Backing ldiskfs File System
36.2. Recovering from Corruption in the Lustre File System
36.2.1. Working with Orphaned Objects
36.3. Recovering from an Unavailable OST
36.4. Checking the file system with LFSCK
36.4.1. LFSCK switch interface
36.4.2. Check the LFSCK global statusL 2.9
36.4.3. LFSCK status interface
36.4.4. LFSCK adjustment interface
37. Debugging a Lustre File System
37.1. Diagnostic and Debugging Tools
37.1.1. Lustre Debugging Tools
37.1.2. External Debugging Tools
37.2. Lustre Debugging Procedures
37.2.1. Understanding the Lustre Debug Messaging Format
37.2.2. Using the lctl Tool to View Debug Messages
37.2.3. Dumping the Buffer to a File (debug_daemon)
37.2.4. Controlling Information Written to the Kernel Debug Log
37.2.5. Troubleshooting with strace
37.2.6. Looking at Disk Content
37.2.7. Finding the Lustre UUID of an OST
37.2.8. Printing Debug Messages to the Console
37.2.9. Tracing Lock Traffic
37.2.10. Controlling Console Message Rate Limiting
37.3. Lustre Debugging for Developers
37.3.1. Adding Debugging to the Lustre Source Code
37.3.2. Accessing the ptlrpc Request History
37.3.3. Finding Memory Leaks Using leak_finder.pl
VI. Reference
38. Lustre File System Recovery
38.1. Recovery Overview
38.1.1. Client Failure
38.1.2. Client Eviction
38.1.3. MDS Failure (Failover)
38.1.4. OST Failure (Failover)
38.1.5. Network Partition
38.1.6. Failed Recovery
38.2. Metadata Replay
38.2.1. XID Numbers
38.2.2. Transaction Numbers
38.2.3. Replay and Resend
38.2.4. Client Replay List
38.2.5. Server Recovery
38.2.6. Request Replay
38.2.7. Gaps in the Replay Sequence
38.2.8. Lock Recovery
38.2.9. Request Resend
38.3. Reply Reconstruction
38.3.1. Required State
38.3.2. Reconstruction of Open Replies
38.3.3. Multiple Reply Data per ClientL 2.8
38.4. Version-based Recovery
38.4.1. VBR Messages
38.4.2. Tips for Using VBR
38.5. Commit on Share
38.5.1. Working with Commit on Share
38.5.2. Tuning Commit On Share
38.6. Imperative Recovery
38.6.1. MGS role
38.6.2. Tuning Imperative Recovery
38.6.3. Configuration Suggestions for Imperative Recovery
38.7. Suppressing Pings
38.7.1. "suppress_pings" Kernel Module Parameter
38.7.2. Client Death Notification
39. Lustre Parameters
39.1. Introduction to Lustre Parameters
39.1.1. Identifying Lustre File Systems and Servers
39.2. Tuning Multi-Block Allocation (mballoc)
39.3. Monitoring Lustre File System I/O
39.3.1. Monitoring the Client RPC Stream
39.3.2. Monitoring Client Activity
39.3.3. Monitoring Client Read-Write Offset Statistics
39.3.4. Monitoring Client Read-Write Extent Statistics
39.3.5. Monitoring the OST Block I/O Stream
39.4. Tuning Lustre File System I/O
39.4.1. Tuning the Client I/O RPC Stream
39.4.2. Tuning File Readahead and Directory Statahead
39.4.3. Tuning Server Read Cache
39.4.4. Enabling OSS Asynchronous Journal Commit
39.4.5. Tuning the Client Metadata RPC Stream L 2.8
39.5. Configuring Timeouts in a Lustre File System
39.5.1. Configuring Adaptive Timeouts
39.5.2. Setting Static Timeouts
39.6. Monitoring LNet
39.7. Allocating Free Space on OSTs
39.8. Configuring Locking
39.9. Setting MDS and OSS Thread Counts
39.10. Enabling and Interpreting Debugging Logs
39.10.1. Interpreting OST Statistics
39.10.2. Interpreting MDT Statistics
40. User Utilities
40.1. lfs
40.1.1. Synopsis
40.1.2. Description
40.1.3. Options
40.1.4. Examples
40.1.5. See Also
40.2. lfs_migrate
40.2.1. Synopsis
40.2.2. Description
40.2.3. Options
40.2.4. Examples
40.2.5. See Also
40.3. filefrag
40.3.1. Synopsis
40.3.2. Description
40.3.3. Options
40.3.4. Examples
40.4. mount
40.5. Handling Timeouts
41. Programming Interfaces
41.1. User/Group Upcall
41.1.1. Synopsis
41.1.2. Description
41.1.3. Data Structures
42. Setting Lustre Properties in a C Program (llapi)
42.1. llapi_file_create
42.1.1. Synopsis
42.1.2. Description
42.1.3. Examples
42.2. llapi_file_get_stripe
42.2.1. Synopsis
42.2.2. Description
42.2.3. Return Values
42.2.4. Errors
42.2.5. Examples
42.3. llapi_file_open
42.3.1. Synopsis
42.3.2. Description
42.3.3. Return Values
42.3.4. Errors
42.3.5. Example
42.4. llapi_quotactl
42.4.1. Synopsis
42.4.2. Description
42.4.3. Return Values
42.4.4. Errors
42.5. llapi_path2fid
42.5.1. Synopsis
42.5.2. Description
42.5.3. Return Values
42.6. llapi_ladvise L 2.9
42.6.1. Synopsis
42.6.2. Description
42.6.3. Return Values
42.6.4. Errors
42.7. Example Using the llapi Library
42.7.1. See Also
43. Configuration Files and Module Parameters
43.1. Introduction
43.2. Module Options
43.2.1. LNet Options
43.2.2. SOCKLND Kernel TCP/IP LND
44. System Configuration Utilities
44.1. l_getidentity
44.1.1. Synopsis
44.1.2. Description
44.1.3. Options
44.1.4. Files
44.2. lctl
44.2.1. Synopsis
44.2.2. Description
44.2.3. Setting Parameters with lctl
44.2.4. Options
44.2.5. Examples
44.2.6. See Also
44.3. ll_decode_filter_fid
44.3.1. Synopsis
44.3.2. Description
44.3.3. Examples
44.4. llobdstat
44.4.1. Synopsis
44.4.2. Description
44.4.3. Example
44.4.4. Files
44.5. llog_reader
44.5.1. Synopsis
44.5.2. Description
44.5.3. See Also
44.6. llstat
44.6.1. Synopsis
44.6.2. Description
44.6.3. Options
44.6.4. Example
44.6.5. Files
44.7. llverdev
44.7.1. Synopsis
44.7.2. Description
44.7.3. Options
44.7.4. Examples
44.8. lshowmount
44.8.1. Synopsis
44.8.2. Description
44.8.3. Options
44.8.4. Files
44.9. lst
44.9.1. Synopsis
44.9.2. Description
44.9.3. Modules
44.9.4. Utilities
44.9.5. Example Script
44.10. lustre_rmmod.sh
44.11. lustre_rsync
44.11.1. Synopsis
44.11.2. Description
44.11.3. Options
44.11.4. Examples
44.11.5. See Also
44.12. mkfs.lustre
44.12.1. Synopsis
44.12.2. Description
44.12.3. Examples
44.12.4. See Also
44.13. mount.lustre
44.13.1. Synopsis
44.13.2. Description
44.13.3. Options
44.13.4. Examples
44.13.5. See Also
44.14. routerstat
44.14.1. Synopsis
44.14.2. Description
44.14.3. Output
44.14.4. Example
44.14.5. Files
44.15. tunefs.lustre
44.15.1. Synopsis
44.15.2. Description
44.15.3. Options
44.15.4. Examples
44.15.5. See Also
44.16. Additional System Configuration Utilities
44.16.1. More Statistics for Application Profiling
44.16.2. Testing / Debugging Utilities
44.16.3. Fileset FeatureL 2.9
45. LNet Configuration C-API
45.1. General API Information
45.1.1. API Return Code
45.1.2. API Common Input Parameters
45.1.3. API Common Output Parameters
45.2. The LNet Configuration C-API
45.2.1. Configuring LNet
45.2.2. Enabling and Disabling Routing
45.2.3. Adding Routes
45.2.4. Deleting Routes
45.2.5. Showing Routes
45.2.6. Adding a Network Interface
45.2.7. Deleting a Network Interface
45.2.8. Showing Network Interfaces
45.2.9. Adjusting Router Buffer Pools
45.2.10. Showing Routing information
45.2.11. Showing LNet Traffic Statistics
45.2.12. Adding/Deleting/Showing Parameters through a YAML Block
45.2.13. Adding a route code example
Glossary
Index

List of Figures

1.1. Lustre file system components in a basic cluster
1.2. Lustre cluster at scale
1.3. Layout EA on MDT pointing to file data on OSTs
1.4. Lustre client requesting file data
1.5. File striping on a Lustre file system
3.1. Lustre failover configuration for a active/passive MDT
3.2. Lustre failover configuration for a active/active MDTs
3.3. Lustre failover configuration for an OSTs
16.1. Routing Configuration with Multi-Rail
19.1. PFL object mapping diagram
19.2. Example: create a composite file
19.3. Example: add a component to an existing composite file
19.4. Example: delete a component from an existing file
19.5. Example: migrate normal to composite
19.6. Example: migrate composite to composite
19.7. Example: migrate composite to normal
19.8. Example: create a SEL file
19.9. Example: an extension of a SEL file
19.10. Example: a spillover in a SEL file
19.11. Example: repeat a SEL component
19.12. Example: forced extension in a SEL file
19.13. LOV/LMV foreign format
19.14. Example: create a foreign file
20.1. Resulting file layout
22.1. FLR Delayed Write
26.1. Overview of the Lustre file system HSM
27.1. Overview of PCC-RW Architecture
34.1. One of Two Connections to o2ib0 Down
34.2. Both Connections to o2ib0 Down
34.3. Connection to o2ib1 Down
34.4. Connection to o2ib1 Never Came Up
34.5. The internal structure of TBF policy
44.1. Lustre fileset

List of Tables

1.1. Lustre File System Scalability and Performance
1.2. Storage and hardware requirements for Lustre file system components
5.1. Default Inode Ratios Used for Newly Formatted OSTs
5.2. File and file system limits
8.1. Packages Installed on Lustre Servers
8.2. Packages Installed on Lustre Clients
8.3. Network Types Supported by Lustre LNDs
10.1. Default stripe pattern
16.1. Configuring Module Parameters
29.1. SSK Security Flavor Protections
29.2. lgss_sk Parameters
29.3. lsvcgssd Parameters
29.4. Key Descriptions
31.1. Write Barrier Status

List of Examples

34.1. lustre.conf

Preface

The Lustre*Software Release 2.x Operations Manual provides detailed information and procedures to install, configure and tune a Lustre file system. The manual covers topics such as failover, quotas, striping, and bonding. This manual also contains troubleshooting information and tips to improve the operation and performance of a Lustre file system.

1. About this Document

This document is maintained by Whamcloud in Docbook format. The canonical version is available at https://wiki.whamcloud.com/display/PUB/Documentation .

1.1. UNIX* Commands

This document does not contain information about basic UNIX* operating system commands and procedures such as shutting down the system, booting the system, and configuring devices. Refer to the following for this information:

  • Software documentation that you received with your system

  • Red Hat* Enterprise Linux* documentation, which is at: https://docs.redhat.com/docs/en-US/index.html

    Note

    The Lustre client module is available for many different Linux* versions and distributions. The Red Hat Enterprise Linux distribution is the best supported and tested platform for Lustre servers.

1.2. Shell Prompts

The shell prompt used in the example text indicates whether a command can or should be executed by a regular user, or whether it requires superuser permission to run. Also, the machine type is often included in the prompt to indicate whether the command should be run on a client node, on an MDS node, an OSS node, or the MGS node.

Some examples are listed below, but other prompt combinations are also used as needed for the example.

Shell

Prompt

Regular user

machine$

Superuser (root)

machine#

Regular user on the client

client$

Superuser on the MDS

mds#

Superuser on the OSS

oss#

Superuser on the MGS

mgs#

1.3. Related Documentation

Application

Title

Format

Location

Latest information

Lustre Software Release 2.x Change Logs

Wiki page

Online at https://wiki.whamcloud.com/display/PUB/Documentation

Service

Lustre Software Release 2.x Operations Manual

PDF

HTML

Online at https://wiki.whamcloud.com/display/PUB/Documentation

1.4. Documentation and Support

These web sites provide additional resources:

2. Revisions

The Lustre* File System Release 2.x Operations Manual is a community maintained work. Versions of the manual are continually built as suggestions for changes and improvements arrive. Suggestions for improvements can be submitted through the ticketing system maintained at https://jira.whamcloud.com/browse/LUDOC. Instructions for providing a patch to the existing manual are available at: http://wiki.lustre.org/Lustre_Manual_Changes.

Introduced in Lustre 2.5

This manual covers a range of Lustre 2.x software releases, currently starting with the 2.5 release. Features specific to individual releases are identified within the table of contents using a shorthand notation (e.g. this paragraph is tagged as a Lustre 2.5 specific feature so that it will be updated when the 2.5-specific tagging is removed), and within the text using a distinct box.

Which version am I running?

The current version of Lustre that is in use on the node can be found using the command lctl get_param version on any Lustre client or server, for example:

$ lctl get_param version
version=2.10.5

Only the latest revision of this document is made readily available because changes are continually arriving. The current and latest revision of this manual is available from links maintained at: http://lustre.opensfs.org/documentation/.

Revision History
Revision 0Built on 03 December 2024 07:57:11Z
Continuous build of Manual.

Part I. Introducing the Lustre* File System

Chapter 1. Understanding Lustre Architecture

This chapter describes the Lustre architecture and features of the Lustre file system. It includes the following sections:

1.1.  What a Lustre File System Is (and What It Isn't)

The Lustre architecture is a storage architecture for clusters. The central component of the Lustre architecture is the Lustre file system, which is supported on the Linux operating system and provides a POSIX *standard-compliant UNIX file system interface.

The Lustre storage architecture is used for many different kinds of clusters. It is best known for powering many of the largest high-performance computing (HPC) clusters worldwide, with tens of thousands of client systems, petabytes (PiB) of storage and hundreds of gigabytes per second (GB/sec) of I/O throughput. Many HPC sites use a Lustre file system as a site-wide global file system, serving dozens of clusters.

The ability of a Lustre file system to scale capacity and performance for any need reduces the need to deploy many separate file systems, such as one for each compute cluster. Storage management is simplified by avoiding the need to copy data between compute clusters. In addition to aggregating storage capacity of many servers, the I/O throughput is also aggregated and scales with additional servers. Moreover, throughput and/or capacity can be easily increased by adding servers dynamically.

While a Lustre file system can function in many work environments, it is not necessarily the best choice for all applications. It is best suited for uses that exceed the capacity that a single server can provide, though in some use cases, a Lustre file system can perform better with a single server than other file systems due to its strong locking and data coherency.

A Lustre file system is currently not particularly well suited for "peer-to-peer" usage models where clients and servers are running on the same node, each sharing a small amount of storage, due to the lack of data replication at the Lustre software level. In such uses, if one client/server fails, then the data stored on that node will not be accessible until the node is restarted.

1.1.1.  Lustre Features

Lustre file systems run on a variety of vendor's kernels. For more details, see the Lustre Test Matrix Section 8.1, “ Preparing to Install the Lustre Software”.

A Lustre installation can be scaled up or down with respect to the number of client nodes, disk storage and bandwidth. Scalability and performance are dependent on available disk and network bandwidth and the processing power of the servers in the system. A Lustre file system can be deployed in a wide variety of configurations that can be scaled well beyond the size and performance observed in production systems to date.

Table 1.1, “Lustre File System Scalability and Performance” shows some of the scalability and performance characteristics of a Lustre file system. For a full list of Lustre file and filesystem limits see Table 5.2, “File and file system limits”.

Table 1.1. Lustre File System Scalability and Performance

Feature

Current Practical Range

Known Production Usage

Client Scalability

100-100000

50000+ clients, many in the 10000 to 20000 range

Client Performance

Single client:

I/O 90% of network bandwidth

Aggregate:

50 TB/sec I/O, 50M IOPS

Single client:

15 GB/sec I/O (HDR IB), 50000 IOPS

Aggregate:

10 TB/sec I/O, 10M IOPS

OSS Scalability

Single OSS:

1-32 OSTs per OSS

Single OST:

500M objects, 1024TiB per OST

OSS count:

1000 OSSs, 4000 OSTs

Single OSS:

4 OSTs per OSS

Single OST:

1024TiB OSTs

OSS count:

450 OSSs with 900 750TiB HDD OSTs + 450 25TiB NVMe OSTs

1024 OSSs with 1024 72TiB OSTs

OSS Performance

Single OSS:

15 GB/sec, 1.5M IOPS

Aggregate:

50 TB/sec, 50M IOPS

Single OSS:

10 GB/sec, 1.5M IOPS

Aggregate:

20 TB/sec, 20M IOPS

MDS Scalability

Single MDS:

1-4 MDTs per MDS

Single MDT:

4 billion files, 16TiB per MDT (ldiskfs)

64 billion files, 64TiB per MDT (ZFS)

MDS count:

256 MDSs, up to 256 MDTs

Single MDS:

4 billion files

MDS count:

40 MDS with 40 4TiB MDTs in production

256 MDS with 256 64GiB MDTs in testing

MDS Performance

1M/s create operations

2M/s stat operations

100k/s create operations,

200k/s metadata stat operations

File system Scalability

Single File:

32 PiB max file size (ldiskfs)

2^63 bytes (ZFS)

Aggregate:

512 PiB space, 1 trillion files

Single File:

multi-TiB max file size

Aggregate:

700 PiB space, 25 billion files


Other Lustre software features are:

  • Performance-enhanced ext4 file system:The Lustre file system uses an improved version of the ext4 journaling file system to store data and metadata. This version, called ldiskfs , has been enhanced to improve performance and provide additional functionality needed by the Lustre file system.

  • It is also possible to use ZFS as the backing filesystem for Lustre for the MDT, OST, and MGS storage. This allows Lustre to leverage the scalability and data integrity features of ZFS for individual storage targets.

  • POSIX standard compliance:The full POSIX test suite passes in an identical manner to a local ext4 file system, with limited exceptions on Lustre clients. In a cluster, most operations are atomic so that clients never see stale data or metadata. The Lustre software supports mmap() file I/O.

  • High-performance heterogeneous networking:The Lustre software supports a variety of high performance, low latency networks and permits Remote Direct Memory Access (RDMA) for InfiniBand *(utilizing OpenFabrics Enterprise Distribution (OFED*), Intel OmniPath®, and other advanced networks for fast and efficient network transport. Multiple RDMA networks can be bridged using Lustre routing for maximum performance. The Lustre software also includes integrated network diagnostics.

  • High-availability:The Lustre file system supports active/active failover using shared storage partitions for OSS targets (OSTs), and for MDS targets (MDTs). The Lustre file system can work with a variety of high availability (HA) managers to allow automated failover and has no single point of failure (NSPF). This allows application transparent recovery. Multiple mount protection (MMP) provides integrated protection from errors in highly-available systems that would otherwise cause file system corruption.

  • Security:By default TCP connections are only allowed from privileged ports. UNIX group membership is verified on the MDS.

  • Access control list (ACL), extended attributes:the Lustre security model follows that of a UNIX file system, enhanced with POSIX ACLs. Noteworthy additional features include root squash.

  • Interoperability:The Lustre file system runs on a variety of CPU architectures and mixed-endian clusters and is interoperable between successive major Lustre software releases.

  • Object-based architecture:Clients are isolated from the on-disk file structure enabling upgrading of the storage architecture without affecting the client.

  • Byte-granular file and fine-grained metadata locking:Many clients can read and modify the same file or directory concurrently. The Lustre distributed lock manager (LDLM) ensures that files are coherent between all clients and servers in the file system. The MDT LDLM manages locks on inode permissions and pathnames. Each OST has its own LDLM for locks on file stripes stored thereon, which scales the locking performance as the file system grows.

  • Quotas:User and group quotas are available for a Lustre file system.

  • Capacity growth:The size of a Lustre file system and aggregate cluster bandwidth can be increased without interruption by adding new OSTs and MDTs to the cluster.

  • Controlled file layout:The layout of files across OSTs can be configured on a per file, per directory, or per file system basis. This allows file I/O to be tuned to specific application requirements within a single file system. The Lustre file system uses RAID-0 striping and balances space usage across OSTs.

  • Network data integrity protection:A checksum of all data sent from the client to the OSS protects against corruption during data transfer.

  • MPI I/O:The Lustre architecture has a dedicated MPI ADIO layer that optimizes parallel I/O to match the underlying file system architecture.

  • NFS and CIFS export:Lustre files can be re-exported using NFS (via Linux knfsd or Ganesha) or CIFS (via Samba), enabling them to be shared with non-Linux clients such as Microsoft*Windows, *Apple *Mac OS X *, and others.

  • Disaster recovery tool:The Lustre file system provides an online distributed file system check (LFSCK) that can restore consistency between storage components in case of a major file system error. A Lustre file system can operate even in the presence of file system inconsistencies, and LFSCK can run while the filesystem is in use, so LFSCK is not required to complete before returning the file system to production.

  • Performance monitoring:The Lustre file system offers a variety of mechanisms to examine performance and tuning.

  • Open source:The Lustre software is licensed under the GPL 2.0 license for use with the Linux operating system.

1.2.  Lustre Components

An installation of the Lustre software includes a management server (MGS) and one or more Lustre file systems interconnected with Lustre networking (LNet).

A basic configuration of Lustre file system components is shown in Figure 1.1, “Lustre file system components in a basic cluster”.

Figure 1.1. Lustre file system components in a basic cluster

Lustre file system components in a basic cluster

1.2.1.  Management Server (MGS)

The MGS stores configuration information for all the Lustre file systems in a cluster and provides this information to other Lustre components. Each Lustre target contacts the MGS to provide information, and Lustre clients contact the MGS to retrieve information.

It is preferable that the MGS have its own storage space so that it can be managed independently. However, the MGS can be co-located and share storage space with an MDS as shown in Figure 1.1, “Lustre file system components in a basic cluster”.

1.2.2. Lustre File System Components

Each Lustre file system consists of the following components:

  • Metadata Servers (MDS)- The MDS makes metadata stored in one or more MDTs available to Lustre clients. Each MDS manages the names and directories in the Lustre file system(s) and provides network request handling for one or more local MDTs.

  • Metadata Targets (MDT) - Each filesystem has at least one MDT, which holds the root directory. The MDT stores metadata (such as filenames, directories, permissions and file layout) on storage attached to an MDS. Each file system has one MDT. An MDT on a shared storage target can be available to multiple MDSs, although only one can access it at a time. If an active MDS fails, a second MDS node can serve the MDT and make it available to clients. This is referred to as MDS failover.

    Multiple MDTs are supported with the Distributed Namespace Environment (Distributed Namespace Environment (DNE)). In addition to the primary MDT that holds the filesystem root, it is possible to add additional MDS nodes, each with their own MDTs, to hold sub-directory trees of the filesystem.

    Introduced in Lustre 2.8

    Since Lustre software release 2.8, DNE also allows the filesystem to distribute files of a single directory over multiple MDT nodes. A directory which is distributed across multiple MDTs is known as a Striped Directory.

  • Object Storage Servers (OSS): The OSS provides file I/O service and network request handling for one or more local OSTs. Typically, an OSS serves between two and eight OSTs, up to 16 TiB each. A typical configuration is an MDT on a dedicated node, two or more OSTs on each OSS node, and a client on each of a large number of compute nodes.

  • Object Storage Target (OST): User file data is stored in one or more objects, each object on a separate OST in a Lustre file system. The number of objects per file is configurable by the user and can be tuned to optimize performance for a given workload.

  • Lustre clients: Lustre clients are computational, visualization or desktop nodes that are running Lustre client software, allowing them to mount the Lustre file system.

The Lustre client software provides an interface between the Linux virtual file system and the Lustre servers. The client software includes a management client (MGC), a metadata client (MDC), and multiple object storage clients (OSCs), one corresponding to each OST in the file system.

A logical object volume (LOV) aggregates the OSCs to provide transparent access across all the OSTs. Thus, a client with the Lustre file system mounted sees a single, coherent, synchronized namespace. Several clients can write to different parts of the same file simultaneously, while, at the same time, other clients can read from the file.

A logical metadata volume (LMV) aggregates the MDCs to provide transparent access across all the MDTs in a similar manner as the LOV does for file access. This allows the client to see the directory tree on multiple MDTs as a single coherent namespace, and striped directories are merged on the clients to form a single visible directory to users and applications.

Table 1.2, “ Storage and hardware requirements for Lustre file system components”provides the requirements for attached storage for each Lustre file system component and describes desirable characteristics of the hardware used.

Table 1.2.  Storage and hardware requirements for Lustre file system components

Required attached storage

Desirable hardware characteristics

MDSs

1-2% of file system capacity

Adequate CPU power, plenty of memory, fast disk storage.

OSSs

1-128 TiB per OST, 1-8 OSTs per OSS

Good bus bandwidth. Recommended that storage be balanced evenly across OSSs and matched to network bandwidth.

Clients

No local storage needed

Low latency, high bandwidth network.


For additional hardware requirements and considerations, see Chapter 5, Determining Hardware Configuration Requirements and Formatting Options.

1.2.3.  Lustre Networking (LNet)

Lustre Networking (LNet) is a custom networking API that provides the communication infrastructure that handles metadata and file I/O data for the Lustre file system servers and clients. For more information about LNet, see Chapter 2, Understanding Lustre Networking (LNet).

1.2.4.  Lustre Cluster

At scale, a Lustre file system cluster can include hundreds of OSSs and thousands of clients (see Figure 1.2, “ Lustre cluster at scale”). More than one type of network can be used in a Lustre cluster. Shared storage between OSSs enables failover capability. For more details about OSS failover, see Chapter 3, Understanding Failover in a Lustre File System.

Figure 1.2.  Lustre cluster at scale

Lustre file system cluster at scale

1.3.  Lustre File System Storage and I/O

Lustre File IDentifiers (FIDs) are used internally for identifying files or objects, similar to inode numbers in local filesystems. A FID is a 128-bit identifier, which contains a unique 64-bit sequence number (SEQ), a 32-bit object ID (OID), and a 32-bit version number. The sequence number is unique across all Lustre targets in a file system (OSTs and MDTs). This allows multiple MDTs and OSTs to uniquely identify objects without depending on identifiers in the underlying filesystem (e.g. inode numbers) that are likely to be duplicated between targets. The FID SEQ number also allows mapping a FID to a particular MDT or OST.

The LFSCK file system consistency checking tool provides functionality that enables FID-in-dirent for existing files. It includes the following functionality:

  • Verifies the FID stored with each directory entry and regenerates it from the inode if it is invalid or missing.

  • Verifies the linkEA entry for each inode and regenerates it if invalid or missing. The linkEA stores the file name and parent FID. It is stored as an extended attribute in each inode. Thus, the linkEA can be used to reconstruct the full path name of a file from only the FID.

Information about where file data is located on the OST(s) is stored as an extended attribute called layout EA in an MDT object identified by the FID for the file (see Figure 1.3, “Layout EA on MDT pointing to file data on OSTs”). If the file is a regular file (not a directory or symbol link), the MDT object points to 1-to-N OST object(s) on the OST(s) that contain the file data. If the MDT file points to one object, all the file data is stored in that object. If the MDT file points to more than one object, the file data is striped across the objects using RAID 0, and each object is stored on a different OST. (For more information about how striping is implemented in a Lustre file system, see Section 1.3.1, “ Lustre File System and Striping”.

Figure 1.3. Layout EA on MDT pointing to file data on OSTs

Layout EA on MDT pointing to file data on OSTs

When a client wants to read from or write to a file, it first fetches the layout EA from the MDT object for the file. The client then uses this information to perform I/O on the file, directly interacting with the OSS nodes where the objects are stored. This process is illustrated in Figure 1.4, “Lustre client requesting file data” .

Figure 1.4. Lustre client requesting file data

Lustre client requesting file data

The available bandwidth of a Lustre file system is determined as follows:

  • The network bandwidth equals the aggregated bandwidth of the OSSs to the targets.

  • The disk bandwidth equals the sum of the disk bandwidths of the storage targets (OSTs) up to the limit of the network bandwidth.

  • The aggregate bandwidth equals the minimum of the disk bandwidth and the network bandwidth.

  • The available file system space equals the sum of the available space of all the OSTs.

1.3.1.  Lustre File System and Striping

One of the main factors leading to the high performance of Lustre file systems is the ability to stripe data across multiple OSTs in a round-robin fashion. Users can optionally configure for each file the number of stripes, stripe size, and OSTs that are used.

Striping can be used to improve performance when the aggregate bandwidth to a single file exceeds the bandwidth of a single OST. The ability to stripe is also useful when a single OST does not have enough free space to hold an entire file. For more information about benefits and drawbacks of file striping, see Section 19.2, “ Lustre File Layout (Striping) Considerations”.

Striping allows segments or 'chunks' of data in a file to be stored on different OSTs, as shown in Figure 1.5, “File striping on a Lustre file system”. In the Lustre file system, a RAID 0 pattern is used in which data is "striped" across a certain number of objects. The number of objects in a single file is called the stripe_count.

Each object contains a chunk of data from the file. When the chunk of data being written to a particular object exceeds the stripe_size, the next chunk of data in the file is stored on the next object.

Default values for stripe_count and stripe_size are set for the file system. The default value for stripe_count is 1 stripe for file and the default value for stripe_size is 1MB. The user may change these values on a per directory or per file basis. For more details, see Section 19.3, “Setting the File Layout/Striping Configuration (lfs setstripe)”.

Figure 1.5, “File striping on a Lustre file system”, the stripe_size for File C is larger than the stripe_size for File A, allowing more data to be stored in a single stripe for File C. The stripe_count for File A is 3, resulting in data striped across three objects, while the stripe_count for File B and File C is 1.

No space is reserved on the OST for unwritten data. File A in Figure 1.5, “File striping on a Lustre file system”.

Figure 1.5. File striping on a Lustre file system

File striping pattern across three OSTs for three different data files. The file is sparse and missing chunk 6.

The maximum file size is not limited by the size of a single target. In a Lustre file system, files can be striped across multiple objects (up to 2000), and each object can be up to 16 TiB in size with ldiskfs, or up to 256PiB with ZFS. This leads to a maximum file size of 31.25 PiB for ldiskfs or 8EiB with ZFS. Note that a Lustre file system can support files up to 2^63 bytes (8EiB), limited only by the space available on the OSTs.

Note

ldiskfs filesystems without the ea_inode feature limit the maximum stripe count for a single file to 160 OSTs.

Although a single file can only be striped over 2000 objects, Lustre file systems can have thousands of OSTs. The I/O bandwidth to access a single file is the aggregated I/O bandwidth to the objects in a file, which can be as much as a bandwidth of up to 2000 servers. On systems with more than 2000 OSTs, clients can do I/O using multiple files to utilize the full file system bandwidth.

For more information about striping, see Chapter 19, Managing File Layout (Striping) and Free Space.

Extended Attributes(xattrs)

Lustre uses lov_user_md_v1/lov_user_md_v3 data-structures to maintain its file striping information under xattrs. Extended attributes are created when files and directory are created. Lustre uses trusted extended attributes to store its parameters which are root-only accessible. The parameters are:

  • trusted.lov: Holds layout for a regular file, or default file layout stored on a directory (also accessible as lustre.lov for non-root users).

  • trusted.lma: Holds FID and extra state flags for current file

  • trusted.lmv: Holds layout for a striped directory (DNE 2), not present otherwise

  • trusted.link: Holds parent directory FID + filename for each link to a file (for lfs fid2path)

xattr which are stored and present in the file could be verify using:

# getfattr -d -m - /mnt/testfs/file>

Chapter 2. Understanding Lustre Networking (LNet)

This chapter introduces Lustre networking (LNet). It includes the following sections:

2.1.  Introducing LNet

In a cluster using one or more Lustre file systems, the network communication infrastructure required by the Lustre file system is implemented using the Lustre networking (LNet) feature.

LNet supports many commonly-used network types, such as InfiniBand and IP networks, and allows simultaneous availability across multiple network types with routing between them. Remote direct memory access (RDMA) is permitted when supported by underlying networks using the appropriate Lustre network driver (LND). High availability and recovery features enable transparent recovery in conjunction with failover servers.

An LND is a pluggable driver that provides support for a particular network type, for example ksocklnd is the driver which implements the TCP Socket LND that supports TCP networks. LNDs are loaded into the driver stack, with one LND for each network type in use.

For information about configuring LNet, see Chapter 9, Configuring Lustre Networking (LNet).

For information about administering LNet, see Part III, “Administering Lustre”.

2.2. Key Features of LNet

Key features of LNet include:

  • RDMA, when supported by underlying networks

  • Support for many commonly-used network types

  • High availability and recovery

  • Support of multiple network types simultaneously

  • Routing among disparate networks

LNet permits end-to-end read/write throughput at or near peak bandwidth rates on a variety of network interconnects.

2.3. Lustre Networks

A Lustre network is comprised of clients and servers running the Lustre software. It need not be confined to one LNet subnet but can span several networks provided routing is possible between the networks. In a similar manner, a single network can have multiple LNet subnets.

The Lustre networking stack is comprised of two layers, the LNet code module and the LND. The LNet layer operates above the LND layer in a manner similar to the way the network layer operates above the data link layer. LNet layer is connectionless, asynchronous and does not verify that data has been transmitted while the LND layer is connection oriented and typically does verify data transmission.

LNets are uniquely identified by a label comprised of a string corresponding to an LND and a number, such as tcp0, o2ib0, or o2ib1, that uniquely identifies each LNet. Each node on an LNet has at least one network identifier (NID). A NID is a combination of the address of the network interface and the LNet label in the form:address@LNet_label.

Examples:

192.168.1.2@tcp0
10.13.24.90@o2ib1

In certain circumstances it might be desirable for Lustre file system traffic to pass between multiple LNets. This is possible using LNet routing. It is important to realize that LNet routing is not the same as network routing. For more details about LNet routing, see Chapter 9, Configuring Lustre Networking (LNet)

2.4. Supported Network Types

The LNet code module includes LNDs to support many network types including:

  • InfiniBand: OpenFabrics OFED (o2ib)

  • TCP (any network carrying TCP traffic, including GigE, 10GigE, and IPoIB)

  • RapidArray: ra

  • Quadrics: Elan

Chapter 3. Understanding Failover in a Lustre File System

This chapter describes failover in a Lustre file system. It includes:

3.1.  What is Failover?

In a high-availability (HA) system, unscheduled downtime is minimized by using redundant hardware and software components and software components that automate recovery when a failure occurs. If a failure condition occurs, such as the loss of a server or storage device or a network or software fault, the system's services continue with minimal interruption. Generally, availability is specified as the percentage of time the system is required to be available.

Availability is accomplished by replicating hardware and/or software so that when a primary server fails or is unavailable, a standby server can be switched into its place to run applications and associated resources. This process, called failover, is automatic in an HA system and, in most cases, completely application-transparent.

A failover hardware setup requires a pair of servers with a shared resource (typically a physical storage device, which may be based on SAN, NAS, hardware RAID, SCSI or Fibre Channel (FC) technology). The method of sharing storage should be essentially transparent at the device level; the same physical logical unit number (LUN) should be visible from both servers. To ensure high availability at the physical storage level, we encourage the use of RAID arrays to protect against drive-level failures.

Note

The Lustre software does not provide redundancy for data; it depends exclusively on redundancy of backing storage devices. The backing OST storage should be RAID 5 or, preferably, RAID 6 storage. MDT storage should be RAID 1 or RAID 10.

3.1.1.  Failover Capabilities

To establish a highly-available Lustre file system, power management software or hardware and high availability (HA) software are used to provide the following failover capabilities:

  • Resource fencing- Protects physical storage from simultaneous access by two nodes.

  • Resource management- Starts and stops the Lustre resources as a part of failover, maintains the cluster state, and carries out other resource management tasks.

  • Health monitoring- Verifies the availability of hardware and network resources and responds to health indications provided by the Lustre software.

These capabilities can be provided by a variety of software and/or hardware solutions. For more information about using power management software or hardware and high availability (HA) software with a Lustre file system, see Chapter 11, Configuring Failover in a Lustre File System.

HA software is responsible for detecting failure of the primary Lustre server node and controlling the failover.The Lustre software works with any HA software that includes resource (I/O) fencing. For proper resource fencing, the HA software must be able to completely power off the failed server or disconnect it from the shared storage device. If two active nodes have access to the same storage device, data may be severely corrupted.

3.1.2.  Types of Failover Configurations

Nodes in a cluster can be configured for failover in several ways. They are often configured in pairs (for example, two OSTs attached to a shared storage device), but other failover configurations are also possible. Failover configurations include:

  • Active/passive pair - In this configuration, the active node provides resources and serves data, while the passive node is usually standing by idle. If the active node fails, the passive node takes over and becomes active.

  • Active/active pair - In this configuration, both nodes are active, each providing a subset of resources. In case of a failure, the second node takes over resources from the failed node.

If there is a single MDT in a filesystem, two MDSes can be configured as an active/passive pair, while pairs of OSSes can be deployed in an active/active configuration that improves OST availability without extra overhead. Often the standby MDS is the active MDS for another Lustre file system or the MGS, so no nodes are idle in the cluster. If there are multiple MDTs in a filesystem, active-active failover configurations are available for MDSs that serve MDTs on shared storage.

3.2.  Failover Functionality in a Lustre File System

The failover functionality provided by the Lustre software can be used for the following failover scenario. When a client attempts to do I/O to a failed Lustre target, it continues to try until it receives an answer from any of the configured failover nodes for the Lustre target. A user-space application does not detect anything unusual, except that the I/O may take longer to complete.

Failover in a Lustre file system requires that two nodes be configured as a failover pair, which must share one or more storage devices. A Lustre file system can be configured to provide MDT or OST failover.

  • For MDT failover, two MDSs can be configured to serve the same MDT. Only one MDS node can serve any MDT at one time. By placing two or more MDT devices on storage shared by two MDSs, one MDS can fail and the remaining MDS can begin serving the unserved MDT. This is described as an active/active failover pair.

  • For OST failover, multiple OSS nodes can be configured to be able to serve the same OST. However, only one OSS node can serve the OST at a time. An OST can be moved between OSS nodes that have access to the same storage device using umount/mount commands.

The --servicenode option is used to set up nodes in a Lustre file system for failover at creation time (using mkfs.lustre) or later when the Lustre file system is active (using tunefs.lustre). For explanations of these utilities, see Section 44.12, “ mkfs.lustre”and Section 44.15, “ tunefs.lustre”.

Failover capability in a Lustre file system can be used to upgrade the Lustre software between successive minor versions without cluster downtime. For more information, see Chapter 17, Upgrading a Lustre File System.

For information about configuring failover, see Chapter 11, Configuring Failover in a Lustre File System.

Note

The Lustre software provides failover functionality only at the file system level. In a complete failover solution, failover functionality for system-level components, such as node failure detection or power control, must be provided by a third-party tool.

Caution

OST failover functionality does not protect against corruption caused by a disk failure. If the storage media (i.e., physical disk) used for an OST fails, it cannot be recovered by functionality provided in the Lustre software. We strongly recommend that some form of RAID be used for OSTs. Lustre functionality assumes that the storage is reliable, so it adds no extra reliability features.

3.2.1.  MDT Failover Configuration (Active/Passive)

Two MDSs are typically configured as an active/passive failover pair as shown in Figure 3.1, “Lustre failover configuration for a active/passive MDT”. Note that both nodes must have access to shared storage for the MDT(s) and the MGS. The primary (active) MDS manages the Lustre system metadata resources. If the primary MDS fails, the secondary (passive) MDS takes over these resources and serves the MDTs and the MGS.

Note

In an environment with multiple file systems, the MDSs can be configured in a quasi active/active configuration, with each MDS managing metadata for a subset of the Lustre file system.

Figure 3.1. Lustre failover configuration for a active/passive MDT

Lustre failover configuration for an MDT

3.2.2.  MDT Failover Configuration (Active/Active)

MDTs can be configured as an active/active failover configuration. A failover cluster is built from two MDSs as shown in Figure 3.2, “Lustre failover configuration for a active/active MDTs”.

Figure 3.2. Lustre failover configuration for a active/active MDTs

Lustre failover configuration for two MDTs

3.2.3.  OST Failover Configuration (Active/Active)

OSTs are usually configured in a load-balanced, active/active failover configuration. A failover cluster is built from two OSSs as shown in Figure 3.3, “Lustre failover configuration for an OSTs”.

Note

OSSs configured as a failover pair must have shared disks/RAID.

Figure 3.3. Lustre failover configuration for an OSTs

Lustre failover configuration for an OSTs

In an active configuration, 50% of the available OSTs are assigned to one OSS and the remaining OSTs are assigned to the other OSS. Each OSS serves as the primary node for half the OSTs and as a failover node for the remaining OSTs.

In this mode, if one OSS fails, the other OSS takes over all of the failed OSTs. The clients attempt to connect to each OSS serving the OST, until one of them responds. Data on the OST is written synchronously, and the clients replay transactions that were in progress and uncommitted to disk before the OST failure.

For more information about configuring failover, see Chapter 11, Configuring Failover in a Lustre File System.

Part II. Installing and Configuring Lustre

Part II describes how to install and configure a Lustre file system. You will find information in this section about:

Table of Contents

4. Installation Overview
4.1. Steps to Installing the Lustre Software
5. Determining Hardware Configuration Requirements and Formatting Options
5.1. Hardware Considerations
5.1.1. MGT and MDT Storage Hardware Considerations
5.1.2. OST Storage Hardware Considerations
5.2. Determining Space Requirements
5.2.1. Determining MGT Space Requirements
5.2.2. Determining MDT Space Requirements
5.2.3. Determining OST Space Requirements
5.3. Setting ldiskfs File System Formatting Options
5.3.1. Setting Formatting Options for an ldiskfs MDT
5.3.2. Setting Formatting Options for an ldiskfs OST
5.4. File and File System Limits
5.5. Determining Memory Requirements
5.5.1. Client Memory Requirements
5.5.2. MDS Memory Requirements
5.5.3. OSS Memory Requirements
5.6. Implementing Networks To Be Used by the Lustre File System
6. Configuring Storage on a Lustre File System
6.1. Selecting Storage for the MDT and OSTs
6.1.1. Metadata Target (MDT)
6.1.2. Object Storage Server (OST)
6.2. Reliability Best Practices
6.3. Performance Tradeoffs
6.4. Formatting Options for ldiskfs RAID Devices
6.4.1. Computing file system parameters for mkfs
6.4.2. Choosing Parameters for an External Journal
6.5. Connecting a SAN to a Lustre File System
7. Setting Up Network Interface Bonding
7.1. Network Interface Bonding Overview
7.2. Requirements
7.3. Bonding Module Parameters
7.4. Setting Up Bonding
7.4.1. Examples
7.5. Configuring a Lustre File System with Bonding
7.6. Bonding References
8. Installing the Lustre Software
8.1. Preparing to Install the Lustre Software
8.1.1. Software Requirements
8.1.2. Environmental Requirements
8.2. Lustre Software Installation Procedure
9. Configuring Lustre Networking (LNet)
9.1. Configuring LNet via lnetctlL 2.7
9.1.1. Configuring LNet
9.1.2. Displaying Global Settings
9.1.3. Adding, Deleting and Showing Networks
9.1.4. Manual Adding, Deleting and Showing PeersL 2.10
9.1.5. Dynamic Peer DiscoveryL 2.11
9.1.6. Adding, Deleting and Showing routes
9.1.7. Enabling and Disabling Routing
9.1.8. Showing routing information
9.1.9. Configuring Routing Buffers
9.1.10. Asymmetrical RoutesL 2.13
9.1.11. Importing YAML Configuration File
9.1.12. Exporting Configuration in YAML format
9.1.13. Showing LNet Traffic Statistics
9.1.14. YAML Syntax
9.2. Overview of LNet Module Parameters
9.2.1. Using a Lustre Network Identifier (NID) to Identify a Node
9.3. Setting the LNet Module networks Parameter
9.3.1. Multihome Server Example
9.4. Setting the LNet Module ip2nets Parameter
9.5. Setting the LNet Module routes Parameter
9.5.1. Routing Example
9.6. Testing the LNet Configuration
9.7. Configuring the Router Checker
9.8. Best Practices for LNet Options
9.8.1. Escaping commas with quotes
9.8.2. Including comments
10. Configuring a Lustre File System
10.1. Configuring a Simple Lustre File System
10.1.1. Simple Lustre Configuration Example
10.2. Additional Configuration Options
10.2.1. Scaling the Lustre File System
10.2.2. Changing Striping Defaults
10.2.3. Using the Lustre Configuration Utilities
11. Configuring Failover in a Lustre File System
11.1. Setting Up a Failover Environment
11.1.1. Selecting Power Equipment
11.1.2. Selecting Power Management Software
11.1.3. Selecting High-Availability (HA) Software
11.2. Preparing a Lustre File System for Failover
11.3. Administering Failover in a Lustre File System

Chapter 4. Installation Overview

This chapter provides on overview of the procedures required to set up, install and configure a Lustre file system.

Note

If the Lustre file system is new to you, you may find it helpful to refer to Part I, “Introducing the Lustre* File System” for a description of the Lustre architecture, file system components and terminology before proceeding with the installation procedure.

4.1.  Steps to Installing the Lustre Software

To set up Lustre file system hardware and install and configure the Lustre software, refer the the chapters below in the order listed:

  1. (Required) Set up your Lustre file system hardware.

    See Chapter 5, Determining Hardware Configuration Requirements and Formatting Options - Provides guidelines for configuring hardware for a Lustre file system including storage, memory, and networking requirements.

  2. (Optional - Highly Recommended) Configure storage on Lustre storage devices.

    See Chapter 6, Configuring Storage on a Lustre File System - Provides instructions for setting up hardware RAID on Lustre storage devices.

  3. (Optional) Set up network interface bonding.

    See Chapter 7, Setting Up Network Interface Bonding - Describes setting up network interface bonding to allow multiple network interfaces to be used in parallel to increase bandwidth or redundancy.

  4. (Required) Install Lustre software.

    See Chapter 8, Installing the Lustre Software - Describes preparation steps and a procedure for installing the Lustre software.

  5. (Optional) Configure Lustre Networking (LNet).

    See Chapter 9, Configuring Lustre Networking (LNet) - Describes how to configure LNet if the default configuration is not sufficient. By default, LNet will use the first TCP/IP interface it discovers on a system. LNet configuration is required if you are using InfiniBand or multiple Ethernet interfaces.

  6. (Required) Configure the Lustre file system.

    See Chapter 10, Configuring a Lustre File System - Provides an example of a simple Lustre configuration procedure and points to tools for completing more complex configurations.

  7. (Optional) Configure Lustre failover.

    See Chapter 11, Configuring Failover in a Lustre File System - Describes how to configure Lustre failover.

Chapter 5. Determining Hardware Configuration Requirements and Formatting Options

This chapter describes hardware configuration requirements for a Lustre file system including:

5.1.  Hardware Considerations

A Lustre file system can utilize any kind of block storage device such as single disks, software RAID, hardware RAID, or a logical volume manager. In contrast to some networked file systems, the block devices are only attached to the MDS and OSS nodes in a Lustre file system and are not accessed by the clients directly.

Since the block devices are accessed by only one or two server nodes, a storage area network (SAN) that is accessible from all the servers is not required. Expensive switches are not needed because point-to-point connections between the servers and the storage arrays normally provide the simplest and best attachments. (If failover capability is desired, the storage must be attached to multiple servers.)

For a production environment, it is preferable that the MGS have separate storage to allow future expansion to multiple file systems. However, it is possible to run the MDS and MGS on the same machine and have them share the same storage device.

For best performance in a production environment, dedicated clients are required. For a non-production Lustre environment or for testing, a Lustre client and server can run on the same machine. However, dedicated clients are the only supported configuration.

Warning

Performance and recovery issues can occur if you put a client on an MDS or OSS:

  • Running the OSS and a client on the same machine can cause issues with low memory and memory pressure. If the client consumes all the memory and then tries to write data to the file system, the OSS will need to allocate pages to receive data from the client but will not be able to perform this operation due to low memory. This can cause the client to hang.

  • Running the MDS and a client on the same machine can cause recovery and deadlock issues and impact the performance of other Lustre clients.

Only servers running on 64-bit CPUs are tested and supported. 64-bit CPU clients are typically used for testing to match expected customer usage and avoid limitations due to the 4 GB limit for RAM size, 1 GB low-memory limitation, and 16 TB file size limit of 32-bit CPUs. Also, due to kernel API limitations, performing backups of Lustre filesystems on 32-bit clients may cause backup tools to confuse files that report the same 32-bit inode number, if the backup tools depend on the inode number for correct operation.

The storage attached to the servers typically uses RAID to provide fault tolerance and can optionally be organized with logical volume management (LVM), which is then formatted as a Lustre file system. Lustre OSS and MDS servers read, write and modify data in the format imposed by the file system.

The Lustre file system uses journaling file system technology on both the MDTs and OSTs. For a MDT, as much as a 20 percent performance gain can be obtained by placing the journal on a separate device.

The MDS can effectively utilize a lot of CPU cycles. A minimum of four processor cores are recommended. More are advisable for files systems with many clients.

Note

Lustre clients running on different CPU architectures is supported. One limitation is that the PAGE_SIZE kernel macro on the client must be as large as the PAGE_SIZE of the server. In particular, ARM or PPC clients with large pages (up to 64kB pages) can run with x86 servers (4kB pages).

5.1.1.  MGT and MDT Storage Hardware Considerations

MGT storage requirements are small (less than 100 MB even in the largest Lustre file systems), and the data on an MGT is only accessed on a server/client mount, so disk performance is not a consideration. However, this data is vital for file system access, so the MGT should be reliable storage, preferably mirrored RAID1.

MDS storage is accessed in a database-like access pattern with many seeks and read-and-writes of small amounts of data. Storage types that provide much lower seek times, such as SSD or NVMe is strongly preferred for the MDT, and high-RPM SAS is acceptable.

For maximum performance, the MDT should be configured as RAID1 with an internal journal and two disks from different controllers.

If you need a larger MDT, create multiple RAID1 devices from pairs of disks, and then make a RAID0 array of the RAID1 devices. For ZFS, use mirror VDEVs for the MDT. This ensures maximum reliability because multiple disk failures only have a small chance of hitting both disks in the same RAID1 device.

Doing the opposite (RAID1 of a pair of RAID0 devices) has a 50% chance that even two disk failures can cause the loss of the whole MDT device. The first failure disables an entire half of the mirror and the second failure has a 50% chance of disabling the remaining mirror.

If multiple MDTs are going to be present in the system, each MDT should be specified for the anticipated usage and load. For details on how to add additional MDTs to the filesystem, see Section 14.7, “Adding a New MDT to a Lustre File System”.

Warning

MDT0000 contains the root of the Lustre file system. If MDT0000 is unavailable for any reason, the file system cannot be used.

Note

Using the DNE feature it is possible to dedicate additional MDTs to sub-directories off the file system root directory stored on MDT0000, or arbitrarily for lower-level subdirectories, using the lfs mkdir -i mdt_index command. If an MDT serving a subdirectory becomes unavailable, any subdirectories on that MDT and all directories beneath it will also become inaccessible. This is typically useful for top-level directories to assign different users or projects to separate MDTs, or to distribute other large working sets of files to multiple MDTs.

Introduced in Lustre 2.8

Note

Starting in the 2.8 release it is possible to spread a single large directory across multiple MDTs using the DNE striped directory feature by specifying multiple stripes (or shards) at creation time using the lfs mkdir -c stripe_count command, where stripe_count is often the number of MDTs in the filesystem. Striped directories should not be used for all directories in the filesystem, since this incurs extra overhead compared to unstriped directories. This is indended for specific applications where many output files are being created in one large directory (over 50k entries).

5.1.2. OST Storage Hardware Considerations

The data access pattern for the OSS storage is a streaming I/O pattern that is dependent on the access patterns of applications being used. Each OSS can manage multiple object storage targets (OSTs), one for each volume with I/O traffic load-balanced between servers and targets. An OSS should be configured to have a balance between the network bandwidth and the attached storage bandwidth to prevent bottlenecks in the I/O path. Depending on the server hardware, an OSS typically serves between 2 and 8 targets, with each target between 24-48TB, but may be up to 256 terabytes (TBs) in size.

Lustre file system capacity is the sum of the capacities provided by the targets. For example, 64 OSSs, each with two 8 TB OSTs, provide a file system with a capacity of nearly 1 PB. If each OST uses ten 1 TB SATA disks (8 data disks plus 2 parity disks in a RAID-6 configuration), it may be possible to get 50 MB/sec from each drive, providing up to 400 MB/sec of disk bandwidth per OST. If this system is used as storage backend with a system network, such as the InfiniBand network, that provides a similar bandwidth, then each OSS could provide 800 MB/sec of end-to-end I/O throughput. (Although the architectural constraints described here are simple, in practice it takes careful hardware selection, benchmarking and integration to obtain such results.)

5.2.  Determining Space Requirements

The desired performance characteristics of the backing file systems on the MDT and OSTs are independent of one another. The size of the MDT backing file system depends on the number of inodes needed in the total Lustre file system, while the aggregate OST space depends on the total amount of data stored on the file system. If MGS data is to be stored on the MDT device (co-located MGT and MDT), add 100 MB to the required size estimate for the MDT.

Each time a file is created on a Lustre file system, it consumes one inode on the MDT and one OST object over which the file is striped. Normally, each file's stripe count is based on the system-wide default stripe count. However, this can be changed for individual files using the lfs setstripe option. For more details, see Chapter 19, Managing File Layout (Striping) and Free Space.

In a Lustre ldiskfs file system, all the MDT inodes and OST objects are allocated when the file system is first formatted. When the file system is in use and a file is created, metadata associated with that file is stored in one of the pre-allocated inodes and does not consume any of the free space used to store file data. The total number of inodes on a formatted ldiskfs MDT or OST cannot be easily changed. Thus, the number of inodes created at format time should be generous enough to anticipate near term expected usage, with some room for growth without the effort of additional storage.

By default, the ldiskfs file system used by Lustre servers to store user-data objects and system data reserves 5% of space that cannot be used by the Lustre file system. Additionally, an ldiskfs Lustre file system reserves up to 400 MB on each OST, and up to 4GB on each MDT for journal use and a small amount of space outside the journal to store accounting data. This reserved space is unusable for general storage. Thus, at least this much space will be used per OST before any file object data is saved.

With a ZFS backing filesystem for the MDT or OST, the space allocation for inodes and file data is dynamic, and inodes are allocated as needed. A minimum of 4kB of usable space (before mirroring) is needed for each inode, exclusive of other overhead such as directories, internal log files, extended attributes, ACLs, etc. ZFS also reserves approximately 3% of the total storage space for internal and redundant metadata, which is not usable by Lustre. Since the size of extended attributes and ACLs is highly dependent on kernel versions and site-specific policies, it is best to over-estimate the amount of space needed for the desired number of inodes, and any excess space will be utilized to store more inodes.

5.2.1.  Determining MGT Space Requirements

Less than 100 MB of space is typically required for the MGT. The size is determined by the total number of servers in the Lustre file system cluster(s) that are managed by the MGS.

5.2.2.  Determining MDT Space Requirements

When calculating the MDT size, the important factor to consider is the number of files to be stored in the file system, which depends on at least 2 KiB per inode of usable space on the MDT. Since MDTs typically use RAID-1+0 mirroring, the total storage needed will be double this.

Please note that the actual used space per MDT depends on the number of files per directory, the number of stripes per file, whether files have ACLs or user xattrs, and the number of hard links per file. The storage required for Lustre file system metadata is typically 1-2 percent of the total file system capacity depending upon file size. If the Chapter 20, Data on MDT (DoM) feature is in use for Lustre 2.11 or later, MDT space should typically be 5 percent or more of the total space, depending on the distribution of small files within the filesystem and the lod.*.dom_stripesize limit on the MDT and file layout used.

For ZFS-based MDT filesystems, the number of inodes created on the MDT and OST is dynamic, so there is less need to determine the number of inodes in advance, though there still needs to be some thought given to the total MDT space compared to the total filesystem size.

For example, if the average file size is 5 MiB and you have 100 TiB of usable OST space, then you can calculate the minimum total number of inodes for MDTs and OSTs as follows:

(500 TB * 1000000 MB/TB) / 5 MB/inode = 100M inodes

It is recommended that the MDT(s) have at least twice the minimum number of inodes to allow for future expansion and allow for an average file size smaller than expected. Thus, the minimum space for ldiskfs MDT(s) should be approximately:

2 KiB/inode x 100 million inodes x 2 = 400 GiB ldiskfs MDT

For details about formatting options for ldiskfs MDT and OST file systems, see Section 5.3.1, “Setting Formatting Options for an ldiskfs MDT”.

Note

If the median file size is very small, 4 KB for example, the MDT would use as much space for each file as the space used on the OST, so the use of Data-on-MDT is strongly recommended in that case. The MDT space per inode should be increased correspondingly to account for the extra data space usage for each inode:

6 KiB/inode x 100 million inodes x 2 = 1200 GiB ldiskfs MDT

Note

If the MDT has too few inodes, this can cause the space on the OSTs to be inaccessible since no new files can be created. In this case, the lfs df -i and df -i commands will limit the number of available inodes reported for the filesystem to match the total number of available objects on the OSTs. Be sure to determine the appropriate MDT size needed to support the filesystem before formatting. It is possible to increase the number of inodes after the file system is formatted, depending on the storage. For ldiskfs MDT filesystems the resize2fs tool can be used if the underlying block device is on a LVM logical volume and the underlying logical volume size can be increased. For ZFS new (mirrored) VDEVs can be added to the MDT pool to increase the total space available for inode storage. Inodes will be added approximately in proportion to space added.

Note

Note that the number of total and free inodes reported by lfs df -i for ZFS MDTs and OSTs is estimated based on the current average space used per inode. When a ZFS filesystem is first formatted, this free inode estimate will be very conservative (low) due to the high ratio of directories to regular files created for internal Lustre metadata storage, but this estimate will improve as more files are created by regular users and the average file size will better reflect actual site usage.

Note

Using the DNE remote directory feature it is possible to increase the total number of inodes of a Lustre filesystem, as well as increasing the aggregate metadata performance, by configuring additional MDTs into the filesystem, see Section 14.7, “Adding a New MDT to a Lustre File System” for details.

5.2.3.  Determining OST Space Requirements

For the OST, the amount of space taken by each object depends on the usage pattern of the users/applications running on the system. The Lustre software defaults to a conservative estimate for the average object size (between 64 KiB per object for 10 GiB OSTs, and 1 MiB per object for 16 TiB and larger OSTs). If you are confident that the average file size for your applications will be different than this, you can specify a different average file size (number of total inodes for a given OST size) to reduce file system overhead and minimize file system check time. See Section 5.3.2, “Setting Formatting Options for an ldiskfs OST” for more details.

5.3.  Setting ldiskfs File System Formatting Options

By default, the mkfs.lustre utility applies these options to the Lustre backing file system used to store data and metadata in order to enhance Lustre file system performance and scalability. These options include:

  • flex_bg - When the flag is set to enable this flexible-block-groups feature, block and inode bitmaps for multiple groups are aggregated to minimize seeking when bitmaps are read or written and to reduce read/modify/write operations on typical RAID storage (with 1 MiB RAID stripe widths). This flag is enabled on both OST and MDT file systems. On MDT file systems the flex_bg factor is left at the default value of 16. On OSTs, the flex_bg factor is set to 256 to allow all of the block or inode bitmaps in a single flex_bg to be read or written in a single 1MiB I/O typical for RAID storage.

  • huge_file - Setting this flag allows files on OSTs to be larger than 2 TiB in size.

  • lazy_journal_init - This extended option is enabled to prevent a full overwrite to zero out the large journal that is allocated by default in a Lustre file system (up to 400 MiB for OSTs, up to 4GiB for MDTs), to reduce the formatting time.

To override the default formatting options, use arguments to mkfs.lustre to pass formatting options to the backing file system:

--mkfsoptions='backing fs options'

For other mkfs.lustre options, see the Linux man page for mke2fs(8).

5.3.1. Setting Formatting Options for an ldiskfs MDT

The number of inodes on the MDT is determined at format time based on the total size of the file system to be created. The default bytes-per-inode ratio ("inode ratio") for an ldiskfs MDT is optimized at one inode for every 2560 bytes of file system space.

This setting takes into account the space needed for additional ldiskfs filesystem-wide metadata, such as the journal (up to 4 GB), bitmaps, and directories, as well as files that Lustre uses internally to maintain cluster consistency. There is additional per-file metadata such as file layout for files with a large number of stripes, Access Control Lists (ACLs), and user extended attributes.

Introduced in Lustre 2.11

Starting in Lustre 2.11, the Chapter 20, Data on MDT (DoM) (DoM) feature allows storing small files on the MDT to take advantage of high-performance flash storage, as well as reduce space and network overhead. If you are planning to use the DoM feature with an ldiskfs MDT, it is recommended to increase the bytes-per-inode ratio to have enough space on the MDT for small files, as described below.

It is possible to change the recommended default of 2560 bytes per inode for an ldiskfs MDT when it is first formatted by adding the --mkfsoptions="-i bytes-per-inode" option to mkfs.lustre. Decreasing the inode ratio tunable bytes-per-inode will create more inodes for a given MDT size, but will leave less space for extra per-file metadata and is not recommended. The inode ratio must always be strictly larger than the MDT inode size, which is 1024 bytes by default. It is recommended to use an inode ratio at least 1536 bytes larger than the inode size to ensure the MDT does not run out of space. Increasing the inode ratio with enough space for the most commonly file size (e.g. 5632 or 66560 bytes if 4KB or 64KB files are widely used) is recommended for DoM.

The size of the inode may be changed at format time by adding the --stripe-count-hint=N to have mkfs.lustre automatically calculate a reasonable inode size based on the default stripe count that will be used by the filesystem, or directly by specifying the --mkfsoptions="-I inode-size" option. Increasing the inode size will provide more space in the inode for a larger Lustre file layout, ACLs, user and system extended attributes, SELinux and other security labels, and other internal metadata and DoM data. However, if these features or other in-inode xattrs are not needed, a larger inode size may hurt metadata performance as 2x, 4x, or 8x as much data would be read or written for each MDT inode access.

5.3.2. Setting Formatting Options for an ldiskfs OST

When formatting an OST file system, it can be beneficial to take local file system usage into account, for example by running df and df -i on a current filesystem to get the used bytes and used inodes respectively, then computing the average bytes-per-inode value. When deciding on the ratio for a new filesystem, try to avoid having too many inodes on each OST, while keeping enough margin to allow for future usage of smaller files. This helps reduce the format and e2fsck time and makes more space available for data.

The table below shows the default bytes-per-inode ratio ("inode ratio") used for OSTs of various sizes when they are formatted.

Table 5.1. Default Inode Ratios Used for Newly Formatted OSTs

LUN/OST size

Default Inode ratio

Total inodes

under 10GiB

1 inode/16KiB

640 - 655k

10GiB - 1TiB

1 inode/68KiB

153k - 15.7M

1TiB - 8TiB

1 inode/256KiB

4.2M - 33.6M

over 8TiB

1 inode/1MiB

8.4M - 268M


In environments with few small files, the default inode ratio may result in far too many inodes for the average file size. In this case, performance can be improved by increasing the number of bytes-per-inode. To set the inode ratio, use the --mkfsoptions="-i bytes-per-inode" argument to mkfs.lustre to specify the expected average (mean) size of OST objects. For example, to create an OST with an expected average object size of 8 MiB run:

[oss#] mkfs.lustre --ost --mkfsoptions="-i $((8192 * 1024))" ...

Note

OSTs formatted with ldiskfs should preferably have fewer than 320 million objects per MDT, and up to a maximum of 4 billion inodes. Specifying a very small bytes-per-inode ratio for a large OST that exceeds this limit can cause either premature out-of-space errors and prevent the full OST space from being used, or will waste space and slow down e2fsck more than necessary. The default inode ratios are chosen to ensure the total number of inodes remain below this limit.

Note

File system check time on OSTs is affected by a number of variables in addition to the number of inodes, including the size of the file system, the number of allocated blocks, the distribution of allocated blocks on the disk, disk speed, CPU speed, and the amount of RAM on the server. Reasonable file system check times for valid filesystems are 5-30 minutes per TiB, but may increase significantly if substantial errors are detected and need to be repaired.

For further details about optimizing MDT and OST file systems, see Section 6.4, “ Formatting Options for ldiskfs RAID Devices”.

5.4. File and File System Limits

Table 5.2, “File and file system limits” describes current known limits of Lustre. These limits may be imposed by either the Lustre architecture or the Linux virtual file system (VFS) and virtual memory subsystems. In a few cases, a limit is defined within the code Lustre based on tested values and could be changed by editing and re-compiling the Lustre software. In these cases, the indicated limit was used for testing of the Lustre software.

Table 5.2. File and file system limits

Limit

Value

Description

Maximum number of MDTs

256

A single MDS can host one or more MDTs, either for separate filesystems, or aggregated into a single namespace. Each filesystem requires a separate MDT for the filesystem root directory. Up to 255 more MDTs can be added to the filesystem and are attached into the filesystem namespace with creation of DNE remote or striped directories.

Maximum number of OSTs

8150

The maximum number of OSTs is a constant that can be changed at compile time. Lustre file systems with up to 4000 OSTs have been configured in the past. Multiple OST targets can be configured on a single OSS node.

Maximum OST size

1024TiB (ldiskfs), 1024TiB (ZFS)

This is not a hard limit. Larger OSTs are possible, but most production systems do not typically go beyond the stated limit per OST because Lustre can add capacity and performance with additional OSTs, and having more OSTs improves aggregate I/O performance, minimizes contention, and allows parallel recovery (e2fsck for ldiskfs OSTs, scrub for ZFS OSTs).

With 32-bit kernels, due to page cache limits, 16TB is the maximum block device size, which in turn applies to the size of OST. It is strongly recommended to run Lustre clients and servers with 64-bit kernels.

Maximum number of clients

131072

The maximum number of clients is a constant that can be changed at compile time. Up to 30000 clients have been used in production accessing a single filesystem.

Maximum size of a single file system

2EiB or larger

Each OST can have a file system up to the "Maximum OST size" limit, and the Maximum number of OSTs can be combined into a single filesystem.

Maximum stripe count

2000

This limit is imposed by the size of the layout that needs to be stored on disk and sent in RPC requests, but is not a hard limit of the protocol. The number of OSTs in the filesystem can exceed the stripe count, but this is the maximum number of OSTs on which a single file can be striped.

Introduced in Lustre 2.13

Note

Before 2.13, the default for ldiskfs MDTs the maximum stripe count for a single file is limited to 160 OSTs. In order to increase the maximum file stripe count, use --mkfsoptions="-O ea_inode" when formatting the MDT, or use tune2fs -O ea_inode to enable it after the MDT has been formatted.

Maximum stripe size

< 4 GiB

The amount of data written to each object before moving on to next object.

Minimum stripe size

64 KiB

Due to the use of 64 KiB PAGE_SIZE on some CPU architectures such as ARM and POWER, the minimum stripe size is 64 KiB so that a single page is not split over multiple servers. This is also the minimum Data-on-MDT component size that can be specified.

Maximum single object size

16TiB (ldiskfs), 256TiB (ZFS)

The amount of data that can be stored in a single object. An object corresponds to a stripe. The ldiskfs limit of 16 TB for a single object applies. For ZFS the limit is the size of the underlying OST. Files can consist of up to 2000 stripes, each stripe can be up to the maximum object size.

Maximum file size

16 TiB on 32-bit systems

 

31.25 PiB on 64-bit ldiskfs systems, 8EiB on 64-bit ZFS systems

Individual files have a hard limit of nearly 16 TiB on 32-bit systems imposed by the kernel memory subsystem. On 64-bit systems this limit does not exist. Hence, files can be 2^63 bits (8EiB) in size if the backing filesystem can support large enough objects and/or the files are sparse.

A single file can have a maximum of 2000 stripes, which gives an upper single file data capacity of 31.25 PiB for 64-bit ldiskfs systems. The actual amount of data that can be stored in a file depends upon the amount of free space in each OST on which the file is striped.

Maximum number of files or subdirectories in a single directory

600M-3.8B files (ldiskfs), 16T (ZFS)

The Lustre software uses the ldiskfs hashed directory code, which has a limit of at least 600 million files, depending on the length of the file name. The limit on subdirectories is the same as the limit on regular files.

Introduced in Lustre 2.8

Note

Starting in the 2.8 release it is possible to exceed this limit by striping a single directory over multiple MDTs with the lfs mkdir -c command, which increases the single directory limit by a factor of the number of directory stripes used.

Introduced in Lustre 2.12

Note

In the 2.12 release, the large_dir feature of ldiskfs was added to allow the use of directories over 10M entries, but not enabled by default.

Introduced in Lustre 2.14

Note

Starting in the 2.14 release, the large_dir feature is enabled by default.

Maximum number of files in the file system

4 billion (ldiskfs), 256 trillion (ZFS) per MDT

The ldiskfs filesystem imposes an upper limit of 4 billion inodes per filesystem. By default, the MDT filesystem is formatted with one inode per 2KB of space, meaning 512 million inodes per TiB of MDT space. This can be increased initially at the time of MDT filesystem creation. For more information, see Chapter 5, Determining Hardware Configuration Requirements and Formatting Options.

The ZFS filesystem dynamically allocates inodes and does not have a fixed ratio of inodes per unit of MDT space, but consumes approximately 4KiB of mirrored space per inode, depending on the configuration.

Each additional MDT can hold up to the above maximum number of additional files, depending on available space and the distribution directories and files in the filesystem.

Maximum length of a filename

255 bytes (filename)

This limit is 255 bytes for a single filename, the same as the limit in the underlying filesystems.

Maximum length of a pathname

4096 bytes (pathname)

The Linux VFS imposes a full pathname length of 4096 bytes.

Maximum number of open files for a Lustre file system

No limit

The Lustre software does not impose a maximum for the number of open files, but the practical limit depends on the amount of RAM on the MDS. No "tables" for open files exist on the MDS, as they are only linked in a list to a given client's export. Each client process has a limit of several thousands of open files which depends on its ulimit.


 

5.5. Determining Memory Requirements

This section describes the memory requirements for each Lustre file system component.

5.5.1.  Client Memory Requirements

A minimum of 2 GB RAM is recommended for clients.

5.5.2. MDS Memory Requirements

MDS memory requirements are determined by the following factors:

  • Number of clients

  • Size of the directories

  • Load placed on server

The amount of memory used by the MDS is a function of how many clients are on the system, and how many files they are using in their working set. This is driven, primarily, by the number of locks a client can hold at one time. The number of locks held by clients varies by load and memory availability on the server. Interactive clients can hold in excess of 10,000 locks at times. On the MDS, memory usage is approximately 2 KB per file, including the Lustre distributed lock manager (LDLM) lock and kernel data structures for the files currently in use. Having file data in cache can improve metadata performance by a factor of 10x or more compared to reading it from storage.

MDS memory requirements include:

  • File system metadata: A reasonable amount of RAM needs to be available for file system metadata. While no hard limit can be placed on the amount of file system metadata, if more RAM is available, then the disk I/O is needed less often to retrieve the metadata.

  • Network transport: If you are using TCP or other network transport that uses system memory for send/receive buffers, this memory requirement must also be taken into consideration.

  • Journal size: By default, the journal size is 4096 MB for each MDT ldiskfs file system. This can pin up to an equal amount of RAM on the MDS node per file system.

  • Failover configuration: If the MDS node will be used for failover from another node, then the RAM for each journal should be doubled, so the backup server can handle the additional load if the primary server fails.

5.5.2.1. Calculating MDS Memory Requirements

By default, 4096 MB are used for the ldiskfs filesystem journal. Additional RAM is used for caching file data for the larger working set, which is not actively in use by clients but should be kept "hot" for improved access times. Approximately 1.5 KB per file is needed to keep a file in cache without a lock.

For example, for a single MDT on an MDS with 1,024 compute nodes, 12 interactive login nodes, and a 20 million file working set (of which 9 million files are cached on the clients at one time):

Operating system overhead = 4096 MB (RHEL8)

File system journal = 4096 MB

1024 * 32-core clients * 256 files/core * 2KB = 16384 MB

12 interactive clients * 100,000 files * 2KB = 2400 MB

20 million file working set * 1.5KB/file = 30720 MB

Thus, a reasonable MDS configuration for this workload is at least 60 GB of RAM. For active-active DNE MDT failover pairs, each MDS should have at least 96 GB of RAM. The additional memory can be used during normal operation to allow more metadata and locks to be cached and improve performance, depending on the workload.

For directories containing 1 million or more files, more memory can provide a significant benefit. For example, in an environment where clients randomly a single directory with 10 million files can consume as much as 35GB of RAM on the MDS.

5.5.3. OSS Memory Requirements

When planning the hardware for an OSS node, consider the memory usage of several components in the Lustre file system (i.e., journal, service threads, file system metadata, etc.). Also, consider the effect of the OSS read cache feature, which consumes memory as it caches data on the OSS node.

In addition to the MDS memory requirements mentioned above, the OSS requirements also include:

  • Service threads: The service threads on the OSS node pre-allocate an RPC-sized MB I/O buffer for each ost_io service thread, so these large buffers do not need to be allocated and freed for each I/O request.

  • OSS read cache: OSS read cache provides read-only caching of data on an HDD-based OSS, using the regular Linux page cache to store the data. Just like caching from a regular file system in the Linux operating system, OSS read cache uses as much physical memory as is available.

The same calculation applies to files accessed from the OSS as for the MDS, but the load is typically distributed over more OSS nodes, so the amount of memory required for locks, inode cache, etc. listed for the MDS is spread out over the OSS nodes.

Because of these memory requirements, the following calculations should be taken as determining the minimum RAM required in an OSS node.

5.5.3.1. Calculating OSS Memory Requirements

The minimum recommended RAM size for an OSS with eight OSTs, handling objects for 1/4 of the active files for the MDS:

Linux kernel and userspace daemon memory = 4096 MB

Network send/receive buffers (16 MB * 512 threads) = 8192 MB

1024 MB ldiskfs journal size * 8 OST devices = 8192 MB

16 MB read/write buffer per OST IO thread * 512 threads = 8192 MB

2048 MB file system read cache * 8 OSTs = 16384 MB

1024 * 32-core clients * 64 objects/core * 2KB/object = 4096 MB

12 interactive clients * 25,000 objects * 2KB/object = 600 MB

5 million object working set * 1.5KB/object = 7500 MB

For a non-failover configuration, the minimum RAM would be about 60 GB for an OSS node with eight OSTs. Additional memory on the OSS will improve the performance of reading smaller, frequently-accessed files.

For a failover configuration, the minimum RAM would be about 90 GB, as some of the memory is per-node. When the OSS is not handling any failed-over OSTs the extra RAM will be used as a read cache.

As a reasonable rule of thumb, about 24 GB of base memory plus 4 GB per OST can be used. In failover configurations, about 8 GB per primary OST is needed.

5.6. Implementing Networks To Be Used by the Lustre File System

As a high performance file system, the Lustre file system places heavy loads on networks. Thus, a network interface in each Lustre server and client is commonly dedicated to Lustre file system traffic. This is often a dedicated TCP/IP subnet, although other network hardware can also be used.

A typical Lustre file system implementation may include the following:

  • A high-performance backend network for the Lustre servers, typically an InfiniBand (IB) network.

  • A larger client network.

  • Lustre routers to connect the two networks.

Lustre networks and routing are configured and managed by specifying parameters to the Lustre Networking (lnet) module in /etc/modprobe.d/lustre.conf.

To prepare to configure Lustre networking, complete the following steps:

  1. Identify all machines that will be running Lustre software and the network interfaces they will use to run Lustre file system traffic. These machines will form the Lustre network .

    A network is a group of nodes that communicate directly with one another. The Lustre software includes Lustre network drivers (LNDs) to support a variety of network types and hardware (see Chapter 2, Understanding Lustre Networking (LNet) for a complete list). The standard rules for specifying networks applies to Lustre networks. For example, two TCP networks on two different subnets (tcp0 and tcp1) are considered to be two different Lustre networks.

  2. If routing is needed, identify the nodes to be used to route traffic between networks.

    If you are using multiple network types, then you will need a router. Any node with appropriate interfaces can route Lustre networking (LNet) traffic between different network hardware types or topologies --the node may be a server, a client, or a standalone router. LNet can route messages between different network types (such as TCP-to-InfiniBand) or across different topologies (such as bridging two InfiniBand or TCP/IP networks). Routing will be configured in Chapter 9, Configuring Lustre Networking (LNet).

  3. Identify the network interfaces to include in or exclude from LNet.

    If not explicitly specified, LNet uses either the first available interface or a pre-defined default for a given network type. Interfaces that LNet should not use (such as an administrative network or IP-over-IB), can be excluded.

    Network interfaces to be used or excluded will be specified using the lnet kernel module parameters networks and ip2nets as described in Chapter 9, Configuring Lustre Networking (LNet).

  4. To ease the setup of networks with complex network configurations, determine a cluster-wide module configuration.

    For large clusters, you can configure the networking setup for all nodes by using a single, unified set of parameters in the lustre.conf file on each node. Cluster-wide configuration is described in Chapter 9, Configuring Lustre Networking (LNet).

Note

We recommend that you use 'dotted-quad' notation for IP addresses rather than host names to make it easier to read debug logs and debug configurations with multiple interfaces.

Chapter 6. Configuring Storage on a Lustre File System

This chapter describes best practices for storage selection and file system options to optimize performance on RAID, and includes the following sections:

Note

It is strongly recommended that storage used in a Lustre file system be configured with hardware RAID. The Lustre software does not support redundancy at the file system level and RAID is required to protect against disk failure.

6.1.  Selecting Storage for the MDT and OSTs

The Lustre architecture allows the use of any kind of block device as backend storage. The characteristics of such devices, particularly in the case of failures, vary significantly and have an impact on configuration choices.

This section describes issues and recommendations regarding backend storage.

6.1.1. Metadata Target (MDT)

I/O on the MDT is typically mostly reads and writes of small amounts of data. For this reason, we recommend that you use RAID 1 for MDT storage. If you require more capacity for an MDT than one disk provides, we recommend RAID 1 + 0 or RAID 10.

6.1.2. Object Storage Server (OST)

A quick calculation makes it clear that without further redundancy, RAID 6 is required for large clusters and RAID 5 is not acceptable:

For a 2 PB file system (2,000 disks of 1 TB capacity) assume the mean time to failure (MTTF) of a disk is about 1,000 days. This means that the expected failure rate is 2000/1000 = 2 disks per day. Repair time at 10% of disk bandwidth is 1000 GB at 10MB/sec = 100,000 sec, or about 1 day.

For a RAID 5 stripe that is 10 disks wide, during 1 day of rebuilding, the chance that a second disk in the same array will fail is about 9/1000 or about 1% per day. After 50 days, you have a 50% chance of a double failure in a RAID 5 array leading to data loss.

Therefore, RAID 6 or another double parity algorithm is needed to provide sufficient redundancy for OST storage.

For better performance, we recommend that you create RAID sets with 4 or 8 data disks plus one or two parity disks. Using larger RAID sets will negatively impact performance compared to having multiple independent RAID sets.

To maximize performance for small I/O request sizes, storage configured as RAID 1+0 can yield much better results but will increase cost or reduce capacity.

6.2. Reliability Best Practices

RAID monitoring software is recommended to quickly detect faulty disks and allow them to be replaced to avoid double failures and data loss. Hot spare disks are recommended so that rebuilds happen without delays.

Backups of the metadata file systems are recommended. For details, see Chapter 18, Backing Up and Restoring a File System.

6.3. Performance Tradeoffs

A writeback cache in a RAID storage controller can dramatically increase write performance on many types of RAID arrays if the writes are not done at full stripe width. Unfortunately, unless the RAID array has battery-backed cache (a feature only found in some higher-priced hardware RAID arrays), interrupting the power to the array may result in out-of-sequence or lost writes, and corruption of RAID parity and/or filesystem metadata, resulting in data loss.

Having a read or writeback cache onboard a PCI adapter card installed in an MDS or OSS is NOT SAFE in a high-availability (HA) failover configuration, as this will result in inconsistencies between nodes and immediate or eventual filesystem corruption. Such devices should not be used, or should have the onboard cache disabled.

If writeback cache is enabled, a file system check is required after the array loses power. Data may also be lost because of this.

Therefore, we recommend against the use of writeback cache when data integrity is critical. You should carefully consider whether the benefits of using writeback cache outweigh the risks.

6.4.  Formatting Options for ldiskfs RAID Devices

When formatting an ldiskfs file system on a RAID device, it can be beneficial to ensure that I/O requests are aligned with the underlying RAID geometry. This ensures that Lustre RPCs do not generate unnecessary disk operations which may reduce performance dramatically. Use the --mkfsoptions parameter to specify additional parameters when formatting the OST or MDT.

For RAID 5, RAID 6, or RAID 1+0 storage, specifying the following option to the --mkfsoptions parameter option improves the layout of the file system metadata, ensuring that no single disk contains all of the allocation bitmaps:

-E stride = chunk_blocks 

The chunk_blocks variable is in units of 4096-byte blocks and represents the amount of contiguous data written to a single disk before moving to the next disk. This is alternately referred to as the RAID stripe size. This is applicable to both MDT and OST file systems.

For more information on how to override the defaults while formatting MDT or OST file systems, see Section 5.3, “ Setting ldiskfs File System Formatting Options ”.

6.4.1. Computing file system parameters for mkfs

For best results, use RAID 5 with 5 or 9 disks or RAID 6 with 6 or 10 disks, each on a different controller. The stripe width is the optimal minimum I/O size. Ideally, the RAID configuration should allow 1 MB Lustre RPCs to fit evenly on a single RAID stripe without an expensive read-modify-write cycle. Use this formula to determine the stripe_width, where number_of_data_disks does not include the RAID parity disks (1 for RAID 5 and 2 for RAID 6):

stripe_width_blocks = chunk_blocks * number_of_data_disks = 1 MB 

If the RAID configuration does not allow chunk_blocks to fit evenly into 1 MB, select stripe_width_blocks, such that is close to 1 MB, but not larger.

The stripe_width_blocks value must equal chunk_blocks * number_of_data_disks. Specifying the stripe_width_blocks parameter is only relevant for RAID 5 or RAID 6, and is not needed for RAID 1 plus 0.

Run --reformat on the file system device (/dev/sdc), specifying the RAID geometry to the underlying ldiskfs file system, where:

--mkfsoptions "other_options -E stride=chunk_blocks, stripe_width=stripe_width_blocks"

A RAID 6 configuration with 6 disks has 4 data and 2 parity disks. The chunk_blocks <= 1024KB/4 = 256KB.

Because the number of data disks is equal to the power of 2, the stripe width is equal to 1 MB.

--mkfsoptions "other_options -E stride=chunk_blocks, stripe_width=stripe_width_blocks"...

6.4.2. Choosing Parameters for an External Journal

If you have configured a RAID array and use it directly as an OST, it contains both data and metadata. For better performance, we recommend putting the OST journal on a separate device, by creating a small RAID 1 array and using it as an external journal for the OST.

In a typical Lustre file system, the default OST journal size is up to 1GB, and the default MDT journal size is up to 4GB, in order to handle a high transaction rate without blocking on journal flushes. Additionally, a copy of the journal is kept in RAM. Therefore, make sure you have enough RAM on the servers to hold copies of all journals.

The file system journal options are specified to mkfs.lustre using the --mkfsoptions parameter. For example:

--mkfsoptions "other_options -j -J device=/dev/mdJ" 

To create an external journal, perform these steps for each OST on the OSS:

  1. Create a 400 MB (or larger) journal partition (RAID 1 is recommended).

    In this example, /dev/sdb is a RAID 1 device.

  2. Create a journal device on the partition. Run:

    oss# mke2fs -b 4096 -O journal_dev /dev/sdb journal_size

    The value of journal_size is specified in units of 4096-byte blocks. For example, 262144 for a 1 GB journal size.

  3. Create the OST.

    In this example, /dev/sdc is the RAID 6 device to be used as the OST, run:

    [oss#] mkfs.lustre --ost ... \
    --mkfsoptions="-J device=/dev/sdb1" /dev/sdc
  4. Mount the OST as usual.

6.5. Connecting a SAN to a Lustre File System

Depending on your cluster size and workload, you may want to connect a SAN to a Lustre file system. Before making this connection, consider the following:

  • In many SAN file systems, clients allocate and lock blocks or inodes individually as they are updated. The design of the Lustre file system avoids the high contention that some of these blocks and inodes may have.

  • The Lustre file system is highly scalable and can have a very large number of clients. SAN switches do not scale to a large number of nodes, and the cost per port of a SAN is generally higher than other networking.

  • File systems that allow direct-to-SAN access from the clients have a security risk because clients can potentially read any data on the SAN disks, and misbehaving clients can corrupt the file system for many reasons like improper file system, network, or other kernel software, bad cabling, bad memory, and so on. The risk increases with increase in the number of clients directly accessing the storage.

Chapter 7. Setting Up Network Interface Bonding

This chapter describes how to use multiple network interfaces in parallel to increase bandwidth and/or redundancy. Topics include:

Note

Using network interface bonding is optional.

7.1. Network Interface Bonding Overview

Bonding, also known as link aggregation, trunking and port trunking, is a method of aggregating multiple physical network links into a single logical link for increased bandwidth.

Several different types of bonding are available in the Linux distribution. All these types are referred to as 'modes', and use the bonding kernel module.

Modes 0 to 3 allow load balancing and fault tolerance by using multiple interfaces. Mode 4 aggregates a group of interfaces into a single virtual interface where all members of the group share the same speed and duplex settings. This mode is described under IEEE spec 802.3ad, and it is referred to as either 'mode 4' or '802.3ad.'

7.2. Requirements

The most basic requirement for successful bonding is that both endpoints of the connection must be capable of bonding. In a normal case, the non-server endpoint is a switch. (Two systems connected via crossover cables can also use bonding.) Any switch used must explicitly handle 802.3ad Dynamic Link Aggregation.

The kernel must also be configured with bonding. All supported Lustre kernels have bonding functionality. The network driver for the interfaces to be bonded must have the ethtool functionality to determine slave speed and duplex settings. All recent network drivers implement it.

To verify that your interface works with ethtool, run:

# which ethtool
/sbin/ethtool
 
# ethtool eth0
Settings for eth0:
           Supported ports: [ TP MII ]
           Supported link modes:   10baseT/Half 10baseT/Full
                                   100baseT/Half 100baseT/Full
           Supports auto-negotiation: Yes
           Advertised link modes:  10baseT/Half 10baseT/Full
                                   100baseT/Half 100baseT/Full
           Advertised auto-negotiation: Yes
           Speed: 100Mb/s
           Duplex: Full
           Port: MII
           PHYAD: 1
           Transceiver: internal
           Auto-negotiation: on
           Supports Wake-on: pumbg
           Wake-on: d
           Current message level: 0x00000001 (1)
           Link detected: yes
 
# ethtool eth1
 
Settings for eth1:
   Supported ports: [ TP MII ]
   Supported link modes:   10baseT/Half 10baseT/Full
                           100baseT/Half 100baseT/Full
   Supports auto-negotiation: Yes
   Advertised link modes:  10baseT/Half 10baseT/Full
   100baseT/Half 100baseT/Full
   Advertised auto-negotiation: Yes
   Speed: 100Mb/s
   Duplex: Full
   Port: MII
   PHYAD: 32
   Transceiver: internal
   Auto-negotiation: on
   Supports Wake-on: pumbg
   Wake-on: d
   Current message level: 0x00000007 (7)
   Link detected: yes
   To quickly check whether your kernel supports bonding, run:     
   # grep ifenslave /sbin/ifup
   # which ifenslave
   /sbin/ifenslave

7.3. Bonding Module Parameters

Bonding module parameters control various aspects of bonding.

Outgoing traffic is mapped across the slave interfaces according to the transmit hash policy. We recommend that you set the xmit_hash_policy option to the layer3+4 option for bonding. This policy uses upper layer protocol information if available to generate the hash. This allows traffic to a particular network peer to span multiple slaves, although a single connection does not span multiple slaves.

$ xmit_hash_policy=layer3+4

The miimon option enables users to monitor the link status. (The parameter is a time interval in milliseconds.) It makes an interface failure transparent to avoid serious network degradation during link failures. A reasonable default setting is 100 milliseconds; run:

$ miimon=100

For a busy network, increase the timeout.

7.4. Setting Up Bonding

To set up bonding:

  1. Create a virtual 'bond' interface by creating a configuration file:

    # vi /etc/sysconfig/network-scripts/ifcfg-bond0
  2. Append the following lines to the file.

    DEVICE=bond0
    IPADDR=192.168.10.79 # Use the free IP Address of your network
    NETWORK=192.168.10.0
    NETMASK=255.255.255.0
    USERCTL=no
    BOOTPROTO=none
    ONBOOT=yes
  3. Attach one or more slave interfaces to the bond interface. Modify the eth0 and eth1 configuration files (using a VI text editor).

    1. Use the VI text editor to open the eth0 configuration file.

      # vi /etc/sysconfig/network-scripts/ifcfg-eth0
    2. Modify/append the eth0 file as follows:

      DEVICE=eth0
      USERCTL=no
      ONBOOT=yes
      MASTER=bond0
      SLAVE=yes
      BOOTPROTO=none
    3. Use the VI text editor to open the eth1 configuration file.

      # vi /etc/sysconfig/network-scripts/ifcfg-eth1
    4. Modify/append the eth1 file as follows:

      DEVICE=eth1
      USERCTL=no
      ONBOOT=yes
      MASTER=bond0
      SLAVE=yes
      BOOTPROTO=none
      
  4. Set up the bond interface and its options in /etc/modprobe.d/bond.conf. Start the slave interfaces by your normal network method.

    # vi /etc/modprobe.d/bond.conf
    
    1. Append the following lines to the file.

      alias bond0 bonding
      options bond0 mode=balance-alb miimon=100
      
    2. Load the bonding module.

      # modprobe bonding
      # ifconfig bond0 up
      # ifenslave bond0 eth0 eth1
      
  5. Start/restart the slave interfaces (using your normal network method).

    Note

    You must modprobe the bonding module for each bonded interface. If you wish to create bond0 and bond1, two entries in bond.conf file are required.

    The examples below are from systems running Red Hat Enterprise Linux. For setup use: /etc/sysconfig/networking-scripts/ifcfg-* The website referenced below includes detailed instructions for other configuration methods, instructions to use DHCP with bonding, and other setup details. We strongly recommend you use this website.

    http://www.linuxfoundation.org/networking/bonding

  6. Check /proc/net/bonding to determine status on bonding. There should be a file there for each bond interface.

    # cat /proc/net/bonding/bond0
    Ethernet Channel Bonding Driver: v3.0.3 (March 23, 2006)
     
    Bonding Mode: load balancing (round-robin)
    MII Status: up
    MII Polling Interval (ms): 0
    Up Delay (ms): 0
    Down Delay (ms): 0
     
    Slave Interface: eth0
    MII Status: up
    Link Failure Count: 0
    Permanent HW addr: 4c:00:10:ac:61:e0
     
    Slave Interface: eth1
    MII Status: up
    Link Failure Count: 0
    Permanent HW addr: 00:14:2a:7c:40:1d
    
  7. Use ethtool or ifconfig to check the interface state. ifconfig lists the first bonded interface as 'bond0.'

    ifconfig
    bond0      Link encap:Ethernet  HWaddr 4C:00:10:AC:61:E0
       inet addr:192.168.10.79  Bcast:192.168.10.255 \     Mask:255.255.255.0
       inet6 addr: fe80::4e00:10ff:feac:61e0/64 Scope:Link
       UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500 Metric:1
       RX packets:3091 errors:0 dropped:0 overruns:0 frame:0
       TX packets:880 errors:0 dropped:0 overruns:0 carrier:0
       collisions:0 txqueuelen:0
       RX bytes:314203 (306.8 KiB)  TX bytes:129834 (126.7 KiB)
     
    eth0       Link encap:Ethernet  HWaddr 4C:00:10:AC:61:E0
       inet6 addr: fe80::4e00:10ff:feac:61e0/64 Scope:Link
       UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500 Metric:1
       RX packets:1581 errors:0 dropped:0 overruns:0 frame:0
       TX packets:448 errors:0 dropped:0 overruns:0 carrier:0
       collisions:0 txqueuelen:1000
       RX bytes:162084 (158.2 KiB)  TX bytes:67245 (65.6 KiB)
       Interrupt:193 Base address:0x8c00
     
    eth1       Link encap:Ethernet  HWaddr 4C:00:10:AC:61:E0
       inet6 addr: fe80::4e00:10ff:feac:61e0/64 Scope:Link
       UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500 Metric:1
       RX packets:1513 errors:0 dropped:0 overruns:0 frame:0
       TX packets:444 errors:0 dropped:0 overruns:0 carrier:0
       collisions:0 txqueuelen:1000
       RX bytes:152299 (148.7 KiB)  TX bytes:64517 (63.0 KiB)
       Interrupt:185 Base address:0x6000
    

7.4.1. Examples

This is an example showing bond.conf entries for bonding Ethernet interfaces eth1 and eth2 to bond0:

# cat /etc/modprobe.d/bond.conf
alias eth0 8139too
alias eth1 via-rhine
alias bond0 bonding
options bond0 mode=balance-alb miimon=100
 
# cat /etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0
BOOTPROTO=none
NETMASK=255.255.255.0
IPADDR=192.168.10.79 # (Assign here the IP of the bonded interface.)
ONBOOT=yes
USERCTL=no
 
ifcfg-ethx 
# cat /etc/sysconfig/network-scripts/ifcfg-eth0
TYPE=Ethernet
DEVICE=eth0
HWADDR=4c:00:10:ac:61:e0
BOOTPROTO=none
ONBOOT=yes
USERCTL=no
IPV6INIT=no
PEERDNS=yes
MASTER=bond0
SLAVE=yes

In the following example, the bond0 interface is the master (MASTER) while eth0 and eth1 are slaves (SLAVE).

Note

All slaves of bond0 have the same MAC address (Hwaddr) - bond0. All modes, except TLB and ALB, have this MAC address. TLB and ALB require a unique MAC address for each slave.

$ /sbin/ifconfig
 
bond0Link encap:EthernetHwaddr 00:C0:F0:1F:37:B4
inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500  Metric:1
RX packets:7224794 errors:0 dropped:0 overruns:0 frame:0
TX packets:3286647 errors:1 dropped:0 overruns:1 carrier:0
collisions:0 txqueuelen:0
 
eth0Link encap:EthernetHwaddr 00:C0:F0:1F:37:B4
inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500  Metric:1
RX packets:3573025 errors:0 dropped:0 overruns:0 frame:0
TX packets:1643167 errors:1 dropped:0 overruns:1 carrier:0
collisions:0 txqueuelen:100
Interrupt:10 Base address:0x1080
 
eth1Link encap:EthernetHwaddr 00:C0:F0:1F:37:B4
inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500  Metric:1
RX packets:3651769 errors:0 dropped:0 overruns:0 frame:0
TX packets:1643480 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
Interrupt:9 Base address:0x1400

7.5. Configuring a Lustre File System with Bonding

The Lustre software uses the IP address of the bonded interfaces and requires no special configuration. The bonded interface is treated as a regular TCP/IP interface. If needed, specify bond0 using the Lustre networks parameter in /etc/modprobe.

options lnet networks=tcp(bond0)

7.6. Bonding References

We recommend the following bonding references:

Chapter 8. Installing the Lustre Software

This chapter describes how to install the Lustre software from RPM packages. It includes:

For hardware and system requirements and hardware configuration information, see Chapter 5, Determining Hardware Configuration Requirements and Formatting Options.

8.1.  Preparing to Install the Lustre Software

You can install the Lustre software from downloaded packages (RPMs) or directly from the source code. This chapter describes how to install the Lustre RPM packages. Instructions to install from source code are beyond the scope of this document, and can be found elsewhere online.

The Lustre RPM packages are tested on current versions of Linux enterprise distributions at the time they are created. See the release notes for each version for specific details.

8.1.1. Software Requirements

To install the Lustre software from RPMs, the following are required:

  • Lustre server packages . The required packages for Lustre 2.9 EL7 servers are listed in the table below, where ver refers to the Lustre release and kernel version (e.g., 2.9.0-1.el7) and arch refers to the processor architecture (e.g., x86_64). These packages are available in the Lustre Releases repository, and may differ depending on your distro and version.

    Table 8.1. Packages Installed on Lustre Servers

    Package NameDescription
    kernel-ver_lustre.arch Linux kernel with Lustre software patches (often referred to as "patched kernel")
    lustre-ver.arch Lustre software command line tools
    kmod-lustre-ver.arch Lustre-patched kernel modules
    kmod-lustre-osd-ldiskfs-ver.arch Lustre back-end file system tools for ldiskfs-based servers.
    lustre-osd-ldiskfs-mount-ver.arch Helper library for mount.lustre and mkfs.lustre for ldiskfs-based servers.
    kmod-lustre-osd-zfs-ver.arch Lustre back-end file system tools for ZFS. This is an alternative to lustre-osd-ldiskfs (kmod-spl and kmod-zfs available separately).
    lustre-osd-zfs-mount-ver.arch Helper library for mount.lustre and mkfs.lustre for ZFS-based servers (zfs utilities available separately).
    e2fsprogs Utilities to maintain Lustre ldiskfs back-end file system(s)
    lustre-tests-ver_lustre.arch Scripts and programs used for running regression tests for Lustre, but likely only of interest to Lustre developers or testers.


  • Lustre client packages . The required packages for Lustre 2.9 EL7 clients are listed in the table below, where ver refers to the Linux distribution (e.g., 3.6.18-348.1.1.el5). These packages are available in the Lustre Releases repository.

    Table 8.2. Packages Installed on Lustre Clients

    Package NameDescription
    kmod-lustre-client-ver.arch Patchless kernel modules for client
    lustre-client-ver.arch Client command line tools
    lustre-client-dkms-ver.arch Alternate client RPM to kmod-lustre-client with Dynamic Kernel Module Support (DKMS) installation. This avoids the need to install a new RPM for each kernel update, but requires a full build environment on the client.


    Note

    The version of the kernel running on a Lustre client must be the same as the version of the kmod-lustre-client-ver package being installed, unless the DKMS package is installed. If the kernel running on the client is not compatible, a kernel that is compatible must be installed on the client before the Lustre file system software is used.

  • Lustre LNet network driver (LND) . The Lustre LNDs provided with the Lustre software are listed in the table below. For more information about Lustre LNet, see Chapter 2, Understanding Lustre Networking (LNet).

    Table 8.3. Network Types Supported by Lustre LNDs

    Supported Network TypesNotes
    TCPAny network carrying TCP traffic, including GigE, 10GigE, and IPoIB
    InfiniBand networkOpenFabrics OFED (o2ib)
    gniGemini (Cray)

Note

The InfiniBand and TCP Lustre LNDs are routinely tested during release cycles. The other LNDs are maintained by their respective owners

  • High availability software . If needed, install third party high-availability software. For more information, see Section 11.2, “Preparing a Lustre File System for Failover”.

  • Optional packages. Optional packages provided in the Lustre Releases repository may include the following (depending on the operating system and platform):

    • kernel-debuginfo, kernel-debuginfo-common, lustre-debuginfo, lustre-osd-ldiskfs-debuginfo- Versions of required packages with debugging symbols and other debugging options enabled for use in troubleshooting.

    • kernel-devel, - Portions of the kernel tree needed to compile third party modules, such as network drivers.

    • kernel-firmware- Standard Red Hat Enterprise Linux distribution that has been recompiled to work with the Lustre kernel.

    • kernel-headers- Header files installed under /user/include and used when compiling user-space, kernel-related code.

    • lustre-source- Lustre software source code.

    • (Recommended) perf, perf-debuginfo, python-perf, python-perf-debuginfo- Linux performance analysis tools that have been compiled to match the Lustre kernel version.

8.1.2. Environmental Requirements

Before installing the Lustre software, make sure the following environmental requirements are met.

  • (Required) Use the same user IDs (UID) and group IDs (GID) on all clients. If use of supplemental groups is required, see Section 41.1, “User/Group Upcall” for information about supplementary user and group cache upcall (identity_upcall).

  • (Recommended) Provide remote shell access to clients. It is recommended that all cluster nodes have remote shell client access to facilitate the use of Lustre configuration and monitoring scripts. Parallel Distributed SHell (pdsh) is preferable, although Secure SHell (SSH) is acceptable.

  • (Recommended) Ensure client clocks are synchronized. The Lustre file system uses client clocks for timestamps. If clocks are out of sync between clients, files will appear with different time stamps when accessed by different clients. Drifting clocks can also cause problems by, for example, making it difficult to debug multi-node issues or correlate logs, which depend on timestamps. We recommend that you use Network Time Protocol (NTP) to keep client and server clocks in sync with each other. For more information about NTP, see: https://www.ntp.org.

  • (Recommended) Make sure security extensions (such as the Novell AppArmor *security system) and network packet filtering tools (such as iptables) do not interfere with the Lustre software.

8.2. Lustre Software Installation Procedure

Caution

Before installing the Lustre software, back up ALL data. The Lustre software contains kernel modifications that interact with storage devices and may introduce security issues and data loss if not installed, configured, or administered properly.

To install the Lustre software from RPMs, complete the steps below.

  1. Verify that all Lustre installation requirements have been met.

  2. Download the e2fsprogs RPMs for your platform from the Lustre Releases repository.

  3. Download the Lustre server RPMs for your platform from the Lustre Releases repository. See Table 8.1, “Packages Installed on Lustre Servers”for a list of required packages.

  4. Install the Lustre server and e2fsprogs packages on all Lustre servers (MGS, MDSs, and OSSs).

    1. Log onto a Lustre server as the root user

    2. Use the yum command to install the packages:

      # yum --nogpgcheck install pkg1.rpm pkg2.rpm ...
      

    3. Verify the packages are installed correctly:

      rpm -qa|egrep "lustre|wc"|sort
      

    4. Reboot the server.

    5. Repeat these steps on each Lustre server.

  5. Download the Lustre client RPMs for your platform from the Lustre Releases repository. See Table 8.2, “Packages Installed on Lustre Clients”for a list of required packages.

  6. Install the Lustre client packages on all Lustre clients.

    Note

    The version of the kernel running on a Lustre client must be the same as the version of the lustre-client-modules- ver package being installed. If not, a compatible kernel must be installed on the client before the Lustre client packages are installed.

    1. Log onto a Lustre client as the root user.

    2. Use the yum command to install the packages:

      # yum --nogpgcheck install pkg1.rpm pkg2.rpm ...
      

    3. Verify the packages were installed correctly:

      # rpm -qa|egrep "lustre|kernel"|sort
      

    4. Reboot the client.

    5. Repeat these steps on each Lustre client.

To configure LNet, go to Chapter 9, Configuring Lustre Networking (LNet). If default settings will be used for LNet, go to Chapter 10, Configuring a Lustre File System.

Chapter 9. Configuring Lustre Networking (LNet)

This chapter describes how to configure Lustre Networking (LNet). It includes the following sections:

Note

Configuring LNet is optional.

LNet will use the first TCP/IP interface it discovers on a system (eth0) if it's loaded using the lctl network up. If this network configuration is sufficient, you do not need to configure LNet. LNet configuration is required if you are using Infiniband or multiple Ethernet interfaces.

Introduced in Lustre 2.7

The lnetctl utility can be used to initialize LNet without bringing up any network interfaces. Network interfaces can be added after configuring LNet via lnetctl. lnetctl can also be used to manage an operational LNet. However, if it wasn't initialized by lnetctl then lnetctl lnet configure must be invoked before lnetctl can be used to manage LNet.

Introduced in Lustre 2.7

DLC also introduces a C-API to enable configuring LNet programatically. See Chapter 45, LNet Configuration C-API

Introduced in Lustre 2.7

9.1. Configuring LNet via lnetctl

The lnetctl utility can be used to initialize and configure the LNet kernel module after it has been loaded via modprobe. In general the lnetctl format is as follows:

lnetctl cmd subcmd [options]

The following configuration items are managed by the tool:

  • Configuring/unconfiguring LNet

  • Adding/removing/showing Networks

  • Adding/removing/showing Routes

  • Enabling/Disabling routing

  • Configuring Router Buffer Pools

9.1.1. Configuring LNet

After LNet has been loaded via modprobe, lnetctl utility can be used to configure LNet without bringing up networks which are specified in the module parameters. It can also be used to configure network interfaces specified in the module prameters by providing the --all option.

lnetctl lnet configure [--all]
# --all: load NI configuration from module parameters

The lnetctl utility can also be used to unconfigure LNet.

lnetctl lnet unconfigure

9.1.2. Displaying Global Settings

The active LNet global settings can be displayed using the lnetctl command shown below:

lnetctl global show

For example:

# lnetctl global show
        global:
        numa_range: 0
        max_intf: 200
        discovery: 1
        drop_asym_route: 0

9.1.3. Adding, Deleting and Showing Networks

Networks can be added, deleted, or shown after the LNet kernel module is loaded.

The lnetctl net add command is used to add networks:

lnetctl net add: add a network
        --net: net name (ex tcp0)
        --if: physical interface (ex eth0)
        --peer_timeout: time to wait before declaring a peer dead
        --peer_credits: defines the max number of inflight messages
        --peer_buffer_credits: the number of buffer credits per peer
        --credits: Network Interface credits
        --cpts: CPU Partitions configured net uses
        --help: display this help text

Example:
lnetctl net add --net tcp2 --if eth0
                --peer_timeout 180 --peer_credits 8
Introduced in Lustre 2.10

Note

With the addition of Software based Multi-Rail in Lustre 2.10, the following should be noted:

  • --net: no longer needs to be unique since multiple interfaces can be added to the same network.

  • --if: The same interface per network can be added only once, however, more than one interface can now be specified (separated by a comma) for a node. For example: eth0,eth1,eth2.

For examples on adding multiple interfaces via lnetctl net add and/or YAML, please see Section 16.2, “Configuring Multi-Rail”

Networks can be deleted with the lnetctl net del command:

net del: delete a network
        --net: net name (ex tcp0)
        --if:  physical inerface (e.g. eth0)

Example:
lnetctl net del --net tcp2
Introduced in Lustre 2.10

Note

In a Software Multi-Rail configuration, specifying only the --net argument will delete the entire network and all interfaces under it. The new --if switch should also be used in conjunction with --net to specify deletion of a specific interface.

All or a subset of the configured networks can be shown with the lnetctl net show command. The output can be non-verbose or verbose.

net show: show networks
        --net: net name (ex tcp0) to filter on
        --verbose: display detailed output per network

Examples:
lnetctl net show
lnetctl net show --verbose
lnetctl net show --net tcp2 --verbose

Below are examples of non-detailed and detailed network configuration show.

# non-detailed show
> lnetctl net show --net tcp2
net:
    - nid: 192.168.205.130@tcp2
      status: up
      interfaces:
          0: eth3

# detailed show
> lnetctl net show --net tcp2 --verbose
net:
    - nid: 192.168.205.130@tcp2
      status: up
      interfaces:
          0: eth3
      tunables:
          peer_timeout: 180
          peer_credits: 8
          peer_buffer_credits: 0
          credits: 256
Introduced in Lustre 2.10

9.1.4. Manual Adding, Deleting and Showing Peers

The lnetctl peer add command is used to manually add a remote peer to a software multi-rail configuration. For the dynamic peer discovery capability introduced in Lustre Release 2.11.0, please see Section 9.1.5, “Dynamic Peer Discovery”.

When configuring peers, use the --prim_nid option to specify the key or primary nid of the peer node. Then follow that with the --nid option to specify a set of comma separated NIDs.

peer add: add a peer
            --prim_nid: primary NID of the peer
            --nid: comma separated list of peer nids (e.g. 10.1.1.2@tcp0)
            --non_mr: if specified this interface is created as a non mulit-rail
            capable peer. Only one NID can be specified in this case.

For example:

            lnetctl peer add --prim_nid 10.10.10.2@tcp --nid 10.10.3.3@tcp1,10.4.4.5@tcp2
        

The --prim-nid (primary nid for the peer node) can go unspecified. In this case, the first listed NID in the --nid option becomes the primary nid of the peer. For example:

            lnetctl peer_add --nid 10.10.10.2@tcp,10.10.3.3@tcp1,10.4.4.5@tcp2

YAML can also be used to configure peers:

peer:
            - primary nid: <key or primary nid>
            Multi-Rail: True
            peer ni:
            - nid: <nid 1>
            - nid: <nid 2>
            - nid: <nid n>

As with all other commands, the result of the lnetctl peer show command can be used to gather information to aid in configuring or deleting a peer:

lnetctl peer show -v

Example output from the lnetctl peer show command:

peer:
            - primary nid: 192.168.122.218@tcp
            Multi-Rail: True
            peer ni:
            - nid: 192.168.122.218@tcp
            state: NA
            max_ni_tx_credits: 8
            available_tx_credits: 8
            available_rtr_credits: 8
            min_rtr_credits: -1
            tx_q_num_of_buf: 0
            send_count: 6819
            recv_count: 6264
            drop_count: 0
            refcount: 1
            - nid: 192.168.122.78@tcp
            state: NA
            max_ni_tx_credits: 8
            available_tx_credits: 8
            available_rtr_credits: 8
            min_rtr_credits: -1
            tx_q_num_of_buf: 0
            send_count: 7061
            recv_count: 6273
            drop_count: 0
            refcount: 1
            - nid: 192.168.122.96@tcp
            state: NA
            max_ni_tx_credits: 8
            available_tx_credits: 8
            available_rtr_credits: 8
            min_rtr_credits: -1
            tx_q_num_of_buf: 0
            send_count: 6939
            recv_count: 6286
            drop_count: 0
            refcount: 1

Use the following lnetctl command to delete a peer:

peer del: delete a peer
            --prim_nid: Primary NID of the peer
            --nid: comma separated list of peer nids (e.g. 10.1.1.2@tcp0)

prim_nid should always be specified. The prim_nid identifies the peer. If the prim_nid is the only one specified, then the entire peer is deleted.

Example of deleting a single nid of a peer (10.10.10.3@tcp):

lnetctl peer del --prim_nid 10.10.10.2@tcp --nid 10.10.10.3@tcp

Example of deleting the entire peer:

lnetctl peer del --prim_nid 10.10.10.2@tcp
Introduced in Lustre 2.11

9.1.5. Dynamic Peer Discovery

9.1.5.1. Overview

Dynamic Discovery (DD) is a feature that allows nodes to dynamically discover a peer's interfaces without having to explicitly configure them. This is very useful for Multi-Rail (MR) configurations. In large clusters, there could be hundreds of nodes and having to configure MR peers on each node becomes error prone. Dynamic Discovery is enabled by default and uses a new protocol based on LNet pings to discover the interfaces of the remote peers on first message.

9.1.5.2. Protocol

When LNet on a node is requested to send a message to a peer it first attempts to ping the peer. The reply to the ping contains the peer's NIDs as well as a feature bit outlining what the peer supports. Dynamic Discovery adds a Multi-Rail feature bit. If the peer is Multi-Rail capable, it sets the MR bit in the ping reply. When the node receives the reply it checks the MR bit, and if it is set it then pushes its own list of NIDs to the peer using a new PUT message, referred to as a "push ping". After this brief protocol, both the peer and the node will have each other's list of interfaces. The MR algorithm can then proceed to use the list of interfaces of the corresponding peer.

If the peer is not MR capable, it will not set the MR feature bit in the ping reply. The node will understand that the peer is not MR capable and will only use the interface provided by upper layers for sending messages.

9.1.5.3. Dynamic Discovery and User-space Configuration

It is possible to configure the peer manually while Dynamic Discovery is running. Manual peer configuration always takes precedence over Dynamic Discovery. If there is a discrepancy between the manual configuration and the dynamically discovered information, a warning is printed.

9.1.5.4. Configuration

Dynamic Discovery is very light on the configuration side. It can only be turned on or turned off. To turn the feature on or off, the following command is used:

lnetctl set discovery [0 | 1]

To check the current discovery setting, the lnetctl global show command can be used as shown in Section 9.1.2, “Displaying Global Settings”.

9.1.5.5. Initiating Dynamic Discovery on Demand

It is possible to initiate the Dynamic Discovery protocol on demand without having to wait for a message to be sent to the peer. This can be done with the following command:

lnetctl discover <peer_nid> [<peer_nid> ...]

9.1.6. Adding, Deleting and Showing routes

A set of routes can be added to identify how LNet messages are to be routed.

lnetctl route add: add a route
        --net: net name (ex tcp0) LNet message is destined to.
               The can not be a local network.
        --gateway: gateway node nid (ex 10.1.1.2@tcp) to route
                   all LNet messaged destined for the identified
                   network
        --hop: number of hops to final destination
               (1 <= hops <= 255) (optional)
        --priority: priority of route (0 - highest prio) (optional)

Example:
lnetctl route add --net tcp2 --gateway 192.168.205.130@tcp1 --hop 2 --prio 1

Routes can be deleted via the following lnetctl command.

lnetctl route del: delete a route
        --net: net name (ex tcp0)
        --gateway: gateway nid (ex 10.1.1.2@tcp)

Example:
lnetctl route del --net tcp2 --gateway 192.168.205.130@tcp1

Configured routes can be shown via the following lnetctl command.

lnetctl route show: show routes
        --net: net name (ex tcp0) to filter on
        --gateway: gateway nid (ex 10.1.1.2@tcp) to filter on
        --hop: number of hops to final destination
               (1 <= hops <= 255) to filter on (-1 default)
        --priority: priority of route (0 - highest prio)
                    to filter on (0 default)
        --verbose: display detailed output per route

Examples:
# non-detailed show
lnetctl route show

# detailed show
lnetctl route show --verbose

When showing routes the --verbose option outputs more detailed information. All show and error output are in YAML format. Below are examples of both non-detailed and detailed route show output.

#Non-detailed output
> lnetctl route show
route:
    - net: tcp2
      gateway: 192.168.205.130@tcp1

#detailed output
> lnetctl route show --verbose
route:
    - net: tcp2
      gateway: 192.168.205.130@tcp1
      hop: 2
      priority: 1
      state: down

9.1.7. Enabling and Disabling Routing

When an LNet node is configured as a router it will route LNet messages not destined to itself. This feature can be enabled or disabled as follows.

lnetctl set routing [0 | 1]
# 0 - disable routing feature
# 1 - enable routing feature

9.1.8. Showing routing information

When routing is enabled on a node, the tiny, small and large routing buffers are allocated. See Section 34.3, “ Tuning LNet Parameters” for more details on router buffers. This information can be shown as follows:

lnetctl routing show: show routing information

Example:
lnetctl routing show

An example of the show output:

> lnetctl routing show
routing:
    - cpt[0]:
          tiny:
              npages: 0
              nbuffers: 2048
              credits: 2048
              mincredits: 2048
          small:
              npages: 1
              nbuffers: 16384
              credits: 16384
              mincredits: 16384
          large:
              npages: 256
              nbuffers: 1024
              credits: 1024
              mincredits: 1024
    - enable: 1

9.1.9. Configuring Routing Buffers

The routing buffers values configured specify the number of buffers in each of the tiny, small and large groups.

It is often desirable to configure the tiny, small and large routing buffers to some values other than the default. These values are global values, when set they are used by all configured CPU partitions. If routing is enabled then the values set take effect immediately. If a larger number of buffers is specified, then buffers are allocated to satisfy the configuration change. If fewer buffers are configured then the excess buffers are freed as they become unused. If routing is not set the values are not changed. The buffer values are reset to default if routing is turned off and on.

The lnetctl 'set' command can be used to set these buffer values. A VALUE greater than 0 will set the number of buffers accordingly. A VALUE of 0 will reset the number of buffers to system defaults.

set tiny_buffers:
      set tiny routing buffers
               VALUE must be greater than or equal to 0

set small_buffers: set small routing buffers
        VALUE must be greater than or equal to 0

set large_buffers: set large routing buffers
        VALUE must be greater than or equal to 0

Usage examples:

> lnetctl set tiny_buffers 4096
> lnetctl set small_buffers 8192
> lnetctl set large_buffers 2048

The buffers can be set back to the default values as follows:

> lnetctl set tiny_buffers 0
> lnetctl set small_buffers 0
> lnetctl set large_buffers 0
Introduced in Lustre 2.13

9.1.10. Asymmetrical Routes

9.1.10.1. Overview

An asymmetrical route is when a message from a remote peer is coming through a router that is not known by this node to reach the remote peer.

Asymmetrical routes can be an issue when debugging network, and allowing them also opens the door to attacks where hostile clients inject data to the servers.

So it is possible to activate a check in LNet, that will detect any asymmetrical route message and drop it.

9.1.10.2. Configuration

In order to switch asymmetric route detection on or off, the following command is used:

lnetctl set drop_asym_route [0 | 1]

This command works on a per-node basis. This means each node in a Lustre cluster can decide whether it accepts asymmetrical route messages.

To check the current drop_asym_route setting, the lnetctl global show command can be used as shown in Section 9.1.2, “Displaying Global Settings”.

By default, asymmetric route detection is off.

9.1.11. Importing YAML Configuration File

Configuration can be described in YAML format and can be fed into the lnetctl utility. The lnetctl utility parses the YAML file and performs the specified operation on all entities described there in. If no operation is defined in the command as shown below, the default operation is 'add'. The YAML syntax is described in a later section.

lnetctl import FILE.yaml
lnetctl import < FILE.yaml

The 'lnetctl import' command provides three optional parameters to define the operation to be performed on the configuration items described in the YAML file.

# if no options are given to the command the "add" command is assumed
              # by default.
lnetctl import --add FILE.yaml
lnetctl import --add < FILE.yaml

# to delete all items described in the YAML file
lnetctl import --del FILE.yaml
lnetctl import --del < FILE.yaml

# to show all items described in the YAML file
lnetctl import --show FILE.yaml
lnetctl import --show < FILE.yaml

9.1.12. Exporting Configuration in YAML format

lnetctl utility provides the 'export' command to dump current LNet configuration in YAML format

lnetctl export FILE.yaml
lnetctl export > FILE.yaml

9.1.13. Showing LNet Traffic Statistics

lnetctl utility can dump the LNet traffic statistiscs as follows

lnetctl stats show

9.1.14. YAML Syntax

The lnetctl utility can take in a YAML file describing the configuration items that need to be operated on and perform one of the following operations: add, delete or show on the items described there in.

Net, routing and route YAML blocks are all defined as a YAML sequence, as shown in the following sections. The stats YAML block is a YAML object. Each sequence item can take a seq_no field. This seq_no field is returned in the error block. This allows the caller to associate the error with the item that caused the error. The lnetctl utilty does a best effort at configuring items defined in the YAML file. It does not stop processing the file at the first error.

Below is the YAML syntax describing the various configuration elements which can be operated on via DLC. Not all YAML elements are required for all operations (add/delete/show). The system ignores elements which are not pertinent to the requested operation.

9.1.14.1. Network Configuration

net:
   - net: <network.  Ex: tcp or o2ib>
     interfaces:
         0: <physical interface>
     detail: <This is only applicable for show command.  1 - output detailed info.  0 - basic output>
     tunables:
        peer_timeout: <Integer. Timeout before consider a peer dead>
        peer_credits: <Integer. Transmit credits for a peer>
        peer_buffer_credits: <Integer. Credits available for receiving messages>
        credits: <Integer.  Network Interface credits>
	SMP: <An array of integers of the form: "[x,y,...]", where each
	integer represents the CPT to associate the network interface
	with> seq_no: <integer.  Optional.  User generated, and is
	passed back in the YAML error block>

Both seq_no and detail fields do not appear in the show output.

9.1.14.2. Enable Routing and Adjust Router Buffer Configuration

routing:
    - tiny: <Integer. Tiny buffers>
      small: <Integer. Small buffers>
      large: <Integer. Large buffers>
      enable: <0 - disable routing.  1 - enable routing>
      seq_no: <Integer.  Optional.  User generated, and is passed back in the YAML error block>

The seq_no field does not appear in the show output

9.1.14.3. Show Statistics

statistics:
    seq_no: <Integer. Optional.  User generated, and is passed back in the YAML error block>

The seq_no field does not appear in the show output

9.1.14.4. Route Configuration

route:
  - net: <network. Ex: tcp or o2ib>
    gateway: <nid of the gateway in the form <ip>@<net>: Ex: 192.168.29.1@tcp>
    hop: <an integer between 1 and 255. Optional>
    detail: <This is only applicable for show commands.  1 - output detailed info.  0. basic output>
    seq_no: <integer. Optional. User generated, and is passed back in the YAML error block>

Both seq_no and detail fields do not appear in the show output.

9.2.  Overview of LNet Module Parameters

LNet kernel module (lnet) parameters specify how LNet is to be configured to work with Lustre, including which NICs will be configured to work with Lustre and the routing to be used with Lustre.

Parameters for LNet can be specified in the /etc/modprobe.d/lustre.conf file. In some cases the parameters may have been stored in /etc/modprobe.conf, but this has been deprecated since before RHEL5 and SLES10, and having a separate /etc/modprobe.d/lustre.conf file simplifies administration and distribution of the Lustre networking configuration. This file contains one or more entries with the syntax:

options lnet parameter=value

To specify the network interfaces that are to be used for Lustre, set either the networks parameter or the ip2nets parameter (only one of these parameters can be used at a time):

  • networks - Specifies the networks to be used.

  • ip2nets - Lists globally-available networks, each with a range of IP addresses. LNet then identifies locally-available networks through address list-matching lookup.

See Section 9.3, “Setting the LNet Module networks Parameter” and Section 9.4, “Setting the LNet Module ip2nets Parameter” for more details.

To set up routing between networks, use:

  • routes - Lists networks and the NIDs of routers that forward to them.

See Section 9.5, “Setting the LNet Module routes Parameter” for more details.

A router checker can be configured to enable Lustre nodes to detect router health status, avoid routers that appear dead, and reuse those that restore service after failures. See Section 9.7, “Configuring the Router Checker” for more details.

For a complete reference to the LNet module parameters, see Chapter 43, Configuration Files and Module ParametersLNet Options.

Note

We recommend that you use 'dotted-quad' notation for IP addresses rather than host names to make it easier to read debug logs and debug configurations with multiple interfaces.

9.2.1. Using a Lustre Network Identifier (NID) to Identify a Node

A Lustre network identifier (NID) is used to uniquely identify a Lustre network endpoint by node ID and network type. The format of the NID is:

network_id@network_type

Examples are:

10.67.73.200@tcp0
10.67.75.100@o2ib

The first entry above identifies a TCP/IP node, while the second entry identifies an InfiniBand node.

When a mount command is run on a client, the client uses the NID of the MDS to retrieve configuration information. If an MDS has more than one NID, the client should use the appropriate NID for its local network.

To determine the appropriate NID to specify in the mount command, use the lctl command. To display MDS NIDs, run on the MDS :

lctl list_nids

To determine if a client can reach the MDS using a particular NID, run on the client:

lctl which_nid MDS_NID

9.3. Setting the LNet Module networks Parameter

If a node has more than one network interface, you'll typically want to dedicate a specific interface to Lustre. You can do this by including an entry in the lustre.conf file on the node that sets the LNet module networks parameter:

options lnet networks=comma-separated list of
    networks

This example specifies that a Lustre node will use a TCP/IP interface and an InfiniBand interface:

options lnet networks=tcp0(eth0),o2ib(ib0)

This example specifies that the Lustre node will use the TCP/IP interface eth1:

options lnet networks=tcp0(eth1)

Depending on the network design, it may be necessary to specify explicit interfaces. To explicitly specify that interface eth2 be used for network tcp0 and eth3 be used for tcp1 , use this entry:

options lnet networks=tcp0(eth2),tcp1(eth3)

When more than one interface is available during the network setup, Lustre chooses the best route based on the hop count. Once the network connection is established, Lustre expects the network to stay connected. In a Lustre network, connections do not fail over to another interface, even if multiple interfaces are available on the same node.

Note

LNet lines in lustre.conf are only used by the local node to determine what to call its interfaces. They are not used for routing decisions.

9.3.1. Multihome Server Example

If a server with multiple IP addresses (multihome server) is connected to a Lustre network, certain configuration setting are required. An example illustrating these setting consists of a network with the following nodes:

  • Server svr1 with three TCP NICs (eth0, eth1, and eth2) and an InfiniBand NIC.

  • Server svr2 with three TCP NICs (eth0, eth1, and eth2) and an InfiniBand NIC. Interface eth2 will not be used for Lustre networking.

  • TCP clients, each with a single TCP interface.

  • InfiniBand clients, each with a single Infiniband interface and a TCP/IP interface for administration.

To set the networks option for this example:

  • On each server, svr1 and svr2, include the following line in the lustre.conf file:

options lnet networks=tcp0(eth0),tcp1(eth1),o2ib
  • For TCP-only clients, the first available non-loopback IP interface is used for tcp0. Thus, TCP clients with only one interface do not need to have options defined in the lustre.conf file.

  • On the InfiniBand clients, include the following line in the lustre.conf file:

options lnet networks=o2ib

Note

By default, Lustre ignores the loopback (lo0) interface. Lustre does not ignore IP addresses aliased to the loopback. If you alias IP addresses to the loopback interface, you must specify all Lustre networks using the LNet networks parameter.

Note

If the server has multiple interfaces on the same subnet, the Linux kernel will send all traffic using the first configured interface. This is a limitation of Linux, not Lustre. In this case, network interface bonding should be used. For more information about network interface bonding, see Chapter 7, Setting Up Network Interface Bonding.

9.4. Setting the LNet Module ip2nets Parameter

The ip2nets option is typically used when a single, universal lustre.conf file is run on all servers and clients. Each node identifies the locally available networks based on the listed IP address patterns that match the node's local IP addresses.

Note that the IP address patterns listed in the ip2nets option are only used to identify the networks that an individual node should instantiate. They are not used by LNet for any other communications purpose.

For the example below, the nodes in the network have these IP addresses:

  • Server svr1: eth0 IP address 192.168.0.2, IP over Infiniband (o2ib) address 132.6.1.2.

  • Server svr2: eth0 IP address 192.168.0.4, IP over Infiniband (o2ib) address 132.6.1.4.

  • TCP clients have IP addresses 192.168.0.5-255.

  • Infiniband clients have IP over Infiniband (o2ib) addresses 132.6.[2-3].2, .4, .6, .8.

The following entry is placed in the lustre.conf file on each server and client:

options lnet 'ip2nets="tcp0(eth0) 192.168.0.[2,4]; \
tcp0 192.168.0.*; o2ib0 132.6.[1-3].[2-8/2]"'

Each entry in ip2nets is referred to as a 'rule'.

The order of LNet entries is important when configuring servers. If a server node can be reached using more than one network, the first network specified in lustre.conf will be used.

Because svr1 and svr2 match the first rule, LNet uses eth0 for tcp0 on those machines. (Although svr1 and svr2 also match the second rule, the first matching rule for a particular network is used).

The [2-8/2] format indicates a range of 2-8 stepped by 2; that is 2,4,6,8. Thus, the clients at 132.6.3.5 will not find a matching o2ib network.

Introduced in Lustre 2.10

Note

Multi-rail deprecates the kernel parsing of ip2nets. ip2nets patterns are matched in user space and translated into Network interfaces to be added into the system.

The first interface that matches the IP pattern will be used when adding a network interface.

If an interface is explicitly specified as well as a pattern, the interface matched using the IP pattern will be sanitized against the explicitly-defined interface.

For example, tcp(eth0) 192.168.*.3 and there exists in the system eth0 == 192.158.19.3 and eth1 == 192.168.3.3, then the configuration will fail, because the pattern contradicts the interface specified.

A clear warning will be displayed if inconsistent configuration is encountered.

You could use the following command to configure ip2nets:

lnetctl import < ip2nets.yaml

For example:

ip2nets:
  - net-spec: tcp1
    interfaces:
         0: eth0
         1: eth1
    ip-range:
         0: 192.168.*.19
         1: 192.168.100.105
  - net-spec: tcp2
    interfaces:
         0: eth2
    ip-range:
         0: 192.168.*.*

9.5. Setting the LNet Module routes Parameter

The LNet module routes parameter is used to identify routers in a Lustre configuration. These parameters are set in modprobe.conf on each Lustre node.

Routes are typically set to connect to segregated subnetworks or to cross connect two different types of networks such as tcp and o2ib

The LNet routes parameter specifies a colon-separated list of router definitions. Each route is defined as a network number, followed by a list of routers:

routes=net_type router_NID(s)

This example specifies bi-directional routing in which TCP clients can reach Lustre resources on the IB networks and IB servers can access the TCP networks:

options lnet 'ip2nets="tcp0 192.168.0.*; \
  o2ib0(ib0) 132.6.1.[1-128]"' 'routes="tcp0   132.6.1.[1-8]@o2ib0; \
  o2ib0 192.16.8.0.[1-8]@tcp0"'

All LNet routers that bridge two networks are equivalent. They are not configured as primary or secondary, and the load is balanced across all available routers.

The number of LNet routers is not limited. Enough routers should be used to handle the required file serving bandwidth plus a 25 percent margin for headroom.

9.5.1. Routing Example

On the clients, place the following entry in the lustre.conf file

lnet networks="tcp" routes="o2ib0 192.168.0.[1-8]@tcp0"

On the router nodes, use:

lnet networks="tcp o2ib" forwarding=enabled 

On the MDS, use the reverse as shown below:

lnet networks="o2ib0" routes="tcp0 132.6.1.[1-8]@o2ib0" 

To start the routers, run:

modprobe lnet
lctl network configure

9.6. Testing the LNet Configuration

After configuring Lustre Networking, it is highly recommended that you test your LNet configuration using the LNet Self-Test provided with the Lustre software. For more information about using LNet Self-Test, see Chapter 32, Testing Lustre Network Performance (LNet Self-Test).

9.7. Configuring the Router Checker

In a Lustre configuration in which different types of networks, such as a TCP/IP network and an Infiniband network, are connected by routers, a router checker can be run on the clients and servers in the routed configuration to monitor the status of the routers. In a multi-hop routing configuration, router checkers can be configured on routers to monitor the health of their next-hop routers.

A router checker is configured by setting LNet parameters in lustre.conf by including an entry in this form:

options lnet
    router_checker_parameter=value

The router checker parameters are:

  • live_router_check_interval - Specifies a time interval in seconds after which the router checker will ping the live routers. The default value is 0, meaning no checking is done. To set the value to 60, enter:

    options lnet live_router_check_interval=60
  • dead_router_check_interval - Specifies a time interval in seconds after which the router checker will check for dead routers. The default value is 0, meaning no checking is done. To set the value to 60, enter:

    options lnet dead_router_check_interval=60
  • auto_down - Enables/disables (1/0) the automatic marking of router state as up or down. The default value is 1. To disable router marking, enter:

    options lnet auto_down=0
  • router_ping_timeout - Specifies a timeout for the router checker when it checks live or dead routers. The router checker sends a ping message to each dead or live router once every dead_router_check_interval or live_router_check_interval respectively. The default value is 50. To set the value to 60, enter:

    options lnet router_ping_timeout=60

    Note

    The router_ping_timeout is consistent with the default LND timeouts. You may have to increase it on very large clusters if the LND timeout is also increased. For larger clusters, we suggest increasing the check interval.

  • check_routers_before_use - Specifies that routers are to be checked before use. Set to off by default. If this parameter is set to on, the dead_router_check_interval parameter must be given a positive integer value.

    options lnet check_routers_before_use=on

The router checker obtains the following information from each router:

  • Time the router was disabled

  • Elapsed disable time

If the router checker does not get a reply message from the router within router_ping_timeout seconds, it considers the router to be down.

If a router is marked 'up' and responds to a ping, the timeout is reset.

If 100 packets have been sent successfully through a router, the sent-packets counter for that router will have a value of 100.

9.8. Best Practices for LNet Options

For the networks, ip2nets, and routes options, follow these best practices to avoid configuration errors.

9.8.1. Escaping commas with quotes

Depending on the Linux distribution, commas may need to be escaped using single or double quotes. In the extreme case, the options entry would look like this:

options
      lnet'networks="tcp0,elan0"'
      'routes="tcp [2,10]@elan0"'

Added quotes may confuse some distributions. Messages such as the following may indicate an issue related to added quotes:

lnet: Unknown parameter 'networks'

A 'Refusing connection - no matching NID' message generally points to an error in the LNet module configuration.

9.8.2. Including comments

Place the semicolon terminating a comment immediately after the comment. LNet silently ignores everything between the # character at the beginning of the comment and the next semicolon.

In this incorrect example, LNet silently ignores pt11 192.168.0.[92,96], resulting in these nodes not being properly initialized. No error message is generated.

options lnet ip2nets="pt10 192.168.0.[89,93]; # comment
      with semicolon BEFORE comment \ pt11 192.168.0.[92,96];

This correct example shows the required syntax:

options lnet ip2nets="pt10 192.168.0.[89,93] \
# comment with semicolon AFTER comment; \
pt11 192.168.0.[92,96] # comment

Do not add an excessive number of comments. The Linux kernel limits the length of character strings used in module options (usually to 1KB, but this may differ between vendor kernels). If you exceed this limit, errors result and the specified configuration may not be processed correctly.

Chapter 10. Configuring a Lustre File System

This chapter shows how to configure a simple Lustre file system comprised of a combined MGS/MDT, an OST and a client. It includes:

10.1.  Configuring a Simple Lustre File System

A Lustre file system can be set up in a variety of configurations by using the administrative utilities provided with the Lustre software. The procedure below shows how to configure a simple Lustre file system consisting of a combined MGS/MDS, one OSS with two OSTs, and a client. For an overview of the entire Lustre installation procedure, see Chapter 4, Installation Overview.

This configuration procedure assumes you have completed the following:

The following optional steps should also be completed, if needed, before the Lustre software is configured:

  • Set up a hardware or software RAID on block devices to be used as OSTs or MDTs.For information about setting up RAID, see the documentation for your RAID controller or Chapter 6, Configuring Storage on a Lustre File System.

  • Set up network interface bonding on Ethernet interfaces.For information about setting up network interface bonding, see Chapter 7, Setting Up Network Interface Bonding.

  • Set lnet module parameters to specify how Lustre Networking (LNet) is to be configured to work with a Lustre file system and test the LNet configuration.LNet will, by default, use the first TCP/IP interface it discovers on a system. If this network configuration is sufficient, you do not need to configure LNet. LNet configuration is required if you are using InfiniBand or multiple Ethernet interfaces.

For information about configuring LNet, see Chapter 9, Configuring Lustre Networking (LNet). For information about testing LNet, see Chapter 32, Testing Lustre Network Performance (LNet Self-Test).

  • Run the benchmark script sgpdd-survey to determine baseline performance of your hardware.Benchmarking your hardware will simplify debugging performance issues that are unrelated to the Lustre software and ensure you are getting the best possible performance with your installation. For information about running sgpdd-survey, see Chapter 33, Benchmarking Lustre File System Performance (Lustre I/O Kit).

Note

The sgpdd-survey script overwrites the device being tested so it must be run before the OSTs are configured.

To configure a simple Lustre file system, complete these steps:

  1. Create a combined MGS/MDT file system on a block device. On the MDS node, run:

    mkfs.lustre --fsname=
    fsname --mgs --mdt --index=0 
    /dev/block_device
    

    The default file system name ( fsname) is lustre.

    Note

    If you plan to create multiple file systems, the MGS should be created separately on its own dedicated block device, by running:

    mkfs.lustre --fsname=
    fsname --mgs 
    /dev/block_device
    

    See Section 13.8, “ Running Multiple Lustre File Systems”for more details.

  2. Optionally add in additional MDTs.

    mkfs.lustre --fsname=
    fsname --mgsnode=
    nid --mdt --index=1 
    /dev/block_device
    

    Note

    Up to 4095 additional MDTs can be added.

  3. Mount the combined MGS/MDT file system on the block device. On the MDS node, run:

    mount -t lustre 
    /dev/block_device 
    /mount_point
    

    Note

    If you have created an MGS and an MDT on separate block devices, mount them both.

  4. Create the OST. On the OSS node, run:

    mkfs.lustre --fsname=
    fsname --mgsnode=
    MGS_NID --ost --index=
    OST_index 
    /dev/block_device
    

    When you create an OST, you are formatting a ldiskfs or ZFS file system on a block storage device like you would with any local file system.

    You can have as many OSTs per OSS as the hardware or drivers allow. For more information about storage and memory requirements for a Lustre file system, see Chapter 5, Determining Hardware Configuration Requirements and Formatting Options.

    You can only configure one OST per block device. You should create an OST that uses the raw block device and does not use partitioning.

    You should specify the OST index number at format time in order to simplify translating the OST number in error messages or file striping to the OSS node and block device later on.

    If you are using block devices that are accessible from multiple OSS nodes, ensure that you mount the OSTs from only one OSS node at at time. It is strongly recommended that multiple-mount protection be enabled for such devices to prevent serious data corruption. For more information about multiple-mount protection, see Chapter 24, Lustre File System Failover and Multiple-Mount Protection.

    Note

    The Lustre software currently supports block devices up to 128 TB on Red Hat Enterprise Linux 5 and 6 (up to 8 TB on other distributions). If the device size is only slightly larger that 16 TB, it is recommended that you limit the file system size to 16 TB at format time. We recommend that you not place DOS partitions on top of RAID 5/6 block devices due to negative impacts on performance, but instead format the whole disk for the file system.

  5. Mount the OST. On the OSS node where the OST was created, run:

    mount -t lustre 
    /dev/block_device 
    /mount_point
    

    Note

    To create additional OSTs, repeat Step 4and Step 5, specifying the next higher OST index number.

  6. Mount the Lustre file system on the client. On the client node, run:

    mount -t lustre 
    MGS_node:/
    fsname 
    /mount_point 
    

    Note

    To mount the filesystem on additional clients, repeat Step 6.

    Note

    If you have a problem mounting the file system, check the syslogs on the client and all the servers for errors and also check the network settings. A common issue with newly-installed systems is that hosts.deny or firewall rules may prevent connections on port 988.

  7. Verify that the file system started and is working correctly. Do this by running lfs df, dd and ls commands on the client node.

  8. (Optional)Run benchmarking tools to validate the performance of hardware and software layers in the cluster. Available tools include:

10.1.1.  Simple Lustre Configuration Example

To see the steps to complete for a simple Lustre file system configuration, follow this example in which a combined MGS/MDT and two OSTs are created to form a file system called temp. Three block devices are used, one for the combined MGS/MDS node and one for each OSS node. Common parameters used in the example are listed below, along with individual node parameters.

Common Parameters

Value

Description

 

MGS node

10.2.0.1@tcp0

Node for the combined MGS/MDS

 

file system

temp

Name of the Lustre file system

 

network type

TCP/IP

Network type used for Lustre file system temp

Node Parameters

Value

Description

MGS/MDS node

 

MGS/MDS node

mdt0

MDS in Lustre file system temp

 

block device

/dev/sdb

Block device for the combined MGS/MDS node

 

mount point

/mnt/mdt

Mount point for the mdt0 block device ( /dev/sdb) on the MGS/MDS node

First OSS node

 

OSS node

oss0

First OSS node in Lustre file system temp

 

OST

ost0

First OST in Lustre file system temp

 

block device

/dev/sdc

Block device for the first OSS node ( oss0)

 

mount point

/mnt/ost0

Mount point for the ost0 block device ( /dev/sdc) on the oss1 node

Second OSS node

OSS node

oss1

Second OSS node in Lustre file system temp

OST

ost1

Second OST in Lustre file system temp

 

block device

/dev/sdd

Block device for the second OSS node (oss1)

mount point

/mnt/ost1

Mount point for the ost1 block device ( /dev/sdd) on the oss1 node

Client node

client node

client1

Client in Lustre file system temp

mount point

/lustre

Mount point for Lustre file system temp on the client1 node

Note

We recommend that you use 'dotted-quad' notation for IP addresses rather than host names to make it easier to read debug logs and debug configurations with multiple interfaces.

For this example, complete the steps below:

  1. Create a combined MGS/MDT file system on the block device. On the MDS node, run:

    [root@mds /]# mkfs.lustre --fsname=temp --mgs --mdt --index=0 /dev/sdb
    

    This command generates this output:

        Permanent disk data:
    Target:            temp-MDT0000
    Index:             0
    Lustre FS: temp
    Mount type:        ldiskfs
    Flags:             0x75
       (MDT MGS first_time update )
    Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
    Parameters: mdt.identity_upcall=/usr/sbin/l_getidentity
     
    checking for existing Lustre data: not found
    device size = 16MB
    2 6 18
    formatting backing filesystem ldiskfs on /dev/sdb
       target name             temp-MDTffff
       4k blocks               0
       options                 -i 4096 -I 512 -q -O dir_index,uninit_groups -F
    mkfs_cmd = mkfs.ext2 -j -b 4096 -L temp-MDTffff  -i 4096 -I 512 -q -O 
    dir_index,uninit_groups -F /dev/sdb
    Writing CONFIGS/mountdata 
    
  2. Mount the combined MGS/MDT file system on the block device. On the MDS node, run:

    [root@mds /]# mount -t lustre /dev/sdb /mnt/mdt
    

    This command generates this output:

    Lustre: temp-MDT0000: new disk, initializing 
    Lustre: 3009:0:(lproc_mds.c:262:lprocfs_wr_identity_upcall()) temp-MDT0000:
    group upcall set to /usr/sbin/l_getidentity
    Lustre: temp-MDT0000.mdt: set parameter identity_upcall=/usr/sbin/l_getidentity
    Lustre: Server temp-MDT0000 on device /dev/sdb has started 
    
  3. Create and mount ost0.

    In this example, the OSTs ( ost0 and ost1) are being created on different OSS nodes ( oss0 and oss1 respectively).

    1. Create ost0. On oss0 node, run:

      [root@oss0 /]# mkfs.lustre --fsname=temp --mgsnode=10.2.0.1@tcp0 --ost
      --index=0 /dev/sdc
      

      The command generates this output:

          Permanent disk data:
      Target:            temp-OST0000
      Index:             0
      Lustre FS: temp
      Mount type:        ldiskfs
      Flags:             0x72
      (OST first_time update)
      Persistent mount opts: errors=remount-ro,extents,mballoc
      Parameters: mgsnode=10.2.0.1@tcp
       
      checking for existing Lustre data: not found
      device size = 16MB
      2 6 18
      formatting backing filesystem ldiskfs on /dev/sdc
         target name             temp-OST0000
         4k blocks               0
         options                 -I 256 -q -O dir_index,uninit_groups -F
      mkfs_cmd = mkfs.ext2 -j -b 4096 -L temp-OST0000  -I 256 -q -O
      dir_index,uninit_groups -F /dev/sdc
      Writing CONFIGS/mountdata 
      
    2. Mount ost0 on the OSS on which it was created. On oss0 node, run:

      root@oss0 /] mount -t lustre /dev/sdc /mnt/ost0
      

      The command generates this output:

      LDISKFS-fs: file extents enabled 
      LDISKFS-fs: mballoc enabled
      Lustre: temp-OST0000: new disk, initializing
      Lustre: Server temp-OST0000 on device /dev/sdb has started
      

      Shortly afterwards, this output appears:

      Lustre: temp-OST0000: received MDS connection from 10.2.0.1@tcp0
      Lustre: MDS temp-MDT0000: temp-OST0000_UUID now active, resetting orphans 
      
  4. Create and mount ost1.

    1. Create ost1. On oss1 node, run:

      [root@oss1 /]# mkfs.lustre --fsname=temp --mgsnode=10.2.0.1@tcp0 \
                 --ost --index=1 /dev/sdd
      

      The command generates this output:

          Permanent disk data:
      Target:            temp-OST0001
      Index:             1
      Lustre FS: temp
      Mount type:        ldiskfs
      Flags:             0x72
      (OST first_time update)
      Persistent mount opts: errors=remount-ro,extents,mballoc
      Parameters: mgsnode=10.2.0.1@tcp
       
      checking for existing Lustre data: not found
      device size = 16MB
      2 6 18
      formatting backing filesystem ldiskfs on /dev/sdd
         target name             temp-OST0001
         4k blocks               0
         options                 -I 256 -q -O dir_index,uninit_groups -F
      mkfs_cmd = mkfs.ext2 -j -b 4096 -L temp-OST0001  -I 256 -q -O
      dir_index,uninit_groups -F /dev/sdc
      Writing CONFIGS/mountdata 
      
    2. Mount ost1 on the OSS on which it was created. On oss1 node, run:

      root@oss1 /] mount -t lustre /dev/sdd /mnt/ost1 
      

      The command generates this output:

      LDISKFS-fs: file extents enabled 
      LDISKFS-fs: mballoc enabled
      Lustre: temp-OST0001: new disk, initializing
      Lustre: Server temp-OST0001 on device /dev/sdb has started
      

      Shortly afterwards, this output appears:

      Lustre: temp-OST0001: received MDS connection from 10.2.0.1@tcp0
      Lustre: MDS temp-MDT0000: temp-OST0001_UUID now active, resetting orphans 
      
  5. Mount the Lustre file system on the client. On the client node, run:

    root@client1 /] mount -t lustre 10.2.0.1@tcp0:/temp /lustre 
    

    This command generates this output:

    Lustre: Client temp-client has started
    
  6. Verify that the file system started and is working by running the df, dd and ls commands on the client node.

    1. Run the lfs df -h command:

      [root@client1 /] lfs df -h 
      

      The lfs df -h command lists space usage per OST and the MDT in human-readable format. This command generates output similar to this:

      UUID               bytes      Used      Available   Use%    Mounted on
      temp-MDT0000_UUID  8.0G      400.0M       7.6G        0%      /lustre[MDT:0]
      temp-OST0000_UUID  800.0G    400.0M     799.6G        0%      /lustre[OST:0]
      temp-OST0001_UUID  800.0G    400.0M     799.6G        0%      /lustre[OST:1]
      filesystem summary:  1.6T    800.0M       1.6T        0%      /lustre
      
    2. Run the lfs df -ih command.

      [root@client1 /] lfs df -ih
      

      The lfs df -ih command lists inode usage per OST and the MDT. This command generates output similar to this:

      UUID              Inodes      IUsed       IFree   IUse%     Mounted on
      temp-MDT0000_UUID   2.5M        32         2.5M      0%       /lustre[MDT:0]
      temp-OST0000_UUID   5.5M        54         5.5M      0%       /lustre[OST:0]
      temp-OST0001_UUID   5.5M        54         5.5M      0%       /lustre[OST:1]
      filesystem summary: 2.5M        32         2.5M      0%       /lustre
      
    3. Run the dd command:

      [root@client1 /] cd /lustre
      [root@client1 /lustre] dd if=/dev/zero of=/lustre/zero.dat bs=4M count=2
      

      The dd command verifies write functionality by creating a file containing all zeros ( 0s). In this command, an 8 MB file is created. This command generates output similar to this:

      2+0 records in
      2+0 records out
      8388608 bytes (8.4 MB) copied, 0.159628 seconds, 52.6 MB/s
      
    4. Run the ls command:

      [root@client1 /lustre] ls -lsah
      

      The ls -lsah command lists files and directories in the current working directory. This command generates output similar to this:

      total 8.0M
      4.0K drwxr-xr-x  2 root root 4.0K Oct 16 15:27 .
      8.0K drwxr-xr-x 25 root root 4.0K Oct 16 15:27 ..
      8.0M -rw-r--r--  1 root root 8.0M Oct 16 15:27 zero.dat 
       
      

Once the Lustre file system is configured, it is ready for use.

10.2.  Additional Configuration Options

This section describes how to scale the Lustre file system or make configuration changes using the Lustre configuration utilities.

10.2.1.  Scaling the Lustre File System

A Lustre file system can be scaled by adding OSTs or clients. For instructions on creating additional OSTs repeat Step 3and Step 5above. For mounting additional clients, repeat Step 6for each client.

10.2.2.  Changing Striping Defaults

The default settings for the file layout stripe pattern are shown in Table 10.1, “Default stripe pattern”.

Table 10.1. Default stripe pattern

File Layout Parameter

Default

Description

stripe_size

1 MB

Amount of data to write to one OST before moving to the next OST.

stripe_count

1

The number of OSTs to use for a single file.

start_ost

-1

The first OST where objects are created for each file. The default -1 allows the MDS to choose the starting index based on available space and load balancing. It's strongly recommended not to change the default for this parameter to a value other than -1.


Use the lfs setstripe command described in Chapter 19, Managing File Layout (Striping) and Free Spaceto change the file layout configuration.

10.2.3.  Using the Lustre Configuration Utilities

If additional configuration is necessary, several configuration utilities are available:

  • mkfs.lustre- Use to format a disk for a Lustre service.

  • tunefs.lustre- Use to modify configuration information on a Lustre target disk.

  • lctl- Use to directly control Lustre features via an ioctl interface, allowing various configuration, maintenance and debugging features to be accessed.

  • mount.lustre- Use to start a Lustre client or target service.

For examples using these utilities, see the topic Chapter 44, System Configuration Utilities

The lfs utility is useful for configuring and querying a variety of options related to files. For more information, see Chapter 40, User Utilities.

Note

Some sample scripts are included in the directory where the Lustre software is installed. If you have installed the Lustre source code, the scripts are located in the lustre/tests sub-directory. These scripts enable quick setup of some simple standard Lustre configurations.

Chapter 11. Configuring Failover in a Lustre File System

This chapter describes how to configure failover in a Lustre file system. It includes:

For an overview of failover functionality in a Lustre file system, see Chapter 3, Understanding Failover in a Lustre File System.

11.1. Setting Up a Failover Environment

The Lustre software provides failover mechanisms only at the layer of the Lustre file system. No failover functionality is provided for system-level components such as failing hardware or applications, or even for the entire failure of a node, as would typically be provided in a complete failover solution. Failover functionality such as node monitoring, failure detection, and resource fencing must be provided by external HA software, such as PowerMan or the open source Corosync and Pacemaker packages provided by Linux operating system vendors. Corosync provides support for detecting failures, and Pacemaker provides the actions to take once a failure has been detected.

11.1.1. Selecting Power Equipment

Failover in a Lustre file system requires the use of a remote power control (RPC) mechanism, which comes in different configurations. For example, Lustre server nodes may be equipped with IPMI/BMC devices that allow remote power control. For recommended devices, refer to the list of supported RPC devices on the website for the PowerMan cluster power management utility:

https://github.com/chaos/powerman/tree/master/etc/devices

11.1.2. Selecting Power Management Software

Lustre failover requires RPC and management capability to verify that a failed node is off before I/O is directed to the failover node. This avoids double-mounting the two nodes and the risk of unrecoverable data corruption. A variety of power management tools will work. Two packages that have been commonly used with the Lustre software are PowerMan and Pacemaker.

The PowerMan cluster power management utility is used to control RPC devices from a central location. PowerMan provides native support for several RPC varieties and Expect-like configuration simplifies the addition of new devices. The latest versions of PowerMan are available at:

https://github.com/chaos/powerman

STONITH, or "Shoot The Other Node In The Head" is used in conjunction with High Availability node management. This is implemented by Pacemaker to ensure that a peer node that may be importing a shared storage device has been powered off and will not corrupt the shared storage if it continues running.

11.1.3. Selecting High-Availability (HA) Software

The Lustre file system must be set up with high-availability (HA) software to enable a complete Lustre failover solution. Except for PowerMan, the HA software packages mentioned above provide both power management and cluster management. For information about setting up failover with Pacemaker, see:

11.2. Preparing a Lustre File System for Failover

To prepare a Lustre file system to be configured and managed as an HA system by a third-party HA application, each storage target (MGT, MGS, OST) must be associated with a second node to create a failover pair. This configuration information is then communicated by the MGS to a client when the client mounts the file system.

The per-target configuration is relayed to the MGS at mount time. Some rules related to this are:

  • When a target is initially mounted, the MGS reads the configuration information from the target (such as mgt vs. ost, failnode, fsname) to configure the target into a Lustre file system. If the MGS is reading the initial mount configuration, the mounting node becomes that target's "primary" node.

  • When a target is subsequently mounted, the MGS reads the current configuration from the target and, as needed, will reconfigure the MGS database target information

When the target is formatted using the mkfs.lustre command, the failover service node(s) for the target are designated using the --servicenode option. In the example below, an OST with index 0 in the file system testfs is formatted with two service nodes designated to serve as a failover pair:

mkfs.lustre --reformat --ost --fsname testfs --mgsnode=192.168.10.1@o3ib \  
              --index=0 --servicenode=192.168.10.7@o2ib \
              --servicenode=192.168.10.8@o2ib \  
              /dev/sdb

More than two potential service nodes can be designated for a target. The target can then be mounted on any of the designated service nodes.

When HA is configured on a storage target, the Lustre software enables multi-mount protection (MMP) on that storage target. MMP prevents multiple nodes from simultaneously mounting and thus corrupting the data on the target. For more about MMP, see Chapter 24, Lustre File System Failover and Multiple-Mount Protection.

If the MGT has been formatted with multiple service nodes designated, this information must be conveyed to the Lustre client in the mount command used to mount the file system. In the example below, NIDs for two MGSs that have been designated as service nodes for the MGT are specified in the mount command executed on the client:

mount -t lustre 10.10.120.1@tcp1:10.10.120.2@tcp1:/testfs /lustre/testfs

When a client mounts the file system, the MGS provides configuration information to the client for the MDT(s) and OST(s) in the file system along with the NIDs for all service nodes associated with each target and the service node on which the target is mounted. Later, when the client attempts to access data on a target, it will try the NID for each specified service node until it connects to the target.

11.3. Administering Failover in a Lustre File System

For additional information about administering failover features in a Lustre file system, see:

Part III. Administering Lustre

Part III provides information about tools and procedures to use to administer a Lustre file system. You will find information in this section about:

Tip

The starting point for administering a Lustre file system is to monitor all logs and console logs for system health:

- Monitor logs on all servers and all clients.

- Invest in tools that allow you to condense logs from multiple systems.

- Use the logging resources provided in the Linux distribution.

Table of Contents

12. Monitoring a Lustre File System
12.1. Lustre Changelogs
12.1.1. Working with Changelogs
12.1.2. Changelog Examples
12.1.3. Audit with ChangelogsL 2.11
12.2. Lustre Jobstats
12.2.1. How Jobstats Works
12.2.2. Enable/Disable Jobstats
12.2.3. Check Job Stats
12.2.4. Clear Job Stats
12.2.5. Configure Auto-cleanup Interval
12.2.6. Identifying Top JobsL 2.14
12.3. Lustre Monitoring Tool (LMT)
12.4. CollectL
12.5. Other Monitoring Options
13. Lustre Operations
13.1. Mounting by Label
13.2. Starting Lustre
13.3. Mounting a Server
13.4. Stopping the Filesystem
13.5. Unmounting a Specific Target on a Server
13.6. Specifying Failout/Failover Mode for OSTs
13.7. Handling Degraded OST RAID Arrays
13.8. Running Multiple Lustre File Systems
13.9. Creating a sub-directory on a specific MDT
13.10. Creating a directory striped across multiple MDTsL 2.8
13.10.1. Directory creation by space/inode usageL 2.13
13.10.2. Filesystem-wide default directory stripingL 2.14
13.11. Default Dir Stripe Policy
13.12. Setting and Retrieving Lustre Parameters
13.12.1. Setting Tunable Parameters with mkfs.lustre
13.12.2. Setting Parameters with tunefs.lustre
13.12.3. Setting Parameters with lctl
13.13. Specifying NIDs and Failover
13.14. Erasing a File System
13.15. Reclaiming Reserved Disk Space
13.16. Replacing an Existing OST or MDT
13.17. Identifying To Which Lustre File an OST Object Belongs
14. Lustre Maintenance
14.1. Working with Inactive OSTs
14.2. Finding Nodes in the Lustre File System
14.3. Mounting a Server Without Lustre Service
14.4. Regenerating Lustre Configuration Logs
14.5. Changing a Server NID
14.6. Clearing configurationL 2.11
14.7. Adding a New MDT to a Lustre File System
14.8. Adding a New OST to a Lustre File System
14.9. Removing and Restoring MDTs and OSTs
14.9.1. Removing an MDT from the File System
14.9.2. Working with Inactive MDTs
14.9.3. Removing an OST from the File System
14.9.4. Backing Up OST Configuration Files
14.9.5. Restoring OST Configuration Files
14.9.6. Returning a Deactivated OST to Service
14.10. Aborting Recovery
14.11. Determining Which Machine is Serving an OST
14.12. Changing the Address of a Failover Node
14.13. Separate a combined MGS/MDT
14.14. Set an MDT to read-onlyL 2.13
14.15. Tune Fallocate for ldiskfsL 2.14
15. Managing Lustre Networking (LNet)
15.1. Updating the Health Status of a Peer or Router
15.2. Starting and Stopping LNet
15.2.1. Starting LNet
15.2.2. Stopping LNet
15.3. Hardware Based Multi-Rail Configurations with LNet
15.4. Load Balancing with an InfiniBand* Network
15.4.1. Setting Up lustre.conf for Load Balancing
15.5. Dynamically Configuring LNet Routes
15.5.1. lustre_routes_config
15.5.2. lustre_routes_conversion
15.5.3. Route Configuration Examples
16. LNet Software Multi-RailL 2.10
16.1. Multi-Rail Overview
16.2. Configuring Multi-Rail
16.2.1. Configure Multiple Interfaces on the Local Node
16.2.2. Deleting Network Interfaces
16.2.3. Adding Remote Peers that are Multi-Rail Capable
16.2.4. Deleting Remote Peers
16.3. Notes on routing with Multi-Rail
16.3.1. Multi-Rail Cluster Example
16.3.2. Utilizing Router Resiliency
16.3.3. Mixed Multi-Rail/Non-Multi-Rail Cluster
16.4. Multi-Rail Routing with LNet HealthL 2.13
16.4.1. Configuration
16.4.2. Router Health
16.4.3. Discovery
16.4.4. Route Aliveness Criteria
16.5. LNet HealthL 2.12
16.5.1. Health Value
16.5.2. Failure Types and Behavior
16.5.3. User Interface
16.5.4. Displaying Information
16.5.5. Initial Settings Recommendations
17. Upgrading a Lustre File System
17.1. Release Interoperability and Upgrade Requirements
17.2. Upgrading to Lustre Software Release 2.x (Major Release)
17.3. Upgrading to Lustre Software Release 2.x.y (Minor Release)
18. Backing Up and Restoring a File System
18.1. Backing up a File System
18.1.1. Lustre_rsync
18.2. Backing Up and Restoring an MDT or OST (ldiskfs Device Level)
18.3. Backing Up an OST or MDT (Backend File System Level)
18.3.1. Backing Up an OST or MDT (Backend File System Level)L 2.11
18.3.2. Backing Up an OST or MDT
18.4. Restoring a File-Level Backup
18.5. Using LVM Snapshots with the Lustre File System
18.5.1. Creating an LVM-based Backup File System
18.5.2. Backing up New/Changed Files to the Backup File System
18.5.3. Creating Snapshot Volumes
18.5.4. Restoring the File System From a Snapshot
18.5.5. Deleting Old Snapshots
18.5.6. Changing Snapshot Volume Size
18.6. Migration Between ZFS and ldiskfs Target Filesystems L 2.11
18.6.1. Migrate from a ZFS to an ldiskfs based filesystem
18.6.2. Migrate from an ldiskfs to a ZFS based filesystem
19. Managing File Layout (Striping) and Free Space
19.1. How Lustre File System Striping Works
19.2. Lustre File Layout (Striping) Considerations
19.2.1. Choosing a Stripe Size
19.3. Setting the File Layout/Striping Configuration (lfs setstripe)
19.3.1. Specifying a File Layout (Striping Pattern) for a Single File
19.3.2. Setting the Striping Layout for a Directory
19.3.3. Setting the Striping Layout for a File System
19.3.4. Per File System Stripe Count Limit
19.3.5. Creating a File on a Specific OST
19.4. Retrieving File Layout/Striping Information (getstripe)
19.4.1. Displaying the Current Stripe Size
19.4.2. Inspecting the File Tree
19.4.3. Locating the MDT for a remote directory
19.5. Progressive File Layout(PFL)L 2.10
19.5.1. lfs setstripe
19.5.2. lfs migrate
19.5.3. lfs getstripe
19.5.4. lfs find
19.6. Self-Extending Layout (SEL)L 2.13
19.6.1. lfs setstripe
19.6.2. lfs getstripe
19.6.3. lfs find
19.7. Foreign LayoutL 2.13
19.7.1. lfs set[dir]stripe
19.7.2. lfs get[dir]stripe
19.7.3. lfs find
19.8. Managing Free Space
19.8.1. Checking File System Free Space
19.8.2. Stripe Allocation Methods
19.8.3. Adjusting the Weighting Between Free Space and Location
19.9. Lustre Striping Internals
20. Data on MDT (DoM)L 2.11
20.1. Introduction to Data on MDT (DoM)
20.2. User Commands
20.2.1. lfs setstripe for DoM files
20.2.2. Setting a default DoM layout to an existing directory
20.2.3. DoM Stripe Size Restrictions
20.2.4. lfs getstripe for DoM files
20.2.5. lfs find for DoM files
20.2.6. The dom_stripesize parameter
20.2.7. Disable DoM
21. Lazy Size on MDT (LSoM)L 2.12
21.1. Introduction to Lazy Size on MDT (LSoM)
21.2. Enable LSoM
21.3. User Commands
21.3.1. lfs getsom for LSoM data
21.3.2. Syncing LSoM data
22. File Level Redundancy (FLR)L 2.11
22.1. Introduction
22.2. Operations
22.2.1. Creating a Mirrored File or Directory
22.2.2. Extending a Mirrored File
22.2.3. Splitting a Mirrored File
22.2.4. Resynchronizing out-of-sync Mirrored File(s)
22.2.5. Verifying Mirrored File(s)
22.2.6. Finding Mirrored File(s)
22.3. Interoperability
23. Managing the File System and I/O
23.1. Handling Full OSTs
23.1.1. Checking OST Space Usage
23.1.2. Disabling creates on a Full OST
23.1.3. Migrating Data within a File System
23.1.4. Returning an Inactive OST Back Online
23.1.5. Migrating Metadata within a Filesystem
23.2. Creating and Managing OST Pools
23.2.1. Working with OST Pools
23.2.2. Tips for Using OST Pools
23.3. Adding an OST to a Lustre File System
23.4. Performing Direct I/O
23.4.1. Making File System Objects Immutable
23.5. Other I/O Options
23.5.1. Lustre Checksums
23.5.2. PtlRPC Client Thread Pool
24. Lustre File System Failover and Multiple-Mount Protection
24.1. Overview of Multiple-Mount Protection
24.2. Working with Multiple-Mount Protection
25. Configuring and Managing Quotas
25.1. Working with Quotas
25.2. Enabling Disk Quotas
25.2.1. Quota Verification
25.3. Quota Administration
25.4. Default QuotaL 2.12
25.4.1. Usage
25.5. Quota Allocation
25.6. Quotas and Version Interoperability
25.7. Granted Cache and Quota Limits
25.8. Lustre Quota Statistics
25.8.1. Interpreting Quota Statistics
25.9. Pool QuotasL 2.14
25.9.1. DOM and MDT pools
25.9.2. Lfs quota/setquota options to setup quota pools
25.9.3. Quota pools interoperability
25.9.4. Pool Quotas Hard Limit setup example
25.9.5. Pool Quotas Soft Limit setup example
26. Hierarchical Storage Management (HSM)L 2.5
26.1. Introduction
26.2. Setup
26.2.1. Requirements
26.2.2. Coordinator
26.2.3. Agents
26.3. Agents and copytool
26.3.1. Archive ID, multiple backends
26.3.2. Registered agents
26.3.3. Timeout
26.4. Requests
26.4.1. Commands
26.4.2. Automatic restore
26.4.3. Request monitoring
26.5. File states
26.6. Tuning
26.6.1. hsm_controlpolicy
26.6.2. max_requests
26.6.3. policy
26.6.4. grace_delay
26.7. change logs
26.8. Policy engine
26.8.1. Robinhood
27. Persistent Client Cache (PCC)L 2.13
27.1. Introduction
27.2. Design
27.2.1. Lustre Read-Write PCC Caching
27.2.2. Rule-based Persistent Client Cache
27.3. PCC Command Line Tools
27.3.1. Add a PCC backend on a client
27.3.2. Delete a PCC backend from a client
27.3.3. Remove all PCC backends on a client
27.3.4. List all PCC backends on a client
27.3.5. Attach given files into PCC
27.3.6. Attach given files into PCC by FID(s)
27.3.7. Detach given files from PCC
27.3.8. Detach given files from PCC by FID(s)
27.3.9. Display the PCC state for given files
27.4. PCC Configuration Example
28. Mapping UIDs and GIDs with NodemapL 2.9
28.1. Setting a Mapping
28.1.1. Defining Terms
28.1.2. Deciding on NID Ranges
28.1.3. Defining a Servers Specific Group
28.1.4. Describing and Deploying a Sample Mapping
28.1.5. Mapping Project IDsL 2.15
28.2. Removing Nodemaps
28.3. Altering Properties
28.3.1. Managing the Properties
28.3.2. Mixing Properties
28.4. Enabling the Feature
28.5. default Nodemap
28.6. Verifying Settings
28.7. Ensuring Consistency
29. Configuring Shared-Secret Key (SSK) SecurityL 2.9
29.1. SSK Security Overview
29.1.1. Key features
29.2. SSK Security Flavors
29.2.1. Secure RPC Rules
29.3. SSK Key Files
29.3.1. Key File Management
29.4. Lustre GSS Keyring
29.4.1. Setup
29.4.2. Server Setup
29.4.3. Debugging GSS Keyring
29.4.4. Revoking Keys
29.5. Role of Nodemap in SSK
29.6. SSK Examples
29.6.1. Securing Client to Server Communications
29.6.2. Securing MGS Communications
29.6.3. Securing Server to Server Communications
29.7. Viewing Secure PtlRPC Contexts
30. Managing Security in a Lustre File System
30.1. Using ACLs
30.1.1. How ACLs Work
30.1.2. Using ACLs with the Lustre Software
30.1.3. Examples
30.2. Using Root Squash
30.3. Isolating Clients to a Sub-directory Tree
30.3.1. Identifying Clients
30.3.2. Configuring Isolation
30.3.3. Making Isolation Permanent
30.4. Checking SELinux Policy Enforced by Lustre ClientsL 2.13
30.4.1. Determining SELinux Policy Info
30.4.2. Enforcing SELinux Policy Check
30.4.3. Making SELinux Policy Check Permanent
30.4.4. Sending SELinux Status Info from Clients
30.5. Encrypting files and directoriesL 2.14
30.5.1. Client-side encryption access semantics
30.5.2. Client-side encryption key hierarchy
30.5.3. Client-side encryption modes and usage
30.5.4. Client-side encryption threat model
30.5.5. Manage encryption on directories
30.6. Configuring Kerberos (KRB) Security
30.6.1. What Is Kerberos?
30.6.2. Security Flavor
30.6.3. Kerberos Setup
30.6.4. Networking
30.6.5. Required packages
30.6.6. Build Lustre
30.6.7. Running
30.6.8. Secure MGS connection
31. Lustre ZFS SnapshotsL 2.10
31.1. Introduction
31.1.1. Requirements
31.2. Configuration
31.3. Snapshot Operations
31.3.1. Creating a Snapshot
31.3.2. Delete a Snapshot
31.3.3. Mounting a Snapshot
31.3.4. Unmounting a Snapshot
31.3.5. List Snapshots
31.3.6. Modify Snapshot Attributes
31.4. Global Write Barriers
31.4.1. Impose Barrier
31.4.2. Remove Barrier
31.4.3. Query Barrier
31.4.4. Rescan Barrier
31.5. Snapshot Logs
31.6. Lustre Configuration Logs

Chapter 12. Monitoring a Lustre File System

This chapter provides information on monitoring a Lustre file system and includes the following sections:

12.1.  Lustre Changelogs

The changelogs feature records events that change the file system namespace or file metadata. Changes such as file creation, deletion, renaming, attribute changes, etc. are recorded with the target and parent file identifiers (FIDs), the name of the target, a timestamp, and user information. These records can be used for a variety of purposes:

  • Capture recent changes to feed into an archiving system.

  • Use changelog entries to exactly replicate changes in a file system mirror.

  • Set up "watch scripts" that take action on certain events or directories.

  • Audit activity on Lustre, thanks to user information associated to file/directory changes with timestamps.

Changelogs record types are:

Value

Description

MARK

Internal recordkeeping

CREAT

Regular file creation

MKDIR

Directory creation

HLINK

Hard link

SLINK

Soft link

MKNOD

Other file creation

UNLNK

Regular file removal

RMDIR

Directory removal

RENME

Rename, original

RNMTO

Rename, final

OPEN *

Open

CLOSE

Close

LYOUT

Layout change

TRUNC

Regular file truncated

SATTR

Attribute change

XATTR

Extended attribute change (setxattr)

HSM

HSM specific event

MTIME

MTIME change

CTIME

CTIME change

ATIME *

ATIME change

MIGRT

Migration event

FLRW

File Level Replication: file initially written

RESYNC

File Level Replication: file re-synced

GXATR *

Extended attribute access (getxattr)

NOPEN *

Denied open

Note

Event types marked with * are not recorded by default. Refer to Section 12.1.2.7, “Setting the Changelog Mask” for instructions on modifying the Changelogs mask.

FID-to-full-pathname and pathname-to-FID functions are also included to map target and parent FIDs into the file system namespace.

12.1.1.  Working with Changelogs

Several commands are available to work with changelogs.

12.1.1.1.  lctl changelog_register

Because changelog records take up space on the MDT, the system administration must register changelog users. As soon as a changelog user is registered, the Changelogs feature is enabled. The registrants specify which records they are "done with", and the system purges up to the greatest common record.

To register a new changelog user, run:

mds# lctl --device fsname-MDTnumber changelog_register

Changelog entries are not purged beyond a registered user's set point (see lfs changelog_clear).

12.1.1.2.  lfs changelog

To display the metadata changes on an MDT (the changelog records), run:

client# lfs changelog fsname-MDTnumber [startrec [endrec]]

It is optional whether to specify the start and end records.

These are sample changelog records:

1 02MKDIR 15:15:21.977666834 2018.01.09 0x0 t=[0x200000402:0x1:0x0] j=mkdir.500 ef=0xf \
u=500:500 nid=10.128.11.159@tcp p=[0x200000007:0x1:0x0] pics
2 01CREAT 15:15:36.687592024 2018.01.09 0x0 t=[0x200000402:0x2:0x0] j=cp.500 ef=0xf \
u=500:500 nid=10.128.11.159@tcp p=[0x200000402:0x1:0x0] chloe.jpg
3 06UNLNK 15:15:41.305116815 2018.01.09 0x1 t=[0x200000402:0x2:0x0] j=rm.500 ef=0xf \
u=500:500 nid=10.128.11.159@tcp p=[0x200000402:0x1:0x0] chloe.jpg
4 07RMDIR 15:15:46.468790091 2018.01.09 0x1 t=[0x200000402:0x1:0x0] j=rmdir.500 ef=0xf \
u=500:500 nid=10.128.11.159@tcp p=[0x200000007:0x1:0x0] pics

12.1.1.3.  lfs changelog_clear

To clear old changelog records for a specific user (records that the user no longer needs), run:

client# lfs changelog_clear mdt_name userid endrec

The changelog_clear command indicates that changelog records previous to endrec are no longer of interest to a particular user userid, potentially allowing the MDT to free up disk space. An endrec value of 0 indicates the current last record. To run changelog_clear, the changelog user must be registered on the MDT node using lctl.

When all changelog users are done with records < X, the records are deleted.

12.1.1.4.  lctl changelog_deregister

To deregister (unregister) a changelog user, run:

mds# lctl --device mdt_device changelog_deregister userid

changelog_deregister cl1 effectively does a lfs changelog_clear cl1 0 as it deregisters.

12.1.2. Changelog Examples

This section provides examples of different changelog commands.

12.1.2.1. Registering a Changelog User

To register a new changelog user for a device (lustre-MDT0000):

mds# lctl --device lustre-MDT0000 changelog_register
lustre-MDT0000: Registered changelog userid 'cl1'

12.1.2.2. Displaying Changelog Records

To display changelog records for an MDT (e.g. lustre-MDT0000):

client# lfs changelog lustre-MDT0000
1 02MKDIR 15:15:21.977666834 2018.01.09 0x0 t=[0x200000402:0x1:0x0] ef=0xf \
u=500:500 nid=10.128.11.159@tcp p=[0x200000007:0x1:0x0] pics
2 01CREAT 15:15:36.687592024 2018.01.09 0x0 t=[0x200000402:0x2:0x0] ef=0xf \
u=500:500 nid=10.128.11.159@tcp p=[0x200000402:0x1:0x0] chloe.jpg
3 06UNLNK 15:15:41.305116815 2018.01.09 0x1 t=[0x200000402:0x2:0x0] ef=0xf \
u=500:500 nid=10.128.11.159@tcp p=[0x200000402:0x1:0x0] chloe.jpg
4 07RMDIR 15:15:46.468790091 2018.01.09 0x1 t=[0x200000402:0x1:0x0] ef=0xf \
u=500:500 nid=10.128.11.159@tcp p=[0x200000007:0x1:0x0] pics

Changelog records include this information:

rec# operation_type(numerical/text) timestamp datestamp flags
t=target_FID ef=extended_flags u=uid:gid nid=client_NID p=parent_FID target_name

Displayed in this format:

rec# operation_type(numerical/text) timestamp datestamp flags t=target_FID \
ef=extended_flags u=uid:gid nid=client_NID p=parent_FID target_name

For example:

2 01CREAT 15:15:36.687592024 2018.01.09 0x0 t=[0x200000402:0x2:0x0] ef=0xf \
u=500:500 nid=10.128.11.159@tcp p=[0x200000402:0x1:0x0] chloe.jpg

12.1.2.3. Clearing Changelog Records

To notify a device that a specific user (cl1) no longer needs records (up to and including 3):

# lfs changelog_clear  lustre-MDT0000 cl1 3

To confirm that the changelog_clear operation was successful, run lfs changelog; only records after id-3 are listed:

# lfs changelog lustre-MDT0000
4 07RMDIR 15:15:46.468790091 2018.01.09 0x1 t=[0x200000402:0x1:0x0] ef=0xf \
u=500:500 nid=10.128.11.159@tcp p=[0x200000007:0x1:0x0] pics

12.1.2.4. Deregistering a Changelog User

To deregister a changelog user (cl1) for a specific device (lustre-MDT0000):

mds# lctl --device lustre-MDT0000 changelog_deregister cl1
lustre-MDT0000: Deregistered changelog user 'cl1'

The deregistration operation clears all changelog records for the specified user (cl1).

client# lfs changelog lustre-MDT0000
5 00MARK  15:56:39.603643887 2018.01.09 0x0 t=[0x20001:0x0:0x0] ef=0xf \
u=500:500 nid=0@<0:0> p=[0:0x50:0xb] mdd_obd-lustre-MDT0000-0

Note

MARK records typically indicate changelog recording status changes.

12.1.2.5. Displaying the Changelog Index and Registered Users

To display the current, maximum changelog index and registered changelog users for a specific device (lustre-MDT0000):

mds# lctl get_param  mdd.lustre-MDT0000.changelog_users
mdd.lustre-MDT0000.changelog_users=current index: 8
ID    index (idle seconds)
cl2   8 (180)

12.1.2.6. Displaying the Changelog Mask

To show the current changelog mask on a specific device (lustre-MDT0000):

mds# lctl get_param  mdd.lustre-MDT0000.changelog_mask

mdd.lustre-MDT0000.changelog_mask= 
MARK CREAT MKDIR HLINK SLINK MKNOD UNLNK RMDIR RENME RNMTO CLOSE LYOUT \
TRUNC SATTR XATTR HSM MTIME CTIME MIGRT

12.1.2.7. Setting the Changelog Mask

To set the current changelog mask on a specific device (lustre-MDT0000):

mds# lctl set_param mdd.lustre-MDT0000.changelog_mask=HLINK
mdd.lustre-MDT0000.changelog_mask=HLINK
$ lfs changelog_clear lustre-MDT0000 cl1 0
$ mkdir /mnt/lustre/mydir/foo
$ cp /etc/hosts /mnt/lustre/mydir/foo/file
$ ln /mnt/lustre/mydir/foo/file /mnt/lustre/mydir/myhardlink

Only item types that are in the mask show up in the changelog.

# lfs changelog lustre-MDT0000
9 03HLINK 16:06:35.291636498 2018.01.09 0x0 t=[0x200000402:0x4:0x0] ef=0xf \
u=500:500 nid=10.128.11.159@tcp p=[0x200000007:0x3:0x0] myhardlink

Introduced in Lustre 2.11

12.1.3.  Audit with Changelogs

A specific use case for Lustre Changelogs is audit. According to a definition found on Wikipedia, information technology audits are used to evaluate the organization's ability to protect its information assets and to properly dispense information to authorized parties. Basically, audit consists in controlling that all data accesses made were done according to the access control policy in place. And usually, this is done by analyzing access logs.

Audit can be used as a proof of security in place. But Audit can also be a requirement to comply with regulations.

Lustre Changelogs are a good mechanism for audit, because this is a centralized facility, and it is designed to be transactional. Changelog records contain all information necessary for auditing purposes:

  • ability to identify object of action thanks to file identifiers (FIDs) and name of targets

  • ability to identify subject of action thanks to UID/GID and NID information

  • ability to identify time of action thanks to timestamp

12.1.3.1. Enabling Audit

To have a fully functional Changelogs-based audit facility, some additional Changelog record types must be enabled, to be able to record events such as OPEN, ATIME, GETXATTR and DENIED OPEN. Please note that enabling these record types may have some performance impact. For instance, recording OPEN and GETXATTR events generate writes in the Changelog records for a read operation from a file-system standpoint.

Being able to record events such as OPEN or DENIED OPEN is important from an audit perspective. For instance, if Lustre file system is used to store medical records on a system dedicated to Life Sciences, data privacy is crucial. Administrators may need to know which doctors accessed, or tried to access, a given medical record and when. And conversely, they might need to know which medical records a given doctor accessed.

To enable all changelog entry types, do:

mds# lctl set_param mdd.lustre-MDT0000.changelog_mask=ALL
mdd.seb-MDT0000.changelog_mask=ALL

Once all required record types have been enabled, just register a Changelogs user and the audit facility is operational.

Note that, however, it is possible to control which Lustre client nodes can trigger the recording of file system access events to the Changelogs, thanks to the audit_mode flag on nodemap entries. The reason to disable audit on a per-nodemap basis is to prevent some nodes (e.g. backup, HSM agent nodes) from flooding the audit logs. When audit_mode flag is set to 1 on a nodemap entry, a client pertaining to this nodemap will be able to record file system access events to the Changelogs, if Changelogs are otherwise activated. When set to 0, events are not logged into the Changelogs, no matter if Changelogs are activated or not. By default, audit_mode flag is set to 1 in newly created nodemap entries. And it is also set to 1 in 'default' nodemap.

To prevent nodes pertaining to a nodemap to generate Changelog entries, do:

mgs# lctl nodemap_modify --name nm1 --property audit_mode --value 0

12.1.3.2. Audit examples

12.1.3.2.1.  OPEN

An OPEN changelog entry is in the form:

7 10OPEN  13:38:51.510728296 2017.07.25 0x242 t=[0x200000401:0x2:0x0] \
ef=0x7 u=500:500 nid=10.128.11.159@tcp m=-w-

It includes information about the open mode, in the form m=rwx.

OPEN entries are recorded only once per UID/GID, for a given open mode, as long as the file is not closed by this UID/GID. It avoids flooding the Changelogs for instance if there is an MPI job opening the same file thousands of times from different threads. It reduces the ChangeLog load significantly, without significantly affecting the audit information. Similarly, only the last CLOSE per UID/GID is recorded.

12.1.3.2.2.  GETXATTR

A GETXATTR changelog entry is in the form:

8 23GXATR 09:22:55.886793012 2017.07.27 0x0 t=[0x200000402:0x1:0x0] \
ef=0xf u=500:500 nid=10.128.11.159@tcp x=user.name0

It includes information about the name of the extended attribute being accessed, in the form x=<xattr name>.

12.1.3.2.3.  SETXATTR

A SETXATTR changelog entry is in the form:

4 15XATTR 09:41:36.157333594 2018.01.10 0x0 t=[0x200000402:0x1:0x0] \
ef=0xf u=500:500 nid=10.128.11.159@tcp x=user.name0

It includes information about the name of the extended attribute being modified, in the form x=<xattr name>.

12.1.3.2.4.  DENIED OPEN

A DENIED OPEN changelog entry is in the form:

4 24NOPEN 15:45:44.947406626 2017.08.31 0x2 t=[0x200000402:0x1:0x0] \
ef=0xf u=500:500 nid=10.128.11.158@tcp m=-w-

It has the same information as a regular OPEN entry. In order to avoid flooding the Changelogs, DENIED OPEN entries are rate limited: no more than one entry per user per file per time interval, this time interval (in seconds) being configurable via mdd.<mdtname>.changelog_deniednext (default value is 60 seconds).

mds# lctl set_param mdd.lustre-MDT0000.changelog_deniednext=120
mdd.seb-MDT0000.changelog_deniednext=120
mds# lctl get_param mdd.lustre-MDT0000.changelog_deniednext
mdd.seb-MDT0000.changelog_deniednext=120

12.2.  Lustre Jobstats

The Lustre jobstats feature collects file system operation statistics for user processes running on Lustre clients, and exposes on the server using the unique Job Identifier (JobID) provided by the job scheduler for each job. Job schedulers known to be able to work with jobstats include: SLURM, SGE, LSF, Loadleveler, PBS and Maui/MOAB.

Since jobstats is implemented in a scheduler-agnostic manner, it is likely that it will be able to work with other schedulers also, and also in environments that do not use a job scheduler, by storing custom format strings in the jobid_name.

12.2.1.  How Jobstats Works

The Lustre jobstats code on the client extracts the unique JobID from an environment variable within the user process, and sends this JobID to the server with all RPCs. This allows the server to tracks statistics for operations specific to each application/command running on the client, and can be useful to identify the source high I/O load.

A Lustre setting on the client, jobid_var, specifies an environment variable or other client-local source that to holds a (relatively) unique the JobID for the running application. Any environment variable can be specified. For example, SLURM sets the SLURM_JOB_ID environment variable with the unique JobID for all clients running a particular job launched on one or more nodes, and the SLURM_JOB_ID will be inherited by all child processes started below that process.

There are several reserved values for jobid_var:

  • disable - disables sending a JobID from this client

  • procname_uid - uses the process name and UID, equivalent to setting jobid_name=%e.%u

  • nodelocal - use only the JobID format from jobid_name

  • session - extract the JobID from jobid_this_session

Lustre can also be configured to generate a synthetic JobID from the client's process name and numeric UID, by setting jobid_var=procname_uid. This will generate a uniform JobID when running the same binary across multiple client nodes, but cannot distinguish whether the binary is part of a single distributed process or multiple independent processes. This can be useful on login nodes where interactive commands are run.

Introduced in Lustre 2.8

In Lustre 2.8 and later it is possible to set jobid_var=nodelocal and then also set jobid_name=name, which all processes on that client node will use. This is useful if only a single job is run on a client at one time, but if multiple jobs are run on a client concurrently, the session JobID should be used.

Introduced in Lustre 2.12

In Lustre 2.12 and later, it is possible to specify more complex JobID values for jobid_name by using a string that contains format codes that are evaluated for each process, in order to generate a site- or node-specific JobID string.

  • %e print executable name

  • %g print group ID number

  • %h print fully-qualified hostname

  • %H print short hostname

  • %j print JobID from the source named by the jobid_var parameter

  • %p print numeric process ID

  • %u print user ID number

Introduced in Lustre 2.13

In Lustre 2.13 and later, it is possible to set a per-session JobID via the jobid_this_session parameter instead of getting the JobID from an environment variable. This session ID will be inherited by all processes that are started in this login session, though there can be a different JobID for each login session. This is enabled by setting jobid_var=session instead of setting it to an environment variable. The session ID will be substituted for %j in jobid_name.

The setting of jobid_var need not be the same on all clients. For example, one could use SLURM_JOB_ID on all clients managed by SLURM, and use procname_uid on clients not managed by SLURM, such as interactive login nodes.

It is not possible to have different jobid_var settings on a single node, since it is unlikely that multiple job schedulers are active on one client. However, the actual JobID value is local to each process environment and it is possible for multiple jobs with different JobIDs to be active on a single client at one time.

12.2.2.  Enable/Disable Jobstats

Jobstats are disabled by default. The current state of jobstats can be verified by checking lctl get_param jobid_var on a client:

clieht# lctl get_param jobid_var
jobid_var=disable

To enable jobstats on all clients for SLURM:

mgs# lctl set_param -P jobid_var=SLURM_JOB_ID

The lctl set_param command to enable or disable jobstats should be run on the MGS as root. The change is persistent, and will be propagated to the MDS, OSS, and client nodes automatically when it is set on the MGS and for each new client mount.

To temporarily enable jobstats on a client, or to use a different jobid_var on a subset of nodes, such as nodes in a remote cluster that use a different job scheduler, or interactive login nodes that do not use a job scheduler at all, run the lctl set_param command directly on the client node(s) after the filesystem is mounted. For example, to enable the procname_uid synthetic JobID locally on a login node run:

client# lctl set_param jobid_var=procname_uid

The lctl set_param setting is not persistent, and will be reset if the global jobid_var is set on the MGS or if the filesystem is unmounted.

The following table shows the environment variables which are set by various job schedulers. Set jobid_var to the value for your job scheduler to collect statistics on a per job basis.

Job Scheduler

Environment Variable

Simple Linux Utility for Resource Management (SLURM)

SLURM_JOB_ID

Sun Grid Engine (SGE)

JOB_ID

Load Sharing Facility (LSF)

LSB_JOBID

Loadleveler

LOADL_STEP_ID

Portable Batch Scheduler (PBS)/MAUI

PBS_JOBID

Cray Application Level Placement Scheduler (ALPS)

ALPS_APP_ID

mgs# lctl set_param -P jobid_var=disable

To track job stats per process name and user ID (for debugging, or if no job scheduler is in use on some nodes such as login nodes), specify jobid_var as procname_uid:

client# lctl set_param jobid_var=procname_uid

12.2.3.  Check Job Stats

Metadata operation statistics are collected on MDTs. These statistics can be accessed for all file systems and all jobs on the MDT via the lctl get_param mdt.*.job_stats. For example, clients running with jobid_var=procname_uid:

mds# lctl get_param mdt.*.job_stats
job_stats:
- job_id:          bash.0
  snapshot_time:   1352084992
  open:            { samples:     2, unit:  reqs }
  close:           { samples:     2, unit:  reqs }
  getattr:         { samples:     3, unit:  reqs }
- job_id:          mythbackend.0
  snapshot_time:   1352084996
  open:            { samples:    72, unit:  reqs }
  close:           { samples:    73, unit:  reqs }
  unlink:          { samples:    22, unit:  reqs }
  getattr:         { samples:   778, unit:  reqs }
  setattr:         { samples:    22, unit:  reqs }
  statfs:          { samples: 19840, unit:  reqs }
  sync:            { samples: 33190, unit:  reqs }

Data operation statistics are collected on OSTs. Data operations statistics can be accessed via lctl get_param obdfilter.*.job_stats, for example:

oss# lctl get_param obdfilter.*.job_stats
obdfilter.myth-OST0000.job_stats=
job_stats:
- job_id:          mythcommflag.0
  snapshot_time:   1429714922
  read:    { samples: 974, unit: bytes, min: 4096, max: 1048576, sum: 91530035 }
  write:   { samples:   0, unit: bytes, min:    0, max:       0, sum:        0 }
obdfilter.myth-OST0001.job_stats=
job_stats:
- job_id:          mythbackend.0
  snapshot_time:   1429715270
  read:    { samples:   0, unit: bytes, min:     0, max:      0, sum:        0 }
  write:   { samples:   1, unit: bytes, min: 96899, max:  96899, sum:    96899 }
  punch:   { samples:   1, unit:  reqs }
obdfilter.myth-OST0002.job_stats=job_stats:
obdfilter.myth-OST0003.job_stats=job_stats:
obdfilter.myth-OST0004.job_stats=
job_stats:
- job_id:          mythfrontend.500
  snapshot_time:   1429692083
  read:    { samples:   9, unit: bytes, min: 16384, max: 1048576, sum: 4444160 }
  write:   { samples:   0, unit: bytes, min:     0, max:       0, sum:       0 }
- job_id:          mythbackend.500
  snapshot_time:   1429692129
  read:    { samples:   0, unit: bytes, min:     0, max:       0, sum:       0 }
  write:   { samples:   1, unit: bytes, min: 56231, max:   56231, sum:   56231 }
  punch:   { samples:   1, unit:  reqs }

12.2.4.  Clear Job Stats

Accumulated job statistics can be reset by writing proc file job_stats.

Clear statistics for all jobs on the local node:

oss# lctl set_param obdfilter.*.job_stats=clear

Clear statistics only for job 'bash.0' on lustre-MDT0000:

mds# lctl set_param mdt.lustre-MDT0000.job_stats=bash.0

12.2.5.  Configure Auto-cleanup Interval

By default, if a job is inactive for 600 seconds (10 minutes) statistics for this job will be dropped. This expiration value can be changed temporarily via:

mds# lctl set_param *.*.job_cleanup_interval={max_age}

It can also be changed permanently, for example to 700 seconds via:

mgs# lctl set_param -P mdt.testfs-*.job_cleanup_interval=700

The job_cleanup_interval can be set as 0 to disable the auto-cleanup. Note that if auto-cleanup of Jobstats is disabled, then all statistics will be kept in memory forever, which may eventually consume all memory on the servers. In this case, any monitoring tool should explicitly clear individual job statistics as they are processed, as shown above.

Introduced in Lustre 2.14

12.2.6.  Identifying Top Jobs

Since Lustre 2.15 the lljobstat utility can be used to monitor and identify the top JobIDs generating load on a particular server. This allows the administrator to quickly see which applications/users/clients (depending on how the JobID is conigured) are generating the most filesystem RPCs and take appropriate action if needed.

mds# lljobstat -c 10
---
    timestamp: 1665984678
    top_jobs:
    - ls.500:          {ops: 64, ga: 64}
    - touch.500:       {ops: 6, op: 1, cl: 1, mn: 1, ga: 1, sa: 2}
    - bash.0:          {ops: 3, ga: 3}
    ...

It is possible to specify the number of top jobs to monitor as well as the refresh interval, among other options.

12.3.  Lustre Monitoring Tool (LMT)

The Lustre Monitoring Tool (LMT) is a Python-based, distributed system that provides a top-like display of activity on server-side nodes (MDS, OSS and portals routers) on one or more Lustre file systems. It does not provide support for monitoring clients. For more information on LMT, including the setup procedure, see:

https://github.com/chaos/lmt/wiki

12.4.  CollectL

CollectL is another tool that can be used to monitor a Lustre file system. You can run CollectL on a Lustre system that has any combination of MDSs, OSTs and clients. The collected data can be written to a file for continuous logging and played back at a later time. It can also be converted to a format suitable for plotting.

For more information about CollectL, see:

http://collectl.sourceforge.net

Lustre-specific documentation is also available. See:

http://collectl.sourceforge.net/Tutorial-Lustre.html

12.5.  Other Monitoring Options

A variety of standard tools are available publicly including the following:

Another option is to script a simple monitoring solution that looks at various reports from ipconfig, as well as the procfs files generated by the Lustre software.

Chapter 13. Lustre Operations

Once you have the Lustre file system up and running, you can use the procedures in this section to perform these basic Lustre administration tasks.

13.1.  Mounting by Label

The file system name is limited to 8 characters. We have encoded the file system and target information in the disk label, so you can mount by label. This allows system administrators to move disks around without worrying about issues such as SCSI disk reordering or getting the /dev/device wrong for a shared target. Soon, file system naming will be made as fail-safe as possible. Currently, Linux disk labels are limited to 16 characters. To identify the target within the file system, 8 characters are reserved, leaving 8 characters for the file system name:

fsname-MDT0000 or
fsname-OST0a19

To mount by label, use this command:

mount -t lustre -L file_system_label /mount_point

This is an example of mount-by-label:

mds# mount -t lustre -L testfs-MDT0000 /mnt/mdt

Caution

Mount-by-label should NOT be used in a multi-path environment or when snapshots are being created of the device, since multiple block devices will have the same label.

Although the file system name is internally limited to 8 characters, you can mount the clients at any mount point, so file system users are not subjected to short names. Here is an example:

client# mount -t lustre mds0@tcp0:/short /dev/long_mountpoint_name

13.2.  Starting Lustre

On the first start of a Lustre file system, the components must be started in the following order:

  1. Mount the MGT.

    Note

    If a combined MGT/MDT is present, Lustre will correctly mount the MGT and MDT automatically.

  2. Mount the MDT.

    Note

    Mount all MDTs if multiple MDTs are present.

  3. Mount the OST(s).

  4. Mount the client(s).

13.3.  Mounting a Server

Starting a Lustre server is straightforward and only involves the mount command. Lustre servers can be added to /etc/fstab:

mount -t lustre

The mount command generates output similar to this:

/dev/sda1 on /mnt/test/mdt type lustre (rw)
/dev/sda2 on /mnt/test/ost0 type lustre (rw)
192.168.0.21@tcp:/testfs on /mnt/testfs type lustre (rw)

In this example, the MDT, an OST (ost0) and file system (testfs) are mounted.

LABEL=testfs-MDT0000 /mnt/test/mdt lustre defaults,_netdev,noauto 0 0
LABEL=testfs-OST0000 /mnt/test/ost0 lustre defaults,_netdev,noauto 0 0

In general, it is wise to specify noauto and let your high-availability (HA) package manage when to mount the device. If you are not using failover, make sure that networking has been started before mounting a Lustre server. If you are running Red Hat Enterprise Linux, SUSE Linux Enterprise Server, Debian operating system (and perhaps others), use the _netdev flag to ensure that these disks are mounted after the network is up, unless you are using systemd 232 or greater, which recognize lustre as a network filesystem. If you are using lnet.service, use x-systemd.requires=lnet.service regardless of systemd version.

We are mounting by disk label here. The label of a device can be read with e2label. The label of a newly-formatted Lustre server may end in FFFF if the --index option is not specified to mkfs.lustre, meaning that it has yet to be assigned. The assignment takes place when the server is first started, and the disk label is updated. It is recommended that the --index option always be used, which will also ensure that the label is set at format time.

Caution

Do not do this when the client and OSS are on the same node, as memory pressure between the client and OSS can lead to deadlocks.

Caution

Mount-by-label should NOT be used in a multi-path environment.

13.4.  Stopping the Filesystem

A complete Lustre filesystem shutdown occurs by unmounting all clients and servers in the order shown below. Please note that unmounting a block device causes the Lustre software to be shut down on that node.

Note

Please note that the -a -t lustre in the commands below is not the name of a filesystem, but rather is specifying to unmount all entries in /etc/mtab that are of type lustre

  1. Unmount the clients

    On each client node, unmount the filesystem on that client using the umount command:

    umount -a -t lustre

    The example below shows the unmount of the testfs filesystem on a client node:

    [root@client1 ~]# mount -t lustre
    XXX.XXX.0.11@tcp:/testfs on /mnt/testfs type lustre (rw,lazystatfs)
    
    [root@client1 ~]# umount -a -t lustre
    [154523.177714] Lustre: Unmounted testfs-client
    

  2. Unmount the MDT and MGT

    On the MGS and MDS node(s), run the umount command:

    umount -a -t lustre

    The example below shows the unmount of the MDT and MGT for the testfs filesystem on a combined MGS/MDS:

    [root@mds1 ~]# mount -t lustre
    /dev/sda on /mnt/mgt type lustre (ro)
    /dev/sdb on /mnt/mdt type lustre (ro)
    
    [root@mds1 ~]# umount -a -t lustre
    [155263.566230] Lustre: Failing over testfs-MDT0000
    [155263.775355] Lustre: server umount testfs-MDT0000 complete
    [155269.843862] Lustre: server umount MGS complete
    

    For a seperate MGS and MDS, the same command is used, first on the MDS and then followed by the MGS.

  3. Unmount all the OSTs

    On each OSS node, use the umount command:

    umount -a -t lustre

    The example below shows the unmount of all OSTs for the testfs filesystem on server OSS1:

    [root@oss1 ~]# mount |grep lustre
    /dev/sda on /mnt/ost0 type lustre (ro)
    /dev/sdb on /mnt/ost1 type lustre (ro)
    /dev/sdc on /mnt/ost2 type lustre (ro)
    
    [root@oss1 ~]# umount -a -t lustre
    Lustre: Failing over testfs-OST0002
    Lustre: server umount testfs-OST0002 complete
    

For unmount command syntax for a single OST, MDT, or MGT target please refer to Section 13.5, “ Unmounting a Specific Target on a Server”

13.5.  Unmounting a Specific Target on a Server

To stop a Lustre OST, MDT, or MGT , use the umount /mount_point command.

The example below stops an OST, ost0, on mount point /mnt/ost0 for the testfs filesystem:

[root@oss1 ~]# umount /mnt/ost0
Lustre: Failing over testfs-OST0000
Lustre: server umount testfs-OST0000 complete

Gracefully stopping a server with the umount command preserves the state of the connected clients. The next time the server is started, it waits for clients to reconnect, and then goes through the recovery procedure.

If the force ( -f) flag is used, then the server evicts all clients and stops WITHOUT recovery. Upon restart, the server does not wait for recovery. Any currently connected clients receive I/O errors until they reconnect.

Note

If you are using loopback devices, use the -d flag. This flag cleans up loop devices and can always be safely specified.

13.6.  Specifying Failout/Failover Mode for OSTs

In a Lustre file system, an OST that has become unreachable because it fails, is taken off the network, or is unmounted can be handled in one of two ways:

  • In failout mode, Lustre clients immediately receive errors (EIOs) after a timeout, instead of waiting for the OST to recover.

  • In failover mode, Lustre clients wait for the OST to recover.

By default, the Lustre file system uses failover mode for OSTs. To specify failout mode instead, use the --param="failover.mode=failout" option as shown below (entered on one line):

oss# mkfs.lustre --fsname=fsname --mgsnode=mgs_NID \
        --param=failover.mode=failout --ost --index=ost_index /dev/ost_block_device

In the example below, failout mode is specified for the OSTs on the MGS mds0 in the file system testfs(entered on one line).

oss# mkfs.lustre --fsname=testfs --mgsnode=mds0 --param=failover.mode=failout \
      --ost --index=3 /dev/sdb

Caution

Before running this command, unmount all OSTs that will be affected by a change in failover/failout mode.

Note

After initial file system configuration, use the tunefs.lustre utility to change the mode. For example, to set the failout mode, run:

# tunefs.lustre --param failover.mode=failout /dev/ost_device

13.7.  Handling Degraded OST RAID Arrays

Lustre includes functionality that notifies Lustre if an external RAID array has degraded performance (resulting in reduced overall file system performance), either because a disk has failed and not been replaced, or because a disk was replaced and is undergoing a rebuild. To avoid a global performance slowdown due to a degraded OST, the MDS can avoid the OST for new object allocation if it is notified of the degraded state.

A parameter for each OST, called degraded, specifies whether the OST is running in degraded mode or not.

To mark the OST as degraded, use:

oss# lctl set_param obdfilter.{OST_name}.degraded=1

To mark that the OST is back in normal operation, use:

oss# lctl set_param obdfilter.{OST_name}.degraded=0

To determine if OSTs are currently in degraded mode, use:

oss# lctl get_param obdfilter.*.degraded

If the OST is remounted due to a reboot or other condition, the flag resets to 0.

It is recommended that this be implemented by an automated script that monitors the status of individual RAID devices, such as MD-RAID's mdadm(8) command with the --monitor option to mark an affected device degraded or restored.

13.8.  Running Multiple Lustre File Systems

Lustre supports multiple file systems provided the combination of NID:fsname is unique. Each file system must be allocated a unique name during creation with the --fsname parameter. Unique names for file systems are enforced if a single MGS is present. If multiple MGSs are present (for example if you have an MGS on every MDS) the administrator is responsible for ensuring file system names are unique. A single MGS and unique file system names provides a single point of administration and allows commands to be issued against the file system even if it is not mounted.

Lustre supports multiple file systems on a single MGS. With a single MGS fsnames are guaranteed to be unique. Lustre also allows multiple MGSs to co-exist. For example, multiple MGSs will be necessary if multiple file systems on different Lustre software versions are to be concurrently available. With multiple MGSs additional care must be taken to ensure file system names are unique. Each file system should have a unique fsname among all systems that may interoperate in the future.

By default, the mkfs.lustre command creates a file system named lustre. To specify a different file system name (limited to 8 characters) at format time, use the --fsname option:

oss# mkfs.lustre --fsname=file_system_name

Note

The MDT, OSTs and clients in the new file system must use the same file system name (prepended to the device name). For example, for a new file system named foo, the MDT and two OSTs would be named foo-MDT0000, foo-OST0000, and foo-OST0001.

To mount a client on the file system, run:

client# mount -t lustre mgsnode:/new_fsname /mount_point

For example, to mount a client on file system foo at mount point /mnt/foo, run:

client# mount -t lustre mgsnode:/foo /mnt/foo

Note

If a client(s) will be mounted on several file systems, add the following line to /etc/xattr.conf file to avoid problems when files are moved between the file systems: lustre.* skip

Note

To ensure that a new MDT is added to an existing MGS create the MDT by specifying: --mdt --mgsnode=mgs_NID.

A Lustre installation with two file systems ( foo and bar) could look like this, where the MGS node is mgsnode@tcp0 and the mount points are /mnt/foo and /mnt/bar.

mgsnode# mkfs.lustre --mgs /dev/sda
mdtfoonode# mkfs.lustre --fsname=foo --mgsnode=mgsnode@tcp0 --mdt --index=0
/dev/sdb
ossfoonode# mkfs.lustre --fsname=foo --mgsnode=mgsnode@tcp0 --ost --index=0
/dev/sda
ossfoonode# mkfs.lustre --fsname=foo --mgsnode=mgsnode@tcp0 --ost --index=1
/dev/sdb
mdtbarnode# mkfs.lustre --fsname=bar --mgsnode=mgsnode@tcp0 --mdt --index=0
/dev/sda
ossbarnode# mkfs.lustre --fsname=bar --mgsnode=mgsnode@tcp0 --ost --index=0
/dev/sdc
ossbarnode# mkfs.lustre --fsname=bar --mgsnode=mgsnode@tcp0 --ost --index=1
/dev/sdd

To mount a client on file system foo at mount point /mnt/foo, run:

client# mount -t lustre mgsnode@tcp0:/foo /mnt/foo

To mount a client on file system bar at mount point /mnt/bar, run:

client# mount -t lustre mgsnode@tcp0:/bar /mnt/bar

13.9.  Creating a sub-directory on a specific MDT

It is possible to create individual directories, along with its files and sub-directories, to be stored on specific MDTs. To create a sub-directory on a given MDT use the command:

client$ lfs mkdir -i mdt_index /mount_point/remote_dir

This command will allocate the sub-directory remote_dir onto the MDT with index mdt_index. For more information on adding additional MDTs and mdt_index see 2.

Warning

An administrator can allocate remote sub-directories to separate MDTs. Creating remote sub-directories in parent directories not hosted on MDT0000 is not recommended. This is because the failure of the parent MDT will leave the namespace below it inaccessible. For this reason, by default it is only possible to create remote sub-directories off MDT0000. To relax this restriction and enable remote sub-directories off any MDT, an administrator must issue the following command on the MGS:

mgs# lctl set_param -P mdt.fsname-MDT*.enable_remote_dir=1

For Lustre filesystem 'scratch', the command executed is:

mgs# lctl set_param -P mdt.scratch-*.enable_remote_dir=1

To verify the configuration setting execute the following command on any MDS:

mds# lctl get_param mdt.*.enable_remote_dir

Introduced in Lustre 2.8

With Lustre software version 2.8, a new tunable is available to allow users with a specific group ID to create and delete remote and striped directories. This tunable is enable_remote_dir_gid. For example, setting this parameter to the 'wheel' or 'admin' group ID allows users with that GID to create and delete remote and striped directories. Setting this parameter to -1 on MDT0000 to permanently allow any non-root users create and delete remote and striped directories. On the MGS execute the following command:

mgs# lctl set_param -P mdt.fsname-*.enable_remote_dir_gid=-1

For the Lustre filesystem 'scratch', the commands expands to:

mgs# lctl set_param -P mdt.scratch-*.enable_remote_dir_gid=-1

The change can be verified by executing the following command on every MDS:

mds# lctl get_param mdt.*.enable_remote_dir_gid

Introduced in Lustre 2.8

13.10.  Creating a directory striped across multiple MDTs

The Lustre 2.8 DNE feature enables files in a single large directory to be distributed across multiple MDTs (a striped directory), if there are mutliple MDTs added to the filesystem, see Section 14.7, “Adding a New MDT to a Lustre File System”. The result is that metadata requests for files in a single large striped directory are serviced by multiple MDTs and metadata service load is distributed over all the MDTs that service a given directory. By distributing metadata service load over multiple MDTs, performance of very large directories can be improved beyond the limit of one MDT. Normally, all files in a directory must be created on a single MDT.

This command to stripe a directory over mdt_count MDTs is:

client$ lfs mkdir -c mdt_count /mount_point/new_directory

The striped directory feature is most useful for distributing a single large directory (50k entries or more) across multiple MDTs. This should be used with discretion since creating and removing striped directories incurs more overhead than non-striped directories.

Introduced in Lustre 2.13

13.10.1. Directory creation by space/inode usage

If the starting MDT is not specified when creating a new directory, this directory and its stripes will be distributed on MDTs by space usage. For example the following will create a new directory on an MDT preferring one that has less space usage:

client$ lfs mkdir -c 1 -i -1 dir1

Alternatively, if a default directory stripe is set on a directory, the subsequent use of mkdir for subdirectories in dir1 will have the same effect:

client$ lfs setdirstripe -D -c 1 -i -1 dir1

The policy is:

  • If free inodes/blocks on all MDT are almost the same, i.e. max_inodes_avail * 84% < min_inodes_avail and max_blocks_avail * 84% < min_blocks_avail, then choose MDT roundrobin.

  • Otherwise, create more subdirectories on MDTs with more free inodes/blocks.

Sometime there are many MDTs. But it is not always desirable to stripe a directory across all MDTs, even if the directory default stripe_count=-1 (unlimited). In this case, the per-filesystem tunable parameter lod.*.max_mdt_stripecount can be used to limit the actual stripe count of directory to fewer than the full MDT count. If lod.*.max_mdt_stripecount is not 0, and the directory stripe_count=-1, the real directory stripe count will be the minimum of the number of MDTs and max_mdt_stripecount. If lod.*.max_mdt_stripecount=0, or an explicit stripe count is given for the directory, it is ignored.

To set max_mdt_stripecount, on all MDSes of file system, run:

mgs# lctl set_param -P lod.$fsname-MDTxxxx-mdtlov.max_mdt_stripecount=<N>

To check max_mdt_stripecount, run:

mds# lctl get_param lod.$fsname-MDTxxxx-mdtlov.max_mdt_stripecount

To reset max_mdt_stripecount, run:

mgs# lctl set_param -P -d lod.$fsname-MDTxxxx-mdtlov.max_mdt_stripecount

Introduced in Lustre 2.14

13.10.2. Filesystem-wide default directory striping

Similar to file objects allocation, the directory objects are allocated on MDTs by a round-robin algorithm or a weighted algorithm. For the top three level of directories from the root of the filesystem, if the amount of free inodes and blocks is well balanced (i.e., by default, when the free inodes and blocks across MDTs differ by less than 5%), the round-robin algorithm is used to select the next MDT on which a directory is to be created.

If the directory is more than three levels below the root directory, or MDTs are not balanced, then the weighted algorithm is used to randomly select an MDT with more free inodes and blocks.

To avoid creating unnecessary remote directories, if the MDT where its parent directory is located is not too full (the free inodes and blocks of the parent MDT is not more than 5% full than average of all MDTs), this directory will be created on parent MDT.

If administrator wants to change this default filesystem-wide directory striping, run the following command to limit this striping to the top level below the root directory:

client$ lfs setdirstripe -D -i -1 -c 1 --max-inherit 0 <mountpoint>

To revert to the pre-2.15 behavior of all directories being created only on MDT0000 by default (deleting this striping won't work because it will be recreated if missing):

client$ lfs setdirstripe -D -i 0 -c 1 --max-inherit 0 <mountpoint>

13.11.  Default Dir Stripe Policy

If default dir stripe policy is set to a directory, it will be applied to sub directories created later. For example:

$ mkdir testdir1
$ lfs setdirstripe testdir1 -D -c 2
$ lfs getdirstripe testdir1 -D
lmv_stripe_count: 2 lmv_stripe_offset: -1 lmv_hash_type: none lmv_max_inherit: 3 lmv_max_inherit_rr: 0
$ mkdir dir1/subdir1
$ lfs getdirstripe testdir1/subdir1
lmv_stripe_count: 2 lmv_stripe_offset: 0 lmv_hash_type: crush
mdtidx       FID[seq:oid:ver]
     0       [0x200000400:0x2:0x0]
     1       [0x240000401:0x2:0x0]

Default dir stripe can be inherited by sub directory. This behavior is controlled by lmv_max_inherit parameter. If lmv_max_inherit is 0 or 1, sub directory stops to inherit default dir stripe policy. Or sub directory decreases its parent's lmv_max_inherit and uses it as its own lmv_max_inherit. -1 is special because it means unlimited. For example:

$ lfs getdirstripe testdir1/subdir1 -D
lmv_stripe_count: 2 lmv_stripe_offset: -1 lmv_hash_type: none lmv_max_inherit: 2 lmv_max_inherit_rr: 0

lmv_max_inherit can be set explicitly with --max-inherit option in lfs setdirstripe -D command. If the max-inherit value is not specified, the default value is -1 when stripe_count is 0 or 1. For other values of stripe_count, the default value is 3.

13.12.  Setting and Retrieving Lustre Parameters

Several options are available for setting parameters in Lustre:

13.12.1. Setting Tunable Parameters with mkfs.lustre

When the file system is first formatted, parameters can simply be added as a --param option to the mkfs.lustre command. For example:

mds# mkfs.lustre --mdt --param="sys.timeout=50" /dev/sda

For more details about creating a file system,see Chapter 10, Configuring a Lustre File System. For more details about mkfs.lustre, see Chapter 44, System Configuration Utilities.

13.12.2. Setting Parameters with tunefs.lustre

If a server (OSS or MDS) is stopped, parameters can be added to an existing file system using the --param option to the tunefs.lustre command. For example:

oss# tunefs.lustre --param=failover.node=192.168.0.13@tcp0 /dev/sda

With tunefs.lustre, parameters are additive-- new parameters are specified in addition to old parameters, they do not replace them. To erase all old tunefs.lustre parameters and just use newly-specified parameters, run:

mds# tunefs.lustre --erase-params --param=new_parameters

The tunefs.lustre command can be used to set any parameter settable via lctl conf_param and that has its own OBD device, so it can be specified as obdname|fsname. obdtype. proc_file_name= value. For example:

mds# tunefs.lustre --param mdt.identity_upcall=NONE /dev/sda1

For more details about tunefs.lustre, see Chapter 44, System Configuration Utilities.

13.12.3. Setting Parameters with lctl

When the file system is running, the lctl command can be used to set parameters (temporary or permanent) and report current parameter values. Temporary parameters are active as long as the server or client is not shut down. Permanent parameters live through server and client reboots.

Note

The lctl list_param command enables users to list all parameters that can be set. See Section 13.12.3.5, “Listing All Tunable Parameters”.

For more details about the lctl command, see the examples in the sections below and Chapter 44, System Configuration Utilities.

13.12.3.1. Setting Temporary Parameters

Use lctl set_param to set temporary parameters on the node where it is run. These parameters internally map to corresponding items in the kernel /proc/{fs,sys}/{lnet,lustre} and /sys/{fs,kernel/debug}/lustre virtual filesystems. However, since the mapping between a particular parameter name and the underlying virtual pathname may change, it is not recommended to access the virtual pathname directly. The lctl set_param command uses this syntax:

# lctl set_param [-n] [-P] obdtype.obdname.proc_file_name=value

For example:

# lctl set_param osc.*.max_dirty_mb=1024
osc.myth-OST0000-osc.max_dirty_mb=32
osc.myth-OST0001-osc.max_dirty_mb=32
osc.myth-OST0002-osc.max_dirty_mb=32
osc.myth-OST0003-osc.max_dirty_mb=32
osc.myth-OST0004-osc.max_dirty_mb=32

13.12.3.2. Setting Permanent Parameters

Use lctl set_param -P or lctl conf_param command to set permanent parameters. In general, the set_param -P command is preferred for new parameters, as this isolates the parameter settings from the MDT and OST device configuration, and is consistent with the common lctl get_param and lctl set_param commands. The lctl conf_param command was previously used to specify settable parameter, with the following syntax (the same as the mkfs.lustre and tunefs.lustre commands):

obdname|fsname.obdtype.proc_file_name=value)

Note

The lctl conf_param and lctl set_param syntax is not the same.

Here are a few examples of lctl conf_param commands:

mgs# lctl conf_param testfs-MDT0000.sys.timeout=40
mgs# lctl conf_param testfs-MDT0000.mdt.identity_upcall=NONE
mgs# lctl conf_param testfs.llite.max_read_ahead_mb=16
mgs# lctl conf_param testfs-OST0000.osc.max_dirty_mb=29.15
mgs# lctl conf_param testfs-OST0000.ost.client_cache_seconds=15
mgs# lctl conf_param testfs.sys.timeout=40

Caution

Parameters specified with the lctl conf_param command are set permanently in the file system's configuration file on the MGS.

Introduced in Lustre 2.5

13.12.3.3. Setting Permanent Parameters with lctl set_param -P

The lctl set_param -P command can also set parameters permanently using the same syntax as lctl set_param and lctl get_param commands. Permanent parameter settings must be issued on the MGS. The given parameter is set on every host using lctl upcall. The lctl set_param command uses the following syntax:

lctl set_param -P obdtype.obdname.proc_file_name=value

For example:

mgs# lctl set_param -P timeout=40
mgs# lctl set_param -P mdt.testfs-MDT*.identity_upcall=NONE
mgs# lctl set_param -P llite.testfs-*.max_read_ahead_mb=16
mgs# lctl set_param -P osc.testfs-OST*.max_dirty_mb=29.15
mgs# lctl set_param -P ost.testfs-OST*.client_cache_seconds=15

Use the -P -d option to delete permanent parameters. Syntax:

lctl set_param -P -d obdtype.obdname.parameter_name

For example:

mgs# lctl set_param -P -d osc.*.max_dirty_mb
Introduced before Lustre 2.5

Note

Starting in Lustre 2.12, there is lctl get_param command can provide tab completion when using an interactive shell with bash-completion installed. This simplifies the use of get_param significantly, since it provides an interactive list of available parameters.

13.12.3.4. Listing Persistent Parameters

To list tunable parameters stored in the params log file by lctl set_param -P and applied to nodes at mount, run the lctl --device MGS llog_print params command on the MGS. For example:

mgs# lctl --device MGS llog_print params
- { index: 2, event: set_param, device: general, parameter: osc.*.max_dirty_mb, value: 1024 }

13.12.3.5. Listing All Tunable Parameters

To list Lustre or LNet parameters that are available to set, use the lctl list_param command. For example:

lctl list_param [-FR] obdtype.obdname

The following arguments are available for the lctl list_param command.

-F Add ' /', ' @' or ' =' for directories, symlinks and writeable files, respectively

-R Recursively lists all parameters under the specified path

For example:

oss# lctl list_param obdfilter.lustre-OST0000

13.12.3.6. Reporting Current Parameter Values

To report current Lustre parameter values, use the lctl get_param command with this syntax:

lctl get_param [-n] obdtype.obdname.proc_file_name
Introduced before Lustre 2.5

Note

Starting in Lustre 2.12, there is lctl get_param command can provide tab completion when using an interactive shell with bash-completion installed. This simplifies the use of get_param significantly, since it provides an interactive list of available parameters.

This example reports data on RPC service times.

oss# lctl get_param -n ost.*.ost_io.timeouts
service : cur 1 worst 30 (at 1257150393, 85d23h58m54s ago) 1 1 1 1

This example reports the amount of space this client has reserved for writeback cache with each OST:

client# lctl get_param osc.*.cur_grant_bytes
osc.myth-OST0000-osc-ffff8800376bdc00.cur_grant_bytes=2097152
osc.myth-OST0001-osc-ffff8800376bdc00.cur_grant_bytes=33890304
osc.myth-OST0002-osc-ffff8800376bdc00.cur_grant_bytes=35418112
osc.myth-OST0003-osc-ffff8800376bdc00.cur_grant_bytes=2097152
osc.myth-OST0004-osc-ffff8800376bdc00.cur_grant_bytes=33808384

13.13.  Specifying NIDs and Failover

If a node has multiple network interfaces, it may have multiple NIDs, which must all be identified so other nodes can choose the NID that is appropriate for their network interfaces. Typically, NIDs are specified in a list delimited by commas ( ,). However, when failover nodes are specified, the NIDs are delimited by a colon ( :) or by repeating a keyword such as --mgsnode= or --servicenode=).

To display the NIDs of all servers in networks configured to work with the Lustre file system, run (while LNet is running):

# lctl list_nids

In the example below, mds0 and mds1 are configured as a combined MGS/MDT failover pair and oss0 and oss1 are configured as an OST failover pair. The Ethernet address for mds0 is 192.168.10.1, and for mds1 is 192.168.10.2. The Ethernet addresses for oss0 and oss1 are 192.168.10.20 and 192.168.10.21 respectively.

mds0# mkfs.lustre --fsname=testfs --mdt --mgs \
        --servicenode=192.168.10.2@tcp0 \
        --servicenode=192.168.10.1@tcp0 /dev/sda1
mds0# mount -t lustre /dev/sda1 /mnt/test/mdt
oss0# mkfs.lustre --fsname=testfs --servicenode=192.168.10.20@tcp0 \
        --servicenode=192.168.10.21 --ost --index=0 \
        --mgsnode=192.168.10.1@tcp0 --mgsnode=192.168.10.2@tcp0 \
        /dev/sdb
oss0# mount -t lustre /dev/sdb /mnt/test/ost0
client# mount -t lustre 192.168.10.1@tcp0:192.168.10.2@tcp0:/testfs \
        /mnt/testfs
mds0# umount /mnt/mdt
mds1# mount -t lustre /dev/sda1 /mnt/test/mdt
mds1# lctl get_param mdt.testfs-MDT0000.recovery_status

Where multiple NIDs are specified separated by commas (for example, 10.67.73.200@tcp,192.168.10.1@tcp), the two NIDs refer to the same host, and the Lustre software chooses the best one for communication. When a pair of NIDs is separated by a colon (for example, 10.67.73.200@tcp:10.67.73.201@tcp), the two NIDs refer to two different hosts and are treated as a failover pair (the Lustre software tries the first one, and if that fails, it tries the second one.)

Two options to mkfs.lustre can be used to specify failover nodes. The --servicenode option is used to specify all service NIDs, including those for primary nodes and failover nodes. When the --servicenode option is used, the first service node to load the target device becomes the primary service node, while nodes corresponding to the other specified NIDs become failover locations for the target device. An older option, --failnode, specifies just the NIDs of failover nodes. For more information about the --servicenode and --failnode options, see Chapter 11, Configuring Failover in a Lustre File System.

13.14.  Erasing a File System

If you want to erase a file system and permanently delete all the data in the file system, run this command on your targets:

# mkfs.lustre --reformat

If you are using a separate MGS and want to keep other file systems defined on that MGS, then set the writeconf flag on the MDT for that file system. The writeconf flag causes the configuration logs to be erased; they are regenerated the next time the servers start.

To set the writeconf flag on the MDT:

  1. Unmount all clients/servers using this file system, run:

    client# umount /mnt/lustre
    
  2. Permanently erase the file system and, presumably, replace it with another file system, run:

    mgs# mkfs.lustre --reformat --fsname spfs --mgs --mdt --index=0 /dev/mdsdev
    
  3. If you have a separate MGS (that you do not want to reformat), then add the --writeconf flag to mkfs.lustre on the MDT, run:

    mgs# mkfs.lustre --reformat --writeconf --fsname spfs --mgsnode=mgs_nid \
           --mdt --index=0 /dev/mds_device
    

Note

If you have a combined MGS/MDT, reformatting the MDT reformats the MGS as well, causing all configuration information to be lost; you can start building your new file system. Nothing needs to be done with old disks that will not be part of the new file system, just do not mount them.

13.15.  Reclaiming Reserved Disk Space

All current Lustre installations run the ldiskfs file system internally on service nodes. By default, ldiskfs reserves 5% of the disk space to avoid file system fragmentation. In order to reclaim this space, run the following command on your OSS for each OST in the file system:

# tune2fs [-m reserved_blocks_percent] /dev/ostdev

You do not need to shut down Lustre before running this command or restart it afterwards.

Warning

Reducing the space reservation can cause severe performance degradation as the OST file system becomes more than 95% full, due to difficulty in locating large areas of contiguous free space. This performance degradation may persist even if the space usage drops below 95% again. It is recommended NOT to reduce the reserved disk space below 5%.

13.16.  Replacing an Existing OST or MDT

To copy the contents of an existing OST to a new OST (or an old MDT to a new MDT), follow the process for either OST/MDT backups in Section 18.2, “ Backing Up and Restoring an MDT or OST (ldiskfs Device Level)”or Section 18.3, “ Backing Up an OST or MDT (Backend File System Level)”. For more information on removing a MDT, see Section 14.9.1, “Removing an MDT from the File System”.

13.17.  Identifying To Which Lustre File an OST Object Belongs

Use this procedure to identify the file containing a given object on a given OST.

  1. On the OST (as root), run debugfs to display the file identifier ( FID) of the file associated with the object.

    For example, if the object is 34976 on /dev/lustre/ost_test2, the debug command is:

    # debugfs -c -R "stat /O/0/d$((34976 % 32))/34976" /dev/lustre/ost_test2
    

    The command output is:

    debugfs 1.45.6.wc1 (20-Mar-2020)
    /dev/lustre/ost_test2: catastrophic mode - not reading inode or group bitmaps
    Inode: 352365   Type: regular    Mode:  0666   Flags: 0x80000
    Generation: 2393149953    Version: 0x0000002a:00005f81
    User:  1000   Group:  1000   Size: 260096
    File ACL: 0    Directory ACL: 0
    Links: 1   Blockcount: 512
    Fragment:  Address: 0    Number: 0    Size: 0
    ctime: 0x4a216b48:00000000 -- Sat May 30 13:22:16 2009
    atime: 0x4a216b48:00000000 -- Sat May 30 13:22:16 2009
    mtime: 0x4a216b48:00000000 -- Sat May 30 13:22:16 2009
    crtime: 0x4a216b3c:975870dc -- Sat May 30 13:22:04 2009
    Size of extra inode fields: 24
    Extended attributes stored in inode body:
      fid = "b9 da 24 00 00 00 00 00 6a fa 0d 3f 01 00 00 00 eb 5b 0b 00 00 00 0000
    00 00 00 00 00 00 00 00 " (32)
      fid: objid=34976 seq=0 parent=[0x200000400:0x122:0x0] stripe=1
    EXTENTS:
    (0-64):4620544-4620607
    

  2. The parent FID will be of the form [0x200000400:0x122:0x0] and can be resolved directly using the command lfs fid2path [0x200000404:0x122:0x0] /mnt/lustre on any Lustre client, and the process is complete.

  3. In cases of an upgraded 1.x inode (if the first part of the FID is below 0x200000400), the MDT inode number is 0x24dab9 and generation 0x3f0dfa6a and the pathname can also be resolved using debugfs.

  4. On the MDS (as root), use debugfs to find the file associated with the inode:

    # debugfs -c -R "ncheck 0x24dab9" /dev/lustre/mdt_test
    debugfs 1.42.3.wc3 (15-Aug-2012)
    /dev/lustre/mdt_test: catastrophic mode - not reading inode or group bitmaps
    Inode      Pathname
    2415289    /ROOT/brian-laptop-guest/clients/client11/~dmtmp/PWRPNT/ZD16.BMP
    

The command lists the inode and pathname associated with the object.

Note

Debugfs' ''ncheck'' is a brute-force search that may take a long time to complete.

Note

To find the Lustre file from a disk LBA, follow the steps listed in the document at this URL: https://www.smartmontools.org/wiki/BadBlockHowto. Then, follow the steps above to resolve the Lustre filename.

Chapter 14. Lustre Maintenance

Once you have the Lustre file system up and running, you can use the procedures in this section to perform these basic Lustre maintenance tasks:

14.1.  Working with Inactive OSTs

To mount a client or an MDT with one or more inactive OSTs, run commands similar to this:

client# mount -o exclude=testfs-OST0000 -t lustre \
           uml1:/testfs /mnt/testfs
            client# lctl get_param lov.testfs-clilov-*.target_obd

To activate an inactive OST on a live client or MDT, use the lctl activate command on the OSC device. For example:

lctl --device 7 activate

Note

A colon-separated list can also be specified. For example, exclude=testfs-OST0000:testfs-OST0001.

14.2.  Finding Nodes in the Lustre File System

There may be situations in which you need to find all nodes in your Lustre file system or get the names of all OSTs.

To get a list of all Lustre nodes, run this command on the MGS:

# lctl get_param mgs.MGS.live.*

Note

This command must be run on the MGS.

In this example, file system testfs has three nodes, testfs-MDT0000, testfs-OST0000, and testfs-OST0001.

mgs:/root# lctl get_param mgs.MGS.live.* 
                fsname: testfs 
                flags: 0x0     gen: 26 
                testfs-MDT0000 
                testfs-OST0000 
                testfs-OST0001 

To get the names of all OSTs, run this command on the MDS:

mds:/root# lctl get_param lov.*-mdtlov.target_obd 

Note

This command must be run on the MDS.

In this example, there are two OSTs, testfs-OST0000 and testfs-OST0001, which are both active.

mgs:/root# lctl get_param lov.testfs-mdtlov.target_obd 
0: testfs-OST0000_UUID ACTIVE 
1: testfs-OST0001_UUID ACTIVE 

14.3.  Mounting a Server Without Lustre Service

If you are using a combined MGS/MDT, but you only want to start the MGS and not the MDT, run this command:

mount -t lustre /dev/mdt_partition -o nosvc /mount_point

The mdt_partition variable is the combined MGS/MDT block device.

In this example, the combined MGS/MDT is testfs-MDT0000 and the mount point is /mnt/test/mdt.

$ mount -t lustre -L testfs-MDT0000 -o nosvc /mnt/test/mdt

14.4.  Regenerating Lustre Configuration Logs

If the Lustre file system configuration logs are in a state where the file system cannot be started, use the tunefs.lustre --writeconf command to regenerate them. After the writeconf command is run and the servers restart, the configuration logs are re-generated and stored on the MGS (as with a new file system).

You should only use the writeconf command if:

  • The configuration logs are in a state where the file system cannot start

  • A server NID is being changed

The writeconf command is destructive to some configuration items (e.g. OST pools information and tunables set via conf_param), and should be used with caution.

Caution

The OST pools feature enables a group of OSTs to be named for file striping purposes. If you use OST pools, be aware that running the writeconf command erases all pools information (as well as any other parameters set via lctl conf_param). We recommend that the pools definitions (and conf_param settings) be executed via a script, so they can be regenerated easily after writeconf is performed. However, tunables saved with lctl set_param -P are not erased in this case.

Note

If the MGS still holds any configuration logs, it may be possible to dump these logs to save any parameters stored with lctl conf_param by dumping the config logs on the MGS and saving the output (once for each MDT and OST device):

mgs# lctl --device MGS llog_print fsname-client
mgs# lctl --device MGS llog_print fsname-MDT0000
mgs# lctl --device MGS llog_print fsname-OST0000

To regenerate Lustre file system configuration logs:

  1. Stop the file system services in the following order before running the tunefs.lustre --writeconf command:

    1. Unmount the clients.

    2. Unmount the MDT(s).

    3. Unmount the OST(s).

    4. If the MGS is separate from the MDT it can remain mounted during this process.

  2. Make sure the MDT and OST devices are available.

  3. Run the tunefs.lustre --writeconf command on all target devices.

    Run writeconf on the MDT(s) first, and then the OST(s).

    1. On each MDS, for each MDT run:

      mds# tunefs.lustre --writeconf /dev/mdt_device
    2. On each OSS, for each OST run:

      oss# tunefs.lustre --writeconf /dev/ost_device

  4. Restart the file system in the following order:

    1. Mount the separate MGT, if it is not already mounted.

    2. Mount the MDT(s) in order, starting with MDT0000.

    3. Mount the OSTs in order, starting with OST0000.

    4. Mount the clients.

After the tunefs.lustre --writeconf command is run, the configuration logs are re-generated as servers connect to the MGS.

14.5.  Changing a Server NID

In order to totally rewrite the Lustre configuration, the tunefs.lustre --writeconf command is used to rewrite all of the configuration files.

If you need to change only the NID of the MDT or OST, the replace_nids command can simplify this process. The replace_nids command differs from tunefs.lustre --writeconf in that it does not erase the entire configuration log, precluding the need the need to execute the writeconf command on all servers and re-specify all permanent parameter settings. However, the writeconf command can still be used if desired.

Change a server NID in these situations:

  • New server hardware is added to the file system, and the MDS or an OSS is being moved to the new machine.

  • New network card is installed in the server.

  • You want to reassign IP addresses.

To change a server NID:

  1. Update the LNet configuration in the /etc/modprobe.conf file so the list of server NIDs is correct. Use lctl list_nids to view the list of server NIDS.

    The lctl list_nids command indicates which network(s) are configured to work with the Lustre file system.

  2. Shut down the file system in this order:

    1. Unmount the clients.

    2. Unmount the MDT.

    3. Unmount all OSTs.

  3. If the MGS and MDS share a partition, start the MGS only:

    mount -t lustre MDT partition -o nosvc mount_point
  4. Run the replace_nids command on the MGS:

    lctl replace_nids devicename nid1[,nid2,nid3 ...]

    where devicename is the Lustre target name, e.g. testfs-OST0013

  5. If the MGS and MDS share a partition, stop the MGS:

    umount mount_point

Note

The replace_nids command also cleans all old, invalidated records out of the configuration log, while preserving all other current settings.

Note

The previous configuration log is backed up on the MGS disk with the suffix '.bak'.

Introduced in Lustre 2.11

14.6.  Clearing configuration

This command runs on MGS node having the MGS device mounted with -o nosvc. It cleans up configuration files stored in the CONFIGS/ directory of any records marked SKIP. If the device name is given, then the specific logs for that filesystem (e.g. testfs-MDT0000) are processed. Otherwise, if a filesystem name is given then all configuration files are cleared. The previous configuration log is backed up on the MGS disk with the suffix 'config.timestamp.bak'. Eg: Lustre-MDT0000-1476454535.bak.

To clear a configuration:

  1. Shut down the file system in this order:

    1. Unmount the clients.

    2. Unmount the MDT.

    3. Unmount all OSTs.

  2. If the MGS and MDS share a partition, start the MGS only using "nosvc" option.

    mount -t lustre MDT partition -o nosvc mount_point
  3. Run the clear_conf command on the MGS:

    lctl clear_conf config

    Example: To clear the configuration for MDT0000 on a filesystem named testfs

    mgs# lctl clear_conf testfs-MDT0000

14.7. Adding a New MDT to a Lustre File System

Additional MDTs can be added using the DNE feature to serve one or more remote sub-directories within a filesystem, in order to increase the total number of files that can be created in the filesystem, to increase aggregate metadata performance, or to isolate user or application workloads from other users of the filesystem. It is possible to have multiple remote sub-directories reference the same MDT. However, the root directory will always be located on MDT0000. To add a new MDT into the file system:

  1. Discover the maximum MDT index. Each MDT must have unique index.

    client$ lctl dl | grep mdc
    36 UP mdc testfs-MDT0000-mdc-ffff88004edf3c00 4c8be054-144f-9359-b063-8477566eb84e 5
    37 UP mdc testfs-MDT0001-mdc-ffff88004edf3c00 4c8be054-144f-9359-b063-8477566eb84e 5
    38 UP mdc testfs-MDT0002-mdc-ffff88004edf3c00 4c8be054-144f-9359-b063-8477566eb84e 5
    39 UP mdc testfs-MDT0003-mdc-ffff88004edf3c00 4c8be054-144f-9359-b063-8477566eb84e 5
    
  2. Add the new block device as a new MDT at the next available index. In this example, the next available index is 4.

    mds# mkfs.lustre --reformat --fsname=testfs --mdt --mgsnode=mgsnode --index 4 /dev/mdt4_device
    
  3. Mount the MDTs.

    mds# mount -t lustre /dev/mdt4_blockdevice /mnt/mdt4
    
  4. In order to start creating new files and directories on the new MDT(s) they need to be attached into the namespace at one or more subdirectories using the lfs mkdir command. All files and directories below those created with lfs mkdir will also be created on the same MDT unless otherwise specified.

    client# lfs mkdir -i 3 /mnt/testfs/new_dir_on_mdt3
    client# lfs mkdir -i 4 /mnt/testfs/new_dir_on_mdt4
    client# lfs mkdir -c 4 /mnt/testfs/project/new_large_dir_striped_over_4_mdts
    

14.8.  Adding a New OST to a Lustre File System

A new OST can be added to existing Lustre file system on either an existing OSS node or on a new OSS node. In order to keep client IO load balanced across OSS nodes for maximum aggregate performance, it is not recommended to configure different numbers of OSTs to each OSS node.

  1. Add a new OST by using mkfs.lustre as when the filesystem was first formatted, see 4 for details. Each new OST must have a unique index number, use lctl dl to see a list of all OSTs. For example, to add a new OST at index 12 to the testfs filesystem run following commands should be run on the OSS:

    oss# mkfs.lustre --fsname=testfs --mgsnode=mds16@tcp0 --ost --index=12 /dev/sda
    oss# mkdir -p /mnt/testfs/ost12
    oss# mount -t lustre /dev/sda /mnt/testfs/ost12
  2. Balance OST space usage (possibly).

    The file system can be quite unbalanced when new empty OSTs are added to a relatively full filesystem. New file creations are automatically balanced to favour the new OSTs. If this is a scratch file system or files are pruned at regular intervals, then no further work may be needed to balance the OST space usage as new files being created will preferentially be placed on the less full OST(s). As old files are deleted, they will release space on the old OST(s).

    Files existing prior to the expansion can optionally be rebalanced using the lfs_migrate utility. This redistributes file data over the entire set of OSTs.

    For example, to rebalance all files within the directory /mnt/lustre/dir, enter:

    client# lfs_migrate /mnt/lustre/dir

    To migrate files within the /test file system on OST0004 that are larger than 4GB in size to other OSTs, enter:

    client# lfs find /test --ost test-OST0004 -size +4G | lfs_migrate -y

    See Section 40.2, “ lfs_migrate for details.

14.9.  Removing and Restoring MDTs and OSTs

OSTs and DNE MDTs can be removed from and restored to a Lustre filesystem. Deactivating an OST means that it is temporarily or permanently marked unavailable. Deactivating an OST on the MDS means it will not try to allocate new objects there or perform OST recovery, while deactivating an OST the client means it will not wait for OST recovery if it cannot contact the OST and will instead return an IO error to the application immediately if files on the OST are accessed. An OST may be permanently deactivated from the file system, depending on the situation and commands used.

Note

A permanently deactivated MDT or OST still appears in the filesystem configuration until the configuration is regenerated with writeconf or it is replaced with a new MDT or OST at the same index and permanently reactivated. A deactivated OST will not be listed by lfs df.

You may want to temporarily deactivate an OST on the MDS to prevent new files from being written to it in several situations:

  • A hard drive has failed and a RAID resync/rebuild is underway, though the OST can also be marked degraded by the RAID system to avoid allocating new files on the slow OST which can reduce performance, see Section 13.7, “ Handling Degraded OST RAID Arrays” for more details.

  • OST is nearing its space capacity, though the MDS will already try to avoid allocating new files on overly-full OSTs if possible, see Section 39.7, “Allocating Free Space on OSTs” for details.

  • MDT/OST storage or MDS/OSS node has failed, and will not be available for some time (or forever), but there is still a desire to continue using the filesystem before it is repaired.

14.9.1. Removing an MDT from the File System

If the MDT is permanently inaccessible, lfs rm_entry {directory} can be used to delete the directory entry for the unavailable MDT. Using rmdir would otherwise report an IO error due to the remote MDT being inactive. Please note that if the MDT is available, standard rm -r should be used to delete the remote directory. After the remote directory has been removed, the administrator should mark the MDT as permanently inactive with:

lctl conf_param {MDT name}.mdc.active=0

A user can identify which MDT holds a remote sub-directory using the lfs utility. For example:

client$ lfs getstripe --mdt-index /mnt/lustre/remote_dir1
1
client$ mkdir /mnt/lustre/local_dir0
client$ lfs getstripe --mdt-index /mnt/lustre/local_dir0
0

The lfs getstripe --mdt-index command returns the index of the MDT that is serving the given directory.

14.9.2.  Working with Inactive MDTs

Files located on or below an inactive MDT are inaccessible until the MDT is activated again. Clients accessing an inactive MDT will receive an EIO error.

14.9.3. Removing an OST from the File System

When deactivating an OST, note that the client and MDS each have an OSC device that handles communication with the corresponding OST. To remove an OST from the file system:

  1. If the OST is functional, and there are files located on the OST that need to be migrated off of the OST, the file creation for that OST should be temporarily deactivated on the MDS (each MDS if running with multiple MDS nodes in DNE mode).

    1. Introduced in Lustre 2.9

      With Lustre 2.9 and later, the MDS should be set to only disable file creation on that OST by setting max_create_count to zero:

      mds# lctl set_param osp.osc_name.max_create_count=0

      This ensures that files deleted or migrated off of the OST will have their corresponding OST objects destroyed, and the space will be freed. For example, to disable OST0000 in the filesystem testfs, run:

      mds# lctl set_param osp.testfs-OST0000-osc-MDT*.max_create_count=0

      on each MDS in the testfs filesystem.

    2. With older versions of Lustre, to deactivate the OSC on the MDS node(s) use:

      mds# lctl set_param osp.osc_name.active=0

      This will prevent the MDS from attempting any communication with that OST, including destroying objects located thereon. This is fine if the OST will be removed permanently, if the OST is not stable in operation, or if it is in a read-only state. Otherwise, the free space and objects on the OST will not decrease when files are deleted, and object destruction will be deferred until the MDS reconnects to the OST.

      For example, to deactivate OST0000 in the filesystem testfs, run:

      mds# lctl set_param osp.testfs-OST0000-osc-MDT*.active=0

      Deactivating the OST on the MDS does not prevent use of existing objects for read/write by a client.

      Note

      If migrating files from a working OST, do not deactivate the OST on clients. This causes IO errors when accessing files located there, and migrating files on the OST would fail.

      Caution

      Do not use lctl set_param -P or lctl conf_param to deactivate the OST if it is still working, as this immediately and permanently deactivates it in the file system configuration on both the MDS and all clients.

  2. Discover all files that have objects residing on the deactivated OST. Depending on whether the deactivated OST is available or not, the data from that OST may be migrated to other OSTs, or may need to be restored from backup.

    1. If the OST is still online and available, find all files with objects on the deactivated OST, and copy them to other OSTs in the file system to:

      client# lfs find --ost ost_name /mount/point | lfs_migrate -y

      Note that if multiple OSTs are being deactivated at one time, the lfs find command can take multiple --ost arguments, and will return files that are located on any of the specified OSTs.

    2. If the OST is no longer available, delete the files on that OST and restore them from backup:

      client# lfs find --ost ost_uuid -print0 /mount/point |
              tee /tmp/files_to_restore | xargs -0 -n 1 unlink

      The list of files that need to be restored from backup is stored in /tmp/files_to_restore. Restoring these files is beyond the scope of this document.

  3. Deactivate the OST.

    1. If there is expected to be a replacement OST in some short time (a few days), the OST can temporarily be deactivated on the clients using:

      client# lctl set_param osc.fsname-OSTnumber-*.active=0

      Note

      This setting is only temporary and will be reset if the clients are remounted or rebooted. It needs to be run on all clients.

    2. If there is not expected to be a replacement for this OST in the near future, permanently deactivate it on all clients and the MDS by running the following command on the MGS:

      mgs# lctl conf_param ost_name.osc.active=0

      Note

      A deactivated OST still appears in the file system configuration, though a replacement OST can be created that re-uses the same OST index with the mkfs.lustre --replace option, see Section 14.9.5, “ Restoring OST Configuration Files”.

      Introduced in Lustre 2.16

      In Lustre 2.16 and later, it is possible to run the command "lctl del_ost --target fsname-OSTxxxx" on the MGS to totally remove an OST from the MGS configuration logs. This will cancel the configuration logs for that OST in the client and MDT configuration logs for the named filesystem. This permanently removes the configuration records for that OST from the filesystem, so that it will not be visible on later client and MDT mounts, and should only be run after earlier steps to migrate files off the OST.

      If the del_ost command is not available, the OST configuration records should be found in the startup logs by running the command "lctl --device MGS llog_print fsname-client" on the MGS (and also "... $fsname-MDTxxxx" for all the MDTs) to list all attach, setup, add_osc, add_pool, and other records related to the removed OST(s). Once the index value is known for each configuration record, the command "lctl --device MGS llog_cancel llog_name -i index " will drop that record from the configuration log llog_name. This is needed for each of fsname-client and fsname-MDTxxxx configuration logs so that new mounts will no longer process it. If a whole OSS is being removed, theadd_uuid records for the OSS should similarly be canceled.

      mgs# lctl --device MGS llog_print testfs-client | egrep "192.168.10.99@tcp|OST0003"
      - { index: 135, event: add_uuid, nid: 192.168.10.99@tcp(0x20000c0a80a63), node: 192.168.10.99@tcp }
      - { index: 136, event: attach, device: testfs-OST0003-osc, type: osc, UUID: testfs-clilov_UUID }
      - { index: 137, event: setup, device: testfs-OST0003-osc, UUID: testfs-OST0003_UUID, node: 192.168.10.99@tcp }
      - { index: 138, event: add_osc, device: testfs-clilov, ost: testfs-OST0003_UUID, index: 3, gen: 1 }
      mgs# lctl --device MGS llog_cancel testfs-client -i 138
      mgs# lctl --device MGS llog_cancel testfs-client -i 137
      mgs# lctl --device MGS llog_cancel testfs-client -i 136
                      

14.9.4.  Backing Up OST Configuration Files

If the OST device is still accessible, then the Lustre configuration files on the OST should be backed up and saved for future use in order to avoid difficulties when a replacement OST is returned to service. These files rarely change, so they can and should be backed up while the OST is functional and accessible. If the deactivated OST is still available to mount (i.e. has not permanently failed or is unmountable due to severe corruption), an effort should be made to preserve these files.

  1. Mount the OST file system.

    oss# mkdir -p /mnt/ost
    oss# mount -t ldiskfs /dev/ost_device /mnt/ost

  2. Back up the OST configuration files.

    oss# tar cvf ost_name.tar -C /mnt/ost last_rcvd \
               CONFIGS/ O/0/LAST_ID

  3. Unmount the OST file system.

    oss# umount /mnt/ost

14.9.5.  Restoring OST Configuration Files

If the original OST is still available, it is best to follow the OST backup and restore procedure given in either Section 18.2, “ Backing Up and Restoring an MDT or OST (ldiskfs Device Level)”, or Section 18.3, “ Backing Up an OST or MDT (Backend File System Level)” and Section 18.4, “ Restoring a File-Level Backup”.

To replace an OST that was removed from service due to corruption or hardware failure, the replacement OST needs to be formatted using mkfs.lustre, and the Lustre file system configuration should be restored, if available. Any objects stored on the OST will be permanently lost, and files using the OST should be deleted and/or restored from backup.

Introduced in Lustre 2.5

With Lustre 2.5 and later, it is possible to replace an OST to the same index without restoring the configuration files, using the --replace option at format time.

oss# mkfs.lustre --ost --reformat --replace --index=old_ost_index \
        other_options /dev/new_ost_dev

The MDS and OSS will negotiate the LAST_ID value for the replacement OST.

If the OST configuration files were not backed up, due to the OST file system being completely inaccessible, it is still possible to replace the failed OST with a new one at the same OST index.

  1. For older versions, format the OST file system without the --replace option and restore the saved configuration:

    oss# mkfs.lustre --ost --reformat --index=old_ost_index \
               other_options /dev/new_ost_dev

  2. Mount the OST file system.

    oss# mkdir /mnt/ost
    oss# mount -t ldiskfs /dev/new_ost_dev /mnt/ost

  3. Restore the OST configuration files, if available.

    oss# tar xvf ost_name.tar -C /mnt/ost
  4. Recreate the OST configuration files, if unavailable.

    Follow the procedure in Section 35.3.4, “Fixing a Bad LAST_ID on an OST” to recreate the LAST_ID file for this OST index. The last_rcvd file will be recreated when the OST is first mounted using the default parameters, which are normally correct for all file systems. The CONFIGS/mountdata file is created by mkfs.lustre at format time, but has flags set that request it to register itself with the MGS. It is possible to copy the flags from another working OST (which should be the same):

    oss1# debugfs -c -R "dump CONFIGS/mountdata /tmp" /dev/other_osdev
    oss1# scp /tmp/mountdata oss0:/tmp/mountdata
    oss0# dd if=/tmp/mountdata of=/mnt/ost/CONFIGS/mountdata bs=4 count=1 seek=5 skip=5 conv=notrunc
  5. Unmount the OST file system.

    oss# umount /mnt/ost

14.9.6. Returning a Deactivated OST to Service

If the OST was permanently deactivated, it needs to be reactivated in the MGS configuration.

mgs# lctl conf_param ost_name.osc.active=1

If the OST was temporarily deactivated, it needs to be reactivated on the MDS and clients.

mds# lctl set_param osp.fsname-OSTnumber-*.active=1
client# lctl set_param osc.fsname-OSTnumber-*.active=1

14.10.  Aborting Recovery

You can abort recovery with either the lctl utility or by mounting the target with the abort_recov option (mount -o abort_recov). When starting a target, run:

mds# mount -t lustre -L mdt_name -o abort_recov /mount_point

Note

The recovery process is blocked until all OSTs are available.

14.11.  Determining Which Machine is Serving an OST

In the course of administering a Lustre file system, you may need to determine which machine is serving a specific OST. It is not as simple as identifying the machine's IP address, as IP is only one of several networking protocols that the Lustre software uses and, as such, LNet does not use IP addresses as node identifiers, but NIDs instead. To identify the NID that is serving a specific OST, run one of the following commands on a client (you do not need to be a root user):

client$ lctl get_param osc.fsname-OSTnumber*.ost_conn_uuid

For example:

client$ lctl get_param osc.*-OST0000*.ost_conn_uuid 
osc.testfs-OST0000-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp

- OR -

client$ lctl get_param osc.*.ost_conn_uuid 
osc.testfs-OST0000-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp
osc.testfs-OST0001-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp
osc.testfs-OST0002-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp
osc.testfs-OST0003-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp
osc.testfs-OST0004-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp

14.12.  Changing the Address of a Failover Node

To change the address of a failover node (e.g, to use node X instead of node Y), run this command on the OSS/OST partition (depending on which option was used to originally identify the NID):

oss# tunefs.lustre --erase-params --servicenode=NID /dev/ost_device

or

oss# tunefs.lustre --erase-params --failnode=NID /dev/ost_device

For more information about the --servicenode and --failnode options, see Chapter 11, Configuring Failover in a Lustre File System.

14.13.  Separate a combined MGS/MDT

These instructions assume the MGS node will be the same as the MDS node. For instructions on how to move MGS to a different node, see Section 14.5, “ Changing a Server NID”.

These instructions are for doing the split without shutting down other servers and clients.

  1. Stop the MDS.

    Unmount the MDT

    mds# umount -f /dev/mdt_device
                
  2. Create the MGT filesystem.

    mds# mkfs.lustre --mgs /dev/mgt_device
                
  3. Copy the configuration data from MDT disk to the new MGT disk.

    mds# mount -t ldiskfs -o ro /dev/mdt_device /mdt_mount_point
    mds# mount -t ldiskfs -o rw /dev/mgt_device /mgt_mount_point
    mds# cp -av /mdt_mount_point/CONFIGS/filesystem_name-* /mgt_mount_point/CONFIGS/
    mds# cp -av /mdt_mount_point/CONFIGS/{params,nodemap,sptlrpc}* /mgt_mount_point/CONFIGS/
    mds# umount /mgt_mount_point
    mds# umount /mdt_mount_point
                  

    See Section 14.4, “ Regenerating Lustre Configuration Logs” for alternative method.

  4. Start the MGS.

    mgs# mount -t lustre /dev/mgt_device /mgt_mount_point
                

    Check to make sure it knows about all your file system

    mgs:/root# lctl get_param mgs.MGS.filesystems
  5. Remove the MGS option from the MDT, and set the new MGS nid.

    mds# tunefs.lustre --nomgs --mgsnode=new_mgs_nid /dev/mdt-device
                
  6. Start the MDT.

    mds# mount -t lustre /dev/mdt_device /mdt_mount_point
                  

    Check to make sure the MGS configuration looks right:

    mgs# lctl get_param mgs.MGS.live.filesystem_name
Introduced in Lustre 2.13

14.14.  Set an MDT to read-only

It is sometimes desirable to be able to mark the filesystem read-only directly on the server, rather than remounting the clients and setting the option there. This can be useful if there is a rogue client that is deleting files, or when decommissioning a system to prevent already-mounted clients from modifying it anymore.

Set the mdt.*.readonly parameter to 1 to immediately set the MDT to read-only. All future MDT access will immediately return a "Read-only file system" error (EROFS) until the parameter is set to 0 again.

Example of setting the readonly parameter to 1, verifying the current setting, accessing from a client, and setting the parameter back to 0:

mds# lctl set_param mdt.fs-MDT0000.readonly=1
mdt.fs-MDT0000.readonly=1

mds# lctl get_param mdt.fs-MDT0000.readonly
mdt.fs-MDT0000.readonly=1

client$ touch test_file
touch: cannot touch 'test_file': Read-only file system

mds# lctl set_param mdt.fs-MDT0000.readonly=0
mdt.fs-MDT0000.readonly=0
Introduced in Lustre 2.14

14.15.  Tune Fallocate for ldiskfs

This section shows how to tune/enable/disable fallocate for ldiskfs OSTs.

The default mode=0 is the standard "allocate unwritten extents" behavior used by ext4. This is by far the fastest for space allocation, but requires the unwritten extents to be split and/or zeroed when they are overwritten.

The OST fallocate mode=1 can also be set to use "zeroed extents", which may be handled by "WRITE SAME", "TRIM zeroes data", or other low-level functionality in the underlying block device.

mode=-1 completely disables fallocate.

Example: To completely disable fallocate

lctl set_param osd-ldiskfs.*.fallocate_zero_blocks=-1

Example: To enable fallocate to use 'zeroed extents'

lctl set_param osd-ldiskfs.*.fallocate_zero_blocks=1

Chapter 15. Managing Lustre Networking (LNet)

This chapter describes some tools for managing Lustre networking (LNet) and includes the following sections:

15.1.  Updating the Health Status of a Peer or Router

There are two mechanisms to update the health status of a peer or a router:

  • LNet can actively check health status of all routers and mark them as dead or alive automatically. By default, this is off. To enable it set auto_down and if desired check_routers_before_use. This initial check may cause a pause equal to router_ping_timeout at system startup, if there are dead routers in the system.

  • When there is a communication error, all LNDs notify LNet that the peer (not necessarily a router) is down. This mechanism is always on, and there is no parameter to turn it off. However, if you set the LNet module parameter auto_down to 0, LNet ignores all such peer-down notifications.

Several key differences in both mechanisms:

  • The router pinger only checks routers for their health, while LNDs notices all dead peers, regardless of whether they are a router or not.

  • The router pinger actively checks the router health by sending pings, but LNDs only notice a dead peer when there is network traffic going on.

  • The router pinger can bring a router from alive to dead or vice versa, but LNDs can only bring a peer down.

15.2. Starting and Stopping LNet

The Lustre software automatically starts and stops LNet, but it can also be manually started in a standalone manner. This is particularly useful to verify that your networking setup is working correctly before you attempt to start the Lustre file system.

15.2.1. Starting LNet

To start LNet, run:

$ modprobe lnet
$ lctl network up

To see the list of local NIDs, run:

$ lctl list_nids

This command tells you the network(s) configured to work with the Lustre file system.

If the networks are not correctly setup, see the modules.conf "networks=" line and make sure the network layer modules are correctly installed and configured.

To get the best remote NID, run:

$ lctl which_nid NIDs

where NIDs is the list of available NIDs.

This command takes the "best" NID from a list of the NIDs of a remote host. The "best" NID is the one that the local node uses when trying to communicate with the remote node.

15.2.1.1. Starting Clients

To start a TCP client, run:

mount -t lustre mdsnode:/mdsA/client /mnt/lustre/

To start an Elan client, run:

mount -t lustre 2@elan0:/mdsA/client /mnt/lustre

15.2.2. Stopping LNet

Before the LNet modules can be removed, LNet references must be removed. In general, these references are removed automatically when the Lustre file system is shut down, but for standalone routers, an explicit step is needed to stop LNet. Run:

lctl network unconfigure

Note

Attempting to remove Lustre modules prior to stopping the network may result in a crash or an LNet hang. If this occurs, the node must be rebooted (in most cases). Make sure that the Lustre network and Lustre file system are stopped prior to unloading the modules. Be extremely careful using rmmod -f.

To unconfigure the LNet network, run:

modprobe -r lnd_and_lnet_modules

Note

To remove all Lustre modules, run:

$ lustre_rmmod

15.3. Hardware Based Multi-Rail Configurations with LNet

To aggregate bandwidth across both rails of a dual-rail IB cluster (o2iblnd) [1] using LNet, consider these points:

  • LNet can work with multiple rails, however, it does not load balance across them. The actual rail used for any communication is determined by the peer NID.

  • Hardware multi-rail LNet configurations do not provide an additional level of network fault tolerance. The configurations described below are for bandwidth aggregation only.

  • A Lustre node always uses the same local NID to communicate with a given peer NID. The criteria used to determine the local NID are:

    • Introduced in Lustre 2.5

      Lowest route priority number (lower number, higher priority).

    • Fewest hops (to minimize routing), and

    • Appears first in the "networks" or "ip2nets" LNet configuration strings

15.4. Load Balancing with an InfiniBand* Network

A Lustre file system contains OSSs with two InfiniBand HCAs. Lustre clients have only one InfiniBand HCA using OFED-based Infiniband ''o2ib'' drivers. Load balancing between the HCAs on the OSS is accomplished through LNet.

15.4.1. Setting Up lustre.conf for Load Balancing

To configure LNet for load balancing on clients and servers:

  1. Set the lustre.conf options.

    Depending on your configuration, set lustre.conf options as follows:

    • Dual HCA OSS server

    options lnet networks="o2ib0(ib0),o2ib1(ib1)"
    • Client with the odd IP address

    options lnet ip2nets="o2ib0(ib0) 192.168.10.[103-253/2]"
    • Client with the even IP address

    options lnet ip2nets="o2ib1(ib0) 192.168.10.[102-254/2]"
  2. Run the modprobe lnet command and create a combined MGS/MDT file system.

    The following commands create an MGS/MDT or OST file system and mount the targets on the servers.

    modprobe lnet
    # mkfs.lustre --fsname lustre --mgs --mdt /dev/mdt_device
    # mkdir -p /mount_point
    # mount -t lustre /dev/mdt_device /mount_point

    For example:

    modprobe lnet
    mds# mkfs.lustre --fsname lustre --mdt --mgs /dev/sda
    mds# mkdir -p /mnt/test/mdt
    mds# mount -t lustre /dev/sda /mnt/test/mdt   
    mds# mount -t lustre mgs@o2ib0:/lustre /mnt/mdt
    oss# mkfs.lustre --fsname lustre --mgsnode=mds@o2ib0 --ost --index=0 /dev/sda
    oss# mkdir -p /mnt/test/mdt
    oss# mount -t lustre /dev/sda /mnt/test/ost   
    oss# mount -t lustre mgs@o2ib0:/lustre /mnt/ost0
  3. Mount the clients.

    client# mount -t lustre mgs_node:/fsname /mount_point

    This example shows an IB client being mounted.

    client# mount -t lustre
    192.168.10.101@o2ib0,192.168.10.102@o2ib1:/mds/client /mnt/lustre

As an example, consider a two-rail IB cluster running the OFED stack with these IPoIB address assignments.

             ib0                             ib1
Servers            192.168.0.*                     192.168.1.*
Clients            192.168.[2-127].*               192.168.[128-253].*

You could create these configurations:

  • A cluster with more clients than servers. The fact that an individual client cannot get two rails of bandwidth is unimportant because the servers are typically the actual bottleneck.

ip2nets="o2ib0(ib0),    o2ib1(ib1)      192.168.[0-1].*                     \
                                            #all servers;\
                   o2ib0(ib0)      192.168.[2-253].[0-252/2]       #even cl\
ients;\
                   o2ib1(ib1)      192.168.[2-253].[1-253/2]       #odd cli\
ents"

This configuration gives every server two NIDs, one on each network, and statically load-balances clients between the rails.

  • A single client that must get two rails of bandwidth, and it does not matter if the maximum aggregate bandwidth is only (# servers) * (1 rail).

ip2nets="       o2ib0(ib0)                      192.168.[0-1].[0-252/2]     \
                                            #even servers;\
           o2ib1(ib1)                      192.168.[0-1].[1-253/2]         \
                                        #odd servers;\
           o2ib0(ib0),o2ib1(ib1)           192.168.[2-253].*               \
                                        #clients"

This configuration gives every server a single NID on one rail or the other. Clients have a NID on both rails.

  • All clients and all servers must get two rails of bandwidth.

ip2nets="o2ib0(ib0),o2ib2(ib1)           192.168.[0-1].[0-252/2]       \
  #even servers;\
           o2ib1(ib0),o2ib3(ib1)           192.168.[0-1].[1-253/2]         \
#odd servers;\
           o2ib0(ib0),o2ib3(ib1)           192.168.[2-253].[0-252/2)       \
#even clients;\
           o2ib1(ib0),o2ib2(ib1)           192.168.[2-253].[1-253/2)       \
#odd clients"

This configuration includes two additional proxy o2ib networks to work around the simplistic NID selection algorithm in the Lustre software. It connects "even" clients to "even" servers with o2ib0 on rail0, and "odd" servers with o2ib3 on rail1. Similarly, it connects "odd" clients to "odd" servers with o2ib1 on rail0, and "even" servers with o2ib2 on rail1.

15.5. Dynamically Configuring LNet Routes

Two scripts are provided: lustre/scripts/lustre_routes_config and lustre/scripts/lustre_routes_conversion.

lustre_routes_config sets or cleans up LNet routes from the specified config file. The /etc/sysconfig/lnet_routes.conf file can be used to automatically configure routes on LNet startup.

lustre_routes_conversion converts a legacy routes configuration file to the new syntax, which is parsed by lustre_routes_config.

15.5.1.  lustre_routes_config

lustre_routes_config usage is as follows

lustre_routes_config [--setup|--cleanup|--dry-run|--verbose] config_file
         --setup: configure routes listed in config_file
         --cleanup: unconfigure routes listed in config_file
         --dry-run: echo commands to be run, but do not execute them
         --verbose: echo commands before they are executed 

The format of the file which is passed into the script is as follows:

network: { gateway: gateway@exit_network [hop: hop] [priority: priority] }

An LNet router is identified when its local NID appears within the list of routes. However, this can not be achieved by the use of this script, since the script only adds extra routes after the router is identified. To ensure that a router is identified correctly, make sure to add its local NID in the routes parameter in the modprobe lustre configuration file. See Section 43.1, “ Introduction”.

15.5.2. lustre_routes_conversion

lustre_routes_conversion usage is as follows:

lustre_routes_conversion legacy_file new_file

lustre_routes_conversion takes as a first parameter a file with routes configured as follows:

network [hop] gateway@exit network[:priority];

The script then converts each routes entry in the provided file to:

network: { gateway: gateway@exit network [hop: hop] [priority: priority] }

and appends each converted entry to the output file passed in as the second parameter to the script.

15.5.3. Route Configuration Examples

Below is an example of a legacy LNet route configuration. A legacy configuration file can have multiple entries.

tcp1 10.1.1.2@tcp0:1;
tcp2 10.1.1.3@tcp0:2;
tcp3 10.1.1.4@tcp0;

Below is an example of the converted LNet route configuration. The following would be the result of the lustre_routes_conversion script, when run on the above legacy entries.

tcp1: { gateway: 10.1.1.2@tcp0 priority: 1 }
tcp2: { gateway: 10.1.1.2@tcp0 priority: 2 }
tcp1: { gateway: 10.1.1.4@tcp0 }


[1] Hardware multi-rail configurations are only supported by o2iblnd; other IB LNDs do not support multiple interfaces.

Introduced in Lustre 2.10

Chapter 16. LNet Software Multi-Rail

This chapter describes LNet Software Multi-Rail configuration and administration.

16.1. Multi-Rail Overview

In computer networking, multi-rail is an arrangement in which two or more network interfaces to a single network on a computer node are employed, to achieve increased throughput. Multi-rail can also be where a node has one or more interfaces to multiple, even different kinds of networks, such as Ethernet, Infiniband, and Intel® Omni-Path. For Lustre clients, multi-rail generally presents the combined network capabilities as a single LNet network. Peer nodes that are multi-rail capable are established during configuration, as are user-defined interface-section policies.

The following link contains a detailed high-level design for the feature: Multi-Rail High-Level Design

16.2. Configuring Multi-Rail

Every node using multi-rail networking needs to be properly configured. Multi-rail uses lnetctl and the LNet Configuration Library for configuration. Configuring multi-rail for a given node involves two tasks:

  1. Configuring multiple network interfaces present on the local node.

  2. Adding remote peers that are multi-rail capable (are connected to one or more common networks with at least two interfaces).

This section is a supplement to Section 9.1.3, “Adding, Deleting and Showing Networks” and contains further examples for Multi-Rail configurations.

For information on the dynamic peer discovery feature added in Lustre Release 2.11.0, see Section 9.1.5, “Dynamic Peer Discovery”.

16.2.1. Configure Multiple Interfaces on the Local Node

Example lnetctl add command with multiple interfaces in a Multi-Rail configuration:

lnetctl net add --net tcp --if eth0,eth1

Example of YAML net show:

lnetctl net show -v
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          tunables:
              peer_timeout: 0
              peer_credits: 0
              peer_buffer_credits: 0
              credits: 0
          lnd tunables:
          tcp bonding: 0
          dev cpt: 0
          CPT: "[0]"
    - net type: tcp
      local NI(s):
        - nid: 192.168.122.10@tcp
          status: up
          interfaces:
              0: eth0
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          tunables:
              peer_timeout: 180
              peer_credits: 8
              peer_buffer_credits: 0
              credits: 256
          lnd tunables:
          tcp bonding: 0
          dev cpt: -1
          CPT: "[0]"
        - nid: 192.168.122.11@tcp
          status: up
          interfaces:
              0: eth1
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          tunables:
              peer_timeout: 180
              peer_credits: 8
              peer_buffer_credits: 0
              credits: 256
          lnd tunables:
          tcp bonding: 0
          dev cpt: -1
          CPT: "[0]"

16.2.2. Deleting Network Interfaces

Example delete with lnetctl net del:

Assuming the network configuration is as shown above with the lnetctl net show -v in the previous section, we can delete a net with following command:

lnetctl net del --net tcp --if eth0

The resultant net information would look like:

lnetctl net show -v
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          tunables:
              peer_timeout: 0
              peer_credits: 0
              peer_buffer_credits: 0
              credits: 0
          lnd tunables:
          tcp bonding: 0
          dev cpt: 0
          CPT: "[0,1,2,3]"

The syntax of a YAML file to perform a delete would be:

- net type: tcp
   local NI(s):
     - nid: 192.168.122.10@tcp
       interfaces:
           0: eth0

16.2.3. Adding Remote Peers that are Multi-Rail Capable

The following example lnetctl peer add command adds a peer with 2 nids, with 192.168.122.30@tcp being the primary nid:

lnetctl peer add --prim_nid 192.168.122.30@tcp --nid 192.168.122.30@tcp,192.168.122.31@tcp
      

The resulting lnetctl peer show would be:

lnetctl peer show -v
peer:
    - primary nid: 192.168.122.30@tcp
      Multi-Rail: True
      peer ni:
        - nid: 192.168.122.30@tcp
          state: NA
          max_ni_tx_credits: 8
          available_tx_credits: 8
          min_tx_credits: 7
          tx_q_num_of_buf: 0
          available_rtr_credits: 8
          min_rtr_credits: 8
          refcount: 1
          statistics:
              send_count: 2
              recv_count: 2
              drop_count: 0
        - nid: 192.168.122.31@tcp
          state: NA
          max_ni_tx_credits: 8
          available_tx_credits: 8
          min_tx_credits: 7
          tx_q_num_of_buf: 0
          available_rtr_credits: 8
          min_rtr_credits: 8
          refcount: 1
          statistics:
              send_count: 1
              recv_count: 1
              drop_count: 0

The following is an example YAML file for adding a peer:

addPeer.yaml
peer:
    - primary nid: 192.168.122.30@tcp
      Multi-Rail: True
      peer ni:
        - nid: 192.168.122.31@tcp

16.2.4. Deleting Remote Peers

Example of deleting a single nid of a peer (192.168.122.31@tcp):

lnetctl peer del --prim_nid 192.168.122.30@tcp --nid 192.168.122.31@tcp

Example of deleting the entire peer:

lnetctl peer del --prim_nid 192.168.122.30@tcp

Example of deleting a peer via YAML:

Assuming the following peer configuration:
peer:
    - primary nid: 192.168.122.30@tcp
      Multi-Rail: True
      peer ni:
        - nid: 192.168.122.30@tcp
          state: NA
        - nid: 192.168.122.31@tcp
          state: NA
        - nid: 192.168.122.32@tcp
          state: NA

You can delete 192.168.122.32@tcp as follows:

delPeer.yaml
peer:
    - primary nid: 192.168.122.30@tcp
      Multi-Rail: True
      peer ni:
        - nid: 192.168.122.32@tcp
    
% lnetctl import --del < delPeer.yaml

16.3. Notes on routing with Multi-Rail

This section details how to configure Multi-Rail with the routing feature before the Section 16.4, “Multi-Rail Routing with LNet Health” feature landed in Lustre 2.13. Routing code has always monitored the state of the route, in order to avoid using unavailable ones.

This section describes how you can configure multiple interfaces on the same gateway node but as different routes. This uses the existing route monitoring algorithm to guard against interfaces going down. With the Section 16.4, “Multi-Rail Routing with LNet Health” feature introduced in Lustre 2.13, the new algorithm uses the Section 16.5, “LNet Health” feature to monitor the different interfaces of the gateway and always ensures that the healthiest interface is used. Therefore, the configuration described in this section applies to releases prior to Lustre 2.13. It will still work in 2.13 as well, however it is not required due to the reason mentioned above.

16.3.1. Multi-Rail Cluster Example

The below example outlines a simple system where all the Lustre nodes are MR capable. Each node in the cluster has two interfaces.

Figure 16.1. Routing Configuration with Multi-Rail

Routing Configuration with Multi-Rail

The routers can aggregate the interfaces on each side of the network by configuring them on the appropriate network.

An example configuration:

Routers
lnetctl net add --net o2ib0 --if ib0,ib1
lnetctl net add --net o2ib1 --if ib2,ib3
lnetctl peer add --nid <peer1-nidA>@o2ib,<peer1-nidB>@o2ib,...
lnetctl peer add --nid <peer2-nidA>@o2ib1,<peer2-nidB>>@o2ib1,...
lnetctl set routing 1

Clients
lnetctl net add --net o2ib0 --if ib0,ib1
lnetctl route add --net o2ib1 --gateway <rtrX-nidA>@o2ib
lnetctl peer add --nid <rtrX-nidA>@o2ib,<rtrX-nidB>@o2ib
        
Servers
lnetctl net add --net o2ib1 --if ib0,ib1
lnetctl route add --net o2ib0 --gateway <rtrX-nidA>@o2ib1
lnetctl peer add --nid <rtrX-nidA>@o2ib1,<rtrX-nidB>@o2ib1

In the above configuration the clients and the servers are configured with only one route entry per router. This works because the routers are MR capable. By adding the routers as peers with multiple interfaces to the clients and the servers, when sending to the router the MR algorithm will ensure that bot interfaces of the routers are used.

However, as of the Lustre 2.10 release LNet Resiliency is still under development and single interface failure will still cause the entire router to go down.

16.3.2. Utilizing Router Resiliency

Currently, LNet provides a mechanism to monitor each route entry. LNet pings each gateway identified in the route entry on regular, configurable interval to ensure that it is alive. If sending over a specific route fails or if the router pinger determines that the gateway is down, then the route is marked as down and is not used. It is subsequently pinged on regular, configurable intervals to determine when it becomes alive again.

This mechanism can be combined with the MR feature in Lustre 2.10 to add this router resiliency feature to the configuration.

Routers
lnetctl net add --net o2ib0 --if ib0,ib1
lnetctl net add --net o2ib1 --if ib2,ib3
lnetctl peer add --nid <peer1-nidA>@o2ib,<peer1-nidB>@o2ib,...
lnetctl peer add --nid <peer2-nidA>@o2ib1,<peer2-nidB>@o2ib1,...
lnetctl set routing 1

Clients
lnetctl net add --net o2ib0 --if ib0,ib1
lnetctl route add --net o2ib1 --gateway <rtrX-nidA>@o2ib
lnetctl route add --net o2ib1 --gateway <rtrX-nidB>@o2ib
        
Servers
lnetctl net add --net o2ib1 --if ib0,ib1
lnetctl route add --net o2ib0 --gateway <rtrX-nidA>@o2ib1
lnetctl route add --net o2ib0 --gateway <rtrX-nidB>@o2ib1

There are a few things to note in the above configuration:

  1. The clients and the servers are now configured with two routes, each route's gateway is one of the interfaces of the route. The clients and servers will view each interface of the same router as a separate gateway and will monitor them as described above.

  2. The clients and the servers are not configured to view the routers as MR capable. This is important because we want to deal with each interface as a separate peers and not different interfaces of the same peer.

  3. The routers are configured to view the peers as MR capable. This is an oddity in the configuration, but is currently required in order to allow the routers to load balance the traffic load across its interfaces evenly.

16.3.3. Mixed Multi-Rail/Non-Multi-Rail Cluster

The above principles can be applied to mixed MR/Non-MR cluster. For example, the same configuration shown above can be applied if the clients and the servers are non-MR while the routers are MR capable. This appears to be a common cluster upgrade scenario.

Introduced in Lustre 2.13

16.4. Multi-Rail Routing with LNet Health

This section details how routing and pertinent module parameters can be configured beginning with Lustre 2.13.

Multi-Rail with Dynamic Discovery allows LNet to discover and use all configured interfaces of a node. It references a node via it's primary NID. Multi-Rail routing carries forward this concept to the routing infrastructure. The following changes are brought in with the Lustre 2.13 release:

  1. Configuring a different route per gateway interface is no longer needed. One route per gateway should be configured. Gateway interfaces are used according to the Multi-Rail selection criteria.

  2. Routing now relies on Section 16.5, “LNet Health” to keep track of the route aliveness.

  3. Router interfaces are monitored via LNet Health. If an interface fails other interfaces will be used.

  4. Routing uses LNet discovery to discover gateways on regular intervals.

  5. A gateway pushes its list of interfaces upon the discovery of any changes in its interfaces' state.

16.4.1. Configuration

16.4.1.1. Configuring Routes

A gateway can have multiple interfaces on the same or different networks. The peers using the gateway can reach it on one or more of its interfaces. Multi-Rail routing takes care of managing which interface to use.

lnetctl route add --net <remote network>
      --gateway <NID for the gateway>
      --hop <number of hops> --priority <route priority>
      

16.4.1.2. Configuring Module Parameters

Table 16.1. Configuring Module Parameters

Module Parameter

Usage

check_routers_before_use

Defaults to 0. If set to 1 all routers must be up before the system can proceed.

avoid_asym_router_failure

Defaults to 1. If set to 1 single-hop routes have an additional requirement to be considered up. The requirement is that the gateway of the route must have at least one healthy network interface connected directly to the remote net of the route. In this context single-hop routes are routes that are given hop=1 explicitly when created, or routes for which lnet can infer that they have only one hop. Otherwise the route is not single-hop and this parameter has no effect.

alive_router_check_interval

Defaults to 60 seconds. The gateways will be discovered ever alive_router_check_interval. If the gateway can be reached on multiple networks, the interval per network is alive_router_check_interval / number of networks.

router_ping_timeout

Defaults to 50 seconds. A gateway sets its interface down if it has not received any traffic for router_ping_timeout + alive_router_check_interval

router_sensitivity_percentage

Defaults to 100. This parameter defines how sensitive a gateway interface is to failure. If set to 100 then any gateway interface failure will contribute to all routes using it going down. The lower the value the more tolerant to failures the system becomes.


16.4.2. Router Health

The routing infrastructure now relies on LNet Health to keep track of interface health. Each gateway interface has a health value associated with it. If a send fails to one of these interfaces, then the interface's health value is decremented and placed on a recovery queue. The unhealthy interface is then pinged every lnet_recovery_interval. This value defaults to 1 second.

If the peer receives a message from the gateway, then it immediately assumes that the gateway's interface is up and resets its health value to maximum. This is needed to ensure we start using the gateways immediately instead of holding off until the interface is back to full health.

16.4.3. Discovery

LNet Discovery is used in place of pinging the peers. This serves two purposes:

  1. The discovery communication infrastructure does not need to be duplicated for the routing feature.

  2. It allows propagation of the gateway's interface state changes to the peers using the gateway.

For (2), if an interface changes state from UP to DOWN or vice versa, then a discovery PUSH is sent to all the peers which can be reached. This allows peers to adapt to changes quicker.

Discovery is designed to be backwards compatible. The discovery protocol is composed of a GET and a PUT. The GET requests interface information from the peer, this is a basic lnet ping. The peer responds with its interface information and a feature bit. If the peer is multi-rail capable and discovery is turned on, then the node will PUSH its interface information. As a result both peers will be aware of each other's interfaces.

This information is then used by the peers to decide, based on the interface state provided by the gateway, whether the route is alive or not.

16.4.4. Route Aliveness Criteria

A route is considered alive if the following conditions hold:

  1. The gateway can be reached on the local net via at least one path.

  2. For a single-hop route, if avoid_asym_router_failure is enabled then the remote network defined in the route must have at least one healthy interface on the gateway.

Introduced in Lustre 2.12

16.5. LNet Health

LNet Multi-Rail has implemented the ability for multiple interfaces to be used on the same LNet network or across multiple LNet networks. The LNet Health feature adds the ability to maintain a health value for each local and remote interface. This allows the Multi-Rail algorithm to consider the health of the interface before selecting it for sending. The feature also adds the ability to resend messages across different interfaces when interface or network failures are detected. This allows LNet to mitigate communication failures before passing the failures to upper layers for further error handling. To accomplish this, LNet Health monitors the status of the send and receive operations and uses this status to increment the interface's health value in case of success and decrement it in case of failure.

16.5.1. Health Value

The initial health value of a local or remote interface is set to LNET_MAX_HEALTH_VALUE, currently set to be 1000. The value itself is arbitrary and is meant to allow for health granularity, as opposed to having a simple boolean state. The granularity allows the Multi-Rail algorithm to select the interface that has the highest likelihood of sending or receiving a message.

16.5.2. Failure Types and Behavior

LNet health behavior depends on the type of failure detected:

Failure Type

Behavior

localresend

A local failure has occurred, such as no route found or an address resolution error. These failures could be temporary, therefore LNet will attempt to resend the message. LNet will decrement the health value of the local interface and will select it less often if there are multiple available interfaces.

localno-resend

A local non-recoverable error occurred in the system, such as out of memory error. In these cases LNet will not attempt to resend the message. LNet will decrement the health value of the local interface and will select it less often if there are multiple available interfaces.

remoteno-resend

If LNet successfully sends a message, but the message does not complete or an expected reply is not received, then it is classified as a remote error. LNet will not attempt to resend the message to avoid duplicate messages on the remote end. LNet will decrement the health value of the remote interface and will select it less often if there are multiple available interfaces.

remoteresend

There are a set of failures where we can be reasonably sure that the message was dropped before getting to the remote end. In this case, LNet will attempt to resend the message. LNet will decrement the health value of the remote interface and will select it less often if there are multiple available interfaces.

16.5.3. User Interface

LNet Health is turned on by default. There are multiple module parameters available to control the LNet Health feature.

All the module parameters are implemented in sysfs and are located in /sys/module/lnet/parameters/. They can be set directly by echoing a value into them as well as from lnetctl.

Parameter

Description

lnet_health_sensitivity

When LNet detects a failure on a particular interface it will decrement its Health Value by lnet_health_sensitivity. The greater the value, the longer it takes for that interface to become healthy again. The default value of lnet_health_sensitivity is set to 100. To disable LNet health, the value can be set to 0.

An lnet_health_sensitivity of 100 means that 10 consecutive message failures or a steady-state failure rate over 1% would degrade the interface Health Value until it is disabled, while a lower failure rate would steer traffic away from the interface but it would continue to be available. When a failure occurs on an interface then its Health Value is decremented and the interface is flagged for recovery.

lnetctl set health_sensitivity: sensitivity to failure
      0 - turn off health evaluation
      >0 - sensitivity value not more than 1000

lnet_recovery_interval

When LNet detects a failure on a local or remote interface it will place that interface on a recovery queue. There is a recovery queue for local interfaces and another for remote interfaces. The interfaces on the recovery queues will be LNet PINGed every lnet_recovery_interval. This value defaults to 1 second. On every successful PING the health value of the interface pinged will be incremented by 1.

Having this value configurable allows system administrators to control the amount of control traffic on the network.

lnetctl set recovery_interval: interval to ping unhealthy interfaces
      >0 - timeout in seconds

lnet_transaction_timeout

This timeout is somewhat of an overloaded value. It carries the following functionality:

  • A message is abandoned if it is not sent successfully when the lnet_transaction_timeout expires and the retry_count is not reached.

  • A GET or a PUT which expects an ACK expires if a REPLY or an ACK respectively, is not received within the lnet_transaction_timeout.

This value defaults to 30 seconds.

lnetctl set transaction_timeout: Message/Response timeout
      >0 - timeout in seconds

Note

The LND timeout will now be a fraction of the lnet_transaction_timeout as described in the next section.

This means that in networks where very large delays are expected then it will be necessary to increase this value accordingly.

lnet_retry_count

When LNet detects a failure which it deems appropriate for re-sending a message it will check if a message has passed the maximum retry_count specified. After which if a message wasn't sent successfully a failure event will be passed up to the layer which initiated message sending. The default value is 2.

Since the message retry interval (lnet_lnd_timeout) is computed from lnet_transaction_timeout / lnet_retry_count, the lnet_retry_count should be kept low enough that the retry interval is not shorter than the round-trip message delay in the network. A lnet_retry_count of 5 is reasonable for the default lnet_transaction_timeout of 50 seconds.

lnetctl set retry_count: number of retries
      0 - turn off retries
      >0 - number of retries, cannot be more than lnet_transaction_timeout

lnet_lnd_timeout

This is not a configurable parameter. But it is derived from two configurable parameters: lnet_transaction_timeout and retry_count.

lnet_lnd_timeout = (lnet_transaction_timeout-1) / (retry_count+1)
              

As such there is a restriction that lnet_transaction_timeout >= retry_count

The core assumption here is that in a healthy network, sending and receiving LNet messages should not have large delays. There could be large delays with RPC messages and their responses, but that's handled at the PtlRPC layer.

16.5.4. Displaying Information

16.5.4.1. Showing LNet Health Configuration Settings

lnetctl can be used to show all the LNet health configuration settings using the lnetctl global show command.

#> lnetctl global show
      global:
      numa_range: 0
      max_intf: 200
      discovery: 1
      retry_count: 3
      transaction_timeout: 10
      health_sensitivity: 100
      recovery_interval: 1

16.5.4.2. Showing LNet Health Statistics

LNet Health statistics are shown under a higher verbosity settings. To show the local interface health statistics:

lnetctl net show -v 3

To show the remote interface health statistics:

lnetctl peer show -v 3

Sample output:

#> lnetctl net show -v 3
      net:
      - net type: tcp
        local NI(s):
           - nid: 192.168.122.108@tcp
             status: up
             interfaces:
                 0: eth2
             statistics:
                 send_count: 304
                 recv_count: 284
                 drop_count: 0
             sent_stats:
                 put: 176
                 get: 138
                 reply: 0
                 ack: 0
                 hello: 0
             received_stats:
                 put: 145
                 get: 137
                 reply: 0
                 ack: 2
                 hello: 0
             dropped_stats:
                 put: 10
                 get: 0
                 reply: 0
                 ack: 0
                 hello: 0
             health stats:
                 health value: 1000
                 interrupts: 0
                 dropped: 10
                 aborted: 0
                 no route: 0
                 timeouts: 0
                 error: 0
             tunables:
                 peer_timeout: 180
                 peer_credits: 8
                 peer_buffer_credits: 0
                 credits: 256
             dev cpt: -1
             tcp bonding: 0
             CPT: "[0]"
      CPT: "[0]"

There is a new YAML block, health stats, which displays the health statistics for each local or remote network interface.

Global statistics also dump the global health statistics as shown below:

#> lnetctl stats show
        statistics:
            msgs_alloc: 0
            msgs_max: 33
            rst_alloc: 0
            errors: 0
            send_count: 901
            resend_count: 4
            response_timeout_count: 0
            local_interrupt_count: 0
            local_dropped_count: 10
            local_aborted_count: 0
            local_no_route_count: 0
            local_timeout_count: 0
            local_error_count: 0
            remote_dropped_count: 0
            remote_error_count: 0
            remote_timeout_count: 0
            network_timeout_count: 0
            recv_count: 851
            route_count: 0
            drop_count: 10
            send_length: 425791628
            recv_length: 69852
            route_length: 0
            drop_length: 0

16.5.5. Initial Settings Recommendations

LNet Health is off by default. This means that lnet_health_sensitivity and lnet_retry_count are set to 0.

Setting lnet_health_sensitivity to 0 will not decrement the health of the interface on failure and will not change the interface selection behavior. Furthermore, the failed interfaces will not be placed on the recovery queues. In essence, turning off the LNet Health feature.

The LNet Health settings will need to be tuned for each cluster. However, the base configuration would be as follows:

#> lnetctl global show
    global:
        numa_range: 0
        max_intf: 200
        discovery: 1
        retry_count: 3
        transaction_timeout: 10
        health_sensitivity: 100
        recovery_interval: 1

This setting will allow a maximum of two retries for failed messages within the 5 second transaction timeout.

If there is a failure on the interface the health value will be decremented by 1 and the interface will be LNet PINGed every 1 second.

Chapter 17. Upgrading a Lustre File System

This chapter describes interoperability between Lustre software releases. It also provides procedures for upgrading from older Lustre 2.x software releases to a more recent 2.y Lustre release a (major release upgrade), and from a Lustre software release 2.x.y to a more recent Lustre software release 2.x.z (minor release upgrade). It includes the following sections:

17.1.  Release Interoperability and Upgrade Requirements

Lustre software release 2.x (major) upgrade:

  • All servers must be upgraded at the same time, while some or all clients may be upgraded independently of the servers.

  • All servers must be be upgraded to a Linux kernel supported by the Lustre software. See the Lustre Release Notes for your Lustre version for a list of tested Linux distributions.

  • Clients to be upgraded must be running a compatible Linux distribution as described in the Release Notes.

Lustre software release 2.x.y release (minor) upgrade:

  • All servers must be upgraded at the same time, while some or all clients may be upgraded.

  • Rolling upgrades are supported for minor releases allowing individual servers and clients to be upgraded without stopping the Lustre file system.

17.2.  Upgrading to Lustre Software Release 2.x (Major Release)

The procedure for upgrading from a Lustre software release 2.x to a more recent 2.y major release of the Lustre software is described in this section. To upgrade an existing 2.x installation to a more recent major release, complete the following steps:

  1. Create a complete, restorable file system backup.

    Caution

    Before installing the Lustre software, back up ALL data. The Lustre software contains kernel modifications that interact with storage devices and may introduce security issues and data loss if not installed, configured, or administered properly. If a full backup of the file system is not practical, a device-level backup of the MDT file system is recommended. See Chapter 18, Backing Up and Restoring a File System for a procedure.

  2. Shut down the entire filesystem by following Section 13.4, “ Stopping the Filesystem”

  3. Upgrade the Linux operating system on all servers to a compatible (tested) Linux distribution and reboot.

  4. Upgrade the Linux operating system on all clients to a compatible (tested) distribution and reboot.

  5. Download the Lustre server RPMs for your platform from the Lustre Releases repository. See Table 8.1, “Packages Installed on Lustre Servers”for a list of required packages.

  6. Install the Lustre server packages on all Lustre servers (MGS, MDSs, and OSSs).

    1. Log onto a Lustre server as the root user

    2. Use the yum command to install the packages:

      # yum --nogpgcheck install pkg1.rpm pkg2.rpm ... 

    3. Verify the packages are installed correctly:

      rpm -qa|egrep "lustre|wc"

    4. Repeat these steps on each Lustre server.

  7. Download the Lustre client RPMs for your platform from the Lustre Releases repository. See Table 8.2, “Packages Installed on Lustre Clients” for a list of required packages.

    Note

    The version of the kernel running on a Lustre client must be the same as the version of the lustre-client-modules- verpackage being installed. If not, a compatible kernel must be installed on the client before the Lustre client packages are installed.

  8. Install the Lustre client packages on each of the Lustre clients to be upgraded.

    1. Log onto a Lustre client as the root user.

    2. Use the yum command to install the packages:

      # yum --nogpgcheck install pkg1.rpm pkg2.rpm ... 

    3. Verify the packages were installed correctly:

      # rpm -qa|egrep "lustre|kernel"

    4. Repeat these steps on each Lustre client.

  9. The DNE feature allows using multiple MDTs within a single filesystem namespace, and each MDT can each serve one or more remote sub-directories in the file system. The root directory is always located on MDT0.

    Note that clients running a release prior to the Lustre software release 2.4 can only see the namespace hosted by MDT0 and will return an IO error if an attempt is made to access a directory on another MDT.

    (Optional) To format an additional MDT, complete these steps:

    1. Determine the index used for the first MDT (each MDT must have unique index). Enter:

      client$ lctl dl | grep mdc
      36 UP mdc lustre-MDT0000-mdc-ffff88004edf3c00 
            4c8be054-144f-9359-b063-8477566eb84e 5

      In this example, the next available index is 1.

    2. Format the new block device as a new MDT at the next available MDT index by entering (on one line):

      mds# mkfs.lustre --reformat --fsname=filesystem_name --mdt \
          --mgsnode=mgsnode --index new_mdt_index 
      /dev/mdt1_device
  10. (Optional) If you are upgrading from a release before Lustre 2.10, to enable the project quota feature enter the following on every ldiskfs backend target while unmounted:

    tune2fs -O project /dev/dev

    Note

    Enabling the project feature will prevent the filesystem from being used by older versions of ldiskfs, so it should only be enabled if the project quota feature is required and/or after it is known that the upgraded release does not need to be downgraded.

  11. When setting up the file system, enter:

    conf_param $FSNAME.quota.mdt=$QUOTA_TYPE
    conf_param $FSNAME.quota.ost=$QUOTA_TYPE
  12. Introduced in Lustre 2.13

    (Optional) If upgrading an ldiskfs MDT formatted prior to Lustre 2.13, the "wide striping" feature that allows files to have more than 160 stripes and store other large xattrs was not enabled by default. This feature can be enabled on existing MDTs by running the following command on all MDT devices:

    mds# tune2fs -O ea_inode /dev/mdtdev

    For more information about wide striping, see Section 19.9, “Lustre Striping Internals”.

  13. Start the Lustre file system by starting the components in the order shown in the following steps:

    1. Mount the MGT. On the MGS, run

      mgs# mount -a -t lustre
    2. Mount the MDT(s). On each MDT, run:

      mds# mount -a -t lustre
    3. Mount all the OSTs. On each OSS node, run:

      oss# mount -a -t lustre

      Note

      This command assumes that all the OSTs are listed in the /etc/fstab file. OSTs that are not listed in the /etc/fstab file, must be mounted individually by running the mount command:

      mount -t lustre /dev/block_device/mount_point
    4. Mount the file system on the clients. On each client node, run:

      client# mount -a -t lustre
  14. Introduced in Lustre 2.7

    (Optional) If you are upgrading from a release before Lustre 2.7, to enable OST FIDs to also store the OST index (to improve reliability of LFSCK and debug messages), after the OSTs are mounted run once on each OSS:

    oss# lctl set_param osd-ldiskfs.*.osd_index_in_idif=1

    Note

    Enabling the index_in_idif feature will prevent the OST from being used by older versions of Lustre, so it should only be enabled once it is known there is no need for the OST to be downgraded to an earlier release.

  15. If a new MDT was added to the filesystem, the new MDT must be attached into the namespace by creating one or more new DNE subdirectories with the lfs mkdir command that use the new MDT:

    client# lfs mkdir -i new_mdt_index /testfs/new_dir
    

    Introduced in Lustre 2.8

    In Lustre 2.8 and later, it is possible to split a new directory across multiple MDTs by creating it with multiple stripes:

    client# lfs mkdir -c 2 /testfs/new_striped_dir
    

    Introduced in Lustre 2.13

    In Lustre 2.13 and later, it is possible to set the default directory layout on existing directories so new remote subdirectories are created on less-full MDTs:

    client# lfs setdirstripe -D -c 1 -i -1 /testfs/some_dir
    

    See Section 13.10.1, “Directory creation by space/inode usage” for details.

    Introduced in Lustre 2.15

    In Lustre 2.15 and later, if no default directory layout is set on the root directory, the MDS will automatically set the default directory layout the root directory to distribute the top-level directories round-robin across all MDTs, see Section 13.10.2, “Filesystem-wide default directory striping”.

Note

The mounting order described in the steps above must be followed for the initial mount and registration of a Lustre file system after an upgrade. For a normal start of a Lustre file system, the mounting order is MGT, OSTs, MDT(s), clients.

If you have a problem upgrading a Lustre file system, see Section 35.2, “Reporting a Lustre File System Bug”for ways to get help.

17.3.  Upgrading to Lustre Software Release 2.x.y (Minor Release)

Rolling upgrades are supported for upgrading from any Lustre software release 2.x.y to a more recent Lustre software release 2.X.y. This allows the Lustre file system to continue to run while individual servers (or their failover partners) and clients are upgraded one at a time. The procedure for upgrading a Lustre software release 2.x.y to a more recent minor release is described in this section.

To upgrade Lustre software release 2.x.y to a more recent minor release, complete these steps:

  1. Create a complete, restorable file system backup.

    Caution

    Before installing the Lustre software, back up ALL data. The Lustre software contains kernel modifications that interact with storage devices and may introduce security issues and data loss if not installed, configured, or administered properly. If a full backup of the file system is not practical, a device-level backup of the MDT file system is recommended. See Chapter 18, Backing Up and Restoring a File System for a procedure.

  2. Download the Lustre server RPMs for your platform from the Lustre Releases repository. See Table 8.1, “Packages Installed on Lustre Servers” for a list of required packages.

  3. For a rolling upgrade, complete any procedures required to keep the Lustre file system running while the server to be upgraded is offline, such as failing over a primary server to its secondary partner.

  4. Unmount the Lustre server to be upgraded (MGS, MDS, or OSS)

  5. Install the Lustre server packages on the Lustre server.

    1. Log onto the Lustre server as the root user

    2. Use the yum command to install the packages:

      # yum --nogpgcheck install pkg1.rpm pkg2.rpm ... 

    3. Verify the packages are installed correctly:

      rpm -qa|egrep "lustre|wc"

    4. Mount the Lustre server to restart the Lustre software on the server:

      server# mount -a -t lustre
    5. Repeat these steps on each Lustre server.

  6. Download the Lustre client RPMs for your platform from the Lustre Releases repository. See Table 8.2, “Packages Installed on Lustre Clients” for a list of required packages.

  7. Install the Lustre client packages on each of the Lustre clients to be upgraded.

    1. Log onto a Lustre client as the root user.

    2. Use the yum command to install the packages:

      # yum --nogpgcheck install pkg1.rpm pkg2.rpm ... 

    3. Verify the packages were installed correctly:

      # rpm -qa|egrep "lustre|kernel"

    4. Mount the Lustre client to restart the Lustre software on the client:

      client# mount -a -t lustre
    5. Repeat these steps on each Lustre client.

If you have a problem upgrading a Lustre file system, see Section 35.2, “Reporting a Lustre File System Bug”for some suggestions for how to get help.

Chapter 18. Backing Up and Restoring a File System

This chapter describes how to backup and restore at the file system-level, device-level and file-level in a Lustre file system. Each backup approach is described in the the following sections:

It is strongly recommended that sites perform periodic device-level backup of the MDT(s) (Section 18.2, “ Backing Up and Restoring an MDT or OST (ldiskfs Device Level)”), for example twice a week with alternate backups going to a separate device, even if there is not enough capacity to do a full backup of all of the filesystem data. Even if there are separate file-level backups of some or all files in the filesystem, having a device-level backup of the MDT can be very useful in case of MDT failure or corruption. Being able to restore a device-level MDT backup can avoid the significantly longer process of restoring the entire filesystem from backup. Since the MDT is required for access to all files, its loss would otherwise force full restore of the filesystem (if that is even possible) even if the OSTs are still OK.

Performing a periodic device-level MDT backup can be done relatively inexpensively because the storage need only be connected to the primary MDS (it can be manually connected to the backup MDS in the rare case it is needed), and only needs good linear read/write performance. While the device-level MDT backup is not useful for restoring individual files, it is most efficient to handle the case of MDT failure or corruption.

18.1.  Backing up a File System

Backing up a complete file system gives you full control over the files to back up, and allows restoration of individual files as needed. File system-level backups are also the easiest to integrate into existing backup solutions.

File system backups are performed from a Lustre client (or many clients working parallel in different directories) rather than on individual server nodes; this is no different than backing up any other file system.

However, due to the large size of most Lustre file systems, it is not always possible to get a complete backup. We recommend that you back up subsets of a file system. This includes subdirectories of the entire file system, filesets for a single user, files incremented by date, and so on, so that restores can be done more efficiently.

Note

Lustre internally uses a 128-bit file identifier (FID) for all files. To interface with user applications, the 64-bit inode numbers are returned by the stat(), fstat(), and readdir() system calls on 64-bit applications, and 32-bit inode numbers to 32-bit applications.

Some 32-bit applications accessing Lustre file systems (on both 32-bit and 64-bit CPUs) may experience problems with the stat(), fstat() or readdir() system calls under certain circumstances, though the Lustre client should return 32-bit inode numbers to these applications.

In particular, if the Lustre file system is exported from a 64-bit client via NFS to a 32-bit client, the Linux NFS server will export 64-bit inode numbers to applications running on the NFS client. If the 32-bit applications are not compiled with Large File Support (LFS), then they return EOVERFLOW errors when accessing the Lustre files. To avoid this problem, Linux NFS clients can use the kernel command-line option "nfs.enable_ino64=0" in order to force the NFS client to export 32-bit inode numbers to the client.

Workaround: We very strongly recommend that backups using tar(1) and other utilities that depend on the inode number to uniquely identify an inode to be run on 64-bit clients. The 128-bit Lustre file identifiers cannot be uniquely mapped to a 32-bit inode number, and as a result these utilities may operate incorrectly on 32-bit clients. While there is still a small chance of inode number collisions with 64-bit inodes, the FID allocation pattern is designed to avoid collisions for long periods of usage.

18.1.1.  Lustre_rsync

The lustre_rsync feature keeps the entire file system in sync on a backup by replicating the file system's changes to a second file system (the second file system need not be a Lustre file system, but it must be sufficiently large). lustre_rsync uses Lustre changelogs to efficiently synchronize the file systems without having to scan (directory walk) the Lustre file system. This efficiency is critically important for large file systems, and distinguishes the Lustre lustre_rsync feature from other replication/backup solutions.

18.1.1.1.  Using Lustre_rsync

The lustre_rsync feature works by periodically running lustre_rsync, a userspace program used to synchronize changes in the Lustre file system onto the target file system. The lustre_rsync utility keeps a status file, which enables it to be safely interrupted and restarted without losing synchronization between the file systems.

The first time that lustre_rsync is run, the user must specify a set of parameters for the program to use. These parameters are described in the following table and in Section 44.11, “ lustre_rsync”. On subsequent runs, these parameters are stored in the the status file, and only the name of the status file needs to be passed to lustre_rsync.

Before using lustre_rsync:

- AND -

  • Verify that the Lustre file system (source) and the replica file system (target) are identical before registering the changelog user. If the file systems are discrepant, use a utility, e.g. regular rsync(not lustre_rsync), to make them identical.

The lustre_rsync utility uses the following parameters:

Parameter

Description

--source= src

The path to the root of the Lustre file system (source) which will be synchronized. This is a mandatory option if a valid status log created during a previous synchronization operation ( --statuslog) is not specified.

--target= tgt

The path to the root where the source file system will be synchronized (target). This is a mandatory option if the status log created during a previous synchronization operation ( --statuslog) is not specified. This option can be repeated if multiple synchronization targets are desired.

--mdt= mdt

The metadata device to be synchronized. A changelog user must be registered for this device. This is a mandatory option if a valid status log created during a previous synchronization operation ( --statuslog) is not specified.

--user= userid

The changelog user ID for the specified MDT. To use lustre_rsync, the changelog user must be registered. For details, see the changelog_register parameter in Chapter 44, System Configuration Utilities( lctl). This is a mandatory option if a valid status log created during a previous synchronization operation ( --statuslog) is not specified.

--statuslog= log

A log file to which synchronization status is saved. When the lustre_rsync utility starts, if the status log from a previous synchronization operation is specified, then the state is read from the log and otherwise mandatory --source, --target and --mdt options can be skipped. Specifying the --source, --target and/or --mdt options, in addition to the --statuslog option, causes the specified parameters in the status log to be overridden. Command line options take precedence over options in the status log.

--xattr yes|no

Specifies whether extended attributes ( xattrs) are synchronized or not. The default is to synchronize extended attributes.

Note

Disabling xattrs causes Lustre striping information not to be synchronized.

--verbose

Produces verbose output.

--dry-run

Shows the output of lustre_rsync commands ( copy, mkdir, etc.) on the target file system without actually executing them.

--abort-on-err

Stops processing the lustre_rsync operation if an error occurs. The default is to continue the operation.

18.1.1.2.  lustre_rsync Examples

Sample lustre_rsync commands are listed below.

Register a changelog user for an MDT (e.g. testfs-MDT0000).

# lctl --device testfs-MDT0000 changelog_register testfs-MDT0000
Registered changelog userid 'cl1'

Synchronize a Lustre file system ( /mnt/lustre) to a target file system ( /mnt/target).

$ lustre_rsync --source=/mnt/lustre --target=/mnt/target \
           --mdt=testfs-MDT0000 --user=cl1 --statuslog sync.log  --verbose 
Lustre filesystem: testfs 
MDT device: testfs-MDT0000 
Source: /mnt/lustre 
Target: /mnt/target 
Statuslog: sync.log 
Changelog registration: cl1 
Starting changelog record: 0 
Errors: 0 
lustre_rsync took 1 seconds 
Changelog records consumed: 22

After the file system undergoes changes, synchronize the changes onto the target file system. Only the statuslog name needs to be specified, as it has all the parameters passed earlier.

$ lustre_rsync --statuslog sync.log --verbose 
Replicating Lustre filesystem: testfs 
MDT device: testfs-MDT0000 
Source: /mnt/lustre 
Target: /mnt/target 
Statuslog: sync.log 
Changelog registration: cl1 
Starting changelog record: 22 
Errors: 0 
lustre_rsync took 2 seconds 
Changelog records consumed: 42

To synchronize a Lustre file system ( /mnt/lustre) to two target file systems ( /mnt/target1 and /mnt/target2).

$ lustre_rsync --source=/mnt/lustre --target=/mnt/target1 \
           --target=/mnt/target2 --mdt=testfs-MDT0000 --user=cl1  \
           --statuslog sync.log

18.2.  Backing Up and Restoring an MDT or OST (ldiskfs Device Level)

In some cases, it is useful to do a full device-level backup of an individual device (MDT or OST), before replacing hardware, performing maintenance, etc. Doing full device-level backups ensures that all of the data and configuration files is preserved in the original state and is the easiest method of doing a backup. For the MDT file system, it may also be the fastest way to perform the backup and restore, since it can do large streaming read and write operations at the maximum bandwidth of the underlying devices.

Note

Keeping an updated full backup of the MDT is especially important because permanent failure or corruption of the MDT file system renders the much larger amount of data in all the OSTs largely inaccessible and unusable. The storage needed for one or two full MDT device backups is much smaller than doing a full filesystem backup, and can use less expensive storage than the actual MDT device(s) since it only needs to have good streaming read/write speed instead of high random IOPS.

If hardware replacement is the reason for the backup or if a spare storage device is available, it is possible to do a raw copy of the MDT or OST from one block device to the other, as long as the new device is at least as large as the original device. To do this, run:

dd if=/dev/{original} of=/dev/{newdev} bs=4M

If hardware errors cause read problems on the original device, use the command below to allow as much data as possible to be read from the original device while skipping sections of the disk with errors:

dd if=/dev/{original} of=/dev/{newdev} bs=4k conv=sync,noerror /
      count={original size in 4kB blocks}

Even in the face of hardware errors, the ldiskfs file system is very robust and it may be possible to recover the file system data after running e2fsck -fy /dev/{newdev} on the new device.

With Lustre software version 2.6 and later, the LFSCK scanning will automatically move objects from lost+found back into its correct location on the OST after directory corruption.

In order to ensure that the backup is fully consistent, the MDT or OST must be unmounted, so that there are no changes being made to the device while the data is being transferred. If the reason for the backup is preventative (i.e. MDT backup on a running MDS in case of future failures) then it is possible to perform a consistent backup from an LVM snapshot. If an LVM snapshot is not available, and taking the MDS offline for a backup is unacceptable, it is also possible to perform a backup from the raw MDT block device. While the backup from the raw device will not be fully consistent due to ongoing changes, the vast majority of ldiskfs metadata is statically allocated, and inconsistencies in the backup can be fixed by running e2fsck on the backup device, and is still much better than not having any backup at all.

18.3.  Backing Up an OST or MDT (Backend File System Level)

This procedure provides an alternative to backup or migrate the data of an OST or MDT at the file level. At the file-level, unused space is omitted from the backup and the process may be completed quicker with a smaller total backup size. Backing up a single OST device is not necessarily the best way to perform backups of the Lustre file system, since the files stored in the backup are not usable without metadata stored on the MDT and additional file stripes that may be on other OSTs. However, it is the preferred method for migration of OST devices, especially when it is desirable to reformat the underlying file system with different configuration options or to reduce fragmentation.

Note

Since Lustre stores internal metadata that maps FIDs to local inode numbers via the Object Index file, they need to be rebuilt at first mount after a restore is detected so that file-level MDT backup and restore is supported. The OI Scrub rebuilds these automatically at first mount after a restore is detected, which may affect MDT performance after mount until the rebuild is completed. Progress can be monitored via lctl get_param osd-*.*.oi_scrub on the MDS or OSS node where the target filesystem was restored.

Introduced in Lustre 2.11

18.3.1.  Backing Up an OST or MDT (Backend File System Level)

Prior to Lustre software release 2.11.0, we can only do the backend file system level backup and restore process for ldiskfs-based systems. The ability to perform a zfs-based MDT/OST file system level backup and restore is introduced beginning in Lustre software release 2.11.0. Differing from an ldiskfs-based system, index objects must be backed up before the unmount of the target (MDT or OST) in order to be able to restore the file system successfully. To enable index backup on the target, execute the following command on the target server:

# lctl set_param osd-*.${fsname}-${target}.index_backup=1

${target} is composed of the target type (MDT or OST) plus the target index, such as MDT0000, OST0001, and so on.

Note

The index_backup is also valid for an ldiskfs-based system, that will be used when migrating data between ldiskfs-based and zfs-based systems as described in Section 18.6, “ Migration Between ZFS and ldiskfs Target Filesystems ”.

18.3.2.  Backing Up an OST or MDT

The below examples show backing up an OST filesystem. When backing up an MDT, substitute mdt for ost in the instructions below.

  1. Umount the target

  2. Make a mountpoint for the file system.

    [oss]# mkdir -p /mnt/ost
  3. Mount the file system.

    For ldiskfs-based systems:

    [oss]# mount -t ldiskfs /dev/{ostdev} /mnt/ost

    For zfs-based systems:

    1. Import the pool for the target if it is exported. For example:

      [oss]# zpool import lustre-ost [-d ${ostdev_dir}]

    2. Enable the canmount property on the target filesystem. For example:

      [oss]# zfs set canmount=on ${fsname}-ost/ost

      You also can specify the mountpoint property. By default, it will be: /${fsname}-ost/ost

    3. Mount the target as 'zfs'. For example:

      [oss]# zfs mount ${fsname}-ost/ost

  4. Change to the mountpoint being backed up.

    [oss]# cd /mnt/ost
  5. Back up the extended attributes.

    [oss]# getfattr -R -d -m '.*' -e hex -P . > ea-$(date +%Y%m%d).bak

    Note

    If the tar(1) command supports the --xattr option (see below), the getfattr step may be unnecessary as long as tar correctly backs up the trusted.* attributes. However, completing this step is not harmful and can serve as an added safety measure.

    Note

    In most distributions, the getfattr command is part of the attr package. If the getfattr command returns errors like Operation not supported, then the kernel does not correctly support EAs. Stop and use a different backup method.

  6. Verify that the ea-$date.bak file has properly backed up the EA data on the OST.

    Without this attribute data, the MDT restore process will fail and result in an unusable filesystem. The OST restore process may be missing extra data that can be very useful in case of later file system corruption. Look at this file with more or a text editor. Each object file should have a corresponding item similar to this:

    [oss]# file: O/0/d0/100992
    trusted.fid= \
    0x0d822200000000004a8a73e500000000808a0100000000000000000000000000
  7. Back up all file system data.

    [oss]# tar czvf {backup file}.tgz [--xattrs] [--xattrs-include="trusted.*"] [--acls] --sparse .

    Note

    The tar --sparse option is vital for backing up an MDT. Very old versions of tar may not support the --sparse option correctly, which may cause the MDT backup to take a long time. Known-working versions include the tar from Red Hat Enterprise Linux distribution (RHEL version 6.3 or newer) or GNU tar version 1.25 and newer.

    Warning

    The tar --xattrs option is only available in GNU tar version 1.27 or later or in RHEL 6.3 or newer. The --xattrs-include="trusted.*" option is required for correct restoration of the xattrs when using GNU tar 1.27 or RHEL 7 and newer.

    The tar --acls option is recommended for MDT backup of POSIX ACLs. Or, getfacl -n -R and setfacl --restore can be used instead.

  8. Change directory out of the file system.

    [oss]# cd -
  9. Unmount the file system.

    [oss]# umount /mnt/ost

    Note

    When restoring an OST backup on a different node as part of an OST migration, you also have to change server NIDs and use the --writeconf command to re-generate the configuration logs. See Chapter 14, Lustre Maintenance(Changing a Server NID).

18.4.  Restoring a File-Level Backup

To restore data from a file-level backup, you need to format the device, restore the file data and then restore the EA data.

  1. Format the new device.

    [oss]# mkfs.lustre --ost --index {OST index}
    --replace --fstype=${fstype} {other options} /dev/{newdev}
  2. Set the file system label (ldiskfs-based systems only).

    [oss]# e2label {fsname}-OST{index in hex} /mnt/ost
  3. Mount the file system.

    For ldiskfs-based systems:

    [oss]# mount -t ldiskfs /dev/{newdev} /mnt/ost

    For zfs-based systems:

    1. Import the pool for the target if it is exported. For example:

      [oss]# zpool import lustre-ost [-d ${ostdev_dir}]
    2. Enable the canmount property on the target filesystem. For example:

      [oss]# zfs set canmount=on ${fsname}-ost/ost

      You also can specify the mountpoint property. By default, it will be: /${fsname}-ost/ost

    3. Mount the target as 'zfs'. For example:

      [oss]# zfs mount ${fsname}-ost/ost
  4. Change to the new file system mount point.

    [oss]# cd /mnt/ost
  5. Restore the file system backup.

    [oss]# tar xzvpf {backup file} [--xattrs] [--xattrs-include="trusted.*"] [--acls] [-P] --sparse

    Warning

    The tar --xattrs option is only available in GNU tar version 1.27 or later or in RHEL 6.3 or newer. The --xattrs-include="trusted.*" option is required for correct restoration of the MDT xattrs when using GNU tar 1.27 or RHEL 7 and newer. Otherwise, the setfattr step below should be used.

    The tar --acls option is needed for correct restoration of POSIX ACLs on MDTs. Alternatively, getfacl -n -R and setfacl --restore can be used instead.

    The tar -P (or --absolute-names) option can be used to speed up extraction of a trusted MDT backup archive.

  6. If not using a version of tar that supports direct xattr backups, restore the file system extended attributes.

    [oss]# setfattr --restore=ea-${date}.bak

    Note

    If --xattrs option is supported by tar and specified in the step above, this step is redundant.

  7. Verify that the extended attributes were restored.

    [oss]# getfattr -d -m ".*" -e hex O/0/d0/100992 trusted.fid= \
    0x0d822200000000004a8a73e500000000808a0100000000000000000000000000
  8. Remove old OI and LFSCK files.

    [oss]# rm -rf oi.16* lfsck_* LFSCK
  9. Remove old CATALOGS.

    [oss]# rm -f CATALOGS

    Note

    This is optional for the MDT side only. The CATALOGS record the llog file handlers that are used for recovering cross-server updates. Before OI scrub rebuilds the OI mappings for the llog files, the related recovery will get a failure if it runs faster than the background OI scrub. This will result in a failure of the whole mount process. OI scrub is an online tool, therefore, a mount failure means that the OI scrub will be stopped. Removing the old CATALOGS will avoid this potential trouble. The side-effect of removing old CATALOGS is that the recovery for related cross-server updates will be aborted. However, this can be handled by LFSCK after the system mount is up.

  10. Change directory out of the file system.

    [oss]# cd -
  11. Unmount the new file system.

    [oss]# umount /mnt/ost

    Note

    If the restored system has a different NID from the backup system, please change the NID. For detail, please refer to Section 14.5, “ Changing a Server NID”. For example:

    [oss]# mount -t lustre -o nosvc ${fsname}-ost/ost /mnt/ost
    [oss]# lctl replace_nids ${fsname}-OSTxxxx $new_nids
    [oss]# umount /mnt/ost
  12. Mount the target as lustre.

    Usually, we will use the -o abort_recov option to skip unnecessary recovery. For example:

    [oss]# mount -t lustre -o abort_recov #{fsname}-ost/ost /mnt/ost

    Lustre can detect the restore automatically when mounting the target, and then trigger OI scrub to rebuild the OIs and index objects asynchronously in the background. You can check the OI scrub status with the following command:

    [oss]# lctl get_param -n osd-${fstype}.${fsname}-${target}.oi_scrub

If the file system was used between the time the backup was made and when it was restored, then the online LFSCK tool will automatically be run to ensure the filesystem is coherent. If all of the device filesystems were backed up at the same time after Lustre was was stopped, this step is unnecessary. In either case, the filesystem will be immediately although there may be I/O errors reading from files that are present on the MDT but not the OSTs, and files that were created after the MDT backup will not be accessible or visible. See Section 36.4, “ Checking the file system with LFSCK”for details on using LFSCK.

18.5.  Using LVM Snapshots with the Lustre File System

If you want to perform disk-based backups (because, for example, access to the backup system needs to be as fast as to the primary Lustre file system), you can use the Linux LVM snapshot tool to maintain multiple, incremental file system backups.

Because LVM snapshots cost CPU cycles as new files are written, taking snapshots of the main Lustre file system will probably result in unacceptable performance losses. You should create a new, backup Lustre file system and periodically (e.g., nightly) back up new/changed files to it. Periodic snapshots can be taken of this backup file system to create a series of "full" backups.

Note

Creating an LVM snapshot is not as reliable as making a separate backup, because the LVM snapshot shares the same disks as the primary MDT device, and depends on the primary MDT device for much of its data. If the primary MDT device becomes corrupted, this may result in the snapshot being corrupted.

18.5.1.  Creating an LVM-based Backup File System

Use this procedure to create a backup Lustre file system for use with the LVM snapshot mechanism.

  1. Create LVM volumes for the MDT and OSTs.

    Create LVM devices for your MDT and OST targets. Make sure not to use the entire disk for the targets; save some room for the snapshots. The snapshots start out as 0 size, but grow as you make changes to the current file system. If you expect to change 20% of the file system between backups, the most recent snapshot will be 20% of the target size, the next older one will be 40%, etc. Here is an example:

    cfs21:~# pvcreate /dev/sda1
       Physical volume "/dev/sda1" successfully created
    cfs21:~# vgcreate vgmain /dev/sda1
       Volume group "vgmain" successfully created
    cfs21:~# lvcreate -L200G -nMDT0 vgmain
       Logical volume "MDT0" created
    cfs21:~# lvcreate -L200G -nOST0 vgmain
       Logical volume "OST0" created
    cfs21:~# lvscan
       ACTIVE                  '/dev/vgmain/MDT0' [200.00 GB] inherit
       ACTIVE                  '/dev/vgmain/OST0' [200.00 GB] inherit
  2. Format the LVM volumes as Lustre targets.

    In this example, the backup file system is called main and designates the current, most up-to-date backup.

    cfs21:~# mkfs.lustre --fsname=main --mdt --index=0 /dev/vgmain/MDT0
     No management node specified, adding MGS to this MDT.
        Permanent disk data:
     Target:     main-MDT0000
     Index:      0
     Lustre FS:  main
     Mount type: ldiskfs
     Flags:      0x75
                   (MDT MGS first_time update )
     Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
     Parameters:
    checking for existing Lustre data
     device size = 200GB
     formatting backing filesystem ldiskfs on /dev/vgmain/MDT0
             target name  main-MDT0000
             4k blocks     0
             options        -i 4096 -I 512 -q -O dir_index -F
     mkfs_cmd = mkfs.ext2 -j -b 4096 -L main-MDT0000  -i 4096 -I 512 -q
      -O dir_index -F /dev/vgmain/MDT0
     Writing CONFIGS/mountdata
    cfs21:~# mkfs.lustre --mgsnode=cfs21 --fsname=main --ost --index=0
    /dev/vgmain/OST0
        Permanent disk data:
     Target:     main-OST0000
     Index:      0
     Lustre FS:  main
     Mount type: ldiskfs
     Flags:      0x72
                   (OST first_time update )
     Persistent mount opts: errors=remount-ro,extents,mballoc
     Parameters: mgsnode=192.168.0.21@tcp
    checking for existing Lustre data
     device size = 200GB
     formatting backing filesystem ldiskfs on /dev/vgmain/OST0
             target name  main-OST0000
             4k blocks     0
             options        -I 256 -q -O dir_index -F
     mkfs_cmd = mkfs.ext2 -j -b 4096 -L lustre-OST0000 -J size=400 -I 256 
      -i 262144 -O extents,uninit_bg,dir_nlink,huge_file,flex_bg -G 256 
      -E resize=4290772992,lazy_journal_init, -F /dev/vgmain/OST0
     Writing CONFIGS/mountdata
    cfs21:~# mount -t lustre /dev/vgmain/MDT0 /mnt/mdt
    cfs21:~# mount -t lustre /dev/vgmain/OST0 /mnt/ost
    cfs21:~# mount -t lustre cfs21:/main /mnt/main
    

18.5.2.  Backing up New/Changed Files to the Backup File System

At periodic intervals e.g., nightly, back up new and changed files to the LVM-based backup file system.

cfs21:~# cp /etc/passwd /mnt/main 
 
cfs21:~# cp /etc/fstab /mnt/main 
 
cfs21:~# ls /mnt/main 
fstab  passwd

18.5.3.  Creating Snapshot Volumes

Whenever you want to make a "checkpoint" of the main Lustre file system, create LVM snapshots of all target MDT and OSTs in the LVM-based backup file system. You must decide the maximum size of a snapshot ahead of time, although you can dynamically change this later. The size of a daily snapshot is dependent on the amount of data changed daily in the main Lustre file system. It is likely that a two-day old snapshot will be twice as big as a one-day old snapshot.

You can create as many snapshots as you have room for in the volume group. If necessary, you can dynamically add disks to the volume group.

The snapshots of the target MDT and OSTs should be taken at the same point in time. Make sure that the cronjob updating the backup file system is not running, since that is the only thing writing to the disks. Here is an example:

cfs21:~# modprobe dm-snapshot
cfs21:~# lvcreate -L50M -s -n MDT0.b1 /dev/vgmain/MDT0
   Rounding up size to full physical extent 52.00 MB
   Logical volume "MDT0.b1" created
cfs21:~# lvcreate -L50M -s -n OST0.b1 /dev/vgmain/OST0
   Rounding up size to full physical extent 52.00 MB
   Logical volume "OST0.b1" created

After the snapshots are taken, you can continue to back up new/changed files to "main". The snapshots will not contain the new files.

cfs21:~# cp /etc/termcap /mnt/main
cfs21:~# ls /mnt/main
fstab  passwd  termcap

18.5.4.  Restoring the File System From a Snapshot

Use this procedure to restore the file system from an LVM snapshot.

  1. Rename the LVM snapshot.

    Rename the file system snapshot from "main" to "back" so you can mount it without unmounting "main". This is recommended, but not required. Use the --reformat flag to tunefs.lustre to force the name change. For example:

    cfs21:~# tunefs.lustre --reformat --fsname=back --writeconf /dev/vgmain/MDT0.b1
     checking for existing Lustre data
     found Lustre data
     Reading CONFIGS/mountdata
    Read previous values:
     Target:     main-MDT0000
     Index:      0
     Lustre FS:  main
     Mount type: ldiskfs
     Flags:      0x5
                  (MDT MGS )
     Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
     Parameters:
    Permanent disk data:
     Target:     back-MDT0000
     Index:      0
     Lustre FS:  back
     Mount type: ldiskfs
     Flags:      0x105
                  (MDT MGS writeconf )
     Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
     Parameters:
    Writing CONFIGS/mountdata
    cfs21:~# tunefs.lustre --reformat --fsname=back --writeconf /dev/vgmain/OST0.b1
     checking for existing Lustre data
     found Lustre data
     Reading CONFIGS/mountdata
    Read previous values:
     Target:     main-OST0000
     Index:      0
     Lustre FS:  main
     Mount type: ldiskfs
     Flags:      0x2
                  (OST )
     Persistent mount opts: errors=remount-ro,extents,mballoc
     Parameters: mgsnode=192.168.0.21@tcp
    Permanent disk data:
     Target:     back-OST0000
     Index:      0
     Lustre FS:  back
     Mount type: ldiskfs
     Flags:      0x102
                  (OST writeconf )
     Persistent mount opts: errors=remount-ro,extents,mballoc
     Parameters: mgsnode=192.168.0.21@tcp
    Writing CONFIGS/mountdata
    

    When renaming a file system, we must also erase the last_rcvd file from the snapshots

    cfs21:~# mount -t ldiskfs /dev/vgmain/MDT0.b1 /mnt/mdtback
    cfs21:~# rm /mnt/mdtback/last_rcvd
    cfs21:~# umount /mnt/mdtback
    cfs21:~# mount -t ldiskfs /dev/vgmain/OST0.b1 /mnt/ostback
    cfs21:~# rm /mnt/ostback/last_rcvd
    cfs21:~# umount /mnt/ostback
  2. Mount the file system from the LVM snapshot. For example:

    cfs21:~# mount -t lustre /dev/vgmain/MDT0.b1 /mnt/mdtback
    cfs21:~# mount -t lustre /dev/vgmain/OST0.b1 /mnt/ostback
    cfs21:~# mount -t lustre cfs21:/back /mnt/back
  3. Note the old directory contents, as of the snapshot time. For example:

    cfs21:~/cfs/b1_5/lustre/utils# ls /mnt/back
    fstab  passwds
    

18.5.5.  Deleting Old Snapshots

To reclaim disk space, you can erase old snapshots as your backup policy dictates. Run:

lvremove /dev/vgmain/MDT0.b1

18.5.6.  Changing Snapshot Volume Size

You can also extend or shrink snapshot volumes if you find your daily deltas are smaller or larger than expected. Run:

lvextend -L10G /dev/vgmain/MDT0.b1

Note

Extending snapshots seems to be broken in older LVM. It is working in LVM v2.02.01.

Introduced in Lustre 2.11

18.6.  Migration Between ZFS and ldiskfs Target Filesystems

Beginning with Lustre 2.11.0, it is possible to migrate between ZFS and ldiskfs backends. For migrating OSTs, it is best to use lfs find/lfs_migrate to empty out an OST while the filesystem is in use and then reformat it with the new fstype. For instructions on removing the OST, please see Section 14.9.3, “Removing an OST from the File System”.

18.6.1.  Migrate from a ZFS to an ldiskfs based filesystem

The first step of the process is to make a ZFS backend backup using tar as described in Section 18.3, “ Backing Up an OST or MDT (Backend File System Level)”.

Next, restore the backup to an ldiskfs-based system as described in Section 18.4, “ Restoring a File-Level Backup”.

18.6.2.  Migrate from an ldiskfs to a ZFS based filesystem

The first step of the process is to make an ldiskfs backend backup using tar as described in Section 18.3, “ Backing Up an OST or MDT (Backend File System Level)”.

Caution:For a migration from ldiskfs to zfs, it is required to enable index_backup before the unmount of the target. This is an additional step for a regular ldiskfs-based backup/restore and easy to be missed.

Next, restore the backup to an ldiskfs-based system as described in Section 18.4, “ Restoring a File-Level Backup”.

Chapter 19. Managing File Layout (Striping) and Free Space

This chapter describes file layout (striping) and I/O options, and includes the following sections:

19.1.  How Lustre File System Striping Works

In a Lustre file system, the MDS allocates objects to OSTs using either a round-robin algorithm or a weighted algorithm. When the amount of free space is well balanced (i.e., by default, when the free space across OSTs differs by less than 17%), the round-robin algorithm is used to select the next OST to which a stripe is to be written. Periodically, the MDS adjusts the striping layout to eliminate some degenerated cases in which applications that create very regular file layouts (striping patterns) preferentially use a particular OST in the sequence.

Normally the usage of OSTs is well balanced. However, if users create a small number of exceptionally large files or incorrectly specify striping parameters, imbalanced OST usage may result. When the free space across OSTs differs by more than a specific amount (17% by default), the MDS then uses weighted random allocations with a preference for allocating objects on OSTs with more free space. (This can reduce I/O performance until space usage is rebalanced again.) For a more detailed description of how striping is allocated, see Section 19.8, “Managing Free Space”.

Files can only be striped over a finite number of OSTs, based on the maximum size of the attributes that can be stored on the MDT. If the MDT is ldiskfs-based without the ea_inode feature, a file can be striped across at most 160 OSTs. With a ZFS-based MDT, or if the ea_inode feature is enabled for an ldiskfs-based MDT (the default since Lustre 2.13.0), a file can be striped across up to 2000 OSTs. For more information, see Section 19.9, “Lustre Striping Internals”.

19.2.  Lustre File Layout (Striping) Considerations

Whether you should set up file striping and what parameter values you select depends on your needs. A good rule of thumb is to stripe over as few objects as will meet those needs and no more.

Some reasons for using striping include:

  • Providing high-bandwidth access. Many applications require high-bandwidth access to a single file, which may be more bandwidth than can be provided by a single OSS. Examples are a scientific application that writes to a single file from hundreds of nodes, or a binary executable that is loaded by many nodes when an application starts.

    In cases like these, a file can be striped over as many OSSs as it takes to achieve the required peak aggregate bandwidth for that file. Striping across a larger number of OSSs should only be used when the file size is very large and/or is accessed by many nodes at a time. Currently, Lustre files can be striped across up to 2000 OSTs

  • Improving performance when OSS bandwidth is exceeded. Striping across many OSSs can improve performance if the aggregate client bandwidth exceeds the server bandwidth and the application reads and writes data fast enough to take advantage of the additional OSS bandwidth. The largest useful stripe count is bounded by the I/O rate of the clients/jobs divided by the performance per OSS.

  • Introduced in Lustre 2.13

    Matching stripes to I/O pattern.When writing to a single file from multiple nodes, having more than one client writing to a stripe can lead to issues with lock exchange, where clients contend over writing to that stripe, even if their I/Os do not overlap. This can be avoided if I/O can be stripe aligned so that each stripe is accessed by only one client. Since Lustre 2.13, the 'overstriping' feature is available, allowing more than one stripe per OST. This is particularly helpful for the case where thread count exceeds OST count, making it possible to match stripe count to thread count even in this case.

  • Providing space for very large files. Striping is useful when a single OST does not have enough free space to hold the entire file.

Some reasons to minimize or avoid striping:

  • Increased overhead. Striping results in more locks and extra network operations during common operations such as stat and unlink. Even when these operations are performed in parallel, one network operation takes less time than 100 operations.

    Increased overhead also results from server contention. Consider a cluster with 100 clients and 100 OSSs, each with one OST. If each file has exactly one object and the load is distributed evenly, there is no contention and the disks on each server can manage sequential I/O. If each file has 100 objects, then the clients all compete with one another for the attention of the servers, and the disks on each node seek in 100 different directions resulting in needless contention.

  • Increased risk. When files are striped across all servers and one of the servers breaks down, a small part of each striped file is lost. By comparison, if each file has exactly one stripe, fewer files are lost, but they are lost in their entirety. Many users would prefer to lose some of their files entirely than all of their files partially.

  • Small files. Small files do not benefit from striping because they can be efficiently stored and accessed as a single OST object or even with Data on MDT.

  • O_APPEND mode. When files are opened for append, they instantiate all uninitialized components expressed in the layout. Typically, log files are opened for append, and complex layouts can be inefficient.

    Note

    The mdd.*.append_stripe_count and mdd.*.append_pool options can be used to specify special default striping for files created with O_APPEND.

19.2.1.  Choosing a Stripe Size

Choosing a stripe size is a balancing act, but reasonable defaults are described below. The stripe size has no effect on a single-stripe file.

  • The stripe size must be a multiple of the page size. Lustre software tools enforce a multiple of 64 KB (the maximum page size on ia64 and PPC64 nodes) so that users on platforms with smaller pages do not accidentally create files that might cause problems for ia64 clients.

  • The smallest recommended stripe size is 512 KB. Although you can create files with a stripe size of 64 KB, the smallest practical stripe size is 512 KB because the Lustre file system sends 1MB chunks over the network. Choosing a smaller stripe size may result in inefficient I/O to the disks and reduced performance.

  • A good stripe size for sequential I/O using high-speed networks is between 1 MB and 4 MB. In most situations, stripe sizes larger than 4 MB may result in longer lock hold times and contention during shared file access.

  • The maximum stripe size is 4 GB. Using a large stripe size can improve performance when accessing very large files. It allows each client to have exclusive access to its own part of a file. However, a large stripe size can be counterproductive in cases where it does not match your I/O pattern.

  • Choose a stripe pattern that takes into account the write patterns of your application. Writes that cross an object boundary are slightly less efficient than writes that go entirely to one server. If the file is written in a consistent and aligned way, make the stripe size a multiple of the write() size.

19.3. Setting the File Layout/Striping Configuration (lfs setstripe)

Use the lfs setstripe command to create new files with a specific file layout (stripe pattern) configuration.

lfs setstripe [--size|-s stripe_size] [--stripe-count|-c stripe_count] [--overstripe-count|-C stripe_count] \
[--index|-i start_ost] [--pool|-p pool_name] filename|dirname 

stripe_size

The stripe_size indicates how much data to write to one OST before moving to the next OST. The default stripe_size is 1 MB. Passing a stripe_size of 0 causes the default stripe size to be used. Otherwise, the stripe_size value must be a multiple of 64 KB.

stripe_count (--stripe-count, --overstripe-count)

The stripe_count indicates how many stripes to use. The default stripe_count value is 1. Setting stripe_count to 0 causes the default stripe count to be used. Setting stripe_count to -1 means stripe over all available OSTs (full OSTs are skipped). When --overstripe-count is used, per OST if necessary.

start_ost

The start OST is the first OST to which files are written. The default value for start_ost is -1, which allows the MDS to choose the starting index. This setting is strongly recommended, as it allows space and load balancing to be done by the MDS as needed. If the value of start_ost is set to a value other than -1, the file starts on the specified OST index. OST index numbering starts at 0.

Note

If the specified OST is inactive or in a degraded mode, the MDS will silently choose another target.

Note

If you pass a start_ost value of 0 and a stripe_count value of 1, all files are written to OST 0, until space is exhausted. This is probably not what you meant to do. If you only want to adjust the stripe count and keep the other parameters at their default settings, do not specify any of the other parameters:

client# lfs setstripe -c stripe_count filename

pool_name

The pool_name specifies the OST pool to which the file will be written. This allows limiting the OSTs used to a subset of all OSTs in the file system. For more details about using OST pools, see Section 23.2, “ Creating and Managing OST Pools”.

19.3.1. Specifying a File Layout (Striping Pattern) for a Single File

It is possible to specify the file layout when a new file is created using the command lfs setstripe. This allows users to override the file system default parameters to tune the file layout more optimally for their application. Execution of an lfs setstripe command fails if the file already exists.

19.3.1.1. Setting the Stripe Size

The command to create a new file with a specified stripe size is similar to:

[client]# lfs setstripe -s 4M /mnt/lustre/new_file

This example command creates the new file /mnt/lustre/new_file with a stripe size of 4 MB.

Now, when the file is created, the new stripe setting creates the file on a single OST with a stripe size of 4M:

 [client]# lfs getstripe /mnt/lustre/new_file
/mnt/lustre/4mb_file
lmm_stripe_count:   1
lmm_stripe_size:    4194304
lmm_pattern:        1
lmm_layout_gen:     0
lmm_stripe_offset:  1
obdidx     objid        objid           group
1          690550       0xa8976         0 

In this example, the stripe size is 4 MB.

19.3.1.2.  Setting the Stripe Count

The command below creates a new file with a stripe count of -1 to specify striping over all available OSTs:

[client]# lfs setstripe -c -1 /mnt/lustre/full_stripe

The example below indicates that the file full_stripe is striped over all six active OSTs in the configuration:

[client]# lfs getstripe /mnt/lustre/full_stripe
/mnt/lustre/full_stripe
  obdidx   objid   objid   group
  0        8       0x8     0
  1        4       0x4     0
  2        5       0x5     0
  3        5       0x5     0
  4        4       0x4     0
  5        2       0x2     0

This is in contrast to the output in Section 19.3.1.1, “Setting the Stripe Size”, which shows only a single object for the file.

19.3.2. Setting the Striping Layout for a Directory

In a directory, the lfs setstripe command sets a default striping configuration for files created in the directory. The usage is the same as lfs setstripe for a regular file, except that the directory must exist prior to setting the default striping configuration. If a file is created in a directory with a default stripe configuration (without otherwise specifying striping), the Lustre file system uses those striping parameters instead of the file system default for the new file.

To change the striping pattern for a sub-directory, create a directory with desired file layout as described above. Sub-directories inherit the file layout of the root/parent directory.

Note

Special default striping can be used for files created with O_APPEND. Files with uninitialized layouts opened with O_APPEND will override a directory's default striping configuration and abide by the mdd.*.append_pool and mdd.*.append_stripe_count options (if they are specified).

19.3.3. Setting the Striping Layout for a File System

Setting the striping specification on the root directory determines the striping for all new files created in the file system unless an overriding striping specification takes precedence (such as a striping layout specified by the application, or set using lfs setstripe, or specified for the parent directory).

Note

The striping settings for a root directory are, by default, applied to any new child directories created in the root directory, unless striping settings have been specified for the child directory.

Note

Special default striping can be used for files created with O_APPEND. Files with uninitialized layouts opened with O_APPEND will override a file system's default striping configuration and abide by the mdd.*.append_pool and mdd.*.append_stripe_count options (if they are specified).

19.3.4. Per File System Stripe Count Limit

Sometime there are many OSTs in a filesystem, but it is not always desirable to stripe file to across all OSTs, even if the given stripe_count=-1 (unlimited). In this case, the per-filesystem tunable parameter lod.*.max_stripecount can be used to limit the real stripe count of file to a lower number than the OST count. If lod.*.max_stripecount is not 0, and the file stripe_count=-1, the real stripe count will be the minimum of the OST count and max_stripecount. If lod.*.max_stripecount=0, or an explicit stripe count is given for the file, it is ignored.

To set max_stripecount, on all MDSes of file system, run:

mgs# lctl set_param -P lod.$fsname-MDTxxxx-mdtlov.max_stripecount=<N>
        

To check max_stripecount, run:

mds# lctl get_param lod.$fsname-MDTxxxx-mdtlov.max_stripecount
        

To reset max_stripecount, run:

mgs# lctl set_param -P -d lod.$fsname-MDTxxxx-mdtlov.max_stripecount
        

19.3.5. Creating a File on a Specific OST

You can use lfs setstripe to create a file on a specific OST. In the following example, the file file1 is created on the first OST (OST index is 0).

$ lfs setstripe --stripe-count 1 --index 0 file1
$ dd if=/dev/zero of=file1 count=1 bs=100M
1+0 records in
1+0 records out

$ lfs getstripe file1
/mnt/testfs/file1
lmm_stripe_count:   1
lmm_stripe_size:    1048576
lmm_pattern:        1
lmm_layout_gen:     0
lmm_stripe_offset:  0
     obdidx    objid   objid    group
     0         37364   0x91f4   0

19.4. Retrieving File Layout/Striping Information (getstripe)

The lfs getstripe command is used to display information that shows over which OSTs a file is distributed. For each OST, the index and UUID is displayed, along with the OST index and object ID for each stripe in the file. For directories, the default settings for files created in that directory are displayed.

19.4.1. Displaying the Current Stripe Size

To see the current stripe size for a Lustre file or directory, use the lfs getstripe command. For example, to view information for a directory, enter a command similar to:

[client]# lfs getstripe /mnt/lustre 

This command produces output similar to:

/mnt/lustre
(Default) stripe_count: 1 stripe_size: 1M stripe_offset: -1

In this example, the default stripe count is 1 (data blocks are striped over a single OST), the default stripe size is 1 MB, and the objects are created over all available OSTs.

To view information for a file, enter a command similar to:

$ lfs getstripe /mnt/lustre/foo
/mnt/lustre/foo
lmm_stripe_count:   1
lmm_stripe_size:    1048576
lmm_pattern:        1
lmm_layout_gen:     0
lmm_stripe_offset:  0
  obdidx   objid    objid      group
  2        835487   m0xcbf9f   0 

In this example, the file is located on obdidx 2, which corresponds to the OST lustre-OST0002. To see which node is serving that OST, run:

$ lctl get_param osc.lustre-OST0002-osc.ost_conn_uuid
osc.lustre-OST0002-osc.ost_conn_uuid=192.168.20.1@tcp

19.4.2. Inspecting the File Tree

To inspect an entire tree of files, use the lfs find command:

lfs find [--recursive | -r] file|directory ...

19.4.3. Locating the MDT for a remote directory

Lustre can be configured with multiple MDTs in the same file system. Each directory and file could be located on a different MDT. To identify which MDT a given subdirectory is located, pass the getstripe [--mdt-index|-M] parameter to lfs. An example of this command is provided in the section Section 14.9.1, “Removing an MDT from the File System”.

Introduced in Lustre 2.10

19.5. Progressive File Layout(PFL)

The Lustre Progressive File Layout (PFL) feature simplifies the use of Lustre so that users can expect reasonable performance for a variety of normal file IO patterns without the need to explicitly understand their IO model or Lustre usage details in advance. In particular, users do not necessarily need to know the size or concurrency of output files in advance of their creation and explicitly specify an optimal layout for each file in order to achieve good performance for both highly concurrent shared-single-large-file IO or parallel IO to many smaller per-process files.

The layout of a PFL file is stored on disk as composite layout. A PFL file is essentially an array of sub-layout components, with each sub-layout component being a plain layout covering different and non-overlapped extents of the file. For PFL files, the file layout is composed of a series of components, therefore it's possible that there are some file extents are not described by any components.

An example of how data blocks of PFL files are mapped to OST objects of components is shown in the following PFL object mapping diagram:

Figure 19.1. PFL object mapping diagram

PFL object mapping diagram

The PFL file in Figure 19.1, “PFL object mapping diagram” has 3 components and shows the mapping for the blocks of a 2055MB file. The stripe size for the first two components is 1MB, while the stripe size for the third component is 4MB. The stripe count is increasing for each successive component. The first component only has two 1MB blocks and the single object has a size of 2MB. The second component holds the next 254MB of the file spread over 4 separate OST objects in RAID-0, each one will have a size of 256MB / 4 objects = 64MB per object. Note the first two objects obj 2,0 and obj 2,1 have a 1MB hole at the start where the data is stored in the first component. The final component holds the next 1800MB spread over 32 OST objects. There is a 256MB / 32 = 8MB hole at the start each one for the data stored in the first two components. Each object will be 2048MB / 32 objects = 64MB per object, except the obj 3,0 that holds an extra 4MB chunk and obj 3,1 that holds an extra 3MB chunk. If more data was written to the file, only the objects in component 3 would increase in size.

When a file range with defined but not instantiated component is accessed, clients will send a Layout Intent RPC to the MDT, and the MDT would instantiate the objects of the components covering that range.

Next, some commands for user to operate PFL files are introduced and some examples of possible composite layout are illustrated as well. Lustre provides commands lfs setstripe and lfs migrate for users to operate PFL files. lfs setstripe commands are used to create PFL files, add or delete components to or from an existing composite file; lfs migrate commands are used to re-layout the data in existing files using the new layout parameter by copying the data from the existing OST(s) to the new OST(s). Also, as introduced in the previous sections, lfs getstripe commands can be used to list the striping/component information for a given PFL file, and lfs find commands can be used to search the directory tree rooted at the given directory or file name for the files that match the given PFL component parameters.

Note

Using PFL files requires both the client and server to understand the PFL file layout, which isn't available for Lustre 2.9 and earlier. And it will not prevent older clients from accessing non-PFL files in the filesystem.

19.5.1. lfs setstripe

lfs setstripe commands are used to create PFL files, add or delete components to or from an existing composite file. (Suppose we have 8 OSTs in the following examples and stripe size is 1MB by default.)

19.5.1.1. Create a PFL file

Command

lfs setstripe
[--component-end|-E end1] [STRIPE_OPTIONS]
[--component-end|-E end2] [STRIPE_OPTIONS] ... filename

The -E option is used to specify the end offset (in bytes or using a suffix kMGTP e.g. 256M) of each component, and it also indicates the following STRIPE_OPTIONS are for this component. Each component defines the stripe pattern of the file in the range of [start, end). The first component must start from offset 0 and all components must be adjacent with each other, no holes are allowed, so each extent will start at the end of previous extent. A -1 end offset or eof indicates this is the last component extending to the end of file. If no EOF

Example

$ lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -i 4 \
/mnt/testfs/create_comp

This command creates a file with composite layout illustrated in the following figure. The first component has 1 stripe and covers [0, 4M), the second component has 4 stripes and covers [4M, 64M), and the last component stripes start at OST4, cross over all available OSTs and covers [64M, EOF).

Figure 19.2. Example: create a composite file

Example: create a composite file

The composite layout can be output by the following command:

$ lfs getstripe /mnt/testfs/create_comp
/mnt/testfs/create_comp
  lcm_layout_gen:  3
  lcm_entry_count: 3
    lcme_id:             1
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   4194304
      lmm_stripe_count:  1
      lmm_stripe_size:   1048576
      lmm_pattern:       1
      lmm_layout_gen:    0
      lmm_stripe_offset: 0
      lmm_objects:
      - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] }

    lcme_id:             2
    lcme_flags:          0
    lcme_extent.e_start: 4194304
    lcme_extent.e_end:   67108864
      lmm_stripe_count:  4
      lmm_stripe_size:   1048576
      lmm_pattern:       1
      lmm_layout_gen:    0
      lmm_stripe_offset: -1
    lcme_id:             3
    lcme_flags:          0
    lcme_extent.e_start: 67108864
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  -1
      lmm_stripe_size:   1048576
      lmm_pattern:       1
      lmm_layout_gen:    0
      lmm_stripe_offset: 4

Note

Only the first component's OST objects of the PFL file are instantiated when the layout is being set. Other instantiation is delayed to later write/truncate operations.

If we write 128M data to this PFL file, the second and third components will be instantiated:

$ dd if=/dev/zero of=/mnt/testfs/create_comp bs=1M count=128
$ lfs getstripe /mnt/testfs/create_comp
/mnt/testfs/create_comp
  lcm_layout_gen:  5
  lcm_entry_count: 3
    lcme_id:             1
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   4194304
      lmm_stripe_count:  1
      lmm_stripe_size:   1048576
      lmm_pattern:       1
      lmm_layout_gen:    0
      lmm_stripe_offset: 0
      lmm_objects:
      - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] }

    lcme_id:             2
    lcme_flags:          init
    lcme_extent.e_start: 4194304
    lcme_extent.e_end:   67108864
      lmm_stripe_count:  4
      lmm_stripe_size:   1048576
      lmm_pattern:       1
      lmm_layout_gen:    0
      lmm_stripe_offset: 1
      lmm_objects:
      - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x2:0x0] }
      - 1: { l_ost_idx: 2, l_fid: [0x100020000:0x2:0x0] }
      - 2: { l_ost_idx: 3, l_fid: [0x100030000:0x2:0x0] }
      - 3: { l_ost_idx: 4, l_fid: [0x100040000:0x2:0x0] }

    lcme_id:             3
    lcme_flags:          init
    lcme_extent.e_start: 67108864
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  8
      lmm_stripe_size:   1048576
      lmm_pattern:       1
      lmm_layout_gen:    0
      lmm_stripe_offset: 4
      lmm_objects:
      - 0: { l_ost_idx: 4, l_fid: [0x100040000:0x3:0x0] }
      - 1: { l_ost_idx: 5, l_fid: [0x100050000:0x2:0x0] }
      - 2: { l_ost_idx: 6, l_fid: [0x100060000:0x2:0x0] }
      - 3: { l_ost_idx: 7, l_fid: [0x100070000:0x2:0x0] }
      - 4: { l_ost_idx: 0, l_fid: [0x100000000:0x3:0x0] }
      - 5: { l_ost_idx: 1, l_fid: [0x100010000:0x3:0x0] }
      - 6: { l_ost_idx: 2, l_fid: [0x100020000:0x3:0x0] }
      - 7: { l_ost_idx: 3, l_fid: [0x100030000:0x3:0x0] }

19.5.1.2. Add component(s) to an existing composite file

Command

lfs setstripe --component-add
[--component-end|-E end1] [STRIPE_OPTIONS]
[--component-end|-E end2] [STRIPE_OPTIONS] ... filename

The option --component-add is used to add components to an existing composite file. The extent start of the first component to be added is equal to the extent end of last component in the existing file, and all components to be added must be adjacent with each other.

Note

If the last existing component is specified by -E -1 or -E eof, which covers to the end of the file, it must be deleted before a new one is added.

Example

$ lfs setstripe -E 4M -c 1 -E 64M -c 4 /mnt/testfs/add_comp
$ lfs setstripe --component-add -E -1 -c 4 -o 6-7,0,5 \
/mnt/testfs/add_comp

This command adds a new component which starts from the end of the last existing component to the end of file. The layout of this example is illustrated in Figure 19.3, “Example: add a component to an existing composite file”. The last component stripes across 4 OSTs in sequence OST6, OST7, OST0 and OST5, covers [64M, EOF).

Figure 19.3. Example: add a component to an existing composite file

Example: add a component to an existing composite file

The layout can be printed out by the following command:

$ lfs getstripe /mnt/testfs/add_comp
/mnt/testfs/add_comp
  lcm_layout_gen:  5
  lcm_entry_count: 3
    lcme_id:             1
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   4194304
      lmm_stripe_count:  1
      lmm_stripe_size:   1048576
      lmm_pattern:       1
      lmm_layout_gen:    0
      lmm_stripe_offset: 0
      lmm_objects:
      - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] }

    lcme_id:             2
    lcme_flags:          init
    lcme_extent.e_start: 4194304
    lcme_extent.e_end:   67108864
      lmm_stripe_count:  4
      lmm_stripe_size:   1048576
      lmm_pattern:       1
      lmm_layout_gen:    0
      lmm_stripe_offset: 1
      lmm_objects:
      - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x2:0x0] }
      - 1: { l_ost_idx: 2, l_fid: [0x100020000:0x2:0x0] }
      - 2: { l_ost_idx: 3, l_fid: [0x100030000:0x2:0x0] }
      - 3: { l_ost_idx: 4, l_fid: [0x100040000:0x2:0x0] }

    lcme_id:             5
    lcme_flags:          0
    lcme_extent.e_start: 67108864
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  4
      lmm_stripe_size:   1048576
      lmm_pattern:       1
      lmm_layout_gen:    0
      lmm_stripe_offset: -1

The component ID "lcme_id" changes as layout generation changes. It is not necessarily sequential and does not imply ordering of individual components.

Note

Similar to specifying a full-file composite layout at file creation time, --component-add won't instantiate OST objects, the instantiation is delayed to later write/truncate operations. For example, after writing beyond the 64MB start of the file's last component, the new component has had objects allocated:

$ lfs getstripe -I5 /mnt/testfs/add_comp
/mnt/testfs/add_comp
  lcm_layout_gen:  6
  lcm_entry_count: 3
    lcme_id:             5
    lcme_flags:          init
    lcme_extent.e_start: 67108864
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  4
      lmm_stripe_size:   1048576
      lmm_pattern:       1
      lmm_layout_gen:    0
      lmm_stripe_offset: 6
      lmm_objects:
      - 0: { l_ost_idx: 6, l_fid: [0x100060000:0x4:0x0] }
      - 1: { l_ost_idx: 7, l_fid: [0x100070000:0x4:0x0] }
      - 2: { l_ost_idx: 0, l_fid: [0x100000000:0x5:0x0] }
      - 3: { l_ost_idx: 5, l_fid: [0x100050000:0x4:0x0] }

19.5.1.3. Delete component(s) from an existing file

Command

lfs setstripe --component-del
[--component-id|-I comp_id | --component-flags comp_flags]
filename

The option --component-del is used to remove the component(s) specified by component ID or flags from an existing file. Any data stored in the deleted component will be lost after this operation.

The ID specified by -I option is the numerical unique ID of the component, which can be obtained by command lfs getstripe -I command, and the flag specified by --component-flags option is a certain type of components, which can be obtained by command lfs getstripe --component-flags. For now, we only have two flags init and ^init for instantiated and un-instantiated components respectively.

Note

Deletion must start with the last component because creation of a hole in the middle of a file layout is not allowed.

Example

$ lfs getstripe -I /mnt/testfs/del_comp
1
2
5
$ lfs setstripe --component-del -I 5 /mnt/testfs/del_comp

This example deletes the component with ID 5 from file /mnt/testfs/del_comp. If we still use the last example, the final result is illustrated in Figure 19.4, “Example: delete a component from an existing file”.

Figure 19.4. Example: delete a component from an existing file

Example: delete a component from an existing file

If you try to delete a non-last component, you will see the following error:

$ lfs setstripe -component-del -I 2 /mnt/testfs/del_comp
Delete component 0x2 from /mnt/testfs/del_comp failed. Invalid argument
error: setstripe: delete component of file '/mnt/testfs/del_comp' failed: Invalid argument

19.5.1.4. Set default PFL layout to an existing directory

Similar to create a PFL file, you can set default PFL layout to an existing directory. After that, all the files created will inherit this layout by default.

Command

lfs setstripe
[--component-end|-E end1] [STRIPE_OPTIONS]
[--component-end|-E end2] [STRIPE_OPTIONS] ... dirname

Example

$ mkdir /mnt/testfs/pfldir
$ lfs setstripe -E 256M -c 1 -E 16G -c 4 -E -1 -S 4M -c -1 /mnt/testfs/pfldir

When you run lfs getstripe, you will see:

$ lfs getstripe /mnt/testfs/pfldir
/mnt/testfs/pfldir
  lcm_layout_gen:  0
  lcm_entry_count: 3
    lcme_id:             N/A
    lcme_flags:          0
    lcme_extent.e_start: 0
    lcme_extent.e_end:   268435456
      stripe_count:  1       stripe_size:   1048576       stripe_offset: -1
    lcme_id:             N/A
    lcme_flags:          0
    lcme_extent.e_start: 268435456
    lcme_extent.e_end:   17179869184
      stripe_count:  4       stripe_size:   1048576       stripe_offset: -1
    lcme_id:             N/A
    lcme_flags:          0
    lcme_extent.e_start: 17179869184
    lcme_extent.e_end:   EOF
      stripe_count:  -1       stripe_size:   4194304       stripe_offset: -1

If you create a file under /mnt/testfs/pfldir, the layout of that file will inherit the layout from its parent directory:

$ touch /mnt/testfs/pfldir/pflfile
$ lfs getstripe /mnt/testfs/pfldir/pflfile
/mnt/testfs/pfldir/pflfile
  lcm_layout_gen:  2
  lcm_entry_count: 3
    lcme_id:             1
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   268435456
      lmm_stripe_count:  1
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 1
      lmm_objects:
      - 0: { l_ost_idx: 1, l_fid: [0x100010000:0xa:0x0] }

    lcme_id:             2
    lcme_flags:          0
    lcme_extent.e_start: 268435456
    lcme_extent.e_end:   17179869184
      lmm_stripe_count:  4
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: -1

    lcme_id:             3
    lcme_flags:          0
    lcme_extent.e_start: 17179869184
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  -1
      lmm_stripe_size:   4194304
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: -1

Note

lfs setstripe --component-add/del can't be run on a directory, because the default layout in directory is like a config, which can be arbitrarily changed by lfs setstripe, while the layout of a file may have data (OST objects) attached. If you want to delete the default layout in a directory, run lfs setstripe -d dirname to return the directory to the filesystem-wide defaults, like:

$ lfs setstripe -d /mnt/testfs/pfldir
$ lfs getstripe -d /mnt/testfs/pfldir
/mnt/testfs/pfldir
stripe_count:  1 stripe_size:   1048576 stripe_offset: -1
/mnt/testfs/pfldir/commonfile
lmm_stripe_count:  1
lmm_stripe_size:   1048576
lmm_pattern:       1
lmm_layout_gen:    0
lmm_stripe_offset: 0
	obdidx		 objid		 objid		 group
	     2	             9	          0x9	             0

19.5.2. lfs migrate

lfs migrate commands are used to re-layout the data in the existing files with the new layout parameter by copying the data from the existing OST(s) to the new OST(s).

Command

lfs migrate [--component-end|-E comp_end] [STRIPE_OPTIONS] ...
filename

The difference between migrate and setstripe is that migrate is to re-layout the data in the existing files, while setstripe is to create new files with the specified layout.

Example

Case1. Migrate a normal one to a composite layout

$ lfs setstripe -c 1 -S 128K /mnt/testfs/norm_to_2comp
$ dd if=/dev/urandom of=/mnt/testfs/norm_to_2comp bs=1M count=5
$ lfs getstripe /mnt/testfs/norm_to_2comp --yaml
/mnt/testfs/norm_to_comp
lmm_stripe_count:  1
lmm_stripe_size:   131072
lmm_pattern:       1
lmm_layout_gen:    0
lmm_stripe_offset: 7
lmm_objects:
      - l_ost_idx: 7
        l_fid:     0x100070000:0x2:0x0
$ lfs migrate -E 1M -S 512K -c 1 -E -1 -S 1M -c 2 \
/mnt/testfs/norm_to_2comp

In this example, a 5MB size file with 1 stripe and 128K stripe size is migrated to a composite layout file with 2 components, illustrated in Figure 19.5, “Example: migrate normal to composite”.

Figure 19.5. Example: migrate normal to composite

Example: migrate normal to composite

The stripe information after migration is like:

$ lfs getstripe /mnt/testfs/norm_to_2comp
/mnt/testfs/norm_to_2comp
  lcm_layout_gen:  4
  lcm_entry_count: 2
    lcme_id:             1
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   1048576
      lmm_stripe_count:  1
      lmm_stripe_size:   524288
      lmm_pattern:       1
      lmm_layout_gen:    0
      lmm_stripe_offset: 0
      lmm_objects:
      - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] }

    lcme_id:             2
    lcme_flags:          init
    lcme_extent.e_start: 1048576
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  2
      lmm_stripe_size:   1048576
      lmm_pattern:       1
      lmm_layout_gen:    0
      lmm_stripe_offset: 2
      lmm_objects:
      - 0: { l_ost_idx: 2, l_fid: [0x100020000:0x2:0x0] }
      - 1: { l_ost_idx: 3, l_fid: [0x100030000:0x2:0x0] }

Case2. Migrate a composite layout to another composite layout

$ lfs setstripe -E 1M -S 512K -c 1 -E -1 -S 1M -c 2 \
/mnt/testfs/2comp_to_3comp
$ dd if=/dev/urandom of=/mnt/testfs/norm_to_2comp bs=1M count=5
$ lfs migrate -E 1M -S 1M -c 2 -E 4M -S 1M -c 2 -E -1 -S 3M -c 3 \
/mnt/testfs/2comp_to_3comp

In this example, a composite layout file with 2 components is migrated a composite layout file with 3 components. If we still use the example in case1, the migration process is illustrated in Figure 19.6, “Example: migrate composite to composite”.

Figure 19.6. Example: migrate composite to composite

Example: migrate composite to composite

The stripe information is like:

$ lfs getstripe /mnt/testfs/2comp_to_3comp
/mnt/testfs/2comp_to_3comp
  lcm_layout_gen:  6
  lcm_entry_count: 3
    lcme_id:             1
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   1048576
      lmm_stripe_count:  2
      lmm_stripe_size:   1048576
      lmm_pattern:       1
      lmm_layout_gen:    0
      lmm_stripe_offset: 4
      lmm_objects:
      - 0: { l_ost_idx: 4, l_fid: [0x100040000:0x2:0x0] }
      - 1: { l_ost_idx: 5, l_fid: [0x100050000:0x2:0x0] }

    lcme_id:             2
    lcme_flags:          init
    lcme_extent.e_start: 1048576
    lcme_extent.e_end:   4194304
      lmm_stripe_count:  2
      lmm_stripe_size:   1048576
      lmm_pattern:       1
      lmm_layout_gen:    0
      lmm_stripe_offset: 6
      lmm_objects:
      - 0: { l_ost_idx: 6, l_fid: [0x100060000:0x2:0x0] }
      - 1: { l_ost_idx: 7, l_fid: [0x100070000:0x3:0x0] }

    lcme_id:             3
    lcme_flags:          init
    lcme_extent.e_start: 4194304
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  3
      lmm_stripe_size:   3145728
      lmm_pattern:       1
      lmm_layout_gen:    0
      lmm_stripe_offset: 0
      lmm_objects:
      - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x3:0x0] }
      - 1: { l_ost_idx: 1, l_fid: [0x100010000:0x2:0x0] }
      - 2: { l_ost_idx: 2, l_fid: [0x100020000:0x3:0x0] }

Case3. Migrate a composite layout to a normal one

$ lfs migrate -E 1M -S 1M -c 2 -E 4M -S 1M -c 2 -E -1 -S 3M -c 3 \
/mnt/testfs/3comp_to_norm
$ dd if=/dev/urandom of=/mnt/testfs/norm_to_2comp bs=1M count=5
$ lfs migrate -c 2 -S 2M /mnt/testfs/3comp_to_normal

In this example, a composite file with 3 components is migrated to a normal file with 2 stripes and 2M stripe size. If we still use the example in Case2, the migration process is illustrated in Figure 19.7, “Example: migrate composite to normal”.

Figure 19.7. Example: migrate composite to normal

Example: migrate composite to normal

The stripe information is like:

$ lfs getstripe /mnt/testfs/3comp_to_norm --yaml
/mnt/testfs/3comp_to_norm
lmm_stripe_count:  2
lmm_stripe_size:   2097152
lmm_pattern:       1
lmm_layout_gen:    7
lmm_stripe_offset: 4
lmm_objects:
      - l_ost_idx: 4
        l_fid:     0x100040000:0x3:0x0
      - l_ost_idx: 5
        l_fid:     0x100050000:0x3:0x0

19.5.3. lfs getstripe

lfs getstripe commands can be used to list the striping/component information for a given PFL file. Here, only those parameters new for PFL files are shown.

Command

lfs getstripe
[--component-id|-I [comp_id]]
[--component-flags [comp_flags]]
[--component-count]
[--component-start [+-][N][kMGTPE]]
[--component-end|-E [+-][N][kMGTPE]]
dirname|filename

Example

Suppose we already have a composite file /mnt/testfs/3comp, created by the following command:

$ lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -i 4 \
/mnt/testfs/3comp

And write some data

$ dd if=/dev/zero of=/mnt/testfs/3comp bs=1M count=5

Case1. List component ID and its related information

  • List all the components ID

    $ lfs getstripe -I /mnt/testfs/3comp
    1
    2
    3
  • List the detailed striping information of component ID=2

    $ lfs getstripe -I2 /mnt/testfs/3comp
    /mnt/testfs/3comp
      lcm_layout_gen:  4
      lcm_entry_count: 3
        lcme_id:             2
        lcme_flags:          init
        lcme_extent.e_start: 4194304
        lcme_extent.e_end:   67108864
          lmm_stripe_count:  4
          lmm_stripe_size:   1048576
          lmm_pattern:       1
          lmm_layout_gen:    0
          lmm_stripe_offset: 5
          lmm_objects:
          - 0: { l_ost_idx: 5, l_fid: [0x100050000:0x2:0x0] }
          - 1: { l_ost_idx: 6, l_fid: [0x100060000:0x2:0x0] }
          - 2: { l_ost_idx: 7, l_fid: [0x100070000:0x2:0x0] }
          - 3: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] }
  • List the stripe offset and stripe count of component ID=2

    $ lfs getstripe -I2 -i -c /mnt/testfs/3comp
          lmm_stripe_count:  4
          lmm_stripe_offset: 5

Case2. List the component which contains the specified flag

  • List the flag of each component

    $ lfs getstripe -component-flag -I /mnt/testfs/3comp
        lcme_id:             1
        lcme_flags:          init
        lcme_id:             2
        lcme_flags:          init
        lcme_id:             3
        lcme_flags:          0
  • List component(s) who is not instantiated

    $ lfs getstripe --component-flags=^init /mnt/testfs/3comp
    /mnt/testfs/3comp
      lcm_layout_gen:  4
      lcm_entry_count: 3
        lcme_id:             3
        lcme_flags:          0
        lcme_extent.e_start: 67108864
        lcme_extent.e_end:   EOF
          lmm_stripe_count:  -1
          lmm_stripe_size:   1048576
          lmm_pattern:       1
          lmm_layout_gen:    4
          lmm_stripe_offset: 4

Case3. List the total number of all the component(s)

  • List the total number of all the components

    $ lfs getstripe --component-count /mnt/testfs/3comp
    3

Case4. List the component with the specified extent start or end positions

  • List the start position in bytes of each component

    $ lfs getstripe --component-start /mnt/testfs/3comp
    0
    4194304
    67108864
  • List the start position in bytes of component ID=3

    $ lfs getstripe --component-start -I3 /mnt/testfs/3comp
    67108864
  • List the component with start = 64M

    $ lfs getstripe --component-start=64M /mnt/testfs/3comp
    /mnt/testfs/3comp
      lcm_layout_gen:  4
      lcm_entry_count: 3
        lcme_id:             3
        lcme_flags:          0
        lcme_extent.e_start: 67108864
        lcme_extent.e_end:   EOF
          lmm_stripe_count:  -1
          lmm_stripe_size:   1048576
          lmm_pattern:       1
          lmm_layout_gen:    4
          lmm_stripe_offset: 4
  • List the component(s) with start > 5M

    $ lfs getstripe --component-start=+5M /mnt/testfs/3comp
    /mnt/testfs/3comp
      lcm_layout_gen:  4
      lcm_entry_count: 3
        lcme_id:             3
        lcme_flags:          0
        lcme_extent.e_start: 67108864
        lcme_extent.e_end:   EOF
          lmm_stripe_count:  -1
          lmm_stripe_size:   1048576
          lmm_pattern:       1
          lmm_layout_gen:    4
          lmm_stripe_offset: 4
  • List the component(s) with start < 5M

    $ lfs getstripe --component-start=-5M /mnt/testfs/3comp
    /mnt/testfs/3comp
      lcm_layout_gen:  4
      lcm_entry_count: 3
        lcme_id:             1
        lcme_flags:          init
        lcme_extent.e_start: 0
        lcme_extent.e_end:   4194304
          lmm_stripe_count:  1
          lmm_stripe_size:   1048576
          lmm_pattern:       1
          lmm_layout_gen:    0
          lmm_stripe_offset: 4
          lmm_objects:
          - 0: { l_ost_idx: 4, l_fid: [0x100040000:0x2:0x0] }
    
        lcme_id:             2
        lcme_flags:          init
        lcme_extent.e_start: 4194304
        lcme_extent.e_end:   67108864
          lmm_stripe_count:  4
          lmm_stripe_size:   1048576
          lmm_pattern:       1
          lmm_layout_gen:    0
          lmm_stripe_offset: 5
          lmm_objects:
          - 0: { l_ost_idx: 5, l_fid: [0x100050000:0x2:0x0] }
          - 1: { l_ost_idx: 6, l_fid: [0x100060000:0x2:0x0] }
          - 2: { l_ost_idx: 7, l_fid: [0x100070000:0x2:0x0] }
          - 3: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] }
  • List the component(s) with start > 3M and end < 70M

    $ lfs getstripe --component-start=+3M --component-end=-70M \
    /mnt/testfs/3comp
    /mnt/testfs/3comp
      lcm_layout_gen:  4
      lcm_entry_count: 3
        lcme_id:             2
        lcme_flags:          init
        lcme_extent.e_start: 4194304
        lcme_extent.e_end:   67108864
          lmm_stripe_count:  4
          lmm_stripe_size:   1048576
          lmm_pattern:       1
          lmm_layout_gen:    0
          lmm_stripe_offset: 5
          lmm_objects:
          - 0: { l_ost_idx: 5, l_fid: [0x100050000:0x2:0x0] }
          - 1: { l_ost_idx: 6, l_fid: [0x100060000:0x2:0x0] }
          - 2: { l_ost_idx: 7, l_fid: [0x100070000:0x2:0x0] }
          - 3: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] }

19.5.4. lfs find

lfs find commands can be used to search the directory tree rooted at the given directory or file name for the files that match the given PFL component parameters. Here, only those parameters new for PFL files are shown. Their usages are similar to lfs getstripe commands.

Command

lfs find directory|filename
[[!] --component-count [+-=]comp_cnt]
[[!] --component-start [+-=]N[kMGTPE]]
[[!] --component-end|-E [+-=]N[kMGTPE]]
[[!] --component-flags=comp_flags]

Note

If you use --component-xxx options, only the composite files will be searched; but if you use ! --component-xxx options, all the files will be searched.

Example

We use the following directory and composite files to show how lfs find works.

$ mkdir /mnt/testfs/testdir
$ lfs setstripe -E 1M -E 10M -E eof /mnt/testfs/testdir/3comp
$ lfs setstripe -E 4M -E 20M -E 30M -E eof /mnt/testfs/testdir/4comp
$ mkdir -p /mnt/testfs/testdir/dir_3comp
$ lfs setstripe -E 6M -E 30M -E eof /mnt/testfs/testdir/dir_3comp
$ lfs setstripe -E 8M -E eof /mnt/testfs/testdir/dir_3comp/2comp
$ lfs setstripe -c 1 /mnt/testfs/testdir/dir_3comp/commnfile

Case1. Find the files that match the specified component count condition

Find the files under directory /mnt/testfs/testdir whose number of components is not equal to 3.

$ lfs find /mnt/testfs/testdir ! --component-count=3
/mnt/testfs/testdir
/mnt/testfs/testdir/4comp
/mnt/testfs/testdir/dir_3comp/2comp
/mnt/testfs/testdir/dir_3comp/commonfile

Case2. Find the files/dirs that match the specified component start/end condition

Find the file(s) under directory /mnt/testfs/testdir with component start = 4M and end < 70M

$ lfs find /mnt/testfs/testdir --component-start=4M -E -30M
/mnt/testfs/testdir/4comp

Case3. Find the files/dirs that match the specified component flag condition

Find the file(s) under directory /mnt/testfs/testdir whose component flags contain init

$ lfs find /mnt/testfs/testdir --component-flag=init
/mnt/testfs/testdir/3comp
/mnt/testfs/testdir/4comp
/mnt/testfs/testdir/dir_3comp/2comp

Note

Since lfs find uses "!" to do negative search, we don't support flag ^init here.

Introduced in Lustre 2.13

19.6.  Self-Extending Layout (SEL)

The Lustre Self-Extending Layout (SEL) feature is an extension of the Section 19.5, “Progressive File Layout(PFL)” feature, which allows the MDS to change the defined PFL layout dynamically. With this feature, the MDS monitors the used space on OSTs and swaps the OSTs for the current file when they are low on space. This avoids ENOSPC problems for SEL files when applications are writing to them.

Whereas PFL delays the instantiation of some components until an IO operation occurs on this region, SEL allows splitting such non-instantiated components in two parts: an extendable component and an extension component. The extendable component is a regular PFL component, covering just a part of the region, which is small originally. The extension (or SEL) component is a new component type which is always non-instantiated and unassigned, covering the other part of the region. When a write reaches this unassigned space, and the client calls the MDS to have it instantiated, the MDS makes a decision as to whether to grant additional space to the extendable component. The granted region moves from the head of the extension component to the tail of the extendable component, thus the extendable component grows and the SEL one is shortened. Therefore, it allows the file to continue on the same OSTs, or in the case where space is low on one of the current OSTs, to modify the layout to switch to a new component on new OSTs. In particular, it lets IO automatically spill over to a large HDD OST pool once a small SSD OST pool is getting low on space.

The default extension policy modifies the layout in the following ways:

  1. Extension: continue on the same OST objects when not low on space on any of the OSTs of the current component; a particular extent is granted to the extendable component.

  2. Spill over: switch to next component OSTs only when not the last component and at least one of the current OSTs is low on space; the whole region of the SEL component moves to the next component and the SEL component is removed in its turn.

  3. Repeating: create a new component with the same layout but on free OSTs, used only for the last component when at least one of the current OSTs is low on space; a new component has the same layout but instantiated on different OSTs (from the same pool) which have enough space.

  4. Forced extension: continue with the current component OSTs when there is a low on space condition for the last component OSTs, but a repeating attempt detected low on space on other OSTs as well, then spillover is impossible and there is no sense in the repeating.

  5. Each spill event increments the spill_hit counter, which can be accessed with: lctl lod.*.POOLNAME.spill_hit

Note

The SEL feature does not require clients to understand the SEL format of already created files, only the MDS support is needed which is introduced in Lustre 2.13. However, old clients will have some limitations as the Lustre tools will not support it.

19.6.1. lfs setstripe

The lfs setstripe command is used to create files with composite layouts, as well as add or delete components to or from an existing file. It is extended to support SEL components.

19.6.1.1. Create a SEL file

Command

lfs setstripe
[--component-end|-E end1] [STRIPE_OPTIONS] ... FILENAME

STRIPE OPTIONS:
--extension-size, --ext-size, -z <ext_size>

The -z option is added to specify the size of the region which is granted to the extendable component on each iteration. While declaring any component, this option turns the declared component to a pair of components: extendable and extension ones.

Example

The following command creates 2 pairs of extendable and extension components:

# lfs setstripe -E 1G -z 64M -E -1 -z 256M /mnt/lustre/file

Figure 19.8. Example: create a SEL file

Example: create a SEL file


Note

As usual, only the first PFL component is instantiated at the creation time, thus it is immediately extended to the extension size (64M for the first component), whereas the third component is left zero-length.

# lfs getstripe /mnt/lustre/file
/mnt/lustre/file
  lcm_layout_gen: 4
  lcm_mirror_count: 1
  lcm_entry_count: 4
    lcme_id: 1
    lcme_mirror_id: 0
    lcme_flags: init
    lcme_extent.e_start: 0
    lcme_extent.e_end: 67108864
      lmm_stripe_count: 1
      lmm_stripe_size: 1048576
      lmm_pattern: raid0
      lmm_layout_gen: 0
      lmm_stripe_offset: 0
      lmm_objects:
      - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x5:0x0] }

    lcme_id: 2
    lcme_mirror_id: 0
    lcme_flags: extension
    lcme_extent.e_start: 67108864
    lcme_extent.e_end: 1073741824
      lmm_stripe_count: 0
      lmm_extension_size: 67108864
      lmm_pattern: raid0
      lmm_layout_gen: 0
      lmm_stripe_offset: -1

    lcme_id: 3
    lcme_mirror_id: 0
    lcme_flags: 0
    lcme_extent.e_start: 1073741824
    lcme_extent.e_end: 1073741824
      lmm_stripe_count: 1
      lmm_stripe_size: 1048576
      lmm_pattern: raid0
      lmm_layout_gen: 0
      lmm_stripe_offset: -1

    lcme_id: 4
    lcme_mirror_id: 0
    lcme_flags: extension
    lcme_extent.e_start: 1073741824
    lcme_extent.e_end: EOF
      lmm_stripe_count: 0
      lmm_extension_size: 268435456
      lmm_pattern: raid0
      lmm_layout_gen: 0
      lmm_stripe_offset: -1

19.6.1.2. Create a SEL layout template

Similar to PFL, it is possible to set a SEL layout template to a directory. After that, all the files created under it will inherit this layout by default.

# lfs setstripe -E 1G -z 64M -E -1 -z 256M /mnt/lustre/dir
# ./lustre/utils/lfs getstripe  /mnt/lustre/dir
/mnt/lustre/dir
  lcm_layout_gen:    0
  lcm_mirror_count:  1
  lcm_entry_count:   4
    lcme_id:             N/A
    lcme_mirror_id:      N/A
    lcme_flags:          0
    lcme_extent.e_start: 0
    lcme_extent.e_end:   67108864
      stripe_count:  1       stripe_size:   1048576       pattern:       raid0       stripe_offset: -1

    lcme_id:             N/A
    lcme_mirror_id:      N/A
    lcme_flags:          extension
    lcme_extent.e_start: 67108864
    lcme_extent.e_end:   1073741824
      stripe_count:  1       extension_size: 67108864       pattern:       raid0       stripe_offset: -1

    lcme_id:             N/A
    lcme_mirror_id:      N/A
    lcme_flags:          0
    lcme_extent.e_start: 1073741824
    lcme_extent.e_end:   1073741824
      stripe_count:  1       stripe_size:   1048576       pattern:       raid0       stripe_offset: -1

    lcme_id:             N/A
    lcme_mirror_id:      N/A
    lcme_flags:          extension
    lcme_extent.e_start: 1073741824
    lcme_extent.e_end:   EOF
      stripe_count:  1       extension_size: 268435456       pattern:       raid0       stripe_offset: -1
	

19.6.2. lfs getstripe

lfs getstripe commands can be used to list the striping/component information for a given SEL file. Here, only those parameters new for SEL files are shown.

Command

lfs getstripe
[--extension-size|--ext-size|-z] filename

The -z option is added to print the extension size in bytes. For composite files this is the extension size of the first extension component. If a particular component is identified by other options (--component-id, --component-start, etc...), this component extension size is printed.

Example 1: List a SEL component information

Suppose we already have a composite file /mnt/lustre/file, created by the following command:

# lfs setstripe -E 1G -z 64M -E -1 -z 256M /mnt/lustre/file

The 2nd component could be listed with the following command:

# lfs getstripe -I2 /mnt/lustre/file
/mnt/lustre/file
  lcm_layout_gen: 4
  lcm_mirror_count: 1
  lcm_entry_count: 4
    lcme_id: 2
    lcme_mirror_id: 0
    lcme_flags: extension
    lcme_extent.e_start: 67108864
    lcme_extent.e_end: 1073741824
      lmm_stripe_count: 0
      lmm_extension_size: 67108864
      lmm_pattern: raid0
      lmm_layout_gen: 0
      lmm_stripe_offset: -1
      

Note

As you can see the SEL components are marked by the extension flag and lmm_extension_size field keeps the specified extension size.

Example 2: List the extension size

Having the same file as in the above example, the extension size of the second component could be listed with:

# lfs getstripe -z -I2 /mnt/lustre/file
67108864

Example 3: Extension

Having the same file as in the above example, suppose there is a write which crosses the end of the first component (64M), and then another write another write which crosses the end of the first component (128M) again, the layout changes as following:

Figure 19.9. Example: an extension of a SEL file

Example: an extension of a SEL file

The layout can be printed out by the following command:

# lfs getstripe /mnt/lustre/file
/mnt/lustre/file
  lcm_layout_gen: 6
  lcm_mirror_count: 1
  lcm_entry_count: 4
    lcme_id: 1
    lcme_mirror_id: 0
    lcme_flags: init
    lcme_extent.e_start: 0
    lcme_extent.e_end: 201326592
      lmm_stripe_count: 1
      lmm_stripe_size: 1048576
      lmm_pattern: raid0
      lmm_layout_gen: 0
      lmm_stripe_offset: 0
      lmm_objects:
      - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x5:0x0] }

    lcme_id: 2
    lcme_mirror_id: 0
    lcme_flags: extension
    lcme_extent.e_start: 201326592
    lcme_extent.e_end: 1073741824
      lmm_stripe_count: 0
      lmm_extension_size: 67108864
      lmm_pattern: raid0
      lmm_layout_gen: 0
      lmm_stripe_offset: -1

    lcme_id: 3
    lcme_mirror_id: 0
    lcme_flags: 0
    lcme_extent.e_start: 1073741824
    lcme_extent.e_end: 1073741824
      lmm_stripe_count: 1
      lmm_stripe_size: 1048576
      lmm_pattern: raid0
      lmm_layout_gen: 0
      lmm_stripe_offset: -1

    lcme_id: 4
    lcme_mirror_id: 0
    lcme_flags: extension
    lcme_extent.e_start: 1073741824
    lcme_extent.e_end: EOF
      lmm_stripe_count: 0
      lmm_extension_size: 268435456
      lmm_pattern: raid0
      lmm_layout_gen: 0
      lmm_stripe_offset: -1

Example 4: Spillover

In case where OST0 is low on space and an IO happens to a SEL component, a spillover happens: the full region of the SEL component is added to the next component, e.g. in the example above the next layout modification will look like:

Figure 19.10. Example: a spillover in a SEL file

Example: a spillover in a SEL file

Note

Despite the fact the third component was [1G, 1G] originally, while it is not instantiated, instead of getting extended backward, it is moved backward to the start of the previous SEL component (192M) and extended on its extension size (256M) from that position, thus it becomes [192M, 448M].

# lfs getstripe /mnt/lustre/file
/mnt/lustre/file
  lcm_layout_gen: 7
  lcm_mirror_count: 1
  lcm_entry_count: 3
    lcme_id: 1
    lcme_mirror_id: 0
    lcme_flags: init
    lcme_extent.e_start: 0
    lcme_extent.e_end: 201326592
      lmm_stripe_count: 1
      lmm_stripe_size: 1048576
      lmm_pattern: raid0
      lmm_layout_gen: 0
      lmm_stripe_offset: 0
      lmm_objects:
      - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x5:0x0] }

    lcme_id: 3
    lcme_mirror_id: 0
    lcme_flags: init
    lcme_extent.e_start: 201326592
    lcme_extent.e_end: 469762048
      lmm_stripe_count: 1
      lmm_stripe_size: 1048576
      lmm_pattern: raid0
      lmm_layout_gen: 0
      lmm_stripe_offset: 1
      lmm_objects:
      - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x8:0x0] }

    lcme_id: 4
    lcme_mirror_id: 0
    lcme_flags: extension
    lcme_extent.e_start: 469762048
    lcme_extent.e_end: EOF
      lmm_stripe_count: 0
      lmm_extension_size: 268435456
      lmm_pattern: raid0
      lmm_layout_gen: 0
      lmm_stripe_offset: -1

Example 5: Repeating

Suppose in the example above, OST0 got enough free space back but OST1 is low on space, the following write to the last SEL component leads to a new component allocation before the SEL component, which repeats the previous component layout but instantiated on free OSTs:

Figure 19.11. Example: repeat a SEL component

Example: repeat a SEL component

# lfs getstripe /mnt/lustre/file
/mnt/lustre/file
  lcm_layout_gen: 9
  lcm_mirror_count: 1
  lcm_entry_count: 4
    lcme_id: 1
    lcme_mirror_id: 0
    lcme_flags: init
    lcme_extent.e_start: 0
    lcme_extent.e_end: 201326592
      lmm_stripe_count: 1
      lmm_stripe_size: 1048576
      lmm_pattern: raid0
      lmm_layout_gen: 0
      lmm_stripe_offset: 0
      lmm_objects:
      - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x5:0x0] }

    lcme_id: 3
    lcme_mirror_id: 0
    lcme_flags: init
    lcme_extent.e_start: 201326592
    lcme_extent.e_end: 469762048
      lmm_stripe_count: 1
      lmm_stripe_size: 1048576
      lmm_pattern: raid0
      lmm_layout_gen: 0
      lmm_stripe_offset: 1
      lmm_objects:
      - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x8:0x0] }

    lcme_id: 8
    lcme_mirror_id: 0
    lcme_flags: init
    lcme_extent.e_start: 469762048
    lcme_extent.e_end: 738197504
      lmm_stripe_count: 1
      lmm_stripe_size: 1048576
      lmm_pattern: raid0
      lmm_layout_gen: 65535
      lmm_stripe_offset: 0
      lmm_objects:
      - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x6:0x0] }

    lcme_id: 4
    lcme_mirror_id: 0
    lcme_flags: extension
    lcme_extent.e_start: 738197504
    lcme_extent.e_end: EOF
      lmm_stripe_count: 0
      lmm_extension_size: 268435456
      lmm_pattern: raid0
      lmm_layout_gen: 0
      lmm_stripe_offset: -1

Example 6: Forced extension

Suppose in the example above, both OST0 and OST1 are low on space, the following write to the last SEL component will behave as an extension as there is no sense to repeat.

Figure 19.12. Example: forced extension in a SEL file

Example: forced extension in a SEL file.

# lfs getstripe /mnt/lustre/file
/mnt/lustre/file
  lcm_layout_gen: 11
  lcm_mirror_count: 1
  lcm_entry_count: 4
    lcme_id: 1
    lcme_mirror_id: 0
    lcme_flags: init
    lcme_extent.e_start: 0
    lcme_extent.e_end: 201326592
      lmm_stripe_count: 1
      lmm_stripe_size: 1048576
      lmm_pattern: raid0
      lmm_layout_gen: 0
      lmm_stripe_offset: 0
      lmm_objects:
      - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x5:0x0] }

    lcme_id: 3
    lcme_mirror_id: 0
    lcme_flags: init
    lcme_extent.e_start: 201326592
    lcme_extent.e_end: 469762048
      lmm_stripe_count: 1
      lmm_stripe_size: 1048576
      lmm_pattern: raid0
      lmm_layout_gen: 0
      lmm_stripe_offset: 1
      lmm_objects:
      - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x8:0x0] }

    lcme_id: 8
    lcme_mirror_id: 0
    lcme_flags: init
    lcme_extent.e_start: 469762048
    lcme_extent.e_end: 1006632960
      lmm_stripe_count: 1
      lmm_stripe_size: 1048576
      lmm_pattern: raid0
      lmm_layout_gen: 65535
      lmm_stripe_offset: 0
      lmm_objects:
      - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x6:0x0] }

    lcme_id: 4
    lcme_mirror_id: 0
    lcme_flags: extension
    lcme_extent.e_start: 1006632960
    lcme_extent.e_end: EOF
      lmm_stripe_count: 0
      lmm_extension_size: 268435456
      lmm_pattern: raid0
      lmm_layout_gen: 0
      lmm_stripe_offset: -1

19.6.3. lfs find

lfs find commands can be used to search for the files that match the given SEL component paremeters. Here, only those parameters new for the SEL files are shown.

lfs find
[[!] --extension-size|--ext-size|-z [+-]ext-size[KMG]
[[!] --component-flags=extension]

The -z option is added to specify the extension size to search for. The files which have any component with the extension size matched the given criteria are printed out. As always + and - signs are allowed to specify the least and the most size.

A new extension component flag is added. Only files which have at least one SEL component are printed.

Note

The negative search for flags searches the files which have a non-SEL component (not files which do not have any SEL component).

Example

# lfs setstripe --extension-size 64M -c 1 -E -1 /mnt/lustre/file

# lfs find --comp-flags extension /mnt/lustre/*
/mnt/lustre/file

# lfs find ! --comp-flags extension /mnt/lustre/*
/mnt/lustre/file

# lfs find -z 64M /mnt/lustre/*
/mnt/lustre/file

# lfs find -z +64M /mnt/lustre/*

# lfs find -z -64M /mnt/lustre/*

# lfs find -z +63M /mnt/lustre/*
/mnt/lustre/file

# lfs find -z -65M /mnt/lustre/*
/mnt/lustre/file

# lfs find -z 65M /mnt/lustre/*

# lfs find ! -z 64M /mnt/lustre/*

# lfs find ! -z +64M /mnt/lustre/*
/mnt/lustre/file

# lfs find ! -z -64M /mnt/lustre/*
/mnt/lustre/file

# lfs find ! -z +63M /mnt/lustre/*

# lfs find ! -z -65M /mnt/lustre/*

# lfs find ! -z 65M /mnt/lustre/*
/mnt/lustre/file
Introduced in Lustre 2.13

19.7.  Foreign Layout

The Lustre Foreign Layout feature is an extension of both the LOV and LMV formats which allows the creation of empty files and directories with the necessary specifications to point to corresponding objects outside from Lustre namespace.

The new LOV/LMV foreign internal format can be represented as:

Figure 19.13. LOV/LMV foreign format

LOV/LMV foreign format

19.7.1. lfs set[dir]stripe

The lfs set[dir]stripe commands are used to create files or directories with foreign layouts, by calling the corresponding API, itself invoking the appropriate ioctl().

19.7.1.1. Create a Foreign file/dir

Command

lfs set[dir]stripe \
--foreign[=<foreign_type>] --xattr|-x <layout_string> \
[--flags <hex_bitmask>] [--mode <mode_bits>] \
{file,dir}name

Both the --foreign and --xattr|-x options are mandatory. The <foreign_type> (default is "none", meaning no special behavior), and both --flags and --mode (default is 0666) options are optional.

Example

The following command creates a foreign file of "none" type and with "foo@bar" LOV content and specific mode and flags:

# lfs setstripe --foreign=none --flags=0xda08 --mode=0640 \
--xattr=foo@bar /mnt/lustre/file

Figure 19.14. Example: create a foreign file

Example: create a foreign file


19.7.2. lfs get[dir]stripe

lfs get[dir]stripe commands can be used to retrieve foreign LOV/LMV informations and content.

Command

lfs get[dir]stripe [-v] filename

List foreign layout information

Suppose we already have a foreign file /mnt/lustre/file, created by the following command:

# lfs setstripe --foreign=none --flags=0xda08 --mode=0640 \
--xattr=foo@bar /mnt/lustre/file

The full foreign layout informations can be listed using the following command:

# lfs getstripe -v /mnt/lustre/file
/mnt/lustre/file
  lfm_magic: 0x0BD70BD0
  lfm_length: 7
  lfm_type: none
  lfm_flags: 0x0000DA08
  lfm_value: foo@bar
      

Note

As you can see the lfm_length field value is the characters number in the variable length lfm_value field.

19.7.3. lfs find

lfs find commands can be used to search for all the foreign files/directories or those that match the given selection paremeters.

lfs find
[[!] --foreign[=<foreign_type>]

The --foreign[=<foreign_type>] option has been added to specify that all [!,but not] files and/or directories with a foreign layout [and [!,but not] of <foreign_type>] will be retrieved.

Example

# lfs setstripe --foreign=none --xattr=foo@bar /mnt/lustre/file
# touch /mnt/lustre/file2

# lfs find --foreign /mnt/lustre/*
/mnt/lustre/file

# lfs find ! --foreign /mnt/lustre/*
/mnt/lustre/file2

# lfs find --foreign=none /mnt/lustre/*
/mnt/lustre/file

19.8. Managing Free Space

To optimize file system performance, the MDT assigns file stripes to OSTs based on two allocation algorithms. The round-robin allocator gives preference to location (spreading out stripes across OSSs to increase network bandwidth utilization) and the weighted allocator gives preference to available space (balancing loads across OSTs). Threshold and weighting factors for these two algorithms can be adjusted by the user. The MDT reserves 0.1 percent of total OST space and 32 inodes for each OST. The MDT stops object allocation for the OST if available space is less than reserved or the OST has fewer than 32 free inodes. The MDT starts object allocation when available space is twice as big as the reserved space and the OST has more than 64 free inodes. Note, clients could append existing files no matter what object allocation state is.

Introduced in Lustre 2.9

The reserved space for each OST can be adjusted by the user. Use the lctl set_param command, for example the next command reserve 1GB space for all OSTs.

lctl set_param -P osp.*.reserved_mb_low=1024

This section describes how to check available free space on disks and how free space is allocated. It then describes how to set the threshold and weighting factors for the allocation algorithms.

19.8.1. Checking File System Free Space

Free space is an important consideration in assigning file stripes. The lfs df command can be used to show available disk space on the mounted Lustre file system and space consumption per OST. If multiple Lustre file systems are mounted, a path may be specified, but is not required. Options to the lfs df command are shown below.

Option

Description

-h, --human-readable

Displays sizes in human readable format (for example: 1K, 234M, 5G) using base-2 (binary) values (i.e. 1G = 1024M).

-H, --si

Like -h, this displays counts in human readable format, but using base-10 (decimal) values (i.e. 1G = 1000M).

-i, --inodes

Lists inodes instead of block usage.

-l, --lazy

Do not attempt to contact any OST or MDT not currently connected to the client. This avoids blocking the lfs df output if a target is offline or unreachable, and only returns the space on OSTs that can currently be accessed.

-p, --pool

Limit the usage to report only OSTs that are in the specified pool. If multiple Lustre filesystems are mounted, list the OSTs in pool for each filesystem, or limit the display to only a pool for a specific filesystem if fsname.pool is given. Specifying both fsname and pool is equivalent to providing a specific mountpoint.

-v, --verbose

Display verbose status of MDTs and OSTs. This may include one or more optional flags at the end of each line.

lfs df may also report additional target status as the last column in the display, if there are issues with that target. Target states include:

  • D: OST/MDT is Degraded. The target has a failed drive in the RAID device, or is undergoing RAID reconstruction. This state is marked on the server automatically for ZFS targets via zed, or a (user-supplied) script that monitors the target device and sets "lctl set_param obdfilter.target.degraded=1" on the OST. This target will be avoided for new allocations, but will still be used to read existing files located there or if there are not enough non-degraded OSTs to make up a widely-striped file.

  • R: OST/MDT is Read-only. The target filesystem is marked read-only due to filesystem corruption detected by ldiskfs or ZFS. No modifications are allowed on this OST, and it needs to be unmounted and e2fsck or zpool scrub run to repair the underlying filesystem.

  • N: OST/MDT is No-precreate. The target is configured to deny object precreation set by "lctl set_param obdfilter.target.no_precreate=1" parameter or the "-o no_precreate" mount option. This may be done to add an OST to the filesystem without allowing objects to be allocated on it yet, or for other reasons.

  • S: OST/MDT is out of Space. The target filesystem has less than the minimum required free space and will not be used for new object allocations until it has more free space.

  • I: OST/MDT is out of Inodes. The target filesystem has less than the minimum required free inodes and will not be used for new object allocations until it has more free inodes.

  • f: OST/MDT is on flash. The target filesystem is using a flash (non-rotational) storage device. This is normally detected from the underlying Linux block device, but can be set manually with "lctl set_param osd-*.*.nonrotational=1 on the respective OSTs. This lower-case status is only shown in conjunction with the -v option, since it is not an error condition.

Note

The df -i and lfs df -i commands show the minimum number of inodes that can be created in the file system at the current time. If the total number of objects available across all of the OSTs is smaller than those available on the MDT(s), taking into account the default file striping, then df -i will also report a smaller number of inodes than could be created. Running lfs df -i will report the actual number of inodes that are free on each target.

For ZFS file systems, the number of inodes that can be created is dynamic and depends on the free space in the file system. The Free and Total inode counts reported for a ZFS file system are only an estimate based on the current usage for each target. The Used inode count is the actual number of inodes used by the file system.

Examples

client$ lfs df
UUID                 1K-blocks       Used Available Use%  Mounted on
testfs-OST0000_UUID    9174328    1020024   8154304  11%  /mnt/lustre[MDT:0]
testfs-OST0000_UUID   94181368   56330708  37850660  59%  /mnt/lustre[OST:0]
testfs-OST0001_UUID   94181368   56385748  37795620  59%  /mnt/lustre[OST:1]
testfs-OST0002_UUID   94181368   54352012  39829356  57%  /mnt/lustre[OST:2]
filesystem summary:  282544104  167068468  39829356  57%  /mnt/lustre

[client1] $ lfs df -hv
UUID                    bytes        Used Available Use%  Mounted on
testfs-MDT0000_UUID      8.7G      996.1M      7.8G  11%  /mnt/lustre[MDT:0]
testfs-OST0000_UUID     89.8G       53.7G     36.1G  59%  /mnt/lustre[OST:0] f
testfs-OST0001_UUID     89.8G       53.8G     36.0G  59%  /mnt/lustre[OST:1] f
testfs-OST0002_UUID     89.8G       51.8G     38.0G  57%  /mnt/lustre[OST:2] f
filesystem summary:    269.5G      159.3G    110.1G  59%  /mnt/lustre

[client1] $ lfs df -iH
UUID                   Inodes       IUsed    IFree IUse%  Mounted on
testfs-MDT0000_UUID     2.21M       41.9k     2.17M   1%  /mnt/lustre[MDT:0]
testfs-OST0000_UUID    737.3k       12.1k    725.1k   1%  /mnt/lustre[OST:0]
testfs-OST0001_UUID    737.3k       12.2k    725.0k   1%  /mnt/lustre[OST:1]
testfs-OST0002_UUID    737.3k       12.2k    725.0k   1%  /mnt/lustre[OST:2]
filesystem summary:     2.21M       41.9k     2.17M   1%  /mnt/lustre[OST:2]

19.8.2.  Stripe Allocation Methods

Two stripe allocation methods are provided:

  • Round-robin allocator - When the OSTs have approximately the same amount of free space, the round-robin allocator alternates stripes between OSTs on different OSSs, so the OST used for stripe 0 of each file is evenly distributed among OSTs, regardless of the stripe count. In a simple example with eight OSTs numbered 0-7, objects would be allocated like this:

    File 1: OST1, OST2, OST3, OST4
    File 2: OST5, OST6, OST7
    File 3: OST0, OST1, OST2, OST3, OST4, OST5
    File 4: OST6, OST7, OST0

    Here are several more sample round-robin stripe orders (each letter represents a different OST on a single OSS):

    3: AAA

    One 3-OST OSS

    3x3: ABABAB

    Two 3-OST OSSs

    3x4: BBABABA

    One 3-OST OSS (A) and one 4-OST OSS (B)

    3x5: BBABBABA

    One 3-OST OSS (A) and one 5-OST OSS (B)

    3x3x3: ABCABCABC

    Three 3-OST OSSs

  • Weighted allocator - When the free space difference between the OSTs becomes significant, the weighting algorithm is used to influence OST ordering based on size (amount of free space available on each OST) and location (stripes evenly distributed across OSTs). The weighted allocator fills the emptier OSTs faster, but uses a weighted random algorithm, so the OST with the most free space is not necessarily chosen each time.

The allocation method is determined by the amount of free-space imbalance on the OSTs. When free space is relatively balanced across OSTs, the faster round-robin allocator is used, which maximizes network balancing. The weighted allocator is used when any two OSTs are out of balance by more than the specified threshold (17% by default). The threshold between the two allocation methods is defined by the qos_threshold_rr parameter.

To temporarily set the qos_threshold_rr to 25, enter the folowing on each MDS:

mds# lctl set_param lod.fsname*.qos_threshold_rr=25

19.8.3. Adjusting the Weighting Between Free Space and Location

The weighting priority used by the weighted allocator is set by the the qos_prio_free parameter. Increasing the value of qos_prio_free puts more weighting on the amount of free space available on each OST and less on how stripes are distributed across OSTs. The default value is 91 (percent). When the free space priority is set to 100 (percent), weighting is based entirely on free space and location is no longer used by the striping algorithm.

To permanently change the allocator weighting to 100, enter this command on the MGS:

lctl conf_param fsname-MDT0000-*.lod.qos_prio_free=100

.

Note

When qos_prio_free is set to 100, a weighted random algorithm is still used to assign stripes, so, for example, if OST2 has twice as much free space as OST1, OST2 is twice as likely to be used, but it is not guaranteed to be used.

19.9. Lustre Striping Internals

Individual files can only be striped over a finite number of OSTs, based on the maximum size of the attributes that can be stored on the MDT. If the MDT is ldiskfs-based without the ea_inode feature, a file can be striped across at most 160 OSTs. With ZFS-based MDTs, or if the ea_inode feature is enabled for an ldiskfs-based MDT, a file can be striped across up to 2000 OSTs.

Lustre inodes use an extended attribute to record on which OST each object is located, and the identifier each object on that OST. The size of the extended attribute is a function of the number of stripes.

If using an ldiskfs-based MDT, the maximum number of OSTs over which files can be striped can been raised to 2000 by enabling the ea_inode feature on the MDT:

tune2fs -O ea_inode /dev/mdtdev

Introduced in Lustre 2.13

Note

Since Lustre 2.13 the ea_inode feature is enabled by default on all newly formatted ldiskfs MDT filesystems.

Note

The maximum stripe count for a single file does not limit the maximum number of OSTs that are in the filesystem as a whole, only the maximum possible size and maximum aggregate bandwidth for the file.

Introduced in Lustre 2.11

Chapter 20. Data on MDT (DoM)

This chapter describes Data on MDT (DoM).

20.1.  Introduction to Data on MDT (DoM)

The Lustre Data on MDT (DoM) feature improves small file IO by placing small files directly on the MDT, and also improves large file IO by avoiding the OST being affected by small random IO that can cause device seeking and hurt the streaming IO performance. Therefore, users can expect more consistent performance for both small file IO and mixed IO patterns.

The layout of a DoM file is stored on disk as a composite layout and is a special case of Progressive File Layout (PFL). Please see Section 19.5, “Progressive File Layout(PFL)” for more information on PFL. For DoM files, the file layout is composed of the component of the file, which is placed on an MDT, and the rest of components are placed on OSTs, if needed. The first component is placed on the MDT in the MDT object data blocks. This component always has one stripe with size equal to the component size. Such a component with an MDT layout can be only the first component in composite layout. The rest of components are placed over OSTs as usual with a RAID0 layout. The OST components are not instantiated until a client writes or truncates the file beyond the size of the MDT component.

Note

When specifying a DoM layout, it might be assumed that the remaning layout will automatically go to the OSTs, but this is not the case. As with regular PFL layouts, if an EOF component is not present, then writes beyond the end of the last existing component will fail with error ENODATA ("No data available"). For example, creating a DoM file with a component end at 1 MB will not be writable beyond 1 MiB:

$ lfs setstripe -E 1M -L mdt /mnt/testfs/domdir
$ dd if=/dev/zero of=/mnt/testfs/domdir/testfile bs=1M
dd: error writing '/myth/tmp/pfl-mdt-only': No data available
2+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00186441 s, 562 MB/s

To allow the file to grow beyond 1 MB, add one or more regular OST components with an EOF component at the end:

lfs setstripe -E 1M -L mdt -E 1G -c1 -E eof -c4 /mnt/testfs/domdir

20.2.  User Commands

Lustre provides the lfs setstripe command for users to create DoM files. Also, as usual, lfs getstripe command can be used to list the striping/component information for a given file, while lfs find command can be used to search the directory tree rooted at the given directory or file name for the files that match the given DoM component parameters, e.g. layout type.

20.2.1.  lfs setstripe for DoM files

The lfs setstripe command is used to create DoM files.

20.2.1.1. Command

lfs setstripe --component-end|-E end1 --layout|-L mdt \
        [--component-end|-E end2 [STRIPE_OPTIONS] ...] <filename>
              

The command above creates a file with the special composite layout, which defines the first component as an MDT component. The MDT component must start from offset 0 and ends at end1. The end1 is also the stripe size of this component, and is limited by the lod.*.dom_stripesize of the MDT the file is created on. No other options are required for this component. The rest of the components use the normal syntax for composite files creation.

Note

If the next component doesn't specify striping, such as:

lfs setstripe -E 1M -L mdt -E EOF <filename>

Then that component get its settings from the default filesystem striping.

20.2.1.2. Example

The command below creates a file with a DoM layout. The first component has an mdt layout and is placed on the MDT, covering [0, 1M). The second component covers [1M, EOF) and is striped over all available OSTs.

client$ lfs setstripe -E 1M -L mdt -E -1 -S 4M -c -1 \
          /mnt/lustre/domfile

The resulting layout is illustrated by Figure 20.1, “Resulting file layout”.

Figure 20.1. Resulting file layout

Resulting file layout

The resulting can also be checked with lfs getstripe as shown below:

client$ lfs getstripe /mnt/lustre/domfile
/mnt/lustre/domfile
  lcm_layout_gen:   2
  lcm_mirror_count: 1
  lcm_entry_count:  2
    lcme_id:             1
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   1048576
      lmm_stripe_count:  0
      lmm_stripe_size:   1048576
      lmm_pattern:       mdt
      lmm_layout_gen:    0
      lmm_stripe_offset: 0
      lmm_objects:
      
    lcme_id:             2
    lcme_flags:          0
    lcme_extent.e_start: 1048576
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  -1
      lmm_stripe_size:   4194304
      lmm_pattern:       raid0
      lmm_layout_gen:    65535
      lmm_stripe_offset: -1

The output above shows that the first component has size 1MB and pattern is 'mdt'. The second component is not instantiated yet, which is seen by lcme_flags: 0.

If more than 1MB of data is written to the file, then lfs getstripe output is changed accordingly:

client$ lfs getstripe /mnt/lustre/domfile
/mnt/lustre/domfile
  lcm_layout_gen:   3
  lcm_mirror_count: 1
  lcm_entry_count:  2
    lcme_id:             1
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   1048576
      lmm_stripe_count:  0
      lmm_stripe_size:   1048576
      lmm_pattern:       mdt
      lmm_layout_gen:    0
      lmm_stripe_offset: 2
      lmm_objects:
      
    lcme_id:             2
    lcme_flags:          init
    lcme_extent.e_start: 1048576
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  2
      lmm_stripe_size:   4194304
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 0
      lmm_objects:
      - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] }
      - 1: { l_ost_idx: 1, l_fid: [0x100010000:0x2:0x0] }

The output above shows that the second component now has objects on OSTs with a 4MB stripe.

20.2.2. Setting a default DoM layout to an existing directory

A DoM layout can be set on an existing directory as well. When set, all the files created after that will inherit this layout by default.

20.2.2.1. Command

lfs setstripe --component-end|-E end1 --layout|-L mdt \
[--component-end|-E end2 [STRIPE_OPTIONS] ...] <dirname>

20.2.2.2. Example

client$ mkdir /mnt/lustre/domdir
client$ touch /mnt/lustre/domdir/normfile
client$ lfs setstripe -E 1M -L mdt -E -1 /mnt/lustre/domdir/
client$ lfs getstripe -d /mnt/lustre/domdir
  lcm_layout_gen:   0
  lcm_mirror_count: 1
  lcm_entry_count:  2
    lcme_id:             N/A
    lcme_flags:          0
    lcme_extent.e_start: 0
    lcme_extent.e_end:   1048576
      stripe_count:  0    stripe_size:   1048576    \
      pattern:  mdt    stripe_offset:  -1
    
    lcme_id:             N/A
    lcme_flags:          0
    lcme_extent.e_start: 1048576
    lcme_extent.e_end:   EOF
      stripe_count:  1    stripe_size:   1048576    \
      pattern:  raid0    stripe_offset:  -1
              

In the output above, it can be seen that the directory has a default layout with a DoM component.

The following example will check layouts of files in that directory:

client$ touch /mnt/lustre/domdir/domfile
client$ lfs getstripe /mnt/lustre/domdir/normfile
/mnt/lustre/domdir/normfile
lmm_stripe_count:  2
lmm_stripe_size:   1048576
lmm_pattern:       raid0
lmm_layout_gen:    0
lmm_stripe_offset: 1
  obdidx   objid   objid   group
       1              3           0x3              0
       0              3           0x3              0

client$ lfs getstripe /mnt/lustre/domdir/domfile
/mnt/lustre/domdir/domfile
  lcm_layout_gen:   2
  lcm_mirror_count: 1
  lcm_entry_count:  2
    lcme_id:             1
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   1048576
      lmm_stripe_count:  0
      lmm_stripe_size:   1048576
      lmm_pattern:       mdt
      lmm_layout_gen:    0
      lmm_stripe_offset: 2
      lmm_objects:
      
    lcme_id:             2
    lcme_flags:          0
    lcme_extent.e_start: 1048576
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  1
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    65535
      lmm_stripe_offset: -1

We can see that first file normfile in that directory has an ordinary layout, whereas the file domfile inherits the directory default layout and is a DoM file.

Note

The directory default layout setting will be inherited by new files even if the server DoM size limit will be set to a lower value.

20.2.3.  DoM Stripe Size Restrictions

The maximum size of a DoM component is restricted in several ways to protect the MDT from being eventually filled with large files.

20.2.3.1. LFS limits for DoM component size

lfs setstripe allows for setting the component size for MDT layouts up to 1GB (this is a compile-time limit to avoid improper configuration), however, the size must also be aligned by 64KB due to the minimum stripe size in Lustre (see Table 5.2, “File and file system limits” Minimum stripe size). There is also a limit imposed on each file by lfs setstripe -E end that may be smaller than the MDT-imposed limit if this is better for a particular usage.

20.2.3.2. MDT Server Limits

The lod.$fsname-MDTxxxx.dom_stripesize is used to control the per-MDT maximum size for a DoM component. Larger DoM components specified by the user will be truncated to the MDT-specified limit, and as such may be different on each MDT to balance DoM space usage on each MDT separately, if needed. It is 1MB by default and can be changed with the lctl tool. For more information on setting dom_stripesize please see Section 20.2.6, “ The dom_stripesize parameter”.

20.2.4.  lfs getstripe for DoM files

The lfs getstripe command is used to list the striping/component information for a given file. For DoM files, it can be used to check its layout and size.

20.2.4.1. Command

lfs getstripe [--component-id|-I [comp_id]] [--layout|-L] \
              [--stripe-size|-S] <dirname|filename>

20.2.4.2. Examples

client$ lfs getstripe -I1 /mnt/lustre/domfile
/mnt/lustre/domfile
  lcm_layout_gen:   3
  lcm_mirror_count: 1
  lcm_entry_count:  2
    lcme_id:             1
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   1048576
      lmm_stripe_count:  0
      lmm_stripe_size:   1048576
      lmm_pattern:       mdt
      lmm_layout_gen:    0
      lmm_stripe_offset: 2
      lmm_objects:

Short info about the layout and size of DoM component can be obtained with the use of the -L option along with -S or -E options:

client$ lfs getstripe -I1 -L -S /mnt/lustre/domfile
      lmm_stripe_size:   1048576
      lmm_pattern:       mdt
client$ lfs getstripe -I1 -L -E /mnt/lustre/domfile
    lcme_extent.e_end:   1048576
      lmm_pattern:       mdt

Both commands return layout type and its size. The stripe size is equal to the extent size of component in case of DoM files, so both can be used to get size on the MDT.

20.2.5.  lfs find for DoM files

The lfs find command can be used to search the directory tree rooted at the given directory or file name for the files that match the given parameters. The command below shows the new parameters for DoM files and their usages are similar to the lfs getstripe command.

20.2.5.1. Command

lfs find <directory|filename> [--layout|-L] [...]
              

20.2.5.2. Examples

Find all files with DoM layout under directory /mnt/lustre:

client$ lfs find -L mdt /mnt/lustre
/mnt/lustre/domfile
/mnt/lustre/domdir
/mnt/lustre/domdir/domfile
                          
client$ lfs find -L mdt -type f /mnt/lustre
/mnt/lustre/domfile
/mnt/lustre/domdir/domfile
                          
client$ lfs find -L mdt -type d /mnt/lustre
/mnt/lustre/domdir

By using this command you can find all DoM objects, only DoM files, or only directories with default DoM layout.

Find the DoM files/dirs with a particular stripe size:

client$ lfs find -L mdt -S -1200K -type f /mnt/lustre
/mnt/lustre/domfile
/mnt/lustre/domdir/domfile
                          
client$ lfs find -L mdt -S +200K -type f /mnt/lustre
/mnt/lustre/domfile
/mnt/lustre/domdir/domfile

The first command finds all DoM files with stripe size less than 1200KB. The second command above does the same for files with a stripe size greater than 200KB. In both cases, all DoM files are found because their DoM size is 1MB.

20.2.6.  The dom_stripesize parameter

The MDT controls the default maximum DoM size on the server via the parameter dom_stripesize in the LOD device. The dom_stripesize can be set differently for each MDT, if necessary. The default value of the parameter is 1MB and can be changed with lctl tool.

20.2.6.1. Get Command

lctl get_param lod.*MDT<index>*.dom_stripesize
              

20.2.6.2. Get Examples

The commands below get the maximum allowed DoM size on the server. The final command is an attempt to create a file with a larger size than the parameter setting and correctly fails.

mds# lctl get_param lod.*MDT0000*.dom_stripesize
lod.lustre-MDT0000-mdtlov.dom_stripesize=1048576

mds# lctl get_param -n lod.*MDT0000*.dom_stripesize
1048576

client$ lfs setstripe -E 2M -L mdt /mnt/lustre/dom2mb
Create composite file /mnt/lustre/dom2mb failed. Invalid argument
error: setstripe: create composite file '/mnt/lustre/dom2mb' failed:
Invalid argument

20.2.6.3. Temporary Set Command

To temporarily set the value of the parameter, the lctl set_param is used:

lctl set_param lod.*MDT<index>*.dom_stripesize=<value>
              

20.2.6.4. Temporary Set Examples

The example below shows a change to the default DoM limit on the server to 64KB and try to create a file with 1MB DoM size after that.

mds# lctl set_param -n lod.*MDT0000*.dom_stripesize=64K
mds# lctl get_param -n lod.*MDT0000*.dom_stripesize
65536

client$ lfs setstripe -E 1M -L mdt /mnt/lustre/dom
Create composite file /mnt/lustre/dom failed. Invalid argument
error: setstripe: create composite file '/mnt/lustre/dom' failed:
Invalid argument

20.2.6.5. Persistent Set Command

To persistently set the value of the parameter on a specific MDT, the lctl set_param -P command is used:

lctl set_param -P lod.fsname-MDTindex.dom_stripesize=value

This can also use a wildcard '*' for the index to apply to all MDTs.

20.2.6.6. Persistent Set Examples

The new value of the parameter is saved in the MGS parameters log permanently:

mgs# lctl set_param -P lod.lustre-MDT0000.dom_stripesize=512K
mds# lctl get_param -n lod.*MDT0000*.dom_stripesize
524288

and are applied on the matching MDTs within a few seconds.

20.2.7.  Disable DoM

When lctl set_param (whether with -P or not) sets dom_stripesize to 0, DoM component creation will be disabled on the specified server(s), and any new layouts with a specified DoM component will have that component removed from the file layout. Existing files and layouts with DoM components on that MDT are not changed.

Note

DoM files can still be created in existing directories with a default DoM layout.

Introduced in Lustre 2.12

Chapter 21. Lazy Size on MDT (LSoM)

This chapter describes Lazy Size on MDT (LSoM).

21.1.  Introduction to Lazy Size on MDT (LSoM)

In the Lustre file system, MDSs store the ctime, mtime, owner, and other file attributes. The OSSs store the size and number of blocks used for each file. To obtain the correct file size, the client must contact each OST that the file is stored across, which means multiple RPCs to get the size and blocks for a file when a file is striped over multiple OSTs. The Lazy Size on MDT (LSoM) feature stores the file size on the MDS and avoids the need to fetch the file size from the OST(s) in cases where the application understands that the size may not be accurate. Lazy means there is no guarantee of the accuracy of the attributes stored on the MDS.

Since many Lustre installations use SSD for MDT storage, the motivation for the LSoM work is to speed up the time it takes to get the size of a file from the Lustre file system by storing that data on the MDTs. We expect this feature to be initially used by Lustre policy engines that scan the backend MDT storage, make decisions based on broad size categories, and do not depend on a totally accurate file size. Examples include Lester, Robinhood, Zester, and various vendor offerings. Future improvements will allow the LSoM data to be accessed by tools such as lfs find.

21.2. Enable LSoM

LSoM is always enabled and nothing needs to be done to enable the feature for fetching the LSoM data when scanning the MDT inodes with a policy engine. It is also possible to access the LSoM data on the client via the lfs getsom command. Because the LSoM data is currently accessed on the client via the xattr interface, the xattr_cache will cache the file size and block count on the client as long as the inode is cached. In most cases this is desirable, since it improves access to the LSoM data. However, it also means that the LSoM data may be stale if the file size is changed after the xattr is first accessed or if the xattr is accessed shortly after the file is first created.

If it is necessary to access up-to-date LSoM data that has gone stale, it is possible to flush the xattr cache from the client by cancelling the MDC locks via lctl set_param ldlm.namespaces.*mdc*.lru_size=clear. Otherwise, the file attributes will be dropped from the client cache if the file has not been accessed before the LDLM lock timeout. The timeout is stored via lctl get_param ldlm.namespaces.*mdc*.lru_max_age.

If repeated access to LSoM attributes for files that are recently created or frequently modified from a specific client, such as an HSM agent node, it is possible to disable xattr caching on a client via: lctl set_param llite.*.xattr_cache=0. This may cause extra overhead when accessing files, and is not recommended for normal usage.

21.3. User Commands

Lustre provides the lfs getsom command to list file attributes that are stored on the MDT.

The llsom_sync command allows the user to sync the file attributes on the MDT with the valid/up-to-date data on the OSTs. llsom_sync is called on the client with the Lustre file system mount point. llsom_sync uses Lustre MDS changelogs and, thus, a changelog user must be registered to use this utility.

21.3.1. lfs getsom for LSoM data

The lfs getsom command lists file attributes that are stored on the MDT. lfs getsom is called with the full path and file name for a file on the Lustre file system. If no flags are used, then all file attributes stored on the MDS will be shown.

21.3.1.1. lfs getsom Command

lfs getsom [-s] [-b] [-f] <filename>

The various lfs getsom options are listed and described below.

Option

Description

-s

Only show the size value of the LSoM data for a given file. This is an optional flag

-b

Only show the blocks value of the LSoM data for a given file. This is an optional flag

-f

Only show the flag value of the LSoM data for a given file. This is an optional flag. Valid flags are:

SOM_FL_UNKNOWN = 0x0000 - Unknown or no SoM data, must get size from OSTs.

SOM_FL_STRICT = 0x0001 - Known strictly correct, FLR file (SoM guaranteed)

SOM_FL_STALE = 0x0002 - Known stale -was right at some point in the past, but it is known (or likely) to be incorrect now (e.g. opened for write)

SOM_FL_LAZY= 0x0004 - Approximate, may never have been strictly correct, need to sync SOM data to achieve eventual consistency.

21.3.2. Syncing LSoM data

The llsom_sync command allows the user to sync the file attributes on the MDT with the valid/up-to-date data on the OSTs. llsom_sync is called on the client with the client mount point for the Lustre file system. llsom_sync uses Lustre MDS changelogs and, thus, a changelog user must be registered to use this utility.

21.3.2.1. llsom_sync Command

llsom_sync --mdt|-m <mdt> --user|-u <user_id>
              [--daemonize|-d] [--verbose|-v] [--interval|-i] [--min-age|-a]
              [--max-cache|-c] [--sync|-s] <lustre_mount_point>

The various llsom_sync options are listed and described below.

Option

Description

--mdt | -m <mdt>

The metadata device which need to be synced the LSoM xattr of files. A changelog user must be registered for this device.Required flag.

--user | -u <user_id>

The changelog user id for the MDT device. Required flag.

--daemonize | -d

Optional flag to run the program in the background. In daemon mode, the utility will scan and process the changelog records and sync the LSoM xattr for files periodically.

--verbose | -v

Optional flag to produce verbose output.

--interval | -i

Optional flag for the time interval to scan the Lustre changelog and process the log record in daemon mode.

--min-age | -a

Optional flag for the time that llsom_sync tool will not try to sync the LSoM data for any files closed less than this many seconds old. The default min-age value is 600s(10 minutes).

--max-cache | -c

Optional flag for the total memory used for the FID cache which can be with a suffix [KkGgMm].The default max-cache value is 256MB. For the parameter value < 100, it is taken as the percentage of total memory size used for the FID cache instead of the cache size.

--sync | -s

Optional flag to sync file data to make the dirty data out of cache to ensure the blocks count is correct when update the file LSoM xattr. This option could hurt server performance significantly if thousands of fsync requests are sent.

Introduced in Lustre 2.11

Chapter 22. File Level Redundancy (FLR)

This chapter describes File Level Redundancy (FLR).

22.1. Introduction

The Lustre file system was initially designed and implemented for HPC use. It has been working well on high-end storage that has internal redundancy and fault-tolerance. However, despite the expense and complexity of these storage systems, storage failures still occur, and before release 2.11, Lustre could not be more reliable than the individual storage and server components on which it was based. The Lustre file system had no mechanism to mitigate storage hardware failures and files would become inaccessible if a server was inaccessible or otherwise out of service.

With the File Level Redundancy (FLR) feature introduced in Lustre Release 2.11, any Lustre file can store the same data on multiple OSTs in order for the system to be robust in the event of storage failures or other outages. With the choice of multiple mirrors, the best suited mirror can be chosen to satisfy an individual request, which has a direct impact on IO availability. Furthermore, for files that are concurrently read by many clients (e.g. input decks, shared libraries, or executables) the aggregate parallel read performance of a single file can be improved by creating multiple mirrors of the file data.

The first phase of the FLR feature has been implemented with delayed write (Figure 22.1, “FLR Delayed Write”). While writing to a mirrored file, only one primary or preferred mirror will be updated directly during the write, while other mirrors will be simply marked as stale. The file can subsequently return to a mirrored state again by synchronizing among mirrors with command line tools (run by the user or administrator directly or via automated monitoring tools).

Figure 22.1. FLR Delayed Write

FLR Delayed Write Diagram

22.2. Operations

Lustre provides lfs mirror command line tools for users to operate on mirrored files or directories.

22.2.1. Creating a Mirrored File or Directory

Command:

lfs mirror create <--mirror-count|-N[mirror_count]
[setstripe_options|[--flags<=flags>]]> ... <filename|directory>

The above command will create a mirrored file or directory specified by filename or directory, respectively.

OptionDescription
--mirror-count|-N[mirror_count]

Indicates the number of mirrors to be created with the following setstripe options. It can be repeated multiple times to separate mirrors that have different layouts.

The mirror_count argument is optional and defaults to 1 if it is not specified; if specified, it must follow the option without a space.

setstripe_options

Specifies a specific layout for the mirror. It can be a plain layout with a specific striping pattern or a composite layout, such as Section 19.5, “Progressive File Layout(PFL)”. The options are the same as those for the lfs setstripe command.

If setstripe_options are not specified, then the stripe options inherited from the previous component will be used. If there is no previous component, then the stripe_count and stripe_size options inherited from the filesystem-wide default values will be used, and the OST pool_name inherited from the parent directory will be used.

--flags<=flags>

Sets flags to the mirror to be created.

Only the prefer flag is supported at this time. This flag will be set to all components that belong to the corresponding mirror. The prefer flag gives a hint to Lustre for which mirrors should be used to serve I/O. When a mirrored file is being read, the component(s) with the prefer flag is likely to be picked to serve the read; and when a mirrored file is prepared to be written, the MDT will tend to choose the component with the prefer flag set and mark the other components with overlapping extents as stale. This flag just provides a hint to Lustre, which means Lustre may still choose mirrors without this flag set, for instance, if all preferred mirrors are unavailable when the I/O occurs. This flag can be set on multiple components.

Note: This flag will be set to all components that belong to the corresponding mirror. The --comp-flags option also exists, which can be set to individual components at mirror creation time.

Note: For redundancy and fault-tolerance, users need to make sure that different mirrors must be on different OSTs, even OSSs and racks. An understanding of cluster topology is necessary to achieve this architecture. In the initial implementation the use of the existing OST pools mechanism will allow separating OSTs by any arbitrary criteria: i.e. fault domain. In practice, users can take advantage of OST pools by grouping OSTs by topological information. Therefore, when creating a mirrored file, users can indicate which OST pools can be used by mirrors.

Examples:

The following command creates a mirrored file with 2 plain layout mirrors:

client# lfs mirror create -N -S 4M -c 2 -p flash \
                          -N -c -1 -p archive /mnt/testfs/file1

The following command displays the layout information of the mirrored file /mnt/testfs/file1:

client# lfs getstripe /mnt/testfs/file1
/mnt/testfs/file1
  lcm_layout_gen:    2
  lcm_mirror_count:  2
  lcm_entry_count:   2
    lcme_id:             65537
    lcme_mirror_id:      1
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  2
      lmm_stripe_size:   4194304
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 1
      lmm_pool:          flash
      lmm_objects:
      - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x2:0x0] }
      - 1: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] }

    lcme_id:             131074
    lcme_mirror_id:      2
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  6
      lmm_stripe_size:   4194304
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 3
      lmm_pool:          archive
      lmm_objects:
      - 0: { l_ost_idx: 3, l_fid: [0x100030000:0x2:0x0] }
      - 1: { l_ost_idx: 4, l_fid: [0x100040000:0x2:0x0] }
      - 2: { l_ost_idx: 5, l_fid: [0x100050000:0x2:0x0] }
      - 3: { l_ost_idx: 6, l_fid: [0x100060000:0x2:0x0] }
      - 4: { l_ost_idx: 7, l_fid: [0x100070000:0x2:0x0] }
      - 5: { l_ost_idx: 2, l_fid: [0x100020000:0x2:0x0] }

The first mirror has 4MB stripe size and two stripes across OSTs in the flash OST pool. The second mirror has 4MB stripe size inherited from the first mirror, and stripes across all of the available OSTs in the archive OST pool.

As mentioned above, it is recommended to use the --pool|-p option (one of the lfs setstripe options) with OST pools configured with independent fault domains to ensure different mirrors will be placed on different OSTs, servers, and/or racks, thereby improving availability and performance. If the setstripe options are not specified, it is possible to create mirrors with objects on the same OST(s), which would remove most of the benefit of using replication.

In the layout information printed by lfs getstripe, lcme_mirror_id shows mirror ID, which is the unique numerical identifier for a mirror. And lcme_flags shows mirrored component flags. Valid flag names are:

  • init - indicates mirrored component has been initialized (has allocated OST objects).

  • stale - indicates mirrored component does not have up-to-date data. Stale components will not be used for read or write operations, and need to be resynchronized by running lfs mirror resync command before they can be accessed again.

  • prefer - indicates mirrored component is preferred for read or write. For example, the mirror is located on SSD-based OSTs or is closer, fewer hops, on the network to the client. This flag can be set by users at mirror creation time.

The following command creates a mirrored file with 3 PFL mirrors:

client# lfs mirror create -N -E 4M -p flash --flags=prefer -E eof -c 2 \
  -N -E 16M -S 8M -c 4 -p archive -E eof -c -1 \
  -N -E 32M -c 1 -p archive2 -E eof -c -1 /mnt/testfs/file2

The following command displays the layout information of the mirrored file /mnt/testfs/file2:

client# lfs getstripe /mnt/testfs/file2
/mnt/testfs/file2
  lcm_layout_gen:    6
  lcm_mirror_count:  3
  lcm_entry_count:   6
    lcme_id:             65537
    lcme_mirror_id:      1
    lcme_flags:          init,prefer
    lcme_extent.e_start: 0
    lcme_extent.e_end:   4194304
      lmm_stripe_count:  1
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 1
      lmm_pool:          flash
      lmm_objects:
      - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x3:0x0] }

    lcme_id:             65538
    lcme_mirror_id:      1
    lcme_flags:          prefer
    lcme_extent.e_start: 4194304
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  2
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: -1
      lmm_pool:          flash

    lcme_id:             131075
    lcme_mirror_id:      2
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   16777216
      lmm_stripe_count:  4
      lmm_stripe_size:   8388608
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 4
      lmm_pool:          archive
      lmm_objects:
      - 0: { l_ost_idx: 4, l_fid: [0x100040000:0x3:0x0] }
      - 1: { l_ost_idx: 5, l_fid: [0x100050000:0x3:0x0] }
      - 2: { l_ost_idx: 6, l_fid: [0x100060000:0x3:0x0] }
      - 3: { l_ost_idx: 7, l_fid: [0x100070000:0x3:0x0] }

    lcme_id:             131076
    lcme_mirror_id:      2
    lcme_flags:          0
    lcme_extent.e_start: 16777216
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  6
      lmm_stripe_size:   8388608
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: -1
      lmm_pool:          archive

    lcme_id:             196613
    lcme_mirror_id:      3
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   33554432
      lmm_stripe_count:  1
      lmm_stripe_size:   8388608
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 0
      lmm_pool:          archive2
      lmm_objects:
      - 0: { l_ost_idx: 8, l_fid: [0x3400000000:0x3:0x0] }

    lcme_id:             196614
    lcme_mirror_id:      3
    lcme_flags:          0
    lcme_extent.e_start: 33554432
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  -1
      lmm_stripe_size:   8388608
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: -1
      lmm_pool:          archive2
      

For the first mirror, the first component inherits the stripe count and stripe size from filesystem-wide default values. The second component inherits the stripe size and OST pool from the first component, and has two stripes. Both of the components are allocated from the flash OST pool. Also, the flag prefer is applied to all the components of the first mirror, which tells the client to read data from those components whenever they are available.

For the second mirror, the first component has an 8MB stripe size and 4 stripes across OSTs in the archive OST pool. The second component inherits the stripe size and OST pool from the first component, and stripes across all of the available OSTs in the archive OST pool.

For the third mirror, the first component inherits the stripe size of 8MB from the last component of the second mirror, and has one single stripe. The OST pool name is set to archive2. The second component inherits stripe size from the first component, and stripes across all of the available OSTs in that pool.

22.2.2. Extending a Mirrored File

Command:

lfs mirror extend [--no-verify] <--mirror-count|-N[mirror_count]
[setstripe_options|-f <victim_file>]> ... <filename>

The above command will append mirror(s) indicated by setstripe options or just take the layout from existing file victim_file into the file filename. The filename must be an existing file, however, it can be a mirrored or regular non-mirrored file. If it is a non-mirrored file, the command will convert it to a mirrored file.

OptionDescription
--mirror-count|-N[mirror_count]

Indicates the number of mirrors to be added with the following setstripe options. It can be repeated multiple times to separate mirrors that have different layouts.

The mirror_count argument is optional and defaults to 1 if it is not specified; if specified, it must follow the option without a space.

setstripe_options

Specifies a specific layout for the mirror. It can be a plain layout with specific striping pattern or a composite layout, such as Section 19.5, “Progressive File Layout(PFL)”. The options are the same as those for the lfs setstripe command.

If setstripe_options are not specified, then the stripe options inherited from the previous component will be used. If there is no previous component, then the stripe_count and stripe_size options inherited from filesystem-wide default values will be used, and the OST pool_name inherited from parent directory will be used.

-f <victim_file>

If victim_file exists, the command will split the layout from that file and use it as a mirror added to the mirrored file. After the command is finished, the victim_file will be removed.

Note: The setstripe_options cannot be specified with -f <victim_file> option in one command line.

--no-verifyIf victim_file is specified, the command will verify that the file contents from victim_file are the same as filename. Otherwise, the command will return a failure. However, the option --no-verify can be used to override this verification. This option can save significant time on file comparison if the file size is large, but use it only when the file contents are known to be the same.

Note: The lfs mirror extend operation won't be applied to the directory.

Examples:

The following commands create a non-mirrored file, convert it to a mirrored file, and extend it with a plain layout mirror:

# lfs setstripe -p flash /mnt/testfs/file1
# lfs getstripe /mnt/testfs/file1
/mnt/testfs/file1
lmm_stripe_count:  1
lmm_stripe_size:   1048576
lmm_pattern:       raid0
lmm_layout_gen:    0
lmm_stripe_offset: 0
lmm_pool:          flash
        obdidx           objid           objid           group
             0               4            0x4                0

# lfs mirror extend -N -S 8M -c -1 -p archive /mnt/testfs/file1
# lfs getstripe /mnt/testfs/file1
/mnt/testfs/file1
  lcm_layout_gen:    2
  lcm_mirror_count:  2
  lcm_entry_count:   2
    lcme_id:             65537
    lcme_mirror_id:      1
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  1
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 0
      lmm_pool:          flash
      lmm_objects:
      - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x4:0x0] }

    lcme_id:             131073
    lcme_mirror_id:      2
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  6
      lmm_stripe_size:   8388608
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 3
      lmm_pool:          archive
      lmm_objects:
      - 0: { l_ost_idx: 3, l_fid: [0x100030000:0x3:0x0] }
      - 1: { l_ost_idx: 4, l_fid: [0x100040000:0x4:0x0] }
      - 2: { l_ost_idx: 5, l_fid: [0x100050000:0x4:0x0] }
      - 3: { l_ost_idx: 6, l_fid: [0x100060000:0x4:0x0] }
      - 4: { l_ost_idx: 7, l_fid: [0x100070000:0x4:0x0] }
      - 5: { l_ost_idx: 2, l_fid: [0x100020000:0x3:0x0] }

The following commands split the PFL layout from a victim_file and use it as a mirror added to the mirrored file /mnt/testfs/file1 created in the above example without data verification:

# lfs setstripe -E 16M -c 2 -p none \
                -E eof -c -1 /mnt/testfs/victim_file
# lfs getstripe /mnt/testfs/victim_file
/mnt/testfs/victim_file
  lcm_layout_gen:    2
  lcm_mirror_count:  1
  lcm_entry_count:   2
    lcme_id:             1
    lcme_mirror_id:      0
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   16777216
      lmm_stripe_count:  2
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 5
      lmm_objects:
      - 0: { l_ost_idx: 5, l_fid: [0x100050000:0x5:0x0] }
      - 1: { l_ost_idx: 6, l_fid: [0x100060000:0x5:0x0] }

    lcme_id:             2
    lcme_mirror_id:      0
    lcme_flags:          0
    lcme_extent.e_start: 16777216
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  -1
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: -1

# lfs mirror extend --no-verify -N -f /mnt/testfs/victim_file \
                    /mnt/testfs/file1
# lfs getstripe /mnt/testfs/file1
/mnt/testfs/file1
  lcm_layout_gen:    3
  lcm_mirror_count:  3
  lcm_entry_count:   4
    lcme_id:             65537
    lcme_mirror_id:      1
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  1
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 0
      lmm_pool:          flash
      lmm_objects:
      - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x4:0x0] }

    lcme_id:             131073
    lcme_mirror_id:      2
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  6
      lmm_stripe_size:   8388608
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 3
      lmm_pool:          archive
      lmm_objects:
      - 0: { l_ost_idx: 3, l_fid: [0x100030000:0x3:0x0] }
      - 1: { l_ost_idx: 4, l_fid: [0x100040000:0x4:0x0] }
      - 2: { l_ost_idx: 5, l_fid: [0x100050000:0x4:0x0] }
      - 3: { l_ost_idx: 6, l_fid: [0x100060000:0x4:0x0] }
      - 4: { l_ost_idx: 7, l_fid: [0x100070000:0x4:0x0] }
      - 5: { l_ost_idx: 2, l_fid: [0x100020000:0x3:0x0] }

    lcme_id:             196609
    lcme_mirror_id:      3
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   16777216
      lmm_stripe_count:  2
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 5
      lmm_objects:
      - 0: { l_ost_idx: 5, l_fid: [0x100050000:0x5:0x0] }
      - 1: { l_ost_idx: 6, l_fid: [0x100060000:0x5:0x0] }

    lcme_id:             196610
    lcme_mirror_id:      3
    lcme_flags:          0
    lcme_extent.e_start: 16777216
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  -1
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: -1

After extending, the victim_file was removed:

# ls /mnt/testfs/victim_file
ls: cannot access /mnt/testfs/victim_file: No such file or directory

22.2.3. Splitting a Mirrored File

Command:

lfs mirror split <--mirror-id <mirror_id>>
[--destroy|-d] [-f <new_file>] <mirrored_file>

The above command will split a specified mirror with ID <mirror_id> out of an existing mirrored file specified by mirrored_file. By default, a new file named <mirrored_file>.mirror~<mirror_id> will be created with the layout of the split mirror. If the --destroy|-d option is specified, then the split mirror will be destroyed. If the -f <new_file> option is specified, then a file named new_file will be created with the layout of the split mirror. If mirrored_file has only one mirror existing after split, it will be converted to a regular non-mirrored file. If the original mirrored_file is not a mirrored file, then the command will return an error.

OptionDescription
--mirror-id <mirror_id>The unique numerical identifier for a mirror. The mirror ID is unique within a mirrored file and is automatically assigned at file creation or extension time. It can be fetched by the lfs getstripe command.
--destroy|-dIndicates the split mirror will be destroyed.
-f <new_file>Indicates a file named new_file will be created with the layout of the split mirror.

Examples:

The following commands create a mirrored file with 4 mirrors, then split 3 mirrors separately from the mirrored file.

Creating a mirrored file with 4 mirrors:

# lfs mirror create -N2 -E 4M -p flash -E eof -c -1 \
                    -N2 -S 8M -c 2 -p archive /mnt/testfs/file1
# lfs getstripe /mnt/testfs/file1
/mnt/testfs/file1
  lcm_layout_gen:    6
  lcm_mirror_count:  4
  lcm_entry_count:   6
    lcme_id:             65537
    lcme_mirror_id:      1
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   4194304
      lmm_stripe_count:  1
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 1
      lmm_pool:          flash
      lmm_objects:
      - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x4:0x0] }

    lcme_id:             65538
    lcme_mirror_id:      1
    lcme_flags:          0
    lcme_extent.e_start: 4194304
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  2
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: -1
      lmm_pool:          flash

    lcme_id:             131075
    lcme_mirror_id:      2
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   4194304
      lmm_stripe_count:  1
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 0
      lmm_pool:          flash
      lmm_objects:
      - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x5:0x0] }

    lcme_id:             131076
    lcme_mirror_id:      2
    lcme_flags:          0
    lcme_extent.e_start: 4194304
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  2
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: -1
      lmm_pool:          flash

    lcme_id:             196613
    lcme_mirror_id:      3
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  2
      lmm_stripe_size:   8388608
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 4
      lmm_pool:          archive
      lmm_objects:
      - 0: { l_ost_idx: 4, l_fid: [0x100040000:0x5:0x0] }
      - 1: { l_ost_idx: 5, l_fid: [0x100050000:0x6:0x0] }

    lcme_id:             262150
    lcme_mirror_id:      4
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  2
      lmm_stripe_size:   8388608
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 7
      lmm_pool:          archive
      lmm_objects:
      - 0: { l_ost_idx: 7, l_fid: [0x100070000:0x5:0x0] }
      - 1: { l_ost_idx: 2, l_fid: [0x100020000:0x4:0x0] }

Splitting the mirror with ID 1 from /mnt/testfs/file1 and creating /mnt/testfs/file1.mirror~1 with the layout of the split mirror:

# lfs mirror split --mirror-id 1 /mnt/testfs/file1
# lfs getstripe /mnt/testfs/file1.mirror~1
/mnt/testfs/file1.mirror~1
  lcm_layout_gen:    1
  lcm_mirror_count:  1
  lcm_entry_count:   2
    lcme_id:             65537
    lcme_mirror_id:      1
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   4194304
      lmm_stripe_count:  1
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 1
      lmm_pool:          flash
      lmm_objects:
      - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x4:0x0] }

    lcme_id:             65538
    lcme_mirror_id:      1
    lcme_flags:          0
    lcme_extent.e_start: 4194304
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  2
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: -1
      lmm_pool:          flash

Splitting the mirror with ID 2 from /mnt/testfs/file1 and destroying it:

# lfs mirror split --mirror-id 2 -d /mnt/testfs/file1
# lfs getstripe /mnt/testfs/file1
/mnt/testfs/file1
  lcm_layout_gen:    8
  lcm_mirror_count:  2
  lcm_entry_count:   2
    lcme_id:             196613
    lcme_mirror_id:      3
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  2
      lmm_stripe_size:   8388608
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 4
      lmm_pool:          archive
      lmm_objects:
      - 0: { l_ost_idx: 4, l_fid: [0x100040000:0x5:0x0] }
      - 1: { l_ost_idx: 5, l_fid: [0x100050000:0x6:0x0] }

    lcme_id:             262150
    lcme_mirror_id:      4
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  2
      lmm_stripe_size:   8388608
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 7
      lmm_pool:          archive
      lmm_objects:
      - 0: { l_ost_idx: 7, l_fid: [0x100070000:0x5:0x0] }
      - 1: { l_ost_idx: 2, l_fid: [0x100020000:0x4:0x0] }

Splitting the mirror with ID 3 from /mnt/testfs/file1 and creating /mnt/testfs/file2 with the layout of the split mirror:

# lfs mirror split --mirror-id 3 -f /mnt/testfs/file2 \
                   /mnt/testfs/file1
# lfs getstripe /mnt/testfs/file2
/mnt/testfs/file2
  lcm_layout_gen:    1
  lcm_mirror_count:  1
  lcm_entry_count:   1
    lcme_id:             196613
    lcme_mirror_id:      3
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  2
      lmm_stripe_size:   8388608
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 4
      lmm_pool:          archive
      lmm_objects:
      - 0: { l_ost_idx: 4, l_fid: [0x100040000:0x5:0x0] }
      - 1: { l_ost_idx: 5, l_fid: [0x100050000:0x6:0x0] }

# lfs getstripe /mnt/testfs/file1
/mnt/testfs/file1
  lcm_layout_gen:    9
  lcm_mirror_count:  1
  lcm_entry_count:   1
    lcme_id:             262150
    lcme_mirror_id:      4
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  2
      lmm_stripe_size:   8388608
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 7
      lmm_pool:          archive
      lmm_objects:
      - 0: { l_ost_idx: 7, l_fid: [0x100070000:0x5:0x0] }
      - 1: { l_ost_idx: 2, l_fid: [0x100020000:0x4:0x0] }

The above layout information showed that mirrors with ID 1, 2, and 3 were all split from the mirrored file /mnt/testfs/file1.

22.2.4. Resynchronizing out-of-sync Mirrored File(s)

Command:

lfs mirror resync [--only <mirror_id[,...]>]
<mirrored_file> [<mirrored_file2>...]

The above command will resynchronize out-of-sync mirrored file(s) specified by mirrored_file. It supports specifying multiple mirrored files in one command line.

If there is no stale mirror for the specified mirrored file(s), then the command does nothing. Otherwise, it will copy data from synced mirror to the stale mirror(s), and mark all successfully copied mirror(s) as SYNC. If the --only <mirror_id[,...]> option is specified, then the command will only resynchronize the mirror(s) specified by the mirror_id(s). This option cannot be used when multiple mirrored files are specified.

OptionDescription
--only <mirror_id[,...]>Indicates which mirror(s) specified by mirror_id(s) needs to be resynchronized. The mirror_id is the unique numerical identifier for a mirror. Multiple mirror_ids are separated by comma. This option cannot be used when multiple mirrored files are specified.

Note: With delayed write implemented in FLR phase 1, after writing to a mirrored file, users need to run lfs mirror resync command to get all mirrors synchronized.

Examples:

The following commands create a mirrored file with 3 mirrors, then write some data into the file and resynchronizes stale mirrors.

Creating a mirrored file with 3 mirrors:

# lfs mirror create -N -E 4M -p flash -E eof \
                    -N2 -p archive /mnt/testfs/file1
# lfs getstripe /mnt/testfs/file1
/mnt/testfs/file1
  lcm_layout_gen:    4
  lcm_mirror_count:  3
  lcm_entry_count:   4
    lcme_id:             65537
    lcme_mirror_id:      1
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   4194304
      lmm_stripe_count:  1
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 1
      lmm_pool:          flash
      lmm_objects:
      - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x5:0x0] }

    lcme_id:             65538
    lcme_mirror_id:      1
    lcme_flags:          0
    lcme_extent.e_start: 4194304
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  1
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: -1
      lmm_pool:          flash

    lcme_id:             131075
    lcme_mirror_id:      2
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  1
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 3
      lmm_pool:          archive
      lmm_objects:
      - 0: { l_ost_idx: 3, l_fid: [0x100030000:0x4:0x0] }

    lcme_id:             196612
    lcme_mirror_id:      3
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  1
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 4
      lmm_pool:          archive
      lmm_objects:
      - 0: { l_ost_idx: 4, l_fid: [0x100040000:0x6:0x0] }

Writing some data into the mirrored file /mnt/testfs/file1:

# yes | dd of=/mnt/testfs/file1 bs=1M count=2
2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 0.0320613 s, 65.4 MB/s

# lfs getstripe /mnt/testfs/file1
/mnt/testfs/file1
  lcm_layout_gen:    5
  lcm_mirror_count:  3
  lcm_entry_count:   4
    lcme_id:             65537
    lcme_mirror_id:      1
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   4194304
    ......

    lcme_id:             65538
    lcme_mirror_id:      1
    lcme_flags:          0
    lcme_extent.e_start: 4194304
    lcme_extent.e_end:   EOF
    ......

    lcme_id:             131075
    lcme_mirror_id:      2
    lcme_flags:          init,stale
    lcme_extent.e_start: 0
    lcme_extent.e_end:   EOF
    ......

    lcme_id:             196612
    lcme_mirror_id:      3
    lcme_flags:          init,stale
    lcme_extent.e_start: 0
    lcme_extent.e_end:   EOF
    ......

The above layout information showed that data were written into the first component of mirror with ID 1, and mirrors with ID 2 and 3 were marked with stale flag.

Resynchronizing the stale mirror with ID 2 for the mirrored file /mnt/testfs/file1:

# lfs mirror resync --only 2 /mnt/testfs/file1
# lfs getstripe /mnt/testfs/file1
/mnt/testfs/file1
  lcm_layout_gen:    7
  lcm_mirror_count:  3
  lcm_entry_count:   4
    lcme_id:             65537
    lcme_mirror_id:      1
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   4194304
    ......

    lcme_id:             65538
    lcme_mirror_id:      1
    lcme_flags:          0
    lcme_extent.e_start: 4194304
    lcme_extent.e_end:   EOF
    ......

    lcme_id:             131075
    lcme_mirror_id:      2
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   EOF
    ......

    lcme_id:             196612
    lcme_mirror_id:      3
    lcme_flags:          init,stale
    lcme_extent.e_start: 0
    lcme_extent.e_end:   EOF
    ......

The above layout information showed that after resynchronizing, the stale flag was removed from mirror with ID 2.

Resynchronizing all of the stale mirrors for the mirrored file /mnt/testfs/file1:

# lfs mirror resync /mnt/testfs/file1
# lfs getstripe /mnt/testfs/file1
/mnt/testfs/file1
  lcm_layout_gen:    9
  lcm_mirror_count:  3
  lcm_entry_count:   4
    lcme_id:             65537
    lcme_mirror_id:      1
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   4194304
    ......

    lcme_id:             65538
    lcme_mirror_id:      1
    lcme_flags:          0
    lcme_extent.e_start: 4194304
    lcme_extent.e_end:   EOF
    ......

    lcme_id:             131075
    lcme_mirror_id:      2
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   EOF
    ......

    lcme_id:             196612
    lcme_mirror_id:      3
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   EOF
    ......

The above layout information showed that after resynchronizing, none of the mirrors were marked as stale.

22.2.5. Verifying Mirrored File(s)

Command:

lfs mirror verify [--only <mirror_id,mirror_id2[,...]>]
[--verbose|-v] <mirrored_file> [<mirrored_file2> ...]

The above command will verify that each SYNC mirror (contains up-to-date data) of a mirrored file, specified by mirrored_file, has exactly the same data. It supports specifying multiple mirrored files in one command line.

This is a scrub tool that should be run on regular basis to make sure that mirrored files are not corrupted. The command won't repair the file if it turns out to be corrupted. Usually, an administrator should check the file content from each mirror and decide which one is correct and then invoke lfs mirror resync to repair it manually.

OptionDescription
--only <mirror_id,mirror_id2[,...]>

Indicates which mirrors specified by mirror_ids need to be verified. The mirror_id is the unique numerical identifier for a mirror. Multiple mirror_ids are separated by comma.

Note: At least two mirror_ids are required. This option cannot be used when multiple mirrored files are specified.

--verbose|-vIndicates the command will print where the differences are if the data do not match. Otherwise, the command will just return an error in that case. This option can be repeated for multiple times to print more information.

Note:

Mirror components that have stale or offline flags will be skipped and not verified.

Examples:

The following command verifies that each mirror of a mirrored file contains exactly the same data:

# lfs mirror verify /mnt/testfs/file1

The following command has the -v option specified to print where the differences are if the data does not match:

# lfs mirror verify -vvv /mnt/testfs/file2
Chunks to be verified in /mnt/testfs/file2:
[0, 0x200000)   [1, 2, 3, 4]    4
[0x200000, 0x400000)    [1, 2, 3, 4]    4
[0x400000, 0x600000)    [1, 2, 3, 4]    4
[0x600000, 0x800000)    [1, 2, 3, 4]    4
[0x800000, 0xa00000)    [1, 2, 3, 4]    4
[0xa00000, 0x1000000)   [1, 2, 3, 4]    4
[0x1000000, 0xffffffffffffffff) [1, 2, 3, 4]    4

Verifying chunk [0, 0x200000) on mirror: 1 2 3 4
CRC-32 checksum value for chunk [0, 0x200000):
Mirror 1:       0x207b02f1
Mirror 2:       0x207b02f1
Mirror 3:       0x207b02f1
Mirror 4:       0x207b02f1

Verifying chunk [0, 0x200000) on mirror: 1 2 3 4 PASS

Verifying chunk [0x200000, 0x400000) on mirror: 1 2 3 4
CRC-32 checksum value for chunk [0x200000, 0x400000):
Mirror 1:       0x207b02f1
Mirror 2:       0x207b02f1
Mirror 3:       0x207b02f1
Mirror 4:       0x207b02f1

Verifying chunk [0x200000, 0x400000) on mirror: 1 2 3 4 PASS

Verifying chunk [0x400000, 0x600000) on mirror: 1 2 3 4
CRC-32 checksum value for chunk [0x400000, 0x600000):
Mirror 1:       0x42571b66
Mirror 2:       0x42571b66
Mirror 3:       0x42571b66
Mirror 4:       0xabdaf92

lfs mirror verify: chunk [0x400000, 0x600000) has different
checksum value on mirror 1 and mirror 4.
Verifying chunk [0x600000, 0x800000) on mirror: 1 2 3 4
CRC-32 checksum value for chunk [0x600000, 0x800000):
Mirror 1:       0x1f8ad0d8
Mirror 2:       0x1f8ad0d8
Mirror 3:       0x1f8ad0d8
Mirror 4:       0x18975bf9

lfs mirror verify: chunk [0x600000, 0x800000) has different
checksum value on mirror 1 and mirror 4.
Verifying chunk [0x800000, 0xa00000) on mirror: 1 2 3 4
CRC-32 checksum value for chunk [0x800000, 0xa00000):
Mirror 1:       0x69c17478
Mirror 2:       0x69c17478
Mirror 3:       0x69c17478
Mirror 4:       0x69c17478

Verifying chunk [0x800000, 0xa00000) on mirror: 1 2 3 4 PASS

lfs mirror verify: '/mnt/testfs/file2' chunk [0xa00000, 0x1000000]
exceeds file size 0xa00000: skipped

The following command uses the --only option to only verify the specified mirrors:

# lfs mirror verify -v --only 1,4 /mnt/testfs/file2
CRC-32 checksum value for chunk [0, 0x200000):
Mirror 1:       0x207b02f1
Mirror 4:       0x207b02f1

CRC-32 checksum value for chunk [0x200000, 0x400000):
Mirror 1:       0x207b02f1
Mirror 4:       0x207b02f1

CRC-32 checksum value for chunk [0x400000, 0x600000):
Mirror 1:       0x42571b66
Mirror 4:       0xabdaf92

lfs mirror verify: chunk [0x400000, 0x600000) has different
checksum value on mirror 1 and mirror 4.
CRC-32 checksum value for chunk [0x600000, 0x800000):
Mirror 1:       0x1f8ad0d8
Mirror 4:       0x18975bf9

lfs mirror verify: chunk [0x600000, 0x800000) has different
checksum value on mirror 1 and mirror 4.
CRC-32 checksum value for chunk [0x800000, 0xa00000):
Mirror 1:       0x69c17478
Mirror 4:       0x69c17478

lfs mirror verify: '/mnt/testfs/file2' chunk [0xa00000, 0x1000000]
exceeds file size 0xa00000: skipped

22.2.6. Finding Mirrored File(s)

The lfs find command is used to list files and directories with specific attributes. The following two attribute parameters are specific to a mirrored file or directory:

lfs find <directory|filename ...>
    [[!] --mirror-count|-N [+-]n]
    [[!] --mirror-state <[^]state>]
OptionDescription
--mirror-count|-N [+-]nIndicates mirror count.
--mirror-state <[^]state>

Indicates mirrored file state.

If ^state is used, print only files not matching state. Only one state can be specified.

Valid state names are:

ro - indicates the mirrored file is in read-only state. All of the mirrors contain the up-to-date data.

wp - indicates the mirrored file is in a state of being written.

sp - indicates the mirrored file is in a state of being resynchronized.

Note:

Specifying ! before an option negates its meaning (files NOT matching the parameter). Using + before a numeric value means 'more than n', while - before a numeric value means 'less than n'. If neither is used, it means 'equal to n', within the bounds of the unit specified (if any).

Examples:

The following command recursively lists all mirrored files that have more than 2 mirrors under directory /mnt/testfs:

# lfs find --mirror-count +2 --type f /mnt/testfs

The following command recursively lists all out-of-sync mirrored files under directory /mnt/testfs:

# lfs find --mirror-state=^ro --type f /mnt/testfs

22.3. Interoperability

Introduced in Lustre release 2.11.0, the FLR feature is based on the Section 19.5, “Progressive File Layout(PFL)” feature introduced in Lustre 2.10.0

For Lustre release 2.9 and older clients, which do not understand the PFL layout, they cannot access and open mirrored files created in the Lustre 2.11 filesystem.

The following example shows the errors returned by accessing and opening a mirrored file (created in Lustre 2.11 filesystem) on a Lustre 2.9 client:

# ls /mnt/testfs/mirrored_file
ls: cannot access /mnt/testfs/mirrored_file: Invalid argument

# cat /mnt/testfs/mirrored_file
cat: /mnt/testfs/mirrored_file: Operation not supported

For Lustre release 2.10 clients, which understand the PFL layout, but do not understand a mirrored layout, they can access mirrored files created in Lustre 2.11 filesystem, however, they cannot open them. This is because the Lustre 2.10 clients do not verify overlapping components so they would read and write mirrored files just as if they were normal PFL files, which will cause a problem where synced mirrors actually contain different data.

The following example shows the results returned by accessing and opening a mirrored file (created in Lustre 2.11 filesystem) on a Lustre 2.10 client:

# ls /mnt/testfs/mirrored_file
/mnt/testfs/mirrored_file

# cat /mnt/testfs/mirrored_file
cat: /mnt/testfs/mirrored_file: Operation not supported

Chapter 23. Managing the File System and I/O

23.1.  Handling Full OSTs

Sometimes a Lustre file system becomes unbalanced, often due to incorrectly-specified stripe settings, or when very large files are created that are not striped over all of the OSTs. Lustre will automatically avoid allocating new files on OSTs that are full. If an OST is completely full and more data is written to files already located on that OST, an error occurs. The procedures below describe how to handle a full OST.

The MDS will normally handle space balancing automatically at file creation time, and this procedure is normally not needed, but manual data migration may be desirable in some cases (e.g. creating very large files that would consume more than the total free space of the full OSTs).

23.1.1.  Checking OST Space Usage

The example below shows an unbalanced file system:

client# lfs df -h
UUID                       bytes           Used            Available       \
Use%            Mounted on
testfs-MDT0000_UUID        4.4G            214.5M          3.9G            \
4%              /mnt/testfs[MDT:0]
testfs-OST0000_UUID        2.0G            751.3M          1.1G            \
37%             /mnt/testfs[OST:0]
testfs-OST0001_UUID        2.0G            755.3M          1.1G            \
37%             /mnt/testfs[OST:1]
testfs-OST0002_UUID        2.0G            1.7G            155.1M          \
86%             /mnt/testfs[OST:2] ****
testfs-OST0003_UUID        2.0G            751.3M          1.1G            \
37%             /mnt/testfs[OST:3]
testfs-OST0004_UUID        2.0G            747.3M          1.1G            \
37%             /mnt/testfs[OST:4]
testfs-OST0005_UUID        2.0G            743.3M          1.1G            \
36%             /mnt/testfs[OST:5]
 
filesystem summary:        11.8G           5.4G            5.8G            \
45%             /mnt/testfs

In this case, OST0002 is almost full and when an attempt is made to write additional information to the file system (even with uniform striping over all the OSTs), the write command fails as follows:

client# lfs setstripe /mnt/testfs 4M 0 -1
client# dd if=/dev/zero of=/mnt/testfs/test_3 bs=10M count=100
dd: writing '/mnt/testfs/test_3': No space left on device
98+0 records in
97+0 records out
1017192448 bytes (1.0 GB) copied, 23.2411 seconds, 43.8 MB/s

23.1.2.  Disabling creates on a Full OST

To avoid running out of space in the file system, if the OST usage is imbalanced and one or more OSTs are close to being full while there are others that have a lot of space, the MDS will typically avoid file creation on the full OST(s) automatically. The full OSTs may optionally be deactivated manually on the MDS to ensure the MDS will not allocate new objects there.

  1. Log into the MDS server and use the lctl command to stop new object creation on the full OST(s):

    mds# lctl set_param osp.fsname-OSTnnnn*.max_create_count=0
    

When new files are created in the file system, they will only use the remaining OSTs. Either manual space rebalancing can be done by migrating data to other OSTs, as shown in the next section, or normal file deletion and creation can passively rebalance the space usage.

23.1.3.  Migrating Data within a File System

If there is a need to move the file data from the current OST(s) to new OST(s), the data must be migrated (copied) to the new location. The simplest way to do this is to use the lfs_migrate command, as described in Section 14.8, “ Adding a New OST to a Lustre File System”.

23.1.4.  Returning an Inactive OST Back Online

Once the full OST(s) no longer are severely imbalanced, due to either active or passive data redistribution, they should be reactivated so they will again have new files allocated on them.

[mds]# lctl set_param osp.testfs-OST0002.max_create_count=20000

23.1.5. Migrating Metadata within a Filesystem

Introduced in Lustre 2.8

23.1.5.1. Whole Directory Migration

Lustre software version 2.8 includes a feature