Copyright © 2010, 2011 Oracle and/or its affiliates. (The original version of this Operations Manual without the Intel modifications.)
Copyright © 2011, 2017 Intel Corporation. (Intel modifications to the original version of this Operations Manual.)
Important Notice from Intel
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: https://www.intel.com/content/www/us/en/design/resource-design-center.html
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. Lustre is a registered trademark of Oracle Corporation.
*Other names and brands may be claimed as the property of others.
THE ORIGINAL LUSTRE 2.x FILESYSTEM: OPERATIONS MANUAL HAS BEEN MODIFIED: THIS OPERATIONS MANUAL IS A MODIFIED VERSION OF, AND IS DERIVED FROM, THE LUSTRE 2.0 FILESYSTEM: OPERATIONS MANUAL PUBLISHED BY ORACLE AND AVAILABLE AT [http://www.lustre.org/]. MODIFICATIONS (collectively, the "Modifications") HAVE BEEN MADE BY INTEL CORPORATION ("Intel"). ORACLE AND ITS AFFILIATES HAVE NOT REVIEWED, APPROVED, SPONSORED, OR ENDORSED THIS MODIFIED OPERATIONS MANUAL, OR ENDORSED INTEL, AND ORACLE AND ITS AFFILIATES ARE NOT RESPONSIBLE OR LIABLE FOR ANY MODIFICATIONS THAT INTEL HAS MADE TO THE ORIGINAL OPERATIONS MANUAL.
NOTHING IN THIS MODIFIED OPERATIONS MANUAL IS INTENDED TO AFFECT THE NOTICE PROVIDED BY ORACLE BELOW IN RESPECT OF THE ORIGINAL OPERATIONS MANUAL AND SUCH ORACLE NOTICE CONTINUES TO APPLY TO THIS MODIFIED OPERATIONS MANUAL EXCEPT FOR THE MODIFICATIONS; THIS INTEL NOTICE SHALL APPLY ONLY TO MODIFICATIONS MADE BY INTEL. AS BETWEEN YOU AND ORACLE: (I) NOTHING IN THIS INTEL NOTICE IS INTENDED TO AFFECT THE TERMS OF THE ORACLE NOTICE BELOW; AND (II) IN THE EVENT OF ANY CONFLICT BETWEEN THE TERMS OF THIS INTEL NOTICE AND THE TERMS OF THE ORACLE NOTICE, THE ORACLE NOTICE SHALL PREVAIL.
Your use of any Intel software shall be governed by separate license terms containing restrictions on use and disclosure and are protected by intellectual property laws.
The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing.
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License. To view a copy of this license and obtain more information about Creative Commons licensing, visit Creative Commons Attribution-Share Alike 3.0 United States or send a letter to Creative Commons, 171 2nd Street, Suite 300, San Francisco, California 94105, USA.
Important Notice from Oracle
This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited.
The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing.
If this is software or related software documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, the following notice is applicable:
U.S. GOVERNMENT RIGHTS. Programs, software, databases, and related documentation and technical data delivered to U.S. Government customers are "commercial computer software" or "commercial technical data" pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such, the use, duplication, disclosure, modification, and adaptation shall be subject to the restrictions and license terms set forth in the applicable Government contract, and, to the extent applicable by the terms of the Government contract, the additional rights set forth in FAR 52.227-19, Commercial Computer Software License (December 2007). Oracle America, Inc., 500 Oracle Parkway, Redwood City, CA 94065.
This software or hardware is developed for general use in a variety of information management applications. It is not developed or intended for use in any inherently dangerous applications, including applications which may create a risk of personal injury. If you use this software or hardware in dangerous applications, then you shall be responsible to take all appropriate fail-safe, backup, redundancy, and other measures to ensure its safe use. Oracle Corporation and its affiliates disclaim any liability for any damages caused by use of this software or hardware in dangerous applications.
Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.
AMD, Opteron, the AMD logo, and the AMD Opteron logo are trademarks or registered trademarks of Advanced Micro Devices. Intel and Intel Xeon are trademarks or registered trademarks of Intel Corporation. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. UNIX is a registered trademark licensed through X/Open Company, Ltd.
This software or hardware and documentation may provide access to or information on content, products, and services from third parties. Oracle Corporation and its affiliates are not responsible for and expressly disclaim all warranties of any kind with respect to third-party content, products, and services. Oracle Corporation and its affiliates will not be responsible for any loss, costs, or damages incurred due to your access to or use of third-party content, products, or services.
Copyright © 2011, Oracle et/ou ses affiliés. Tous droits réservés.
Ce logiciel et la documentation qui l'accompagne sont protégés par les lois sur la propriété intellectuelle. Ils sont concédés sous licence et soumis à des restrictions d'utilisation et de divulgation. Sauf disposition de votre contrat de licence ou de la loi, vous ne pouvez pas copier, reproduire, traduire, diffuser, modifier, breveter, transmettre, distribuer, exposer, exécuter, publier ou afficher le logiciel, même partiellement, sous quelque forme et par quelque procédé que ce soit. Par ailleurs, il est interdit de procéder à toute ingénierie inverse du logiciel, de le désassembler ou de le décompiler, excepté à des fins d'interopérabilité avec des logiciels tiers ou tel que prescrit par la loi.
Les informations fournies dans ce document sont susceptibles de modification sans préavis. Par ailleurs, Oracle Corporation ne garantit pas qu'elles soient exemptes d'erreurs et vous invite, le cas échéant, à lui en faire part par écrit.
Si ce logiciel, ou la documentation qui l'accompagne, est concédé sous licence au Gouvernement des Etats-Unis, ou à toute entité qui délivre la licence de ce logiciel ou l'utilise pour le compte du Gouvernement des Etats-Unis, la notice suivante s'applique :
U.S. GOVERNMENT RIGHTS. Programs, software, databases, and related documentation and technical data delivered to U.S. Government customers are "commercial computer software" or "commercial technical data" pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such, the use, duplication, disclosure, modification, and adaptation shall be subject to the restrictions and license terms set forth in the applicable Government contract, and, to the extent applicable by the terms of the Government contract, the additional rights set forth in FAR 52.227-19, Commercial Computer Software License (December 2007). Oracle America, Inc., 500 Oracle Parkway, Redwood City, CA 94065.
Ce logiciel ou matériel a été développé pour un usage général dans le cadre d'applications de gestion des informations. Ce logiciel ou matériel n'est pas conçu ni n'est destiné à être utilisé dans des applications à risque, notamment dans des applications pouvant causer des dommages corporels. Si vous utilisez ce logiciel ou matériel dans le cadre d'applications dangereuses, il est de votre responsabilité de prendre toutes les mesures de secours, de sauvegarde, de redondance et autres mesures nécessaires à son utilisation dans des conditions optimales de sécurité. Oracle Corporation et ses affiliés déclinent toute responsabilité quant aux dommages causés par l'utilisation de ce logiciel ou matériel pour ce type d'applications.
Oracle et Java sont des marques déposées d'Oracle Corporation et/ou de ses affiliés. Tout autre nom mentionné peut correspondre à des marques appartenant à d'autres propriétaires qu'Oracle.
AMD, Opteron, le logo AMD et le logo AMD Opteron sont des marques ou des marques déposées d'Advanced Micro Devices. Intel et Intel Xeon sont des marques ou des marques déposées d'Intel Corporation. Toutes les marques SPARC sont utilisées sous licence et sont des marques ou des marques déposées de SPARC International, Inc. UNIX est une marque déposée concédée sous licence par X/Open Company, Ltd.
Ce logiciel ou matériel et la documentation qui l'accompagne peuvent fournir des informations ou des liens donnant accès à des contenus, des produits et des services émanant de tiers. Oracle Corporation et ses affiliés déclinent toute responsabilité ou garantie expresse quant aux contenus, produits ou services émanant de tiers. En aucun cas, Oracle Corporation et ses affiliés ne sauraient être tenus pour responsables des pertes subies, des coûts occasionnés ou des dommages causés par l'accès à des contenus, produits ou services tiers, ou à leur utilisation.
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License. To view a copy of this license and obtain more information about Creative Commons licensing, visit Creative Commons Attribution-Share Alike 3.0 United States or send a letter to Creative Commons, 171 2nd Street, Suite 300, San Francisco, California 94105, USA.
Table of Contents
lnetctl
L 2.7 lfs
setstripe
)getstripe
)Bind: Address already in use
" ErrorSlow Start_Page_Write
"'Out of Memory
' on OSTdebug_daemon
)strace
llapi
)List of Figures
List of Tables
List of Examples
Table of Contents
The Lustre*Software Release 2.x Operations Manual provides detailed information and procedures to install, configure and tune a Lustre file system. The manual covers topics such as failover, quotas, striping, and bonding. This manual also contains troubleshooting information and tips to improve the operation and performance of a Lustre file system.
This document is maintained by Whamcloud in Docbook format. The canonical version is available at https://wiki.whamcloud.com/display/PUB/Documentation .
This document does not contain information about basic UNIX* operating system commands and procedures such as shutting down the system, booting the system, and configuring devices. Refer to the following for this information:
Software documentation that you received with your system
Red Hat* Enterprise Linux* documentation, which is at: https://docs.redhat.com/docs/en-US/index.html
The Lustre client module is available for many different Linux* versions and distributions. The Red Hat Enterprise Linux distribution is the best supported and tested platform for Lustre servers.
The shell prompt used in the example text indicates whether a command can or should be executed by a regular user, or whether it requires superuser permission to run. Also, the machine type is often included in the prompt to indicate whether the command should be run on a client node, on an MDS node, an OSS node, or the MGS node.
Some examples are listed below, but other prompt combinations are also used as needed for the example.
Shell |
Prompt |
---|---|
Regular user |
|
Superuser (root) |
|
Regular user on the client |
|
Superuser on the MDS |
|
Superuser on the OSS |
|
Superuser on the MGS |
|
Application |
Title |
Format |
Location |
---|---|---|---|
Latest information |
Lustre Software Release 2.x Change Logs |
Wiki page |
Online at https://wiki.whamcloud.com/display/PUB/Documentation |
Service |
Lustre Software Release 2.x Operations Manual |
HTML |
Online at https://wiki.whamcloud.com/display/PUB/Documentation |
These web sites provide additional resources:
The Lustre* File System Release 2.x Operations Manual is a community maintained work. Versions of the manual are continually built as suggestions for changes and improvements arrive. Suggestions for improvements can be submitted through the ticketing system maintained at https://jira.whamcloud.com/browse/LUDOC. Instructions for providing a patch to the existing manual are available at: http://wiki.lustre.org/Lustre_Manual_Changes.
This manual covers a range of Lustre 2.x software releases, currently starting with the 2.5 release. Features specific to individual releases are identified within the table of contents using a shorthand notation (e.g. this paragraph is tagged as a Lustre 2.5 specific feature so that it will be updated when the 2.5-specific tagging is removed), and within the text using a distinct box.
The current version of Lustre that is in use on the node can be found
using the command lctl get_param version
on any Lustre
client or server, for example:
$ lctl get_param version version=2.10.5
Only the latest revision of this document is made readily available because changes are continually arriving. The current and latest revision of this manual is available from links maintained at: http://lustre.opensfs.org/documentation/.
Revision History | ||
---|---|---|
Revision 0 | Built on 03 December 2024 07:57:11Z | |
Continuous build of Manual. |
Part I provides background information to help you understand the Lustre file system architecture and how the major components fit together. You will find information in this section about:
Table of Contents
This chapter describes the Lustre architecture and features of the Lustre file system. It includes the following sections:
The Lustre architecture is a storage architecture for clusters. The central component of the Lustre architecture is the Lustre file system, which is supported on the Linux operating system and provides a POSIX *standard-compliant UNIX file system interface.
The Lustre storage architecture is used for many different kinds of clusters. It is best known for powering many of the largest high-performance computing (HPC) clusters worldwide, with tens of thousands of client systems, petabytes (PiB) of storage and hundreds of gigabytes per second (GB/sec) of I/O throughput. Many HPC sites use a Lustre file system as a site-wide global file system, serving dozens of clusters.
The ability of a Lustre file system to scale capacity and performance for any need reduces the need to deploy many separate file systems, such as one for each compute cluster. Storage management is simplified by avoiding the need to copy data between compute clusters. In addition to aggregating storage capacity of many servers, the I/O throughput is also aggregated and scales with additional servers. Moreover, throughput and/or capacity can be easily increased by adding servers dynamically.
While a Lustre file system can function in many work environments, it is not necessarily the best choice for all applications. It is best suited for uses that exceed the capacity that a single server can provide, though in some use cases, a Lustre file system can perform better with a single server than other file systems due to its strong locking and data coherency.
A Lustre file system is currently not particularly well suited for "peer-to-peer" usage models where clients and servers are running on the same node, each sharing a small amount of storage, due to the lack of data replication at the Lustre software level. In such uses, if one client/server fails, then the data stored on that node will not be accessible until the node is restarted.
Lustre file systems run on a variety of vendor's kernels. For more details, see the Lustre Test Matrix Section 8.1, “ Preparing to Install the Lustre Software”.
A Lustre installation can be scaled up or down with respect to the number of client nodes, disk storage and bandwidth. Scalability and performance are dependent on available disk and network bandwidth and the processing power of the servers in the system. A Lustre file system can be deployed in a wide variety of configurations that can be scaled well beyond the size and performance observed in production systems to date.
Table 1.1, “Lustre File System Scalability and Performance” shows some of the scalability and performance characteristics of a Lustre file system. For a full list of Lustre file and filesystem limits see Table 5.2, “File and file system limits”.
Table 1.1. Lustre File System Scalability and Performance
Feature |
Current Practical Range |
Known Production Usage |
---|---|---|
Client Scalability |
100-100000 |
50000+ clients, many in the 10000 to 20000 range |
Client Performance |
Single client: I/O 90% of network bandwidth Aggregate: 50 TB/sec I/O, 50M IOPS |
Single client: 15 GB/sec I/O (HDR IB), 50000 IOPS Aggregate: 10 TB/sec I/O, 10M IOPS |
OSS Scalability |
Single OSS: 1-32 OSTs per OSS Single OST: 500M objects, 1024TiB per OST OSS count: 1000 OSSs, 4000 OSTs |
Single OSS: 4 OSTs per OSS Single OST: 1024TiB OSTs OSS count: 450 OSSs with 900 750TiB HDD OSTs + 450 25TiB NVMe OSTs 1024 OSSs with 1024 72TiB OSTs |
OSS Performance |
Single OSS: 15 GB/sec, 1.5M IOPS Aggregate: 50 TB/sec, 50M IOPS |
Single OSS: 10 GB/sec, 1.5M IOPS Aggregate: 20 TB/sec, 20M IOPS |
MDS Scalability |
Single MDS: 1-4 MDTs per MDS Single MDT: 4 billion files, 16TiB per MDT (ldiskfs) 64 billion files, 64TiB per MDT (ZFS) MDS count: 256 MDSs, up to 256 MDTs |
Single MDS: 4 billion files MDS count: 40 MDS with 40 4TiB MDTs in production 256 MDS with 256 64GiB MDTs in testing |
MDS Performance |
1M/s create operations 2M/s stat operations |
100k/s create operations, 200k/s metadata stat operations |
File system Scalability |
Single File: 32 PiB max file size (ldiskfs) 2^63 bytes (ZFS) Aggregate: 512 PiB space, 1 trillion files |
Single File: multi-TiB max file size Aggregate: 700 PiB space, 25 billion files |
Other Lustre software features are:
Performance-enhanced ext4 file
system:The Lustre file system uses an improved version of
the ext4 journaling file system to store data and metadata. This
version, called
ldiskfs
, has been enhanced to improve performance and provide
additional functionality needed by the Lustre file system.
It is also possible to use ZFS as the backing filesystem for Lustre for the MDT, OST, and MGS storage. This allows Lustre to leverage the scalability and data integrity features of ZFS for individual storage targets.
POSIX standard compliance:The full POSIX test suite passes in an identical manner to a local ext4 file system, with limited exceptions on Lustre clients. In a cluster, most operations are atomic so that clients never see stale data or metadata. The Lustre software supports mmap() file I/O.
High-performance heterogeneous networking:The Lustre software supports a variety of high performance, low latency networks and permits Remote Direct Memory Access (RDMA) for InfiniBand *(utilizing OpenFabrics Enterprise Distribution (OFED*), Intel OmniPath®, and other advanced networks for fast and efficient network transport. Multiple RDMA networks can be bridged using Lustre routing for maximum performance. The Lustre software also includes integrated network diagnostics.
High-availability:The Lustre file system supports active/active failover using shared storage partitions for OSS targets (OSTs), and for MDS targets (MDTs). The Lustre file system can work with a variety of high availability (HA) managers to allow automated failover and has no single point of failure (NSPF). This allows application transparent recovery. Multiple mount protection (MMP) provides integrated protection from errors in highly-available systems that would otherwise cause file system corruption.
Security:By default TCP connections are only allowed from privileged ports. UNIX group membership is verified on the MDS.
Access control list (ACL), extended attributes:the Lustre security model follows that of a UNIX file system, enhanced with POSIX ACLs. Noteworthy additional features include root squash.
Interoperability:The Lustre file system runs on a variety of CPU architectures and mixed-endian clusters and is interoperable between successive major Lustre software releases.
Object-based architecture:Clients are isolated from the on-disk file structure enabling upgrading of the storage architecture without affecting the client.
Byte-granular file and fine-grained metadata locking:Many clients can read and modify the same file or directory concurrently. The Lustre distributed lock manager (LDLM) ensures that files are coherent between all clients and servers in the file system. The MDT LDLM manages locks on inode permissions and pathnames. Each OST has its own LDLM for locks on file stripes stored thereon, which scales the locking performance as the file system grows.
Quotas:User and group quotas are available for a Lustre file system.
Capacity growth:The size of a Lustre file system and aggregate cluster bandwidth can be increased without interruption by adding new OSTs and MDTs to the cluster.
Controlled file layout:The layout of files across OSTs can be configured on a per file, per directory, or per file system basis. This allows file I/O to be tuned to specific application requirements within a single file system. The Lustre file system uses RAID-0 striping and balances space usage across OSTs.
Network data integrity protection:A checksum of all data sent from the client to the OSS protects against corruption during data transfer.
MPI I/O:The Lustre architecture has a dedicated MPI ADIO layer that optimizes parallel I/O to match the underlying file system architecture.
NFS and CIFS export:Lustre files can be re-exported using NFS (via Linux knfsd or Ganesha) or CIFS (via Samba), enabling them to be shared with non-Linux clients such as Microsoft*Windows, *Apple *Mac OS X *, and others.
Disaster recovery tool:The Lustre file system provides an online distributed file system check (LFSCK) that can restore consistency between storage components in case of a major file system error. A Lustre file system can operate even in the presence of file system inconsistencies, and LFSCK can run while the filesystem is in use, so LFSCK is not required to complete before returning the file system to production.
Performance monitoring:The Lustre file system offers a variety of mechanisms to examine performance and tuning.
Open source:The Lustre software is licensed under the GPL 2.0 license for use with the Linux operating system.
An installation of the Lustre software includes a management server (MGS) and one or more Lustre file systems interconnected with Lustre networking (LNet).
A basic configuration of Lustre file system components is shown in Figure 1.1, “Lustre file system components in a basic cluster”.
The MGS stores configuration information for all the Lustre file systems in a cluster and provides this information to other Lustre components. Each Lustre target contacts the MGS to provide information, and Lustre clients contact the MGS to retrieve information.
It is preferable that the MGS have its own storage space so that it can be managed independently. However, the MGS can be co-located and share storage space with an MDS as shown in Figure 1.1, “Lustre file system components in a basic cluster”.
Each Lustre file system consists of the following components:
Metadata Servers (MDS)- The MDS makes metadata stored in one or more MDTs available to Lustre clients. Each MDS manages the names and directories in the Lustre file system(s) and provides network request handling for one or more local MDTs.
Metadata Targets (MDT) - Each filesystem has at least one MDT, which holds the root directory. The MDT stores metadata (such as filenames, directories, permissions and file layout) on storage attached to an MDS. Each file system has one MDT. An MDT on a shared storage target can be available to multiple MDSs, although only one can access it at a time. If an active MDS fails, a second MDS node can serve the MDT and make it available to clients. This is referred to as MDS failover.
Multiple MDTs are supported with the Distributed Namespace Environment (Distributed Namespace Environment (DNE)). In addition to the primary MDT that holds the filesystem root, it is possible to add additional MDS nodes, each with their own MDTs, to hold sub-directory trees of the filesystem.
Since Lustre software release 2.8, DNE also allows the filesystem to distribute files of a single directory over multiple MDT nodes. A directory which is distributed across multiple MDTs is known as a Striped Directory.
Object Storage Servers (OSS): The OSS provides file I/O service and network request handling for one or more local OSTs. Typically, an OSS serves between two and eight OSTs, up to 16 TiB each. A typical configuration is an MDT on a dedicated node, two or more OSTs on each OSS node, and a client on each of a large number of compute nodes.
Object Storage Target (OST): User file data is stored in one or more objects, each object on a separate OST in a Lustre file system. The number of objects per file is configurable by the user and can be tuned to optimize performance for a given workload.
Lustre clients: Lustre clients are computational, visualization or desktop nodes that are running Lustre client software, allowing them to mount the Lustre file system.
The Lustre client software provides an interface between the Linux virtual file system and the Lustre servers. The client software includes a management client (MGC), a metadata client (MDC), and multiple object storage clients (OSCs), one corresponding to each OST in the file system.
A logical object volume (LOV) aggregates the OSCs to provide transparent access across all the OSTs. Thus, a client with the Lustre file system mounted sees a single, coherent, synchronized namespace. Several clients can write to different parts of the same file simultaneously, while, at the same time, other clients can read from the file.
A logical metadata volume (LMV) aggregates the MDCs to provide transparent access across all the MDTs in a similar manner as the LOV does for file access. This allows the client to see the directory tree on multiple MDTs as a single coherent namespace, and striped directories are merged on the clients to form a single visible directory to users and applications.
Table 1.2, “ Storage and hardware requirements for Lustre file system components”provides the requirements for attached storage for each Lustre file system component and describes desirable characteristics of the hardware used.
Table 1.2. Storage and hardware requirements for Lustre file system components
|
Required attached storage |
Desirable hardware characteristics |
---|---|---|
MDSs |
1-2% of file system capacity |
Adequate CPU power, plenty of memory, fast disk storage. |
OSSs |
1-128 TiB per OST, 1-8 OSTs per OSS |
Good bus bandwidth. Recommended that storage be balanced evenly across OSSs and matched to network bandwidth. |
Clients |
No local storage needed |
Low latency, high bandwidth network. |
For additional hardware requirements and considerations, see Chapter 5, Determining Hardware Configuration Requirements and Formatting Options.
Lustre Networking (LNet) is a custom networking API that provides the communication infrastructure that handles metadata and file I/O data for the Lustre file system servers and clients. For more information about LNet, see Chapter 2, Understanding Lustre Networking (LNet).
At scale, a Lustre file system cluster can include hundreds of OSSs and thousands of clients (see Figure 1.2, “ Lustre cluster at scale”). More than one type of network can be used in a Lustre cluster. Shared storage between OSSs enables failover capability. For more details about OSS failover, see Chapter 3, Understanding Failover in a Lustre File System.
Lustre File IDentifiers (FIDs) are used internally for identifying files or objects, similar to inode numbers in local filesystems. A FID is a 128-bit identifier, which contains a unique 64-bit sequence number (SEQ), a 32-bit object ID (OID), and a 32-bit version number. The sequence number is unique across all Lustre targets in a file system (OSTs and MDTs). This allows multiple MDTs and OSTs to uniquely identify objects without depending on identifiers in the underlying filesystem (e.g. inode numbers) that are likely to be duplicated between targets. The FID SEQ number also allows mapping a FID to a particular MDT or OST.
The LFSCK file system consistency checking tool provides functionality that enables FID-in-dirent for existing files. It includes the following functionality:
Verifies the FID stored with each directory entry and regenerates it from the inode if it is invalid or missing.
Verifies the linkEA entry for each inode and regenerates it if invalid or missing. The linkEA stores the file name and parent FID. It is stored as an extended attribute in each inode. Thus, the linkEA can be used to reconstruct the full path name of a file from only the FID.
Information about where file data is located on the OST(s) is stored as an extended attribute called layout EA in an MDT object identified by the FID for the file (see Figure 1.3, “Layout EA on MDT pointing to file data on OSTs”). If the file is a regular file (not a directory or symbol link), the MDT object points to 1-to-N OST object(s) on the OST(s) that contain the file data. If the MDT file points to one object, all the file data is stored in that object. If the MDT file points to more than one object, the file data is striped across the objects using RAID 0, and each object is stored on a different OST. (For more information about how striping is implemented in a Lustre file system, see Section 1.3.1, “ Lustre File System and Striping”.
When a client wants to read from or write to a file, it first fetches the layout EA from the MDT object for the file. The client then uses this information to perform I/O on the file, directly interacting with the OSS nodes where the objects are stored. This process is illustrated in Figure 1.4, “Lustre client requesting file data” .
The available bandwidth of a Lustre file system is determined as follows:
The network bandwidth equals the aggregated bandwidth of the OSSs to the targets.
The disk bandwidth equals the sum of the disk bandwidths of the storage targets (OSTs) up to the limit of the network bandwidth.
The aggregate bandwidth equals the minimum of the disk bandwidth and the network bandwidth.
The available file system space equals the sum of the available space of all the OSTs.
One of the main factors leading to the high performance of Lustre file systems is the ability to stripe data across multiple OSTs in a round-robin fashion. Users can optionally configure for each file the number of stripes, stripe size, and OSTs that are used.
Striping can be used to improve performance when the aggregate bandwidth to a single file exceeds the bandwidth of a single OST. The ability to stripe is also useful when a single OST does not have enough free space to hold an entire file. For more information about benefits and drawbacks of file striping, see Section 19.2, “ Lustre File Layout (Striping) Considerations”.
Striping allows segments or 'chunks' of data in a file to be stored
on different OSTs, as shown in
Figure 1.5, “File striping on a
Lustre file system”. In the Lustre file
system, a RAID 0 pattern is used in which data is "striped" across a
certain number of objects. The number of objects in a single file is
called the
stripe_count
.
Each object contains a chunk of data from the file. When the chunk
of data being written to a particular object exceeds the
stripe_size
, the next chunk of data in the file is
stored on the next object.
Default values for
stripe_count
and
stripe_size
are set for the file system. The default
value for
stripe_count
is 1 stripe for file and the default value
for
stripe_size
is 1MB. The user may change these values on
a per directory or per file basis. For more details, see
Section 19.3, “Setting the File Layout/Striping Configuration (lfs
setstripe
)”.
Figure 1.5, “File striping on a
Lustre file system”, the
stripe_size
for File C is larger than the
stripe_size
for File A, allowing more data to be stored
in a single stripe for File C. The
stripe_count
for File A is 3, resulting in data striped
across three objects, while the
stripe_count
for File B and File C is 1.
No space is reserved on the OST for unwritten data. File A in Figure 1.5, “File striping on a Lustre file system”.
The maximum file size is not limited by the size of a single target. In a Lustre file system, files can be striped across multiple objects (up to 2000), and each object can be up to 16 TiB in size with ldiskfs, or up to 256PiB with ZFS. This leads to a maximum file size of 31.25 PiB for ldiskfs or 8EiB with ZFS. Note that a Lustre file system can support files up to 2^63 bytes (8EiB), limited only by the space available on the OSTs.
ldiskfs filesystems without the ea_inode
feature limit the maximum stripe count for a single file to 160 OSTs.
Although a single file can only be striped over 2000 objects, Lustre file systems can have thousands of OSTs. The I/O bandwidth to access a single file is the aggregated I/O bandwidth to the objects in a file, which can be as much as a bandwidth of up to 2000 servers. On systems with more than 2000 OSTs, clients can do I/O using multiple files to utilize the full file system bandwidth.
For more information about striping, see Chapter 19, Managing File Layout (Striping) and Free Space.
Extended Attributes(xattrs)
Lustre uses lov_user_md_v1/lov_user_md_v3 data-structures to
maintain its file striping information under xattrs. Extended
attributes are created when files and directory are created. Lustre
uses trusted
extended attributes to store its
parameters which are root-only accessible. The parameters are:
trusted.lov
:
Holds layout for a regular file, or default file layout stored
on a directory (also accessible as lustre.lov
for non-root users).
trusted.lma
:
Holds FID and extra state flags for current file
trusted.lmv
:
Holds layout for a striped directory (DNE 2), not present otherwise
trusted.link
:
Holds parent directory FID + filename for each link to a file
(for lfs fid2path
)
xattr which are stored and present in the file could be verify using:
# getfattr -d -m - /mnt/testfs/file>
Table of Contents
This chapter introduces Lustre networking (LNet). It includes the following sections:
In a cluster using one or more Lustre file systems, the network communication infrastructure required by the Lustre file system is implemented using the Lustre networking (LNet) feature.
LNet supports many commonly-used network types, such as InfiniBand and IP networks, and allows simultaneous availability across multiple network types with routing between them. Remote direct memory access (RDMA) is permitted when supported by underlying networks using the appropriate Lustre network driver (LND). High availability and recovery features enable transparent recovery in conjunction with failover servers.
An LND is a pluggable driver that provides support for a particular network type, for
example ksocklnd
is the driver which implements the TCP Socket LND that
supports TCP networks. LNDs are loaded into the driver stack, with one LND for each network
type in use.
For information about configuring LNet, see Chapter 9, Configuring Lustre Networking (LNet).
For information about administering LNet, see Part III, “Administering Lustre”.
Key features of LNet include:
RDMA, when supported by underlying networks
Support for many commonly-used network types
High availability and recovery
Support of multiple network types simultaneously
Routing among disparate networks
LNet permits end-to-end read/write throughput at or near peak bandwidth rates on a variety of network interconnects.
A Lustre network is comprised of clients and servers running the Lustre software. It need not be confined to one LNet subnet but can span several networks provided routing is possible between the networks. In a similar manner, a single network can have multiple LNet subnets.
The Lustre networking stack is comprised of two layers, the LNet code module and the LND. The LNet layer operates above the LND layer in a manner similar to the way the network layer operates above the data link layer. LNet layer is connectionless, asynchronous and does not verify that data has been transmitted while the LND layer is connection oriented and typically does verify data transmission.
LNets are uniquely identified by a label comprised of a string corresponding to an LND and
a number, such as tcp0, o2ib0, or o2ib1, that uniquely identifies each LNet. Each node on an
LNet has at least one network identifier (NID). A NID is a combination of the address of the
network interface and the LNet label in the
form:
.address
@LNet_label
Examples:
192.168.1.2@tcp0 10.13.24.90@o2ib1
In certain circumstances it might be desirable for Lustre file system traffic to pass between multiple LNets. This is possible using LNet routing. It is important to realize that LNet routing is not the same as network routing. For more details about LNet routing, see Chapter 9, Configuring Lustre Networking (LNet)
Table of Contents
This chapter describes failover in a Lustre file system. It includes:
In a high-availability (HA) system, unscheduled downtime is minimized by using redundant hardware and software components and software components that automate recovery when a failure occurs. If a failure condition occurs, such as the loss of a server or storage device or a network or software fault, the system's services continue with minimal interruption. Generally, availability is specified as the percentage of time the system is required to be available.
Availability is accomplished by replicating hardware and/or software so that when a primary server fails or is unavailable, a standby server can be switched into its place to run applications and associated resources. This process, called failover, is automatic in an HA system and, in most cases, completely application-transparent.
A failover hardware setup requires a pair of servers with a shared resource (typically a physical storage device, which may be based on SAN, NAS, hardware RAID, SCSI or Fibre Channel (FC) technology). The method of sharing storage should be essentially transparent at the device level; the same physical logical unit number (LUN) should be visible from both servers. To ensure high availability at the physical storage level, we encourage the use of RAID arrays to protect against drive-level failures.
The Lustre software does not provide redundancy for data; it depends exclusively on redundancy of backing storage devices. The backing OST storage should be RAID 5 or, preferably, RAID 6 storage. MDT storage should be RAID 1 or RAID 10.
To establish a highly-available Lustre file system, power management software or hardware and high availability (HA) software are used to provide the following failover capabilities:
Resource fencing- Protects physical storage from simultaneous access by two nodes.
Resource management- Starts and stops the Lustre resources as a part of failover, maintains the cluster state, and carries out other resource management tasks.
Health monitoring- Verifies the availability of hardware and network resources and responds to health indications provided by the Lustre software.
These capabilities can be provided by a variety of software and/or hardware solutions. For more information about using power management software or hardware and high availability (HA) software with a Lustre file system, see Chapter 11, Configuring Failover in a Lustre File System.
HA software is responsible for detecting failure of the primary Lustre server node and controlling the failover.The Lustre software works with any HA software that includes resource (I/O) fencing. For proper resource fencing, the HA software must be able to completely power off the failed server or disconnect it from the shared storage device. If two active nodes have access to the same storage device, data may be severely corrupted.
Nodes in a cluster can be configured for failover in several ways. They are often configured in pairs (for example, two OSTs attached to a shared storage device), but other failover configurations are also possible. Failover configurations include:
Active/passive pair - In this configuration, the active node provides resources and serves data, while the passive node is usually standing by idle. If the active node fails, the passive node takes over and becomes active.
Active/active pair - In this configuration, both nodes are active, each providing a subset of resources. In case of a failure, the second node takes over resources from the failed node.
If there is a single MDT in a filesystem, two MDSes can be configured as an active/passive pair, while pairs of OSSes can be deployed in an active/active configuration that improves OST availability without extra overhead. Often the standby MDS is the active MDS for another Lustre file system or the MGS, so no nodes are idle in the cluster. If there are multiple MDTs in a filesystem, active-active failover configurations are available for MDSs that serve MDTs on shared storage.
The failover functionality provided by the Lustre software can be used for the following failover scenario. When a client attempts to do I/O to a failed Lustre target, it continues to try until it receives an answer from any of the configured failover nodes for the Lustre target. A user-space application does not detect anything unusual, except that the I/O may take longer to complete.
Failover in a Lustre file system requires that two nodes be configured as a failover pair, which must share one or more storage devices. A Lustre file system can be configured to provide MDT or OST failover.
For MDT failover, two MDSs can be configured to serve the same MDT. Only one MDS node can serve any MDT at one time. By placing two or more MDT devices on storage shared by two MDSs, one MDS can fail and the remaining MDS can begin serving the unserved MDT. This is described as an active/active failover pair.
For OST failover, multiple OSS nodes can be configured to be able
to serve the same OST. However, only one OSS node can serve the OST at
a time. An OST can be moved between OSS nodes that have access to the
same storage device using
umount/mount
commands.
The
--servicenode
option is used to set up nodes in a Lustre
file system for failover at creation time (using
mkfs.lustre
) or later when the Lustre file system is
active (using
tunefs.lustre
). For explanations of these utilities, see
Section 44.12, “
mkfs.lustre”and
Section 44.15, “
tunefs.lustre”.
Failover capability in a Lustre file system can be used to upgrade the Lustre software between successive minor versions without cluster downtime. For more information, see Chapter 17, Upgrading a Lustre File System.
For information about configuring failover, see Chapter 11, Configuring Failover in a Lustre File System.
The Lustre software provides failover functionality only at the file system level. In a complete failover solution, failover functionality for system-level components, such as node failure detection or power control, must be provided by a third-party tool.
OST failover functionality does not protect against corruption caused by a disk failure. If the storage media (i.e., physical disk) used for an OST fails, it cannot be recovered by functionality provided in the Lustre software. We strongly recommend that some form of RAID be used for OSTs. Lustre functionality assumes that the storage is reliable, so it adds no extra reliability features.
Two MDSs are typically configured as an active/passive failover pair as shown in Figure 3.1, “Lustre failover configuration for a active/passive MDT”. Note that both nodes must have access to shared storage for the MDT(s) and the MGS. The primary (active) MDS manages the Lustre system metadata resources. If the primary MDS fails, the secondary (passive) MDS takes over these resources and serves the MDTs and the MGS.
In an environment with multiple file systems, the MDSs can be configured in a quasi active/active configuration, with each MDS managing metadata for a subset of the Lustre file system.
MDTs can be configured as an active/active failover configuration. A failover cluster is built from two MDSs as shown in Figure 3.2, “Lustre failover configuration for a active/active MDTs”.
OSTs are usually configured in a load-balanced, active/active failover configuration. A failover cluster is built from two OSSs as shown in Figure 3.3, “Lustre failover configuration for an OSTs”.
OSSs configured as a failover pair must have shared disks/RAID.
In an active configuration, 50% of the available OSTs are assigned to one OSS and the remaining OSTs are assigned to the other OSS. Each OSS serves as the primary node for half the OSTs and as a failover node for the remaining OSTs.
In this mode, if one OSS fails, the other OSS takes over all of the failed OSTs. The clients attempt to connect to each OSS serving the OST, until one of them responds. Data on the OST is written synchronously, and the clients replay transactions that were in progress and uncommitted to disk before the OST failure.
For more information about configuring failover, see Chapter 11, Configuring Failover in a Lustre File System.
Part II describes how to install and configure a Lustre file system. You will find information in this section about:
Table of Contents
lnetctl
L 2.7 Table of Contents
This chapter provides on overview of the procedures required to set up, install and configure a Lustre file system.
If the Lustre file system is new to you, you may find it helpful to refer to Part I, “Introducing the Lustre* File System” for a description of the Lustre architecture, file system components and terminology before proceeding with the installation procedure.
To set up Lustre file system hardware and install and configure the Lustre software, refer the the chapters below in the order listed:
(Required) Set up your Lustre file system hardware.
See Chapter 5, Determining Hardware Configuration Requirements and Formatting Options - Provides guidelines for configuring hardware for a Lustre file system including storage, memory, and networking requirements.
(Optional - Highly Recommended) Configure storage on Lustre storage devices.
See Chapter 6, Configuring Storage on a Lustre File System - Provides instructions for setting up hardware RAID on Lustre storage devices.
(Optional) Set up network interface bonding.
See Chapter 7, Setting Up Network Interface Bonding - Describes setting up network interface bonding to allow multiple network interfaces to be used in parallel to increase bandwidth or redundancy.
(Required) Install Lustre software.
See Chapter 8, Installing the Lustre Software - Describes preparation steps and a procedure for installing the Lustre software.
(Optional) Configure Lustre Networking (LNet).
See Chapter 9, Configuring Lustre Networking (LNet) - Describes how to configure LNet if the default configuration is not sufficient. By default, LNet will use the first TCP/IP interface it discovers on a system. LNet configuration is required if you are using InfiniBand or multiple Ethernet interfaces.
(Required) Configure the Lustre file system.
See Chapter 10, Configuring a Lustre File System - Provides an example of a simple Lustre configuration procedure and points to tools for completing more complex configurations.
(Optional) Configure Lustre failover.
See Chapter 11, Configuring Failover in a Lustre File System - Describes how to configure Lustre failover.
Table of Contents
This chapter describes hardware configuration requirements for a Lustre file system including:
A Lustre file system can utilize any kind of block storage device such as single disks, software RAID, hardware RAID, or a logical volume manager. In contrast to some networked file systems, the block devices are only attached to the MDS and OSS nodes in a Lustre file system and are not accessed by the clients directly.
Since the block devices are accessed by only one or two server nodes, a storage area network (SAN) that is accessible from all the servers is not required. Expensive switches are not needed because point-to-point connections between the servers and the storage arrays normally provide the simplest and best attachments. (If failover capability is desired, the storage must be attached to multiple servers.)
For a production environment, it is preferable that the MGS have separate storage to allow future expansion to multiple file systems. However, it is possible to run the MDS and MGS on the same machine and have them share the same storage device.
For best performance in a production environment, dedicated clients are required. For a non-production Lustre environment or for testing, a Lustre client and server can run on the same machine. However, dedicated clients are the only supported configuration.
Performance and recovery issues can occur if you put a client on an MDS or OSS:
Running the OSS and a client on the same machine can cause issues with low memory and memory pressure. If the client consumes all the memory and then tries to write data to the file system, the OSS will need to allocate pages to receive data from the client but will not be able to perform this operation due to low memory. This can cause the client to hang.
Running the MDS and a client on the same machine can cause recovery and deadlock issues and impact the performance of other Lustre clients.
Only servers running on 64-bit CPUs are tested and supported. 64-bit CPU clients are typically used for testing to match expected customer usage and avoid limitations due to the 4 GB limit for RAM size, 1 GB low-memory limitation, and 16 TB file size limit of 32-bit CPUs. Also, due to kernel API limitations, performing backups of Lustre filesystems on 32-bit clients may cause backup tools to confuse files that report the same 32-bit inode number, if the backup tools depend on the inode number for correct operation.
The storage attached to the servers typically uses RAID to provide fault tolerance and can optionally be organized with logical volume management (LVM), which is then formatted as a Lustre file system. Lustre OSS and MDS servers read, write and modify data in the format imposed by the file system.
The Lustre file system uses journaling file system technology on both the MDTs and OSTs. For a MDT, as much as a 20 percent performance gain can be obtained by placing the journal on a separate device.
The MDS can effectively utilize a lot of CPU cycles. A minimum of four processor cores are recommended. More are advisable for files systems with many clients.
Lustre clients running on different CPU architectures is supported. One limitation is that the PAGE_SIZE kernel macro on the client must be as large as the PAGE_SIZE of the server. In particular, ARM or PPC clients with large pages (up to 64kB pages) can run with x86 servers (4kB pages).
MGT storage requirements are small (less than 100 MB even in the largest Lustre file systems), and the data on an MGT is only accessed on a server/client mount, so disk performance is not a consideration. However, this data is vital for file system access, so the MGT should be reliable storage, preferably mirrored RAID1.
MDS storage is accessed in a database-like access pattern with many seeks and read-and-writes of small amounts of data. Storage types that provide much lower seek times, such as SSD or NVMe is strongly preferred for the MDT, and high-RPM SAS is acceptable.
For maximum performance, the MDT should be configured as RAID1 with an internal journal and two disks from different controllers.
If you need a larger MDT, create multiple RAID1 devices from pairs
of disks, and then make a RAID0 array of the RAID1 devices. For ZFS,
use mirror
VDEVs for the MDT. This ensures
maximum reliability because multiple disk failures only have a small
chance of hitting both disks in the same RAID1 device.
Doing the opposite (RAID1 of a pair of RAID0 devices) has a 50% chance that even two disk failures can cause the loss of the whole MDT device. The first failure disables an entire half of the mirror and the second failure has a 50% chance of disabling the remaining mirror.
If multiple MDTs are going to be present in the system, each MDT should be specified for the anticipated usage and load. For details on how to add additional MDTs to the filesystem, see Section 14.7, “Adding a New MDT to a Lustre File System”.
MDT0000 contains the root of the Lustre file system. If MDT0000 is unavailable for any reason, the file system cannot be used.
Using the DNE feature it is possible to dedicate additional
MDTs to sub-directories off the file system root directory stored on
MDT0000, or arbitrarily for lower-level subdirectories, using the
lfs mkdir -i
command. If an MDT serving a subdirectory becomes unavailable, any
subdirectories on that MDT and all directories beneath it will also
become inaccessible. This is typically useful for top-level directories
to assign different users or projects to separate MDTs, or to distribute
other large working sets of files to multiple MDTs.mdt_index
Starting in the 2.8 release it is possible
to spread a single large directory across multiple MDTs using the DNE
striped directory feature by specifying multiple stripes (or shards)
at creation time using the
lfs mkdir -c
command, where stripe_count
stripe_count
is often the
number of MDTs in the filesystem. Striped directories should
not be used for all directories in the filesystem, since this
incurs extra overhead compared to unstriped directories. This is indended
for specific applications where many output files are being created in
one large directory (over 50k entries).
The data access pattern for the OSS storage is a streaming I/O pattern that is dependent on the access patterns of applications being used. Each OSS can manage multiple object storage targets (OSTs), one for each volume with I/O traffic load-balanced between servers and targets. An OSS should be configured to have a balance between the network bandwidth and the attached storage bandwidth to prevent bottlenecks in the I/O path. Depending on the server hardware, an OSS typically serves between 2 and 8 targets, with each target between 24-48TB, but may be up to 256 terabytes (TBs) in size.
Lustre file system capacity is the sum of the capacities provided by the targets. For example, 64 OSSs, each with two 8 TB OSTs, provide a file system with a capacity of nearly 1 PB. If each OST uses ten 1 TB SATA disks (8 data disks plus 2 parity disks in a RAID-6 configuration), it may be possible to get 50 MB/sec from each drive, providing up to 400 MB/sec of disk bandwidth per OST. If this system is used as storage backend with a system network, such as the InfiniBand network, that provides a similar bandwidth, then each OSS could provide 800 MB/sec of end-to-end I/O throughput. (Although the architectural constraints described here are simple, in practice it takes careful hardware selection, benchmarking and integration to obtain such results.)
The desired performance characteristics of the backing file systems on the MDT and OSTs are independent of one another. The size of the MDT backing file system depends on the number of inodes needed in the total Lustre file system, while the aggregate OST space depends on the total amount of data stored on the file system. If MGS data is to be stored on the MDT device (co-located MGT and MDT), add 100 MB to the required size estimate for the MDT.
Each time a file is created on a Lustre file system, it consumes
one inode on the MDT and one OST object over which the file is striped.
Normally, each file's stripe count is based on the system-wide
default stripe count. However, this can be changed for individual files
using the lfs setstripe
option. For more details,
see Chapter 19, Managing File Layout (Striping) and Free
Space.
In a Lustre ldiskfs file system, all the MDT inodes and OST objects are allocated when the file system is first formatted. When the file system is in use and a file is created, metadata associated with that file is stored in one of the pre-allocated inodes and does not consume any of the free space used to store file data. The total number of inodes on a formatted ldiskfs MDT or OST cannot be easily changed. Thus, the number of inodes created at format time should be generous enough to anticipate near term expected usage, with some room for growth without the effort of additional storage.
By default, the ldiskfs file system used by Lustre servers to store user-data objects and system data reserves 5% of space that cannot be used by the Lustre file system. Additionally, an ldiskfs Lustre file system reserves up to 400 MB on each OST, and up to 4GB on each MDT for journal use and a small amount of space outside the journal to store accounting data. This reserved space is unusable for general storage. Thus, at least this much space will be used per OST before any file object data is saved.
With a ZFS backing filesystem for the MDT or OST, the space allocation for inodes and file data is dynamic, and inodes are allocated as needed. A minimum of 4kB of usable space (before mirroring) is needed for each inode, exclusive of other overhead such as directories, internal log files, extended attributes, ACLs, etc. ZFS also reserves approximately 3% of the total storage space for internal and redundant metadata, which is not usable by Lustre. Since the size of extended attributes and ACLs is highly dependent on kernel versions and site-specific policies, it is best to over-estimate the amount of space needed for the desired number of inodes, and any excess space will be utilized to store more inodes.
Less than 100 MB of space is typically required for the MGT. The size is determined by the total number of servers in the Lustre file system cluster(s) that are managed by the MGS.
When calculating the MDT size, the important factor to consider is the number of files to be stored in the file system, which depends on at least 2 KiB per inode of usable space on the MDT. Since MDTs typically use RAID-1+0 mirroring, the total storage needed will be double this.
Please note that the actual used space per MDT depends on the number
of files per directory, the number of stripes per file, whether files
have ACLs or user xattrs, and the number of hard links per file. The
storage required for Lustre file system metadata is typically 1-2
percent of the total file system capacity depending upon file size.
If the Chapter 20, Data on MDT (DoM) feature is in use for Lustre
2.11 or later, MDT space should typically be 5 percent or more of the
total space, depending on the distribution of small files within the
filesystem and the lod.*.dom_stripesize
limit on
the MDT and file layout used.
For ZFS-based MDT filesystems, the number of inodes created on the MDT and OST is dynamic, so there is less need to determine the number of inodes in advance, though there still needs to be some thought given to the total MDT space compared to the total filesystem size.
For example, if the average file size is 5 MiB and you have 100 TiB of usable OST space, then you can calculate the minimum total number of inodes for MDTs and OSTs as follows:
(500 TB * 1000000 MB/TB) / 5 MB/inode = 100M inodes
It is recommended that the MDT(s) have at least twice the minimum number of inodes to allow for future expansion and allow for an average file size smaller than expected. Thus, the minimum space for ldiskfs MDT(s) should be approximately:
2 KiB/inode x 100 million inodes x 2 = 400 GiB ldiskfs MDT
For details about formatting options for ldiskfs MDT and OST file systems, see Section 5.3.1, “Setting Formatting Options for an ldiskfs MDT”.
If the median file size is very small, 4 KB for example, the MDT would use as much space for each file as the space used on the OST, so the use of Data-on-MDT is strongly recommended in that case. The MDT space per inode should be increased correspondingly to account for the extra data space usage for each inode:
6 KiB/inode x 100 million inodes x 2 = 1200 GiB ldiskfs MDT
If the MDT has too few inodes, this can cause the space on the
OSTs to be inaccessible since no new files can be created. In this
case, the lfs df -i
and df -i
commands will limit the number of available inodes reported for the
filesystem to match the total number of available objects on the OSTs.
Be sure to determine the appropriate MDT size needed to support the
filesystem before formatting. It is possible to increase the
number of inodes after the file system is formatted, depending on the
storage. For ldiskfs MDT filesystems the resize2fs
tool can be used if the underlying block device is on a LVM logical
volume and the underlying logical volume size can be increased.
For ZFS new (mirrored) VDEVs can be added to the MDT pool to increase
the total space available for inode storage.
Inodes will be added approximately in proportion to space added.
Note that the number of total and free inodes reported by
lfs df -i
for ZFS MDTs and OSTs is estimated based
on the current average space used per inode. When a ZFS filesystem is
first formatted, this free inode estimate will be very conservative
(low) due to the high ratio of directories to regular files created for
internal Lustre metadata storage, but this estimate will improve as
more files are created by regular users and the average file size will
better reflect actual site usage.
Using the DNE remote directory feature it is possible to increase the total number of inodes of a Lustre filesystem, as well as increasing the aggregate metadata performance, by configuring additional MDTs into the filesystem, see Section 14.7, “Adding a New MDT to a Lustre File System” for details.
For the OST, the amount of space taken by each object depends on the usage pattern of the users/applications running on the system. The Lustre software defaults to a conservative estimate for the average object size (between 64 KiB per object for 10 GiB OSTs, and 1 MiB per object for 16 TiB and larger OSTs). If you are confident that the average file size for your applications will be different than this, you can specify a different average file size (number of total inodes for a given OST size) to reduce file system overhead and minimize file system check time. See Section 5.3.2, “Setting Formatting Options for an ldiskfs OST” for more details.
By default, the mkfs.lustre
utility applies these
options to the Lustre backing file system used to store data and metadata
in order to enhance Lustre file system performance and scalability. These
options include:
flex_bg
- When the flag is set to enable
this flexible-block-groups feature, block and inode bitmaps for
multiple groups are aggregated to minimize seeking when bitmaps
are read or written and to reduce read/modify/write operations
on typical RAID storage (with 1 MiB RAID stripe widths). This flag
is enabled on both OST and MDT file systems. On MDT file systems
the flex_bg
factor is left at the default value
of 16. On OSTs, the flex_bg
factor is set
to 256 to allow all of the block or inode bitmaps in a single
flex_bg
to be read or written in a single
1MiB I/O typical for RAID storage.
huge_file
- Setting this flag allows
files on OSTs to be larger than 2 TiB in size.
lazy_journal_init
- This extended option
is enabled to prevent a full overwrite to zero out the large
journal that is allocated by default in a Lustre file system
(up to 400 MiB for OSTs, up to 4GiB for MDTs), to reduce the
formatting time.
To override the default formatting options, use arguments to
mkfs.lustre
to pass formatting options to the backing file system:
--mkfsoptions='backing fs options'
For other mkfs.lustre
options, see the Linux man page for
mke2fs(8)
.
The number of inodes on the MDT is determined at format time based on the total size of the file system to be created. The default bytes-per-inode ratio ("inode ratio") for an ldiskfs MDT is optimized at one inode for every 2560 bytes of file system space.
This setting takes into account the space needed for additional ldiskfs filesystem-wide metadata, such as the journal (up to 4 GB), bitmaps, and directories, as well as files that Lustre uses internally to maintain cluster consistency. There is additional per-file metadata such as file layout for files with a large number of stripes, Access Control Lists (ACLs), and user extended attributes.
Starting in Lustre 2.11, the Chapter 20, Data on MDT (DoM) (DoM) feature allows storing small files on the MDT to take advantage of high-performance flash storage, as well as reduce space and network overhead. If you are planning to use the DoM feature with an ldiskfs MDT, it is recommended to increase the bytes-per-inode ratio to have enough space on the MDT for small files, as described below.
It is possible to change the recommended default of 2560 bytes
per inode for an ldiskfs MDT when it is first formatted by adding the
--mkfsoptions="-i bytes-per-inode"
option to
mkfs.lustre
. Decreasing the inode ratio tunable
bytes-per-inode
will create more inodes for a given
MDT size, but will leave less space for extra per-file metadata and is
not recommended. The inode ratio must always be strictly larger than
the MDT inode size, which is 1024 bytes by default. It is recommended
to use an inode ratio at least 1536 bytes larger than the inode size to
ensure the MDT does not run out of space. Increasing the inode ratio
with enough space for the most commonly file size (e.g. 5632 or 66560
bytes if 4KB or 64KB files are widely used) is recommended for DoM.
The size of the inode may be changed at format time by adding the
--stripe-count-hint=N
to have
mkfs.lustre
automatically calculate a reasonable
inode size based on the default stripe count that will be used by the
filesystem, or directly by specifying the
--mkfsoptions="-I inode-size"
option. Increasing
the inode size will provide more space in the inode for a larger Lustre
file layout, ACLs, user and system extended attributes, SELinux and
other security labels, and other internal metadata and DoM data. However,
if these features or other in-inode xattrs are not needed, a larger inode
size may hurt metadata performance as 2x, 4x, or 8x as much data would be
read or written for each MDT inode access.
When formatting an OST file system, it can be beneficial
to take local file system usage into account, for example by running
df
and df -i
on a current filesystem
to get the used bytes and used inodes respectively, then computing the
average bytes-per-inode value. When deciding on the ratio for a new
filesystem, try to avoid having too many inodes on each OST, while keeping
enough margin to allow for future usage of smaller files. This helps
reduce the format and e2fsck time and makes more space available for data.
The table below shows the default bytes-per-inode ratio ("inode ratio") used for OSTs of various sizes when they are formatted.
Table 5.1. Default Inode Ratios Used for Newly Formatted OSTs
LUN/OST size |
Default Inode ratio |
Total inodes |
---|---|---|
under 10GiB |
1 inode/16KiB |
640 - 655k |
10GiB - 1TiB |
1 inode/68KiB |
153k - 15.7M |
1TiB - 8TiB |
1 inode/256KiB |
4.2M - 33.6M |
over 8TiB |
1 inode/1MiB |
8.4M - 268M |
In environments with few small files, the default inode ratio
may result in far too many inodes for the average file size. In this
case, performance can be improved by increasing the number of
bytes-per-inode. To set the inode
ratio, use the --mkfsoptions="-i
argument to bytes-per-inode
"mkfs.lustre
to specify the expected
average (mean) size of OST objects. For example, to create an OST
with an expected average object size of 8 MiB run:
[oss#] mkfs.lustre --ost --mkfsoptions="-i $((8192 * 1024))" ...
OSTs formatted with ldiskfs should preferably have fewer than 320 million objects per MDT, and up to a maximum of 4 billion inodes. Specifying a very small bytes-per-inode ratio for a large OST that exceeds this limit can cause either premature out-of-space errors and prevent the full OST space from being used, or will waste space and slow down e2fsck more than necessary. The default inode ratios are chosen to ensure the total number of inodes remain below this limit.
File system check time on OSTs is affected by a number of variables in addition to the number of inodes, including the size of the file system, the number of allocated blocks, the distribution of allocated blocks on the disk, disk speed, CPU speed, and the amount of RAM on the server. Reasonable file system check times for valid filesystems are 5-30 minutes per TiB, but may increase significantly if substantial errors are detected and need to be repaired.
For further details about optimizing MDT and OST file systems, see Section 6.4, “ Formatting Options for ldiskfs RAID Devices”.
Table 5.2, “File and file system limits” describes current known limits of Lustre. These limits may be imposed by either the Lustre architecture or the Linux virtual file system (VFS) and virtual memory subsystems. In a few cases, a limit is defined within the code Lustre based on tested values and could be changed by editing and re-compiling the Lustre software. In these cases, the indicated limit was used for testing of the Lustre software.
Table 5.2. File and file system limits
Limit |
Value |
Description |
---|---|---|
256 |
A single MDS can host one or more MDTs, either for separate filesystems, or aggregated into a single namespace. Each filesystem requires a separate MDT for the filesystem root directory. Up to 255 more MDTs can be added to the filesystem and are attached into the filesystem namespace with creation of DNE remote or striped directories. | |
8150 |
The maximum number of OSTs is a constant that can be changed at compile time. Lustre file systems with up to 4000 OSTs have been configured in the past. Multiple OST targets can be configured on a single OSS node. | |
1024TiB (ldiskfs), 1024TiB (ZFS) |
This is not a hard limit. Larger OSTs are possible, but most production systems do not typically go beyond the stated limit per OST because Lustre can add capacity and performance with additional OSTs, and having more OSTs improves aggregate I/O performance, minimizes contention, and allows parallel recovery (e2fsck for ldiskfs OSTs, scrub for ZFS OSTs). With 32-bit kernels, due to page cache limits, 16TB is the maximum block device size, which in turn applies to the size of OST. It is strongly recommended to run Lustre clients and servers with 64-bit kernels. | |
131072 |
The maximum number of clients is a constant that can be changed at compile time. Up to 30000 clients have been used in production accessing a single filesystem. | |
2EiB or larger |
Each OST can have a file system up to the "Maximum OST size" limit, and the Maximum number of OSTs can be combined into a single filesystem. | |
2000 |
This limit is imposed by the size of the layout that needs to be stored on disk and sent in RPC requests, but is not a hard limit of the protocol. The number of OSTs in the filesystem can exceed the stripe count, but this is the maximum number of OSTs on which a single file can be striped. Introduced in Lustre 2.13
NoteBefore 2.13, the default for ldiskfs
MDTs the maximum stripe count for a
single file is limited to 160 OSTs. In order to
increase the maximum file stripe count, use
| |
< 4 GiB |
The amount of data written to each object before moving on to next object. | |
64 KiB |
Due to the use of 64 KiB PAGE_SIZE on some CPU architectures such as ARM and POWER, the minimum stripe size is 64 KiB so that a single page is not split over multiple servers. This is also the minimum Data-on-MDT component size that can be specified. | |
16TiB (ldiskfs), 256TiB (ZFS) |
The amount of data that can be stored in a single object. An object corresponds to a stripe. The ldiskfs limit of 16 TB for a single object applies. For ZFS the limit is the size of the underlying OST. Files can consist of up to 2000 stripes, each stripe can be up to the maximum object size. | |
16 TiB on 32-bit systems
31.25 PiB on 64-bit ldiskfs systems, 8EiB on 64-bit ZFS systems |
Individual files have a hard limit of nearly 16 TiB on 32-bit systems imposed by the kernel memory subsystem. On 64-bit systems this limit does not exist. Hence, files can be 2^63 bits (8EiB) in size if the backing filesystem can support large enough objects and/or the files are sparse. A single file can have a maximum of 2000 stripes, which gives an upper single file data capacity of 31.25 PiB for 64-bit ldiskfs systems. The actual amount of data that can be stored in a file depends upon the amount of free space in each OST on which the file is striped. | |
Maximum number of files or subdirectories in a single directory |
600M-3.8B files (ldiskfs), 16T (ZFS) |
The Lustre software uses the ldiskfs hashed directory code, which has a limit of at least 600 million files, depending on the length of the file name. The limit on subdirectories is the same as the limit on regular files. Introduced in Lustre 2.8
NoteStarting in the 2.8 release it is
possible to exceed this limit by striping a single directory
over multiple MDTs with the Introduced in Lustre 2.12
NoteIn the 2.12 release, the
Introduced in Lustre 2.14
NoteStarting in the 2.14 release, the
|
4 billion (ldiskfs), 256 trillion (ZFS) per MDT |
The ldiskfs filesystem imposes an upper limit of 4 billion inodes per filesystem. By default, the MDT filesystem is formatted with one inode per 2KB of space, meaning 512 million inodes per TiB of MDT space. This can be increased initially at the time of MDT filesystem creation. For more information, see Chapter 5, Determining Hardware Configuration Requirements and Formatting Options. The ZFS filesystem dynamically allocates inodes and does not have a fixed ratio of inodes per unit of MDT space, but consumes approximately 4KiB of mirrored space per inode, depending on the configuration. Each additional MDT can hold up to the above maximum number of additional files, depending on available space and the distribution directories and files in the filesystem. | |
255 bytes (filename) |
This limit is 255 bytes for a single filename, the same as the limit in the underlying filesystems. | |
4096 bytes (pathname) |
The Linux VFS imposes a full pathname length of 4096 bytes. | |
No limit |
The Lustre software does not impose a maximum for the number of open files, but the practical limit depends on the amount of RAM on the MDS. No "tables" for open files exist on the MDS, as they are only linked in a list to a given client's export. Each client process has a limit of several thousands of open files which depends on its ulimit. |
This section describes the memory requirements for each Lustre file system component.
MDS memory requirements are determined by the following factors:
Number of clients
Size of the directories
Load placed on server
The amount of memory used by the MDS is a function of how many clients are on the system, and how many files they are using in their working set. This is driven, primarily, by the number of locks a client can hold at one time. The number of locks held by clients varies by load and memory availability on the server. Interactive clients can hold in excess of 10,000 locks at times. On the MDS, memory usage is approximately 2 KB per file, including the Lustre distributed lock manager (LDLM) lock and kernel data structures for the files currently in use. Having file data in cache can improve metadata performance by a factor of 10x or more compared to reading it from storage.
MDS memory requirements include:
File system metadata: A reasonable amount of RAM needs to be available for file system metadata. While no hard limit can be placed on the amount of file system metadata, if more RAM is available, then the disk I/O is needed less often to retrieve the metadata.
Network transport: If you are using TCP or other network transport that uses system memory for send/receive buffers, this memory requirement must also be taken into consideration.
Journal size: By default, the journal size is 4096 MB for each MDT ldiskfs file system. This can pin up to an equal amount of RAM on the MDS node per file system.
Failover configuration: If the MDS node will be used for failover from another node, then the RAM for each journal should be doubled, so the backup server can handle the additional load if the primary server fails.
By default, 4096 MB are used for the ldiskfs filesystem journal. Additional RAM is used for caching file data for the larger working set, which is not actively in use by clients but should be kept "hot" for improved access times. Approximately 1.5 KB per file is needed to keep a file in cache without a lock.
For example, for a single MDT on an MDS with 1,024 compute nodes, 12 interactive login nodes, and a 20 million file working set (of which 9 million files are cached on the clients at one time):
Operating system overhead = 4096 MB (RHEL8)
File system journal = 4096 MB
1024 * 32-core clients * 256 files/core * 2KB = 16384 MB
12 interactive clients * 100,000 files * 2KB = 2400 MB
20 million file working set * 1.5KB/file = 30720 MB
Thus, a reasonable MDS configuration for this workload is at least 60 GB of RAM. For active-active DNE MDT failover pairs, each MDS should have at least 96 GB of RAM. The additional memory can be used during normal operation to allow more metadata and locks to be cached and improve performance, depending on the workload.
For directories containing 1 million or more files, more memory can provide a significant benefit. For example, in an environment where clients randomly a single directory with 10 million files can consume as much as 35GB of RAM on the MDS.
When planning the hardware for an OSS node, consider the memory usage of several components in the Lustre file system (i.e., journal, service threads, file system metadata, etc.). Also, consider the effect of the OSS read cache feature, which consumes memory as it caches data on the OSS node.
In addition to the MDS memory requirements mentioned above, the OSS requirements also include:
Service threads:
The service threads on the OSS node pre-allocate an RPC-sized MB
I/O buffer for each ost_io
service thread, so
these large buffers do not need to be allocated and freed for
each I/O request.
OSS read cache: OSS read cache provides read-only caching of data on an HDD-based OSS, using the regular Linux page cache to store the data. Just like caching from a regular file system in the Linux operating system, OSS read cache uses as much physical memory as is available.
The same calculation applies to files accessed from the OSS as for the MDS, but the load is typically distributed over more OSS nodes, so the amount of memory required for locks, inode cache, etc. listed for the MDS is spread out over the OSS nodes.
Because of these memory requirements, the following calculations should be taken as determining the minimum RAM required in an OSS node.
The minimum recommended RAM size for an OSS with eight OSTs, handling objects for 1/4 of the active files for the MDS:
Linux kernel and userspace daemon memory = 4096 MB
Network send/receive buffers (16 MB * 512 threads) = 8192 MB
1024 MB ldiskfs journal size * 8 OST devices = 8192 MB
16 MB read/write buffer per OST IO thread * 512 threads = 8192 MB
2048 MB file system read cache * 8 OSTs = 16384 MB
1024 * 32-core clients * 64 objects/core * 2KB/object = 4096 MB
12 interactive clients * 25,000 objects * 2KB/object = 600 MB
5 million object working set * 1.5KB/object = 7500 MB
For a non-failover configuration, the minimum RAM would be about 60 GB for an OSS node with eight OSTs. Additional memory on the OSS will improve the performance of reading smaller, frequently-accessed files.
For a failover configuration, the minimum RAM would be about 90 GB, as some of the memory is per-node. When the OSS is not handling any failed-over OSTs the extra RAM will be used as a read cache.
As a reasonable rule of thumb, about 24 GB of base memory plus 4 GB per OST can be used. In failover configurations, about 8 GB per primary OST is needed.
As a high performance file system, the Lustre file system places heavy loads on networks. Thus, a network interface in each Lustre server and client is commonly dedicated to Lustre file system traffic. This is often a dedicated TCP/IP subnet, although other network hardware can also be used.
A typical Lustre file system implementation may include the following:
A high-performance backend network for the Lustre servers, typically an InfiniBand (IB) network.
A larger client network.
Lustre routers to connect the two networks.
Lustre networks and routing are configured and managed by specifying parameters to the
Lustre Networking (lnet
) module in
/etc/modprobe.d/lustre.conf
.
To prepare to configure Lustre networking, complete the following steps:
Identify all machines that will be running Lustre software and the network interfaces they will use to run Lustre file system traffic. These machines will form the Lustre network .
A network is a group of nodes that communicate directly with one another. The Lustre
software includes Lustre network drivers (LNDs) to support a variety of network types and
hardware (see Chapter 2, Understanding Lustre Networking (LNet) for a complete list). The
standard rules for specifying networks applies to Lustre networks. For example, two TCP
networks on two different subnets (tcp0
and tcp1
)
are considered to be two different Lustre networks.
If routing is needed, identify the nodes to be used to route traffic between networks.
If you are using multiple network types, then you will need a router. Any node with appropriate interfaces can route Lustre networking (LNet) traffic between different network hardware types or topologies --the node may be a server, a client, or a standalone router. LNet can route messages between different network types (such as TCP-to-InfiniBand) or across different topologies (such as bridging two InfiniBand or TCP/IP networks). Routing will be configured in Chapter 9, Configuring Lustre Networking (LNet).
Identify the network interfaces to include in or exclude from LNet.
If not explicitly specified, LNet uses either the first available interface or a pre-defined default for a given network type. Interfaces that LNet should not use (such as an administrative network or IP-over-IB), can be excluded.
Network interfaces to be used or excluded will be specified using
the lnet kernel module parameters networks
and
ip2nets
as described in
Chapter 9, Configuring Lustre Networking (LNet).
To ease the setup of networks with complex network configurations, determine a cluster-wide module configuration.
For large clusters, you can configure the networking setup for
all nodes by using a single, unified set of parameters in the
lustre.conf
file on each node. Cluster-wide
configuration is described in Chapter 9, Configuring Lustre Networking (LNet).
We recommend that you use 'dotted-quad' notation for IP addresses rather than host names to make it easier to read debug logs and debug configurations with multiple interfaces.
Table of Contents
This chapter describes best practices for storage selection and file system options to optimize performance on RAID, and includes the following sections:
It is strongly recommended that storage used in a Lustre file system be configured with hardware RAID. The Lustre software does not support redundancy at the file system level and RAID is required to protect against disk failure.
The Lustre architecture allows the use of any kind of block device as backend storage. The characteristics of such devices, particularly in the case of failures, vary significantly and have an impact on configuration choices.
This section describes issues and recommendations regarding backend storage.
I/O on the MDT is typically mostly reads and writes of small amounts of data. For this reason, we recommend that you use RAID 1 for MDT storage. If you require more capacity for an MDT than one disk provides, we recommend RAID 1 + 0 or RAID 10.
A quick calculation makes it clear that without further redundancy, RAID 6 is required for large clusters and RAID 5 is not acceptable:
For a 2 PB file system (2,000 disks of 1 TB capacity) assume the mean time to failure (MTTF) of a disk is about 1,000 days. This means that the expected failure rate is 2000/1000 = 2 disks per day. Repair time at 10% of disk bandwidth is 1000 GB at 10MB/sec = 100,000 sec, or about 1 day.
For a RAID 5 stripe that is 10 disks wide, during 1 day of rebuilding, the chance that a second disk in the same array will fail is about 9/1000 or about 1% per day. After 50 days, you have a 50% chance of a double failure in a RAID 5 array leading to data loss.
Therefore, RAID 6 or another double parity algorithm is needed to provide sufficient redundancy for OST storage.
For better performance, we recommend that you create RAID sets with 4 or 8 data disks plus one or two parity disks. Using larger RAID sets will negatively impact performance compared to having multiple independent RAID sets.
To maximize performance for small I/O request sizes, storage configured as RAID 1+0 can yield much better results but will increase cost or reduce capacity.
RAID monitoring software is recommended to quickly detect faulty disks and allow them to be replaced to avoid double failures and data loss. Hot spare disks are recommended so that rebuilds happen without delays.
Backups of the metadata file systems are recommended. For details, see Chapter 18, Backing Up and Restoring a File System.
A writeback cache in a RAID storage controller can dramatically increase write performance on many types of RAID arrays if the writes are not done at full stripe width. Unfortunately, unless the RAID array has battery-backed cache (a feature only found in some higher-priced hardware RAID arrays), interrupting the power to the array may result in out-of-sequence or lost writes, and corruption of RAID parity and/or filesystem metadata, resulting in data loss.
Having a read or writeback cache onboard a PCI adapter card installed in an MDS or OSS is NOT SAFE in a high-availability (HA) failover configuration, as this will result in inconsistencies between nodes and immediate or eventual filesystem corruption. Such devices should not be used, or should have the onboard cache disabled.
If writeback cache is enabled, a file system check is required after the array loses power. Data may also be lost because of this.
Therefore, we recommend against the use of writeback cache when data integrity is critical. You should carefully consider whether the benefits of using writeback cache outweigh the risks.
When formatting an ldiskfs file system on a RAID device, it can be
beneficial to ensure that I/O requests are aligned with the underlying
RAID geometry. This ensures that Lustre RPCs do not generate unnecessary
disk operations which may reduce performance dramatically. Use the
--mkfsoptions
parameter to specify additional parameters
when formatting the OST or MDT.
For RAID 5, RAID 6, or RAID 1+0 storage, specifying the following
option to the --mkfsoptions
parameter option improves
the layout of the file system metadata, ensuring that no single disk
contains all of the allocation bitmaps:
-E stride = chunk_blocks
The
variable is in units of 4096-byte blocks and represents the amount of
contiguous data written to a single disk before moving to the next disk.
This is alternately referred to as the RAID stripe size. This is
applicable to both MDT and OST file systems.chunk_blocks
For more information on how to override the defaults while formatting MDT or OST file systems, see Section 5.3, “ Setting ldiskfs File System Formatting Options ”.
For best results, use RAID 5 with 5 or 9 disks or RAID 6 with 6 or 10 disks, each on a different controller. The stripe width is the optimal minimum I/O size. Ideally, the RAID configuration should allow 1 MB Lustre RPCs to fit evenly on a single RAID stripe without an expensive read-modify-write cycle. Use this formula to determine the
, where
stripe_width
does not include the RAID parity disks (1 for RAID 5 and 2 for RAID 6):number_of_data_disks
stripe_width_blocks = chunk_blocks * number_of_data_disks
= 1 MB
If the RAID configuration does not allow
to fit evenly into 1 MB, select
chunk_blocks
,
such that is close to 1 MB, but not larger.stripe_width_blocks
The
value must equal
stripe_width_blocks
.
Specifying the
chunk_blocks
* number_of_data_disks
parameter is only relevant for RAID 5 or RAID 6, and is not needed for RAID 1 plus 0.stripe_width_blocks
Run --reformat
on the file system device (/dev/sdc
), specifying the RAID geometry to the underlying ldiskfs file system, where:
--mkfsoptions "other_options
-E stride=chunk_blocks
, stripe_width=stripe_width_blocks
"
A RAID 6 configuration with 6 disks has 4 data and 2 parity disks. The
<= 1024KB/4 = 256KB.chunk_blocks
Because the number of data disks is equal to the power of 2, the stripe width is equal to 1 MB.
--mkfsoptions "other_options
-E stride=chunk_blocks
, stripe_width=stripe_width_blocks
"...
If you have configured a RAID array and use it directly as an OST, it contains both data and metadata. For better performance, we recommend putting the OST journal on a separate device, by creating a small RAID 1 array and using it as an external journal for the OST.
In a typical Lustre file system, the default OST journal size is up to 1GB, and the default MDT journal size is up to 4GB, in order to handle a high transaction rate without blocking on journal flushes. Additionally, a copy of the journal is kept in RAM. Therefore, make sure you have enough RAM on the servers to hold copies of all journals.
The file system journal options are specified to mkfs.lustre
using
the --mkfsoptions
parameter. For example:
--mkfsoptions "other_options
-j -J device=/dev/mdJ"
To create an external journal, perform these steps for each OST on the OSS:
Create a 400 MB (or larger) journal partition (RAID 1 is recommended).
In this example, /dev/sdb
is a RAID 1 device.
Create a journal device on the partition. Run:
oss# mke2fs -b 4096 -O journal_dev /dev/sdb journal_size
The value of
is specified in units of 4096-byte blocks. For example, 262144 for a 1 GB journal size.journal_size
Create the OST.
In this example, /dev/sdc
is the RAID 6 device to be used as the OST, run:
[oss#] mkfs.lustre --ost ... \ --mkfsoptions="-J device=/dev/sdb1" /dev/sdc
Mount the OST as usual.
Depending on your cluster size and workload, you may want to connect a SAN to a Lustre file system. Before making this connection, consider the following:
In many SAN file systems, clients allocate and lock blocks or inodes individually as they are updated. The design of the Lustre file system avoids the high contention that some of these blocks and inodes may have.
The Lustre file system is highly scalable and can have a very large number of clients. SAN switches do not scale to a large number of nodes, and the cost per port of a SAN is generally higher than other networking.
File systems that allow direct-to-SAN access from the clients have a security risk because clients can potentially read any data on the SAN disks, and misbehaving clients can corrupt the file system for many reasons like improper file system, network, or other kernel software, bad cabling, bad memory, and so on. The risk increases with increase in the number of clients directly accessing the storage.
Table of Contents
This chapter describes how to use multiple network interfaces in parallel to increase bandwidth and/or redundancy. Topics include:
Using network interface bonding is optional.
Bonding, also known as link aggregation, trunking and port trunking, is a method of aggregating multiple physical network links into a single logical link for increased bandwidth.
Several different types of bonding are available in the Linux distribution. All these types are referred to as 'modes', and use the bonding kernel module.
Modes 0 to 3 allow load balancing and fault tolerance by using multiple interfaces. Mode 4 aggregates a group of interfaces into a single virtual interface where all members of the group share the same speed and duplex settings. This mode is described under IEEE spec 802.3ad, and it is referred to as either 'mode 4' or '802.3ad.'
The most basic requirement for successful bonding is that both endpoints of the connection must be capable of bonding. In a normal case, the non-server endpoint is a switch. (Two systems connected via crossover cables can also use bonding.) Any switch used must explicitly handle 802.3ad Dynamic Link Aggregation.
The kernel must also be configured with bonding. All supported Lustre kernels have bonding functionality. The network driver for the interfaces to be bonded must have the ethtool functionality to determine slave speed and duplex settings. All recent network drivers implement it.
To verify that your interface works with ethtool, run:
# which ethtool /sbin/ethtool # ethtool eth0 Settings for eth0: Supported ports: [ TP MII ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full Advertised auto-negotiation: Yes Speed: 100Mb/s Duplex: Full Port: MII PHYAD: 1 Transceiver: internal Auto-negotiation: on Supports Wake-on: pumbg Wake-on: d Current message level: 0x00000001 (1) Link detected: yes # ethtool eth1 Settings for eth1: Supported ports: [ TP MII ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full Advertised auto-negotiation: Yes Speed: 100Mb/s Duplex: Full Port: MII PHYAD: 32 Transceiver: internal Auto-negotiation: on Supports Wake-on: pumbg Wake-on: d Current message level: 0x00000007 (7) Link detected: yes To quickly check whether your kernel supports bonding, run: # grep ifenslave /sbin/ifup # which ifenslave /sbin/ifenslave
Bonding module parameters control various aspects of bonding.
Outgoing traffic is mapped across the slave interfaces according to the transmit hash
policy. We recommend that you set the xmit_hash_policy
option to the
layer3+4 option for bonding. This policy uses upper layer protocol information if available to
generate the hash. This allows traffic to a particular network peer to span multiple slaves,
although a single connection does not span multiple slaves.
$ xmit_hash_policy=layer3+4
The miimon
option enables users to monitor the link status. (The
parameter is a time interval in milliseconds.) It makes an interface failure transparent to
avoid serious network degradation during link failures. A reasonable default setting is 100
milliseconds; run:
$ miimon=100
For a busy network, increase the timeout.
To set up bonding:
Create a virtual 'bond' interface by creating a configuration file:
# vi /etc/sysconfig/network-scripts/ifcfg-bond0
Append the following lines to the file.
DEVICE=bond0 IPADDR=192.168.10.79 # Use the free IP Address of your network NETWORK=192.168.10.0 NETMASK=255.255.255.0 USERCTL=no BOOTPROTO=none ONBOOT=yes
Attach one or more slave interfaces to the bond interface. Modify the eth0 and eth1 configuration files (using a VI text editor).
Use the VI text editor to open the eth0 configuration file.
# vi /etc/sysconfig/network-scripts/ifcfg-eth0
Modify/append the eth0 file as follows:
DEVICE=eth0 USERCTL=no ONBOOT=yes MASTER=bond0 SLAVE=yes BOOTPROTO=none
Use the VI text editor to open the eth1 configuration file.
# vi /etc/sysconfig/network-scripts/ifcfg-eth1
Modify/append the eth1 file as follows:
DEVICE=eth1 USERCTL=no ONBOOT=yes MASTER=bond0 SLAVE=yes BOOTPROTO=none
Set up the bond interface and its options in /etc/modprobe.d/bond.conf
. Start the slave interfaces by your normal network method.
# vi /etc/modprobe.d/bond.conf
Append the following lines to the file.
alias bond0 bonding options bond0 mode=balance-alb miimon=100
Load the bonding module.
# modprobe bonding # ifconfig bond0 up # ifenslave bond0 eth0 eth1
Start/restart the slave interfaces (using your normal network method).
You must modprobe
the bonding module for each bonded interface. If you wish to create bond0 and bond1, two entries in bond.conf
file are required.
The examples below are from systems running Red Hat Enterprise Linux. For setup use:
/etc/sysconfig/networking-scripts/ifcfg-*
The website referenced
below includes detailed instructions for other configuration methods, instructions to use
DHCP with bonding, and other setup details. We strongly recommend you use this
website.
Check /proc/net/bonding to determine status on bonding. There should be a file there for each bond interface.
# cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.0.3 (March 23, 2006) Bonding Mode: load balancing (round-robin) MII Status: up MII Polling Interval (ms): 0 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: eth0 MII Status: up Link Failure Count: 0 Permanent HW addr: 4c:00:10:ac:61:e0 Slave Interface: eth1 MII Status: up Link Failure Count: 0 Permanent HW addr: 00:14:2a:7c:40:1d
Use ethtool or ifconfig to check the interface state. ifconfig lists the first bonded interface as 'bond0.'
ifconfig bond0 Link encap:Ethernet HWaddr 4C:00:10:AC:61:E0 inet addr:192.168.10.79 Bcast:192.168.10.255 \ Mask:255.255.255.0 inet6 addr: fe80::4e00:10ff:feac:61e0/64 Scope:Link UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:3091 errors:0 dropped:0 overruns:0 frame:0 TX packets:880 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:314203 (306.8 KiB) TX bytes:129834 (126.7 KiB) eth0 Link encap:Ethernet HWaddr 4C:00:10:AC:61:E0 inet6 addr: fe80::4e00:10ff:feac:61e0/64 Scope:Link UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:1581 errors:0 dropped:0 overruns:0 frame:0 TX packets:448 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:162084 (158.2 KiB) TX bytes:67245 (65.6 KiB) Interrupt:193 Base address:0x8c00 eth1 Link encap:Ethernet HWaddr 4C:00:10:AC:61:E0 inet6 addr: fe80::4e00:10ff:feac:61e0/64 Scope:Link UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:1513 errors:0 dropped:0 overruns:0 frame:0 TX packets:444 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:152299 (148.7 KiB) TX bytes:64517 (63.0 KiB) Interrupt:185 Base address:0x6000
This is an example showing bond.conf
entries for bonding Ethernet interfaces eth1
and eth2
to bond0
:
# cat /etc/modprobe.d/bond.conf alias eth0 8139too alias eth1 via-rhine alias bond0 bonding options bond0 mode=balance-alb miimon=100 # cat /etc/sysconfig/network-scripts/ifcfg-bond0 DEVICE=bond0 BOOTPROTO=none NETMASK=255.255.255.0 IPADDR=192.168.10.79 # (Assign here the IP of the bonded interface.) ONBOOT=yes USERCTL=no ifcfg-ethx # cat /etc/sysconfig/network-scripts/ifcfg-eth0 TYPE=Ethernet DEVICE=eth0 HWADDR=4c:00:10:ac:61:e0 BOOTPROTO=none ONBOOT=yes USERCTL=no IPV6INIT=no PEERDNS=yes MASTER=bond0 SLAVE=yes
In the following example, the bond0
interface is the master (MASTER) while eth0
and eth1
are slaves (SLAVE).
All slaves of bond0
have the same MAC address (Hwaddr) - bond0
. All modes, except TLB and ALB, have this MAC address. TLB and ALB require a unique MAC address for each slave.
$ /sbin/ifconfig bond0Link encap:EthernetHwaddr 00:C0:F0:1F:37:B4 inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0 UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:7224794 errors:0 dropped:0 overruns:0 frame:0 TX packets:3286647 errors:1 dropped:0 overruns:1 carrier:0 collisions:0 txqueuelen:0 eth0Link encap:EthernetHwaddr 00:C0:F0:1F:37:B4 inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0 UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:3573025 errors:0 dropped:0 overruns:0 frame:0 TX packets:1643167 errors:1 dropped:0 overruns:1 carrier:0 collisions:0 txqueuelen:100 Interrupt:10 Base address:0x1080 eth1Link encap:EthernetHwaddr 00:C0:F0:1F:37:B4 inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0 UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:3651769 errors:0 dropped:0 overruns:0 frame:0 TX packets:1643480 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:100 Interrupt:9 Base address:0x1400
The Lustre software uses the IP address of the bonded interfaces and requires no special
configuration. The bonded interface is treated as a regular TCP/IP interface. If needed,
specify bond0
using the Lustre networks
parameter in
/etc/modprobe
.
options lnet networks=tcp(bond0)
We recommend the following bonding references:
In the Linux kernel source tree, see
documentation/networking/bonding.txt
Linux Foundation bonding website: https://www.linuxfoundation.org/networking/bonding. This is the most extensive reference and we highly recommend it. This website includes explanations of more complicated setups, including the use of DHCP with bonding.
Table of Contents
This chapter describes how to install the Lustre software from RPM packages. It includes:
For hardware and system requirements and hardware configuration information, see Chapter 5, Determining Hardware Configuration Requirements and Formatting Options.
You can install the Lustre software from downloaded packages (RPMs) or directly from the source code. This chapter describes how to install the Lustre RPM packages. Instructions to install from source code are beyond the scope of this document, and can be found elsewhere online.
The Lustre RPM packages are tested on current versions of Linux enterprise distributions at the time they are created. See the release notes for each version for specific details.
To install the Lustre software from RPMs, the following are required:
Lustre server packages
. The required packages for Lustre 2.9 EL7 servers are
listed in the table below, where
ver
refers to the Lustre release and
kernel version (e.g., 2.9.0-1.el7) and
arch
refers to the processor architecture
(e.g., x86_64). These packages are available in the
Lustre Releases repository, and may differ depending on
your distro and version.
Table 8.1. Packages Installed on Lustre Servers
Package Name | Description |
---|---|
kernel-
| Linux kernel with Lustre software patches (often referred to as "patched kernel") |
lustre-
| Lustre software command line tools |
kmod-lustre-
| Lustre-patched kernel modules |
kmod-lustre-osd-ldiskfs-
| Lustre back-end file system tools for ldiskfs-based servers. |
lustre-osd-ldiskfs-mount-
| Helper library for mount.lustre
and mkfs.lustre for ldiskfs-based servers.
|
kmod-lustre-osd-zfs-
| Lustre back-end file system tools for ZFS. This is
an alternative to
lustre-osd-ldiskfs (kmod-spl and
kmod-zfs available separately). |
lustre-osd-zfs-mount-
| Helper library for mount.lustre
and mkfs.lustre for ZFS-based servers
(zfs utilities available separately).
|
e2fsprogs
| Utilities to maintain Lustre ldiskfs back-end file system(s) |
lustre-tests-
| Scripts and programs used for running regression tests for Lustre, but likely only of interest to Lustre developers or testers. |
Lustre client packages
. The required packages for Lustre 2.9 EL7 clients are
listed in the table below, where
ver
refers to the Linux distribution (e.g.,
3.6.18-348.1.1.el5). These packages are available in the
Lustre Releases repository.
Table 8.2. Packages Installed on Lustre Clients
Package Name | Description |
---|---|
kmod-lustre-client-
| Patchless kernel modules for client |
lustre-client-
| Client command line tools |
lustre-client-dkms-
| Alternate client RPM to kmod-lustre-client with Dynamic Kernel Module Support (DKMS) installation. This avoids the need to install a new RPM for each kernel update, but requires a full build environment on the client. |
The version of the kernel running on a Lustre client must be
the same as the version of the
kmod-lustre-client-
package being installed, unless the DKMS package is installed.
If the kernel running on the client is not compatible, a kernel
that is compatible must be installed on the client before the
Lustre file system software is used.ver
Lustre LNet network driver (LND) . The Lustre LNDs provided with the Lustre software are listed in the table below. For more information about Lustre LNet, see Chapter 2, Understanding Lustre Networking (LNet).
Table 8.3. Network Types Supported by Lustre LNDs
Supported Network Types | Notes |
---|---|
TCP | Any network carrying TCP traffic, including GigE, 10GigE, and IPoIB |
InfiniBand network | OpenFabrics OFED (o2ib) |
gni | Gemini (Cray) |
The InfiniBand and TCP Lustre LNDs are routinely tested during release cycles. The other LNDs are maintained by their respective owners
High availability software . If needed, install third party high-availability software. For more information, see Section 11.2, “Preparing a Lustre File System for Failover”.
Optional packages. Optional packages provided in the Lustre Releases repository may include the following (depending on the operating system and platform):
kernel-debuginfo
,
kernel-debuginfo-common
,
lustre-debuginfo
,
lustre-osd-ldiskfs-debuginfo
- Versions of required
packages with debugging symbols and other debugging options
enabled for use in troubleshooting.
kernel-devel
, - Portions of the kernel tree needed
to compile third party modules, such as network drivers.
kernel-firmware
- Standard Red Hat Enterprise Linux
distribution that has been recompiled to work with the Lustre
kernel.
kernel-headers
- Header files installed under
/user/include and used when compiling user-space,
kernel-related code.
lustre-source
- Lustre software source code.
(Recommended)
perf
,
perf-debuginfo
,
python-perf
,
python-perf-debuginfo
- Linux performance analysis
tools that have been compiled to match the Lustre kernel
version.
Before installing the Lustre software, make sure the following environmental requirements are met.
(Required)
Use the same user IDs (UID) and group IDs
(GID) on all clients.
If use of supplemental groups is required, see
Section 41.1, “User/Group Upcall” for information about
supplementary user and group cache upcall (identity_upcall
).
(Recommended) Provide remote shell access to clients. It is recommended that all cluster nodes have remote shell client access to facilitate the use of Lustre configuration and monitoring scripts. Parallel Distributed SHell (pdsh) is preferable, although Secure SHell (SSH) is acceptable.
(Recommended) Ensure client clocks are synchronized. The Lustre file system uses client clocks for timestamps. If clocks are out of sync between clients, files will appear with different time stamps when accessed by different clients. Drifting clocks can also cause problems by, for example, making it difficult to debug multi-node issues or correlate logs, which depend on timestamps. We recommend that you use Network Time Protocol (NTP) to keep client and server clocks in sync with each other. For more information about NTP, see: https://www.ntp.org.
(Recommended) Make sure security extensions (such as the Novell AppArmor *security system) and network packet filtering tools (such as iptables) do not interfere with the Lustre software.
Before installing the Lustre software, back up ALL data. The Lustre software contains kernel modifications that interact with storage devices and may introduce security issues and data loss if not installed, configured, or administered properly.
To install the Lustre software from RPMs, complete the steps below.
Verify that all Lustre installation requirements have been met.
For hardware requirements, see Chapter 5, Determining Hardware Configuration Requirements and Formatting Options.
For software and environmental requirements, see the section Section 8.1, “ Preparing to Install the Lustre Software”above.
Download the
e2fsprogs
RPMs for your platform from the
Lustre Releases repository.
Download the Lustre server RPMs for your platform from the Lustre Releases repository. See Table 8.1, “Packages Installed on Lustre Servers”for a list of required packages.
Install the Lustre server and
e2fsprogs
packages on all Lustre servers (MGS, MDSs,
and OSSs).
Log onto a Lustre server as the
root
user
Use the
yum
command to install the packages:
# yum --nogpgcheck install pkg1.rpm pkg2.rpm ...
Verify the packages are installed correctly:
rpm -qa|egrep "lustre|wc"|sort
Reboot the server.
Repeat these steps on each Lustre server.
Download the Lustre client RPMs for your platform from the Lustre Releases repository. See Table 8.2, “Packages Installed on Lustre Clients”for a list of required packages.
Install the Lustre client packages on all Lustre clients.
The version of the kernel running on a Lustre client must be
the same as the version of the
lustre-client-modules-
ver
package being installed. If not, a
compatible kernel must be installed on the client before the Lustre
client packages are installed.
Log onto a Lustre client as the root user.
Use the
yum
command to install the packages:
# yum --nogpgcheck install pkg1.rpm pkg2.rpm ...
Verify the packages were installed correctly:
# rpm -qa|egrep "lustre|kernel"|sort
Reboot the client.
Repeat these steps on each Lustre client.
To configure LNet, go to Chapter 9, Configuring Lustre Networking (LNet). If default settings will be used for LNet, go to Chapter 10, Configuring a Lustre File System.
Table of Contents
lnetctl
L 2.7 This chapter describes how to configure Lustre Networking (LNet). It includes the following sections:
Configuring LNet is optional.
LNet will use the first TCP/IP interface it discovers on a
system (eth0
) if it's loaded using the
lctl network up
. If this network configuration is
sufficient, you do not need to configure LNet. LNet configuration is
required if you are using Infiniband or multiple Ethernet
interfaces.
The lnetctl
utility can be used
to initialize LNet without bringing up any network interfaces. Network
interfaces can be added after configuring LNet via
lnetctl
. lnetctl
can also be used to
manage an operational LNet. However, if it wasn't initialized by
lnetctl
then lnetctl lnet configure
must be invoked before lnetctl
can be used to manage
LNet.
DLC also introduces a C-API to enable configuring LNet programatically. See Chapter 45, LNet Configuration C-API
The lnetctl
utility can be used to initialize
and configure the LNet kernel module after it has been loaded via
modprobe
. In general the lnetctl format is as
follows:
lnetctl cmd subcmd [options]
The following configuration items are managed by the tool:
Configuring/unconfiguring LNet
Adding/removing/showing Networks
Adding/removing/showing Routes
Enabling/Disabling routing
Configuring Router Buffer Pools
After LNet has been loaded via modprobe
,
lnetctl
utility can be used to configure LNet
without bringing up networks which are specified in the module
parameters. It can also be used to configure network interfaces
specified in the module prameters by providing the
--all
option.
lnetctl lnet configure [--all] # --all: load NI configuration from module parameters
The lnetctl
utility can also be used to
unconfigure LNet.
lnetctl lnet unconfigure
The active LNet global settings can be displayed using the
lnetctl
command shown below:
lnetctl global show
For example:
# lnetctl global show global: numa_range: 0 max_intf: 200 discovery: 1 drop_asym_route: 0
Networks can be added, deleted, or shown after the LNet kernel module is loaded.
The lnetctl net add
command is used to add networks:
lnetctl net add: add a network --net: net name (ex tcp0) --if: physical interface (ex eth0) --peer_timeout: time to wait before declaring a peer dead --peer_credits: defines the max number of inflight messages --peer_buffer_credits: the number of buffer credits per peer --credits: Network Interface credits --cpts: CPU Partitions configured net uses --help: display this help text Example: lnetctl net add --net tcp2 --if eth0 --peer_timeout 180 --peer_credits 8
With the addition of Software based Multi-Rail in Lustre 2.10, the following should be noted:
--net: no longer needs to be unique since multiple interfaces can be added to the same network.
--if: The same interface per network can be added only once, however, more than one interface can now be specified (separated by a comma) for a node. For example: eth0,eth1,eth2.
For examples on adding multiple interfaces via
lnetctl net add
and/or YAML, please see
Section 16.2, “Configuring Multi-Rail”
Networks can be deleted with the
lnetctl net del
command:
net del: delete a network --net: net name (ex tcp0) --if: physical inerface (e.g. eth0) Example: lnetctl net del --net tcp2
In a Software Multi-Rail configuration,
specifying only the --net
argument will delete the
entire network and all interfaces under it. The new
--if
switch should also be used in conjunction with
--net
to specify deletion of a specific interface.
All or a subset of the configured networks can be shown with the
lnetctl net show
command. The output can be non-verbose or verbose.
net show: show networks --net: net name (ex tcp0) to filter on --verbose: display detailed output per network Examples: lnetctl net show lnetctl net show --verbose lnetctl net show --net tcp2 --verbose
Below are examples of non-detailed and detailed network configuration show.
# non-detailed show > lnetctl net show --net tcp2 net: - nid: 192.168.205.130@tcp2 status: up interfaces: 0: eth3 # detailed show > lnetctl net show --net tcp2 --verbose net: - nid: 192.168.205.130@tcp2 status: up interfaces: 0: eth3 tunables: peer_timeout: 180 peer_credits: 8 peer_buffer_credits: 0 credits: 256
The lnetctl peer add
command is used to manually add a remote peer to a software
multi-rail configuration. For the dynamic peer discovery capability
introduced in Lustre Release 2.11.0, please see
Section 9.1.5, “Dynamic Peer Discovery”.
When configuring peers, use the --prim_nid
option to specify the key or primary nid of the peer node. Then
follow that with the --nid
option to specify a
set of comma separated NIDs.
peer add: add a peer --prim_nid: primary NID of the peer --nid: comma separated list of peer nids (e.g. 10.1.1.2@tcp0) --non_mr: if specified this interface is created as a non mulit-rail capable peer. Only one NID can be specified in this case.
For example:
lnetctl peer add --prim_nid 10.10.10.2@tcp --nid 10.10.3.3@tcp1,10.4.4.5@tcp2
The --prim-nid
(primary nid for the peer
node) can go unspecified. In this case, the first listed NID in the
--nid
option becomes the primary nid of the peer.
For example:
lnetctl peer_add --nid 10.10.10.2@tcp,10.10.3.3@tcp1,10.4.4.5@tcp2
YAML can also be used to configure peers:
peer: - primary nid: <key or primary nid> Multi-Rail: True peer ni: - nid: <nid 1> - nid: <nid 2> - nid: <nid n>
As with all other commands, the result of the
lnetctl peer show
command can be used to gather
information to aid in configuring or deleting a peer:
lnetctl peer show -v
Example output from the lnetctl peer show
command:
peer: - primary nid: 192.168.122.218@tcp Multi-Rail: True peer ni: - nid: 192.168.122.218@tcp state: NA max_ni_tx_credits: 8 available_tx_credits: 8 available_rtr_credits: 8 min_rtr_credits: -1 tx_q_num_of_buf: 0 send_count: 6819 recv_count: 6264 drop_count: 0 refcount: 1 - nid: 192.168.122.78@tcp state: NA max_ni_tx_credits: 8 available_tx_credits: 8 available_rtr_credits: 8 min_rtr_credits: -1 tx_q_num_of_buf: 0 send_count: 7061 recv_count: 6273 drop_count: 0 refcount: 1 - nid: 192.168.122.96@tcp state: NA max_ni_tx_credits: 8 available_tx_credits: 8 available_rtr_credits: 8 min_rtr_credits: -1 tx_q_num_of_buf: 0 send_count: 6939 recv_count: 6286 drop_count: 0 refcount: 1
Use the following lnetctl
command to delete a
peer:
peer del: delete a peer --prim_nid: Primary NID of the peer --nid: comma separated list of peer nids (e.g. 10.1.1.2@tcp0)
prim_nid
should always be specified. The
prim_nid
identifies the peer. If the
prim_nid
is the only one specified, then the
entire peer is deleted.
Example of deleting a single nid of a peer (10.10.10.3@tcp):
lnetctl peer del --prim_nid 10.10.10.2@tcp --nid 10.10.10.3@tcp
Example of deleting the entire peer:
lnetctl peer del --prim_nid 10.10.10.2@tcp
Dynamic Discovery (DD) is a feature that allows nodes to dynamically discover a peer's interfaces without having to explicitly configure them. This is very useful for Multi-Rail (MR) configurations. In large clusters, there could be hundreds of nodes and having to configure MR peers on each node becomes error prone. Dynamic Discovery is enabled by default and uses a new protocol based on LNet pings to discover the interfaces of the remote peers on first message.
When LNet on a node is requested to send a message to a peer it first attempts to ping the peer. The reply to the ping contains the peer's NIDs as well as a feature bit outlining what the peer supports. Dynamic Discovery adds a Multi-Rail feature bit. If the peer is Multi-Rail capable, it sets the MR bit in the ping reply. When the node receives the reply it checks the MR bit, and if it is set it then pushes its own list of NIDs to the peer using a new PUT message, referred to as a "push ping". After this brief protocol, both the peer and the node will have each other's list of interfaces. The MR algorithm can then proceed to use the list of interfaces of the corresponding peer.
If the peer is not MR capable, it will not set the MR feature bit in the ping reply. The node will understand that the peer is not MR capable and will only use the interface provided by upper layers for sending messages.
It is possible to configure the peer manually while Dynamic Discovery is running. Manual peer configuration always takes precedence over Dynamic Discovery. If there is a discrepancy between the manual configuration and the dynamically discovered information, a warning is printed.
Dynamic Discovery is very light on the configuration side. It can only be turned on or turned off. To turn the feature on or off, the following command is used:
lnetctl set discovery [0 | 1]
To check the current discovery
setting, the
lnetctl global show
command can be used as shown in
Section 9.1.2, “Displaying Global Settings”.
A set of routes can be added to identify how LNet messages are to be routed.
lnetctl route add: add a route --net: net name (ex tcp0) LNet message is destined to. The can not be a local network. --gateway: gateway node nid (ex 10.1.1.2@tcp) to route all LNet messaged destined for the identified network --hop: number of hops to final destination (1 <= hops <= 255) (optional) --priority: priority of route (0 - highest prio) (optional) Example: lnetctl route add --net tcp2 --gateway 192.168.205.130@tcp1 --hop 2 --prio 1
Routes can be deleted via the following lnetctl
command.
lnetctl route del: delete a route --net: net name (ex tcp0) --gateway: gateway nid (ex 10.1.1.2@tcp) Example: lnetctl route del --net tcp2 --gateway 192.168.205.130@tcp1
Configured routes can be shown via the following
lnetctl
command.
lnetctl route show: show routes --net: net name (ex tcp0) to filter on --gateway: gateway nid (ex 10.1.1.2@tcp) to filter on --hop: number of hops to final destination (1 <= hops <= 255) to filter on (-1 default) --priority: priority of route (0 - highest prio) to filter on (0 default) --verbose: display detailed output per route Examples: # non-detailed show lnetctl route show # detailed show lnetctl route show --verbose
When showing routes the --verbose
option
outputs more detailed information. All show and error output are in
YAML format. Below are examples of both non-detailed and detailed
route show output.
#Non-detailed output > lnetctl route show route: - net: tcp2 gateway: 192.168.205.130@tcp1 #detailed output > lnetctl route show --verbose route: - net: tcp2 gateway: 192.168.205.130@tcp1 hop: 2 priority: 1 state: down
When an LNet node is configured as a router it will route LNet messages not destined to itself. This feature can be enabled or disabled as follows.
lnetctl set routing [0 | 1] # 0 - disable routing feature # 1 - enable routing feature
When routing is enabled on a node, the tiny, small and large routing buffers are allocated. See Section 34.3, “ Tuning LNet Parameters” for more details on router buffers. This information can be shown as follows:
lnetctl routing show: show routing information Example: lnetctl routing show
An example of the show output:
> lnetctl routing show routing: - cpt[0]: tiny: npages: 0 nbuffers: 2048 credits: 2048 mincredits: 2048 small: npages: 1 nbuffers: 16384 credits: 16384 mincredits: 16384 large: npages: 256 nbuffers: 1024 credits: 1024 mincredits: 1024 - enable: 1
The routing buffers values configured specify the number of buffers in each of the tiny, small and large groups.
It is often desirable to configure the tiny, small and large routing buffers to some values other than the default. These values are global values, when set they are used by all configured CPU partitions. If routing is enabled then the values set take effect immediately. If a larger number of buffers is specified, then buffers are allocated to satisfy the configuration change. If fewer buffers are configured then the excess buffers are freed as they become unused. If routing is not set the values are not changed. The buffer values are reset to default if routing is turned off and on.
The lnetctl
'set' command can be
used to set these buffer values. A VALUE greater than 0
will set the number of buffers accordingly. A VALUE of 0
will reset the number of buffers to system defaults.
set tiny_buffers: set tiny routing buffers VALUE must be greater than or equal to 0 set small_buffers: set small routing buffers VALUE must be greater than or equal to 0 set large_buffers: set large routing buffers VALUE must be greater than or equal to 0
Usage examples:
> lnetctl set tiny_buffers 4096 > lnetctl set small_buffers 8192 > lnetctl set large_buffers 2048
The buffers can be set back to the default values as follows:
> lnetctl set tiny_buffers 0 > lnetctl set small_buffers 0 > lnetctl set large_buffers 0
An asymmetrical route is when a message from a remote peer is coming through a router that is not known by this node to reach the remote peer.
Asymmetrical routes can be an issue when debugging network, and allowing them also opens the door to attacks where hostile clients inject data to the servers.
So it is possible to activate a check in LNet, that will detect any asymmetrical route message and drop it.
In order to switch asymmetric route detection on or off, the following command is used:
lnetctl set drop_asym_route [0 | 1]
This command works on a per-node basis. This means each node in a Lustre cluster can decide whether it accepts asymmetrical route messages.
To check the current drop_asym_route
setting, the
lnetctl global show
command can be used as shown in
Section 9.1.2, “Displaying Global Settings”.
By default, asymmetric route detection is off.
Configuration can be described in YAML format and can be fed
into the lnetctl
utility. The
lnetctl
utility parses the YAML file and performs
the specified operation on all entities described there in. If no
operation is defined in the command as shown below, the default
operation is 'add'. The YAML syntax is described in a later
section.
lnetctl import FILE.yaml lnetctl import < FILE.yaml
The 'lnetctl
import' command provides three
optional parameters to define the operation to be performed on the
configuration items described in the YAML file.
# if no options are given to the command the "add" command is assumed # by default. lnetctl import --add FILE.yaml lnetctl import --add < FILE.yaml # to delete all items described in the YAML file lnetctl import --del FILE.yaml lnetctl import --del < FILE.yaml # to show all items described in the YAML file lnetctl import --show FILE.yaml lnetctl import --show < FILE.yaml
lnetctl
utility provides the 'export'
command to dump current LNet configuration in YAML format
lnetctl export FILE.yaml lnetctl export > FILE.yaml
lnetctl
utility can dump the LNet traffic
statistiscs as follows
lnetctl stats show
The lnetctl
utility can take in a YAML file
describing the configuration items that need to be operated on and
perform one of the following operations: add, delete or show on the
items described there in.
Net, routing and route YAML blocks are all defined as a YAML
sequence, as shown in the following sections. The stats YAML block
is a YAML object. Each sequence item can take a seq_no field. This
seq_no field is returned in the error block. This allows the caller
to associate the error with the item that caused the error. The
lnetctl
utilty does a best effort at configuring
items defined in the YAML file. It does not stop processing the file
at the first error.
Below is the YAML syntax describing the various configuration elements which can be operated on via DLC. Not all YAML elements are required for all operations (add/delete/show). The system ignores elements which are not pertinent to the requested operation.
net: - net: <network. Ex: tcp or o2ib> interfaces: 0: <physical interface> detail: <This is only applicable for show command. 1 - output detailed info. 0 - basic output> tunables: peer_timeout: <Integer. Timeout before consider a peer dead> peer_credits: <Integer. Transmit credits for a peer> peer_buffer_credits: <Integer. Credits available for receiving messages> credits: <Integer. Network Interface credits> SMP: <An array of integers of the form: "[x,y,...]", where each integer represents the CPT to associate the network interface with> seq_no: <integer. Optional. User generated, and is passed back in the YAML error block>
Both seq_no and detail fields do not appear in the show output.
routing: - tiny: <Integer. Tiny buffers> small: <Integer. Small buffers> large: <Integer. Large buffers> enable: <0 - disable routing. 1 - enable routing> seq_no: <Integer. Optional. User generated, and is passed back in the YAML error block>
The seq_no field does not appear in the show output
statistics: seq_no: <Integer. Optional. User generated, and is passed back in the YAML error block>
The seq_no field does not appear in the show output
route: - net: <network. Ex: tcp or o2ib> gateway: <nid of the gateway in the form <ip>@<net>: Ex: 192.168.29.1@tcp> hop: <an integer between 1 and 255. Optional> detail: <This is only applicable for show commands. 1 - output detailed info. 0. basic output> seq_no: <integer. Optional. User generated, and is passed back in the YAML error block>
Both seq_no and detail fields do not appear in the show output.
LNet kernel module (lnet) parameters specify how LNet is to be configured to work with Lustre, including which NICs will be configured to work with Lustre and the routing to be used with Lustre.
Parameters for LNet can be specified in the
/etc/modprobe.d/lustre.conf
file. In some cases
the parameters may have been stored in
/etc/modprobe.conf
, but this has been deprecated
since before RHEL5 and SLES10, and having a separate
/etc/modprobe.d/lustre.conf
file simplifies
administration and distribution of the Lustre networking
configuration. This file contains one or more entries with the
syntax:
options lnetparameter
=value
To specify the network interfaces that are to be used for
Lustre, set either the networks
parameter or the
ip2nets
parameter (only one of these parameters can
be used at a time):
networks
- Specifies the networks to be used.
ip2nets
- Lists globally-available
networks, each with a range of IP addresses. LNet then identifies
locally-available networks through address list-matching
lookup.
See Section 9.3, “Setting the LNet Module networks Parameter” and Section 9.4, “Setting the LNet Module ip2nets Parameter” for more details.
To set up routing between networks, use:
routes
- Lists networks and the NIDs of
routers that forward to them.
See Section 9.5, “Setting the LNet Module routes Parameter” for more details.
A router
checker can be configured to enable
Lustre nodes to detect router health status, avoid routers that appear
dead, and reuse those that restore service after failures. See Section 9.7, “Configuring the Router Checker” for more details.
For a complete reference to the LNet module parameters, see Chapter 43, Configuration Files and Module ParametersLNet Options.
We recommend that you use 'dotted-quad' notation for IP addresses rather than host names to make it easier to read debug logs and debug configurations with multiple interfaces.
A Lustre network identifier (NID) is used to uniquely identify a Lustre network endpoint by node ID and network type. The format of the NID is:
network_id
@network_type
Examples are:
10.67.73.200@tcp0 10.67.75.100@o2ib
The first entry above identifies a TCP/IP node, while the second entry identifies an InfiniBand node.
When a mount command is run on a client, the client uses the NID of the MDS to retrieve configuration information. If an MDS has more than one NID, the client should use the appropriate NID for its local network.
To determine the appropriate NID to specify in
the mount command, use the lctl
command. To
display MDS NIDs, run on the MDS :
lctl list_nids
To determine if a client can reach the MDS using a particular NID, run on the client:
lctl which_nid MDS_NID
If a node has more than one network interface, you'll
typically want to dedicate a specific interface to Lustre. You can do
this by including an entry in the lustre.conf
file
on the node that sets the LNet module networks
parameter:
options lnet networks=comma-separated list of
networks
This example specifies that a Lustre node will use a TCP/IP interface and an InfiniBand interface:
options lnet networks=tcp0(eth0),o2ib(ib0)
This example specifies that the Lustre node will use the TCP/IP
interface eth1
:
options lnet networks=tcp0(eth1)
Depending on the network design, it may be necessary to specify
explicit interfaces. To explicitly specify that interface
eth2
be used for network tcp0
and eth3
be used for tcp1
, use
this entry:
options lnet networks=tcp0(eth2),tcp1(eth3)
When more than one interface is available during the network setup, Lustre chooses the best route based on the hop count. Once the network connection is established, Lustre expects the network to stay connected. In a Lustre network, connections do not fail over to another interface, even if multiple interfaces are available on the same node.
LNet lines in lustre.conf
are only used by
the local node to determine what to call its interfaces. They are
not used for routing decisions.
If a server with multiple IP addresses (multihome server) is connected to a Lustre network, certain configuration setting are required. An example illustrating these setting consists of a network with the following nodes:
Server svr1 with three TCP NICs (eth0
,
eth1
, and eth2
) and an
InfiniBand NIC.
Server svr2 with three TCP NICs (eth0
,
eth1
, and eth2
) and an
InfiniBand NIC. Interface eth2 will not be used for Lustre
networking.
TCP clients, each with a single TCP interface.
InfiniBand clients, each with a single Infiniband interface and a TCP/IP interface for administration.
To set the networks
option for this example:
On each server, svr1
and
svr2
, include the following line in the
lustre.conf
file:
options lnet networks=tcp0(eth0),tcp1(eth1),o2ib
For TCP-only clients, the first available non-loopback IP
interface is used for tcp0
. Thus, TCP clients
with only one interface do not need to have options defined in
the lustre.conf
file.
On the InfiniBand clients, include the following line in
the lustre.conf
file:
options lnet networks=o2ib
By default, Lustre ignores the loopback
(lo0
) interface. Lustre does not ignore IP
addresses aliased to the loopback. If you alias IP addresses to
the loopback interface, you must specify all Lustre networks using
the LNet networks parameter.
If the server has multiple interfaces on the same subnet, the Linux kernel will send all traffic using the first configured interface. This is a limitation of Linux, not Lustre. In this case, network interface bonding should be used. For more information about network interface bonding, see Chapter 7, Setting Up Network Interface Bonding.
The ip2nets
option is typically used when a
single, universal lustre.conf
file is run on all
servers and clients. Each node identifies the locally available
networks based on the listed IP address patterns that match the
node's local IP addresses.
Note that the IP address patterns listed in the
ip2nets
option are only used
to identify the networks that an individual node should instantiate.
They are not used by LNet for any other
communications purpose.
For the example below, the nodes in the network have these IP addresses:
Server svr1: eth0
IP address
192.168.0.2
, IP over Infiniband
(o2ib
) address
132.6.1.2
.
Server svr2: eth0
IP address
192.168.0.4
, IP over Infiniband
(o2ib
) address
132.6.1.4
.
TCP clients have IP addresses
192.168.0.5-255.
Infiniband clients have IP over Infiniband
(o2ib
) addresses 132.6.[2-3].2, .4,
.6, .8
.
The following entry is placed in the
lustre.conf
file on each server and client:
options lnet 'ip2nets="tcp0(eth0) 192.168.0.[2,4]; \ tcp0 192.168.0.*; o2ib0 132.6.[1-3].[2-8/2]"'
Each entry in ip2nets
is referred to as a
'rule'.
The order of LNet entries is important when configuring servers.
If a server node can be reached using more than one network, the first
network specified in lustre.conf
will be
used.
Because svr1
and svr2
match the first rule, LNet uses eth0
for
tcp0
on those machines. (Although
svr1
and svr2
also match the
second rule, the first matching rule for a particular network is
used).
The [2-8/2]
format indicates a range of 2-8
stepped by 2; that is 2,4,6,8. Thus, the clients at
132.6.3.5
will not find a matching o2ib
network.
Multi-rail deprecates the kernel parsing of ip2nets. ip2nets patterns are matched in user space and translated into Network interfaces to be added into the system.
The first interface that matches the IP pattern will be used when adding a network interface.
If an interface is explicitly specified as well as a pattern, the interface matched using the IP pattern will be sanitized against the explicitly-defined interface.
For example, tcp(eth0) 192.168.*.3
and there
exists in the system eth0 == 192.158.19.3
and
eth1 == 192.168.3.3
, then the configuration will
fail, because the pattern contradicts the interface specified.
A clear warning will be displayed if inconsistent configuration is encountered.
You could use the following command to configure ip2nets:
lnetctl import < ip2nets.yaml
For example:
ip2nets: - net-spec: tcp1 interfaces: 0: eth0 1: eth1 ip-range: 0: 192.168.*.19 1: 192.168.100.105 - net-spec: tcp2 interfaces: 0: eth2 ip-range: 0: 192.168.*.*
The LNet module routes parameter is used to identify routers in
a Lustre configuration. These parameters are set in
modprobe.conf
on each Lustre node.
Routes are typically set to connect to segregated subnetworks or to cross connect two different types of networks such as tcp and o2ib
The LNet routes parameter specifies a colon-separated list of router definitions. Each route is defined as a network number, followed by a list of routers:
routes=net_type router_NID(s)
This example specifies bi-directional routing in which TCP clients can reach Lustre resources on the IB networks and IB servers can access the TCP networks:
options lnet 'ip2nets="tcp0 192.168.0.*; \ o2ib0(ib0) 132.6.1.[1-128]"' 'routes="tcp0 132.6.1.[1-8]@o2ib0; \ o2ib0 192.16.8.0.[1-8]@tcp0"'
All LNet routers that bridge two networks are equivalent. They are not configured as primary or secondary, and the load is balanced across all available routers.
The number of LNet routers is not limited. Enough routers should be used to handle the required file serving bandwidth plus a 25 percent margin for headroom.
On the clients, place the following entry in the
lustre.conf
file
lnet networks="tcp" routes="o2ib0 192.168.0.[1-8]@tcp0"
On the router nodes, use:
lnet networks="tcp o2ib" forwarding=enabled
On the MDS, use the reverse as shown below:
lnet networks="o2ib0" routes="tcp0 132.6.1.[1-8]@o2ib0"
To start the routers, run:
modprobe lnet lctl network configure
After configuring Lustre Networking, it is highly recommended that you test your LNet configuration using the LNet Self-Test provided with the Lustre software. For more information about using LNet Self-Test, see Chapter 32, Testing Lustre Network Performance (LNet Self-Test).
In a Lustre configuration in which different types of networks, such as a TCP/IP network and an Infiniband network, are connected by routers, a router checker can be run on the clients and servers in the routed configuration to monitor the status of the routers. In a multi-hop routing configuration, router checkers can be configured on routers to monitor the health of their next-hop routers.
A router checker is configured by setting LNet parameters in
lustre.conf
by including an entry in this
form:
options lnetrouter_checker_parameter
=value
The router checker parameters are:
live_router_check_interval
- Specifies a
time interval in seconds after which the router checker will ping
the live routers. The default value is 0, meaning no checking is
done. To set the value to 60, enter:
options lnet live_router_check_interval=60
dead_router_check_interval
- Specifies a
time interval in seconds after which the router checker will check
for dead routers. The default value is 0, meaning no checking is
done. To set the value to 60, enter:
options lnet dead_router_check_interval=60
auto_down - Enables/disables (1/0) the automatic marking of router state as up or down. The default value is 1. To disable router marking, enter:
options lnet auto_down=0
router_ping_timeout
- Specifies a
timeout for the router checker when it checks live or dead
routers. The router checker sends a ping message to each dead or
live router once every dead_router_check_interval or
live_router_check_interval respectively. The default value is 50.
To set the value to 60, enter:
options lnet router_ping_timeout=60
The router_ping_timeout
is consistent
with the default LND timeouts. You may have to increase it on very
large clusters if the LND timeout is also increased. For larger
clusters, we suggest increasing the check interval.
check_routers_before_use
- Specifies
that routers are to be checked before use. Set to off by
default. If this parameter is set to on, the
dead_router_check_interval parameter must be given a positive
integer value.
options lnet check_routers_before_use=on
The router checker obtains the following information from each router:
Time the router was disabled
Elapsed disable time
If the router checker does not get a reply message from the router within router_ping_timeout seconds, it considers the router to be down.
If a router is marked 'up' and responds to a ping, the timeout is reset.
If 100 packets have been sent successfully through a router, the sent-packets counter for that router will have a value of 100.
For the networks
, ip2nets
,
and routes
options, follow these best practices to
avoid configuration errors.
Depending on the Linux distribution, commas may need to be
escaped using single or double quotes. In the extreme case, the
options
entry would look like this:
options lnet'networks="tcp0,elan0"' 'routes="tcp [2,10]@elan0"'
Added quotes may confuse some distributions. Messages such as the following may indicate an issue related to added quotes:
lnet: Unknown parameter 'networks'
A 'Refusing connection - no matching
NID'
message generally points to an error in the LNet
module configuration.
Place the semicolon terminating a comment
immediately after the comment. LNet silently ignores
everything between the #
character at the
beginning of the comment and the next semicolon.
In this incorrect example, LNet silently
ignores pt11 192.168.0.[92,96]
, resulting in
these nodes not being properly initialized. No error message is
generated.
options lnet ip2nets="pt10 192.168.0.[89,93]; # comment with semicolon BEFORE comment \ pt11 192.168.0.[92,96];
This correct example shows the required syntax:
options lnet ip2nets="pt10 192.168.0.[89,93] \ # comment with semicolon AFTER comment; \ pt11 192.168.0.[92,96] # comment
Do not add an excessive number of comments. The Linux kernel limits the length of character strings used in module options (usually to 1KB, but this may differ between vendor kernels). If you exceed this limit, errors result and the specified configuration may not be processed correctly.
Table of Contents
This chapter shows how to configure a simple Lustre file system comprised of a combined MGS/MDT, an OST and a client. It includes:
A Lustre file system can be set up in a variety of configurations by using the administrative utilities provided with the Lustre software. The procedure below shows how to configure a simple Lustre file system consisting of a combined MGS/MDS, one OSS with two OSTs, and a client. For an overview of the entire Lustre installation procedure, see Chapter 4, Installation Overview.
This configuration procedure assumes you have completed the following:
Set up and configured your hardware . For more information about hardware requirements, see Chapter 5, Determining Hardware Configuration Requirements and Formatting Options.
Downloaded and installed the Lustre software.For more information about preparing for and installing the Lustre software, see Chapter 8, Installing the Lustre Software.
The following optional steps should also be completed, if needed, before the Lustre software is configured:
Set up a hardware or software RAID on block devices to be used as OSTs or MDTs.For information about setting up RAID, see the documentation for your RAID controller or Chapter 6, Configuring Storage on a Lustre File System.
Set up network interface bonding on Ethernet interfaces.For information about setting up network interface bonding, see Chapter 7, Setting Up Network Interface Bonding.
Set lnet module parameters to specify how Lustre Networking (LNet) is to be configured to work with a Lustre file system and test the LNet configuration.LNet will, by default, use the first TCP/IP interface it discovers on a system. If this network configuration is sufficient, you do not need to configure LNet. LNet configuration is required if you are using InfiniBand or multiple Ethernet interfaces.
For information about configuring LNet, see Chapter 9, Configuring Lustre Networking (LNet). For information about testing LNet, see Chapter 32, Testing Lustre Network Performance (LNet Self-Test).
Run the benchmark script
sgpdd-survey
to determine baseline performance of
your hardware.Benchmarking your hardware will simplify
debugging performance issues that are unrelated to the Lustre software
and ensure you are getting the best possible performance with your
installation. For information about running
sgpdd-survey
, see
Chapter 33, Benchmarking Lustre File System Performance (Lustre I/O
Kit).
The
sgpdd-survey
script overwrites the device being tested
so it must be run before the OSTs are configured.
To configure a simple Lustre file system, complete these steps:
Create a combined MGS/MDT file system on a block device. On the MDS node, run:
mkfs.lustre --fsname=fsname
--mgs --mdt --index=0/dev/block_device
The default file system name (
fsname
) is
lustre
.
If you plan to create multiple file systems, the MGS should be created separately on its own dedicated block device, by running:
mkfs.lustre --fsname=fsname
--mgs/dev/block_device
See Section 13.8, “ Running Multiple Lustre File Systems”for more details.
Optionally add in additional MDTs.
mkfs.lustre --fsname=fsname
--mgsnode=nid
--mdt --index=1/dev/block_device
Up to 4095 additional MDTs can be added.
Mount the combined MGS/MDT file system on the block device. On the MDS node, run:
mount -t lustre/dev/block_device
/mount_point
If you have created an MGS and an MDT on separate block devices, mount them both.
Create the OST. On the OSS node, run:
mkfs.lustre --fsname=fsname
--mgsnode=MGS_NID
--ost --index=OST_index
/dev/block_device
When you create an OST, you are formatting a
ldiskfs
or
ZFS
file system on a block storage device like you
would with any local file system.
You can have as many OSTs per OSS as the hardware or drivers allow. For more information about storage and memory requirements for a Lustre file system, see Chapter 5, Determining Hardware Configuration Requirements and Formatting Options.
You can only configure one OST per block device. You should create an OST that uses the raw block device and does not use partitioning.
You should specify the OST index number at format time in order to simplify translating the OST number in error messages or file striping to the OSS node and block device later on.
If you are using block devices that are accessible from multiple OSS nodes, ensure that you mount the OSTs from only one OSS node at at time. It is strongly recommended that multiple-mount protection be enabled for such devices to prevent serious data corruption. For more information about multiple-mount protection, see Chapter 24, Lustre File System Failover and Multiple-Mount Protection.
The Lustre software currently supports block devices up to 128 TB on Red Hat Enterprise Linux 5 and 6 (up to 8 TB on other distributions). If the device size is only slightly larger that 16 TB, it is recommended that you limit the file system size to 16 TB at format time. We recommend that you not place DOS partitions on top of RAID 5/6 block devices due to negative impacts on performance, but instead format the whole disk for the file system.
Mount the OST. On the OSS node where the OST was created, run:
mount -t lustre/dev/block_device
/mount_point
Mount the Lustre file system on the client. On the client node, run:
mount -t lustreMGS_node
:/fsname
/mount_point
To mount the filesystem on additional clients, repeat Step 6.
If you have a problem mounting the file system, check the
syslogs on the client and all the servers for errors and also check
the network settings. A common issue with newly-installed systems is
that
hosts.deny
or firewall rules may prevent
connections on port 988.
Verify that the file system started and is working correctly. Do
this by running
lfs df
,
dd
and
ls
commands on the client node.
(Optional)Run benchmarking tools to validate the performance of hardware and software layers in the cluster. Available tools include:
obdfilter-survey
- Characterizes the storage
performance of a Lustre file system. For details, see
Section 33.3, “Testing OST Performance (obdfilter-survey
)
”.
ost-survey
- Performs I/O against OSTs to detect
anomalies between otherwise identical disk subsystems. For details,
see
Section 33.4, “
Testing OST I/O Performance (ost-survey
)”.
To see the steps to complete for a simple Lustre file system
configuration, follow this example in which a combined MGS/MDT and two
OSTs are created to form a file system called
temp
. Three block devices are used, one for the
combined MGS/MDS node and one for each OSS node. Common parameters used
in the example are listed below, along with individual node
parameters.
Common Parameters |
Value |
Description | |
---|---|---|---|
|
MGS node |
|
Node for the combined MGS/MDS |
|
file system |
|
Name of the Lustre file system |
|
network type |
|
Network type used for Lustre file system
|
Node Parameters |
Value |
Description | |
---|---|---|---|
MGS/MDS node | |||
|
MGS/MDS node |
|
MDS in Lustre file system
|
|
block device |
|
Block device for the combined MGS/MDS node |
|
mount point |
|
Mount point for the
|
First OSS node | |||
|
OSS node |
|
First OSS node in Lustre file system
|
|
OST |
|
First OST in Lustre file system
|
|
block device |
|
Block device for the first OSS node (
|
|
mount point |
|
Mount point for the
|
Second OSS node | |||
OSS node |
|
Second OSS node in Lustre file system
| |
OST |
|
Second OST in Lustre file system
| |
block device |
|
Block device for the second OSS node (oss1) | |
mount point |
|
Mount point for the
| |
Client node | |||
client node |
|
Client in Lustre file system
| |
mount point |
|
Mount point for Lustre file system
|
We recommend that you use 'dotted-quad' notation for IP addresses rather than host names to make it easier to read debug logs and debug configurations with multiple interfaces.
For this example, complete the steps below:
Create a combined MGS/MDT file system on the block device. On the MDS node, run:
[root@mds /]# mkfs.lustre --fsname=temp --mgs --mdt --index=0 /dev/sdb
This command generates this output:
Permanent disk data: Target: temp-MDT0000 Index: 0 Lustre FS: temp Mount type: ldiskfs Flags: 0x75 (MDT MGS first_time update ) Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr Parameters: mdt.identity_upcall=/usr/sbin/l_getidentity checking for existing Lustre data: not found device size = 16MB 2 6 18 formatting backing filesystem ldiskfs on /dev/sdb target name temp-MDTffff 4k blocks 0 options -i 4096 -I 512 -q -O dir_index,uninit_groups -F mkfs_cmd = mkfs.ext2 -j -b 4096 -L temp-MDTffff -i 4096 -I 512 -q -O dir_index,uninit_groups -F /dev/sdb Writing CONFIGS/mountdata
Mount the combined MGS/MDT file system on the block device. On the MDS node, run:
[root@mds /]# mount -t lustre /dev/sdb /mnt/mdt
This command generates this output:
Lustre: temp-MDT0000: new disk, initializing Lustre: 3009:0:(lproc_mds.c:262:lprocfs_wr_identity_upcall()) temp-MDT0000: group upcall set to /usr/sbin/l_getidentity Lustre: temp-MDT0000.mdt: set parameter identity_upcall=/usr/sbin/l_getidentity Lustre: Server temp-MDT0000 on device /dev/sdb has started
In this example, the OSTs (
ost0
and
ost1
) are being created on different OSS nodes (
oss0
and
oss1
respectively).
Create
ost0
. On
oss0
node, run:
[root@oss0 /]# mkfs.lustre --fsname=temp --mgsnode=10.2.0.1@tcp0 --ost --index=0 /dev/sdc
The command generates this output:
Permanent disk data: Target: temp-OST0000 Index: 0 Lustre FS: temp Mount type: ldiskfs Flags: 0x72 (OST first_time update) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=10.2.0.1@tcp checking for existing Lustre data: not found device size = 16MB 2 6 18 formatting backing filesystem ldiskfs on /dev/sdc target name temp-OST0000 4k blocks 0 options -I 256 -q -O dir_index,uninit_groups -F mkfs_cmd = mkfs.ext2 -j -b 4096 -L temp-OST0000 -I 256 -q -O dir_index,uninit_groups -F /dev/sdc Writing CONFIGS/mountdata
Mount ost0 on the OSS on which it was created. On
oss0
node, run:
root@oss0 /] mount -t lustre /dev/sdc /mnt/ost0
The command generates this output:
LDISKFS-fs: file extents enabled LDISKFS-fs: mballoc enabled Lustre: temp-OST0000: new disk, initializing Lustre: Server temp-OST0000 on device /dev/sdb has started
Shortly afterwards, this output appears:
Lustre: temp-OST0000: received MDS connection from 10.2.0.1@tcp0 Lustre: MDS temp-MDT0000: temp-OST0000_UUID now active, resetting orphans
Create and mount
ost1
.
Create ost1. On
oss1
node, run:
[root@oss1 /]# mkfs.lustre --fsname=temp --mgsnode=10.2.0.1@tcp0 \ --ost --index=1 /dev/sdd
The command generates this output:
Permanent disk data: Target: temp-OST0001 Index: 1 Lustre FS: temp Mount type: ldiskfs Flags: 0x72 (OST first_time update) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=10.2.0.1@tcp checking for existing Lustre data: not found device size = 16MB 2 6 18 formatting backing filesystem ldiskfs on /dev/sdd target name temp-OST0001 4k blocks 0 options -I 256 -q -O dir_index,uninit_groups -F mkfs_cmd = mkfs.ext2 -j -b 4096 -L temp-OST0001 -I 256 -q -O dir_index,uninit_groups -F /dev/sdc Writing CONFIGS/mountdata
Mount ost1 on the OSS on which it was created. On
oss1
node, run:
root@oss1 /] mount -t lustre /dev/sdd /mnt/ost1
The command generates this output:
LDISKFS-fs: file extents enabled LDISKFS-fs: mballoc enabled Lustre: temp-OST0001: new disk, initializing Lustre: Server temp-OST0001 on device /dev/sdb has started
Shortly afterwards, this output appears:
Lustre: temp-OST0001: received MDS connection from 10.2.0.1@tcp0 Lustre: MDS temp-MDT0000: temp-OST0001_UUID now active, resetting orphans
Mount the Lustre file system on the client. On the client node, run:
root@client1 /] mount -t lustre 10.2.0.1@tcp0:/temp /lustre
This command generates this output:
Lustre: Client temp-client has started
Verify that the file system started and is working by running
the
df
,
dd
and
ls
commands on the client node.
Run the
lfs df -h
command:
[root@client1 /] lfs df -h
The
lfs df -h
command lists space usage per OST and
the MDT in human-readable format. This command generates output
similar to this:
UUID bytes Used Available Use% Mounted on temp-MDT0000_UUID 8.0G 400.0M 7.6G 0% /lustre[MDT:0] temp-OST0000_UUID 800.0G 400.0M 799.6G 0% /lustre[OST:0] temp-OST0001_UUID 800.0G 400.0M 799.6G 0% /lustre[OST:1] filesystem summary: 1.6T 800.0M 1.6T 0% /lustre
Run the
lfs df -ih
command.
[root@client1 /] lfs df -ih
The
lfs df -ih
command lists inode usage per OST
and the MDT. This command generates output similar to
this:
UUID Inodes IUsed IFree IUse% Mounted on temp-MDT0000_UUID 2.5M 32 2.5M 0% /lustre[MDT:0] temp-OST0000_UUID 5.5M 54 5.5M 0% /lustre[OST:0] temp-OST0001_UUID 5.5M 54 5.5M 0% /lustre[OST:1] filesystem summary: 2.5M 32 2.5M 0% /lustre
Run the
dd
command:
[root@client1 /] cd /lustre [root@client1 /lustre] dd if=/dev/zero of=/lustre/zero.dat bs=4M count=2
The
dd
command verifies write functionality by
creating a file containing all zeros (
0
s). In this command, an 8 MB file is created.
This command generates output similar to this:
2+0 records in 2+0 records out 8388608 bytes (8.4 MB) copied, 0.159628 seconds, 52.6 MB/s
Run the
ls
command:
[root@client1 /lustre] ls -lsah
The
ls -lsah
command lists files and directories in
the current working directory. This command generates output
similar to this:
total 8.0M 4.0K drwxr-xr-x 2 root root 4.0K Oct 16 15:27 . 8.0K drwxr-xr-x 25 root root 4.0K Oct 16 15:27 .. 8.0M -rw-r--r-- 1 root root 8.0M Oct 16 15:27 zero.dat
Once the Lustre file system is configured, it is ready for use.
This section describes how to scale the Lustre file system or make configuration changes using the Lustre configuration utilities.
A Lustre file system can be scaled by adding OSTs or clients. For instructions on creating additional OSTs repeat Step 3and Step 5above. For mounting additional clients, repeat Step 6for each client.
The default settings for the file layout stripe pattern are shown in Table 10.1, “Default stripe pattern”.
Table 10.1. Default stripe pattern
File Layout Parameter |
Default |
Description |
|
1 MB |
Amount of data to write to one OST before moving to the next OST. |
|
1 |
The number of OSTs to use for a single file. |
|
-1 |
The first OST where objects are created for each file. The default -1 allows the MDS to choose the starting index based on available space and load balancing. It's strongly recommended not to change the default for this parameter to a value other than -1. |
Use the
lfs setstripe
command described in
Chapter 19, Managing File Layout (Striping) and Free
Spaceto change the file layout
configuration.
If additional configuration is necessary, several configuration utilities are available:
mkfs.lustre
- Use to format a disk for a Lustre
service.
tunefs.lustre
- Use to modify configuration
information on a Lustre target disk.
lctl
- Use to directly control Lustre features via
an
ioctl
interface, allowing various configuration,
maintenance and debugging features to be accessed.
mount.lustre
- Use to start a Lustre client or
target service.
For examples using these utilities, see the topic Chapter 44, System Configuration Utilities
The
lfs
utility is useful for configuring and querying a
variety of options related to files. For more information, see
Chapter 40, User Utilities.
Some sample scripts are included in the directory where the
Lustre software is installed. If you have installed the Lustre source
code, the scripts are located in the
lustre/tests
sub-directory. These scripts enable
quick setup of some simple standard Lustre configurations.
Table of Contents
This chapter describes how to configure failover in a Lustre file system. It includes:
For an overview of failover functionality in a Lustre file system, see Chapter 3, Understanding Failover in a Lustre File System.
The Lustre software provides failover mechanisms only at the layer of the Lustre file system. No failover functionality is provided for system-level components such as failing hardware or applications, or even for the entire failure of a node, as would typically be provided in a complete failover solution. Failover functionality such as node monitoring, failure detection, and resource fencing must be provided by external HA software, such as PowerMan or the open source Corosync and Pacemaker packages provided by Linux operating system vendors. Corosync provides support for detecting failures, and Pacemaker provides the actions to take once a failure has been detected.
Failover in a Lustre file system requires the use of a remote power control (RPC) mechanism, which comes in different configurations. For example, Lustre server nodes may be equipped with IPMI/BMC devices that allow remote power control. For recommended devices, refer to the list of supported RPC devices on the website for the PowerMan cluster power management utility:
Lustre failover requires RPC and management capability to verify that a failed node is off before I/O is directed to the failover node. This avoids double-mounting the two nodes and the risk of unrecoverable data corruption. A variety of power management tools will work. Two packages that have been commonly used with the Lustre software are PowerMan and Pacemaker.
The PowerMan cluster power management utility is used to control RPC devices from a central location. PowerMan provides native support for several RPC varieties and Expect-like configuration simplifies the addition of new devices. The latest versions of PowerMan are available at:
https://github.com/chaos/powerman
STONITH, or "Shoot The Other Node In The Head" is used in conjunction with High Availability node management. This is implemented by Pacemaker to ensure that a peer node that may be importing a shared storage device has been powered off and will not corrupt the shared storage if it continues running.
The Lustre file system must be set up with high-availability (HA) software to enable a complete Lustre failover solution. Except for PowerMan, the HA software packages mentioned above provide both power management and cluster management. For information about setting up failover with Pacemaker, see:
Pacemaker Project website: https://clusterlabs.org/
Article Using Pacemaker with a Lustre File System : https://wiki.whamcloud.com/display/PUB/Using+Pacemaker+with+a+Lustre+File+System
To prepare a Lustre file system to be configured and managed as an HA system by a third-party HA application, each storage target (MGT, MGS, OST) must be associated with a second node to create a failover pair. This configuration information is then communicated by the MGS to a client when the client mounts the file system.
The per-target configuration is relayed to the MGS at mount time. Some rules related to this are:
When a target is initially mounted, the MGS reads the configuration information from the target (such as mgt vs. ost, failnode, fsname) to configure the target into a Lustre file system. If the MGS is reading the initial mount configuration, the mounting node becomes that target's "primary" node.
When a target is subsequently mounted, the MGS reads the current configuration from the target and, as needed, will reconfigure the MGS database target information
When the target is formatted using the mkfs.lustre
command, the failover
service node(s) for the target are designated using the --servicenode
option. In the example below, an OST with index 0
in the file system
testfs
is formatted with two service nodes designated to serve as a
failover
pair:
mkfs.lustre --reformat --ost --fsname testfs --mgsnode=192.168.10.1@o3ib \ --index=0 --servicenode=192.168.10.7@o2ib \ --servicenode=192.168.10.8@o2ib \ /dev/sdb
More than two potential service nodes can be designated for a target. The target can then be mounted on any of the designated service nodes.
When HA is configured on a storage target, the Lustre software enables multi-mount protection (MMP) on that storage target. MMP prevents multiple nodes from simultaneously mounting and thus corrupting the data on the target. For more about MMP, see Chapter 24, Lustre File System Failover and Multiple-Mount Protection.
If the MGT has been formatted with multiple service nodes designated, this information must be conveyed to the Lustre client in the mount command used to mount the file system. In the example below, NIDs for two MGSs that have been designated as service nodes for the MGT are specified in the mount command executed on the client:
mount -t lustre 10.10.120.1@tcp1:10.10.120.2@tcp1:/testfs /lustre/testfs
When a client mounts the file system, the MGS provides configuration information to the client for the MDT(s) and OST(s) in the file system along with the NIDs for all service nodes associated with each target and the service node on which the target is mounted. Later, when the client attempts to access data on a target, it will try the NID for each specified service node until it connects to the target.
For additional information about administering failover features in a Lustre file system, see:
Part III provides information about tools and procedures to use to administer a Lustre file system. You will find information in this section about:
The starting point for administering a Lustre file system is to monitor all logs and console logs for system health:
- Monitor logs on all servers and all clients.
- Invest in tools that allow you to condense logs from multiple systems.
- Use the logging resources provided in the Linux distribution.
Table of Contents
lfs
setstripe
)getstripe
)Table of Contents
This chapter provides information on monitoring a Lustre file system and includes the following sections:
Section 12.1, “ Lustre Changelogs”Lustre Changelogs
Section 12.2, “ Lustre Jobstats”Lustre Jobstats
Section 12.3, “ Lustre Monitoring Tool (LMT)”Lustre Monitoring Tool
Section 12.4, “
CollectL
”CollectL
Section 12.5, “ Other Monitoring Options”Other Monitoring Options
The changelogs feature records events that change the file system namespace or file metadata. Changes such as file creation, deletion, renaming, attribute changes, etc. are recorded with the target and parent file identifiers (FIDs), the name of the target, a timestamp, and user information. These records can be used for a variety of purposes:
Capture recent changes to feed into an archiving system.
Use changelog entries to exactly replicate changes in a file system mirror.
Set up "watch scripts" that take action on certain events or directories.
Audit activity on Lustre, thanks to user information associated to file/directory changes with timestamps.
Changelogs record types are:
Value |
Description |
---|---|
MARK |
Internal recordkeeping |
CREAT |
Regular file creation |
MKDIR |
Directory creation |
HLINK |
Hard link |
SLINK |
Soft link |
MKNOD |
Other file creation |
UNLNK |
Regular file removal |
RMDIR |
Directory removal |
RENME |
Rename, original |
RNMTO |
Rename, final |
OPEN * |
Open |
CLOSE |
Close |
LYOUT |
Layout change |
TRUNC |
Regular file truncated |
SATTR |
Attribute change |
XATTR |
Extended attribute change (setxattr) |
HSM |
HSM specific event |
MTIME |
MTIME change |
CTIME |
CTIME change |
ATIME * |
ATIME change |
MIGRT |
Migration event |
FLRW |
File Level Replication: file initially written |
RESYNC |
File Level Replication: file re-synced |
GXATR * |
Extended attribute access (getxattr) |
NOPEN * |
Denied open |
Event types marked with * are not recorded by default. Refer to Section 12.1.2.7, “Setting the Changelog Mask” for instructions on modifying the Changelogs mask.
FID-to-full-pathname and pathname-to-FID functions are also included to map target and parent FIDs into the file system namespace.
Several commands are available to work with changelogs.
Because changelog records take up space on the MDT, the system administration must register changelog users. As soon as a changelog user is registered, the Changelogs feature is enabled. The registrants specify which records they are "done with", and the system purges up to the greatest common record.
To register a new changelog user, run:
mds# lctl --devicefsname
-MDTnumber
changelog_register
Changelog entries are not purged beyond a registered user's
set point (see lfs changelog_clear
).
To display the metadata changes on an MDT (the changelog records), run:
client# lfs changelogfsname
-MDTnumber
[startrec [endrec]]
It is optional whether to specify the start and end records.
These are sample changelog records:
1 02MKDIR 15:15:21.977666834 2018.01.09 0x0 t=[0x200000402:0x1:0x0] j=mkdir.500 ef=0xf \ u=500:500 nid=10.128.11.159@tcp p=[0x200000007:0x1:0x0] pics 2 01CREAT 15:15:36.687592024 2018.01.09 0x0 t=[0x200000402:0x2:0x0] j=cp.500 ef=0xf \ u=500:500 nid=10.128.11.159@tcp p=[0x200000402:0x1:0x0] chloe.jpg 3 06UNLNK 15:15:41.305116815 2018.01.09 0x1 t=[0x200000402:0x2:0x0] j=rm.500 ef=0xf \ u=500:500 nid=10.128.11.159@tcp p=[0x200000402:0x1:0x0] chloe.jpg 4 07RMDIR 15:15:46.468790091 2018.01.09 0x1 t=[0x200000402:0x1:0x0] j=rmdir.500 ef=0xf \ u=500:500 nid=10.128.11.159@tcp p=[0x200000007:0x1:0x0] pics
To clear old changelog records for a specific user (records that the user no longer needs), run:
client# lfs changelog_clearmdt_name
userid
endrec
The changelog_clear
command indicates that
changelog records previous to endrec
are no
longer of interest to a particular user
userid
, potentially allowing the MDT to free
up disk space. An
value of 0 indicates the current last record. To run
endrec
changelog_clear
, the changelog user must be
registered on the MDT node using lctl
.
When all changelog users are done with records < X, the records are deleted.
This section provides examples of different changelog commands.
To register a new changelog user for a device
(lustre-MDT0000
):
mds# lctl --device lustre-MDT0000 changelog_register lustre-MDT0000: Registered changelog userid 'cl1'
To display changelog records for an MDT
(e.g. lustre-MDT0000
):
client# lfs changelog lustre-MDT0000 1 02MKDIR 15:15:21.977666834 2018.01.09 0x0 t=[0x200000402:0x1:0x0] ef=0xf \ u=500:500 nid=10.128.11.159@tcp p=[0x200000007:0x1:0x0] pics 2 01CREAT 15:15:36.687592024 2018.01.09 0x0 t=[0x200000402:0x2:0x0] ef=0xf \ u=500:500 nid=10.128.11.159@tcp p=[0x200000402:0x1:0x0] chloe.jpg 3 06UNLNK 15:15:41.305116815 2018.01.09 0x1 t=[0x200000402:0x2:0x0] ef=0xf \ u=500:500 nid=10.128.11.159@tcp p=[0x200000402:0x1:0x0] chloe.jpg 4 07RMDIR 15:15:46.468790091 2018.01.09 0x1 t=[0x200000402:0x1:0x0] ef=0xf \ u=500:500 nid=10.128.11.159@tcp p=[0x200000007:0x1:0x0] pics
Changelog records include this information:
rec# operation_type(numerical/text) timestamp datestamp flags t=target_FID ef=extended_flags u=uid:gid nid=client_NID p=parent_FID target_name
Displayed in this format:
rec# operation_type(numerical/text) timestamp datestamp flags t=target_FID \ ef=extended_flags u=uid:gid nid=client_NID p=parent_FID target_name
For example:
2 01CREAT 15:15:36.687592024 2018.01.09 0x0 t=[0x200000402:0x2:0x0] ef=0xf \ u=500:500 nid=10.128.11.159@tcp p=[0x200000402:0x1:0x0] chloe.jpg
To notify a device that a specific user (cl1
)
no longer needs records (up to and including 3):
# lfs changelog_clear lustre-MDT0000 cl1 3
To confirm that the changelog_clear
operation
was successful, run lfs changelog
; only records after
id-3 are listed:
# lfs changelog lustre-MDT0000 4 07RMDIR 15:15:46.468790091 2018.01.09 0x1 t=[0x200000402:0x1:0x0] ef=0xf \ u=500:500 nid=10.128.11.159@tcp p=[0x200000007:0x1:0x0] pics
To deregister a changelog user (cl1
) for a
specific device (lustre-MDT0000
):
mds# lctl --device lustre-MDT0000 changelog_deregister cl1 lustre-MDT0000: Deregistered changelog user 'cl1'
The deregistration operation clears all changelog records for the
specified user (cl1
).
client# lfs changelog lustre-MDT0000 5 00MARK 15:56:39.603643887 2018.01.09 0x0 t=[0x20001:0x0:0x0] ef=0xf \ u=500:500 nid=0@<0:0> p=[0:0x50:0xb] mdd_obd-lustre-MDT0000-0
MARK records typically indicate changelog recording status changes.
To display the current, maximum changelog index and registered
changelog users for a specific device
(lustre-MDT0000
):
mds# lctl get_param mdd.lustre-MDT0000.changelog_users mdd.lustre-MDT0000.changelog_users=current index: 8 ID index (idle seconds) cl2 8 (180)
To show the current changelog mask on a specific device
(lustre-MDT0000
):
mds# lctl get_param mdd.lustre-MDT0000.changelog_mask mdd.lustre-MDT0000.changelog_mask= MARK CREAT MKDIR HLINK SLINK MKNOD UNLNK RMDIR RENME RNMTO CLOSE LYOUT \ TRUNC SATTR XATTR HSM MTIME CTIME MIGRT
To set the current changelog mask on a specific device
(lustre-MDT0000
):
mds# lctl set_param mdd.lustre-MDT0000.changelog_mask=HLINK mdd.lustre-MDT0000.changelog_mask=HLINK $ lfs changelog_clear lustre-MDT0000 cl1 0 $ mkdir /mnt/lustre/mydir/foo $ cp /etc/hosts /mnt/lustre/mydir/foo/file $ ln /mnt/lustre/mydir/foo/file /mnt/lustre/mydir/myhardlink
Only item types that are in the mask show up in the changelog.
# lfs changelog lustre-MDT0000 9 03HLINK 16:06:35.291636498 2018.01.09 0x0 t=[0x200000402:0x4:0x0] ef=0xf \ u=500:500 nid=10.128.11.159@tcp p=[0x200000007:0x3:0x0] myhardlink
A specific use case for Lustre Changelogs is audit. According to a definition found on Wikipedia, information technology audits are used to evaluate the organization's ability to protect its information assets and to properly dispense information to authorized parties. Basically, audit consists in controlling that all data accesses made were done according to the access control policy in place. And usually, this is done by analyzing access logs.
Audit can be used as a proof of security in place. But Audit can also be a requirement to comply with regulations.
Lustre Changelogs are a good mechanism for audit, because this is a centralized facility, and it is designed to be transactional. Changelog records contain all information necessary for auditing purposes:
ability to identify object of action thanks to file identifiers (FIDs) and name of targets
ability to identify subject of action thanks to UID/GID and NID information
ability to identify time of action thanks to timestamp
To have a fully functional Changelogs-based audit facility, some additional Changelog record types must be enabled, to be able to record events such as OPEN, ATIME, GETXATTR and DENIED OPEN. Please note that enabling these record types may have some performance impact. For instance, recording OPEN and GETXATTR events generate writes in the Changelog records for a read operation from a file-system standpoint.
Being able to record events such as OPEN or DENIED OPEN is important from an audit perspective. For instance, if Lustre file system is used to store medical records on a system dedicated to Life Sciences, data privacy is crucial. Administrators may need to know which doctors accessed, or tried to access, a given medical record and when. And conversely, they might need to know which medical records a given doctor accessed.
To enable all changelog entry types, do:
mds# lctl set_param mdd.lustre-MDT0000.changelog_mask=ALL mdd.seb-MDT0000.changelog_mask=ALL
Once all required record types have been enabled, just register a Changelogs user and the audit facility is operational.
Note that, however, it is possible to control which Lustre client
nodes can trigger the recording of file system access events to the
Changelogs, thanks to the audit_mode
flag on nodemap
entries. The reason to disable audit on a per-nodemap basis is to
prevent some nodes (e.g. backup, HSM agent nodes) from flooding the
audit logs. When audit_mode
flag is
set to 1 on a nodemap entry, a client pertaining to this nodemap will be
able to record file system access events to the Changelogs, if
Changelogs are otherwise activated. When set to 0, events are not logged
into the Changelogs, no matter if Changelogs are activated or not. By
default, audit_mode
flag is set to 1 in newly created
nodemap entries. And it is also set to 1 in 'default' nodemap.
To prevent nodes pertaining to a nodemap to generate Changelog entries, do:
mgs# lctl nodemap_modify --name nm1 --property audit_mode --value 0
An OPEN changelog entry is in the form:
7 10OPEN 13:38:51.510728296 2017.07.25 0x242 t=[0x200000401:0x2:0x0] \ ef=0x7 u=500:500 nid=10.128.11.159@tcp m=-w-
It includes information about the open mode, in the form m=rwx.
OPEN entries are recorded only once per UID/GID, for a given open mode, as long as the file is not closed by this UID/GID. It avoids flooding the Changelogs for instance if there is an MPI job opening the same file thousands of times from different threads. It reduces the ChangeLog load significantly, without significantly affecting the audit information. Similarly, only the last CLOSE per UID/GID is recorded.
A GETXATTR changelog entry is in the form:
8 23GXATR 09:22:55.886793012 2017.07.27 0x0 t=[0x200000402:0x1:0x0] \ ef=0xf u=500:500 nid=10.128.11.159@tcp x=user.name0
It includes information about the name of the extended attribute
being accessed, in the form x=<xattr name>
.
A SETXATTR changelog entry is in the form:
4 15XATTR 09:41:36.157333594 2018.01.10 0x0 t=[0x200000402:0x1:0x0] \ ef=0xf u=500:500 nid=10.128.11.159@tcp x=user.name0
It includes information about the name of the extended attribute
being modified, in the form x=<xattr name>
.
A DENIED OPEN changelog entry is in the form:
4 24NOPEN 15:45:44.947406626 2017.08.31 0x2 t=[0x200000402:0x1:0x0] \ ef=0xf u=500:500 nid=10.128.11.158@tcp m=-w-
It has the same information as a regular OPEN entry. In order to
avoid flooding the Changelogs, DENIED OPEN entries are rate limited:
no more than one entry per user per file per time interval, this time
interval (in seconds) being configurable via
mdd.<mdtname>.changelog_deniednext
(default value is 60 seconds).
mds# lctl set_param mdd.lustre-MDT0000.changelog_deniednext=120 mdd.seb-MDT0000.changelog_deniednext=120 mds# lctl get_param mdd.lustre-MDT0000.changelog_deniednext mdd.seb-MDT0000.changelog_deniednext=120
The Lustre jobstats feature collects file system operation statistics for user processes running on Lustre clients, and exposes on the server using the unique Job Identifier (JobID) provided by the job scheduler for each job. Job schedulers known to be able to work with jobstats include: SLURM, SGE, LSF, Loadleveler, PBS and Maui/MOAB.
Since jobstats is implemented in a scheduler-agnostic manner, it is
likely that it will be able to work with other schedulers also, and also
in environments that do not use a job scheduler, by storing custom format
strings in the jobid_name
.
The Lustre jobstats code on the client extracts the unique JobID from an environment variable within the user process, and sends this JobID to the server with all RPCs. This allows the server to tracks statistics for operations specific to each application/command running on the client, and can be useful to identify the source high I/O load.
A Lustre setting on the client, jobid_var
,
specifies an environment variable or other client-local source that
to holds a (relatively) unique the JobID for the running application.
Any environment variable can be specified. For example, SLURM sets the
SLURM_JOB_ID
environment variable with the unique
JobID for all clients running a particular job launched on one or
more nodes, and
the SLURM_JOB_ID
will be inherited by all child
processes started below that process.
There are several reserved values for jobid_var
:
disable
- disables sending a JobID from
this client
procname_uid
- uses the process name and UID,
equivalent to setting jobid_name=%e.%u
nodelocal
- use only the JobID format from
jobid_name
session
- extract the JobID from
jobid_this_session
Lustre can also be configured to generate a synthetic JobID from
the client's process name and numeric UID, by setting
jobid_var=procname_uid
. This will generate a
uniform JobID when running the same binary across multiple client
nodes, but cannot distinguish whether the binary is part of a single
distributed process or multiple independent processes. This can be
useful on login nodes where interactive commands are run.
In Lustre 2.8 and later it is possible to set
jobid_var=nodelocal
and then also set
jobid_name=
name
, which
all processes on that client node will use. This
is useful if only a single job is run on a client at one time, but if
multiple jobs are run on a client concurrently, the
session
JobID should be used.
In Lustre 2.12 and later, it is possible to
specify more complex JobID values for jobid_name
by using a string that contains format codes that are evaluated for
each process, in order to generate a site- or node-specific JobID string.
%e print executable name
%g print group ID number
%h print fully-qualified hostname
%H print short hostname
%j print JobID from the source named by the jobid_var parameter
%p print numeric process ID
%u print user ID number
In Lustre 2.13 and later, it is possible to
set a per-session JobID via the jobid_this_session
parameter instead of getting the JobID from an
environment variable. This session ID will be
inherited by all processes that are started in this login session,
though there can be a different JobID for each login session. This
is enabled by setting jobid_var=session
instead
of setting it to an environment variable. The session ID will be
substituted for %j
in jobid_name
.
The setting of jobid_var
need not be the same
on all clients. For example, one could use
SLURM_JOB_ID
on all clients managed by SLURM, and
use procname_uid
on clients not managed by SLURM,
such as interactive login nodes.
It is not possible to have different
jobid_var
settings on a single node, since it is
unlikely that multiple job schedulers are active on one client.
However, the actual JobID value is local to each process environment
and it is possible for multiple jobs with different JobIDs to be
active on a single client at one time.
Jobstats are disabled by default. The current state of jobstats
can be verified by checking lctl get_param jobid_var
on a client:
clieht# lctl get_param jobid_var jobid_var=disable
To enable jobstats on all clients for SLURM:
mgs# lctl set_param -P jobid_var=SLURM_JOB_ID
The lctl set_param
command to enable or disable
jobstats should be run on the MGS as root. The change is persistent, and
will be propagated to the MDS, OSS, and client nodes automatically when
it is set on the MGS and for each new client mount.
To temporarily enable jobstats on a client, or to use a different
jobid_var on a subset of nodes, such as nodes in a remote cluster that
use a different job scheduler, or interactive login nodes that do not
use a job scheduler at all, run the lctl set_param
command directly on the client node(s) after the filesystem is mounted.
For example, to enable the procname_uid
synthetic
JobID locally on a login node run:
client# lctl set_param jobid_var=procname_uid
The lctl set_param
setting is not persistent, and will
be reset if the global jobid_var
is set on the MGS or
if the filesystem is unmounted.
The following table shows the environment variables which are set
by various job schedulers. Set jobid_var
to the value
for your job scheduler to collect statistics on a per job basis.
Job Scheduler |
Environment Variable |
---|---|
Simple Linux Utility for Resource Management (SLURM) |
SLURM_JOB_ID |
Sun Grid Engine (SGE) |
JOB_ID |
Load Sharing Facility (LSF) |
LSB_JOBID |
Loadleveler |
LOADL_STEP_ID |
Portable Batch Scheduler (PBS)/MAUI |
PBS_JOBID |
Cray Application Level Placement Scheduler (ALPS) |
ALPS_APP_ID |
mgs# lctl set_param -P jobid_var=disable
To track job stats per process name and user ID (for debugging, or
if no job scheduler is in use on some nodes such as login nodes), specify
jobid_var
as procname_uid
:
client# lctl set_param jobid_var=procname_uid
Metadata operation statistics are collected on MDTs. These statistics
can be accessed for all file systems and all jobs on the MDT via the
lctl get_param mdt.*.job_stats
. For example, clients
running with jobid_var=procname_uid
:
mds# lctl get_param mdt.*.job_stats job_stats: - job_id: bash.0 snapshot_time: 1352084992 open: { samples: 2, unit: reqs } close: { samples: 2, unit: reqs } getattr: { samples: 3, unit: reqs } - job_id: mythbackend.0 snapshot_time: 1352084996 open: { samples: 72, unit: reqs } close: { samples: 73, unit: reqs } unlink: { samples: 22, unit: reqs } getattr: { samples: 778, unit: reqs } setattr: { samples: 22, unit: reqs } statfs: { samples: 19840, unit: reqs } sync: { samples: 33190, unit: reqs }
Data operation statistics are collected on OSTs. Data operations
statistics can be accessed via
lctl get_param obdfilter.*.job_stats
, for example:
oss# lctl get_param obdfilter.*.job_stats obdfilter.myth-OST0000.job_stats= job_stats: - job_id: mythcommflag.0 snapshot_time: 1429714922 read: { samples: 974, unit: bytes, min: 4096, max: 1048576, sum: 91530035 } write: { samples: 0, unit: bytes, min: 0, max: 0, sum: 0 } obdfilter.myth-OST0001.job_stats= job_stats: - job_id: mythbackend.0 snapshot_time: 1429715270 read: { samples: 0, unit: bytes, min: 0, max: 0, sum: 0 } write: { samples: 1, unit: bytes, min: 96899, max: 96899, sum: 96899 } punch: { samples: 1, unit: reqs } obdfilter.myth-OST0002.job_stats=job_stats: obdfilter.myth-OST0003.job_stats=job_stats: obdfilter.myth-OST0004.job_stats= job_stats: - job_id: mythfrontend.500 snapshot_time: 1429692083 read: { samples: 9, unit: bytes, min: 16384, max: 1048576, sum: 4444160 } write: { samples: 0, unit: bytes, min: 0, max: 0, sum: 0 } - job_id: mythbackend.500 snapshot_time: 1429692129 read: { samples: 0, unit: bytes, min: 0, max: 0, sum: 0 } write: { samples: 1, unit: bytes, min: 56231, max: 56231, sum: 56231 } punch: { samples: 1, unit: reqs }
Accumulated job statistics can be reset by writing proc file
job_stats
.
Clear statistics for all jobs on the local node:
oss# lctl set_param obdfilter.*.job_stats=clear
Clear statistics only for job 'bash.0' on lustre-MDT0000:
mds# lctl set_param mdt.lustre-MDT0000.job_stats=bash.0
By default, if a job is inactive for 600 seconds (10 minutes) statistics for this job will be dropped. This expiration value can be changed temporarily via:
mds# lctl set_param *.*.job_cleanup_interval={max_age}
It can also be changed permanently, for example to 700 seconds via:
mgs# lctl set_param -P mdt.testfs-*.job_cleanup_interval=700
The job_cleanup_interval
can be set
as 0 to disable the auto-cleanup. Note that if auto-cleanup of
Jobstats is disabled, then all statistics will be kept in memory
forever, which may eventually consume all memory on the servers.
In this case, any monitoring tool should explicitly clear
individual job statistics as they are processed, as shown above.
Since Lustre 2.15 the lljobstat
utility can be used to monitor and identify the top JobIDs generating
load on a particular server. This allows the administrator to quickly
see which applications/users/clients (depending on how the JobID is
conigured) are generating the most filesystem RPCs and take appropriate
action if needed.
mds# lljobstat -c 10 --- timestamp: 1665984678 top_jobs: - ls.500: {ops: 64, ga: 64} - touch.500: {ops: 6, op: 1, cl: 1, mn: 1, ga: 1, sa: 2} - bash.0: {ops: 3, ga: 3} ...
It is possible to specify the number of top jobs to monitor as well as the refresh interval, among other options.
The Lustre Monitoring Tool (LMT) is a Python-based, distributed
system that provides a top
-like display of activity
on server-side nodes (MDS, OSS and portals routers) on one or more
Lustre file systems. It does not provide support for monitoring
clients. For more information on LMT, including the setup procedure,
see:
CollectL
is another tool that can be used to monitor a Lustre file
system. You can run CollectL
on a Lustre system that has any combination of
MDSs, OSTs and clients. The collected data can be written to a file for continuous logging and
played back at a later time. It can also be converted to a format suitable for
plotting.
For more information about CollectL
, see:
http://collectl.sourceforge.net
Lustre-specific documentation is also available. See:
A variety of standard tools are available publicly including the following:
lltop
- Lustre load monitor with batch scheduler integration.
https://github.com/jhammond/lltop
tacc_stats
- A job-oriented system monitor, analyzation, and
visualization tool that probes Lustre interfaces and collects statistics. https://github.com/jhammond/tacc_stats
xltop
- A continuous Lustre monitor with batch scheduler
integration. https://github.com/jhammond/xltop
Another option is to script a simple monitoring solution that looks at various reports
from ipconfig
, as well as the procfs
files generated by
the Lustre software.
Table of Contents
Once you have the Lustre file system up and running, you can use the procedures in this section to perform these basic Lustre administration tasks.
The file system name is limited to 8 characters. We have encoded the
file system and target information in the disk label, so you can mount by
label. This allows system administrators to move disks around without
worrying about issues such as SCSI disk reordering or getting the
/dev/device
wrong for a shared target. Soon, file system
naming will be made as fail-safe as possible. Currently, Linux disk labels
are limited to 16 characters. To identify the target within the file
system, 8 characters are reserved, leaving 8 characters for the file system
name:
fsname
-MDT0000 orfsname
-OST0a19
To mount by label, use this command:
mount -t lustre -Lfile_system_label
/mount_point
This is an example of mount-by-label:
mds# mount -t lustre -L testfs-MDT0000 /mnt/mdt
Mount-by-label should NOT be used in a multi-path environment or when snapshots are being created of the device, since multiple block devices will have the same label.
Although the file system name is internally limited to 8 characters, you can mount the clients at any mount point, so file system users are not subjected to short names. Here is an example:
client# mount -t lustre mds0@tcp0:/short /dev/long_mountpoint_name
On the first start of a Lustre file system, the components must be started in the following order:
Mount the MGT.
If a combined MGT/MDT is present, Lustre will correctly mount the MGT and MDT automatically.
Mount the MDT.
Mount all MDTs if multiple MDTs are present.
Mount the OST(s).
Mount the client(s).
Starting a Lustre server is straightforward and only involves the
mount command. Lustre servers can be added to /etc/fstab
:
mount -t lustre
The mount command generates output similar to this:
/dev/sda1 on /mnt/test/mdt type lustre (rw) /dev/sda2 on /mnt/test/ost0 type lustre (rw) 192.168.0.21@tcp:/testfs on /mnt/testfs type lustre (rw)
In this example, the MDT, an OST (ost0) and file system (testfs) are mounted.
LABEL=testfs-MDT0000 /mnt/test/mdt lustre defaults,_netdev,noauto 0 0 LABEL=testfs-OST0000 /mnt/test/ost0 lustre defaults,_netdev,noauto 0 0
In general, it is wise to specify noauto and let your
high-availability (HA) package manage when to mount the device. If you are
not using failover, make sure that networking has been started before
mounting a Lustre server. If you are running Red Hat Enterprise Linux, SUSE
Linux Enterprise Server, Debian operating system (and perhaps others), use
the _netdev
flag to ensure that these disks are mounted
after the network is up, unless you are using systemd 232 or greater, which
recognize lustre
as a network filesystem.
If you are using lnet.service
, use
x-systemd.requires=lnet.service
regardless of systemd
version.
We are mounting by disk label here. The label of a device can be read
with e2label
. The label of a newly-formatted Lustre
server may end in FFFF
if the
--index
option is not specified to
mkfs.lustre
, meaning that it has yet to be assigned. The
assignment takes place when the server is first started, and the disk label
is updated. It is recommended that the
--index
option always be used, which will also ensure
that the label is set at format time.
Do not do this when the client and OSS are on the same node, as memory pressure between the client and OSS can lead to deadlocks.
Mount-by-label should NOT be used in a multi-path environment.
A complete Lustre filesystem shutdown occurs by unmounting all clients and servers in the order shown below. Please note that unmounting a block device causes the Lustre software to be shut down on that node.
Please note that the -a -t lustre
in the
commands below is not the name of a filesystem, but rather is
specifying to unmount all entries in /etc/mtab that are of type
lustre
Unmount the clients
On each client node, unmount the filesystem on that client
using the umount
command:
umount -a -t lustre
The example below shows the unmount of the
testfs
filesystem on a client node:
[root@client1 ~]# mount -t lustre XXX.XXX.0.11@tcp:/testfs on /mnt/testfs type lustre (rw,lazystatfs) [root@client1 ~]# umount -a -t lustre [154523.177714] Lustre: Unmounted testfs-client
Unmount the MDT and MGT
On the MGS and MDS node(s), run the
umount
command:
umount -a -t lustre
The example below shows the unmount of the MDT and MGT for
the testfs
filesystem on a combined MGS/MDS:
[root@mds1 ~]# mount -t lustre /dev/sda on /mnt/mgt type lustre (ro) /dev/sdb on /mnt/mdt type lustre (ro) [root@mds1 ~]# umount -a -t lustre [155263.566230] Lustre: Failing over testfs-MDT0000 [155263.775355] Lustre: server umount testfs-MDT0000 complete [155269.843862] Lustre: server umount MGS complete
For a seperate MGS and MDS, the same command is used, first on the MDS and then followed by the MGS.
Unmount all the OSTs
On each OSS node, use the umount
command:
umount -a -t lustre
The example below shows the unmount of all OSTs for the
testfs
filesystem on server
OSS1
:
[root@oss1 ~]# mount |grep lustre /dev/sda on /mnt/ost0 type lustre (ro) /dev/sdb on /mnt/ost1 type lustre (ro) /dev/sdc on /mnt/ost2 type lustre (ro) [root@oss1 ~]# umount -a -t lustre Lustre: Failing over testfs-OST0002 Lustre: server umount testfs-OST0002 complete
For unmount command syntax for a single OST, MDT, or MGT target please refer to Section 13.5, “ Unmounting a Specific Target on a Server”
To stop a Lustre OST, MDT, or MGT , use the
umount
command./mount_point
The example below stops an OST, ost0
, on mount
point /mnt/ost0
for the testfs
filesystem:
[root@oss1 ~]# umount /mnt/ost0 Lustre: Failing over testfs-OST0000 Lustre: server umount testfs-OST0000 complete
Gracefully stopping a server with the
umount
command preserves the state of the connected
clients. The next time the server is started, it waits for clients to
reconnect, and then goes through the recovery procedure.
If the force (
-f
) flag is used, then the server evicts all clients and
stops WITHOUT recovery. Upon restart, the server does not wait for
recovery. Any currently connected clients receive I/O errors until they
reconnect.
If you are using loopback devices, use the
-d
flag. This flag cleans up loop devices and can
always be safely specified.
In a Lustre file system, an OST that has become unreachable because it fails, is taken off the network, or is unmounted can be handled in one of two ways:
In failout
mode, Lustre clients immediately
receive errors (EIOs) after a timeout, instead of waiting for the OST
to recover.
In failover
mode, Lustre clients wait for the
OST to recover.
By default, the Lustre file system uses
failover
mode for OSTs. To specify
failout
mode instead, use the
--param="failover.mode=failout"
option as shown below
(entered on one line):
oss# mkfs.lustre --fsname=fsname
--mgsnode=mgs_NID
\ --param=failover.mode=failout --ost --index=ost_index
/dev/ost_block_device
In the example below,
failout
mode is specified for the OSTs on the MGS
mds0
in the file system
testfs
(entered on one line).
oss# mkfs.lustre --fsname=testfs --mgsnode=mds0 --param=failover.mode=failout \ --ost --index=3 /dev/sdb
Before running this command, unmount all OSTs that will be affected
by a change in failover
/failout
mode.
After initial file system configuration, use the
tunefs.lustre
utility to change the mode. For example,
to set the failout
mode, run:
# tunefs.lustre --param failover.mode=failout /dev/ost_device
Lustre includes functionality that notifies Lustre if an external RAID array has degraded performance (resulting in reduced overall file system performance), either because a disk has failed and not been replaced, or because a disk was replaced and is undergoing a rebuild. To avoid a global performance slowdown due to a degraded OST, the MDS can avoid the OST for new object allocation if it is notified of the degraded state.
A parameter for each OST, called
degraded
, specifies whether the OST is running in
degraded mode or not.
To mark the OST as degraded, use:
oss# lctl set_param obdfilter.{OST_name}.degraded=1
To mark that the OST is back in normal operation, use:
oss# lctl set_param obdfilter.{OST_name}.degraded=0
To determine if OSTs are currently in degraded mode, use:
oss# lctl get_param obdfilter.*.degraded
If the OST is remounted due to a reboot or other condition, the flag
resets to
0
.
It is recommended that this be implemented by an automated script
that monitors the status of individual RAID devices, such as MD-RAID's
mdadm(8)
command with the --monitor
option to mark an affected device degraded or restored.
Lustre supports multiple file systems provided the combination of
NID:fsname
is unique. Each file system must be allocated
a unique name during creation with the
--fsname
parameter. Unique names for file systems are
enforced if a single MGS is present. If multiple MGSs are present (for
example if you have an MGS on every MDS) the administrator is responsible
for ensuring file system names are unique. A single MGS and unique file
system names provides a single point of administration and allows commands
to be issued against the file system even if it is not mounted.
Lustre supports multiple file systems on a single MGS. With a single MGS fsnames are guaranteed to be unique. Lustre also allows multiple MGSs to co-exist. For example, multiple MGSs will be necessary if multiple file systems on different Lustre software versions are to be concurrently available. With multiple MGSs additional care must be taken to ensure file system names are unique. Each file system should have a unique fsname among all systems that may interoperate in the future.
By default, the
mkfs.lustre
command creates a file system named
lustre
. To specify a different file system name (limited
to 8 characters) at format time, use the
--fsname
option:
oss# mkfs.lustre --fsname=file_system_name
The MDT, OSTs and clients in the new file system must use the same
file system name (prepended to the device name). For example, for a new
file system named foo
, the MDT and two OSTs would be
named foo-MDT0000
,
foo-OST0000
, and
foo-OST0001
.
To mount a client on the file system, run:
client# mount -t lustremgsnode
:/new_fsname
/mount_point
For example, to mount a client on file system foo at mount point /mnt/foo, run:
client# mount -t lustre mgsnode:/foo /mnt/foo
If a client(s) will be mounted on several file systems, add the
following line to /etc/xattr.conf
file to avoid
problems when files are moved between the file systems:
lustre.* skip
To ensure that a new MDT is added to an existing MGS create the MDT
by specifying:
--mdt --mgsnode=
.
mgs_NID
A Lustre installation with two file systems (
foo
and
bar
) could look like this, where the MGS node is
mgsnode@tcp0
and the mount points are
/mnt/foo
and
/mnt/bar
.
mgsnode# mkfs.lustre --mgs /dev/sda mdtfoonode# mkfs.lustre --fsname=foo --mgsnode=mgsnode@tcp0 --mdt --index=0 /dev/sdb ossfoonode# mkfs.lustre --fsname=foo --mgsnode=mgsnode@tcp0 --ost --index=0 /dev/sda ossfoonode# mkfs.lustre --fsname=foo --mgsnode=mgsnode@tcp0 --ost --index=1 /dev/sdb mdtbarnode# mkfs.lustre --fsname=bar --mgsnode=mgsnode@tcp0 --mdt --index=0 /dev/sda ossbarnode# mkfs.lustre --fsname=bar --mgsnode=mgsnode@tcp0 --ost --index=0 /dev/sdc ossbarnode# mkfs.lustre --fsname=bar --mgsnode=mgsnode@tcp0 --ost --index=1 /dev/sdd
To mount a client on file system foo at mount point
/mnt/foo
, run:
client# mount -t lustre mgsnode@tcp0:/foo /mnt/foo
To mount a client on file system bar at mount point
/mnt/bar
, run:
client# mount -t lustre mgsnode@tcp0:/bar /mnt/bar
It is possible to create individual directories, along with its files and sub-directories, to be stored on specific MDTs. To create a sub-directory on a given MDT use the command:
client$ lfs mkdir -imdt_index
/mount_point/remote_dir
This command will allocate the sub-directory
remote_dir
onto the MDT with index
mdt_index
. For more information on adding additional
MDTs and mdt_index
see 2.
An administrator can allocate remote sub-directories to separate MDTs. Creating remote sub-directories in parent directories not hosted on MDT0000 is not recommended. This is because the failure of the parent MDT will leave the namespace below it inaccessible. For this reason, by default it is only possible to create remote sub-directories off MDT0000. To relax this restriction and enable remote sub-directories off any MDT, an administrator must issue the following command on the MGS:
mgs# lctl set_param -P mdt.fsname-MDT*
.enable_remote_dir=1
For Lustre filesystem 'scratch', the command executed is:
mgs# lctl set_param -P mdt.scratch-*.enable_remote_dir=1
To verify the configuration setting execute the following command on any MDS:
mds# lctl get_param mdt.*.enable_remote_dir
With Lustre software version 2.8, a new
tunable is available to allow users with a specific group ID to create
and delete remote and striped directories. This tunable is
enable_remote_dir_gid
. For example, setting this
parameter to the 'wheel' or 'admin' group ID allows users with that GID
to create and delete remote and striped directories. Setting this
parameter to -1
on MDT0000 to permanently allow any
non-root users create and delete remote and striped directories.
On the MGS execute the following command:
mgs# lctl set_param -P mdt.fsname-*
.enable_remote_dir_gid=-1
For the Lustre filesystem 'scratch', the commands expands to:
mgs# lctl set_param -P mdt.scratch-*.enable_remote_dir_gid=-1
The change can be verified by executing the following command on every MDS:
mds# lctl get_param mdt.*
.enable_remote_dir_gid
The Lustre 2.8 DNE feature enables files in a single large directory to be distributed across multiple MDTs (a striped directory), if there are mutliple MDTs added to the filesystem, see Section 14.7, “Adding a New MDT to a Lustre File System”. The result is that metadata requests for files in a single large striped directory are serviced by multiple MDTs and metadata service load is distributed over all the MDTs that service a given directory. By distributing metadata service load over multiple MDTs, performance of very large directories can be improved beyond the limit of one MDT. Normally, all files in a directory must be created on a single MDT.
This command to stripe a directory over
mdt_count
MDTs is:
client$ lfs mkdir -cmdt_count
/mount_point/new_directory
The striped directory feature is most useful for distributing a single large directory (50k entries or more) across multiple MDTs. This should be used with discretion since creating and removing striped directories incurs more overhead than non-striped directories.
If the starting MDT is not specified when creating a new directory, this directory and its stripes will be distributed on MDTs by space usage. For example the following will create a new directory on an MDT preferring one that has less space usage:
client$ lfs mkdir -c 1 -i -1 dir1
Alternatively, if a default directory stripe is set on a directory,
the subsequent use of mkdir
for subdirectories in
dir1
will have the same effect:
client$ lfs setdirstripe -D -c 1 -i -1 dir1
The policy is:
If free inodes/blocks on all MDT are almost the same,
i.e. max_inodes_avail * 84% < min_inodes_avail
and
max_blocks_avail * 84% < min_blocks_avail
, then
choose MDT roundrobin.
Otherwise, create more subdirectories on MDTs with more free inodes/blocks.
Sometime there are many MDTs. But it is not always desirable to
stripe a directory across all MDTs, even if the directory default
stripe_count=-1
(unlimited).
In this case, the per-filesystem tunable parameter
lod.*.max_mdt_stripecount
can be used to limit the
actual stripe count of directory to fewer than the full MDT count.
If lod.*.max_mdt_stripecount
is not 0, and the
directory stripe_count=-1
, the real directory
stripe count will be the minimum of the number of MDTs and
max_mdt_stripecount
.
If lod.*.max_mdt_stripecount=0
, or an explicit
stripe count is given for the directory, it is ignored.
To set max_mdt_stripecount
, on all MDSes of
file system, run:
mgs# lctl set_param -P lod.$fsname-MDTxxxx-mdtlov.max_mdt_stripecount=<N>
To check max_mdt_stripecount
, run:
mds# lctl get_param lod.$fsname-MDTxxxx-mdtlov.max_mdt_stripecount
To reset max_mdt_stripecount
, run:
mgs# lctl set_param -P -d lod.$fsname-MDTxxxx-mdtlov.max_mdt_stripecount
Similar to file objects allocation, the directory objects are allocated on MDTs by a round-robin algorithm or a weighted algorithm. For the top three level of directories from the root of the filesystem, if the amount of free inodes and blocks is well balanced (i.e., by default, when the free inodes and blocks across MDTs differ by less than 5%), the round-robin algorithm is used to select the next MDT on which a directory is to be created.
If the directory is more than three levels below the root directory, or MDTs are not balanced, then the weighted algorithm is used to randomly select an MDT with more free inodes and blocks.
To avoid creating unnecessary remote directories, if the MDT where its parent directory is located is not too full (the free inodes and blocks of the parent MDT is not more than 5% full than average of all MDTs), this directory will be created on parent MDT.
If administrator wants to change this default filesystem-wide directory striping, run the following command to limit this striping to the top level below the root directory:
client$ lfs setdirstripe -D -i -1 -c 1 --max-inherit 0 <mountpoint>
To revert to the pre-2.15 behavior of all directories being created only on MDT0000 by default (deleting this striping won't work because it will be recreated if missing):
client$ lfs setdirstripe -D -i 0 -c 1 --max-inherit 0 <mountpoint>
If default dir stripe policy is set to a directory, it will be applied to sub directories created later. For example:
$ mkdir testdir1 $ lfs setdirstripe testdir1 -D -c 2 $ lfs getdirstripe testdir1 -D lmv_stripe_count: 2 lmv_stripe_offset: -1 lmv_hash_type: none lmv_max_inherit: 3 lmv_max_inherit_rr: 0 $ mkdir dir1/subdir1 $ lfs getdirstripe testdir1/subdir1 lmv_stripe_count: 2 lmv_stripe_offset: 0 lmv_hash_type: crush mdtidx FID[seq:oid:ver] 0 [0x200000400:0x2:0x0] 1 [0x240000401:0x2:0x0]
Default dir stripe can be inherited by sub directory.
This behavior is controlled by lmv_max_inherit
parameter. If lmv_max_inherit
is 0 or 1, sub
directory stops to inherit default dir stripe policy.
Or sub directory decreases its parent's
lmv_max_inherit
and uses it as its own
lmv_max_inherit
.
-1 is special because it means unlimited. For example:
$ lfs getdirstripe testdir1/subdir1 -D lmv_stripe_count: 2 lmv_stripe_offset: -1 lmv_hash_type: none lmv_max_inherit: 2 lmv_max_inherit_rr: 0
lmv_max_inherit
can be set explicitly with
--max-inherit
option in
lfs setdirstripe -D
command.
If the max-inherit value is not specified, the default value is -1
when stripe_count
is 0 or 1.
For other values of stripe_count
, the default value
is 3.
Several options are available for setting parameters in Lustre:
When creating a file system, use mkfs.lustre. See
Section 13.12.1, “Setting Tunable Parameters with
mkfs.lustre
”below.
When a server is stopped, use tunefs.lustre. See
Section 13.12.2, “Setting Parameters with
tunefs.lustre
”below.
When the file system is running, use lctl to set or retrieve
Lustre parameters. See
Section 13.12.3, “Setting Parameters with
lctl
”and
Section 13.12.3.6, “Reporting Current Parameter Values”below.
When the file system is first formatted, parameters can simply be
added as a --param
option to the
mkfs.lustre
command. For example:
mds# mkfs.lustre --mdt --param="sys.timeout=50" /dev/sda
For more details about creating a file system,see
Chapter 10, Configuring a Lustre File
System. For more details about
mkfs.lustre
, see
Chapter 44, System Configuration Utilities.
If a server (OSS or MDS) is stopped, parameters can be added to an
existing file system using the
--param
option to the
tunefs.lustre
command. For example:
oss# tunefs.lustre --param=failover.node=192.168.0.13@tcp0 /dev/sda
With tunefs.lustre
, parameters are
additive-- new parameters are specified in addition
to old parameters, they do not replace them. To erase all old
tunefs.lustre
parameters and just use newly-specified
parameters, run:
mds# tunefs.lustre --erase-params --param=new_parameters
The tunefs.lustre command can be used to set any parameter settable
via lctl conf_param
and that has its own OBD device,
so it can be specified as
. For example:obdname|fsname
.
obdtype
.
proc_file_name
=
value
mds# tunefs.lustre --param mdt.identity_upcall=NONE /dev/sda1
For more details about tunefs.lustre
, see
Chapter 44, System Configuration Utilities.
When the file system is running, the
lctl
command can be used to set parameters (temporary
or permanent) and report current parameter values. Temporary parameters
are active as long as the server or client is not shut down. Permanent
parameters live through server and client reboots.
The lctl list_param
command enables users to
list all parameters that can be set. See
Section 13.12.3.5, “Listing All Tunable Parameters”.
For more details about the
lctl
command, see the examples in the sections below
and
Chapter 44, System Configuration Utilities.
Use
lctl set_param
to set temporary parameters on the
node where it is run. These parameters internally map to corresponding
items in the kernel /proc/{fs,sys}/{lnet,lustre}
and
/sys/{fs,kernel/debug}/lustre
virtual filesystems.
However, since the mapping between a particular parameter name and the
underlying virtual pathname may change, it is not
recommended to access the virtual pathname directly. The
lctl set_param
command uses this syntax:
# lctl set_param [-n] [-P]obdtype
.obdname
.proc_file_name
=value
For example:
# lctl set_param osc.*.max_dirty_mb=1024 osc.myth-OST0000-osc.max_dirty_mb=32 osc.myth-OST0001-osc.max_dirty_mb=32 osc.myth-OST0002-osc.max_dirty_mb=32 osc.myth-OST0003-osc.max_dirty_mb=32 osc.myth-OST0004-osc.max_dirty_mb=32
Use lctl set_param -P
or
lctl conf_param
command to set permanent parameters.
In general, the set_param -P
command is preferred
for new parameters, as this isolates the parameter settings from the
MDT and OST device configuration, and is consistent with the common
lctl get_param
and lctl set_param
commands. The lctl conf_param
command
was previously used to specify settable parameter, with the following
syntax (the same as the mkfs.lustre
and
tunefs.lustre
commands):
obdname|fsname
.obdtype
.proc_file_name
=value
)
The lctl conf_param
and
lctl set_param
syntax is not
the same.
Here are a few examples of
lctl conf_param
commands:
mgs# lctl conf_param testfs-MDT0000.sys.timeout=40 mgs# lctl conf_param testfs-MDT0000.mdt.identity_upcall=NONE mgs# lctl conf_param testfs.llite.max_read_ahead_mb=16 mgs# lctl conf_param testfs-OST0000.osc.max_dirty_mb=29.15 mgs# lctl conf_param testfs-OST0000.ost.client_cache_seconds=15 mgs# lctl conf_param testfs.sys.timeout=40
Parameters specified with the
lctl conf_param
command are set permanently in the
file system's configuration file on the MGS.
The lctl set_param -P
command can also
set parameters permanently using the same syntax as
lctl set_param
and lctl
get_param
commands. Permanent parameter settings must be
issued on the MGS. The given parameter is set on every host using
lctl
upcall. The lctl set_param
command uses the following syntax:
lctl set_param -Pobdtype
.obdname
.proc_file_name
=value
For example:
mgs# lctl set_param -P timeout=40 mgs# lctl set_param -P mdt.testfs-MDT*.identity_upcall=NONE mgs# lctl set_param -P llite.testfs-*.max_read_ahead_mb=16 mgs# lctl set_param -P osc.testfs-OST*.max_dirty_mb=29.15 mgs# lctl set_param -P ost.testfs-OST*.client_cache_seconds=15
Use the -P -d
option to delete permanent
parameters. Syntax:
lctl set_param -P -dobdtype
.obdname
.parameter_name
For example:
mgs# lctl set_param -P -d osc.*.max_dirty_mb
Starting in Lustre 2.12, there is
lctl get_param
command can provide
tab completion when using an interactive shell
with bash-completion
installed. This simplifies
the use of get_param
significantly, since it
provides an interactive list of available parameters.
To list tunable parameters stored in the params
log file by lctl set_param -P
and applied to nodes at
mount, run the lctl --device MGS llog_print params
command on the MGS. For example:
mgs# lctl --device MGS llog_print params - { index: 2, event: set_param, device: general, parameter: osc.*.max_dirty_mb, value: 1024 }
To list Lustre or LNet parameters that are available to set, use
the lctl list_param
command. For example:
lctl list_param [-FR]obdtype
.obdname
The following arguments are available for the
lctl list_param
command.
-F
Add '
/
', '
@
' or '
=
' for directories, symlinks and writeable files,
respectively
-R
Recursively lists all parameters under the
specified path
For example:
oss# lctl list_param obdfilter.lustre-OST0000
To report current Lustre parameter values, use the
lctl get_param
command with this syntax:
lctl get_param [-n]obdtype
.obdname
.proc_file_name
Starting in Lustre 2.12, there is
lctl get_param
command can provide
tab completion when using an interactive shell
with bash-completion
installed. This simplifies
the use of get_param
significantly, since it
provides an interactive list of available parameters.
This example reports data on RPC service times.
oss# lctl get_param -n ost.*.ost_io.timeouts service : cur 1 worst 30 (at 1257150393, 85d23h58m54s ago) 1 1 1 1
This example reports the amount of space this client has reserved for writeback cache with each OST:
client# lctl get_param osc.*.cur_grant_bytes osc.myth-OST0000-osc-ffff8800376bdc00.cur_grant_bytes=2097152 osc.myth-OST0001-osc-ffff8800376bdc00.cur_grant_bytes=33890304 osc.myth-OST0002-osc-ffff8800376bdc00.cur_grant_bytes=35418112 osc.myth-OST0003-osc-ffff8800376bdc00.cur_grant_bytes=2097152 osc.myth-OST0004-osc-ffff8800376bdc00.cur_grant_bytes=33808384
If a node has multiple network interfaces, it may have multiple NIDs,
which must all be identified so other nodes can choose the NID that is
appropriate for their network interfaces. Typically, NIDs are specified in
a list delimited by commas (
,
). However, when failover nodes are specified, the NIDs
are delimited by a colon (
:
) or by repeating a keyword such as
--mgsnode=
or
--servicenode=
).
To display the NIDs of all servers in networks configured to work with the Lustre file system, run (while LNet is running):
# lctl list_nids
In the example below,
mds0
and
mds1
are configured as a combined MGS/MDT failover pair
and oss0
and
oss1
are configured as an OST failover pair. The Ethernet
address for
mds0
is 192.168.10.1, and for
mds1
is 192.168.10.2. The Ethernet addresses for
oss0
and
oss1
are 192.168.10.20 and 192.168.10.21
respectively.
mds0# mkfs.lustre --fsname=testfs --mdt --mgs \ --servicenode=192.168.10.2@tcp0 \ --servicenode=192.168.10.1@tcp0 /dev/sda1 mds0# mount -t lustre /dev/sda1 /mnt/test/mdt oss0# mkfs.lustre --fsname=testfs --servicenode=192.168.10.20@tcp0 \ --servicenode=192.168.10.21 --ost --index=0 \ --mgsnode=192.168.10.1@tcp0 --mgsnode=192.168.10.2@tcp0 \ /dev/sdb oss0# mount -t lustre /dev/sdb /mnt/test/ost0 client# mount -t lustre 192.168.10.1@tcp0:192.168.10.2@tcp0:/testfs \ /mnt/testfs mds0# umount /mnt/mdt mds1# mount -t lustre /dev/sda1 /mnt/test/mdt mds1# lctl get_param mdt.testfs-MDT0000.recovery_status
Where multiple NIDs are specified separated by commas (for example,
10.67.73.200@tcp,192.168.10.1@tcp
), the two NIDs refer
to the same host, and the Lustre software chooses the
best one for communication. When a pair of NIDs is
separated by a colon (for example,
10.67.73.200@tcp:10.67.73.201@tcp
), the two NIDs refer
to two different hosts and are treated as a failover pair (the Lustre
software tries the first one, and if that fails, it tries the second
one.)
Two options to
mkfs.lustre
can be used to specify failover nodes. The
--servicenode
option is used to specify all service NIDs,
including those for primary nodes and failover nodes. When the
--servicenode
option is used, the first service node to
load the target device becomes the primary service node, while nodes
corresponding to the other specified NIDs become failover locations for the
target device. An older option, --failnode
, specifies
just the NIDs of failover nodes. For more information about the
--servicenode
and
--failnode
options, see
Chapter 11, Configuring Failover in a Lustre
File System.
If you want to erase a file system and permanently delete all the data in the file system, run this command on your targets:
# mkfs.lustre --reformat
If you are using a separate MGS and want to keep other file systems
defined on that MGS, then set the
writeconf
flag on the MDT for that file system. The
writeconf
flag causes the configuration logs to be
erased; they are regenerated the next time the servers start.
To set the writeconf
flag on the MDT:
Unmount all clients/servers using this file system, run:
client# umount /mnt/lustre
Permanently erase the file system and, presumably, replace it with another file system, run:
mgs# mkfs.lustre --reformat --fsname spfs --mgs --mdt --index=0 /dev/mdsdev
If you have a separate MGS (that you do not want to reformat),
then add the --writeconf
flag to
mkfs.lustre
on the MDT, run:
mgs# mkfs.lustre --reformat --writeconf --fsname spfs --mgsnode=mgs_nid
\ --mdt --index=0/dev/mds_device
If you have a combined MGS/MDT, reformatting the MDT reformats the MGS as well, causing all configuration information to be lost; you can start building your new file system. Nothing needs to be done with old disks that will not be part of the new file system, just do not mount them.
All current Lustre installations run the ldiskfs file system internally on service nodes. By default, ldiskfs reserves 5% of the disk space to avoid file system fragmentation. In order to reclaim this space, run the following command on your OSS for each OST in the file system:
# tune2fs [-m reserved_blocks_percent] /dev/ostdev
You do not need to shut down Lustre before running this command or restart it afterwards.
Reducing the space reservation can cause severe performance degradation as the OST file system becomes more than 95% full, due to difficulty in locating large areas of contiguous free space. This performance degradation may persist even if the space usage drops below 95% again. It is recommended NOT to reduce the reserved disk space below 5%.
To copy the contents of an existing OST to a new OST (or an old MDT to a new MDT), follow the process for either OST/MDT backups in Section 18.2, “ Backing Up and Restoring an MDT or OST (ldiskfs Device Level)”or Section 18.3, “ Backing Up an OST or MDT (Backend File System Level)”. For more information on removing a MDT, see Section 14.9.1, “Removing an MDT from the File System”.
Use this procedure to identify the file containing a given object on a given OST.
On the OST (as root), run
debugfs
to display the file identifier (
FID
) of the file associated with the object.
For example, if the object is
34976
on
/dev/lustre/ost_test2
, the debug command is:
# debugfs -c -R "stat /O/0/d$((34976 % 32))/34976" /dev/lustre/ost_test2
The command output is:
debugfs 1.45.6.wc1 (20-Mar-2020) /dev/lustre/ost_test2: catastrophic mode - not reading inode or group bitmaps Inode: 352365 Type: regular Mode: 0666 Flags: 0x80000 Generation: 2393149953 Version: 0x0000002a:00005f81 User: 1000 Group: 1000 Size: 260096 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 512 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x4a216b48:00000000 -- Sat May 30 13:22:16 2009 atime: 0x4a216b48:00000000 -- Sat May 30 13:22:16 2009 mtime: 0x4a216b48:00000000 -- Sat May 30 13:22:16 2009 crtime: 0x4a216b3c:975870dc -- Sat May 30 13:22:04 2009 Size of extra inode fields: 24 Extended attributes stored in inode body: fid = "b9 da 24 00 00 00 00 00 6a fa 0d 3f 01 00 00 00 eb 5b 0b 00 00 00 0000 00 00 00 00 00 00 00 00 " (32) fid: objid=34976 seq=0 parent=[0x200000400:0x122:0x0] stripe=1 EXTENTS: (0-64):4620544-4620607
The parent FID will be of the form
[0x200000400:0x122:0x0]
and can be resolved directly
using the command lfs fid2path [0x200000404:0x122:0x0]
/mnt/lustre
on any Lustre client, and the process is
complete.
In cases of an upgraded 1.x inode (if the first part of the
FID is below 0x200000400), the MDT inode number is
0x24dab9
and generation
0x3f0dfa6a
and the pathname can also be resolved
using debugfs
.
On the MDS (as root), use
debugfs
to find the file associated with the
inode:
# debugfs -c -R "ncheck 0x24dab9" /dev/lustre/mdt_test debugfs 1.42.3.wc3 (15-Aug-2012) /dev/lustre/mdt_test: catastrophic mode - not reading inode or group bitmaps Inode Pathname 2415289 /ROOT/brian-laptop-guest/clients/client11/~dmtmp/PWRPNT/ZD16.BMP
The command lists the inode and pathname associated with the object.
Debugfs
' ''ncheck'' is a brute-force search that may
take a long time to complete.
To find the Lustre file from a disk LBA, follow the steps listed in the document at this URL: https://www.smartmontools.org/wiki/BadBlockHowto. Then, follow the steps above to resolve the Lustre filename.
Table of Contents
Once you have the Lustre file system up and running, you can use the procedures in this section to perform these basic Lustre maintenance tasks:
To mount a client or an MDT with one or more inactive OSTs, run commands similar to this:
client# mount -o exclude=testfs-OST0000 -t lustre \ uml1:/testfs /mnt/testfs client# lctl get_param lov.testfs-clilov-*.target_obd
To activate an inactive OST on a live client or MDT, use the
lctl activate
command on the OSC device. For example:
lctl --device 7 activate
A colon-separated list can also be specified. For example,
exclude=testfs-OST0000:testfs-OST0001
.
There may be situations in which you need to find all nodes in your Lustre file system or get the names of all OSTs.
To get a list of all Lustre nodes, run this command on the MGS:
# lctl get_param mgs.MGS.live.*
This command must be run on the MGS.
In this example, file system testfs
has three
nodes, testfs-MDT0000
,
testfs-OST0000
, and
testfs-OST0001
.
mgs:/root# lctl get_param mgs.MGS.live.* fsname: testfs flags: 0x0 gen: 26 testfs-MDT0000 testfs-OST0000 testfs-OST0001
To get the names of all OSTs, run this command on the MDS:
mds:/root# lctl get_param lov.*-mdtlov.target_obd
This command must be run on the MDS.
In this example, there are two OSTs, testfs-OST0000 and testfs-OST0001, which are both active.
mgs:/root# lctl get_param lov.testfs-mdtlov.target_obd 0: testfs-OST0000_UUID ACTIVE 1: testfs-OST0001_UUID ACTIVE
If you are using a combined MGS/MDT, but you only want to start the MGS and not the MDT, run this command:
mount -t lustre/dev/mdt_partition
-o nosvc/mount_point
The
variable is the combined MGS/MDT block device.mdt_partition
In this example, the combined MGS/MDT is testfs-MDT0000
and the mount point is /mnt/test/mdt
.
$ mount -t lustre -L testfs-MDT0000 -o nosvc /mnt/test/mdt
If the Lustre file system configuration logs are in a state where
the file system cannot be started, use the
tunefs.lustre --writeconf
command to regenerate them.
After the writeconf
command is run and the servers
restart, the configuration logs are re-generated and stored on the MGS
(as with a new file system).
You should only use the writeconf
command if:
The configuration logs are in a state where the file system cannot start
A server NID is being changed
The writeconf
command is destructive to some
configuration items (e.g. OST pools information and tunables set via
conf_param
), and should be used with caution.
The OST pools feature enables a group of OSTs to be named for
file striping purposes. If you use OST pools, be aware that running
the writeconf
command erases
all pools information (as well as
any other parameters set via lctl conf_param
).
We recommend that the pools definitions (and
conf_param
settings) be executed via a script,
so they can be regenerated easily after writeconf
is performed. However, tunables saved with lctl set_param
-P
are not erased in this case.
If the MGS still holds any configuration logs, it may be
possible to dump these logs to save any parameters stored with
lctl conf_param
by dumping the config logs on
the MGS and saving the output (once for each MDT and OST device):
mgs# lctl --device MGS llog_printfsname
-client mgs# lctl --device MGS llog_printfsname
-MDT0000 mgs# lctl --device MGS llog_printfsname
-OST0000
To regenerate Lustre file system configuration logs:
Stop the file system services in the following order before
running the tunefs.lustre --writeconf
command:
Unmount the clients.
Unmount the MDT(s).
Unmount the OST(s).
If the MGS is separate from the MDT it can remain mounted during this process.
Make sure the MDT and OST devices are available.
Run the tunefs.lustre --writeconf
command
on all target devices.
Run writeconf on the MDT(s) first, and then the OST(s).
On each MDS, for each MDT run:
mds# tunefs.lustre --writeconf /dev/mdt_device
On each OSS, for each OST run:
oss# tunefs.lustre --writeconf /dev/ost_device
Restart the file system in the following order:
Mount the separate MGT, if it is not already mounted.
Mount the MDT(s) in order, starting with MDT0000.
Mount the OSTs in order, starting with OST0000.
Mount the clients.
After the tunefs.lustre --writeconf
command is
run, the configuration logs are re-generated as servers connect to the
MGS.
In order to totally rewrite the Lustre configuration, the
tunefs.lustre --writeconf
command is used to
rewrite all of the configuration files.
If you need to change only the NID of the MDT or OST, the
replace_nids
command can simplify this process.
The replace_nids
command differs from
tunefs.lustre --writeconf
in that it does not
erase the entire configuration log, precluding the need the need to
execute the writeconf
command on all servers and
re-specify all permanent parameter settings. However, the
writeconf
command can still be used if desired.
Change a server NID in these situations:
New server hardware is added to the file system, and the MDS or an OSS is being moved to the new machine.
New network card is installed in the server.
You want to reassign IP addresses.
To change a server NID:
Update the LNet configuration in the /etc/modprobe.conf
file so the list of server NIDs is correct. Use lctl list_nids
to view the list of server NIDS.
The lctl list_nids
command indicates which network(s) are
configured to work with the Lustre file system.
Shut down the file system in this order:
Unmount the clients.
Unmount the MDT.
Unmount all OSTs.
If the MGS and MDS share a partition, start the MGS only:
mount -t lustreMDT partition
-o nosvcmount_point
Run the replace_nids
command on the MGS:
lctl replace_nidsdevicename
nid1
[,nid2,nid3 ...]
where devicename
is the Lustre target name, e.g.
testfs-OST0013
If the MGS and MDS share a partition, stop the MGS:
umount mount_point
The replace_nids
command also cleans
all old, invalidated records out of the configuration log, while
preserving all other current settings.
The previous configuration log is backed up on the MGS
disk with the suffix '.bak'
.
This command runs on MGS node having the MGS device mounted with
-o nosvc.
It cleans up configuration files
stored in the CONFIGS/ directory of any records marked SKIP.
If the device name is given, then the specific logs for that
filesystem (e.g. testfs-MDT0000) are processed. Otherwise, if a
filesystem name is given then all configuration files are cleared.
The previous configuration log is backed up on the MGS disk with
the suffix 'config.timestamp.bak'. Eg: Lustre-MDT0000-1476454535.bak.
To clear a configuration:
Shut down the file system in this order:
Unmount the clients.
Unmount the MDT.
Unmount all OSTs.
If the MGS and MDS share a partition, start the MGS only using "nosvc" option.
mount -t lustreMDT partition
-o nosvcmount_point
Run the clear_conf
command on the MGS:
lctl clear_conf config
Example: To clear the configuration for
MDT0000
on a filesystem named
testfs
mgs# lctl clear_conf testfs-MDT0000
Additional MDTs can be added using the DNE feature to serve one or more remote sub-directories within a filesystem, in order to increase the total number of files that can be created in the filesystem, to increase aggregate metadata performance, or to isolate user or application workloads from other users of the filesystem. It is possible to have multiple remote sub-directories reference the same MDT. However, the root directory will always be located on MDT0000. To add a new MDT into the file system:
Discover the maximum MDT index. Each MDT must have unique index.
client$ lctl dl | grep mdc 36 UP mdc testfs-MDT0000-mdc-ffff88004edf3c00 4c8be054-144f-9359-b063-8477566eb84e 5 37 UP mdc testfs-MDT0001-mdc-ffff88004edf3c00 4c8be054-144f-9359-b063-8477566eb84e 5 38 UP mdc testfs-MDT0002-mdc-ffff88004edf3c00 4c8be054-144f-9359-b063-8477566eb84e 5 39 UP mdc testfs-MDT0003-mdc-ffff88004edf3c00 4c8be054-144f-9359-b063-8477566eb84e 5
Add the new block device as a new MDT at the next available index. In this example, the next available index is 4.
mds# mkfs.lustre --reformat --fsname=testfs
--mdt --mgsnode=mgsnode
--index 4/dev/mdt4_device
Mount the MDTs.
mds# mount -t lustre /dev/mdt4_blockdevice
/mnt/mdt4
In order to start creating new files and directories on the
new MDT(s) they need to be attached into the namespace at one or
more subdirectories using the lfs mkdir
command.
All files and directories below those created with
lfs mkdir
will also be created on the same MDT
unless otherwise specified.
client# lfs mkdir -i 3 /mnt/testfs/new_dir_on_mdt3 client# lfs mkdir -i 4 /mnt/testfs/new_dir_on_mdt4 client# lfs mkdir -c 4 /mnt/testfs/project/new_large_dir_striped_over_4_mdts
A new OST can be added to existing Lustre file system on either an existing OSS node or on a new OSS node. In order to keep client IO load balanced across OSS nodes for maximum aggregate performance, it is not recommended to configure different numbers of OSTs to each OSS node.
Add a new OST by using mkfs.lustre
as when
the filesystem was first formatted, see
4 for details. Each new OST
must have a unique index number, use lctl dl
to
see a list of all OSTs. For example, to add a new OST at index 12
to the testfs
filesystem run following commands
should be run on the OSS:
oss# mkfs.lustre --fsname=testfs --mgsnode=mds16@tcp0 --ost --index=12 /dev/sda oss# mkdir -p /mnt/testfs/ost12 oss# mount -t lustre /dev/sda /mnt/testfs/ost12
Balance OST space usage (possibly).
The file system can be quite unbalanced when new empty OSTs are added to a relatively full filesystem. New file creations are automatically balanced to favour the new OSTs. If this is a scratch file system or files are pruned at regular intervals, then no further work may be needed to balance the OST space usage as new files being created will preferentially be placed on the less full OST(s). As old files are deleted, they will release space on the old OST(s).
Files existing prior to the expansion can optionally be
rebalanced using the lfs_migrate
utility.
This redistributes file data over the entire set of OSTs.
For example, to rebalance all files within the directory
/mnt/lustre/dir
, enter:
client# lfs_migrate /mnt/lustre/dir
To migrate files within the /test
file
system on OST0004
that are larger than 4GB in
size to other OSTs, enter:
client# lfs find /test --ost test-OST0004 -size +4G | lfs_migrate -y
See Section 40.2, “
lfs_migrate
” for details.
OSTs and DNE MDTs can be removed from and restored to a Lustre filesystem. Deactivating an OST means that it is temporarily or permanently marked unavailable. Deactivating an OST on the MDS means it will not try to allocate new objects there or perform OST recovery, while deactivating an OST the client means it will not wait for OST recovery if it cannot contact the OST and will instead return an IO error to the application immediately if files on the OST are accessed. An OST may be permanently deactivated from the file system, depending on the situation and commands used.
A permanently deactivated MDT or OST still appears in the
filesystem configuration until the configuration is regenerated with
writeconf
or it is replaced with a new MDT or OST
at the same index and permanently reactivated. A deactivated OST
will not be listed by lfs df
.
You may want to temporarily deactivate an OST on the MDS to prevent new files from being written to it in several situations:
A hard drive has failed and a RAID resync/rebuild is underway, though the OST can also be marked degraded by the RAID system to avoid allocating new files on the slow OST which can reduce performance, see Section 13.7, “ Handling Degraded OST RAID Arrays” for more details.
OST is nearing its space capacity, though the MDS will already try to avoid allocating new files on overly-full OSTs if possible, see Section 39.7, “Allocating Free Space on OSTs” for details.
MDT/OST storage or MDS/OSS node has failed, and will not be available for some time (or forever), but there is still a desire to continue using the filesystem before it is repaired.
If the MDT is permanently inaccessible,
lfs rm_entry {directory}
can be used to delete the
directory entry for the unavailable MDT. Using rmdir
would otherwise report an IO error due to the remote MDT being inactive.
Please note that if the MDT is available, standard
rm -r
should be used to delete the remote directory.
After the remote directory has been removed, the administrator should
mark the MDT as permanently inactive with:
lctl conf_param {MDT name}.mdc.active=0
A user can identify which MDT holds a remote sub-directory using
the lfs
utility. For example:
client$ lfs getstripe --mdt-index /mnt/lustre/remote_dir1 1 client$ mkdir /mnt/lustre/local_dir0 client$ lfs getstripe --mdt-index /mnt/lustre/local_dir0 0
The lfs getstripe --mdt-index
command
returns the index of the MDT that is serving the given directory.
Files located on or below an inactive MDT are inaccessible until the MDT is activated again. Clients accessing an inactive MDT will receive an EIO error.
When deactivating an OST, note that the client and MDS each have an OSC device that handles communication with the corresponding OST. To remove an OST from the file system:
If the OST is functional, and there are files located on the OST that need to be migrated off of the OST, the file creation for that OST should be temporarily deactivated on the MDS (each MDS if running with multiple MDS nodes in DNE mode).
With Lustre 2.9 and later, the MDS should be
set to only disable file creation on that OST by setting
max_create_count
to zero:
mds# lctl set_param osp.osc_name
.max_create_count=0
This ensures that files deleted or migrated off of the OST
will have their corresponding OST objects destroyed, and the space
will be freed. For example, to disable OST0000
in the filesystem testfs
, run:
mds# lctl set_param osp.testfs-OST0000-osc-MDT*.max_create_count=0
on each MDS in the testfs
filesystem.
With older versions of Lustre, to deactivate the OSC on the MDS node(s) use:
mds# lctl set_param osp.osc_name
.active=0
This will prevent the MDS from attempting any communication with that OST, including destroying objects located thereon. This is fine if the OST will be removed permanently, if the OST is not stable in operation, or if it is in a read-only state. Otherwise, the free space and objects on the OST will not decrease when files are deleted, and object destruction will be deferred until the MDS reconnects to the OST.
For example, to deactivate OST0000
in
the filesystem testfs
, run:
mds# lctl set_param osp.testfs-OST0000-osc-MDT*.active=0
Deactivating the OST on the MDS does not prevent use of existing objects for read/write by a client.
If migrating files from a working OST, do not deactivate the OST on clients. This causes IO errors when accessing files located there, and migrating files on the OST would fail.
Do not use lctl set_param -P
or
lctl conf_param
to
deactivate the OST if it is still working, as this immediately
and permanently deactivates it in the file system configuration
on both the MDS and all clients.
Discover all files that have objects residing on the deactivated OST. Depending on whether the deactivated OST is available or not, the data from that OST may be migrated to other OSTs, or may need to be restored from backup.
If the OST is still online and available, find all files with objects on the deactivated OST, and copy them to other OSTs in the file system to:
client# lfs find --ostost_name
/mount/point
| lfs_migrate -y
Note that if multiple OSTs are being deactivated at one
time, the lfs find
command can take multiple
--ost
arguments, and will return files that
are located on any of the specified OSTs.
If the OST is no longer available, delete the files on that OST and restore them from backup:
client# lfs find --ostost_uuid
-print0/mount/point
| tee /tmp/files_to_restore | xargs -0 -n 1 unlink
The list of files that need to be restored from backup is
stored in /tmp/files_to_restore
. Restoring
these files is beyond the scope of this document.
Deactivate the OST.
If there is expected to be a replacement OST in some short time (a few days), the OST can temporarily be deactivated on the clients using:
client# lctl set_param osc.fsname
-OSTnumber
-*.active=0
This setting is only temporary and will be reset if the clients are remounted or rebooted. It needs to be run on all clients.
If there is not expected to be a replacement for this OST in the near future, permanently deactivate it on all clients and the MDS by running the following command on the MGS:
mgs# lctl conf_param ost_name
.osc.active=0
A deactivated OST still appears in the file system
configuration, though a replacement OST can be created that
re-uses the same OST index with the
mkfs.lustre --replace
option, see
Section 14.9.5, “
Restoring OST Configuration Files”.
In Lustre 2.16 and later, it is possible to run the command
"lctl del_ost --target
"
on the MGS to totally remove an OST from the MGS configuration
logs. This will cancel the configuration logs for that OST in
the client and MDT configuration logs for the named filesystem.
This permanently removes the configuration records for that OST
from the filesystem, so that it will not be visible on later
client and MDT mounts, and should only be run after earlier
steps to migrate files off the OST.
fsname
-OSTxxxx
If the del_ost
command is not available,
the OST configuration records should be found in the startup
logs by running the command
"lctl --device MGS llog_print
"
on the MGS (and also
"fsname
-client...
"
for all the MDTs) to list all $fsname
-MDTxxxx
attach
,
setup
, add_osc
,
add_pool
, and other records related to the
removed OST(s). Once the index
value is
known for each configuration record, the command
"lctl --device MGS llog_cancel
"
will drop that record from the configuration log
llog_name
-i index
llog_name
. This is needed for each
of
and fsname
-client
configuration logs so that new mounts will no longer process it.
If a whole OSS is being removed, thefsname
-MDTxxxxadd_uuid
records for the OSS should similarly be canceled.
mgs# lctl --device MGS llog_print testfs-client | egrep "192.168.10.99@tcp|OST0003" - { index: 135, event: add_uuid, nid: 192.168.10.99@tcp(0x20000c0a80a63), node: 192.168.10.99@tcp } - { index: 136, event: attach, device: testfs-OST0003-osc, type: osc, UUID: testfs-clilov_UUID } - { index: 137, event: setup, device: testfs-OST0003-osc, UUID: testfs-OST0003_UUID, node: 192.168.10.99@tcp } - { index: 138, event: add_osc, device: testfs-clilov, ost: testfs-OST0003_UUID, index: 3, gen: 1 } mgs# lctl --device MGS llog_cancel testfs-client -i 138 mgs# lctl --device MGS llog_cancel testfs-client -i 137 mgs# lctl --device MGS llog_cancel testfs-client -i 136
If the OST device is still accessible, then the Lustre configuration files on the OST should be backed up and saved for future use in order to avoid difficulties when a replacement OST is returned to service. These files rarely change, so they can and should be backed up while the OST is functional and accessible. If the deactivated OST is still available to mount (i.e. has not permanently failed or is unmountable due to severe corruption), an effort should be made to preserve these files.
Mount the OST file system.
oss# mkdir -p /mnt/ost
oss# mount -t ldiskfs /dev/ost_device
/mnt/ost
Back up the OST configuration files.
oss# tar cvf ost_name
.tar -C /mnt/ost last_rcvd \
CONFIGS/ O/0/LAST_ID
Unmount the OST file system.
oss# umount /mnt/ost
If the original OST is still available, it is best to follow the OST backup and restore procedure given in either Section 18.2, “ Backing Up and Restoring an MDT or OST (ldiskfs Device Level)”, or Section 18.3, “ Backing Up an OST or MDT (Backend File System Level)” and Section 18.4, “ Restoring a File-Level Backup”.
To replace an OST that was removed from service due to corruption
or hardware failure, the replacement OST needs to be formatted using
mkfs.lustre
, and the Lustre file system configuration
should be restored, if available. Any objects stored on the OST will
be permanently lost, and files using the OST should be deleted and/or
restored from backup.
With Lustre 2.5 and later, it is possible to
replace an OST to the same index without restoring the configuration
files, using the --replace
option at format time.
oss# mkfs.lustre --ost --reformat --replace --index=old_ost_index
\other_options
/dev/new_ost_dev
The MDS and OSS will negotiate the LAST_ID
value
for the replacement OST.
If the OST configuration files were not backed up, due to the OST file system being completely inaccessible, it is still possible to replace the failed OST with a new one at the same OST index.
For older versions, format the OST file system without the
--replace
option and restore the saved
configuration:
oss# mkfs.lustre --ost --reformat --index=old_ost_index
\other_options
/dev/new_ost_dev
Mount the OST file system.
oss# mkdir /mnt/ost oss# mount -t ldiskfs/dev/new_ost_dev
/mnt/ost
Restore the OST configuration files, if available.
oss# tar xvf ost_name
.tar -C /mnt/ost
Recreate the OST configuration files, if unavailable.
Follow the procedure in
Section 35.3.4, “Fixing a Bad LAST_ID on an OST” to recreate the LAST_ID
file for this OST index. The last_rcvd
file
will be recreated when the OST is first mounted using the default
parameters, which are normally correct for all file systems. The
CONFIGS/mountdata
file is created by
mkfs.lustre
at format time, but has flags set
that request it to register itself with the MGS. It is possible to
copy the flags from another working OST (which should be the same):
oss1# debugfs -c -R "dump CONFIGS/mountdata /tmp" /dev/other_osdev
oss1# scp /tmp/mountdata oss0:/tmp/mountdata
oss0# dd if=/tmp/mountdata of=/mnt/ost/CONFIGS/mountdata bs=4 count=1 seek=5 skip=5 conv=notrunc
Unmount the OST file system.
oss# umount /mnt/ost
If the OST was permanently deactivated, it needs to be reactivated in the MGS configuration.
mgs# lctl conf_param ost_name
.osc.active=1
If the OST was temporarily deactivated, it needs to be reactivated on the MDS and clients.
mds# lctl set_param osp.fsname
-OSTnumber
-*.active=1 client# lctl set_param osc.fsname
-OSTnumber
-*.active=1
You can abort recovery with either the lctl
utility or by mounting the target with the abort_recov
option (mount -o abort_recov
). When starting a target, run:
mds# mount -t lustre -Lmdt_name
-o abort_recov/mount_point
The recovery process is blocked until all OSTs are available.
In the course of administering a Lustre file system, you may need to determine which machine is serving a specific OST. It is not as simple as identifying the machine's IP address, as IP is only one of several networking protocols that the Lustre software uses and, as such, LNet does not use IP addresses as node identifiers, but NIDs instead. To identify the NID that is serving a specific OST, run one of the following commands on a client (you do not need to be a root user):
client$ lctl get_param osc.fsname
-OSTnumber
*.ost_conn_uuid
For example:
client$ lctl get_param osc.*-OST0000*.ost_conn_uuid osc.testfs-OST0000-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp
- OR -
client$ lctl get_param osc.*.ost_conn_uuid osc.testfs-OST0000-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp osc.testfs-OST0001-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp osc.testfs-OST0002-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp osc.testfs-OST0003-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp osc.testfs-OST0004-osc-f1579000.ost_conn_uuid=192.168.20.1@tcp
To change the address of a failover node (e.g, to use node X instead of node Y), run this command on the OSS/OST partition (depending on which option was used to originally identify the NID):
oss# tunefs.lustre --erase-params --servicenode=NID
/dev/ost_device
or
oss# tunefs.lustre --erase-params --failnode=NID
/dev/ost_device
For more information about the --servicenode
and
--failnode
options, see Chapter 11, Configuring Failover in a Lustre
File System.
These instructions assume the MGS node will be the same as the MDS node. For instructions on how to move MGS to a different node, see Section 14.5, “ Changing a Server NID”.
These instructions are for doing the split without shutting down other servers and clients.
Stop the MDS.
Unmount the MDT
mds# umount -f /dev/mdt_device
Create the MGT filesystem.
mds# mkfs.lustre --mgs /dev/mgt_device
Copy the configuration data from MDT disk to the new MGT disk.
mds# mount -t ldiskfs -o ro/dev/mdt_device
/mdt_mount_point
mds# mount -t ldiskfs -o rw/dev/mgt_device
/mgt_mount_point
mds# cp -av/mdt_mount_point
/CONFIGS/filesystem_name
-*/mgt_mount_point
/CONFIGS/ mds# cp -av/mdt_mount_point
/CONFIGS/{params,nodemap,sptlrpc}*/mgt_mount_point
/CONFIGS/ mds# umount/mgt_mount_point
mds# umount/mdt_mount_point
See Section 14.4, “ Regenerating Lustre Configuration Logs” for alternative method.
Start the MGS.
mgs# mount -t lustre/dev/mgt_device
/mgt_mount_point
Check to make sure it knows about all your file system
mgs:/root# lctl get_param mgs.MGS.filesystems
Remove the MGS option from the MDT, and set the new MGS nid.
mds# tunefs.lustre --nomgs --mgsnode=new_mgs_nid
/dev/mdt-device
Start the MDT.
mds# mount -t lustre /dev/mdt_device /mdt_mount_point
Check to make sure the MGS configuration looks right:
mgs# lctl get_param mgs.MGS.live.filesystem_name
It is sometimes desirable to be able to mark the filesystem read-only directly on the server, rather than remounting the clients and setting the option there. This can be useful if there is a rogue client that is deleting files, or when decommissioning a system to prevent already-mounted clients from modifying it anymore.
Set the mdt.*.readonly
parameter to
1
to immediately set the MDT to read-only. All future
MDT access will immediately return a "Read-only file system" error
(EROFS
) until the parameter is set to
0
again.
Example of setting the readonly
parameter to
1
, verifying the current setting, accessing from a
client, and setting the parameter back to 0
:
mds# lctl set_param mdt.fs-MDT0000.readonly=1 mdt.fs-MDT0000.readonly=1 mds# lctl get_param mdt.fs-MDT0000.readonly mdt.fs-MDT0000.readonly=1 client$ touch test_file touch: cannot touch 'test_file': Read-only file system mds# lctl set_param mdt.fs-MDT0000.readonly=0 mdt.fs-MDT0000.readonly=0
This section shows how to tune/enable/disable fallocate for ldiskfs OSTs.
The default mode=0
is the standard
"allocate unwritten extents" behavior used by ext4. This is by far the
fastest for space allocation, but requires the unwritten extents to be
split and/or zeroed when they are overwritten.
The OST fallocate mode=1
can also be set to use
"zeroed extents", which may be handled by "WRITE SAME", "TRIM zeroes data",
or other low-level functionality in the underlying block device.
mode=-1
completely disables fallocate.
Example: To completely disable fallocate
lctl set_param osd-ldiskfs.*.fallocate_zero_blocks=-1
Example: To enable fallocate to use 'zeroed extents'
lctl set_param osd-ldiskfs.*.fallocate_zero_blocks=1
Table of Contents
This chapter describes some tools for managing Lustre networking (LNet) and includes the following sections:
There are two mechanisms to update the health status of a peer or a router:
LNet can actively check health status of all routers and mark them as dead or alive automatically. By default, this is off. To enable it set auto_down
and if desired check_routers_before_use
. This initial check may cause a pause equal to router_ping_timeout
at system startup, if there are dead routers in the system.
When there is a communication error, all LNDs notify LNet that the peer (not necessarily a router) is down. This mechanism is always on, and there is no parameter to turn it off. However, if you set the LNet module parameter auto_down
to 0
, LNet ignores all such peer-down notifications.
Several key differences in both mechanisms:
The router pinger only checks routers for their health, while LNDs notices all dead peers, regardless of whether they are a router or not.
The router pinger actively checks the router health by sending pings, but LNDs only notice a dead peer when there is network traffic going on.
The router pinger can bring a router from alive to dead or vice versa, but LNDs can only bring a peer down.
The Lustre software automatically starts and stops LNet, but it can also be manually started in a standalone manner. This is particularly useful to verify that your networking setup is working correctly before you attempt to start the Lustre file system.
To start LNet, run:
$ modprobe lnet $ lctl network up
To see the list of local NIDs, run:
$ lctl list_nids
This command tells you the network(s) configured to work with the Lustre file system.
If the networks are not correctly setup, see the modules.conf
"networks=
" line and make sure the network layer modules are correctly installed and configured.
To get the best remote NID, run:
$ lctl which_nid NIDs
where
is the list of available NIDs.NIDs
This command takes the "best" NID from a list of the NIDs of a remote host. The "best" NID is the one that the local node uses when trying to communicate with the remote node.
Before the LNet modules can be removed, LNet references must be removed. In general, these references are removed automatically when the Lustre file system is shut down, but for standalone routers, an explicit step is needed to stop LNet. Run:
lctl network unconfigure
Attempting to remove Lustre modules prior to stopping the network may result in a
crash or an LNet hang. If this occurs, the node must be rebooted (in most cases). Make
sure that the Lustre network and Lustre file system are stopped prior to unloading the
modules. Be extremely careful using rmmod -f
.
To unconfigure the LNet network, run:
modprobe -r lnd_and_lnet_modules
To remove all Lustre modules, run:
$ lustre_rmmod
To aggregate bandwidth across both rails of a dual-rail IB cluster (o2iblnd) [1] using LNet, consider these points:
LNet can work with multiple rails, however, it does not load balance across them. The actual rail used for any communication is determined by the peer NID.
Hardware multi-rail LNet configurations do not provide an additional level of network fault tolerance. The configurations described below are for bandwidth aggregation only.
A Lustre node always uses the same local NID to communicate with a given peer NID. The criteria used to determine the local NID are:
Lowest route priority number (lower number, higher priority).
Fewest hops (to minimize routing), and
Appears first in the "networks
"
or "ip2nets
" LNet configuration strings
A Lustre file system contains OSSs with two InfiniBand HCAs. Lustre clients have only one InfiniBand HCA using OFED-based Infiniband ''o2ib'' drivers. Load balancing between the HCAs on the OSS is accomplished through LNet.
To configure LNet for load balancing on clients and servers:
Set the lustre.conf
options.
Depending on your configuration, set lustre.conf
options as follows:
Dual HCA OSS server
options lnet networks="o2ib0(ib0),o2ib1(ib1)"
Client with the odd IP address
options lnet ip2nets="o2ib0(ib0) 192.168.10.[103-253/2]"
Client with the even IP address
options lnet ip2nets="o2ib1(ib0) 192.168.10.[102-254/2]"
Run the modprobe lnet command and create a combined MGS/MDT file system.
The following commands create an MGS/MDT or OST file system and mount the targets on the servers.
modprobe lnet # mkfs.lustre --fsname lustre --mgs --mdt/dev/mdt_device
# mkdir -p/mount_point
# mount -t lustre /dev/mdt_device
/mount_point
For example:
modprobe lnet mds# mkfs.lustre --fsname lustre --mdt --mgs /dev/sda mds# mkdir -p /mnt/test/mdt mds# mount -t lustre /dev/sda /mnt/test/mdt mds# mount -t lustre mgs@o2ib0:/lustre /mnt/mdt oss# mkfs.lustre --fsname lustre --mgsnode=mds@o2ib0 --ost --index=0 /dev/sda oss# mkdir -p /mnt/test/mdt oss# mount -t lustre /dev/sda /mnt/test/ost oss# mount -t lustre mgs@o2ib0:/lustre /mnt/ost0
Mount the clients.
client# mount -t lustremgs_node
:/fsname
/mount_point
This example shows an IB client being mounted.
client# mount -t lustre 192.168.10.101@o2ib0,192.168.10.102@o2ib1:/mds/client /mnt/lustre
As an example, consider a two-rail IB cluster running the OFED stack with these IPoIB address assignments.
ib0 ib1 Servers 192.168.0.* 192.168.1.* Clients 192.168.[2-127].* 192.168.[128-253].*
You could create these configurations:
A cluster with more clients than servers. The fact that an individual client cannot get two rails of bandwidth is unimportant because the servers are typically the actual bottleneck.
ip2nets="o2ib0(ib0), o2ib1(ib1) 192.168.[0-1].* \ #all servers;\ o2ib0(ib0) 192.168.[2-253].[0-252/2] #even cl\ ients;\ o2ib1(ib1) 192.168.[2-253].[1-253/2] #odd cli\ ents"
This configuration gives every server two NIDs, one on each network, and statically load-balances clients between the rails.
A single client that must get two rails of bandwidth, and it does not matter if the maximum aggregate bandwidth is only (# servers) * (1 rail).
ip2nets=" o2ib0(ib0) 192.168.[0-1].[0-252/2] \ #even servers;\ o2ib1(ib1) 192.168.[0-1].[1-253/2] \ #odd servers;\ o2ib0(ib0),o2ib1(ib1) 192.168.[2-253].* \ #clients"
This configuration gives every server a single NID on one rail or the other. Clients have a NID on both rails.
All clients and all servers must get two rails of bandwidth.
ip2nets="o2ib0(ib0),o2ib2(ib1) 192.168.[0-1].[0-252/2] \ #even servers;\ o2ib1(ib0),o2ib3(ib1) 192.168.[0-1].[1-253/2] \ #odd servers;\ o2ib0(ib0),o2ib3(ib1) 192.168.[2-253].[0-252/2) \ #even clients;\ o2ib1(ib0),o2ib2(ib1) 192.168.[2-253].[1-253/2) \ #odd clients"
This configuration includes two additional proxy o2ib networks to work around the
simplistic NID selection algorithm in the Lustre software. It connects "even"
clients to "even" servers with o2ib0
on
rail0
, and "odd" servers with o2ib3
on
rail1
. Similarly, it connects "odd" clients to
"odd" servers with o2ib1
on rail0
, and
"even" servers with o2ib2
on rail1
.
Two scripts are provided:
lustre/scripts/lustre_routes_config
and
lustre/scripts/lustre_routes_conversion
.
lustre_routes_config
sets or cleans up LNet routes
from the specified config file. The
/etc/sysconfig/lnet_routes.conf
file can be used to
automatically configure routes on LNet startup.
lustre_routes_conversion
converts a legacy routes
configuration file to the new syntax, which is parsed by
lustre_routes_config
.
lustre_routes_config
usage is as follows
lustre_routes_config [--setup|--cleanup|--dry-run|--verbose] config_file
--setup: configure routes listed in config_file
--cleanup: unconfigure routes listed in config_file
--dry-run: echo commands to be run, but do not execute them
--verbose: echo commands before they are executed
The format of the file which is passed into the script is as follows:
network
: { gateway: gateway
@exit_network
[hop: hop
] [priority: priority
] }
An LNet router is identified when its local NID appears within the list of routes. However, this can not be achieved by the use of this script, since the script only adds extra routes after the router is identified. To ensure that a router is identified correctly, make sure to add its local NID in the routes parameter in the modprobe lustre configuration file. See Section 43.1, “ Introduction”.
lustre_routes_conversion
usage is as follows:
lustre_routes_conversion legacy_file
new_file
lustre_routes_conversion
takes as a first parameter a file with routes configured as follows:
network
[hop
] gateway
@exit network
[:priority
];
The script then converts each routes entry in the provided file to:
network
: { gateway: gateway
@exit network
[hop: hop
] [priority: priority
] }
and appends each converted entry to the output file passed in as the second parameter to the script.
Below is an example of a legacy LNet route configuration. A legacy configuration file can have multiple entries.
tcp1 10.1.1.2@tcp0:1;
tcp2 10.1.1.3@tcp0:2;
tcp3 10.1.1.4@tcp0;
Below is an example of the converted LNet route configuration. The following would be the result of the lustre_routes_conversion
script, when run on the above legacy entries.
tcp1: { gateway: 10.1.1.2@tcp0 priority: 1 }
tcp2: { gateway: 10.1.1.2@tcp0 priority: 2 }
tcp1: { gateway: 10.1.1.4@tcp0 }
[1] Hardware multi-rail configurations are only supported by o2iblnd; other IB LNDs do not support multiple interfaces.
Table of Contents
This chapter describes LNet Software Multi-Rail configuration and administration.
In computer networking, multi-rail is an arrangement in which two or more network interfaces to a single network on a computer node are employed, to achieve increased throughput. Multi-rail can also be where a node has one or more interfaces to multiple, even different kinds of networks, such as Ethernet, Infiniband, and Intel® Omni-Path. For Lustre clients, multi-rail generally presents the combined network capabilities as a single LNet network. Peer nodes that are multi-rail capable are established during configuration, as are user-defined interface-section policies.
The following link contains a detailed high-level design for the feature: Multi-Rail High-Level Design
Every node using multi-rail networking needs to be properly
configured. Multi-rail uses lnetctl
and the LNet
Configuration Library for configuration. Configuring multi-rail for a
given node involves two tasks:
Configuring multiple network interfaces present on the local node.
Adding remote peers that are multi-rail capable (are connected to one or more common networks with at least two interfaces).
This section is a supplement to Section 9.1.3, “Adding, Deleting and Showing Networks” and contains further examples for Multi-Rail configurations.
For information on the dynamic peer discovery feature added in Lustre Release 2.11.0, see Section 9.1.5, “Dynamic Peer Discovery”.
Example lnetctl add
command with multiple
interfaces in a Multi-Rail configuration:
lnetctl net add --net tcp --if eth0,eth1
Example of YAML net show:
lnetctl net show -v net: - net type: lo local NI(s): - nid: 0@lo status: up statistics: send_count: 0 recv_count: 0 drop_count: 0 tunables: peer_timeout: 0 peer_credits: 0 peer_buffer_credits: 0 credits: 0 lnd tunables: tcp bonding: 0 dev cpt: 0 CPT: "[0]" - net type: tcp local NI(s): - nid: 192.168.122.10@tcp status: up interfaces: 0: eth0 statistics: send_count: 0 recv_count: 0 drop_count: 0 tunables: peer_timeout: 180 peer_credits: 8 peer_buffer_credits: 0 credits: 256 lnd tunables: tcp bonding: 0 dev cpt: -1 CPT: "[0]" - nid: 192.168.122.11@tcp status: up interfaces: 0: eth1 statistics: send_count: 0 recv_count: 0 drop_count: 0 tunables: peer_timeout: 180 peer_credits: 8 peer_buffer_credits: 0 credits: 256 lnd tunables: tcp bonding: 0 dev cpt: -1 CPT: "[0]"
Example delete with lnetctl net del
:
Assuming the network configuration is as shown above with the
lnetctl net show -v
in the previous section, we can
delete a net with following command:
lnetctl net del --net tcp --if eth0
The resultant net information would look like:
lnetctl net show -v net: - net type: lo local NI(s): - nid: 0@lo status: up statistics: send_count: 0 recv_count: 0 drop_count: 0 tunables: peer_timeout: 0 peer_credits: 0 peer_buffer_credits: 0 credits: 0 lnd tunables: tcp bonding: 0 dev cpt: 0 CPT: "[0,1,2,3]"
The syntax of a YAML file to perform a delete would be:
- net type: tcp local NI(s): - nid: 192.168.122.10@tcp interfaces: 0: eth0
The following example lnetctl peer add
command adds a peer with 2 nids, with
192.168.122.30@tcp
being the primary nid:
lnetctl peer add --prim_nid 192.168.122.30@tcp --nid 192.168.122.30@tcp,192.168.122.31@tcp
The resulting lnetctl peer show
would be:
lnetctl peer show -v peer: - primary nid: 192.168.122.30@tcp Multi-Rail: True peer ni: - nid: 192.168.122.30@tcp state: NA max_ni_tx_credits: 8 available_tx_credits: 8 min_tx_credits: 7 tx_q_num_of_buf: 0 available_rtr_credits: 8 min_rtr_credits: 8 refcount: 1 statistics: send_count: 2 recv_count: 2 drop_count: 0 - nid: 192.168.122.31@tcp state: NA max_ni_tx_credits: 8 available_tx_credits: 8 min_tx_credits: 7 tx_q_num_of_buf: 0 available_rtr_credits: 8 min_rtr_credits: 8 refcount: 1 statistics: send_count: 1 recv_count: 1 drop_count: 0
The following is an example YAML file for adding a peer:
addPeer.yaml peer: - primary nid: 192.168.122.30@tcp Multi-Rail: True peer ni: - nid: 192.168.122.31@tcp
Example of deleting a single nid of a peer (192.168.122.31@tcp):
lnetctl peer del --prim_nid 192.168.122.30@tcp --nid 192.168.122.31@tcp
Example of deleting the entire peer:
lnetctl peer del --prim_nid 192.168.122.30@tcp
Example of deleting a peer via YAML:
Assuming the following peer configuration: peer: - primary nid: 192.168.122.30@tcp Multi-Rail: True peer ni: - nid: 192.168.122.30@tcp state: NA - nid: 192.168.122.31@tcp state: NA - nid: 192.168.122.32@tcp state: NA You can delete 192.168.122.32@tcp as follows: delPeer.yaml peer: - primary nid: 192.168.122.30@tcp Multi-Rail: True peer ni: - nid: 192.168.122.32@tcp % lnetctl import --del < delPeer.yaml
This section details how to configure Multi-Rail with the routing feature before the Section 16.4, “Multi-Rail Routing with LNet Health” feature landed in Lustre 2.13. Routing code has always monitored the state of the route, in order to avoid using unavailable ones.
This section describes how you can configure multiple interfaces on the same gateway node but as different routes. This uses the existing route monitoring algorithm to guard against interfaces going down. With the Section 16.4, “Multi-Rail Routing with LNet Health” feature introduced in Lustre 2.13, the new algorithm uses the Section 16.5, “LNet Health” feature to monitor the different interfaces of the gateway and always ensures that the healthiest interface is used. Therefore, the configuration described in this section applies to releases prior to Lustre 2.13. It will still work in 2.13 as well, however it is not required due to the reason mentioned above.
The below example outlines a simple system where all the Lustre nodes are MR capable. Each node in the cluster has two interfaces.
The routers can aggregate the interfaces on each side of the network by configuring them on the appropriate network.
An example configuration:
Routers lnetctl net add --net o2ib0 --if ib0,ib1 lnetctl net add --net o2ib1 --if ib2,ib3 lnetctl peer add --nid <peer1-nidA>@o2ib,<peer1-nidB>@o2ib,... lnetctl peer add --nid <peer2-nidA>@o2ib1,<peer2-nidB>>@o2ib1,... lnetctl set routing 1 Clients lnetctl net add --net o2ib0 --if ib0,ib1 lnetctl route add --net o2ib1 --gateway <rtrX-nidA>@o2ib lnetctl peer add --nid <rtrX-nidA>@o2ib,<rtrX-nidB>@o2ib Servers lnetctl net add --net o2ib1 --if ib0,ib1 lnetctl route add --net o2ib0 --gateway <rtrX-nidA>@o2ib1 lnetctl peer add --nid <rtrX-nidA>@o2ib1,<rtrX-nidB>@o2ib1
In the above configuration the clients and the servers are configured with only one route entry per router. This works because the routers are MR capable. By adding the routers as peers with multiple interfaces to the clients and the servers, when sending to the router the MR algorithm will ensure that bot interfaces of the routers are used.
However, as of the Lustre 2.10 release LNet Resiliency is still under development and single interface failure will still cause the entire router to go down.
Currently, LNet provides a mechanism to monitor each route entry. LNet pings each gateway identified in the route entry on regular, configurable interval to ensure that it is alive. If sending over a specific route fails or if the router pinger determines that the gateway is down, then the route is marked as down and is not used. It is subsequently pinged on regular, configurable intervals to determine when it becomes alive again.
This mechanism can be combined with the MR feature in Lustre 2.10 to add this router resiliency feature to the configuration.
Routers lnetctl net add --net o2ib0 --if ib0,ib1 lnetctl net add --net o2ib1 --if ib2,ib3 lnetctl peer add --nid <peer1-nidA>@o2ib,<peer1-nidB>@o2ib,... lnetctl peer add --nid <peer2-nidA>@o2ib1,<peer2-nidB>@o2ib1,... lnetctl set routing 1 Clients lnetctl net add --net o2ib0 --if ib0,ib1 lnetctl route add --net o2ib1 --gateway <rtrX-nidA>@o2ib lnetctl route add --net o2ib1 --gateway <rtrX-nidB>@o2ib Servers lnetctl net add --net o2ib1 --if ib0,ib1 lnetctl route add --net o2ib0 --gateway <rtrX-nidA>@o2ib1 lnetctl route add --net o2ib0 --gateway <rtrX-nidB>@o2ib1
There are a few things to note in the above configuration:
The clients and the servers are now configured with two routes, each route's gateway is one of the interfaces of the route. The clients and servers will view each interface of the same router as a separate gateway and will monitor them as described above.
The clients and the servers are not configured to view the routers as MR capable. This is important because we want to deal with each interface as a separate peers and not different interfaces of the same peer.
The routers are configured to view the peers as MR capable. This is an oddity in the configuration, but is currently required in order to allow the routers to load balance the traffic load across its interfaces evenly.
The above principles can be applied to mixed MR/Non-MR cluster. For example, the same configuration shown above can be applied if the clients and the servers are non-MR while the routers are MR capable. This appears to be a common cluster upgrade scenario.
This section details how routing and pertinent module parameters can be configured beginning with Lustre 2.13.
Multi-Rail with Dynamic Discovery allows LNet to discover and use all configured interfaces of a node. It references a node via it's primary NID. Multi-Rail routing carries forward this concept to the routing infrastructure. The following changes are brought in with the Lustre 2.13 release:
Configuring a different route per gateway interface is no longer needed. One route per gateway should be configured. Gateway interfaces are used according to the Multi-Rail selection criteria.
Routing now relies on Section 16.5, “LNet Health” to keep track of the route aliveness.
Router interfaces are monitored via LNet Health. If an interface fails other interfaces will be used.
Routing uses LNet discovery to discover gateways on regular intervals.
A gateway pushes its list of interfaces upon the discovery of any changes in its interfaces' state.
A gateway can have multiple interfaces on the same or different networks. The peers using the gateway can reach it on one or more of its interfaces. Multi-Rail routing takes care of managing which interface to use.
lnetctl route add --net <remote network> --gateway <NID for the gateway> --hop <number of hops> --priority <route priority>
Table 16.1. Configuring Module Parameters
Module Parameter |
Usage |
---|---|
|
Defaults to |
|
Defaults to |
|
Defaults to |
|
Defaults to |
|
Defaults to |
The routing infrastructure now relies on LNet Health to keep track
of interface health. Each gateway interface has a health value
associated with it. If a send fails to one of these interfaces, then the
interface's health value is decremented and placed on a recovery queue.
The unhealthy interface is then pinged every
lnet_recovery_interval
. This value defaults to
1
second.
If the peer receives a message from the gateway, then it immediately assumes that the gateway's interface is up and resets its health value to maximum. This is needed to ensure we start using the gateways immediately instead of holding off until the interface is back to full health.
LNet Discovery is used in place of pinging the peers. This serves two purposes:
The discovery communication infrastructure does not need to be duplicated for the routing feature.
It allows propagation of the gateway's interface state changes to the peers using the gateway.
For (2), if an interface changes state from UP
to
DOWN
or vice versa, then a discovery
PUSH
is sent to all the peers which can be reached.
This allows peers to adapt to changes quicker.
Discovery is designed to be backwards compatible. The discovery
protocol is composed of a GET
and a
PUT
. The GET
requests interface
information from the peer, this is a basic lnet ping. The peer responds
with its interface information and a feature bit. If the peer is
multi-rail capable and discovery is turned on, then the node will
PUSH
its interface information. As a result both peers
will be aware of each other's interfaces.
This information is then used by the peers to decide, based on the interface state provided by the gateway, whether the route is alive or not.
A route is considered alive if the following conditions hold:
The gateway can be reached on the local net via at least one path.
For a single-hop route, if
avoid_asym_router_failure
is
enabled then the remote network defined in the route must have at least
one healthy interface on the gateway.
LNet Multi-Rail has implemented the ability for multiple interfaces to be used on the same LNet network or across multiple LNet networks. The LNet Health feature adds the ability to maintain a health value for each local and remote interface. This allows the Multi-Rail algorithm to consider the health of the interface before selecting it for sending. The feature also adds the ability to resend messages across different interfaces when interface or network failures are detected. This allows LNet to mitigate communication failures before passing the failures to upper layers for further error handling. To accomplish this, LNet Health monitors the status of the send and receive operations and uses this status to increment the interface's health value in case of success and decrement it in case of failure.
The initial health value of a local or remote interface is set to
LNET_MAX_HEALTH_VALUE
, currently set to be
1000
. The value itself is arbitrary and is meant to
allow for health granularity, as opposed to having a simple boolean state.
The granularity allows the Multi-Rail algorithm to select the interface
that has the highest likelihood of sending or receiving a message.
LNet health behavior depends on the type of failure detected:
Failure Type |
Behavior |
---|---|
|
A local failure has occurred, such as no route found or an address resolution error. These failures could be temporary, therefore LNet will attempt to resend the message. LNet will decrement the health value of the local interface and will select it less often if there are multiple available interfaces. |
|
A local non-recoverable error occurred in the system, such as out of memory error. In these cases LNet will not attempt to resend the message. LNet will decrement the health value of the local interface and will select it less often if there are multiple available interfaces. |
|
If LNet successfully sends a message, but the message does not complete or an expected reply is not received, then it is classified as a remote error. LNet will not attempt to resend the message to avoid duplicate messages on the remote end. LNet will decrement the health value of the remote interface and will select it less often if there are multiple available interfaces. |
|
There are a set of failures where we can be reasonably sure that the message was dropped before getting to the remote end. In this case, LNet will attempt to resend the message. LNet will decrement the health value of the remote interface and will select it less often if there are multiple available interfaces. |
LNet Health is turned on by default. There are multiple module parameters available to control the LNet Health feature.
All the module parameters are implemented in sysfs and are located in /sys/module/lnet/parameters/. They can be set directly by echoing a value into them as well as from lnetctl.
Parameter |
Description |
---|---|
|
When LNet detects a failure on a particular interface it
will decrement its Health Value by
An lnetctl set health_sensitivity: sensitivity to failure 0 - turn off health evaluation >0 - sensitivity value not more than 1000 |
|
When LNet detects a failure on a local or remote interface
it will place that interface on a recovery queue. There is a
recovery queue for local interfaces and another for remote
interfaces. The interfaces on the recovery queues will be LNet
PINGed every Having this value configurable allows system administrators to control the amount of control traffic on the network. lnetctl set recovery_interval: interval to ping unhealthy interfaces >0 - timeout in seconds |
|
This timeout is somewhat of an overloaded value. It carries the following functionality:
This value defaults to 30 seconds. lnetctl set transaction_timeout: Message/Response timeout >0 - timeout in seconds NoteThe LND timeout will now be a fraction of the
This means that in networks where very large delays are expected then it will be necessary to increase this value accordingly. |
|
When LNet detects a failure which it deems appropriate for re-sending a message it will check if a message has passed the maximum retry_count specified. After which if a message wasn't sent successfully a failure event will be passed up to the layer which initiated message sending. The default value is 2. Since the message retry interval
( lnetctl set retry_count: number of retries
0 - turn off retries
>0 - number of retries, cannot be more than
|
|
This is not a configurable parameter. But it is derived from
two configurable parameters:
lnet_lnd_timeout = (lnet_transaction_timeout-1) / (retry_count+1) As such there is a restriction that
The core assumption here is that in a healthy network, sending and receiving LNet messages should not have large delays. There could be large delays with RPC messages and their responses, but that's handled at the PtlRPC layer. |
lnetctl
can be used to show all the LNet health
configuration settings using the lnetctl global show
command.
#> lnetctl global show global: numa_range: 0 max_intf: 200 discovery: 1 retry_count: 3 transaction_timeout: 10 health_sensitivity: 100 recovery_interval: 1
LNet Health statistics are shown under a higher verbosity settings. To show the local interface health statistics:
lnetctl net show -v 3
To show the remote interface health statistics:
lnetctl peer show -v 3
Sample output:
#> lnetctl net show -v 3 net: - net type: tcp local NI(s): - nid: 192.168.122.108@tcp status: up interfaces: 0: eth2 statistics: send_count: 304 recv_count: 284 drop_count: 0 sent_stats: put: 176 get: 138 reply: 0 ack: 0 hello: 0 received_stats: put: 145 get: 137 reply: 0 ack: 2 hello: 0 dropped_stats: put: 10 get: 0 reply: 0 ack: 0 hello: 0 health stats: health value: 1000 interrupts: 0 dropped: 10 aborted: 0 no route: 0 timeouts: 0 error: 0 tunables: peer_timeout: 180 peer_credits: 8 peer_buffer_credits: 0 credits: 256 dev cpt: -1 tcp bonding: 0 CPT: "[0]" CPT: "[0]"
There is a new YAML block, health stats
, which
displays the health statistics for each local or remote network
interface.
Global statistics also dump the global health statistics as shown below:
#> lnetctl stats show statistics: msgs_alloc: 0 msgs_max: 33 rst_alloc: 0 errors: 0 send_count: 901 resend_count: 4 response_timeout_count: 0 local_interrupt_count: 0 local_dropped_count: 10 local_aborted_count: 0 local_no_route_count: 0 local_timeout_count: 0 local_error_count: 0 remote_dropped_count: 0 remote_error_count: 0 remote_timeout_count: 0 network_timeout_count: 0 recv_count: 851 route_count: 0 drop_count: 10 send_length: 425791628 recv_length: 69852 route_length: 0 drop_length: 0
LNet Health is off by default. This means that
lnet_health_sensitivity
and
lnet_retry_count
are set to 0
.
Setting lnet_health_sensitivity
to
0
will not decrement the health of the interface on
failure and will not change the interface selection behavior. Furthermore,
the failed interfaces will not be placed on the recovery queues. In
essence, turning off the LNet Health feature.
The LNet Health settings will need to be tuned for each cluster. However, the base configuration would be as follows:
#> lnetctl global show global: numa_range: 0 max_intf: 200 discovery: 1 retry_count: 3 transaction_timeout: 10 health_sensitivity: 100 recovery_interval: 1
This setting will allow a maximum of two retries for failed messages within the 5 second transaction timeout.
If there is a failure on the interface the health value will be decremented by 1 and the interface will be LNet PINGed every 1 second.
Table of Contents
This chapter describes interoperability between Lustre software releases. It also provides procedures for upgrading from older Lustre 2.x software releases to a more recent 2.y Lustre release a (major release upgrade), and from a Lustre software release 2.x.y to a more recent Lustre software release 2.x.z (minor release upgrade). It includes the following sections:
Lustre software release 2.x (major) upgrade:
All servers must be upgraded at the same time, while some or all clients may be upgraded independently of the servers.
All servers must be be upgraded to a Linux kernel supported by the Lustre software. See the Lustre Release Notes for your Lustre version for a list of tested Linux distributions.
Clients to be upgraded must be running a compatible Linux distribution as described in the Release Notes.
Lustre software release 2.x.y release (minor) upgrade:
All servers must be upgraded at the same time, while some or all clients may be upgraded.
Rolling upgrades are supported for minor releases allowing individual servers and clients to be upgraded without stopping the Lustre file system.
The procedure for upgrading from a Lustre software release 2.x to a more recent 2.y major release of the Lustre software is described in this section. To upgrade an existing 2.x installation to a more recent major release, complete the following steps:
Create a complete, restorable file system backup.
Before installing the Lustre software, back up ALL data. The Lustre software contains kernel modifications that interact with storage devices and may introduce security issues and data loss if not installed, configured, or administered properly. If a full backup of the file system is not practical, a device-level backup of the MDT file system is recommended. See Chapter 18, Backing Up and Restoring a File System for a procedure.
Shut down the entire filesystem by following Section 13.4, “ Stopping the Filesystem”
Upgrade the Linux operating system on all servers to a compatible (tested) Linux distribution and reboot.
Upgrade the Linux operating system on all clients to a compatible (tested) distribution and reboot.
Download the Lustre server RPMs for your platform from the Lustre Releases repository. See Table 8.1, “Packages Installed on Lustre Servers”for a list of required packages.
Install the Lustre server packages on all Lustre servers (MGS, MDSs, and OSSs).
Log onto a Lustre server as the
root
user
Use the
yum
command to install the packages:
# yum --nogpgcheck install pkg1.rpm pkg2.rpm ...
Verify the packages are installed correctly:
rpm -qa|egrep "lustre|wc"
Repeat these steps on each Lustre server.
Download the Lustre client RPMs for your platform from the Lustre Releases repository. See Table 8.2, “Packages Installed on Lustre Clients” for a list of required packages.
The version of the kernel running on a Lustre client must be
the same as the version of the
lustre-client-modules-
ver
package being installed. If not, a
compatible kernel must be installed on the client before the Lustre
client packages are installed.
Install the Lustre client packages on each of the Lustre clients to be upgraded.
Log onto a Lustre client as the
root
user.
Use the
yum
command to install the packages:
# yum --nogpgcheck install pkg1.rpm pkg2.rpm ...
Verify the packages were installed correctly:
# rpm -qa|egrep "lustre|kernel"
Repeat these steps on each Lustre client.
The DNE feature allows using multiple MDTs within a single
filesystem namespace, and each MDT can each serve one or more remote
sub-directories in the file system. The root
directory is always located on MDT0.
Note that clients running a release prior to the Lustre software release 2.4 can only see the namespace hosted by MDT0 and will return an IO error if an attempt is made to access a directory on another MDT.
(Optional) To format an additional MDT, complete these steps:
Determine the index used for the first MDT (each MDT must have unique index). Enter:
client$ lctl dl | grep mdc 36 UP mdc lustre-MDT0000-mdc-ffff88004edf3c00 4c8be054-144f-9359-b063-8477566eb84e 5
In this example, the next available index is 1.
Format the new block device as a new MDT at the next available MDT index by entering (on one line):
mds# mkfs.lustre --reformat --fsname=filesystem_name
--mdt \ --mgsnode=mgsnode
--indexnew_mdt_index
/dev/mdt1_device
(Optional) If you are upgrading from a release before Lustre 2.10, to enable the project quota feature enter the following on every ldiskfs backend target while unmounted:
tune2fs -O project /dev/dev
Enabling the project
feature will prevent
the filesystem from being used by older versions of ldiskfs, so it
should only be enabled if the project quota feature is required
and/or after it is known that the upgraded release does not need
to be downgraded.
When setting up the file system, enter:
conf_param $FSNAME.quota.mdt=$QUOTA_TYPE conf_param $FSNAME.quota.ost=$QUOTA_TYPE
(Optional) If upgrading an ldiskfs MDT formatted prior to Lustre 2.13, the "wide striping" feature that allows files to have more than 160 stripes and store other large xattrs was not enabled by default. This feature can be enabled on existing MDTs by running the following command on all MDT devices:
mds# tune2fs -O ea_inode /dev/mdtdev
For more information about wide striping, see Section 19.9, “Lustre Striping Internals”.
Start the Lustre file system by starting the components in the order shown in the following steps:
Mount the MGT. On the MGS, run
mgs# mount -a -t lustre
Mount the MDT(s). On each MDT, run:
mds# mount -a -t lustre
Mount all the OSTs. On each OSS node, run:
oss# mount -a -t lustre
This command assumes that all the OSTs are listed in the
/etc/fstab
file. OSTs that are not listed in
the
/etc/fstab
file, must be mounted individually
by running the mount command:
mount -t lustre/dev/block_device
/mount_point
Mount the file system on the clients. On each client node, run:
client# mount -a -t lustre
(Optional) If you are upgrading from a release before Lustre 2.7, to enable OST FIDs to also store the OST index (to improve reliability of LFSCK and debug messages), after the OSTs are mounted run once on each OSS:
oss# lctl set_param osd-ldiskfs.*.osd_index_in_idif=1
Enabling the index_in_idif
feature will
prevent the OST from being used by older versions of Lustre, so it
should only be enabled once it is known there is no need for the
OST to be downgraded to an earlier release.
If a new MDT was added to the filesystem, the new MDT must be
attached into the namespace by creating one or more
new DNE subdirectories with the
lfs mkdir
command that use the new MDT:
client# lfs mkdir -i new_mdt_index /testfs/new_dir
In Lustre 2.8 and later, it is possible to split a new directory across multiple MDTs by creating it with multiple stripes:
client# lfs mkdir -c 2 /testfs/new_striped_dir
In Lustre 2.13 and later, it is possible to set the default directory layout on existing directories so new remote subdirectories are created on less-full MDTs:
client# lfs setdirstripe -D -c 1 -i -1 /testfs/some_dir
See Section 13.10.1, “Directory creation by space/inode usage” for details.
In Lustre 2.15 and later, if no default directory layout is set on the root directory, the MDS will automatically set the default directory layout the root directory to distribute the top-level directories round-robin across all MDTs, see Section 13.10.2, “Filesystem-wide default directory striping”.
The mounting order described in the steps above must be followed for the initial mount and registration of a Lustre file system after an upgrade. For a normal start of a Lustre file system, the mounting order is MGT, OSTs, MDT(s), clients.
If you have a problem upgrading a Lustre file system, see Section 35.2, “Reporting a Lustre File System Bug”for ways to get help.
Rolling upgrades are supported for upgrading from any Lustre software release 2.x.y to a more recent Lustre software release 2.X.y. This allows the Lustre file system to continue to run while individual servers (or their failover partners) and clients are upgraded one at a time. The procedure for upgrading a Lustre software release 2.x.y to a more recent minor release is described in this section.
To upgrade Lustre software release 2.x.y to a more recent minor release, complete these steps:
Create a complete, restorable file system backup.
Before installing the Lustre software, back up ALL data. The Lustre software contains kernel modifications that interact with storage devices and may introduce security issues and data loss if not installed, configured, or administered properly. If a full backup of the file system is not practical, a device-level backup of the MDT file system is recommended. See Chapter 18, Backing Up and Restoring a File System for a procedure.
Download the Lustre server RPMs for your platform from the Lustre Releases repository. See Table 8.1, “Packages Installed on Lustre Servers” for a list of required packages.
For a rolling upgrade, complete any procedures required to keep the Lustre file system running while the server to be upgraded is offline, such as failing over a primary server to its secondary partner.
Unmount the Lustre server to be upgraded (MGS, MDS, or OSS)
Install the Lustre server packages on the Lustre server.
Log onto the Lustre server as the
root
user
Use the
yum
command to install the packages:
# yum --nogpgcheck install pkg1.rpm pkg2.rpm ...
Verify the packages are installed correctly:
rpm -qa|egrep "lustre|wc"
Mount the Lustre server to restart the Lustre software on the server:
server# mount -a -t lustre
Repeat these steps on each Lustre server.
Download the Lustre client RPMs for your platform from the Lustre Releases repository. See Table 8.2, “Packages Installed on Lustre Clients” for a list of required packages.
Install the Lustre client packages on each of the Lustre clients to be upgraded.
Log onto a Lustre client as the
root
user.
Use the
yum
command to install the packages:
# yum --nogpgcheck install pkg1.rpm pkg2.rpm ...
Verify the packages were installed correctly:
# rpm -qa|egrep "lustre|kernel"
Mount the Lustre client to restart the Lustre software on the client:
client# mount -a -t lustre
Repeat these steps on each Lustre client.
If you have a problem upgrading a Lustre file system, see Section 35.2, “Reporting a Lustre File System Bug”for some suggestions for how to get help.
Table of Contents
This chapter describes how to backup and restore at the file system-level, device-level and file-level in a Lustre file system. Each backup approach is described in the the following sections:
It is strongly recommended that sites perform periodic device-level backup of the MDT(s) (Section 18.2, “ Backing Up and Restoring an MDT or OST (ldiskfs Device Level)”), for example twice a week with alternate backups going to a separate device, even if there is not enough capacity to do a full backup of all of the filesystem data. Even if there are separate file-level backups of some or all files in the filesystem, having a device-level backup of the MDT can be very useful in case of MDT failure or corruption. Being able to restore a device-level MDT backup can avoid the significantly longer process of restoring the entire filesystem from backup. Since the MDT is required for access to all files, its loss would otherwise force full restore of the filesystem (if that is even possible) even if the OSTs are still OK.
Performing a periodic device-level MDT backup can be done relatively inexpensively because the storage need only be connected to the primary MDS (it can be manually connected to the backup MDS in the rare case it is needed), and only needs good linear read/write performance. While the device-level MDT backup is not useful for restoring individual files, it is most efficient to handle the case of MDT failure or corruption.
Backing up a complete file system gives you full control over the files to back up, and allows restoration of individual files as needed. File system-level backups are also the easiest to integrate into existing backup solutions.
File system backups are performed from a Lustre client (or many clients working parallel in different directories) rather than on individual server nodes; this is no different than backing up any other file system.
However, due to the large size of most Lustre file systems, it is not always possible to get a complete backup. We recommend that you back up subsets of a file system. This includes subdirectories of the entire file system, filesets for a single user, files incremented by date, and so on, so that restores can be done more efficiently.
Lustre internally uses a 128-bit file identifier (FID) for all
files. To interface with user applications, the 64-bit inode numbers
are returned by the stat()
,
fstat()
, and
readdir()
system calls on 64-bit applications, and
32-bit inode numbers to 32-bit applications.
Some 32-bit applications accessing Lustre file systems (on both
32-bit and 64-bit CPUs) may experience problems with the
stat()
,
fstat()
or
readdir()
system calls under certain circumstances,
though the Lustre client should return 32-bit inode numbers to these
applications.
In particular, if the Lustre file system is exported from a 64-bit
client via NFS to a 32-bit client, the Linux NFS server will export
64-bit inode numbers to applications running on the NFS client. If the
32-bit applications are not compiled with Large File Support (LFS), then
they return
EOVERFLOW
errors when accessing the Lustre files. To
avoid this problem, Linux NFS clients can use the kernel command-line
option "nfs.enable_ino64=0
" in order to force the
NFS client to export 32-bit inode numbers to the client.
Workaround: We very strongly recommend
that backups using
tar(1)
and other utilities that depend on the inode
number to uniquely identify an inode to be run on 64-bit clients. The
128-bit Lustre file identifiers cannot be uniquely mapped to a 32-bit
inode number, and as a result these utilities may operate incorrectly on
32-bit clients. While there is still a small chance of inode number
collisions with 64-bit inodes, the FID allocation pattern is designed
to avoid collisions for long periods of usage.
The
lustre_rsync
feature keeps the entire file system in
sync on a backup by replicating the file system's changes to a second
file system (the second file system need not be a Lustre file system, but
it must be sufficiently large).
lustre_rsync
uses Lustre changelogs to efficiently
synchronize the file systems without having to scan (directory walk) the
Lustre file system. This efficiency is critically important for large
file systems, and distinguishes the Lustre
lustre_rsync
feature from other replication/backup
solutions.
The
lustre_rsync
feature works by periodically running
lustre_rsync
, a userspace program used to
synchronize changes in the Lustre file system onto the target file
system. The
lustre_rsync
utility keeps a status file, which
enables it to be safely interrupted and restarted without losing
synchronization between the file systems.
The first time that
lustre_rsync
is run, the user must specify a set of
parameters for the program to use. These parameters are described in
the following table and in
Section 44.11, “
lustre_rsync”. On subsequent runs, these
parameters are stored in the the status file, and only the name of the
status file needs to be passed to
lustre_rsync
.
Before using
lustre_rsync
:
Register the changelog user. For details, see the
Chapter 44, System Configuration Utilities(
changelog_register
) parameter in the
Chapter 44, System Configuration Utilities(
lctl
).
- AND -
Verify that the Lustre file system (source) and the replica
file system (target) are identical
before registering the changelog user. If the
file systems are discrepant, use a utility, e.g. regular
rsync
(not
lustre_rsync
), to make them identical.
The
lustre_rsync
utility uses the following
parameters:
Parameter |
Description |
---|---|
|
The path to the root of the Lustre file system (source)
which will be synchronized. This is a mandatory option if a
valid status log created during a previous synchronization
operation (
|
|
The path to the root where the source file system will
be synchronized (target). This is a mandatory option if the
status log created during a previous synchronization
operation (
|
|
The metadata device to be synchronized. A changelog
user must be registered for this device. This is a mandatory
option if a valid status log created during a previous
synchronization operation (
|
|
The changelog user ID for the specified MDT. To use
|
|
A log file to which synchronization status is saved.
When the
|
--xattr
|
Specifies whether extended attributes (
NoteDisabling xattrs causes Lustre striping information not to be synchronized.
|
|
Produces verbose output. |
|
Shows the output of
|
|
Stops processing the
|
Sample
lustre_rsync
commands are listed below.
Register a changelog user for an MDT (e.g.
testfs-MDT0000
).
# lctl --device testfs-MDT0000 changelog_register testfs-MDT0000 Registered changelog userid 'cl1'
Synchronize a Lustre file system (
/mnt/lustre
) to a target file system (
/mnt/target
).
$ lustre_rsync --source=/mnt/lustre --target=/mnt/target \ --mdt=testfs-MDT0000 --user=cl1 --statuslog sync.log --verbose Lustre filesystem: testfs MDT device: testfs-MDT0000 Source: /mnt/lustre Target: /mnt/target Statuslog: sync.log Changelog registration: cl1 Starting changelog record: 0 Errors: 0 lustre_rsync took 1 seconds Changelog records consumed: 22
After the file system undergoes changes, synchronize the changes
onto the target file system. Only the
statuslog
name needs to be specified, as it has all
the parameters passed earlier.
$ lustre_rsync --statuslog sync.log --verbose Replicating Lustre filesystem: testfs MDT device: testfs-MDT0000 Source: /mnt/lustre Target: /mnt/target Statuslog: sync.log Changelog registration: cl1 Starting changelog record: 22 Errors: 0 lustre_rsync took 2 seconds Changelog records consumed: 42
To synchronize a Lustre file system (
/mnt/lustre
) to two target file systems (
/mnt/target1
and
/mnt/target2
).
$ lustre_rsync --source=/mnt/lustre --target=/mnt/target1 \ --target=/mnt/target2 --mdt=testfs-MDT0000 --user=cl1 \ --statuslog sync.log
In some cases, it is useful to do a full device-level backup of an individual device (MDT or OST), before replacing hardware, performing maintenance, etc. Doing full device-level backups ensures that all of the data and configuration files is preserved in the original state and is the easiest method of doing a backup. For the MDT file system, it may also be the fastest way to perform the backup and restore, since it can do large streaming read and write operations at the maximum bandwidth of the underlying devices.
Keeping an updated full backup of the MDT is especially important because permanent failure or corruption of the MDT file system renders the much larger amount of data in all the OSTs largely inaccessible and unusable. The storage needed for one or two full MDT device backups is much smaller than doing a full filesystem backup, and can use less expensive storage than the actual MDT device(s) since it only needs to have good streaming read/write speed instead of high random IOPS.
If hardware replacement is the reason for the backup or if a spare storage device is available, it is possible to do a raw copy of the MDT or OST from one block device to the other, as long as the new device is at least as large as the original device. To do this, run:
dd if=/dev/{original} of=/dev/{newdev} bs=4M
If hardware errors cause read problems on the original device, use the command below to allow as much data as possible to be read from the original device while skipping sections of the disk with errors:
dd if=/dev/{original} of=/dev/{newdev} bs=4k conv=sync,noerror / count={original size in 4kB blocks}
Even in the face of hardware errors, the ldiskfs
file system is very robust and it may be possible
to recover the file system data after running
e2fsck -fy /dev/{newdev}
on the new device.
With Lustre software version 2.6 and later, the
LFSCK
scanning will automatically move objects from
lost+found
back into its correct location on the OST
after directory corruption.
In order to ensure that the backup is fully consistent, the MDT or
OST must be unmounted, so that there are no changes being made to the
device while the data is being transferred. If the reason for the
backup is preventative (i.e. MDT backup on a running MDS in case of
future failures) then it is possible to perform a consistent backup from
an LVM snapshot. If an LVM snapshot is not available, and taking the
MDS offline for a backup is unacceptable, it is also possible to perform
a backup from the raw MDT block device. While the backup from the raw
device will not be fully consistent due to ongoing changes, the vast
majority of ldiskfs metadata is statically allocated, and inconsistencies
in the backup can be fixed by running e2fsck
on the
backup device, and is still much better than not having any backup at all.
This procedure provides an alternative to backup or migrate the data of an OST or MDT at the file level. At the file-level, unused space is omitted from the backup and the process may be completed quicker with a smaller total backup size. Backing up a single OST device is not necessarily the best way to perform backups of the Lustre file system, since the files stored in the backup are not usable without metadata stored on the MDT and additional file stripes that may be on other OSTs. However, it is the preferred method for migration of OST devices, especially when it is desirable to reformat the underlying file system with different configuration options or to reduce fragmentation.
Since Lustre stores internal metadata that maps FIDs to local
inode numbers via the Object Index file, they need to be rebuilt at
first mount after a restore is detected so that file-level MDT backup
and restore is supported. The OI Scrub rebuilds these automatically
at first mount after a restore is detected, which may affect MDT
performance after mount until the rebuild is completed. Progress can
be monitored via lctl get_param osd-*.*.oi_scrub
on the MDS or OSS node where the target filesystem was restored.
Prior to Lustre software release 2.11.0, we can only do the backend file system level backup and restore process for ldiskfs-based systems. The ability to perform a zfs-based MDT/OST file system level backup and restore is introduced beginning in Lustre software release 2.11.0. Differing from an ldiskfs-based system, index objects must be backed up before the unmount of the target (MDT or OST) in order to be able to restore the file system successfully. To enable index backup on the target, execute the following command on the target server:
# lctl set_param osd-*.${fsname}-${target}.index_backup=1
${target}
is composed of the target type
(MDT or OST) plus the target index, such as MDT0000
,
OST0001
, and so on.
The index_backup is also valid for an ldiskfs-based system, that will be used when migrating data between ldiskfs-based and zfs-based systems as described in Section 18.6, “ Migration Between ZFS and ldiskfs Target Filesystems ”.
The below examples show backing up an OST filesystem. When backing
up an MDT, substitute mdt
for ost
in the instructions below.
Umount the target
Make a mountpoint for the file system.
[oss]# mkdir -p /mnt/ost
Mount the file system.
For ldiskfs-based systems:
[oss]# mount -t ldiskfs /dev/{ostdev} /mnt/ost
For zfs-based systems:
Import the pool for the target if it is exported. For example:
[oss]# zpool import lustre-ost [-d ${ostdev_dir}]
Enable the canmount
property on the target
filesystem. For example:
[oss]# zfs set canmount=on ${fsname}-ost/ost
You also can specify the mountpoint property. By default, it will
be: /${fsname}-ost/ost
Mount the target as 'zfs'. For example:
[oss]# zfs mount ${fsname}-ost/ost
Change to the mountpoint being backed up.
[oss]# cd /mnt/ost
Back up the extended attributes.
[oss]# getfattr -R -d -m '.*' -e hex -P . > ea-$(date +%Y%m%d).bak
If the tar(1)
command supports the
--xattr
option (see below), the
getfattr
step may be unnecessary as long as tar
correctly backs up the trusted.*
attributes.
However, completing this step is not harmful and can serve as an
added safety measure.
In most distributions, the
getfattr
command is part of the
attr
package. If the
getfattr
command returns errors like
Operation not supported
, then the kernel does not
correctly support EAs. Stop and use a different backup method.
Verify that the
ea-$date.bak
file has properly backed up the EA
data on the OST.
Without this attribute data, the MDT restore process will fail
and result in an unusable filesystem. The OST restore process may be
missing extra data that can be very useful in case of later file system
corruption. Look at this file with more
or a text
editor. Each object file should have a corresponding item similar to
this:
[oss]# file: O/0/d0/100992 trusted.fid= \ 0x0d822200000000004a8a73e500000000808a0100000000000000000000000000
Back up all file system data.
[oss]# tar czvf {backup file}.tgz [--xattrs] [--xattrs-include="trusted.*"] [--acls] --sparse .
The tar
--sparse
option is vital for backing up an MDT.
Very old versions of tar may not support the
--sparse
option correctly, which may cause the
MDT backup to take a long time. Known-working versions include
the tar from Red Hat Enterprise Linux distribution (RHEL version
6.3 or newer) or GNU tar version 1.25 and newer.
The tar --xattrs
option is only available
in GNU tar version 1.27 or later or in RHEL 6.3 or newer. The
--xattrs-include="trusted.*"
option is
required for correct restoration of the xattrs
when using GNU tar 1.27 or RHEL 7 and newer.
The tar --acls
option is recommended for
MDT backup of POSIX ACLs. Or, getfacl -n -R
and setfacl --restore
can be used instead.
Change directory out of the file system.
[oss]# cd -
Unmount the file system.
[oss]# umount /mnt/ost
When restoring an OST backup on a different node as part of an
OST migration, you also have to change server NIDs and use the
--writeconf
command to re-generate the
configuration logs. See
Chapter 14, Lustre Maintenance(Changing a Server NID).
To restore data from a file-level backup, you need to format the device, restore the file data and then restore the EA data.
Format the new device.
[oss]# mkfs.lustre --ost --index {OST index} --replace --fstype=${fstype} {other options} /dev/{newdev}
Set the file system label (ldiskfs-based systems only).
[oss]# e2label {fsname}-OST{index in hex} /mnt/ost
Mount the file system.
For ldiskfs-based systems:
[oss]# mount -t ldiskfs /dev/{newdev} /mnt/ost
For zfs-based systems:
Import the pool for the target if it is exported. For example:
[oss]# zpool import lustre-ost [-d ${ostdev_dir}]
Enable the canmount property on the target filesystem. For example:
[oss]# zfs set canmount=on ${fsname}-ost/ost
You also can specify the mountpoint
property. By default, it will be:
/${fsname}-ost/ost
Mount the target as 'zfs'. For example:
[oss]# zfs mount ${fsname}-ost/ost
Change to the new file system mount point.
[oss]# cd /mnt/ost
Restore the file system backup.
[oss]# tar xzvpf {backup file} [--xattrs] [--xattrs-include="trusted.*"] [--acls] [-P] --sparse
The tar --xattrs
option is only available
in GNU tar version 1.27 or later or in RHEL 6.3 or newer. The
--xattrs-include="trusted.*"
option is
required for correct restoration of the
MDT xattrs when using GNU tar 1.27 or RHEL 7 and newer. Otherwise,
the setfattr
step below should be used.
The tar --acls
option is needed for
correct restoration of POSIX ACLs on MDTs. Alternatively,
getfacl -n -R
and
setfacl --restore
can be used instead.
The tar -P
(or
--absolute-names
) option can be used to speed
up extraction of a trusted MDT backup archive.
If not using a version of tar that supports direct xattr backups, restore the file system extended attributes.
[oss]# setfattr --restore=ea-${date}.bak
If
--xattrs
option is supported by tar and specified
in the step above, this step is redundant.
Verify that the extended attributes were restored.
[oss]# getfattr -d -m ".*" -e hex O/0/d0/100992 trusted.fid= \ 0x0d822200000000004a8a73e500000000808a0100000000000000000000000000
Remove old OI and LFSCK files.
[oss]# rm -rf oi.16* lfsck_* LFSCK
Remove old CATALOGS.
[oss]# rm -f CATALOGS
This is optional for the MDT side only. The CATALOGS record the llog file handlers that are used for recovering cross-server updates. Before OI scrub rebuilds the OI mappings for the llog files, the related recovery will get a failure if it runs faster than the background OI scrub. This will result in a failure of the whole mount process. OI scrub is an online tool, therefore, a mount failure means that the OI scrub will be stopped. Removing the old CATALOGS will avoid this potential trouble. The side-effect of removing old CATALOGS is that the recovery for related cross-server updates will be aborted. However, this can be handled by LFSCK after the system mount is up.
Change directory out of the file system.
[oss]# cd -
Unmount the new file system.
[oss]# umount /mnt/ost
If the restored system has a different NID from the backup system, please change the NID. For detail, please refer to Section 14.5, “ Changing a Server NID”. For example:
[oss]# mount -t lustre -o nosvc ${fsname}-ost/ost /mnt/ost [oss]# lctl replace_nids ${fsname}-OSTxxxx $new_nids [oss]# umount /mnt/ost
Mount the target as lustre
.
Usually, we will use the -o abort_recov
option
to skip unnecessary recovery. For example:
[oss]# mount -t lustre -o abort_recov #{fsname}-ost/ost /mnt/ost
Lustre can detect the restore automatically when mounting the target, and then trigger OI scrub to rebuild the OIs and index objects asynchronously in the background. You can check the OI scrub status with the following command:
[oss]# lctl get_param -n osd-${fstype}.${fsname}-${target}.oi_scrub
If the file system was used between the time the backup was made and
when it was restored, then the online LFSCK
tool will
automatically be run to ensure the filesystem is coherent. If all of the
device filesystems were backed up at the same time after Lustre was
was stopped, this step is unnecessary. In either case, the filesystem
will be immediately although there may be I/O errors reading
from files that are present on the MDT but not the OSTs, and files that
were created after the MDT backup will not be accessible or visible. See
Section 36.4, “
Checking the file system with LFSCK”for details on using LFSCK.
If you want to perform disk-based backups (because, for example, access to the backup system needs to be as fast as to the primary Lustre file system), you can use the Linux LVM snapshot tool to maintain multiple, incremental file system backups.
Because LVM snapshots cost CPU cycles as new files are written, taking snapshots of the main Lustre file system will probably result in unacceptable performance losses. You should create a new, backup Lustre file system and periodically (e.g., nightly) back up new/changed files to it. Periodic snapshots can be taken of this backup file system to create a series of "full" backups.
Creating an LVM snapshot is not as reliable as making a separate backup, because the LVM snapshot shares the same disks as the primary MDT device, and depends on the primary MDT device for much of its data. If the primary MDT device becomes corrupted, this may result in the snapshot being corrupted.
Use this procedure to create a backup Lustre file system for use with the LVM snapshot mechanism.
Create LVM volumes for the MDT and OSTs.
Create LVM devices for your MDT and OST targets. Make sure not to use the entire disk for the targets; save some room for the snapshots. The snapshots start out as 0 size, but grow as you make changes to the current file system. If you expect to change 20% of the file system between backups, the most recent snapshot will be 20% of the target size, the next older one will be 40%, etc. Here is an example:
cfs21:~# pvcreate /dev/sda1 Physical volume "/dev/sda1" successfully created cfs21:~# vgcreate vgmain /dev/sda1 Volume group "vgmain" successfully created cfs21:~# lvcreate -L200G -nMDT0 vgmain Logical volume "MDT0" created cfs21:~# lvcreate -L200G -nOST0 vgmain Logical volume "OST0" created cfs21:~# lvscan ACTIVE '/dev/vgmain/MDT0' [200.00 GB] inherit ACTIVE '/dev/vgmain/OST0' [200.00 GB] inherit
Format the LVM volumes as Lustre targets.
In this example, the backup file system is called
main
and designates the current, most up-to-date
backup.
cfs21:~# mkfs.lustre --fsname=main --mdt --index=0 /dev/vgmain/MDT0 No management node specified, adding MGS to this MDT. Permanent disk data: Target: main-MDT0000 Index: 0 Lustre FS: main Mount type: ldiskfs Flags: 0x75 (MDT MGS first_time update ) Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr Parameters: checking for existing Lustre data device size = 200GB formatting backing filesystem ldiskfs on /dev/vgmain/MDT0 target name main-MDT0000 4k blocks 0 options -i 4096 -I 512 -q -O dir_index -F mkfs_cmd = mkfs.ext2 -j -b 4096 -L main-MDT0000 -i 4096 -I 512 -q -O dir_index -F /dev/vgmain/MDT0 Writing CONFIGS/mountdata cfs21:~# mkfs.lustre --mgsnode=cfs21 --fsname=main --ost --index=0 /dev/vgmain/OST0 Permanent disk data: Target: main-OST0000 Index: 0 Lustre FS: main Mount type: ldiskfs Flags: 0x72 (OST first_time update ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=192.168.0.21@tcp checking for existing Lustre data device size = 200GB formatting backing filesystem ldiskfs on /dev/vgmain/OST0 target name main-OST0000 4k blocks 0 options -I 256 -q -O dir_index -F mkfs_cmd = mkfs.ext2 -j -b 4096 -L lustre-OST0000 -J size=400 -I 256 -i 262144 -O extents,uninit_bg,dir_nlink,huge_file,flex_bg -G 256 -E resize=4290772992,lazy_journal_init, -F /dev/vgmain/OST0 Writing CONFIGS/mountdata cfs21:~# mount -t lustre /dev/vgmain/MDT0 /mnt/mdt cfs21:~# mount -t lustre /dev/vgmain/OST0 /mnt/ost cfs21:~# mount -t lustre cfs21:/main /mnt/main
At periodic intervals e.g., nightly, back up new and changed files to the LVM-based backup file system.
cfs21:~# cp /etc/passwd /mnt/main cfs21:~# cp /etc/fstab /mnt/main cfs21:~# ls /mnt/main fstab passwd
Whenever you want to make a "checkpoint" of the main Lustre file system, create LVM snapshots of all target MDT and OSTs in the LVM-based backup file system. You must decide the maximum size of a snapshot ahead of time, although you can dynamically change this later. The size of a daily snapshot is dependent on the amount of data changed daily in the main Lustre file system. It is likely that a two-day old snapshot will be twice as big as a one-day old snapshot.
You can create as many snapshots as you have room for in the volume group. If necessary, you can dynamically add disks to the volume group.
The snapshots of the target MDT and OSTs should be taken at the same point in time. Make sure that the cronjob updating the backup file system is not running, since that is the only thing writing to the disks. Here is an example:
cfs21:~# modprobe dm-snapshot cfs21:~# lvcreate -L50M -s -n MDT0.b1 /dev/vgmain/MDT0 Rounding up size to full physical extent 52.00 MB Logical volume "MDT0.b1" created cfs21:~# lvcreate -L50M -s -n OST0.b1 /dev/vgmain/OST0 Rounding up size to full physical extent 52.00 MB Logical volume "OST0.b1" created
After the snapshots are taken, you can continue to back up new/changed files to "main". The snapshots will not contain the new files.
cfs21:~# cp /etc/termcap /mnt/main cfs21:~# ls /mnt/main fstab passwd termcap
Use this procedure to restore the file system from an LVM snapshot.
Rename the LVM snapshot.
Rename the file system snapshot from "main" to "back" so you
can mount it without unmounting "main". This is recommended, but not
required. Use the
--reformat
flag to
tunefs.lustre
to force the name change. For
example:
cfs21:~# tunefs.lustre --reformat --fsname=back --writeconf /dev/vgmain/MDT0.b1 checking for existing Lustre data found Lustre data Reading CONFIGS/mountdata Read previous values: Target: main-MDT0000 Index: 0 Lustre FS: main Mount type: ldiskfs Flags: 0x5 (MDT MGS ) Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr Parameters: Permanent disk data: Target: back-MDT0000 Index: 0 Lustre FS: back Mount type: ldiskfs Flags: 0x105 (MDT MGS writeconf ) Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr Parameters: Writing CONFIGS/mountdata cfs21:~# tunefs.lustre --reformat --fsname=back --writeconf /dev/vgmain/OST0.b1 checking for existing Lustre data found Lustre data Reading CONFIGS/mountdata Read previous values: Target: main-OST0000 Index: 0 Lustre FS: main Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=192.168.0.21@tcp Permanent disk data: Target: back-OST0000 Index: 0 Lustre FS: back Mount type: ldiskfs Flags: 0x102 (OST writeconf ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=192.168.0.21@tcp Writing CONFIGS/mountdata
When renaming a file system, we must also erase the last_rcvd file from the snapshots
cfs21:~# mount -t ldiskfs /dev/vgmain/MDT0.b1 /mnt/mdtback cfs21:~# rm /mnt/mdtback/last_rcvd cfs21:~# umount /mnt/mdtback cfs21:~# mount -t ldiskfs /dev/vgmain/OST0.b1 /mnt/ostback cfs21:~# rm /mnt/ostback/last_rcvd cfs21:~# umount /mnt/ostback
Mount the file system from the LVM snapshot. For example:
cfs21:~# mount -t lustre /dev/vgmain/MDT0.b1 /mnt/mdtback cfs21:~# mount -t lustre /dev/vgmain/OST0.b1 /mnt/ostback cfs21:~# mount -t lustre cfs21:/back /mnt/back
Note the old directory contents, as of the snapshot time. For example:
cfs21:~/cfs/b1_5/lustre/utils# ls /mnt/back fstab passwds
To reclaim disk space, you can erase old snapshots as your backup policy dictates. Run:
lvremove /dev/vgmain/MDT0.b1
Beginning with Lustre 2.11.0, it is possible to migrate between
ZFS and ldiskfs backends. For migrating OSTs, it is best to use
lfs find
/lfs_migrate
to empty out
an OST while the filesystem is in use and then reformat it with the new
fstype. For instructions on removing the OST, please see
Section 14.9.3, “Removing an OST from the File System”.
The first step of the process is to make a ZFS backend backup
using tar
as described in
Section 18.3, “
Backing Up an OST or MDT (Backend File System Level)”.
Next, restore the backup to an ldiskfs-based system as described in Section 18.4, “ Restoring a File-Level Backup”.
The first step of the process is to make an ldiskfs backend backup
using tar
as described in
Section 18.3, “
Backing Up an OST or MDT (Backend File System Level)”.
Caution:For a migration from ldiskfs to zfs, it is required to enable index_backup before the unmount of the target. This is an additional step for a regular ldiskfs-based backup/restore and easy to be missed.
Next, restore the backup to an ldiskfs-based system as described in Section 18.4, “ Restoring a File-Level Backup”.
Table of Contents
lfs
setstripe
)getstripe
)This chapter describes file layout (striping) and I/O options, and includes the following sections:
In a Lustre file system, the MDS allocates objects to OSTs using either a round-robin algorithm or a weighted algorithm. When the amount of free space is well balanced (i.e., by default, when the free space across OSTs differs by less than 17%), the round-robin algorithm is used to select the next OST to which a stripe is to be written. Periodically, the MDS adjusts the striping layout to eliminate some degenerated cases in which applications that create very regular file layouts (striping patterns) preferentially use a particular OST in the sequence.
Normally the usage of OSTs is well balanced. However, if users create a small number of exceptionally large files or incorrectly specify striping parameters, imbalanced OST usage may result. When the free space across OSTs differs by more than a specific amount (17% by default), the MDS then uses weighted random allocations with a preference for allocating objects on OSTs with more free space. (This can reduce I/O performance until space usage is rebalanced again.) For a more detailed description of how striping is allocated, see Section 19.8, “Managing Free Space”.
Files can only be striped over a finite number of OSTs, based on the
maximum size of the attributes that can be stored on the MDT. If the MDT
is ldiskfs-based without the ea_inode
feature, a file
can be striped across at most 160 OSTs. With a ZFS-based MDT, or if the
ea_inode
feature is enabled for an ldiskfs-based MDT
(the default since Lustre 2.13.0),
a file can be striped across up to 2000 OSTs. For more information, see
Section 19.9, “Lustre Striping Internals”.
Whether you should set up file striping and what parameter values you select depends on your needs. A good rule of thumb is to stripe over as few objects as will meet those needs and no more.
Some reasons for using striping include:
Providing high-bandwidth access. Many applications require high-bandwidth access to a single file, which may be more bandwidth than can be provided by a single OSS. Examples are a scientific application that writes to a single file from hundreds of nodes, or a binary executable that is loaded by many nodes when an application starts.
In cases like these, a file can be striped over as many OSSs as it takes to achieve the required peak aggregate bandwidth for that file. Striping across a larger number of OSSs should only be used when the file size is very large and/or is accessed by many nodes at a time. Currently, Lustre files can be striped across up to 2000 OSTs
Improving performance when OSS bandwidth is exceeded. Striping across many OSSs can improve performance if the aggregate client bandwidth exceeds the server bandwidth and the application reads and writes data fast enough to take advantage of the additional OSS bandwidth. The largest useful stripe count is bounded by the I/O rate of the clients/jobs divided by the performance per OSS.
Matching stripes to I/O pattern.When writing to a single file from multiple nodes, having more than one client writing to a stripe can lead to issues with lock exchange, where clients contend over writing to that stripe, even if their I/Os do not overlap. This can be avoided if I/O can be stripe aligned so that each stripe is accessed by only one client. Since Lustre 2.13, the 'overstriping' feature is available, allowing more than one stripe per OST. This is particularly helpful for the case where thread count exceeds OST count, making it possible to match stripe count to thread count even in this case.
Providing space for very large files. Striping is useful when a single OST does not have enough free space to hold the entire file.
Some reasons to minimize or avoid striping:
Increased overhead. Striping results in more locks
and extra network operations during common operations such as stat
and
unlink
. Even when these operations are performed in parallel, one
network operation takes less time than 100 operations.
Increased overhead also results from server contention. Consider a cluster with 100 clients and 100 OSSs, each with one OST. If each file has exactly one object and the load is distributed evenly, there is no contention and the disks on each server can manage sequential I/O. If each file has 100 objects, then the clients all compete with one another for the attention of the servers, and the disks on each node seek in 100 different directions resulting in needless contention.
Increased risk. When files are striped across all servers and one of the servers breaks down, a small part of each striped file is lost. By comparison, if each file has exactly one stripe, fewer files are lost, but they are lost in their entirety. Many users would prefer to lose some of their files entirely than all of their files partially.
Small files. Small files do not benefit from striping because they can be efficiently stored and accessed as a single OST object or even with Data on MDT.
O_APPEND mode. When files are opened for append, they instantiate all uninitialized components expressed in the layout. Typically, log files are opened for append, and complex layouts can be inefficient.
The mdd.*.append_stripe_count
and mdd.*.append_pool
options can be used to specify special default striping for files created
with O_APPEND
.
Choosing a stripe size is a balancing act, but reasonable defaults are described below. The stripe size has no effect on a single-stripe file.
The stripe size must be a multiple of the page size. Lustre software tools enforce a multiple of 64 KB (the maximum page size on ia64 and PPC64 nodes) so that users on platforms with smaller pages do not accidentally create files that might cause problems for ia64 clients.
The smallest recommended stripe size is 512 KB. Although you can create files with a stripe size of 64 KB, the smallest practical stripe size is 512 KB because the Lustre file system sends 1MB chunks over the network. Choosing a smaller stripe size may result in inefficient I/O to the disks and reduced performance.
A good stripe size for sequential I/O using high-speed networks is between 1 MB and 4 MB. In most situations, stripe sizes larger than 4 MB may result in longer lock hold times and contention during shared file access.
The maximum stripe size is 4 GB. Using a large stripe size can improve performance when accessing very large files. It allows each client to have exclusive access to its own part of a file. However, a large stripe size can be counterproductive in cases where it does not match your I/O pattern.
Choose a stripe pattern that takes into account the write
patterns of your application. Writes that cross an object boundary are
slightly less efficient than writes that go entirely to one server. If the file is
written in a consistent and aligned way, make the stripe size a multiple of the
write()
size.
Use the lfs setstripe
command to create new files with a specific file layout (stripe pattern) configuration.
lfs setstripe [--size|-s stripe_size] [--stripe-count|-c stripe_count] [--overstripe-count|-C stripe_count] \
[--index|-i start_ost] [--pool|-p pool_name] filename|dirname
stripe_size
The stripe_size
indicates how much data to write to one OST before
moving to the next OST. The default stripe_size
is 1 MB. Passing a
stripe_size
of 0 causes the default stripe size to be used. Otherwise,
the stripe_size
value must be a multiple of 64 KB.
stripe_count (--stripe-count, --overstripe-count)
The stripe_count
indicates how many stripes to use.
The default stripe_count
value is 1. Setting
stripe_count
to 0 causes the default stripe count to be
used. Setting stripe_count
to -1 means stripe over all
available OSTs (full OSTs are skipped). When --overstripe-count is used,
per OST if necessary.
start_ost
The start OST is the first OST to which files are written. The default value for
start_ost
is -1, which allows the MDS to choose the starting index. This
setting is strongly recommended, as it allows space and load balancing to be done by the MDS
as needed. If the value of start_ost
is set to a value other than -1, the
file starts on the specified OST index. OST index numbering starts at 0.
If the specified OST is inactive or in a degraded mode, the MDS will silently choose another target.
If you pass a start_ost
value of 0 and a
stripe_count
value of 1, all files are written to
OST 0, until space is exhausted. This is probably not what you meant
to do. If you only want to adjust the stripe count and keep the other
parameters at their default settings, do not specify any of the other parameters:
client# lfs setstripe -cstripe_count
filename
pool_name
The pool_name
specifies the OST pool to which the
file will be written. This allows limiting the OSTs used to a subset of
all OSTs in the file system. For more details about using OST pools, see
Section 23.2, “
Creating and Managing OST Pools”.
It is possible to specify the file layout when a new file is created using the command lfs setstripe
. This allows users to override the file system default parameters to tune the file layout more optimally for their application. Execution of an lfs setstripe
command fails if the file already exists.
The command to create a new file with a specified stripe size is similar to:
[client]# lfs setstripe -s 4M /mnt/lustre/new_file
This example command creates the new file /mnt/lustre/new_file
with a stripe size of 4 MB.
Now, when the file is created, the new stripe setting creates the file on a single OST with a stripe size of 4M:
[client]# lfs getstripe /mnt/lustre/new_file /mnt/lustre/4mb_file lmm_stripe_count: 1 lmm_stripe_size: 4194304 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 1 obdidx objid objid group 1 690550 0xa8976 0
In this example, the stripe size is 4 MB.
The command below creates a new file with a stripe count of -1
to
specify striping over all available OSTs:
[client]# lfs setstripe -c -1 /mnt/lustre/full_stripe
The example below indicates that the file
full_stripe
is striped
over all six active OSTs in the configuration:
[client]# lfs getstripe /mnt/lustre/full_stripe /mnt/lustre/full_stripe obdidx objid objid group 0 8 0x8 0 1 4 0x4 0 2 5 0x5 0 3 5 0x5 0 4 4 0x4 0 5 2 0x2 0
This is in contrast to the output in Section 19.3.1.1, “Setting the Stripe Size”, which shows only a single object for the file.
In a directory, the lfs setstripe
command sets a default striping
configuration for files created in the directory. The usage is the same as lfs
setstripe
for a regular file, except that the directory must exist prior to
setting the default striping configuration. If a file is created in a directory with a
default stripe configuration (without otherwise specifying striping), the Lustre file system
uses those striping parameters instead of the file system default for the new file.
To change the striping pattern for a sub-directory, create a directory with desired file layout as described above. Sub-directories inherit the file layout of the root/parent directory.
Special default striping can be used for files created with O_APPEND
.
Files with uninitialized layouts opened with O_APPEND
will
override a directory's default striping configuration and abide by the
mdd.*.append_pool
and mdd.*.append_stripe_count
options (if
they are specified).
Setting the striping specification on the root
directory determines
the striping for all new files created in the file system unless an overriding striping
specification takes precedence (such as a striping layout specified by the application, or
set using lfs setstripe
, or specified for the parent directory).
The striping settings for a root
directory are, by default, applied
to any new child directories created in the root directory, unless striping settings have
been specified for the child directory.
Special default striping can be used for files created with O_APPEND
.
Files with uninitialized layouts opened with O_APPEND
will
override a file system's default striping configuration and abide by the
mdd.*.append_pool
and mdd.*.append_stripe_count
options (if
they are specified).
Sometime there are many OSTs in a filesystem, but it is not always
desirable to stripe file to across all OSTs, even if the given
stripe_count=-1
(unlimited).
In this case, the per-filesystem tunable parameter
lod.*.max_stripecount
can be used to limit the real
stripe count of file to a lower number than the OST count.
If lod.*.max_stripecount
is not 0, and the file
stripe_count=-1
, the real stripe count will be
the minimum of the OST count and max_stripecount
. If
lod.*.max_stripecount=0
, or an explicit stripe count
is given for the file, it is ignored.
To set max_stripecount
, on all MDSes of
file system, run:
mgs# lctl set_param -P lod.$fsname-MDTxxxx-mdtlov.max_stripecount=<N>
To check max_stripecount
, run:
mds# lctl get_param lod.$fsname-MDTxxxx-mdtlov.max_stripecount
To reset max_stripecount
, run:
mgs# lctl set_param -P -d lod.$fsname-MDTxxxx-mdtlov.max_stripecount
You can use lfs setstripe
to create a file on a specific OST. In the
following example, the file file1
is created on the first OST (OST index
is 0).
$ lfs setstripe --stripe-count 1 --index 0 file1 $ dd if=/dev/zero of=file1 count=1 bs=100M 1+0 records in 1+0 records out $ lfs getstripe file1 /mnt/testfs/file1 lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 0 obdidx objid objid group 0 37364 0x91f4 0
The lfs getstripe
command is used to display information that shows
over which OSTs a file is distributed. For each OST, the index and UUID is displayed, along
with the OST index and object ID for each stripe in the file. For directories, the default
settings for files created in that directory are displayed.
To see the current stripe size for a Lustre file or directory, use the lfs
getstripe
command. For example, to view information for a directory, enter a
command similar to:
[client]# lfs getstripe /mnt/lustre
This command produces output similar to:
/mnt/lustre (Default) stripe_count: 1 stripe_size: 1M stripe_offset: -1
In this example, the default stripe count is 1
(data blocks are
striped over a single OST), the default stripe size is 1 MB, and the objects are created
over all available OSTs.
To view information for a file, enter a command similar to:
$ lfs getstripe /mnt/lustre/foo /mnt/lustre/foo lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 0 obdidx objid objid group 2 835487 m0xcbf9f 0
In this example, the file is located on obdidx 2
, which corresponds
to the OST lustre-OST0002
. To see which node is serving that OST, run:
$ lctl get_param osc.lustre-OST0002-osc.ost_conn_uuid osc.lustre-OST0002-osc.ost_conn_uuid=192.168.20.1@tcp
To inspect an entire tree of files, use the lfs find
command:
lfs find [--recursive | -r] file|directory
...
Lustre can be configured with multiple MDTs in the same file
system. Each directory and file could be located on a different MDT.
To identify which MDT a given subdirectory is located, pass the
getstripe [--mdt-index|-M]
parameter to
lfs
. An example of this command is provided in
the section Section 14.9.1, “Removing an MDT from the File System”.
The Lustre Progressive File Layout (PFL) feature simplifies the use of Lustre so that users can expect reasonable performance for a variety of normal file IO patterns without the need to explicitly understand their IO model or Lustre usage details in advance. In particular, users do not necessarily need to know the size or concurrency of output files in advance of their creation and explicitly specify an optimal layout for each file in order to achieve good performance for both highly concurrent shared-single-large-file IO or parallel IO to many smaller per-process files.
The layout of a PFL file is stored on disk as composite
layout
. A PFL file is essentially an array of
sub-layout components
, with each sub-layout component
being a plain layout covering different and non-overlapped extents of
the file. For PFL files, the file layout is composed of a series of
components, therefore it's possible that there are some file extents are
not described by any components.
An example of how data blocks of PFL files are mapped to OST objects of components is shown in the following PFL object mapping diagram:
The PFL file in Figure 19.1, “PFL object mapping diagram” has 3
components and shows the mapping for the blocks of a 2055MB file.
The stripe size for the first two components is 1MB, while the stripe size
for the third component is 4MB. The stripe count is increasing for each
successive component. The first component only has two 1MB blocks and the
single object has a size of 2MB. The second component holds the next 254MB
of the file spread over 4 separate OST objects in RAID-0, each one will
have a size of 256MB / 4 objects = 64MB per object. Note the first two
objects obj 2,0
and obj 2,1
have a 1MB hole at the start where the data is stored in the first
component. The final component holds the next 1800MB spread over 32 OST
objects. There is a 256MB / 32 = 8MB hole at the start each one for the
data stored in the first two components. Each object will be
2048MB / 32 objects = 64MB per object, except the
obj 3,0
that holds an extra 4MB chunk and
obj 3,1
that holds an extra 3MB chunk. If more data
was written to the file, only the objects in component 3 would increase
in size.
When a file range with defined but not instantiated component is accessed, clients will send a Layout Intent RPC to the MDT, and the MDT would instantiate the objects of the components covering that range.
Next, some commands for user to operate PFL files are introduced and
some examples of possible composite layout are illustrated as well.
Lustre provides commands
lfs setstripe
and lfs migrate
for
users to operate PFL files. lfs setstripe
commands
are used to create PFL files, add or delete components to or from an
existing composite file; lfs migrate
commands are used
to re-layout the data in existing files using the new layout parameter by
copying the data from the existing OST(s) to the new OST(s). Also,
as introduced in the previous sections, lfs getstripe
commands can be used to list the striping/component information for a
given PFL file, and lfs find
commands can be used to
search the directory tree rooted at the given directory or file name for
the files that match the given PFL component parameters.
Using PFL files requires both the client and server to understand the PFL file layout, which isn't available for Lustre 2.9 and earlier. And it will not prevent older clients from accessing non-PFL files in the filesystem.
lfs setstripe
commands are used to create PFL
files, add or delete components to or from an existing composite file.
(Suppose we have 8 OSTs in the following examples and stripe size is 1MB
by default.)
Command
lfs setstripe
[--component-end|-E end1] [STRIPE_OPTIONS]
[--component-end|-E end2] [STRIPE_OPTIONS] ... filename
The -E
option is used to specify the end offset
(in bytes or using a suffix kMGTP
e.g. 256M)
of each component,
and it also indicates the following STRIPE_OPTIONS
are for this component.
Each component defines the stripe pattern of the
file in the range of [start, end). The first component must start from
offset 0 and all components must be adjacent with each other, no holes
are allowed, so each extent will start at the end of previous extent.
A -1
end offset or eof
indicates
this is the last component extending to the end of file.
If no EOF
Example
$ lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -i 4 \ /mnt/testfs/create_comp
This command creates a file with composite layout illustrated in the following figure. The first component has 1 stripe and covers [0, 4M), the second component has 4 stripes and covers [4M, 64M), and the last component stripes start at OST4, cross over all available OSTs and covers [64M, EOF).
The composite layout can be output by the following command:
$ lfs getstripe /mnt/testfs/create_comp /mnt/testfs/create_comp lcm_layout_gen: 3 lcm_entry_count: 3 lcme_id: 1 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 4194304 lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 0 lmm_objects: - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] } lcme_id: 2 lcme_flags: 0 lcme_extent.e_start: 4194304 lcme_extent.e_end: 67108864 lmm_stripe_count: 4 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: -1 lcme_id: 3 lcme_flags: 0 lcme_extent.e_start: 67108864 lcme_extent.e_end: EOF lmm_stripe_count: -1 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 4
Only the first component's OST objects of the PFL file are instantiated when the layout is being set. Other instantiation is delayed to later write/truncate operations.
If we write 128M data to this PFL file, the second and third components will be instantiated:
$ dd if=/dev/zero of=/mnt/testfs/create_comp bs=1M count=128 $ lfs getstripe /mnt/testfs/create_comp /mnt/testfs/create_comp lcm_layout_gen: 5 lcm_entry_count: 3 lcme_id: 1 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 4194304 lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 0 lmm_objects: - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] } lcme_id: 2 lcme_flags: init lcme_extent.e_start: 4194304 lcme_extent.e_end: 67108864 lmm_stripe_count: 4 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 1 lmm_objects: - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x2:0x0] } - 1: { l_ost_idx: 2, l_fid: [0x100020000:0x2:0x0] } - 2: { l_ost_idx: 3, l_fid: [0x100030000:0x2:0x0] } - 3: { l_ost_idx: 4, l_fid: [0x100040000:0x2:0x0] } lcme_id: 3 lcme_flags: init lcme_extent.e_start: 67108864 lcme_extent.e_end: EOF lmm_stripe_count: 8 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 4 lmm_objects: - 0: { l_ost_idx: 4, l_fid: [0x100040000:0x3:0x0] } - 1: { l_ost_idx: 5, l_fid: [0x100050000:0x2:0x0] } - 2: { l_ost_idx: 6, l_fid: [0x100060000:0x2:0x0] } - 3: { l_ost_idx: 7, l_fid: [0x100070000:0x2:0x0] } - 4: { l_ost_idx: 0, l_fid: [0x100000000:0x3:0x0] } - 5: { l_ost_idx: 1, l_fid: [0x100010000:0x3:0x0] } - 6: { l_ost_idx: 2, l_fid: [0x100020000:0x3:0x0] } - 7: { l_ost_idx: 3, l_fid: [0x100030000:0x3:0x0] }
Command
lfs setstripe --component-add
[--component-end|-E end1] [STRIPE_OPTIONS]
[--component-end|-E end2] [STRIPE_OPTIONS] ... filename
The option --component-add
is used to add
components to an existing composite file. The extent start of
the first component to be added is equal to the extent end of last
component in the existing file, and all components to be added must
be adjacent with each other.
If the last existing component is specified by
-E -1
or -E eof
, which covers
to the end of the file, it must be deleted before a new one is added.
Example
$ lfs setstripe -E 4M -c 1 -E 64M -c 4 /mnt/testfs/add_comp $ lfs setstripe --component-add -E -1 -c 4 -o 6-7,0,5 \ /mnt/testfs/add_comp
This command adds a new component which starts from the end of the last existing component to the end of file. The layout of this example is illustrated in Figure 19.3, “Example: add a component to an existing composite file”. The last component stripes across 4 OSTs in sequence OST6, OST7, OST0 and OST5, covers [64M, EOF).
The layout can be printed out by the following command:
$ lfs getstripe /mnt/testfs/add_comp /mnt/testfs/add_comp lcm_layout_gen: 5 lcm_entry_count: 3 lcme_id: 1 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 4194304 lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 0 lmm_objects: - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] } lcme_id: 2 lcme_flags: init lcme_extent.e_start: 4194304 lcme_extent.e_end: 67108864 lmm_stripe_count: 4 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 1 lmm_objects: - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x2:0x0] } - 1: { l_ost_idx: 2, l_fid: [0x100020000:0x2:0x0] } - 2: { l_ost_idx: 3, l_fid: [0x100030000:0x2:0x0] } - 3: { l_ost_idx: 4, l_fid: [0x100040000:0x2:0x0] } lcme_id: 5 lcme_flags: 0 lcme_extent.e_start: 67108864 lcme_extent.e_end: EOF lmm_stripe_count: 4 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: -1
The component ID "lcme_id" changes as layout generation changes. It is not necessarily sequential and does not imply ordering of individual components.
Similar to specifying a full-file composite layout at file
creation time, --component-add
won't instantiate
OST objects, the instantiation is delayed to later write/truncate
operations. For example, after writing beyond the 64MB start of the
file's last component, the new component has had objects allocated:
$ lfs getstripe -I5 /mnt/testfs/add_comp /mnt/testfs/add_comp lcm_layout_gen: 6 lcm_entry_count: 3 lcme_id: 5 lcme_flags: init lcme_extent.e_start: 67108864 lcme_extent.e_end: EOF lmm_stripe_count: 4 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 6 lmm_objects: - 0: { l_ost_idx: 6, l_fid: [0x100060000:0x4:0x0] } - 1: { l_ost_idx: 7, l_fid: [0x100070000:0x4:0x0] } - 2: { l_ost_idx: 0, l_fid: [0x100000000:0x5:0x0] } - 3: { l_ost_idx: 5, l_fid: [0x100050000:0x4:0x0] }
Command
lfs setstripe --component-del
[--component-id|-I comp_id | --component-flags comp_flags]
filename
The option --component-del
is used to remove
the component(s) specified by component ID or flags from an existing
file. Any data stored in the deleted component will be lost after
this operation.
The ID specified by -I
option is the numerical
unique ID of the component, which can be obtained by command
lfs getstripe -I
command, and the flag specified by
--component-flags
option is a certain type of
components, which can be obtained by command
lfs getstripe --component-flags
. For now, we only
have two flags init
and ^init
for instantiated and un-instantiated components respectively.
Deletion must start with the last component because creation of a hole in the middle of a file layout is not allowed.
Example
$ lfs getstripe -I /mnt/testfs/del_comp 1 2 5 $ lfs setstripe --component-del -I 5 /mnt/testfs/del_comp
This example deletes the component with ID 5 from file
/mnt/testfs/del_comp
. If we still use the last
example, the final result is illustrated in
Figure 19.4, “Example: delete a component from an existing file”.
If you try to delete a non-last component, you will see the following error:
$ lfs setstripe -component-del -I 2 /mnt/testfs/del_comp Delete component 0x2 from /mnt/testfs/del_comp failed. Invalid argument error: setstripe: delete component of file '/mnt/testfs/del_comp' failed: Invalid argument
Similar to create a PFL file, you can set default PFL layout to an existing directory. After that, all the files created will inherit this layout by default.
Command
lfs setstripe
[--component-end|-E end1] [STRIPE_OPTIONS]
[--component-end|-E end2] [STRIPE_OPTIONS] ... dirname
Example
$ mkdir /mnt/testfs/pfldir $ lfs setstripe -E 256M -c 1 -E 16G -c 4 -E -1 -S 4M -c -1 /mnt/testfs/pfldir
When you run lfs getstripe
, you will see:
$ lfs getstripe /mnt/testfs/pfldir /mnt/testfs/pfldir lcm_layout_gen: 0 lcm_entry_count: 3 lcme_id: N/A lcme_flags: 0 lcme_extent.e_start: 0 lcme_extent.e_end: 268435456 stripe_count: 1 stripe_size: 1048576 stripe_offset: -1 lcme_id: N/A lcme_flags: 0 lcme_extent.e_start: 268435456 lcme_extent.e_end: 17179869184 stripe_count: 4 stripe_size: 1048576 stripe_offset: -1 lcme_id: N/A lcme_flags: 0 lcme_extent.e_start: 17179869184 lcme_extent.e_end: EOF stripe_count: -1 stripe_size: 4194304 stripe_offset: -1
If you create a file under /mnt/testfs/pfldir
,
the layout of that file will inherit the layout from its parent
directory:
$ touch /mnt/testfs/pfldir/pflfile $ lfs getstripe /mnt/testfs/pfldir/pflfile /mnt/testfs/pfldir/pflfile lcm_layout_gen: 2 lcm_entry_count: 3 lcme_id: 1 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 268435456 lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 1 lmm_objects: - 0: { l_ost_idx: 1, l_fid: [0x100010000:0xa:0x0] } lcme_id: 2 lcme_flags: 0 lcme_extent.e_start: 268435456 lcme_extent.e_end: 17179869184 lmm_stripe_count: 4 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: -1 lcme_id: 3 lcme_flags: 0 lcme_extent.e_start: 17179869184 lcme_extent.e_end: EOF lmm_stripe_count: -1 lmm_stripe_size: 4194304 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: -1
lfs setstripe --component-add/del
can't be run
on a directory, because the default layout in directory is like a config,
which can be arbitrarily changed by lfs setstripe
,
while the layout of a file may have data (OST objects) attached.
If you want to delete the default layout in a directory, run
lfs setstripe -d
to return the directory to the filesystem-wide defaults, like:
dirname
$ lfs setstripe -d /mnt/testfs/pfldir $ lfs getstripe -d /mnt/testfs/pfldir /mnt/testfs/pfldir stripe_count: 1 stripe_size: 1048576 stripe_offset: -1 /mnt/testfs/pfldir/commonfile lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 0 obdidx objid objid group 2 9 0x9 0
lfs migrate
commands are used to re-layout the
data in the existing files with the new layout parameter by copying the
data from the existing OST(s) to the new OST(s).
Command
lfs migrate [--component-end|-E comp_end] [STRIPE_OPTIONS] ...
filename
The difference between migrate
and
setstripe
is that migrate
is to
re-layout the data in the existing files, while
setstripe
is to create new files with the specified
layout.
Example
Case1. Migrate a normal one to a composite layout
$ lfs setstripe -c 1 -S 128K /mnt/testfs/norm_to_2comp $ dd if=/dev/urandom of=/mnt/testfs/norm_to_2comp bs=1M count=5 $ lfs getstripe /mnt/testfs/norm_to_2comp --yaml /mnt/testfs/norm_to_comp lmm_stripe_count: 1 lmm_stripe_size: 131072 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 7 lmm_objects: - l_ost_idx: 7 l_fid: 0x100070000:0x2:0x0 $ lfs migrate -E 1M -S 512K -c 1 -E -1 -S 1M -c 2 \ /mnt/testfs/norm_to_2comp
In this example, a 5MB size file with 1 stripe and 128K stripe size is migrated to a composite layout file with 2 components, illustrated in Figure 19.5, “Example: migrate normal to composite”.
The stripe information after migration is like:
$ lfs getstripe /mnt/testfs/norm_to_2comp /mnt/testfs/norm_to_2comp lcm_layout_gen: 4 lcm_entry_count: 2 lcme_id: 1 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 1048576 lmm_stripe_count: 1 lmm_stripe_size: 524288 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 0 lmm_objects: - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] } lcme_id: 2 lcme_flags: init lcme_extent.e_start: 1048576 lcme_extent.e_end: EOF lmm_stripe_count: 2 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 2 lmm_objects: - 0: { l_ost_idx: 2, l_fid: [0x100020000:0x2:0x0] } - 1: { l_ost_idx: 3, l_fid: [0x100030000:0x2:0x0] }
Case2. Migrate a composite layout to another composite layout
$ lfs setstripe -E 1M -S 512K -c 1 -E -1 -S 1M -c 2 \ /mnt/testfs/2comp_to_3comp $ dd if=/dev/urandom of=/mnt/testfs/norm_to_2comp bs=1M count=5 $ lfs migrate -E 1M -S 1M -c 2 -E 4M -S 1M -c 2 -E -1 -S 3M -c 3 \ /mnt/testfs/2comp_to_3comp
In this example, a composite layout file with 2 components is migrated a composite layout file with 3 components. If we still use the example in case1, the migration process is illustrated in Figure 19.6, “Example: migrate composite to composite”.
The stripe information is like:
$ lfs getstripe /mnt/testfs/2comp_to_3comp /mnt/testfs/2comp_to_3comp lcm_layout_gen: 6 lcm_entry_count: 3 lcme_id: 1 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 1048576 lmm_stripe_count: 2 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 4 lmm_objects: - 0: { l_ost_idx: 4, l_fid: [0x100040000:0x2:0x0] } - 1: { l_ost_idx: 5, l_fid: [0x100050000:0x2:0x0] } lcme_id: 2 lcme_flags: init lcme_extent.e_start: 1048576 lcme_extent.e_end: 4194304 lmm_stripe_count: 2 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 6 lmm_objects: - 0: { l_ost_idx: 6, l_fid: [0x100060000:0x2:0x0] } - 1: { l_ost_idx: 7, l_fid: [0x100070000:0x3:0x0] } lcme_id: 3 lcme_flags: init lcme_extent.e_start: 4194304 lcme_extent.e_end: EOF lmm_stripe_count: 3 lmm_stripe_size: 3145728 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 0 lmm_objects: - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x3:0x0] } - 1: { l_ost_idx: 1, l_fid: [0x100010000:0x2:0x0] } - 2: { l_ost_idx: 2, l_fid: [0x100020000:0x3:0x0] }
Case3. Migrate a composite layout to a normal one
$ lfs migrate -E 1M -S 1M -c 2 -E 4M -S 1M -c 2 -E -1 -S 3M -c 3 \ /mnt/testfs/3comp_to_norm $ dd if=/dev/urandom of=/mnt/testfs/norm_to_2comp bs=1M count=5 $ lfs migrate -c 2 -S 2M /mnt/testfs/3comp_to_normal
In this example, a composite file with 3 components is migrated to a normal file with 2 stripes and 2M stripe size. If we still use the example in Case2, the migration process is illustrated in Figure 19.7, “Example: migrate composite to normal”.
The stripe information is like:
$ lfs getstripe /mnt/testfs/3comp_to_norm --yaml /mnt/testfs/3comp_to_norm lmm_stripe_count: 2 lmm_stripe_size: 2097152 lmm_pattern: 1 lmm_layout_gen: 7 lmm_stripe_offset: 4 lmm_objects: - l_ost_idx: 4 l_fid: 0x100040000:0x3:0x0 - l_ost_idx: 5 l_fid: 0x100050000:0x3:0x0
lfs getstripe
commands can be used to list the
striping/component information for a given PFL file. Here, only those
parameters new for PFL files are shown.
Command
lfs getstripe
[--component-id|-I [comp_id]]
[--component-flags [comp_flags]]
[--component-count]
[--component-start [+-][N][kMGTPE]]
[--component-end|-E [+-][N][kMGTPE]]
dirname|filename
Example
Suppose we already have a composite file
/mnt/testfs/3comp
, created by the following
command:
$ lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -i 4 \ /mnt/testfs/3comp
And write some data
$ dd if=/dev/zero of=/mnt/testfs/3comp bs=1M count=5
Case1. List component ID and its related information
List all the components ID
$ lfs getstripe -I /mnt/testfs/3comp 1 2 3
List the detailed striping information of component ID=2
$ lfs getstripe -I2 /mnt/testfs/3comp /mnt/testfs/3comp lcm_layout_gen: 4 lcm_entry_count: 3 lcme_id: 2 lcme_flags: init lcme_extent.e_start: 4194304 lcme_extent.e_end: 67108864 lmm_stripe_count: 4 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 5 lmm_objects: - 0: { l_ost_idx: 5, l_fid: [0x100050000:0x2:0x0] } - 1: { l_ost_idx: 6, l_fid: [0x100060000:0x2:0x0] } - 2: { l_ost_idx: 7, l_fid: [0x100070000:0x2:0x0] } - 3: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] }
List the stripe offset and stripe count of component ID=2
$ lfs getstripe -I2 -i -c /mnt/testfs/3comp lmm_stripe_count: 4 lmm_stripe_offset: 5
Case2. List the component which contains the specified flag
List the flag of each component
$ lfs getstripe -component-flag -I /mnt/testfs/3comp lcme_id: 1 lcme_flags: init lcme_id: 2 lcme_flags: init lcme_id: 3 lcme_flags: 0
List component(s) who is not instantiated
$ lfs getstripe --component-flags=^init /mnt/testfs/3comp /mnt/testfs/3comp lcm_layout_gen: 4 lcm_entry_count: 3 lcme_id: 3 lcme_flags: 0 lcme_extent.e_start: 67108864 lcme_extent.e_end: EOF lmm_stripe_count: -1 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 4 lmm_stripe_offset: 4
Case3. List the total number of all the component(s)
List the total number of all the components
$ lfs getstripe --component-count /mnt/testfs/3comp 3
Case4. List the component with the specified extent start or end positions
List the start position in bytes of each component
$ lfs getstripe --component-start /mnt/testfs/3comp 0 4194304 67108864
List the start position in bytes of component ID=3
$ lfs getstripe --component-start -I3 /mnt/testfs/3comp 67108864
List the component with start = 64M
$ lfs getstripe --component-start=64M /mnt/testfs/3comp /mnt/testfs/3comp lcm_layout_gen: 4 lcm_entry_count: 3 lcme_id: 3 lcme_flags: 0 lcme_extent.e_start: 67108864 lcme_extent.e_end: EOF lmm_stripe_count: -1 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 4 lmm_stripe_offset: 4
List the component(s) with start > 5M
$ lfs getstripe --component-start=+5M /mnt/testfs/3comp /mnt/testfs/3comp lcm_layout_gen: 4 lcm_entry_count: 3 lcme_id: 3 lcme_flags: 0 lcme_extent.e_start: 67108864 lcme_extent.e_end: EOF lmm_stripe_count: -1 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 4 lmm_stripe_offset: 4
List the component(s) with start < 5M
$ lfs getstripe --component-start=-5M /mnt/testfs/3comp /mnt/testfs/3comp lcm_layout_gen: 4 lcm_entry_count: 3 lcme_id: 1 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 4194304 lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 4 lmm_objects: - 0: { l_ost_idx: 4, l_fid: [0x100040000:0x2:0x0] } lcme_id: 2 lcme_flags: init lcme_extent.e_start: 4194304 lcme_extent.e_end: 67108864 lmm_stripe_count: 4 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 5 lmm_objects: - 0: { l_ost_idx: 5, l_fid: [0x100050000:0x2:0x0] } - 1: { l_ost_idx: 6, l_fid: [0x100060000:0x2:0x0] } - 2: { l_ost_idx: 7, l_fid: [0x100070000:0x2:0x0] } - 3: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] }
List the component(s) with start > 3M and end < 70M
$ lfs getstripe --component-start=+3M --component-end=-70M \ /mnt/testfs/3comp /mnt/testfs/3comp lcm_layout_gen: 4 lcm_entry_count: 3 lcme_id: 2 lcme_flags: init lcme_extent.e_start: 4194304 lcme_extent.e_end: 67108864 lmm_stripe_count: 4 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 5 lmm_objects: - 0: { l_ost_idx: 5, l_fid: [0x100050000:0x2:0x0] } - 1: { l_ost_idx: 6, l_fid: [0x100060000:0x2:0x0] } - 2: { l_ost_idx: 7, l_fid: [0x100070000:0x2:0x0] } - 3: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] }
lfs find
commands can be used to search the
directory tree rooted at the given directory or file name for the files
that match the given PFL component parameters. Here, only those
parameters new for PFL files are shown. Their usages are similar to
lfs getstripe
commands.
Command
lfs finddirectory|filename
[[!] --component-count [+-=]comp_cnt
] [[!] --component-start [+-=]N
[kMGTPE]] [[!] --component-end|-E [+-=]N
[kMGTPE]] [[!] --component-flags=comp_flags
]
If you use --component-xxx
options, only
the composite files will be searched; but if you use
! --component-xxx
options, all the files will be
searched.
Example
We use the following directory and composite files to show how
lfs find
works.
$ mkdir /mnt/testfs/testdir $ lfs setstripe -E 1M -E 10M -E eof /mnt/testfs/testdir/3comp $ lfs setstripe -E 4M -E 20M -E 30M -E eof /mnt/testfs/testdir/4comp $ mkdir -p /mnt/testfs/testdir/dir_3comp $ lfs setstripe -E 6M -E 30M -E eof /mnt/testfs/testdir/dir_3comp $ lfs setstripe -E 8M -E eof /mnt/testfs/testdir/dir_3comp/2comp $ lfs setstripe -c 1 /mnt/testfs/testdir/dir_3comp/commnfile
Case1. Find the files that match the specified component count condition
Find the files under directory /mnt/testfs/testdir whose number of components is not equal to 3.
$ lfs find /mnt/testfs/testdir ! --component-count=3 /mnt/testfs/testdir /mnt/testfs/testdir/4comp /mnt/testfs/testdir/dir_3comp/2comp /mnt/testfs/testdir/dir_3comp/commonfile
Case2. Find the files/dirs that match the specified component start/end condition
Find the file(s) under directory /mnt/testfs/testdir with component start = 4M and end < 70M
$ lfs find /mnt/testfs/testdir --component-start=4M -E -30M /mnt/testfs/testdir/4comp
Case3. Find the files/dirs that match the specified component flag condition
Find the file(s) under directory /mnt/testfs/testdir whose component
flags contain init
$ lfs find /mnt/testfs/testdir --component-flag=init /mnt/testfs/testdir/3comp /mnt/testfs/testdir/4comp /mnt/testfs/testdir/dir_3comp/2comp
Since lfs find
uses
"!
" to do negative search, we don't support
flag ^init
here.
The Lustre Self-Extending Layout (SEL) feature is an extension of the
Section 19.5, “Progressive File Layout(PFL)” feature, which allows the MDS to change the defined
PFL layout dynamically. With this feature, the MDS monitors the used space
on OSTs and swaps the OSTs for the current file when they are low on space.
This avoids ENOSPC
problems for SEL files when
applications are writing to them.
Whereas PFL delays the instantiation of some components until an IO operation occurs on this region, SEL allows splitting such non-instantiated components in two parts: an extendable component and an extension component. The extendable component is a regular PFL component, covering just a part of the region, which is small originally. The extension (or SEL) component is a new component type which is always non-instantiated and unassigned, covering the other part of the region. When a write reaches this unassigned space, and the client calls the MDS to have it instantiated, the MDS makes a decision as to whether to grant additional space to the extendable component. The granted region moves from the head of the extension component to the tail of the extendable component, thus the extendable component grows and the SEL one is shortened. Therefore, it allows the file to continue on the same OSTs, or in the case where space is low on one of the current OSTs, to modify the layout to switch to a new component on new OSTs. In particular, it lets IO automatically spill over to a large HDD OST pool once a small SSD OST pool is getting low on space.
The default extension policy modifies the layout in the following ways:
Extension: continue on the same OST objects when not low on space on any of the OSTs of the current component; a particular extent is granted to the extendable component.
Spill over: switch to next component OSTs only when not the last component and at least one of the current OSTs is low on space; the whole region of the SEL component moves to the next component and the SEL component is removed in its turn.
Repeating: create a new component with the same layout but on free OSTs, used only for the last component when at least one of the current OSTs is low on space; a new component has the same layout but instantiated on different OSTs (from the same pool) which have enough space.
Forced extension: continue with the current component OSTs when there is a low on space condition for the last component OSTs, but a repeating attempt detected low on space on other OSTs as well, then spillover is impossible and there is no sense in the repeating.
Each spill event increments the spill_hit
counter, which can be accessed with:
lctl lod.*.
POOLNAME
.spill_hit
The SEL feature does not require clients to understand the SEL format of already created files, only the MDS support is needed which is introduced in Lustre 2.13. However, old clients will have some limitations as the Lustre tools will not support it.
The lfs setstripe
command is used to create files
with composite layouts, as well as add or delete components to or from an
existing file. It is extended to support SEL components.
Command
lfs setstripe
[--component-end|-E end1] [STRIPE_OPTIONS] ... FILENAME
STRIPE OPTIONS:
--extension-size, --ext-size, -z <ext_size>
The -z
option is added to specify the size of
the region which is granted to the extendable component on each
iteration. While declaring any component, this option turns the declared
component to a pair of components: extendable and extension ones.
Example
The following command creates 2 pairs of extendable and extension components:
# lfs setstripe -E 1G -z 64M -E -1 -z 256M /mnt/lustre/file
As usual, only the first PFL component is instantiated at the creation time, thus it is immediately extended to the extension size (64M for the first component), whereas the third component is left zero-length.
# lfs getstripe /mnt/lustre/file /mnt/lustre/file lcm_layout_gen: 4 lcm_mirror_count: 1 lcm_entry_count: 4 lcme_id: 1 lcme_mirror_id: 0 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 67108864 lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 0 lmm_objects: - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x5:0x0] } lcme_id: 2 lcme_mirror_id: 0 lcme_flags: extension lcme_extent.e_start: 67108864 lcme_extent.e_end: 1073741824 lmm_stripe_count: 0 lmm_extension_size: 67108864 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: -1 lcme_id: 3 lcme_mirror_id: 0 lcme_flags: 0 lcme_extent.e_start: 1073741824 lcme_extent.e_end: 1073741824 lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: -1 lcme_id: 4 lcme_mirror_id: 0 lcme_flags: extension lcme_extent.e_start: 1073741824 lcme_extent.e_end: EOF lmm_stripe_count: 0 lmm_extension_size: 268435456 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: -1
Similar to PFL, it is possible to set a SEL layout template to a directory. After that, all the files created under it will inherit this layout by default.
# lfs setstripe -E 1G -z 64M -E -1 -z 256M /mnt/lustre/dir # ./lustre/utils/lfs getstripe /mnt/lustre/dir /mnt/lustre/dir lcm_layout_gen: 0 lcm_mirror_count: 1 lcm_entry_count: 4 lcme_id: N/A lcme_mirror_id: N/A lcme_flags: 0 lcme_extent.e_start: 0 lcme_extent.e_end: 67108864 stripe_count: 1 stripe_size: 1048576 pattern: raid0 stripe_offset: -1 lcme_id: N/A lcme_mirror_id: N/A lcme_flags: extension lcme_extent.e_start: 67108864 lcme_extent.e_end: 1073741824 stripe_count: 1 extension_size: 67108864 pattern: raid0 stripe_offset: -1 lcme_id: N/A lcme_mirror_id: N/A lcme_flags: 0 lcme_extent.e_start: 1073741824 lcme_extent.e_end: 1073741824 stripe_count: 1 stripe_size: 1048576 pattern: raid0 stripe_offset: -1 lcme_id: N/A lcme_mirror_id: N/A lcme_flags: extension lcme_extent.e_start: 1073741824 lcme_extent.e_end: EOF stripe_count: 1 extension_size: 268435456 pattern: raid0 stripe_offset: -1
lfs getstripe
commands can be used to list the
striping/component information for a given SEL file. Here, only those parameters
new for SEL files are shown.
Command
lfs getstripe
[--extension-size|--ext-size|-z] filename
The -z
option is added to print the extension
size in bytes. For composite files this is the extension size of the
first extension component. If a particular component is identified by
other options (--component-id, --component-start
,
etc...), this component extension size is printed.
Example 1: List a SEL component information
Suppose we already have a composite file
/mnt/lustre/file
, created by the following command:
# lfs setstripe -E 1G -z 64M -E -1 -z 256M /mnt/lustre/file
The 2nd component could be listed with the following command:
# lfs getstripe -I2 /mnt/lustre/file /mnt/lustre/file lcm_layout_gen: 4 lcm_mirror_count: 1 lcm_entry_count: 4 lcme_id: 2 lcme_mirror_id: 0 lcme_flags: extension lcme_extent.e_start: 67108864 lcme_extent.e_end: 1073741824 lmm_stripe_count: 0 lmm_extension_size: 67108864 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: -1
As you can see the SEL components are marked by the
extension
flag and lmm_extension_size
field
keeps the specified extension size.
Example 2: List the extension size
Having the same file as in the above example, the extension size of the second component could be listed with:
# lfs getstripe -z -I2 /mnt/lustre/file 67108864
Example 3: Extension
Having the same file as in the above example, suppose there is a write which crosses the end of the first component (64M), and then another write another write which crosses the end of the first component (128M) again, the layout changes as following:
The layout can be printed out by the following command:
# lfs getstripe /mnt/lustre/file /mnt/lustre/file lcm_layout_gen: 6 lcm_mirror_count: 1 lcm_entry_count: 4 lcme_id: 1 lcme_mirror_id: 0 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 201326592 lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 0 lmm_objects: - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x5:0x0] } lcme_id: 2 lcme_mirror_id: 0 lcme_flags: extension lcme_extent.e_start: 201326592 lcme_extent.e_end: 1073741824 lmm_stripe_count: 0 lmm_extension_size: 67108864 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: -1 lcme_id: 3 lcme_mirror_id: 0 lcme_flags: 0 lcme_extent.e_start: 1073741824 lcme_extent.e_end: 1073741824 lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: -1 lcme_id: 4 lcme_mirror_id: 0 lcme_flags: extension lcme_extent.e_start: 1073741824 lcme_extent.e_end: EOF lmm_stripe_count: 0 lmm_extension_size: 268435456 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: -1
Example 4: Spillover
In case where OST0
is low on space and an IO
happens to a SEL component, a spillover happens: the full region of the
SEL component is added to the next component, e.g. in the example above
the next layout modification will look like:
Despite the fact the third component was [1G, 1G] originally,
while it is not instantiated, instead of getting extended backward, it is
moved backward to the start of the previous SEL component (192M) and
extended on its extension size (256M) from that position, thus it becomes
[192M, 448M]
.
# lfs getstripe /mnt/lustre/file /mnt/lustre/file lcm_layout_gen: 7 lcm_mirror_count: 1 lcm_entry_count: 3 lcme_id: 1 lcme_mirror_id: 0 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 201326592 lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 0 lmm_objects: - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x5:0x0] } lcme_id: 3 lcme_mirror_id: 0 lcme_flags: init lcme_extent.e_start: 201326592 lcme_extent.e_end: 469762048 lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 1 lmm_objects: - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x8:0x0] } lcme_id: 4 lcme_mirror_id: 0 lcme_flags: extension lcme_extent.e_start: 469762048 lcme_extent.e_end: EOF lmm_stripe_count: 0 lmm_extension_size: 268435456 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: -1
Example 5: Repeating
Suppose in the example above, OST0
got
enough free space back but OST1
is low on space,
the following write to the last SEL component leads to a new component
allocation before the SEL component, which repeats the previous
component layout but instantiated on free OSTs:
# lfs getstripe /mnt/lustre/file /mnt/lustre/file lcm_layout_gen: 9 lcm_mirror_count: 1 lcm_entry_count: 4 lcme_id: 1 lcme_mirror_id: 0 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 201326592 lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 0 lmm_objects: - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x5:0x0] } lcme_id: 3 lcme_mirror_id: 0 lcme_flags: init lcme_extent.e_start: 201326592 lcme_extent.e_end: 469762048 lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 1 lmm_objects: - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x8:0x0] } lcme_id: 8 lcme_mirror_id: 0 lcme_flags: init lcme_extent.e_start: 469762048 lcme_extent.e_end: 738197504 lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 65535 lmm_stripe_offset: 0 lmm_objects: - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x6:0x0] } lcme_id: 4 lcme_mirror_id: 0 lcme_flags: extension lcme_extent.e_start: 738197504 lcme_extent.e_end: EOF lmm_stripe_count: 0 lmm_extension_size: 268435456 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: -1
Example 6: Forced extension
Suppose in the example above, both OST0
and
OST1
are low on space, the following write to the
last SEL component will behave as an extension as there is no sense to
repeat.
# lfs getstripe /mnt/lustre/file /mnt/lustre/file lcm_layout_gen: 11 lcm_mirror_count: 1 lcm_entry_count: 4 lcme_id: 1 lcme_mirror_id: 0 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 201326592 lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 0 lmm_objects: - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x5:0x0] } lcme_id: 3 lcme_mirror_id: 0 lcme_flags: init lcme_extent.e_start: 201326592 lcme_extent.e_end: 469762048 lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 1 lmm_objects: - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x8:0x0] } lcme_id: 8 lcme_mirror_id: 0 lcme_flags: init lcme_extent.e_start: 469762048 lcme_extent.e_end: 1006632960 lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 65535 lmm_stripe_offset: 0 lmm_objects: - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x6:0x0] } lcme_id: 4 lcme_mirror_id: 0 lcme_flags: extension lcme_extent.e_start: 1006632960 lcme_extent.e_end: EOF lmm_stripe_count: 0 lmm_extension_size: 268435456 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: -1
lfs find
commands can be used to search for
the files that match the given SEL component paremeters. Here, only
those parameters new for the SEL files are shown.
lfs find [[!] --extension-size|--ext-size|-z [+-]ext-size[KMG] [[!] --component-flags=extension]
The -z
option is added to specify the extension
size to search for. The files which have any component with the
extension size matched the given criteria are printed out. As always
+
and -
signs are allowed to
specify the least and the most size.
A new extension
component flag is added. Only
files which have at least one SEL component are printed.
The negative search for flags searches the files which have a non-SEL component (not files which do not have any SEL component).
Example
# lfs setstripe --extension-size 64M -c 1 -E -1 /mnt/lustre/file # lfs find --comp-flags extension /mnt/lustre/* /mnt/lustre/file # lfs find ! --comp-flags extension /mnt/lustre/* /mnt/lustre/file # lfs find -z 64M /mnt/lustre/* /mnt/lustre/file # lfs find -z +64M /mnt/lustre/* # lfs find -z -64M /mnt/lustre/* # lfs find -z +63M /mnt/lustre/* /mnt/lustre/file # lfs find -z -65M /mnt/lustre/* /mnt/lustre/file # lfs find -z 65M /mnt/lustre/* # lfs find ! -z 64M /mnt/lustre/* # lfs find ! -z +64M /mnt/lustre/* /mnt/lustre/file # lfs find ! -z -64M /mnt/lustre/* /mnt/lustre/file # lfs find ! -z +63M /mnt/lustre/* # lfs find ! -z -65M /mnt/lustre/* # lfs find ! -z 65M /mnt/lustre/* /mnt/lustre/file
The Lustre Foreign Layout feature is an extension of both the LOV and LMV formats which allows the creation of empty files and directories with the necessary specifications to point to corresponding objects outside from Lustre namespace.
The new LOV/LMV foreign internal format can be represented as:
The lfs set[dir]stripe
commands are used to
create files or directories with foreign layouts, by calling the
corresponding API, itself invoking the appropriate ioctl().
Command
lfs set[dir]stripe \
--foreign[=<foreign_type>] --xattr|-x <layout_string> \
[--flags <hex_bitmask>] [--mode <mode_bits>] \
{file,dir}name
Both the --foreign
and
--xattr|-x
options are mandatory.
The <foreign_type>
(default is "none", meaning
no special behavior), and both --flags
and
--mode
(default is 0666) options are optional.
Example
The following command creates a foreign file of "none" type and with "foo@bar" LOV content and specific mode and flags:
# lfs setstripe --foreign=none --flags=0xda08 --mode=0640 \ --xattr=foo@bar /mnt/lustre/file
lfs get[dir]stripe
commands can be used to
retrieve foreign LOV/LMV informations and content.
Command
lfs get[dir]stripe [-v] filename
List foreign layout information
Suppose we already have a foreign file
/mnt/lustre/file
, created by the following command:
# lfs setstripe --foreign=none --flags=0xda08 --mode=0640 \ --xattr=foo@bar /mnt/lustre/file
The full foreign layout informations can be listed using the following command:
# lfs getstripe -v /mnt/lustre/file /mnt/lustre/file lfm_magic: 0x0BD70BD0 lfm_length: 7 lfm_type: none lfm_flags: 0x0000DA08 lfm_value: foo@bar
As you can see the lfm_length
field
value is the characters number in the variable length
lfm_value
field.
lfs find
commands can be used to search for
all the foreign files/directories or those that match the given
selection paremeters.
lfs find [[!] --foreign[=<foreign_type>]
The --foreign[=<foreign_type>]
option
has been added to specify that all [!,but not] files and/or directories
with a foreign layout [and [!,but not] of
<foreign_type>
] will be retrieved.
Example
# lfs setstripe --foreign=none --xattr=foo@bar /mnt/lustre/file # touch /mnt/lustre/file2 # lfs find --foreign /mnt/lustre/* /mnt/lustre/file # lfs find ! --foreign /mnt/lustre/* /mnt/lustre/file2 # lfs find --foreign=none /mnt/lustre/* /mnt/lustre/file
To optimize file system performance, the MDT assigns file stripes to OSTs based on two allocation algorithms. The round-robin allocator gives preference to location (spreading out stripes across OSSs to increase network bandwidth utilization) and the weighted allocator gives preference to available space (balancing loads across OSTs). Threshold and weighting factors for these two algorithms can be adjusted by the user. The MDT reserves 0.1 percent of total OST space and 32 inodes for each OST. The MDT stops object allocation for the OST if available space is less than reserved or the OST has fewer than 32 free inodes. The MDT starts object allocation when available space is twice as big as the reserved space and the OST has more than 64 free inodes. Note, clients could append existing files no matter what object allocation state is.
The reserved space for each OST can be adjusted by the user. Use the
lctl set_param
command, for example the next command reserve 1GB space
for all OSTs.
lctl set_param -P osp.*.reserved_mb_low=1024
This section describes how to check available free space on disks and how free space is allocated. It then describes how to set the threshold and weighting factors for the allocation algorithms.
Free space is an important consideration in assigning file stripes.
The lfs df
command can be used to show available
disk space on the mounted Lustre file system and space consumption per
OST. If multiple Lustre file systems are mounted, a path may be
specified, but is not required. Options to the lfs df
command are shown below.
Option |
Description |
---|---|
|
Displays sizes in human readable format (for example: 1K, 234M, 5G) using base-2 (binary) values (i.e. 1G = 1024M). |
|
Like |
|
Lists inodes instead of block usage. |
|
Do not attempt to contact any OST or MDT not currently
connected to the client. This avoids blocking the
|
|
Limit the usage to report only OSTs that are in the
specified |
|
Display verbose status of MDTs and OSTs. This may include one or more optional flags at the end of each line. |
lfs df
may also report additional target status
as the last column in the display, if there are issues with that target.
Target states include:
D
: OST/MDT is Degraded
.
The target has a failed drive in the RAID device, or is
undergoing RAID reconstruction. This state is marked on
the server automatically for ZFS targets via
zed
, or a (user-supplied) script that
monitors the target device and sets
"lctl set_param obdfilter.
"
on the OST. This target will be avoided for new
allocations, but will still be used to read existing files
located there or if there are not enough non-degraded OSTs
to make up a widely-striped file.
target
.degraded=1
R
: OST/MDT is Read-only
.
The target filesystem is marked read-only due to filesystem
corruption detected by ldiskfs or ZFS. No modifications
are allowed on this OST, and it needs to be unmounted and
e2fsck
or zpool scrub
run to repair the underlying filesystem.
N
: OST/MDT is No-precreate
.
The target is configured to deny object precreation set by
"lctl set_param obdfilter.
"
parameter or the "target
.no_precreate=1-o no_precreate
" mount option.
This may be done to add an OST to the filesystem without allowing
objects to be allocated on it yet, or for other reasons.
S
: OST/MDT is out of Space
.
The target filesystem has less than the minimum required
free space and will not be used for new object allocations
until it has more free space.
I
: OST/MDT is out of Inodes
.
The target filesystem has less than the minimum required
free inodes and will not be used for new object allocations
until it has more free inodes.
f
: OST/MDT is on flash
.
The target filesystem is using a flash (non-rotational)
storage device. This is normally detected from the
underlying Linux block device, but can be set manually
with "lctl set_param osd-*.*.nonrotational=1
on the respective OSTs. This lower-case status is only
shown in conjunction with the -v
option,
since it is not an error condition.
The df -i
and lfs df -i
commands show the minimum number
of inodes that can be created in the file system at the current time.
If the total number of objects available across all of the OSTs is
smaller than those available on the MDT(s), taking into account the
default file striping, then df -i
will also
report a smaller number of inodes than could be created. Running
lfs df -i
will report the actual number of inodes
that are free on each target.
For ZFS file systems, the number of inodes that can be created is dynamic and depends on the free space in the file system. The Free and Total inode counts reported for a ZFS file system are only an estimate based on the current usage for each target. The Used inode count is the actual number of inodes used by the file system.
Examples
client$ lfs df UUID 1K-blocks Used Available Use% Mounted on testfs-OST0000_UUID 9174328 1020024 8154304 11% /mnt/lustre[MDT:0] testfs-OST0000_UUID 94181368 56330708 37850660 59% /mnt/lustre[OST:0] testfs-OST0001_UUID 94181368 56385748 37795620 59% /mnt/lustre[OST:1] testfs-OST0002_UUID 94181368 54352012 39829356 57% /mnt/lustre[OST:2] filesystem summary: 282544104 167068468 39829356 57% /mnt/lustre [client1] $ lfs df -hv UUID bytes Used Available Use% Mounted on testfs-MDT0000_UUID 8.7G 996.1M 7.8G 11% /mnt/lustre[MDT:0] testfs-OST0000_UUID 89.8G 53.7G 36.1G 59% /mnt/lustre[OST:0] f testfs-OST0001_UUID 89.8G 53.8G 36.0G 59% /mnt/lustre[OST:1] f testfs-OST0002_UUID 89.8G 51.8G 38.0G 57% /mnt/lustre[OST:2] f filesystem summary: 269.5G 159.3G 110.1G 59% /mnt/lustre [client1] $ lfs df -iH UUID Inodes IUsed IFree IUse% Mounted on testfs-MDT0000_UUID 2.21M 41.9k 2.17M 1% /mnt/lustre[MDT:0] testfs-OST0000_UUID 737.3k 12.1k 725.1k 1% /mnt/lustre[OST:0] testfs-OST0001_UUID 737.3k 12.2k 725.0k 1% /mnt/lustre[OST:1] testfs-OST0002_UUID 737.3k 12.2k 725.0k 1% /mnt/lustre[OST:2] filesystem summary: 2.21M 41.9k 2.17M 1% /mnt/lustre[OST:2]
Two stripe allocation methods are provided:
Round-robin allocator - When the OSTs have approximately the same amount of free space, the round-robin allocator alternates stripes between OSTs on different OSSs, so the OST used for stripe 0 of each file is evenly distributed among OSTs, regardless of the stripe count. In a simple example with eight OSTs numbered 0-7, objects would be allocated like this:
File 1: OST1, OST2, OST3, OST4 File 2: OST5, OST6, OST7 File 3: OST0, OST1, OST2, OST3, OST4, OST5 File 4: OST6, OST7, OST0
Here are several more sample round-robin stripe orders (each letter represents a different OST on a single OSS):
3: AAA |
One 3-OST OSS |
3x3: ABABAB |
Two 3-OST OSSs |
3x4: BBABABA |
One 3-OST OSS (A) and one 4-OST OSS (B) |
3x5: BBABBABA |
One 3-OST OSS (A) and one 5-OST OSS (B) |
3x3x3: ABCABCABC |
Three 3-OST OSSs |
Weighted allocator - When the free space difference between the OSTs becomes significant, the weighting algorithm is used to influence OST ordering based on size (amount of free space available on each OST) and location (stripes evenly distributed across OSTs). The weighted allocator fills the emptier OSTs faster, but uses a weighted random algorithm, so the OST with the most free space is not necessarily chosen each time.
The allocation method is determined by the amount of free-space
imbalance on the OSTs. When free space is relatively balanced across
OSTs, the faster round-robin allocator is used, which maximizes network
balancing. The weighted allocator is used when any two OSTs are out of
balance by more than the specified threshold (17% by default). The
threshold between the two allocation methods is defined by the
qos_threshold_rr
parameter.
To temporarily set the qos_threshold_rr
to
25
, enter the folowing on each MDS:
mds# lctl set_param lod.fsname
*.qos_threshold_rr=25
The weighting priority used by the weighted allocator is set by the
the qos_prio_free
parameter.
Increasing the value of qos_prio_free
puts more
weighting on the amount of free space available on each OST and less
on how stripes are distributed across OSTs. The default value is
91
(percent). When the free space priority is set to
100
(percent), weighting is based entirely on free space and location
is no longer used by the striping algorithm.
To permanently change the allocator weighting to 100
, enter this command on the
MGS:
lctl conf_param fsname
-MDT0000-*.lod.qos_prio_free=100
.
When qos_prio_free
is set to 100
, a weighted
random algorithm is still used to assign stripes, so, for example, if OST2 has twice as
much free space as OST1, OST2 is twice as likely to be used, but it is not guaranteed to
be used.
Individual files can only be striped over a finite number of OSTs,
based on the maximum size of the attributes that can be stored on the MDT.
If the MDT is ldiskfs-based without the ea_inode
feature, a file can be striped across at most 160 OSTs. With ZFS-based
MDTs, or if the ea_inode
feature is enabled for an
ldiskfs-based MDT, a file can be striped across up to 2000 OSTs.
Lustre inodes use an extended attribute to record on which OST each object is located, and the identifier each object on that OST. The size of the extended attribute is a function of the number of stripes.
If using an ldiskfs-based MDT, the maximum number of OSTs over which
files can be striped can been raised to 2000 by enabling the
ea_inode
feature on the MDT:
tune2fs -O ea_inode /dev/mdtdev
Since Lustre 2.13 the
ea_inode
feature is enabled by default on all newly
formatted ldiskfs MDT filesystems.
The maximum stripe count for a single file does not limit the maximum number of OSTs that are in the filesystem as a whole, only the maximum possible size and maximum aggregate bandwidth for the file.
Table of Contents
This chapter describes Data on MDT (DoM).
The Lustre Data on MDT (DoM) feature improves small file IO by placing small files directly on the MDT, and also improves large file IO by avoiding the OST being affected by small random IO that can cause device seeking and hurt the streaming IO performance. Therefore, users can expect more consistent performance for both small file IO and mixed IO patterns.
The layout of a DoM file is stored on disk as a composite layout and is a special case of Progressive File Layout (PFL). Please see Section 19.5, “Progressive File Layout(PFL)” for more information on PFL. For DoM files, the file layout is composed of the component of the file, which is placed on an MDT, and the rest of components are placed on OSTs, if needed. The first component is placed on the MDT in the MDT object data blocks. This component always has one stripe with size equal to the component size. Such a component with an MDT layout can be only the first component in composite layout. The rest of components are placed over OSTs as usual with a RAID0 layout. The OST components are not instantiated until a client writes or truncates the file beyond the size of the MDT component.
When specifying a DoM layout, it might be assumed that the
remaning layout will automatically go to the OSTs,
but this is not the case. As with regular PFL layouts, if an
EOF
component is not present,
then writes beyond the end of the last existing component will fail
with error ENODATA
("No data available").
For example, creating a DoM file with a component end at 1 MB will
not be writable beyond 1 MiB:
$ lfs setstripe -E 1M -L mdt /mnt/testfs/domdir $ dd if=/dev/zero of=/mnt/testfs/domdir/testfile bs=1M dd: error writing '/myth/tmp/pfl-mdt-only': No data available 2+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00186441 s, 562 MB/s
To allow the file to grow beyond 1 MB, add one or more regular
OST components with an EOF
component at the end:
lfs setstripe -E 1M -L mdt -E 1G -c1 -E eof -c4 /mnt/testfs/domdir
Lustre provides the lfs setstripe
command for
users to create DoM files. Also, as usual,
lfs getstripe
command can be used to list the
striping/component information for a given file, while
lfs find
command can be used to search the directory
tree rooted at the given directory or file name for the files that match
the given DoM component parameters, e.g. layout type.
The lfs setstripe
command is used to create
DoM files.
lfs setstripe --component-end|-E end1 --layout|-L mdt \ [--component-end|-E end2 [STRIPE_OPTIONS] ...] <filename>
The command above creates a file with the special composite
layout, which defines the first component as an MDT component. The
MDT component must start from offset 0 and ends at
end1
. The
end1
is also the stripe size of this
component, and is limited by the
lod.*.dom_stripesize
of the MDT the file is
created on. No other options are required for this component.
The rest of the components use the normal syntax for composite
files creation.
If the next component doesn't specify striping, such as:
lfs setstripe -E 1M -L mdt -E EOF <filename>
Then that component get its settings from the default filesystem striping.
The command below creates a file with a DoM layout. The first
component has an mdt
layout and is placed on the
MDT, covering [0, 1M). The second component covers [1M, EOF) and is
striped over all available OSTs.
client$ lfs setstripe -E 1M -L mdt -E -1 -S 4M -c -1 \ /mnt/lustre/domfile
The resulting layout is illustrated by Figure 20.1, “Resulting file layout”.
The resulting can also be checked with
lfs getstripe
as shown below:
client$ lfs getstripe /mnt/lustre/domfile /mnt/lustre/domfile lcm_layout_gen: 2 lcm_mirror_count: 1 lcm_entry_count: 2 lcme_id: 1 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 1048576 lmm_stripe_count: 0 lmm_stripe_size: 1048576 lmm_pattern: mdt lmm_layout_gen: 0 lmm_stripe_offset: 0 lmm_objects: lcme_id: 2 lcme_flags: 0 lcme_extent.e_start: 1048576 lcme_extent.e_end: EOF lmm_stripe_count: -1 lmm_stripe_size: 4194304 lmm_pattern: raid0 lmm_layout_gen: 65535 lmm_stripe_offset: -1
The output above shows that the first component has size 1MB and
pattern is 'mdt'. The second component is not instantiated yet, which
is seen by lcme_flags: 0
.
If more than 1MB of data is written to the file, then
lfs getstripe
output is changed accordingly:
client$ lfs getstripe /mnt/lustre/domfile /mnt/lustre/domfile lcm_layout_gen: 3 lcm_mirror_count: 1 lcm_entry_count: 2 lcme_id: 1 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 1048576 lmm_stripe_count: 0 lmm_stripe_size: 1048576 lmm_pattern: mdt lmm_layout_gen: 0 lmm_stripe_offset: 2 lmm_objects: lcme_id: 2 lcme_flags: init lcme_extent.e_start: 1048576 lcme_extent.e_end: EOF lmm_stripe_count: 2 lmm_stripe_size: 4194304 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 0 lmm_objects: - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] } - 1: { l_ost_idx: 1, l_fid: [0x100010000:0x2:0x0] }
The output above shows that the second component now has objects on OSTs with a 4MB stripe.
A DoM layout can be set on an existing directory as well. When set, all the files created after that will inherit this layout by default.
lfs setstripe --component-end|-E end1 --layout|-L mdt \ [--component-end|-E end2 [STRIPE_OPTIONS] ...] <dirname>
client$ mkdir /mnt/lustre/domdir client$ touch /mnt/lustre/domdir/normfile client$ lfs setstripe -E 1M -L mdt -E -1 /mnt/lustre/domdir/ client$ lfs getstripe -d /mnt/lustre/domdir lcm_layout_gen: 0 lcm_mirror_count: 1 lcm_entry_count: 2 lcme_id: N/A lcme_flags: 0 lcme_extent.e_start: 0 lcme_extent.e_end: 1048576 stripe_count: 0 stripe_size: 1048576 \ pattern: mdt stripe_offset: -1 lcme_id: N/A lcme_flags: 0 lcme_extent.e_start: 1048576 lcme_extent.e_end: EOF stripe_count: 1 stripe_size: 1048576 \ pattern: raid0 stripe_offset: -1
In the output above, it can be seen that the directory has a default layout with a DoM component.
The following example will check layouts of files in that directory:
client$ touch /mnt/lustre/domdir/domfile client$ lfs getstripe /mnt/lustre/domdir/normfile /mnt/lustre/domdir/normfile lmm_stripe_count: 2 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 1 obdidx objid objid group 1 3 0x3 0 0 3 0x3 0 client$ lfs getstripe /mnt/lustre/domdir/domfile /mnt/lustre/domdir/domfile lcm_layout_gen: 2 lcm_mirror_count: 1 lcm_entry_count: 2 lcme_id: 1 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 1048576 lmm_stripe_count: 0 lmm_stripe_size: 1048576 lmm_pattern: mdt lmm_layout_gen: 0 lmm_stripe_offset: 2 lmm_objects: lcme_id: 2 lcme_flags: 0 lcme_extent.e_start: 1048576 lcme_extent.e_end: EOF lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 65535 lmm_stripe_offset: -1
We can see that first file normfile in that directory has an ordinary layout, whereas the file domfile inherits the directory default layout and is a DoM file.
The directory default layout setting will be inherited by new files even if the server DoM size limit will be set to a lower value.
The maximum size of a DoM component is restricted in several ways to protect the MDT from being eventually filled with large files.
lfs setstripe
allows for setting the
component size for MDT layouts up to 1GB (this is a compile-time
limit to avoid improper configuration), however, the size must
also be aligned by 64KB due to the minimum stripe size in Lustre
(see Table 5.2, “File and file system limits”
Minimum stripe size
). There is also a limit
imposed on each file by lfs setstripe -E end
that may be smaller than the MDT-imposed limit if this is better
for a particular usage.
The lod.$fsname-MDTxxxx.dom_stripesize
is used to control the per-MDT maximum size for a DoM component.
Larger DoM components specified by the user will be truncated to
the MDT-specified limit, and as such may be different on each
MDT to balance DoM space usage on each MDT separately, if needed.
It is 1MB by default and can be changed with the
lctl
tool. For more information on setting
dom_stripesize
please see
Section 20.2.6, “
The dom_stripesize parameter”.
The lfs getstripe
command is used to list
the striping/component information for a given file. For DoM files, it
can be used to check its layout and size.
lfs getstripe [--component-id|-I [comp_id]] [--layout|-L] \ [--stripe-size|-S] <dirname|filename>
client$ lfs getstripe -I1 /mnt/lustre/domfile /mnt/lustre/domfile lcm_layout_gen: 3 lcm_mirror_count: 1 lcm_entry_count: 2 lcme_id: 1 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 1048576 lmm_stripe_count: 0 lmm_stripe_size: 1048576 lmm_pattern: mdt lmm_layout_gen: 0 lmm_stripe_offset: 2 lmm_objects:
Short info about the layout and size of DoM component can
be obtained with the use of the -L
option
along with -S
or -E
options:
client$ lfs getstripe -I1 -L -S /mnt/lustre/domfile lmm_stripe_size: 1048576 lmm_pattern: mdt client$ lfs getstripe -I1 -L -E /mnt/lustre/domfile lcme_extent.e_end: 1048576 lmm_pattern: mdt
Both commands return layout type and its size. The stripe size is equal to the extent size of component in case of DoM files, so both can be used to get size on the MDT.
The lfs find
command can be used to search
the directory tree rooted at the given directory or file name for the
files that match the given parameters. The command below shows the new
parameters for DoM files and their usages are similar to the
lfs getstripe
command.
Find all files with DoM layout under directory
/mnt/lustre
:
client$ lfs find -L mdt /mnt/lustre /mnt/lustre/domfile /mnt/lustre/domdir /mnt/lustre/domdir/domfile client$ lfs find -L mdt -type f /mnt/lustre /mnt/lustre/domfile /mnt/lustre/domdir/domfile client$ lfs find -L mdt -type d /mnt/lustre /mnt/lustre/domdir
By using this command you can find all DoM objects, only DoM files, or only directories with default DoM layout.
Find the DoM files/dirs with a particular stripe size:
client$ lfs find -L mdt -S -1200K -type f /mnt/lustre /mnt/lustre/domfile /mnt/lustre/domdir/domfile client$ lfs find -L mdt -S +200K -type f /mnt/lustre /mnt/lustre/domfile /mnt/lustre/domdir/domfile
The first command finds all DoM files with stripe size less than 1200KB. The second command above does the same for files with a stripe size greater than 200KB. In both cases, all DoM files are found because their DoM size is 1MB.
The MDT controls the default maximum DoM size on the server via
the parameter dom_stripesize
in the LOD device.
The dom_stripesize
can be set differently for each
MDT, if necessary. The default value of the parameter is 1MB and can
be changed with lctl
tool.
The commands below get the maximum allowed DoM size on the server. The final command is an attempt to create a file with a larger size than the parameter setting and correctly fails.
mds# lctl get_param lod.*MDT0000*.dom_stripesize lod.lustre-MDT0000-mdtlov.dom_stripesize=1048576 mds# lctl get_param -n lod.*MDT0000*.dom_stripesize 1048576 client$ lfs setstripe -E 2M -L mdt /mnt/lustre/dom2mb Create composite file /mnt/lustre/dom2mb failed. Invalid argument error: setstripe: create composite file '/mnt/lustre/dom2mb' failed: Invalid argument
To temporarily set the value of the parameter, the
lctl set_param
is used:
lctl set_param lod.*MDT<index>*.dom_stripesize=<value>
The example below shows a change to the default DoM limit on the server to 64KB and try to create a file with 1MB DoM size after that.
mds# lctl set_param -n lod.*MDT0000*.dom_stripesize=64K mds# lctl get_param -n lod.*MDT0000*.dom_stripesize 65536 client$ lfs setstripe -E 1M -L mdt /mnt/lustre/dom Create composite file /mnt/lustre/dom failed. Invalid argument error: setstripe: create composite file '/mnt/lustre/dom' failed: Invalid argument
To persistently set the value of the parameter on a
specific MDT, the
lctl set_param -P
command is used:
lctl set_param -P lod.fsname
-MDTindex
.dom_stripesize=value
This can also use a wildcard '*
' for the
index
to apply to all MDTs.
When lctl set_param
(whether with
-P
or not) sets
dom_stripesize
to 0
, DoM
component creation will be disabled on the specified server(s), and
any new layouts with a specified DoM component
will have that component removed from the file layout. Existing
files and layouts with DoM components on that MDT are not changed.
DoM files can still be created in existing directories with a default DoM layout.
Table of Contents
This chapter describes Lazy Size on MDT (LSoM).
In the Lustre file system, MDSs store the ctime, mtime, owner, and other file attributes. The OSSs store the size and number of blocks used for each file. To obtain the correct file size, the client must contact each OST that the file is stored across, which means multiple RPCs to get the size and blocks for a file when a file is striped over multiple OSTs. The Lazy Size on MDT (LSoM) feature stores the file size on the MDS and avoids the need to fetch the file size from the OST(s) in cases where the application understands that the size may not be accurate. Lazy means there is no guarantee of the accuracy of the attributes stored on the MDS.
Since many Lustre installations use SSD for MDT storage, the
motivation for the LSoM work is to speed up the time it takes to get
the size of a file from the Lustre file system by storing that data on
the MDTs. We expect this feature to be initially used by Lustre policy
engines that scan the backend MDT storage, make decisions based on broad
size categories, and do not depend on a totally accurate file size.
Examples include Lester, Robinhood, Zester, and various vendor offerings.
Future improvements will allow the LSoM data to be accessed by tools such
as lfs find
.
LSoM is always enabled and nothing needs to be done to enable the
feature for fetching the LSoM data when scanning the MDT inodes with a
policy engine. It is also possible to access the LSoM data on the client
via the lfs getsom
command. Because the LSoM data is
currently accessed on the client via the xattr interface, the
xattr_cache
will cache the file size and block count on
the client as long as the inode is cached. In most cases this is
desirable, since it improves access to the LSoM data. However, it also
means that the LSoM data may be stale if the file size is changed after the
xattr is first accessed or if the xattr is accessed shortly after the file
is first created.
If it is necessary to access up-to-date LSoM data that has gone
stale, it is possible to flush the xattr cache from the client by
cancelling the MDC locks via
lctl set_param ldlm.namespaces.*mdc*.lru_size=clear
.
Otherwise, the file attributes will be dropped from the client cache if
the file has not been accessed before the LDLM lock timeout. The timeout
is stored via
lctl get_param ldlm.namespaces.*mdc*.lru_max_age
.
If repeated access to LSoM attributes for files that are recently
created or frequently modified from a specific client, such as an HSM agent
node, it is possible to disable xattr caching on a client via:
lctl set_param llite.*.xattr_cache=0
. This may cause
extra overhead when accessing files, and is not recommended for normal
usage.
Lustre provides the lfs getsom
command to list
file attributes that are stored on the MDT.
The llsom_sync
command allows the user to sync
the file attributes on the MDT with the valid/up-to-date data on the
OSTs. llsom_sync
is called on the client with the
Lustre file system mount point. llsom_sync
uses Lustre
MDS changelogs and, thus, a changelog user must be registered to use this
utility.
The lfs getsom
command lists file attributes
that are stored on the MDT. lfs getsom
is called
with the full path and file name for a file on the Lustre file
system. If no flags are used, then all file attributes stored on the
MDS will be shown.
lfs getsom [-s] [-b] [-f] <filename>
The various lfs getsom
options are listed and
described below.
Option |
Description |
---|---|
|
Only show the size value of the LSoM data for a given file. This is an optional flag |
|
Only show the blocks value of the LSoM data for a given file. This is an optional flag |
|
Only show the flag value of the LSoM data for a given file. This is an optional flag. Valid flags are: SOM_FL_UNKNOWN = 0x0000 - Unknown or no SoM data, must get size from OSTs. SOM_FL_STRICT = 0x0001 - Known strictly correct, FLR file (SoM guaranteed) SOM_FL_STALE = 0x0002 - Known stale -was right at some point in the past, but it is known (or likely) to be incorrect now (e.g. opened for write) SOM_FL_LAZY= 0x0004 - Approximate, may never have been strictly correct, need to sync SOM data to achieve eventual consistency. |
The llsom_sync
command allows the user to sync
the file attributes on the MDT with the valid/up-to-date data on the
OSTs. llsom_sync
is called on the client with the
client mount point for the Lustre file system.
llsom_sync
uses Lustre MDS changelogs and, thus, a
changelog user must be registered to use this utility.
llsom_sync --mdt|-m <mdt> --user|-u <user_id> [--daemonize|-d] [--verbose|-v] [--interval|-i] [--min-age|-a] [--max-cache|-c] [--sync|-s] <lustre_mount_point>
The various llsom_sync
options are
listed and described below.
Option |
Description |
---|---|
|
The metadata device which need to be synced the LSoM xattr of files. A changelog user must be registered for this device.Required flag. |
|
The changelog user id for the MDT device. Required flag. |
|
Optional flag to run the program in the background. In daemon mode, the utility will scan and process the changelog records and sync the LSoM xattr for files periodically. |
|
Optional flag to produce verbose output. |
|
Optional flag for the time interval to scan the Lustre changelog and process the log record in daemon mode. |
|
Optional flag for the time that
|
|
Optional flag for the total memory used for the FID cache which can be with a suffix [KkGgMm].The default max-cache value is 256MB. For the parameter value < 100, it is taken as the percentage of total memory size used for the FID cache instead of the cache size. |
|
Optional flag to sync file data to make the dirty data out of cache to ensure the blocks count is correct when update the file LSoM xattr. This option could hurt server performance significantly if thousands of fsync requests are sent. |
Table of Contents
This chapter describes File Level Redundancy (FLR).
The Lustre file system was initially designed and implemented for HPC use. It has been working well on high-end storage that has internal redundancy and fault-tolerance. However, despite the expense and complexity of these storage systems, storage failures still occur, and before release 2.11, Lustre could not be more reliable than the individual storage and server components on which it was based. The Lustre file system had no mechanism to mitigate storage hardware failures and files would become inaccessible if a server was inaccessible or otherwise out of service.
With the File Level Redundancy (FLR) feature introduced in Lustre Release 2.11, any Lustre file can store the same data on multiple OSTs in order for the system to be robust in the event of storage failures or other outages. With the choice of multiple mirrors, the best suited mirror can be chosen to satisfy an individual request, which has a direct impact on IO availability. Furthermore, for files that are concurrently read by many clients (e.g. input decks, shared libraries, or executables) the aggregate parallel read performance of a single file can be improved by creating multiple mirrors of the file data.
The first phase of the FLR feature has been implemented with delayed write (Figure 22.1, “FLR Delayed Write”). While writing to a mirrored file, only one primary or preferred mirror will be updated directly during the write, while other mirrors will be simply marked as stale. The file can subsequently return to a mirrored state again by synchronizing among mirrors with command line tools (run by the user or administrator directly or via automated monitoring tools).
Lustre provides lfs mirror
command line tools for
users to operate on mirrored files or directories.
Command:
lfs mirror create <--mirror-count|-N[mirror_count] [setstripe_options|[--flags<=flags>]]> ... <filename|directory>
The above command will create a mirrored file or directory specified
by filename
or
directory
, respectively.
Option | Description |
---|---|
--mirror-count|-N[mirror_count] |
Indicates the number of mirrors to be created with the following setstripe options. It can be repeated multiple times to separate mirrors that have different layouts. The |
setstripe_options |
Specifies a specific layout for the mirror. It can be a
plain layout with a specific striping pattern or a composite
layout, such as Section 19.5, “Progressive File Layout(PFL)”. The options are
the same as those for the If |
--flags<=flags> |
Sets flags to the mirror to be created. Only the Note: This flag will
be set to all components that belong to the corresponding
mirror. The |
Note: For redundancy and fault-tolerance, users need to make sure that different mirrors must be on different OSTs, even OSSs and racks. An understanding of cluster topology is necessary to achieve this architecture. In the initial implementation the use of the existing OST pools mechanism will allow separating OSTs by any arbitrary criteria: i.e. fault domain. In practice, users can take advantage of OST pools by grouping OSTs by topological information. Therefore, when creating a mirrored file, users can indicate which OST pools can be used by mirrors.
Examples:
The following command creates a mirrored file with 2 plain layout mirrors:
client# lfs mirror create -N -S 4M -c 2 -p flash \ -N -c -1 -p archive /mnt/testfs/file1
The following command displays the layout information of the
mirrored file /mnt/testfs/file1
:
client# lfs getstripe /mnt/testfs/file1 /mnt/testfs/file1 lcm_layout_gen: 2 lcm_mirror_count: 2 lcm_entry_count: 2 lcme_id: 65537 lcme_mirror_id: 1 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: EOF lmm_stripe_count: 2 lmm_stripe_size: 4194304 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 1 lmm_pool: flash lmm_objects: - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x2:0x0] } - 1: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] } lcme_id: 131074 lcme_mirror_id: 2 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: EOF lmm_stripe_count: 6 lmm_stripe_size: 4194304 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 3 lmm_pool: archive lmm_objects: - 0: { l_ost_idx: 3, l_fid: [0x100030000:0x2:0x0] } - 1: { l_ost_idx: 4, l_fid: [0x100040000:0x2:0x0] } - 2: { l_ost_idx: 5, l_fid: [0x100050000:0x2:0x0] } - 3: { l_ost_idx: 6, l_fid: [0x100060000:0x2:0x0] } - 4: { l_ost_idx: 7, l_fid: [0x100070000:0x2:0x0] } - 5: { l_ost_idx: 2, l_fid: [0x100020000:0x2:0x0] }
The first mirror has 4MB stripe size and two stripes across OSTs in
the flash
OST pool.
The second mirror has 4MB stripe size inherited from the first mirror,
and stripes across all of the available OSTs in the
archive
OST pool.
As mentioned above, it is recommended to use the
--pool|-p
option (one of the
lfs setstripe
options) with OST pools configured with
independent fault domains to ensure different mirrors will be placed on
different OSTs, servers, and/or racks, thereby improving availability
and performance. If the setstripe options are not specified, it is
possible to create mirrors with objects on the same OST(s), which would
remove most of the benefit of using replication.
In the layout information printed by lfs getstripe
,
lcme_mirror_id
shows mirror ID, which is the unique
numerical identifier for a mirror. And lcme_flags
shows
mirrored component flags. Valid flag names are:
init
- indicates mirrored component has been
initialized (has allocated OST objects).
stale
- indicates mirrored component does not
have up-to-date data. Stale components will not be used for read or
write operations, and need to be resynchronized by running
lfs mirror resync
command before they can be
accessed again.
prefer
- indicates mirrored component is
preferred for read or write. For example, the mirror is located on
SSD-based OSTs or is closer, fewer hops, on the network to the
client. This flag can be set by users at mirror creation time.
The following command creates a mirrored file with 3 PFL mirrors:
client# lfs mirror create -N -E 4M -p flash --flags=prefer -E eof -c 2 \ -N -E 16M -S 8M -c 4 -p archive -E eof -c -1 \ -N -E 32M -c 1 -p archive2 -E eof -c -1 /mnt/testfs/file2
The following command displays the layout information of the
mirrored file /mnt/testfs/file2
:
client# lfs getstripe /mnt/testfs/file2 /mnt/testfs/file2 lcm_layout_gen: 6 lcm_mirror_count: 3 lcm_entry_count: 6 lcme_id: 65537 lcme_mirror_id: 1 lcme_flags: init,prefer lcme_extent.e_start: 0 lcme_extent.e_end: 4194304 lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 1 lmm_pool: flash lmm_objects: - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x3:0x0] } lcme_id: 65538 lcme_mirror_id: 1 lcme_flags: prefer lcme_extent.e_start: 4194304 lcme_extent.e_end: EOF lmm_stripe_count: 2 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: -1 lmm_pool: flash lcme_id: 131075 lcme_mirror_id: 2 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 16777216 lmm_stripe_count: 4 lmm_stripe_size: 8388608 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 4 lmm_pool: archive lmm_objects: - 0: { l_ost_idx: 4, l_fid: [0x100040000:0x3:0x0] } - 1: { l_ost_idx: 5, l_fid: [0x100050000:0x3:0x0] } - 2: { l_ost_idx: 6, l_fid: [0x100060000:0x3:0x0] } - 3: { l_ost_idx: 7, l_fid: [0x100070000:0x3:0x0] } lcme_id: 131076 lcme_mirror_id: 2 lcme_flags: 0 lcme_extent.e_start: 16777216 lcme_extent.e_end: EOF lmm_stripe_count: 6 lmm_stripe_size: 8388608 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: -1 lmm_pool: archive lcme_id: 196613 lcme_mirror_id: 3 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 33554432 lmm_stripe_count: 1 lmm_stripe_size: 8388608 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 0 lmm_pool: archive2 lmm_objects: - 0: { l_ost_idx: 8, l_fid: [0x3400000000:0x3:0x0] } lcme_id: 196614 lcme_mirror_id: 3 lcme_flags: 0 lcme_extent.e_start: 33554432 lcme_extent.e_end: EOF lmm_stripe_count: -1 lmm_stripe_size: 8388608 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: -1 lmm_pool: archive2
For the first mirror, the first component inherits the stripe count
and stripe size from filesystem-wide default values. The second
component inherits the stripe size and OST pool from the first
component, and has two stripes. Both of the components are allocated
from the flash
OST pool.
Also, the flag prefer
is
applied to all the components of the first mirror, which tells the
client to read data from those components whenever they are available.
For the second mirror, the first component has an 8MB stripe size
and 4 stripes across OSTs in the archive
OST pool.
The second component inherits the stripe size and OST pool from the
first component, and stripes across all of the available OSTs in the
archive
OST pool.
For the third mirror, the first component inherits the stripe size
of 8MB from the last component of the second mirror, and has one single
stripe. The OST pool name is set to archive2
.
The second component inherits stripe size from the first component,
and stripes across all of the available OSTs in that pool.
Command:
lfs mirror extend [--no-verify] <--mirror-count|-N[mirror_count] [setstripe_options|-f <victim_file>]> ... <filename>
The above command will append mirror(s) indicated by
setstripe options
or just take the layout from
existing file victim_file
into the file
filename
. The
filename
must be an existing file, however,
it can be a mirrored or regular non-mirrored file. If it is a
non-mirrored file, the command will convert it to a mirrored file.
Option | Description |
---|---|
--mirror-count|-N[mirror_count] |
Indicates the number of mirrors to be added with the
following The |
setstripe_options |
Specifies a specific layout for the mirror. It can be a
plain layout with specific striping pattern or a composite
layout, such as Section 19.5, “Progressive File Layout(PFL)”. The options are the
same as those for the If |
-f <victim_file> |
If Note: The
|
--no-verify | If victim_file is specified, the
command will verify that the file contents from
victim_file are the same as
filename . Otherwise, the command
will return a failure. However, the option
--no-verify can be used to override this
verification. This option can save significant time on file
comparison if the file size is large, but use it only when the
file contents are known to be the same. |
Note: The
lfs mirror extend
operation won't be applied to the
directory.
Examples:
The following commands create a non-mirrored file, convert it to a mirrored file, and extend it with a plain layout mirror:
# lfs setstripe -p flash /mnt/testfs/file1 # lfs getstripe /mnt/testfs/file1 /mnt/testfs/file1 lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 0 lmm_pool: flash obdidx objid objid group 0 4 0x4 0 # lfs mirror extend -N -S 8M -c -1 -p archive /mnt/testfs/file1 # lfs getstripe /mnt/testfs/file1 /mnt/testfs/file1 lcm_layout_gen: 2 lcm_mirror_count: 2 lcm_entry_count: 2 lcme_id: 65537 lcme_mirror_id: 1 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: EOF lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 0 lmm_pool: flash lmm_objects: - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x4:0x0] } lcme_id: 131073 lcme_mirror_id: 2 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: EOF lmm_stripe_count: 6 lmm_stripe_size: 8388608 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 3 lmm_pool: archive lmm_objects: - 0: { l_ost_idx: 3, l_fid: [0x100030000:0x3:0x0] } - 1: { l_ost_idx: 4, l_fid: [0x100040000:0x4:0x0] } - 2: { l_ost_idx: 5, l_fid: [0x100050000:0x4:0x0] } - 3: { l_ost_idx: 6, l_fid: [0x100060000:0x4:0x0] } - 4: { l_ost_idx: 7, l_fid: [0x100070000:0x4:0x0] } - 5: { l_ost_idx: 2, l_fid: [0x100020000:0x3:0x0] }
The following commands split the PFL layout from a
victim_file
and use it as a mirror added to
the mirrored file /mnt/testfs/file1
created in the
above example without data verification:
# lfs setstripe -E 16M -c 2 -p none \ -E eof -c -1 /mnt/testfs/victim_file # lfs getstripe /mnt/testfs/victim_file /mnt/testfs/victim_file lcm_layout_gen: 2 lcm_mirror_count: 1 lcm_entry_count: 2 lcme_id: 1 lcme_mirror_id: 0 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 16777216 lmm_stripe_count: 2 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 5 lmm_objects: - 0: { l_ost_idx: 5, l_fid: [0x100050000:0x5:0x0] } - 1: { l_ost_idx: 6, l_fid: [0x100060000:0x5:0x0] } lcme_id: 2 lcme_mirror_id: 0 lcme_flags: 0 lcme_extent.e_start: 16777216 lcme_extent.e_end: EOF lmm_stripe_count: -1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: -1 # lfs mirror extend --no-verify -N -f /mnt/testfs/victim_file \ /mnt/testfs/file1 # lfs getstripe /mnt/testfs/file1 /mnt/testfs/file1 lcm_layout_gen: 3 lcm_mirror_count: 3 lcm_entry_count: 4 lcme_id: 65537 lcme_mirror_id: 1 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: EOF lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 0 lmm_pool: flash lmm_objects: - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x4:0x0] } lcme_id: 131073 lcme_mirror_id: 2 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: EOF lmm_stripe_count: 6 lmm_stripe_size: 8388608 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 3 lmm_pool: archive lmm_objects: - 0: { l_ost_idx: 3, l_fid: [0x100030000:0x3:0x0] } - 1: { l_ost_idx: 4, l_fid: [0x100040000:0x4:0x0] } - 2: { l_ost_idx: 5, l_fid: [0x100050000:0x4:0x0] } - 3: { l_ost_idx: 6, l_fid: [0x100060000:0x4:0x0] } - 4: { l_ost_idx: 7, l_fid: [0x100070000:0x4:0x0] } - 5: { l_ost_idx: 2, l_fid: [0x100020000:0x3:0x0] } lcme_id: 196609 lcme_mirror_id: 3 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 16777216 lmm_stripe_count: 2 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 5 lmm_objects: - 0: { l_ost_idx: 5, l_fid: [0x100050000:0x5:0x0] } - 1: { l_ost_idx: 6, l_fid: [0x100060000:0x5:0x0] } lcme_id: 196610 lcme_mirror_id: 3 lcme_flags: 0 lcme_extent.e_start: 16777216 lcme_extent.e_end: EOF lmm_stripe_count: -1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: -1
After extending, the victim_file
was
removed:
# ls /mnt/testfs/victim_file ls: cannot access /mnt/testfs/victim_file: No such file or directory
Command:
lfs mirror split <--mirror-id <mirror_id>> [--destroy|-d] [-f <new_file>] <mirrored_file>
The above command will split a specified mirror with ID
<mirror_id>
out of an existing mirrored
file specified by
mirrored_file
. By default, a new file named
<mirrored_file>.mirror~<mirror_id>
will
be created with the layout of the split mirror. If the
--destroy|-d
option is specified, then the split
mirror will be destroyed. If the -f <new_file>
option is specified, then a file named
new_file
will be created with the layout of
the split mirror. If mirrored_file
has only
one mirror existing after split, it will be converted to a regular
non-mirrored file. If the original
mirrored_file
is not a mirrored file, then
the command will return an error.
Option | Description |
---|---|
--mirror-id <mirror_id> | The unique numerical identifier for a mirror. The mirror
ID is unique within a mirrored file and is automatically
assigned at file creation or extension time. It can be fetched
by the lfs getstripe command.
|
--destroy|-d | Indicates the split mirror will be destroyed. |
-f <new_file> | Indicates a file named new_file
will be created with the layout of the split mirror. |
Examples:
The following commands create a mirrored file with 4 mirrors, then split 3 mirrors separately from the mirrored file.
Creating a mirrored file with 4 mirrors:
# lfs mirror create -N2 -E 4M -p flash -E eof -c -1 \ -N2 -S 8M -c 2 -p archive /mnt/testfs/file1 # lfs getstripe /mnt/testfs/file1 /mnt/testfs/file1 lcm_layout_gen: 6 lcm_mirror_count: 4 lcm_entry_count: 6 lcme_id: 65537 lcme_mirror_id: 1 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 4194304 lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 1 lmm_pool: flash lmm_objects: - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x4:0x0] } lcme_id: 65538 lcme_mirror_id: 1 lcme_flags: 0 lcme_extent.e_start: 4194304 lcme_extent.e_end: EOF lmm_stripe_count: 2 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: -1 lmm_pool: flash lcme_id: 131075 lcme_mirror_id: 2 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 4194304 lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 0 lmm_pool: flash lmm_objects: - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x5:0x0] } lcme_id: 131076 lcme_mirror_id: 2 lcme_flags: 0 lcme_extent.e_start: 4194304 lcme_extent.e_end: EOF lmm_stripe_count: 2 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: -1 lmm_pool: flash lcme_id: 196613 lcme_mirror_id: 3 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: EOF lmm_stripe_count: 2 lmm_stripe_size: 8388608 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 4 lmm_pool: archive lmm_objects: - 0: { l_ost_idx: 4, l_fid: [0x100040000:0x5:0x0] } - 1: { l_ost_idx: 5, l_fid: [0x100050000:0x6:0x0] } lcme_id: 262150 lcme_mirror_id: 4 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: EOF lmm_stripe_count: 2 lmm_stripe_size: 8388608 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 7 lmm_pool: archive lmm_objects: - 0: { l_ost_idx: 7, l_fid: [0x100070000:0x5:0x0] } - 1: { l_ost_idx: 2, l_fid: [0x100020000:0x4:0x0] }
Splitting the mirror with ID 1
from
/mnt/testfs/file1
and creating
/mnt/testfs/file1.mirror~1
with the layout of the
split mirror:
# lfs mirror split --mirror-id 1 /mnt/testfs/file1 # lfs getstripe /mnt/testfs/file1.mirror~1 /mnt/testfs/file1.mirror~1 lcm_layout_gen: 1 lcm_mirror_count: 1 lcm_entry_count: 2 lcme_id: 65537 lcme_mirror_id: 1 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 4194304 lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 1 lmm_pool: flash lmm_objects: - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x4:0x0] } lcme_id: 65538 lcme_mirror_id: 1 lcme_flags: 0 lcme_extent.e_start: 4194304 lcme_extent.e_end: EOF lmm_stripe_count: 2 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: -1 lmm_pool: flash
Splitting the mirror with ID 2
from
/mnt/testfs/file1
and destroying it:
# lfs mirror split --mirror-id 2 -d /mnt/testfs/file1 # lfs getstripe /mnt/testfs/file1 /mnt/testfs/file1 lcm_layout_gen: 8 lcm_mirror_count: 2 lcm_entry_count: 2 lcme_id: 196613 lcme_mirror_id: 3 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: EOF lmm_stripe_count: 2 lmm_stripe_size: 8388608 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 4 lmm_pool: archive lmm_objects: - 0: { l_ost_idx: 4, l_fid: [0x100040000:0x5:0x0] } - 1: { l_ost_idx: 5, l_fid: [0x100050000:0x6:0x0] } lcme_id: 262150 lcme_mirror_id: 4 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: EOF lmm_stripe_count: 2 lmm_stripe_size: 8388608 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 7 lmm_pool: archive lmm_objects: - 0: { l_ost_idx: 7, l_fid: [0x100070000:0x5:0x0] } - 1: { l_ost_idx: 2, l_fid: [0x100020000:0x4:0x0] }
Splitting the mirror with ID 3
from
/mnt/testfs/file1
and creating
/mnt/testfs/file2
with the layout of the split
mirror:
# lfs mirror split --mirror-id 3 -f /mnt/testfs/file2 \ /mnt/testfs/file1 # lfs getstripe /mnt/testfs/file2 /mnt/testfs/file2 lcm_layout_gen: 1 lcm_mirror_count: 1 lcm_entry_count: 1 lcme_id: 196613 lcme_mirror_id: 3 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: EOF lmm_stripe_count: 2 lmm_stripe_size: 8388608 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 4 lmm_pool: archive lmm_objects: - 0: { l_ost_idx: 4, l_fid: [0x100040000:0x5:0x0] } - 1: { l_ost_idx: 5, l_fid: [0x100050000:0x6:0x0] } # lfs getstripe /mnt/testfs/file1 /mnt/testfs/file1 lcm_layout_gen: 9 lcm_mirror_count: 1 lcm_entry_count: 1 lcme_id: 262150 lcme_mirror_id: 4 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: EOF lmm_stripe_count: 2 lmm_stripe_size: 8388608 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 7 lmm_pool: archive lmm_objects: - 0: { l_ost_idx: 7, l_fid: [0x100070000:0x5:0x0] } - 1: { l_ost_idx: 2, l_fid: [0x100020000:0x4:0x0] }
The above layout information showed that mirrors with ID
1, 2, and 3
were all split from the mirrored file
/mnt/testfs/file1
.
Command:
lfs mirror resync [--only <mirror_id[,...]>] <mirrored_file> [<mirrored_file2>...]
The above command will resynchronize out-of-sync mirrored file(s)
specified by mirrored_file
. It
supports specifying multiple mirrored files in one command line.
If there is no stale mirror for the specified mirrored file(s), then
the command does nothing. Otherwise, it will copy data from synced
mirror to the stale mirror(s), and mark all successfully copied
mirror(s) as SYNC. If the
--only <mirror_id[,...]>
option is specified,
then the command will only resynchronize the mirror(s) specified by the
mirror_id(s)
. This option cannot be used when
multiple mirrored files are specified.
Option | Description |
---|---|
--only <mirror_id[,...]> | Indicates which mirror(s) specified by
mirror_id(s) needs to be
resynchronized. The mirror_id is the
unique numerical identifier for a mirror. Multiple
mirror_ids are separated by comma.
This option cannot be used when multiple mirrored files are
specified. |
Note: With delayed write
implemented in FLR phase 1, after writing to a mirrored file, users
need to run lfs mirror resync
command to get all
mirrors synchronized.
Examples:
The following commands create a mirrored file with 3 mirrors, then write some data into the file and resynchronizes stale mirrors.
Creating a mirrored file with 3 mirrors:
# lfs mirror create -N -E 4M -p flash -E eof \ -N2 -p archive /mnt/testfs/file1 # lfs getstripe /mnt/testfs/file1 /mnt/testfs/file1 lcm_layout_gen: 4 lcm_mirror_count: 3 lcm_entry_count: 4 lcme_id: 65537 lcme_mirror_id: 1 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 4194304 lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 1 lmm_pool: flash lmm_objects: - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x5:0x0] } lcme_id: 65538 lcme_mirror_id: 1 lcme_flags: 0 lcme_extent.e_start: 4194304 lcme_extent.e_end: EOF lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: -1 lmm_pool: flash lcme_id: 131075 lcme_mirror_id: 2 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: EOF lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 3 lmm_pool: archive lmm_objects: - 0: { l_ost_idx: 3, l_fid: [0x100030000:0x4:0x0] } lcme_id: 196612 lcme_mirror_id: 3 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: EOF lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 4 lmm_pool: archive lmm_objects: - 0: { l_ost_idx: 4, l_fid: [0x100040000:0x6:0x0] }
Writing some data into the mirrored file
/mnt/testfs/file1
:
# yes | dd of=/mnt/testfs/file1 bs=1M count=2 2+0 records in 2+0 records out 2097152 bytes (2.1 MB) copied, 0.0320613 s, 65.4 MB/s # lfs getstripe /mnt/testfs/file1 /mnt/testfs/file1 lcm_layout_gen: 5 lcm_mirror_count: 3 lcm_entry_count: 4 lcme_id: 65537 lcme_mirror_id: 1 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 4194304 ...... lcme_id: 65538 lcme_mirror_id: 1 lcme_flags: 0 lcme_extent.e_start: 4194304 lcme_extent.e_end: EOF ...... lcme_id: 131075 lcme_mirror_id: 2 lcme_flags: init,stale lcme_extent.e_start: 0 lcme_extent.e_end: EOF ...... lcme_id: 196612 lcme_mirror_id: 3 lcme_flags: init,stale lcme_extent.e_start: 0 lcme_extent.e_end: EOF ......
The above layout information showed that data were written into the
first component of mirror with ID 1
, and mirrors with
ID 2
and 3
were marked with
stale
flag.
Resynchronizing the stale mirror with ID 2
for
the mirrored file /mnt/testfs/file1
:
# lfs mirror resync --only 2 /mnt/testfs/file1 # lfs getstripe /mnt/testfs/file1 /mnt/testfs/file1 lcm_layout_gen: 7 lcm_mirror_count: 3 lcm_entry_count: 4 lcme_id: 65537 lcme_mirror_id: 1 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 4194304 ...... lcme_id: 65538 lcme_mirror_id: 1 lcme_flags: 0 lcme_extent.e_start: 4194304 lcme_extent.e_end: EOF ...... lcme_id: 131075 lcme_mirror_id: 2 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: EOF ...... lcme_id: 196612 lcme_mirror_id: 3 lcme_flags: init,stale lcme_extent.e_start: 0 lcme_extent.e_end: EOF ......
The above layout information showed that after resynchronizing, the
stale
flag was removed from mirror with ID
2
.
Resynchronizing all of the stale mirrors for the mirrored file
/mnt/testfs/file1
:
# lfs mirror resync /mnt/testfs/file1 # lfs getstripe /mnt/testfs/file1 /mnt/testfs/file1 lcm_layout_gen: 9 lcm_mirror_count: 3 lcm_entry_count: 4 lcme_id: 65537 lcme_mirror_id: 1 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 4194304 ...... lcme_id: 65538 lcme_mirror_id: 1 lcme_flags: 0 lcme_extent.e_start: 4194304 lcme_extent.e_end: EOF ...... lcme_id: 131075 lcme_mirror_id: 2 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: EOF ...... lcme_id: 196612 lcme_mirror_id: 3 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: EOF ......
The above layout information showed that after resynchronizing, none of the mirrors were marked as stale.
Command:
lfs mirror verify [--only <mirror_id,mirror_id2[,...]>] [--verbose|-v] <mirrored_file> [<mirrored_file2> ...]
The above command will verify that each SYNC mirror (contains
up-to-date data) of a mirrored file, specified by
mirrored_file
, has exactly the same data. It
supports specifying multiple mirrored files in one command line.
This is a scrub tool that should be run on regular basis to make
sure that mirrored files are not corrupted. The command won't repair the
file if it turns out to be corrupted. Usually, an administrator should
check the file content from each mirror and decide which one is correct
and then invoke lfs mirror resync
to repair it
manually.
Option | Description |
---|---|
--only <mirror_id,mirror_id2[,...]> | Indicates which mirrors specified by
Note: At least two |
--verbose|-v | Indicates the command will print where the differences are if the data do not match. Otherwise, the command will just return an error in that case. This option can be repeated for multiple times to print more information. |
Note:
Mirror components that have stale
or
offline
flags will be skipped and not verified.
Examples:
The following command verifies that each mirror of a mirrored file contains exactly the same data:
# lfs mirror verify /mnt/testfs/file1
The following command has the -v
option specified
to print where the differences are if the data does not match:
# lfs mirror verify -vvv /mnt/testfs/file2 Chunks to be verified in /mnt/testfs/file2: [0, 0x200000) [1, 2, 3, 4] 4 [0x200000, 0x400000) [1, 2, 3, 4] 4 [0x400000, 0x600000) [1, 2, 3, 4] 4 [0x600000, 0x800000) [1, 2, 3, 4] 4 [0x800000, 0xa00000) [1, 2, 3, 4] 4 [0xa00000, 0x1000000) [1, 2, 3, 4] 4 [0x1000000, 0xffffffffffffffff) [1, 2, 3, 4] 4 Verifying chunk [0, 0x200000) on mirror: 1 2 3 4 CRC-32 checksum value for chunk [0, 0x200000): Mirror 1: 0x207b02f1 Mirror 2: 0x207b02f1 Mirror 3: 0x207b02f1 Mirror 4: 0x207b02f1 Verifying chunk [0, 0x200000) on mirror: 1 2 3 4 PASS Verifying chunk [0x200000, 0x400000) on mirror: 1 2 3 4 CRC-32 checksum value for chunk [0x200000, 0x400000): Mirror 1: 0x207b02f1 Mirror 2: 0x207b02f1 Mirror 3: 0x207b02f1 Mirror 4: 0x207b02f1 Verifying chunk [0x200000, 0x400000) on mirror: 1 2 3 4 PASS Verifying chunk [0x400000, 0x600000) on mirror: 1 2 3 4 CRC-32 checksum value for chunk [0x400000, 0x600000): Mirror 1: 0x42571b66 Mirror 2: 0x42571b66 Mirror 3: 0x42571b66 Mirror 4: 0xabdaf92 lfs mirror verify: chunk [0x400000, 0x600000) has different checksum value on mirror 1 and mirror 4. Verifying chunk [0x600000, 0x800000) on mirror: 1 2 3 4 CRC-32 checksum value for chunk [0x600000, 0x800000): Mirror 1: 0x1f8ad0d8 Mirror 2: 0x1f8ad0d8 Mirror 3: 0x1f8ad0d8 Mirror 4: 0x18975bf9 lfs mirror verify: chunk [0x600000, 0x800000) has different checksum value on mirror 1 and mirror 4. Verifying chunk [0x800000, 0xa00000) on mirror: 1 2 3 4 CRC-32 checksum value for chunk [0x800000, 0xa00000): Mirror 1: 0x69c17478 Mirror 2: 0x69c17478 Mirror 3: 0x69c17478 Mirror 4: 0x69c17478 Verifying chunk [0x800000, 0xa00000) on mirror: 1 2 3 4 PASS lfs mirror verify: '/mnt/testfs/file2' chunk [0xa00000, 0x1000000] exceeds file size 0xa00000: skipped
The following command uses the --only
option to
only verify the specified mirrors:
# lfs mirror verify -v --only 1,4 /mnt/testfs/file2 CRC-32 checksum value for chunk [0, 0x200000): Mirror 1: 0x207b02f1 Mirror 4: 0x207b02f1 CRC-32 checksum value for chunk [0x200000, 0x400000): Mirror 1: 0x207b02f1 Mirror 4: 0x207b02f1 CRC-32 checksum value for chunk [0x400000, 0x600000): Mirror 1: 0x42571b66 Mirror 4: 0xabdaf92 lfs mirror verify: chunk [0x400000, 0x600000) has different checksum value on mirror 1 and mirror 4. CRC-32 checksum value for chunk [0x600000, 0x800000): Mirror 1: 0x1f8ad0d8 Mirror 4: 0x18975bf9 lfs mirror verify: chunk [0x600000, 0x800000) has different checksum value on mirror 1 and mirror 4. CRC-32 checksum value for chunk [0x800000, 0xa00000): Mirror 1: 0x69c17478 Mirror 4: 0x69c17478 lfs mirror verify: '/mnt/testfs/file2' chunk [0xa00000, 0x1000000] exceeds file size 0xa00000: skipped
The lfs find
command is used to list files and
directories with specific attributes. The following two attribute
parameters are specific to a mirrored file or directory:
lfs find <directory|filename ...> [[!] --mirror-count|-N [+-]n] [[!] --mirror-state <[^]state>]
Option | Description |
---|---|
--mirror-count|-N [+-]n | Indicates mirror count. |
--mirror-state <[^]state> |
Indicates mirrored file state. If Valid state names are:
|
Note:
Specifying !
before an option negates its meaning
(files NOT matching the parameter). Using +
before a
numeric value means 'more than n', while -
before a
numeric value means 'less than n'. If neither is used, it means
'equal to n', within the bounds of the unit specified (if any).
Examples:
The following command recursively lists all mirrored files that have
more than 2 mirrors under directory /mnt/testfs
:
# lfs find --mirror-count +2 --type f /mnt/testfs
The following command recursively lists all out-of-sync mirrored
files under directory /mnt/testfs
:
# lfs find --mirror-state=^ro --type f /mnt/testfs
Introduced in Lustre release 2.11.0, the FLR feature is based on the Section 19.5, “Progressive File Layout(PFL)” feature introduced in Lustre 2.10.0
For Lustre release 2.9 and older clients, which do not understand the PFL layout, they cannot access and open mirrored files created in the Lustre 2.11 filesystem.
The following example shows the errors returned by accessing and opening a mirrored file (created in Lustre 2.11 filesystem) on a Lustre 2.9 client:
# ls /mnt/testfs/mirrored_file ls: cannot access /mnt/testfs/mirrored_file: Invalid argument # cat /mnt/testfs/mirrored_file cat: /mnt/testfs/mirrored_file: Operation not supported
For Lustre release 2.10 clients, which understand the PFL layout, but do not understand a mirrored layout, they can access mirrored files created in Lustre 2.11 filesystem, however, they cannot open them. This is because the Lustre 2.10 clients do not verify overlapping components so they would read and write mirrored files just as if they were normal PFL files, which will cause a problem where synced mirrors actually contain different data.
The following example shows the results returned by accessing and opening a mirrored file (created in Lustre 2.11 filesystem) on a Lustre 2.10 client:
# ls /mnt/testfs/mirrored_file /mnt/testfs/mirrored_file # cat /mnt/testfs/mirrored_file cat: /mnt/testfs/mirrored_file: Operation not supported
Table of Contents
Sometimes a Lustre file system becomes unbalanced, often due to incorrectly-specified stripe settings, or when very large files are created that are not striped over all of the OSTs. Lustre will automatically avoid allocating new files on OSTs that are full. If an OST is completely full and more data is written to files already located on that OST, an error occurs. The procedures below describe how to handle a full OST.
The MDS will normally handle space balancing automatically at file creation time, and this procedure is normally not needed, but manual data migration may be desirable in some cases (e.g. creating very large files that would consume more than the total free space of the full OSTs).
The example below shows an unbalanced file system:
client# lfs df -h UUID bytes Used Available \ Use% Mounted on testfs-MDT0000_UUID 4.4G 214.5M 3.9G \ 4% /mnt/testfs[MDT:0] testfs-OST0000_UUID 2.0G 751.3M 1.1G \ 37% /mnt/testfs[OST:0] testfs-OST0001_UUID 2.0G 755.3M 1.1G \ 37% /mnt/testfs[OST:1] testfs-OST0002_UUID 2.0G 1.7G 155.1M \ 86% /mnt/testfs[OST:2] **** testfs-OST0003_UUID 2.0G 751.3M 1.1G \ 37% /mnt/testfs[OST:3] testfs-OST0004_UUID 2.0G 747.3M 1.1G \ 37% /mnt/testfs[OST:4] testfs-OST0005_UUID 2.0G 743.3M 1.1G \ 36% /mnt/testfs[OST:5] filesystem summary: 11.8G 5.4G 5.8G \ 45% /mnt/testfs
In this case, OST0002 is almost full and when an attempt is made to write additional information to the file system (even with uniform striping over all the OSTs), the write command fails as follows:
client# lfs setstripe /mnt/testfs 4M 0 -1 client# dd if=/dev/zero of=/mnt/testfs/test_3 bs=10M count=100 dd: writing '/mnt/testfs/test_3': No space left on device 98+0 records in 97+0 records out 1017192448 bytes (1.0 GB) copied, 23.2411 seconds, 43.8 MB/s
To avoid running out of space in the file system, if the OST usage is imbalanced and one or more OSTs are close to being full while there are others that have a lot of space, the MDS will typically avoid file creation on the full OST(s) automatically. The full OSTs may optionally be deactivated manually on the MDS to ensure the MDS will not allocate new objects there.
Log into the MDS server and use the lctl
command to stop new object creation on the full OST(s):
mds# lctl set_param osp.fsname
-OSTnnnn
*.max_create_count=0
When new files are created in the file system, they will only use the remaining OSTs. Either manual space rebalancing can be done by migrating data to other OSTs, as shown in the next section, or normal file deletion and creation can passively rebalance the space usage.
If there is a need to move the file data from the current
OST(s) to new OST(s), the data must be migrated (copied)
to the new location. The simplest way to do this is to use the
lfs_migrate
command, as described in
Section 14.8, “
Adding a New OST to a Lustre File System”.
Once the full OST(s) no longer are severely imbalanced, due to either active or passive data redistribution, they should be reactivated so they will again have new files allocated on them.
[mds]# lctl set_param osp.testfs-OST0002.max_create_count=20000