http://www.linux-mag.com/index.php?option=com_content&task=view&id=1167&Itemid=2045
| Journaling File Systems | |
| Feature Story | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Written by Steve Best | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Tuesday, 15 October 2002 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| The file system is one of the most important parts of an operating system. The file system stores and manages user data on disk drives, and ensures that what's read from storage is identical to what was originally written. In addition to storing user data in files, the file system also creates and manages information about files and about itself. Besides guaranteeing the integrity of all that data, file systems are also expected to be extremely reliable and have very good performance. For the past several years, Ext2 has been the de facto file system for most Linux machines. It's robust, reliable, and suitable for most deployments. However, as Linux displaces Unix and other operating systems in more and more large server and computing environments, Ext2 is being pushed to its limits. In fact, many now common requirements -- large hard-disk partitions, quick recovery from crashes, high-performance I/O, and the need to store thousands and thousands of files representing terabytes of data -- exceed the abilities of Ext2. Fortunately, a number of other Linux file systems take up where Ext2 leaves off. Indeed, Linux now offers four alternatives to Ext2: Ext3, ReiserFS, XFS, and JFS. In addition to meeting some or all of the requirements listed above, each of these alternative file systems also supports journaling, a feature certainly demanded by enterprises, but beneficial to anyone running Linux. A journaling file system can simplify restarts, reduce fragmentation, and accelerate I/O. Better yet, journaling file systems make fscks a thing of the past. If you maintain a system of fair complexity or require high-availability, you should seriously consider a journaling file system. Let's find out how journaling file systems work, look at the four journaling file systems available for Linux, and walk through the steps of installing one of the newer systems, JFS. Switching to a journaling file system is easier than you might think, and once you switch -- well, you'll be glad you did. Fun with File Systems To better appreciate the benefits of journaling file systems, let's start by looking at how files are saved in a non-journaled file system like Ext2. To do that, it's helpful to speak the vernacular of file systems.
Figure Two illustrates blocks, inodes (with a number of meta-data attributes), directories, and their relationships.
When Good File Systems Go Bad With those concepts in mind, here's what happens when a three-block file is modified and grows to be a five-block file:
As you can see, while writing data to a file appears to be a single atomic operation, the actual process involves a number of steps (even more steps than shown here if you consider all of the accounting required to remove free blocks from a list of free blocka, among other possible metadata changes). If all the steps to write a file are completed perfectly (and this happens most of the time), the file is saved successfully. However, if the process is interrupted at any time (perhaps due to power failure or other systemic failure), a non-journaled file system can end up in an inconsistent state. Corruption occurs because the logical operation of writing (or updating) a file is actually a sequence of I/O, and the entire operation may not be totally reflected on the media at any given point in time. If the meta-data or the file data is left in an inconsistent state, the file system will no longer function properly. Non-journaled file systems rely on fsck to examine all of the file system's metadata and detect and repair structural integrity problems before restarting. If Linux shuts down smoothly, fsck will typically return a clean bill of health. However, after a power failure or crash, fsck is likely to find some kind of error in meta-data. A file system has a lot of meta-data, and fsck can be very time consuming. After all, fsck has to scan a file system's entire repository of meta-data to ensure consistency and error-free operation. As you may have experienced, the speed of fsck on a disk partition is proportional to the size of the partition, the number of directories, and the number of files in each directory. For large file systems, journaling becomes crucial. A journaling file system provides improved structural consistency, better recovery, and faster restart times than non-journaled file systems. In most cases, a journaled file system can restart in less than a second. Dear Journal... The magic of journaling file systems lies in transactions. Just like a database transaction, a journaling file system transaction treats a sequence of changes as a single, atomic operation -- but instead of tracking updates to tables, the journaling file system tracks changes to file system meta-data and/or user data. The transaction guarantees that either all or none of the file system updates are done. For example, the process of creating a new file modifies several meta-data structures (inodes, free lists, directory entries, etc.). Before the file system makes those changes, it creates a transaction that describes what it's about to do. Once the transaction has been recorded (on disk), the file system goes ahead and modifies the meta-data. The journal in a journaling file system is simply a list of transactions. In the event of a system failure, the file system is restored to a consistent state by replaying the journal. Rather than examine all meta-data (the fsck way), the file system inspects only those portions of the meta-data that have recently changed. Recovery is much faster, usually only a matter of seconds. Better yet, recovery time is not dependent on the size of the partition. In addition to faster restart times, most journaling file systems also address another significant problem: scalability. If you combine even a few large-capacity disks, you can assemble some massive (certainly by early-90s' standards) file systems. Features of modern file systems include:
More advanced file systems also manage sparse files, internal fragmentation, and the allocation of inodes better than Ext2. A Wealth of Options While advanced file systems are tailored primarily for the high throughput and high uptime requirements of servers (from single processor systems to clusters), these file systems can also benefit client machines where performance and reliability are wanted or needed. As mentioned in the introduction, recent releases of Linux include not one, but four journaling file systems. JFS from IBM, XFS from SGI, and ReiserFS from Namesys have all been "open sourced" and subsequently included in the Linux kernel. In addition, Ext3 was developed as a journaling add-on to Ext2. Figure Three shows where the file systems fit in Linux. You'll note that JFS, XFS, ReiserFS, and Ext3 are independent "peers." It's possible for a single Linux machine to use all of those file systems at the same time. A system administrator could configure a system to use XFS on one partition, and ReiserFS on another.
What are the features and benefits of each system? Let's take a quick look at Ext3, ReiserFS, and XFS, and then an in-depth look at JFS. EXT3 As mentioned above, Ext2 is the de facto file system for Linux. While it lacks some of the advanced features (extremely large files, extent-mapped files, etc.) of XFS and ReiserFS and others, it's reliable, stable, and still the default "out of the box" file system for all Linux distributions. Ext2's real weakness is fsck: the bigger the Ext2 file system, the longer it takes to fsck. Longer fsck times means longer down times. The Ext3 file system was designed to provide higher availability without impacting the robustness (at least the simplicity and reliability) of Ext2. Ext3 is a minimal extension to Ext2 to add support for journaling. Ext3 uses the same disk layout and data structures as Ext2, and it's forward- and backward-compatible with Ext2. Migration from Ext2 to Ext3 (and vice versa) is quite easy, and can even be done in-place in the same partition. The other three journaling file systems required the partition to be formatted with their mkfs utility. If you want to adopt a journaling file system, but don't have free partitions on your system, Ext3 could be the journaling file system to use. See "Switching to Ext3" for information on how to switch to Ext3 on your Linux machine.
The downside of Ext3? It's an add-on to Ext2, so it still has the same limitations that Ext2 has. The fixed internal structures of Ext2 are simply too small (too few bits) to capture large file sizes, extremely large partition sizes, and enormous numbers of files in a single directory. Moreover, the bookkeeping techniques of Ext2, such as its linked-list directory implementation, do not scale well to large file systems (there is an upper limit of 32,768 subdirectories in a single directory, and a "soft" upper limit of 10,000-15,000 files in a single directory.) To make radical improvements to Ext2, you'd have to make radical changes. Radical change was not the intent of Ext3. However, newer file systems do not have to be backward-compatible with Ext2. ReiserFS, XFS, and JFS offer scalability, high-performance, very large file systems, and of course, journaling. "Why Four Journaling File Systems is a Good Thing" presents an overview of the capabilities of the four journaling file systems.
REISERFS ReiserFS is designed and developed by Hans Reiser and his team of developers at Namesys. Like the other journaling file systems, it's open source, is available in most Linux distributions, and supports meta-data journaling. One of the unique advantages of ReiserFS is support for small files -- lots and lots of small files. Reiser's philosophy is simple: small files encourage coding simplicity. Rather than use a database or create your own file caching scheme, use the filesystem to handle lots of small pieces of information. ReiserFS is about eight to fifteen times faster than Ext2 at handling files smaller than 1K. Even more impressive, (when properly configured) ReiserFS can actually store about 6% more data that Ext2 on the same physical file system. Rather than allocate space in fixed 4K blocks, ReiserFS can allocate the exact space that's needed. A B* tree manages all file system meta-data, and stores and compresses tails, portions of files smaller than a block. Of course, ReiserFS also has excellent performance for large files, but it's especially adept at managing small files. For a more in-depth discussion of ReiserFS and instructions on how to install it, see "Journaling File Systems" in the August 2000 issue, available online at http://www.linux-mag.com/2000-08/journaling_01.html. JFS JFS for Linux is based on IBM's successful JFS file system for OS/2 Warp. Donated to open source in early 2000 and ported to Linux soon after, JFS is well-suited to enterprise environments. JFS uses many advanced techniques to boost performance, provide for very large file systems, and of course, journal changes to the file system. SGI's XFS (described next) has many similar features. Some of the features of JFS include:
There are other advanced features in JFS such as allocation groups (which speeds file access times by maximizing locality), and various block sizes ranging from 512-bytes to 4096-bytes (which can be tuned to avoid internal and external fragmentation). You can read about all of them at the JFS Web site at http://www-124.ibm.com/developerworks/oss/jfs. XFS A little more than a year ago, SGI released a version of its high-end XFS file system for Linux. Based on SGI's Irix XFS file system technology, XFS supports meta-data journaling, and extremely large disk farms. How large? A single XFS file system can be 18,000 petabytes (that's 1015 bytes) and a single file can be 9,000 petabytes. XFS is also capable of delivering excellent I/O performance. In addition to truly amazing scale and speed, XFS uses many of the same techniques found in JFS. Installing JFS For the rest of the article, let's look at how to install and use IBM's JFS system. If you have the latest release of Turbolinux, Mandrake, SuSE, Red Hat, or Slackware, you can probably skip ahead to the section "Creating a JFS Partition." If you want to include the latest JFS source code drop into your kernel, the next few sections show you what to do. THE LATEST AND GREATEST JFS has been incorporated into the 2.5.6 Linux kernel, and is also included in Alan Cox's 2.4.X-ac kernels beginning with 2.4.18-pre9-ac4, which was released on February 14, 2002. Alan's patches for 2.4.x series are available from http://www.kernel.org. You can also download a 2.4 kernel source tree and add the JFS patches to this tree. JFS comes as a patch for several of the 2.4.x kernel, so first of all, get the latest kernel from http://www.kernel.org. At the time of writing, the latest kernel was 2.4.18 and the latest release of JFS was 1.0.20. We'll be using those in the instructions below. The JFS patch is available from the JFS web site. You also need both the utilities (jfsutils-1.0.20.tar.gz), the kernel patch (jfs-2.4.18-patch), and the file system source (jfs-2.4-1.0.20.tar.gz). If you're using any of the latest distros, you probably won't have to patch the kernel for the JFS code. Instead, you'll only need to compile the kernel to update to the latest release of JFS (you can build JFS either as built-in or as a module). (To determine what version of JFS was shipped in the distribution you're running, you can edit the JFS file super.c and look for a printk() that has the JFS development version number string.) PATCHING THE KERNEL TO SUPPORT JFS In the example below, we'll use the 2.4.18 kernel source tree as an example on how to patch JFS into the kernel source tree. First, you need to download the Linux kernel: linux-2.4.18 .tar.gz. If you have a linux subdirectory, move it to linux-org, so it won't replaced by the linux-2.4.18 source tree. When you download the kernel archive, save it under /usr/src and expand the kernel source tree by using: % mv linux linux-org This operation will create a directory named /usr/src/linux. The next step is to get the JFS utilities and the appropriate patch for kernel 2.4.18. Before you do that, you need to create a directory for JFS source, /usr/src/jfs1020, and download (to that directory) the JFS kernel patch and the JFS file system source files. Once you have those files, you have everything you need to patch the kernel. Next, change to the directory of the kernel 2.4.18 source tree and apply the JFS kernel patch: % cd /usr/src/linux Now, you need to configure the kernel and enable JFS by going to the File systems section of the configuration menu and enabling JFS file system support (CONFIG_JFS_FS=y). You also have the option to configure JFS as a module, in which case you only need to recompile and reinstall kernel modules by typing: % make modules && make install_modules Otherwise, if you configured the JFS option as a kernel built-in, you need to: 1. Recompile the kernel (in /usr/src/linux). Run the command % make dep && make clean && make bzImage 2. Recompile and install modules (only if you added other options as modules) % make modules && make modules_install 3. Install the kernel. # cp arch/i386/boot/bzImage /boot/jfs-bzImage Next, update /etc/lilo.conf with the new kernel. Add an entry like the one that follows and a jfs1020 entry should appear at the lilo boot prompt: image=/boot/jfs-bzImage Be sure to specify the correct root partition. Then run # lilo to make the system aware of the new kernel. Reboot and select the jfs1020 kernel to boot from the new image. After you compile and install the kernel, you should compile and install the JFS utilities. Save the jfsutils-1.0.20.tar.gz file into the /usr/src/jfs1020 directory, expand it, run configure, and the install the utilities. % tar zxvf jfsutils-1.0.20.tar.gz Creating a JFS partition Having built and installed the JFS utilities, the next step is to create a JFS partition. In this exact example, we'll demonstrate the process using a spare partition. (If there's unpartitioned space on your disk, you can create a partition using fdisk. After you create the partition, reboot the system to make sure that the new partition is available to create a JFS file system on it. In our test system, we had /dev/hdb3 as a spare partition.) To create the JFS file system with the log inside the JFS partition, apply the following command: # mkfs.jfs /dev/hdb3 After the file system has been created, you need to mount it. You will need a mount point. Create a new empty directory such as /jfs to mount the file system with the following command: # mount -t jfs /dev/hdb3 /jfs After the file system is mounted, you are ready to try out JFS. To unmount the JFS file system, you simply use the umount command with the same mount point as the argument: # umount /jfs
Go Faster with An External Log An external log improves performance since the log updates are saved to a different partition than its corresponding file system. To create the JFS file system with the log on an external device, your system will need to have 2 unused partitions. Our test system had /dev/hda6 and /dev/hdb1 as spare partitions. # mkfs.jfs -j /dev/hdb1 /dev/hda6 To mount the file system use the following mount command: # mount -t jfs /dev/hda6 /jfs So you don't have to mount this file system every time you boot, you can add it to /etc/fstab. Make a backup of /etc/fstab and edit it with you favorite editor. Add the /dev/hda6 device. For example, add: /dev/hda6 /jfs jfs defaults 1 2 Not Just for Reboots Anymore Some people have the impression that journaling file systems only provide fast restart times. As you've seen, this isn't true. Considerable coding efforts have made journaling file systems scalable, reliable, and fast. Whether you're running an enterprise server, a cluster supercomputer, or a small Web site, XFS, JFS, and ReiserFS add credibility and oomph to Linux. Need a better reason to switch to a journaling file system? Just imagine yourself in a world without fsck. What will you do with all that extra time?
Steve Best works in the Linux Technology Center of IBM in Austin, Texas. He is currently working on the Journaled File System (JFS) for Linux project. Steve has done extensive work in operating system development with a focus in the areas of file systems, internationalization, and security. He can be reached at sbest@us.ibm. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
No comments:
Post a Comment