Παρουσιάστηκαν επισήμως από την Microsoft για τον Windows 8 Server οι τεχνικές προδιαγραφές του Resilient File System (ReFS)

Entry posted by Jordan_Tsafaridis · January 18, 2012

956 views

Αγαπητοί συνάδελφοι της κοινότητας η Microsoft πριν από λίγες ώρες μας αποκάλυψε τις τεχνικές προδιαγραφές του καινούριου συστήματος διαχείρισης αρχείων (file system) το οποίο ονομάζεται RsFS (Resilient File System) και το οποίο θα αποτελεί αναπόσπαστο χαρακτηριστικό των Windows 8 Server.

Για περισσότερες λεπτομέρειες σας παραθέτω παρακάτω το πρωτότυπο κείμενο στην Αγγλική γλώσσα όπως δημοσιεύθηκε από την τεχνική υπηρεσία της Microsoft και συγκεκριμένα από τον Surendra Verm. Όπως θα συμφωνήσετε πιστεύω μαζί μου τα τεχνικά χαρακτηριστικά του καινούριου συστήματος διαχείρισης αρχείων είναι απλά συγκλονιστικά.

We wanted to continue our dialog about data storage by talking about

the next generation file system being introduced in Windows 8. Today,

NTFS is the most widely used, advanced, and feature rich file system in

broad use. But when you’re reimagining Windows, as we are for Windows 8,

we don’t rest on past successes, and so with Windows 8 we are also

introducing a newly engineered file system. ReFS, (which stands for

Resilient File System), is built on the foundations of NTFS, so it

maintains crucial compatibility while at the same time it has been

architected and engineered for a new generation of storage technologies

and scenarios. In Windows 8, ReFS will be introduced only as part of

Windows Server 8, which is the same approach we have used for each and

every file system introduction. Of course at the application level, ReFS

stored data will be accessible from clients just as NTFS data would be.

As you read this, let’s not forget that NTFS is by far the industry’s

leading technology for file systems on PCs.

This file system, which we call ReFS, has been designed from the

ground up to meet a broad set of customer requirements, both today’s and

tomorrow’s, for all the different ways that Windows is deployed.

The key goals of ReFS are:

Maintain
a high degree of compatibility with a subset of NTFS features that are

widely adopted while deprecating others that provide limited value at

the cost of system complexity and footprint.
Verify and
auto-correct data. Data can get corrupted due to a number of reasons and

therefore must be verified and, when possible, corrected automatically.

Metadata must not be written in place to avoid the possibility of “torn

writes,” which we will talk about in more detail below.
Optimize
for extreme scale. Use scalable structures for everything. Don’t assume

that disk-checking algorithms, in particular, can scale to the size of

the entire file system.
Never take the file system offline.
Assume that in the event of corruptions, it is advantageous to isolate

the fault while allowing access to the rest of the volume. This is done

while salvaging the maximum amount of data possible, all done live.
Provide
a full end-to-end resiliency architecture when used in conjunction with

the Storage Spaces feature, which was co-designed and built in

conjunction with ReFS.

The key features of ReFS are as follows (note that some of these features are provided in conjunction with Storage Spaces).

Metadata integrity with checksums
Integrity streams providing optional user data integrity
Allocate on write transactional model for robust disk updates (also known as copy on write)
Large volume, file and directory sizes
Storage pooling and virtualization makes file system creation and management easy
Data striping for performance (bandwidth can be managed) and redundancy for fault tolerance
Disk scrubbing for protection against latent disk errors
Resiliency to corruptions with "salvage" for maximum volume availability in all cases
Shared storage pools across machines for additional failure tolerance and load balancing

addition, ReFS inherits the features and semantics from NTFS including

BitLocker encryption, access-control lists for security, USN journal,

change notifications, symbolic links, junction points, mount points,

reparse points, volume snapshots, file IDs, and oplocks.

And of

course, data stored on ReFS is accessible through the same file access

APIs on clients that are used on any operating system that can access

today’s NTFS volumes.

Key design attributes and features

Our

design attributes are closely related to our goals. As we go through

these attributes, keep in mind the history of producing file systems

used by hundreds of millions of devices scaling from the smallest

footprint machines to the largest data centers, from the smallest

storage format to the largest multi-spindle format, from solid state

storage to the largest drives and storage systems available. Yet at the

same time, Windows file systems are accessed by the widest array of

application and system software anywhere. ReFS takes that learning and

builds on it. We didn’t start from scratch, but reimagined it where it

made sense and built on the right parts of NTFS where that made sense.

Above all, we are delivering this in a pragmatic manner consistent with

the delivery of a major file system—something only Microsoft has done at

this scale.

Code reuse and compatibility

When we look

at the file system API, this is the area where compatibility is the

most critical and technically, the most challenging. Rewriting the code

that implements file system semantics would not lead to the right level

of compatibility and the issues introduced would be highly dependent on

application code, call timing, and hardware. Therefore in building ReFS,

we reused the code responsible for implementing the Windows file system

semantics. This code implements the file system interface (read, write,

open, close, change notification, etc.), maintains in-memory file and

volume state, enforces security, and maintains memory caching and

synchronization for file data. This reuse ensures a high degree of

compatibility with the features of NTFS that we’re carrying forward.

Underneath

this reused portion, the NTFS version of the code-base uses a newly

architected engine that implements on-disk structures such as the Master

File Table (MFT) to represent files and directories. ReFS combines this

reused code with a brand-new engine, where a significant portion of the

innovation behind ReFS lies. Graphically, it looks like this:

NTFS.SYS = NTFS upper layer API/semantics engine / NTFS on-disk store engine; ReFS.SYS = Upper layer engine inherited from NTFS / New on-disk store engine

Reliable and scalable on-disk structures

On-disk

structures and their manipulation are handled by the on-disk storage

engine. This exposes a generic key-value interface, which the layer

above leverages to implement files, directories, etc. For its own

implementation, the storage engine uses B+ trees

exclusively. In fact, we utilize B+ trees as the single common on-disk

structure to represent all information on the disk. Trees can be

embedded within other trees (a child tree’s root is stored within the

row of a parent tree). On the disk, trees can be very large and

multi-level or really compact with just a few keys and embedded in

another structure. This ensures extreme scalability up and down for all

aspects of the file system. Having a single structure significantly

simplifies the system and reduces code. The new engine interface

includes the notion of “tables” that are enumerable sets of key-value

pairs. Most tables have a unique ID (called the object ID) by which they

can be referenced. A special object table indexes all such tables in

the system.

Now, let’s look at how the common file system abstractions are constructed using tables.

File structures

shown in the diagram above, directories are represented as tables.

Because we implement tables using B+ trees, directories can scale

efficiently, becoming very large. Files are implemented as tables

embedded within a row of the parent directory, itself a table

(represented as File Metadata in the diagram above). The rows within the

File Metadata table represent the various file attributes. The file

data extent locations are represented by an embedded stream table, which

is a table of offset mappings (and, optionally, checksums). This means

that the files and directories can be very large without a performance

impact, eclipsing the limitations found in NTFS.

As expected,

other global structures within the file system such ACLs (Access Control

Lists) are represented as tables rooted within the object table.

All

disk space allocation is managed by a hierarchical allocator, which

represents free space by tables of free space ranges. For scalability,

there are three such tables – the large, medium and small allocators.

These differ in the granularity of space they manage: for example, a

medium allocator manages medium-sized chunks allocated from the large

allocator. This makes disk allocation algorithms scale very well, and

allows us the benefit of naturally collocating related metadata for

better performance. The roots of these allocators as well as that of the

object table are reachable from a well-known location on the disk. Some

tables have allocators that are private to them, reducing contention

and encouraging better allocation locality.

Apart from global

system metadata tables, the entries in the object table refer to

directories, since files are embedded within directories.

Robust disk update strategy

Updating

the disk reliably and efficiently is one of the most important and

challenging aspects of a file system design. We spent a lot of time

evaluating various approaches. One of the approaches we considered and

rejected was to implement a log structured file system. This approach is

unsuitable for the type of general-purpose file system required by

Windows. NTFS relies on a journal of transactions to ensure consistency

on the disk. That approach updates metadata in-place on the disk and

uses a journal on the side to keep track of changes that can be rolled

back on errors and during recovery from a power loss. One of the

benefits of this approach is that it maintains the metadata layout in

place, which can be advantageous for read performance. The main

disadvantages of a journaling system are that writes can get randomized

and, more importantly, the act of updating the disk can corrupt

previously written metadata if power is lost at the time of the write, a

problem commonly known as torn write.

To maximize

reliability and eliminate torn writes, we chose an allocate-on-write

approach that never updates metadata in-place, but rather writes it to a

different location in an atomic fashion. In some ways this borrows from

a very old notion of “shadow paging”

that is used to reliably update structures on the disk. Transactions

are built on top of this allocate-on-write approach. Since the upper

layer of ReFS is derived from NTFS, the new transaction model seamlessly

leverages failure recovery logic already present, which has been tested

and stabilized over many releases.

ReFS allocates metadata in a

way that allows writes to be combined for related parts (for example,

stream allocation, file attributes, file names, and directory pages) in

fewer, larger I/Os, which is great for both spinning media and flash. At

the same time a measure of read contiguity is maintained. The

hierarchical allocation scheme is leveraged heavily here.

perform significant testing where power is withdrawn from the system

while the system is under extreme stress, and once the system is back

up, all structures are examined for correctness. This testing is the

ultimate measure of our success. We have achieved an unprecedented level

of robustness in this test for Microsoft file systems. We believe this

is industry-leading and fulfills our key design goals.

Resiliency to disk corruptions

mentioned previously, one of our design goals was to detect and correct

corruption. This not only ensures data integrity, but also improves

system availability and online operation. Thus, all ReFS metadata is

check-summed at the level of a B+ tree page, and the checksum is stored

independently from the page itself. This allows us to detect all forms

of disk corruption, including lost and misdirected writes and bit rot

(degradation of data on the media). In addition, we have added an

option where the contents of a file are check-summed as well. When this

option, known as “integrity streams,” is enabled, ReFS always writes the

file changes to a location different from the original one. This

allocate-on-write technique ensures that pre-existing data is not lost

due to the new write. The checksum update is done atomically with the

data write, so that if power is lost during the write, we always have a

consistently verifiable version of the file available whereby

corruptions can be detected authoritatively.

We blogged about Storage Spaces

a couple of weeks ago. We designed ReFS and Storage Spaces to

complement each other, as two components of a complete storage system.

We are making Storage Spaces available for NTFS (and client PCs) because

there is great utility in that; the architectural layering supports

this client-side approach while we adapt ReFS for usage on clients so

that ultimately you’ll be able to use ReFS across both clients and

servers.

In addition to improved performance, Storage Spaces

protects data from partial and complete disk failures by maintaining

copies on multiple disks. On read failures, Storage Spaces is able to

read alternate copies, and on write failures (as well as complete media

loss on read/write) it is able to reallocate data transparently. Many

failures don’t involve media failure, but happen due to data

corruptions, or lost and misdirected writes.

These are exactly

the failures that ReFS can detect using checksums. Once ReFS detects

such a failure, it interfaces with Storage Spaces to read all available

copies of data and chooses the correct one based on checksum validation.

It then tells Storage Spaces to fix the bad copies based on the good

copies. All of this happens transparently from the point of view of the

application. If ReFS is not running on top of a mirrored Storage Space,

then it has no means to automatically repair the corruption. In that

case it will simply log an event indicating that corruption was detected

and fail the read if it is for file data. I’ll talk more about the

impact of this on metadata later.

Checksums (64-bit) are always

turned on for ReFS metadata, and assuming that the volume is hosted on a

mirrored Storage Space, automatic correction is also always turned on.

All integrity streams (see below) are protected in the same way. This

creates an end-to-end high integrity solution for the customer, where

relatively unreliable storage can be made highly reliable.

Integrity streams

Integrity

streams protect file content against all forms of data corruption.

Although this feature is valuable for many scenarios, it is not

appropriate for some. For example, some applications prefer to manage

their file storage carefully and rely on a particular file layout on the

disk. Since integrity streams reallocate blocks every time file content

is changed, the file layout is too unpredictable for these

applications. Database systems are excellent examples of this. Such

applications also typically maintain their own checksums of file content

and are able to verify and correct data by direct interaction with

Storage Spaces APIs.

For those cases where a particular file

layout is required, we provide mechanisms and APIs to control this

setting at various levels of granularity.

At the most basic

level, integrity is an attribute of a file

(FILE_ATTRIBUTE_INTEGRITY_STREAM). It is also an attribute of a

directory. When present in a directory, it is inherited by all files and

directories created inside the directory. For convenience, you can use

the “format” command to specify this for the root directory of a volume

at format time. Setting it on the root ensures that it propagates by

default to every file and directory on the volume. For example:

D:\>format /fs:refs /q /i:enable <volume>

D:\>format /fs:refs /q /i:disable <volume>

By default, when the /i switch is not specified, the behavior that

the system chooses depends on whether the volume resides on a mirrored

space. On a mirrored space, integrity is enabled because we expect the

benefits to significantly outweigh the costs. Applications can always

override this programmatically for individual files.

Battling “bit rot”

As we described earlier, the combination of ReFS and Storage Spaces

provides a high degree of data resiliency in the presence of disk

corruptions and storage failures. A form of data loss that is harder to

detect and deal with happens due to “bit rot,” where parts of the

disk develop corruptions over time that go largely undetected since

those parts are not read frequently. By the time they are read and

detected, the alternate copies may have also been corrupted or lost due

to other failures.

In order to deal with bit rot, we have added a system task that

periodically scrubs all metadata and Integrity Stream data on a ReFS

volume residing on a mirrored Storage Space. Scrubbing involves reading

all the redundant copies and validating their correctness using the ReFS

checksums. If checksums mismatch, bad copies are fixed using good ones.

The file attribute FILE_ATTRIBUTE_NO_SCRUB_DATA indicates that the

scrubber should skip the file. This attribute is useful for those

applications that maintain their own integrity information, when the

application developer wants tighter control over when and how those

files are scrubbed.

The Integrity.exe command line tool is a powerful way to manage the integrity and scrubbing policies.

When all else fails…continued volume availability

We expect many customers to use ReFS in conjunction with mirrored

Storage Spaces, in which case corruptions will be automatically and

transparently fixed. But there are cases, admittedly rare, when even a

volume on a mirrored space can get corrupted – for example faulty system

memory can corrupt data, which can then find its way to the disk and

corrupt all redundant copies. In addition, some customers may not choose

to use a mirrored storage space underneath ReFS.

For these cases where the volume gets corrupted, ReFS implements

“salvage,” a feature that removes the corrupt data from the namespace on

a live volume. The intention behind this feature is to ensure that

non-repairable corruption does not adversely affect the availability of

good data. If, for example, a single file in a directory were to become

corrupt and could not be automatically repaired, ReFS will remove that

file from the file system namespace while salvaging the rest of the

volume. This operation can typically be completed in under a second.

Normally, the file system cannot open or delete a corrupt file,

making it impossible for an administrator to respond. But because ReFS

can still salvage the corrupt data, the administrator is able to recover

that file from a backup or have the application re-create it without

taking the file system offline. This key innovation ensures that we do

not need to run an expensive offline disk checking and correcting tool,

and allows for very large data volumes to be deployed without risking

large offline periods due to corruption.

A clean fit into the Windows storage stack

We knew we had to design for maximum flexibility and compatibility.

We designed ReFS to plug into the storage stack just like another file

system, to maximize compatibility with the other layers around it. For

example, it can seamlessly leverage BitLocker encryption, Access Control

Lists for security, USN journal, change notifications, symbolic links,

junction points, mount points, reparse points, volume snapshots, file

IDs, and oplocks. We expect most file system filters to work seamlessly

with ReFS with little or no modification. Our testing bore this out; for

example, we were able to validate the functionality of the existing

Forefront antivirus solution.

Some filters that depend on the NTFS physical format will need

greater modification. We run an extensive compatibility program where we

test our file systems with third-party antivirus, backup, and other

such software. We are doing the same with ReFS and will work with our

key partners to address any incompatibilities that we discover. This is

something we have done before and is not unique to ReFS.

An aspect of flexibility worth noting is that although ReFS and

Storage Spaces work well together, they are designed to run

independently of each other. This provides maximum deployment

flexibility for both components without unnecessarily limiting each

other. Or said another way, there are reliability and performance

tradeoffs that can be made in choosing a complete storage solution,

including deploying ReFS with underlying storage from our partners.

With Storage Spaces, a storage pool can be shared by multiple

machines and the virtual disks can seamlessly transition between them,

providing additional resiliency to failures. Because of the way we have

architected the system, ReFS can seamlessly take advantage of this.

Usage

We have tested ReFS using a sophisticated and vast set of tens of

thousands of tests that have been developed over two decades for NTFS.

These tests simulate and exceed the requirements of the deployments we

expect in terms of stress on the system, failures such as power loss,

scalability, and performance. Therefore, ReFS is ready to be

deployment-tested in a managed environment. Being the first version of a

major file system, we do suggest just a bit of caution. We do not

characterize ReFS in Windows 8 as a “beta” feature. It will be a

production-ready release when Windows 8 comes out of beta, with the

caveat that nothing is more important than the reliability of data. So,

unlike any other aspect of a system, this is one where a conservative

approach to initial deployment and testing is mandatory.

With this in mind, we will implement ReFS in a staged evolution of

the feature: first as a storage system for Windows Server, then as

storage for clients, and then ultimately as a boot volume. This is the

same approach we have used with new file systems in the past.

Initially, our primary test focus will be running ReFS as a file

server. We expect customers to benefit from using it as a file server,

especially on a mirrored Storage Space. We also plan to work with our

storage partners to integrate it with their storage solutions.

Conclusion

Along with Storage Spaces, ReFS forms the foundation of storage on

Windows for the next decade or more. We believe this significantly

advances our state of the art for storage. Together, Storage Spaces and

ReFS have been architected with headroom to innovate further, and we

expect that we will see ReFS as the next massively deployed file system.

FAQ:

Q) Why is it named ReFS?

ReFS stands for Resilient File System. Although it is designed to be

better in many dimensions, resiliency stands out as one of its most

prominent features.

Q) What are the capacity limits of ReFS?

The table below shows the capacity limits of the on-disk format.

Other concerns may determine some practical limits, such as the system

configuration (for example, the amount of memory), limits set by various

system components, as well as time taken to populate data sets, backup

times, etc.

Attribute	Limit based on the on-disk format
Maximum size of a single file	2^64-1 bytes
Maximum size of a single volume	Format supports 2^78 bytes with 16KB cluster size (2^64 * 16 * 2^10). Windows stack addressing allows 2^64 bytes
Maximum number of files in a directory	2^64
Maximum number of directories in a volume	2^64
Maximum file name length	32K unicode characters
Maximum path length	32K
Maximum size of any storage pool	4 PB
Maximum number of storage pools in a system	No limit
Maximum number of spaces in a storage pool	No limit

Q) Can I convert data between NTFS and ReFS?

In Windows 8 there is no way to convert data in place. Data can be

copied. This was an intentional design decision given the size of data

sets that we see today and how impractical it would be to do this

conversion in place, in addition to the likely change in architected

approach before and after conversion.

Q) Can I boot from ReFS in Windows Server 8?

No, this is not implemented or supported.

Q) Can ReFS be used on removable media or drives?

No, this is not implemented or supported.

Q) What semantics or features of NTFS are no longer supported on ReFS?

The NTFS features we have chosen to not support in ReFS are: named

streams, object IDs, short names, compression, file level encryption

(EFS), user data transactions, sparse, hard-links, extended attributes,

and quotas.

Q) What about parity spaces and ReFS?

ReFS is supported on the fault resiliency options provided by Storage

Spaces. In Windows Server 8, automatic data correction is implemented

for mirrored spaces only.

Q) Is clustering supported?

Failover clustering is supported, whereby individual volumes can

failover across machines. In addition, shared storage pools in a cluster

are supported.

Q) What about RAID? How do I use ReFS capabilities of striping,

mirroring, or other forms of RAID? Does ReFS deliver the read

performance needed for video, for example?

ReFS leverages the data redundancy capabilities of Storage Spaces,

which include striped mirrors and parity. The read performance of ReFS

is expected to be similar to that of NTFS, with which it shares a lot of

the relevant code. It will be great at streaming data.

Q) How come ReFS does not have deduplication, second level caching between DRAM & storage, and writable snapshots?

ReFS does not itself offer deduplication. One side effect of its

familiar, pluggable, file system architecture is that other

deduplication products will be able to plug into ReFS the same way they

do with NTFS.

ReFS does not explicitly implement a second-level cache, but customers can use third-party solutions for this.

ReFS and VSS work together to provide snapshots in a manner

consistent with NTFS in Windows environments. For now, they don’t

support writable snapshots or snapshots larger than 64TB.

Report Entry

1 Comment

Recommended Comments

Add a comment...

× Pasted as rich text. Paste as plain text instead

Only 75 emoji are allowed.

× Your link has been automatically embedded. Display as a link instead

× Your previous content has been restored. Clear editor

× You cannot paste images directly. Upload or insert images from URL.

Sign In

iThalis