Αγαπητοί συνάδελφοι της κοινότητας η Microsoft πριν από λίγες ώρες μας αποκάλυψε τις τεχνικές προδιαγραφές του καινούριου συστήματος διαχείρισης αρχείων (file system) το οποίο ονομάζεται RsFS (Resilient File System) και το οποίο θα αποτελεί αναπόσπαστο χαρακτηριστικό των Windows 8 Server.
Για περισσότερες λεπτομέρειες σας παραθέτω παρακάτω το πρωτότυπο κείμενο στην Αγγλική γλώσσα όπως δημοσιεύθηκε από την τεχνική υπηρεσία της Microsoft και συγκεκριμένα από τον Surendra Verm. Όπως θα συμφωνήσετε πιστεύω μαζί μου τα τεχνικά χαρακτηριστικά του καινούριου συστήματος διαχείρισης αρχείων είναι απλά συγκλονιστικά.
We wanted to continue our dialog about data storage by talking about
the next generation file system being introduced in Windows 8. Today,
NTFS is the most widely used, advanced, and feature rich file system in
broad use. But when you’re reimagining Windows, as we are for Windows 8,
we don’t rest on past successes, and so with Windows 8 we are also
introducing a newly engineered file system. ReFS, (which stands for
Resilient File System), is built on the foundations of NTFS, so it
maintains crucial compatibility while at the same time it has been
architected and engineered for a new generation of storage technologies
and scenarios. In Windows 8, ReFS will be introduced only as part of
Windows Server 8, which is the same approach we have used for each and
every file system introduction. Of course at the application level, ReFS
stored data will be accessible from clients just as NTFS data would be.
As you read this, let’s not forget that NTFS is by far the industry’s
leading technology for file systems on PCs.
This file system, which we call ReFS, has been designed from the
ground up to meet a broad set of customer requirements, both today’s and
tomorrow’s, for all the different ways that Windows is deployed.
The key goals of ReFS are:
a high degree of compatibility with a subset of NTFS features that are
widely adopted while deprecating others that provide limited value at
the cost of system complexity and footprint.
- Verify and
auto-correct data. Data can get corrupted due to a number of reasons and
therefore must be verified and, when possible, corrected automatically.
Metadata must not be written in place to avoid the possibility of “torn
writes,” which we will talk about in more detail below.
for extreme scale. Use scalable structures for everything. Don’t assume
that disk-checking algorithms, in particular, can scale to the size of
the entire file system.
- Never take the file system offline.
Assume that in the event of corruptions, it is advantageous to isolate
the fault while allowing access to the rest of the volume. This is done
while salvaging the maximum amount of data possible, all done live.
a full end-to-end resiliency architecture when used in conjunction with
the Storage Spaces feature, which was co-designed and built in
conjunction with ReFS.
The key features of ReFS are as follows (note that some of these features are provided in conjunction with Storage Spaces).
- Metadata integrity with checksums
- Integrity streams providing optional user data integrity
- Allocate on write transactional model for robust disk updates (also known as copy on write)
- Large volume, file and directory sizes
- Storage pooling and virtualization makes file system creation and management easy
- Data striping for performance (bandwidth can be managed) and redundancy for fault tolerance
- Disk scrubbing for protection against latent disk errors
- Resiliency to corruptions with "salvage" for maximum volume availability in all cases
- Shared storage pools across machines for additional failure tolerance and load balancing
addition, ReFS inherits the features and semantics from NTFS including
BitLocker encryption, access-control lists for security, USN journal,
change notifications, symbolic links, junction points, mount points,
reparse points, volume snapshots, file IDs, and oplocks.
course, data stored on ReFS is accessible through the same file access
APIs on clients that are used on any operating system that can access
today’s NTFS volumes.
Key design attributes and features
design attributes are closely related to our goals. As we go through
these attributes, keep in mind the history of producing file systems
used by hundreds of millions of devices scaling from the smallest
footprint machines to the largest data centers, from the smallest
storage format to the largest multi-spindle format, from solid state
storage to the largest drives and storage systems available. Yet at the
same time, Windows file systems are accessed by the widest array of
application and system software anywhere. ReFS takes that learning and
builds on it. We didn’t start from scratch, but reimagined it where it
made sense and built on the right parts of NTFS where that made sense.
Above all, we are delivering this in a pragmatic manner consistent with
the delivery of a major file system—something only Microsoft has done at
Code reuse and compatibility
When we look
at the file system API, this is the area where compatibility is the
most critical and technically, the most challenging. Rewriting the code
that implements file system semantics would not lead to the right level
of compatibility and the issues introduced would be highly dependent on
application code, call timing, and hardware. Therefore in building ReFS,
we reused the code responsible for implementing the Windows file system
semantics. This code implements the file system interface (read, write,
open, close, change notification, etc.), maintains in-memory file and
volume state, enforces security, and maintains memory caching and
synchronization for file data. This reuse ensures a high degree of
compatibility with the features of NTFS that we’re carrying forward.
this reused portion, the NTFS version of the code-base uses a newly
architected engine that implements on-disk structures such as the Master
File Table (MFT) to represent files and directories. ReFS combines this
reused code with a brand-new engine, where a significant portion of the
innovation behind ReFS lies. Graphically, it looks like this:
Reliable and scalable on-disk structures
structures and their manipulation are handled by the on-disk storage
engine. This exposes a generic key-value interface, which the layer
above leverages to implement files, directories, etc. For its own
implementation, the storage engine uses B+ trees
exclusively. In fact, we utilize B+ trees as the single common on-disk
structure to represent all information on the disk. Trees can be
embedded within other trees (a child tree’s root is stored within the
row of a parent tree). On the disk, trees can be very large and
multi-level or really compact with just a few keys and embedded in
another structure. This ensures extreme scalability up and down for all
aspects of the file system. Having a single structure significantly
simplifies the system and reduces code. The new engine interface
includes the notion of “tables” that are enumerable sets of key-value
pairs. Most tables have a unique ID (called the object ID) by which they
can be referenced. A special object table indexes all such tables in
Now, let’s look at how the common file system abstractions are constructed using tables.
shown in the diagram above, directories are represented as tables.
Because we implement tables using B+ trees, directories can scale
efficiently, becoming very large. Files are implemented as tables
embedded within a row of the parent directory, itself a table
(represented as File Metadata in the diagram above). The rows within the
File Metadata table represent the various file attributes. The file
data extent locations are represented by an embedded stream table, which
is a table of offset mappings (and, optionally, checksums). This means
that the files and directories can be very large without a performance
impact, eclipsing the limitations found in NTFS.
other global structures within the file system such ACLs (Access Control
Lists) are represented as tables rooted within the object table.
disk space allocation is managed by a hierarchical allocator, which
represents free space by tables of free space ranges. For scalability,
there are three such tables – the large, medium and small allocators.
These differ in the granularity of space they manage: for example, a
medium allocator manages medium-sized chunks allocated from the large
allocator. This makes disk allocation algorithms scale very well, and
allows us the benefit of naturally collocating related metadata for
better performance. The roots of these allocators as well as that of the
object table are reachable from a well-known location on the disk. Some
tables have allocators that are private to them, reducing contention
and encouraging better allocation locality.
Apart from global
system metadata tables, the entries in the object table refer to
directories, since files are embedded within directories.
Robust disk update strategy
the disk reliably and efficiently is one of the most important and
challenging aspects of a file system design. We spent a lot of time
evaluating various approaches. One of the approaches we considered and
rejected was to implement a log structured file system. This approach is
unsuitable for the type of general-purpose file system required by
Windows. NTFS relies on a journal of transactions to ensure consistency
on the disk. That approach updates metadata in-place on the disk and
uses a journal on the side to keep track of changes that can be rolled
back on errors and during recovery from a power loss. One of the
benefits of this approach is that it maintains the metadata layout in
place, which can be advantageous for read performance. The main
disadvantages of a journaling system are that writes can get randomized
and, more importantly, the act of updating the disk can corrupt
previously written metadata if power is lost at the time of the write, a
problem commonly known as torn write.
reliability and eliminate torn writes, we chose an allocate-on-write
approach that never updates metadata in-place, but rather writes it to a
different location in an atomic fashion. In some ways this borrows from
a very old notion of “shadow paging”
that is used to reliably update structures on the disk. Transactions
are built on top of this allocate-on-write approach. Since the upper
layer of ReFS is derived from NTFS, the new transaction model seamlessly
leverages failure recovery logic already present, which has been tested
and stabilized over many releases.
ReFS allocates metadata in a
way that allows writes to be combined for related parts (for example,
stream allocation, file attributes, file names, and directory pages) in
fewer, larger I/Os, which is great for both spinning media and flash. At
the same time a measure of read contiguity is maintained. The
hierarchical allocation scheme is leveraged heavily here.
perform significant testing where power is withdrawn from the system
while the system is under extreme stress, and once the system is back
up, all structures are examined for correctness. This testing is the
ultimate measure of our success. We have achieved an unprecedented level
of robustness in this test for Microsoft file systems. We believe this
is industry-leading and fulfills our key design goals.
Resiliency to disk corruptions
mentioned previously, one of our design goals was to detect and correct
corruption. This not only ensures data integrity, but also improves
system availability and online operation. Thus, all ReFS metadata is
check-summed at the level of a B+ tree page, and the checksum is stored
independently from the page itself. This allows us to detect all forms
of disk corruption, including lost and misdirected writes and bit rot
(degradation of data on the media). In addition, we have added an
option where the contents of a file are check-summed as well. When this
option, known as “integrity streams,” is enabled, ReFS always writes the
file changes to a location different from the original one. This
allocate-on-write technique ensures that pre-existing data is not lost
due to the new write. The checksum update is done atomically with the
data write, so that if power is lost during the write, we always have a
consistently verifiable version of the file available whereby
corruptions can be detected authoritatively.
We blogged about Storage Spaces
a couple of weeks ago. We designed ReFS and Storage Spaces to
complement each other, as two components of a complete storage system.
We are making Storage Spaces available for NTFS (and client PCs) because
there is great utility in that; the architectural layering supports
this client-side approach while we adapt ReFS for usage on clients so
that ultimately you’ll be able to use ReFS across both clients and
In addition to improved performance, Storage Spaces
protects data from partial and complete disk failures by maintaining
copies on multiple disks. On read failures, Storage Spaces is able to
read alternate copies, and on write failures (as well as complete media
loss on read/write) it is able to reallocate data transparently. Many
failures don’t involve media failure, but happen due to data
corruptions, or lost and misdirected writes.
These are exactly
the failures that ReFS can detect using checksums. Once ReFS detects
such a failure, it interfaces with Storage Spaces to read all available
copies of data and chooses the correct one based on checksum validation.
It then tells Storage Spaces to fix the bad copies based on the good
copies. All of this happens transparently from the point of view of the
application. If ReFS is not running on top of a mirrored Storage Space,
then it has no means to automatically repair the corruption. In that
case it will simply log an event indicating that corruption was detected
and fail the read if it is for file data. I’ll talk more about the
impact of this on metadata later.
Checksums (64-bit) are always
turned on for ReFS metadata, and assuming that the volume is hosted on a
mirrored Storage Space, automatic correction is also always turned on.
All integrity streams (see below) are protected in the same way. This
creates an end-to-end high integrity solution for the customer, where
relatively unreliable storage can be made highly reliable.
streams protect file content against all forms of data corruption.
Although this feature is valuable for many scenarios, it is not
appropriate for some. For example, some applications prefer to manage
their file storage carefully and rely on a particular file layout on the
disk. Since integrity streams reallocate blocks every time file content
is changed, the file layout is too unpredictable for these
applications. Database systems are excellent examples of this. Such
applications also typically maintain their own checksums of file content
and are able to verify and correct data by direct interaction with
Storage Spaces APIs.
For those cases where a particular file
layout is required, we provide mechanisms and APIs to control this
setting at various levels of granularity.
At the most basic
level, integrity is an attribute of a file
(FILE_ATTRIBUTE_INTEGRITY_STREAM). It is also an attribute of a
directory. When present in a directory, it is inherited by all files and
directories created inside the directory. For convenience, you can use
the “format” command to specify this for the root directory of a volume
at format time. Setting it on the root ensures that it propagates by
default to every file and directory on the volume. For example:
D:\>format /fs:refs /q /i:enable <volume>
By default, when the /i switch is not specified, the behavior that
the system chooses depends on whether the volume resides on a mirrored
space. On a mirrored space, integrity is enabled because we expect the
benefits to significantly outweigh the costs. Applications can always
override this programmatically for individual files.
Battling “bit rot”
As we described earlier, the combination of ReFS and Storage Spaces
provides a high degree of data resiliency in the presence of disk
corruptions and storage failures. A form of data loss that is harder to
detect and deal with happens due to “bit rot,” where parts of the
disk develop corruptions over time that go largely undetected since
those parts are not read frequently. By the time they are read and
detected, the alternate copies may have also been corrupted or lost due
to other failures.
In order to deal with bit rot, we have added a system task that
periodically scrubs all metadata and Integrity Stream data on a ReFS
volume residing on a mirrored Storage Space. Scrubbing involves reading
all the redundant copies and validating their correctness using the ReFS
checksums. If checksums mismatch, bad copies are fixed using good ones.
The file attribute FILE_ATTRIBUTE_NO_SCRUB_DATA indicates that the
scrubber should skip the file. This attribute is useful for those
applications that maintain their own integrity information, when the
application developer wants tighter control over when and how those
files are scrubbed.
The Integrity.exe command line tool is a powerful way to manage the integrity and scrubbing policies.
When all else fails…continued volume availability
We expect many customers to use ReFS in conjunction with mirrored
Storage Spaces, in which case corruptions will be automatically and
transparently fixed. But there are cases, admittedly rare, when even a
volume on a mirrored space can get corrupted – for example faulty system
memory can corrupt data, which can then find its way to the disk and
corrupt all redundant copies. In addition, some customers may not choose
to use a mirrored storage space underneath ReFS.
For these cases where the volume gets corrupted, ReFS implements
“salvage,” a feature that removes the corrupt data from the namespace on
a live volume. The intention behind this feature is to ensure that
non-repairable corruption does not adversely affect the availability of
good data. If, for example, a single file in a directory were to become
corrupt and could not be automatically repaired, ReFS will remove that
file from the file system namespace while salvaging the rest of the
volume. This operation can typically be completed in under a second.
Normally, the file system cannot open or delete a corrupt file,
making it impossible for an administrator to respond. But because ReFS
can still salvage the corrupt data, the administrator is able to recover
that file from a backup or have the application re-create it without
taking the file system offline. This key innovation ensures that we do
not need to run an expensive offline disk checking and correcting tool,
and allows for very large data volumes to be deployed without risking
large offline periods due to corruption.
A clean fit into the Windows storage stack
We knew we had to design for maximum flexibility and compatibility.
We designed ReFS to plug into the storage stack just like another file
system, to maximize compatibility with the other layers around it. For
example, it can seamlessly leverage BitLocker encryption, Access Control
Lists for security, USN journal, change notifications, symbolic links,
junction points, mount points, reparse points, volume snapshots, file
IDs, and oplocks. We expect most file system filters to work seamlessly
with ReFS with little or no modification. Our testing bore this out; for
example, we were able to validate the functionality of the existing
Forefront antivirus solution.
Some filters that depend on the NTFS physical format will need
greater modification. We run an extensive compatibility program where we
test our file systems with third-party antivirus, backup, and other
such software. We are doing the same with ReFS and will work with our
key partners to address any incompatibilities that we discover. This is
something we have done before and is not unique to ReFS.
An aspect of flexibility worth noting is that although ReFS and
Storage Spaces work well together, they are designed to run
independently of each other. This provides maximum deployment
flexibility for both components without unnecessarily limiting each
other. Or said another way, there are reliability and performance
tradeoffs that can be made in choosing a complete storage solution,
including deploying ReFS with underlying storage from our partners.
With Storage Spaces, a storage pool can be shared by multiple
machines and the virtual disks can seamlessly transition between them,
providing additional resiliency to failures. Because of the way we have
architected the system, ReFS can seamlessly take advantage of this.
We have tested ReFS using a sophisticated and vast set of tens of
thousands of tests that have been developed over two decades for NTFS.
These tests simulate and exceed the requirements of the deployments we
expect in terms of stress on the system, failures such as power loss,
scalability, and performance. Therefore, ReFS is ready to be
deployment-tested in a managed environment. Being the first version of a
major file system, we do suggest just a bit of caution. We do not
characterize ReFS in Windows 8 as a “beta” feature. It will be a
production-ready release when Windows 8 comes out of beta, with the
caveat that nothing is more important than the reliability of data. So,
unlike any other aspect of a system, this is one where a conservative
approach to initial deployment and testing is mandatory.
With this in mind, we will implement ReFS in a staged evolution of
the feature: first as a storage system for Windows Server, then as
storage for clients, and then ultimately as a boot volume. This is the
same approach we have used with new file systems in the past.
Initially, our primary test focus will be running ReFS as a file
server. We expect customers to benefit from using it as a file server,
especially on a mirrored Storage Space. We also plan to work with our
storage partners to integrate it with their storage solutions.
Along with Storage Spaces, ReFS forms the foundation of storage on
Windows for the next decade or more. We believe this significantly
advances our state of the art for storage. Together, Storage Spaces and
ReFS have been architected with headroom to innovate further, and we
expect that we will see ReFS as the next massively deployed file system.
Q) Why is it named ReFS?
ReFS stands for Resilient File System. Although it is designed to be
better in many dimensions, resiliency stands out as one of its most
Q) What are the capacity limits of ReFS?
The table below shows the capacity limits of the on-disk format.
Other concerns may determine some practical limits, such as the system
configuration (for example, the amount of memory), limits set by various
system components, as well as time taken to populate data sets, backup
Limit based on the on-disk format
Maximum size of a single file
Maximum size of a single volume
Format supports 2^78 bytes with 16KB cluster size (2^64 * 16 * 2^10). Windows stack addressing allows 2^64 bytes
Maximum number of files in a directory
Maximum number of directories in a volume
Maximum file name length
32K unicode characters
Maximum path length
Maximum size of any storage pool
Maximum number of storage pools in a system
Maximum number of spaces in a storage pool
Q) Can I convert data between NTFS and ReFS?
In Windows 8 there is no way to convert data in place. Data can be
copied. This was an intentional design decision given the size of data
sets that we see today and how impractical it would be to do this
conversion in place, in addition to the likely change in architected
approach before and after conversion.
Q) Can I boot from ReFS in Windows Server 8?
No, this is not implemented or supported.
Q) Can ReFS be used on removable media or drives?
No, this is not implemented or supported.
Q) What semantics or features of NTFS are no longer supported on ReFS?
The NTFS features we have chosen to not support in ReFS are: named
streams, object IDs, short names, compression, file level encryption
(EFS), user data transactions, sparse, hard-links, extended attributes,
Q) What about parity spaces and ReFS?
ReFS is supported on the fault resiliency options provided by Storage
Spaces. In Windows Server 8, automatic data correction is implemented
for mirrored spaces only.
Q) Is clustering supported?
Failover clustering is supported, whereby individual volumes can
failover across machines. In addition, shared storage pools in a cluster
Q) What about RAID? How do I use ReFS capabilities of striping,
mirroring, or other forms of RAID? Does ReFS deliver the read
performance needed for video, for example?
ReFS leverages the data redundancy capabilities of Storage Spaces,
which include striped mirrors and parity. The read performance of ReFS
is expected to be similar to that of NTFS, with which it shares a lot of
the relevant code. It will be great at streaming data.
Q) How come ReFS does not have deduplication, second level caching between DRAM & storage, and writable snapshots?
ReFS does not itself offer deduplication. One side effect of its
familiar, pluggable, file system architecture is that other
deduplication products will be able to plug into ReFS the same way they
do with NTFS.
ReFS does not explicitly implement a second-level cache, but customers can use third-party solutions for this.
ReFS and VSS work together to provide snapshots in a manner
consistent with NTFS in Windows environments. For now, they don’t
support writable snapshots or snapshots larger than 64TB.