Dist Nfs
Dist Nfs
1
2 S UN ’ S N ETWORK F ILE S YSTEM (NFS)
Client 0
Client 1
Network Server
Client 2
Client 3
tem calls to the client-side file system (such as open(), read(), write(),
close(), mkdir(), etc.) in order to access files which are stored on the
server. Thus, to client applications, the file system does not appear to
be any different than a local (disk-based) file system, except perhaps
for performance; in this way, distributed file systems provide trans-
parent access to files, an obvious goal; after all, who would want to
use a file system that required a different set of APIs or otherwise
was a pain to use?
The role of the client-side file system is to execute the actions
needed to service those system calls. For example, if the client is-
sues a read() request, the client-side file system may send a message
to the server-side file system (or, more commonly, the file server) to
read a particular block; the file server will then read the block from
disk (or its own in-memory cache), and send a message back to the
client with the requested data. The client-side file system will then
copy the data into the user buffer supplied to the read() system call
and thus the request will complete. Note that a subsequent read() of
the same block on the client may be cached in client memory or on
the client’s disk even; in the best such case, no network traffic need
be generated.
O PERATING
S YSTEMS A RPACI -D USSEAU
S UN ’ S N ETWORK F ILE S YSTEM (NFS) 3
Client Application
From this simple overview, you should get a sense that there are
two important pieces of software in a client/server distributed file
system: the client-side file system and the file server. Together their
behavior determines the behavior of the distributed file system.
47.2 On To NFS
One of the earliest and most successful systems was developed
by Sun Microsystems, and is known as the Sun Network File Sys-
tem (or NFS) [S86]. In defining NFS, Sun took an unusual approach:
instead of building a proprietary and closed system, Sun instead de-
veloped an open protocol which simply specified the exact message
formats that clients and servers would use to communicate. Different
groups could develop their own NFS servers and thus compete in an
NFS marketplace while preserving interoperability. It worked: today
there are many companies that sell NFS servers (including Sun, Net-
App [HLM94], EMC, IBM, and others), and the widespread success
of NFS is likely attributed to this “open market” approach.
F OUR
E ASY
A RPACI -D USSEAU
P IECES
( V 0.4)
4 S UN ’ S N ETWORK F ILE S YSTEM (NFS)
Now imagine that the client-side file system opens the file by
sending a protocol message to the server saying “open the file ’foo’
and give me back a descriptor”. The file server then opens the file
locally on its side and sends the descriptor back to the client. On
subsequent reads, the client application uses that descriptor to call
the read() system call; the client-side file system then passes the de-
scriptor in a message to the file server, saying “read some bytes from
the file that is referred to by the descriptor I am passing you here”.
In this example, the file descriptor is a piece of shared state be-
tween the client and the server (Ousterhout calls this distributed
state [O91]). Shared state, as we hinted above, complicates crash
recovery. Imagine the server crashes after the first read completes,
but before the client has issued the second one. After the server is
O PERATING
S YSTEMS A RPACI -D USSEAU
S UN ’ S N ETWORK F ILE S YSTEM (NFS) 5
up and running again, the client then issues the second read. Un-
fortunately, the server has no idea to which file fd is referring; that
information was ephemeral (i.e., in memory) and thus lost when the
server crashed. To handle this situation, the client and server would
have to engage in some kind of recovery protocol, where the client
would make sure to keep enough information around in its memory
to be able to tell the server what it needs to know (in this case, that
file descriptor fd refers to file foo).
It gets even worse when you consider the fact that a stateful server
has to deal with client crashes. Imagine, for example, a client that
opens a file and then crashes. The open() uses up a file descriptor
on the server; how can the server know it is OK to close a given
file? In normal operation, a client would eventually call close() and
thus inform the server that the file should be closed. However, when
a client crashes, the server never receives a close(), and thus has to
notice the client has crashed in order to close the file.
For these reasons, the designers of NFS decided to pursue a state-
less approach: each client operation contains all the information needed
to complete the request. No fancy crash recovery is needed; the
server just starts running again, and a client, at worst, might have
to retry a request.
F OUR
E ASY
A RPACI -D USSEAU
P IECES
( V 0.4)
6 S UN ’ S N ETWORK F ILE S YSTEM (NFS)
O PERATING
S YSTEMS A RPACI -D USSEAU
S UN ’ S N ETWORK F ILE S YSTEM (NFS) 7
NFSPROC_GETATTR
expects: file handle
returns: attributes
NFSPROC_SETATTR
expects: file handle, attributes
returns: nothing
NFSPROC_LOOKUP
expects: directory file handle, name of file/directory to look up
returns: file handle
NFSPROC_READ
expects: file handle, offset, count
returns: data, attributes
NFSPROC_WRITE
expects: file handle, offset, count, data
returns: attributes
NFSPROC_CREATE
expects: directory file handle, name of file to be created, attributes
returns: nothing
NFSPROC_REMOVE
expects: directory file handle, name of file to be removed
returns: nothing
NFSPROC_MKDIR
expects: directory file handle, name of directory to be created, attributes
returns: file handle
NFSPROC_RMDIR
expects: directory file handle, name of directory to be removed
returns: nothing
NFSPROC_READDIR
expects: directory handle, count of bytes to read, cookie
returns: directory entries, cookie (which can be used to get more entries)
For example, assume the client already has a directory file han-
dle for the root directory of a file system (/) (indeed, this would be
obtained through the NFS mount protocol, which is how clients and
servers first are connected together; we do not discuss the mount
protocol here for sake of brevity). If an application running on the
client tries to open the file /foo.txt, the client-side file system will
send a lookup request to the server, passing it the root directory’s
file handle and the name foo.txt; if successful, the file handle for
foo.txt will be returned, along with its attributes.
In case you are wondering, attributes are just the metadata that
the file system tracks about each file, including fields such as file cre-
ation time, last modification time, size, ownership and permissions
F OUR
E ASY
A RPACI -D USSEAU
P IECES
( V 0.4)
8 S UN ’ S N ETWORK F ILE S YSTEM (NFS)
information, and so forth, i.e., the same type of information that you
would get back if you called stat() on a file.
Once a file handle is available, the client can issue READ and
WRITE protocol messages on a file to read or write the file, respec-
tively. The READ protocol message requires the protocol to pass
along the file handle of the file along with the offset within the file
and number of bytes to read. The server then will be able to issue the
read (after all, the handle tells the server which volume and which
inode to read from, and the offset and count tells it which bytes of
the file to read) and return the data to the client (or an error if there
was a failure). WRITE is handled similarly, except the data is passed
from the client to the server, and just a success code is returned.
One last interesting protocol message is the GETATTR request;
given a file handle, it simply fetches the attributes for that file, in-
cluding the last modified time of the file. We will see why this pro-
tocol request is quite important in NFSv2 below when we discuss
caching (see if you can guess why).
O PERATING
S YSTEMS A RPACI -D USSEAU
S UN ’ S N ETWORK F ILE S YSTEM (NFS) 9
F OUR
E ASY
A RPACI -D USSEAU
P IECES
( V 0.4)
10 S UN ’ S N ETWORK F ILE S YSTEM (NFS)
D ESIGN T IP : I DEMPOTENCY
Idempotency is a useful property when building reliable systems.
When an operation can be issued more than once, it is much easier to
handle failure of the operation; you can just retry it. If an operation
is not idempotent, life becomes more difficult.
O PERATING
S YSTEMS A RPACI -D USSEAU
S UN ’ S N ETWORK F ILE S YSTEM (NFS) 11
In this way, the client can handle all timeouts in a unified way. If
a WRITE request was simply lost (Case 1 above), the client will retry
it, the server will perform the write, and all will be well. The same
will happen if the server happened to be down while the request was
sent, but back up and running when the second request is sent, and
again all works as desired (Case 2). Finally, the server may in fact
F OUR
E ASY
A RPACI -D USSEAU
P IECES
( V 0.4)
12 S UN ’ S N ETWORK F ILE S YSTEM (NFS)
receive the WRITE request, issue the write to its disk, and send a
reply. This reply may get lost (Case 3), again causing the client to re-
send the request. When the server receives the request again, it will
simply do the exact same thing: write the data to disk and reply that
it has done so. If the client this time receives the reply, all is again
well, and thus the client has handled both message loss and server
failure in a uniform manner. Neat!
A small aside: some operations are hard to make idempotent. For
example, when you try to make a directory that already exists, you
are informed that the mkdir request has failed. Thus, in NFS, if the
file server receives a MKDIR protocol message and executes it suc-
cessfully but the reply is lost, the client may repeat it and encounter
that failure when in fact the operation at first succeeded and then
only failed on the retry. Thus, life is not perfect.
O PERATING
S YSTEMS A RPACI -D USSEAU
S UN ’ S N ETWORK F ILE S YSTEM (NFS) 13
C1 C2 C3
cache: F[v1] cache: F[v2] cache: empty
Server S
disk: F[v1] at first
F[v2] eventually
F OUR
E ASY
A RPACI -D USSEAU
P IECES
( V 0.4)
14 S UN ’ S N ETWORK F ILE S YSTEM (NFS)
client C2 may buffer its writes in its cache for a time before propagat-
ing them to the server; in this case, while F[v2] sits in C2’s memory,
any access of F from another client (say C3) will fetch the old ver-
sion of the file (F[v1]). Thus, by buffering writes at the client, other
clients may get stale versions of the file, which may be undesirable;
indeed, imagine the case where you log into machine C2, update F,
and then log into C3 and try to read the file, only to get the old copy!
Certainly this could be frustrating. Thus, let us call this aspect of the
cache consistency problem update visibility; when do updates from
one client become visible at other clients?
The second subproblem of cache consistency is a stale cache; in
this case, C2 has finally flushed its writes to the file server, and thus
the server has the latest version (F[v2]). However, C1 still has F[v1]
in its cache; if a program running on C1 reads file F, it will get a stale
version (F[v1]) and not the most recent copy (F[v2]). Again, this may
result in undesirable behavior.
NFSv2 implementations solve these cache consistency problems
in two ways. First, to address update visibility, clients implement
what is sometimes called flush-on-close consistency semantics; specif-
ically, when a file is written to and subsequently closed by a client ap-
plication, the client flushes all updates (i.e., dirty pages in the cache)
to the server. With flush-on-close consistency, NFS tries to ensure
that an open from another node will see the latest file version.
Second, to address the stale-cache problem, NFSv2 clients first
check to see whether a file has changed before using its cached con-
tents. Specifically, when opening a file, the client-side file system will
issue a GETATTR request to the server to fetch the file’s attributes.
The attributes, importantly, include information as to when the file
was last modified on the server; if the time-of-modification is more
recent than the time that the file was fetched into the client cache, the
client invalidates the file, thus removing it from the client cache and
ensuring that subsequent reads will go to the server and retrieve the
latest version of the file. If, on the other hand, the client sees that it
has the latest version of the file, it will go ahead and use the cached
contents, thus increasing performance.
When the original team at Sun implemented this solution to the
stale-cache problem, they realized a new problem; suddenly, the NFS
server was flooded with GETATTR requests. A good engineering
principle to follow is to design for the common case, and to make
it work well; here, although the common case was that a file was
O PERATING
S YSTEMS A RPACI -D USSEAU
S UN ’ S N ETWORK F ILE S YSTEM (NFS) 15
accessed only from a single client (perhaps repeatedly), the client al-
ways had to send GETATTR requests to the server to make sure no
one else had changed the file. A client thus bombards the server,
constantly asking “has anyone changed this file?”, when most of the
time no one had.
To remedy this situation (somewhat), an attribute cache was added
to each client. A client would still validate a file before accessing it,
but most often would just look in the attribute cache to fetch the at-
tributes. The attributes for a particular file were placed in the cache
when the file was first accessed, and then would timeout after a cer-
tain amount of time (say 3 seconds). Thus, during those three sec-
onds, all file accesses would determine that it was OK to use the
cached file and thus do so with no network communication with the
server.
F OUR
E ASY
A RPACI -D USSEAU
P IECES
( V 0.4)
16 S UN ’ S N ETWORK F ILE S YSTEM (NFS)
we might expect the final result after these writes to be like this:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
The x’s, y’s, and z’s, would be overwritten with a’s, b’s, and c’s,
respectively.
Now let’s assume for the sake of the example that these three
client writes were issued to the server as three distinct WRITE pro-
tocol messages. Assume the first WRITE message is received by the
server and issued to the disk, and the client informed of its success.
Now assume the second write is just buffered in memory, and the
server also reports it success to the client before forcing it to disk; un-
fortunately, the server crashes before writing it to disk. The server
quickly restarts and receives the third write request, which also suc-
ceeds.
O PERATING
S YSTEMS A RPACI -D USSEAU
S UN ’ S N ETWORK F ILE S YSTEM (NFS) 17
Thus, to the client, all the requests succeeded, but we are sur-
prised that the file contents look like this:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy <--- oops
cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
Yikes! Because the server told the client that the second write was
successful before committing it to disk, an old chunk is left in the file,
which, depending on the application, might result in a completely
useless file.
To avoid this problem, NFS servers must commit each write to sta-
ble (persistent) storage before informing the client of success; doing
so enables the client to detect server failure during a write, and thus
retry until it finally succeeds. Doing so ensures we will never end up
with file contents intermingled as in the above example.
The problem that this requirement gives rise to in NFS server im-
plementation is that write performance, without great care, can be
the major performance bottleneck. Indeed, some companies (e.g.,
Network Appliance) came into existence with the simple objective
of building an NFS server that can perform writes quickly; one trick
they use is to first put writes in a battery-backed memory, thus en-
abling to quickly reply to WRITE requests without fear of losing the
data and without the cost of having to write to disk right away; the
second trick is to use a file system design specifically designed to
write to disk quickly when one finally needs to do so [HLM94,RO91].
47.12 Summary
We have seen the introduction of the NFS distributed file system.
NFS is centered around the idea of simple and fast recovery in the
face of server failure, and achieves this end through careful protocol
design. Idempotency of operations is essential; because a client can
safely replay a failed operation, it is OK to do so whether or not the
server has executed the request.
We also have seen how the introduction of caching into a multiple-
client, single-server system can complicate things. In particular, the
system must resolve the cache consistency problem in order to be-
have reasonably; however, NFS does so in a slightly ad hoc fashion
which can occasionally result in observably weird behavior. Finally,
F OUR
E ASY
A RPACI -D USSEAU
P IECES
( V 0.4)
18 S UN ’ S N ETWORK F ILE S YSTEM (NFS)
O PERATING
S YSTEMS A RPACI -D USSEAU
S UN ’ S N ETWORK F ILE S YSTEM (NFS) 19
References
[S86] “The Sun Network File System: Design, Implementation and Experience”
Russel Sandberg
USENIX Summer 1986
The original NFS paper. Frankly, it is pretty poorly written and makes some of the behaviors of
NFS hard to understand.
F OUR
E ASY
A RPACI -D USSEAU
P IECES
( V 0.4)