I was looking into a Java performance problem, and thought the problem may be connected to the performance of the unix files in the ZFS file system. I found it hard to find out useful information on how to investigate the problems. ZFS can produce a lot of information, but I found it hard to know which reports to look at, and what the key fields were.
This blog post gives the overall concept of a cached file system, it is based on my experience of other cached “file” systems. I have no special knowledge about zFS. I hopes it explains the concepts, it may not reflect reality.
It reminds be of a lectures at University, where they explained matter was atoms with electrons whizzing around a small, solid nucleus. This was a good picture but entirely inaccurate. We then learned that the nucleus was composed to protons and neutrons. This was also a good picture, and entirely inaccurate. We then learned that protons and neutrons are composed of quarks particles, a good picture, but inaccurate. We then got into string theory and got knotted. Which ever picture you used, it help with the understanding but was not accurate.
General background of cache systems.
A cached file system is common in IT. DB2 has buffer pools, MQ has buffer pools, and ZFS has a cache. The concepts are very similar. Over the years the technology has improved and the technology is efficient. For example all of the above system, use data in 4KB pages, and the IO to external media has been optimised.
I like to think of the technology in different layers.
- The application layer, where the application does an fread(), MQGET or SQL query.
- The interface layer, where it knows which records to get, which MQ message to gets, or which tables, rows and columns to get. This layer has a logical view of the data, and will request the next level down. “Please get me the data for this 4KB page on this data set at this position.
- The buffer manager layer. The aim of the buffer manager is to keep the optimum amout of data in cache, and minimise I/O.
- If the requested 4KB page is in the cache then return it. This counts as a cache hit.
- If it is not in the cache then call the data layer, and say please read this page from disk at this location, into this buffer. This counts as a cache miss.
- The buffer manager may have logic which can detect if a file is being read sequentially and perform read ahead. Logic like
- Read page 19 of the data set, wait for the I/O to complete, return
- Read page 20 of the data set, wait for the I/O to complete, return
- Read page 21 of the data set, wait for the I/O to complete, return
- Read page 22 … Hmm – this looks like a sequential read. Get pages 22 to 30 from the data set, wait for the IO to complete, return page 22
- Read page 23 – get it from the buffer and return, no I/O
- Read page 24 – get it from the buffer and return, no I/O
- When a page has been updated, usually it is not written directly to the disk. It is more efficient to write multiple pages in one I/O. This means the application does not have to wait for the I/O. This is often called “write behind” or “lazy write”. When the application has to be sure the write to disk has worked, for example the fsync() request, or a transactional commit; the requester has to wait until the I/O has completed. The write to the disk is a collection of pages possibly from different applications. It is totally separate to the applications writing the records.
- If the cache fills up, the buffer manager is responsible for making space. This might be to reuse the space for pages which have not been used for a long time, or if there are a lot of updated pages, writing these out – or doing both.
- If the same file is often used, then the pages should be in the buffer. If a file is used for the first time, it will need to be read in – some pages synchronously by the application, then pages read in by the read ahead processing.
- The data layer. This does the IO to the disk or other external media.
What statistics make sense?
The application can time the request and provide a true duration of the request.
At the interface level, one file read requests may have resulted in many calls to the buffer manager. The first few “get page” requests may be slow because it had to do I/O to the disk. After read ahead became active the reads from the buffer were very fast. “The average get page time” may have little value.
It may be possible to record the number of synchronous disk writes an application did (the fsync() request), but if the write was a lazy write it will not be recorded by against the file. If one I/O wrote 10 pages, four pages were for this application,six pages for that application. Recording the duration of the lazy write for each application has no value.
You can tell how many read and write requests there were to a data set (file system), and how long these requests took. You can also record how many bytes were read or written to a data set.
Overall there may be many statistics that tell you what each level is doing, and how it is performing, but they may not be very helpful when looking from an application viewpoint.
Simple file access example
- fopen file name
fopen – Under the covers
Conceptually, the fopen may have logic like
zfs_open. This looks up to see if the file has been used before. It looks for the path name in the meta data cache. The meta data cache has information about the file, for example the file owner, the permissions for the owner, last time the file was read and pointer to the file system it is on, and its location on the file system.
If the path name is not in the meta cache then go to the file system and get the information. To get the information for file /u/colin/doc/myhelp.txt it may have to get a list of the files under /u/colin, then find where the ‘doc’ directory is. Then get information of the files under /u/colin/doc, this has record for myhelp.txt which has information on where this file is on disk. Set “next page” to where the file is on disk. Each of these steps may need or more pages to be read from disk.
fread – under the covers
The fread may have logic like
zfs_reads. Within this it may have logic like
- Get the next page value. Does this page exist for in the cache? If so, return the contents, else read it from the file system, store it in the cache, update the next page pointer, and then return the contents.
- Loop until enough data bas been read. As the pages are in 4KB units – to read a 10KB message will need 3 pages.
There are smarts; the code has read ahead support. If the system detects there have been a sequence of get next page requests, instead of “Loop until enough data has been read” it can do
- Loop until enough data has been read, and start reading the next N pages, ready for the next request from the application.
By the time the application issues the next fread request, the data it needs may already have been read from disk. To the application it looks like there was no file I/O.
There may also be calls to zfs_getattrs, zfs_lookups.
fwrite – under the covers
The fwrite does not write directly to the disk. It writes some log data, and writes the data to the cache. This is known as “dirty data” because it has been changed. There is an internal process that writes the data out to the file system. Writing many records to the file system in one I/O is more efficient than writing one page each time in many IO.
Applications can use the fsync() request to force the writes to the disk.
Characteristics of the cached file system
It changes over time
The behaviour of the ZFS cache will change over time.
At start up, as files are used, the files will be read from disk into the cache.
Once the system has “warmed up”, the frequently used files will be in the cache, and should not need to be read from disk.
You could IPL the system at 0600, and for the first hour it warms up, and the cache settles down to a steady state for the rest of the day. In the evening, you may start other applications for the overnight processing, and these new applications will have a warm up period, and the cache will reach another steady state.
Data in the cache
Data in the file cache can be
- Read only, for example a java program uses .jar files to execute. Some .jar files may be used by many applications and be access frequently.
- Read only application data. For example a list of names.
- Write application data – for example an output list of names, a trace or log file. For some writes this may be an update and so the previous contents need to be read in.
Read only jar files
The cache needs to be big enough to hold the files. Once the files have been read in, there may be no reads from the file system. Any files that had not been used before will need I/O to the file system. If the cache is not big enough then some of the data can be thrown away, and reloaded next time it is needed.
Read only application data
This data may only be used once a day. Typically it will be read in as needed, and once it has been used it is the cache storage could be stolen and reused for other applications.
Write application data
If the write updates existing pages for example writing to the end of a file, or updating a record within the file then the pages will be needed to be in cache. This may require disk I/O, or the page may be in the cache from a previous operation.
If the data is written to an empty page, then the page need not be in the cache before it is written to. Once the page has been updated, it will be written asynchronously, as it is more efficient to write multiple pages in one I/O than multiple I/Os with just one page.
File system activity
A program product file system
This will typically be used read only (even though it may be mounted read/write), so you can expect pages read from the data set, but no write I/O.
A user will typically read and write files read I/O and write I/O.
Using subystems like Liberty Web Server ( and so products like z/OSConnect, z/OSMF, ZOWE) these will have read and write activity, as configuration information is used, and data is written to logs.
What happens as the cache is used?
Writing to a file
When data is written to a file, the cache gets updated. Modified pages get queued to be written to disk when
- A segment of 64KB of data for a file is filled up.
- The application does a fsync() request to say write the file out to disk.
- The file is closed
- The zfsadm config -sync_interval n has expired.
- The cache is very full of updated (dirty) pages the so called Page Reclaim Writes
Reading a file
When a page has been used, it gets put on a queue, with most recent used pages at the front. A hot page (with frequent use) will always be near the front of the queue. If all the pages have data, and the buffer pool needs a buffer page, then the oldest page on this queue is stolen, (or reclaimed) for the new request.
Ideally the buffer pool needs to be big enough so there is always unused space.
If you have a cache of size 100 pages and read a 50 page file, it will occupy 50 pages in the cache. The first time the file is used data will have to be read from disk. The second time the file is used, all the pages are in memory and there is no disk I/O.
If the cache is only 40 pages, then the first 40 pages of data will fill the cache. When page 41 is read it will use replace the buffer with page 1 in it (the oldest page). When page 42 is read, it will replace the buffer with page 2 in it.
If you now read the file a second time – page 1 is no longer in the cache, so it will need to be read from disk, and will replace a buffer. All 40 pages will be read from disk.
Will making the cache bigger help? If you make the cache 45 pages it will have the same problem. If you make it 50 pages the file will just fit – and may still have a problem. If the cache is bigger than 50 pages the file should fit in – but other applications may be using other files, so you need to make the cache big enough for the 50 page file, and any other files being used. There is nothing to tell you how big to make it. The solution seems to be make the cache bigger until the I/O stops (or reduces), and you have 5-10% free pages. If you make the cache very large it might cause paging, so you have a balancing act. It is more important to have no paging, as paging makes it difficult for the buffer manager to manage. (For example it wants to write out a dirty page. It may need to page in the data, then write it out!)
A page cannot be stolen(reused) if it needs to be written to disk. Once the contents have been written to disk the page can be stolen.
In reality it looks like blocks of 64 KB segments are used, not pages.
There is a VM statistic called Steal Invocations. This is a count of the number of 64KB blocks which were reused.
Overall performance objective
The cache needs to be big enough to keep frequently used files in the cache. If the cache is not big enough then it has to do more work, for example discard files, to make space, and reading files in from disk.
The system provides statistics on
- How big the cache is
- How many free pages there are
- How many segments have been stolen (should be zero)
- How many read requests were met from data in the cache (Cache hit), and so by calculation the number of requests that were not in the cache (cache miss), and required disk I/O.
Typically you will not achieve a cache hit of 100% because the application data may not be hot.
A little whoops
I had a little whoops. I wrote to a file, and filled the file system. When I deleted the file, and tried again, it reported there was no space on the device. When I waited for a few seconds, and repeated the command, it worked! This shows there are background tasks running asynchronously which clean up after a file has been used.
Just to make it more complex
- ZFS uses 8KB as its “page” which is 2 * 4K pages on disk.
- Small files live in the meta data, and not in the user cache!
- There is also a Directory Backing Cache also known as Metadata Backing Cache. This seems to be a cache for the meta data, which doesn’t have the same locking. It is described in a Share presentation, zFS Diagnosis I: Performance Monitoring and Tuning Guidelines from 2012. It looks from the more recent documentation as if this has been rolled into the meta cache.
The sysplex support makes it just a little more complex.
The ZFS support behaves like a client server.
One LPAR has the file system mounted read write – acting as the server. Other system act as the client.
If SYSA has the file system mounted Read Write, and SYSB wants to access a file, it sends a request through XCF. The access is managed by use of Tokens, and a Token Cache.
If you display KNPFS (Kernel Nodes Physical File System?) you get operations such as zfs_open
- On Owner. On my single LPAR sysplex, I get values here
- On Client. These are all zeros for me.
SMF 92-51 provides statistics on the zfs verbs such as zfs_open
- Count of calls made to file systems owned locally or R/O file systems
- Count of calls that required a transmit to another sysplex member to complete for locally-owned file systems.
- Count of calls made to file systems owned remotely from this member.
- Count of calls that required a transmit to another sysplex member to complete for remotely-owned file systems.
- Average number of microseconds per call for locally-owned file systems.
- Average number of microseconds per call for remotely-owned file systems
The ZFS configuration is driven from the SYS1.PARMLIB(BPXPRM00), member with
FILESYSTYPE TYPE(ZFS) ENTRYPOINT(IOEFSCM)
This can have PRM=(aa, bb, …, zz) for SYS1.PARMLIB(IOEPRMaa)… It defaults to parmlib member IOEPRM00. See here for the contents.