Grafting in VINO

1. Introduction

VINO is an extensible operating system that allows applications to dynamically replace or augment kernel functionality on a per-application basis. That is, applications can invoke specialized versions of particular operating system modules to enhance their performance or functionality. For example, it has been shown that database applications can enjoy improved performance by using application specific prefetching and caching [1]. In VINO, such a process can provide its own page replacement and read-ahead functions to facilitate this application-specific optimization. Similarly, client server applications can often benefit from a cooperative scheduling algorithm [5]. In VINO, such applications can download a scheduler that provides group scheduling among the participating applications.

While conventional systems often provide the ability to dynamically download kernel modules (e.g., device drivers in Solaris or entire kernel modules in AIX), such run-time extensions are global (i.e., they apply to the entire system) and run with all the permissions and access of the rest of the kernel. In VINO, since extensions are per-process, we must limit their sphere of influence to the process(es) for which they have been downloaded, and we must protect the rest of the kernel from errant or misbehaved extensions. VINO's process of downloading extensions in a safe fashion is called Grafting requires two mechanisms: software fault isolation [4] and lightweight transactions [2]. Software fault isolation prevents grafts from accessing memory or executing functions to which the graft does not have access. The lightweight transaction mechanism provides the ability to recover after an errant or malicious graft has changed kernel state.

There are two different types of grafts: function grafts and event grafts. A function graft replaces an existing function in the kernel with an application specific one (e.g., a different page replacement policy). An event graft provides new functionality (e.g., placing a web server or NFS server entirely in the kernel). Event graft functionality is not supported in the current VINO release.

The rest of this document is organized as follows: Section 2 describes how to write grafts. Section 3 provides a detailed discussion about the mechanisms that support safe grafting. Section 4provides some trouble-shooting tips for the graft writer. The Appendices list the currently supported graft points.

2. Writing Grafts

There are 4 basic steps required to use grafting in VINO:

  1. Identify the kernel function (graft point) that you wish to replace.
  2. Write the new function with which you want to replace the existing kernel function.
  3. Compile the graft using the regular VINO compiler, MiSFIT [3], and the regular VINO assembler (this produces a .o file).
  4. Use the VINO grafting system call interface (list the two functions here) to dynamically load the graft.

This document will address items 2-4. We are in the process of implementing a self-monitoring system that will assist application writers in pinpointing the appropriate kernel functions to replace, but that discussion is beyond the scope of this document.

For the rest of this section, we will use the following example to illustrate the different steps to grafting.

Assume that we have a database application that maintains a large user-level cache of shared pages. The default VM page eviction policy in VINO is a two-handed clock. However, instrumentation has indicated that sometimes the user-level buffers get paged out, which leads to double-buffering and bad performance. We would like to implement a graft that tries not to evict pages from the user-level cache. We must first identify the appropriate graft point. A quick perusal of Appendix 1 points out the vas_o::pick_victim graft point and identifies it as the place where we can make this replacement.

The simplest way to write a graft is to begin with the existing code and modify it. In this case, we are looking for the pick_victim method in the vas_o class, which is implemented in the file vas.C. All graft points bear VINO specific decoration in the code. To locate the existing implementation of pick_victim, one looks for:

	status_t
	DEFAULT_GRAFT(vas_o, pick_victim,
	    (list_o<coremap_entry_o *=""> *opt_pages,
	    coremap_entry_o *proposed_victim, coremap_entry_o **my_choice))

or more generally:

	status_t
	DEFAULT_GRAFT(class_name, function, (args))

where "class_name" and "function" are the key strings for which you are looking.

The DEFAULT_GRAFT annotation is a macro that expands to:

		class_name ## ::_default_ ## function args

or in the case of pick_victim:

		status_t
		vas_o::_default_pick_victim (list_o<coremap_entry_o *="">
		    *opt_pages, coremap_entry_o *proposed_victim,
		    coremap_entry_o **my_choice)

This function is called with the list of all pages available for eviction (opt_pages) and a proposed victim (proposed_victim), based on the kernel's default page eviction policy (LRU). The function returns its designated victim in the my_choice parameter. The default function simply returns the proposed victim:

	{
		*mychoice = proposed_victim;
		return (SUCCESS);
	}

while a user-supplied function might perform a more complicated operation.

First, all grafts must include the file "graft-support.h". Now, let us assume that the beginning and ending virtual addresses of the user's buffer region is stored (this is actually done via global variables misfit_src_seg and misfit_dest_seg). Then, rather than simply returning the victim, the application's pick_victim function would verify that the proposed victim is not in the buffer region. If it were, then the function must traverse the list of pages, and select a suitable alternative. Below is an example of such a function:

#include "graft-support.H"
status_t vas_o::pick_victim(list_o<coremap_entry_o *=""> *opt_pages,
		coremap_entry_o *proposed_victim,
		coremap_entry_o **my_choice))
{
        coremap_entry_o * option;
	link_o<coremap_entry_o *=""> *alt_link;

	// Check if the proposed victim is not one of the user's 
	// database buffers.
	if (opt_pages->find (proposed_victim, alt_link) != SUCCESS) {
		*mychoice = proposed_victim;
		return (SUCCESS);
	}

	// Look through the list for an alternative.
	for (option = opt_pages; opt_pages != NULL;
		opt_pages = opt_pages->next) {
		if (option->get_pagestate() == PAGE_ACTIVE) {
			*mychoice = option;
			return (SUCCESS);
	}

	return (ENOTFOUND);
}

It is important to note that all graft functions must return a status_t return value.

2.1 Compiling Grafts

Once the application's graft is written, it must be properly compiled for grafting into VINO. As MiSFIT operates on assembly code, it should be called between the compilation and assembly phases. The source file (e.g., victim.C) is compiled using the standard VINO compiler with the -S option (rather than -c), producing victim.s. Next, run MiSFIT on the assembly output.

The command line syntax for MiSFIT is:

misfit [-crws] [-i input_file] [-o output_file]

where:

-c insert indirect-call protection code
-r insert read protection code
-w insert write protection code
-s insert stack protection code

For our example graft, a proper MiSFIT command might be:

misfit -wc -i victim.s -o victim.misfit

Finally, using the standard VINO assembler, assemble the transformed assembly code into an object file:

as victim.misfit -o victim.o

2.2 Running a Graft

VINO supports two grafting system calls:

vino_graft_lookup
returns a handle for a requested function in the graft name space.
vino_graft_load
installs a piece of code at the designated graft point.

The application program that wishes to use the graft must be modified to locate the graft point and actually load the graft. For this example, let us assume that the victim.o file, created above, has been placed on the vino file system in /home/grafts/victim.o. An appropriate function to graft in the new function is shown below.

	status_t
	do_graft()
	{
		graft_o *gp;
		status_t ret;

		ret = vino_graft_lookup("vas_o::pick_victim", &gp);
		if (ret != SUCCESS) {
			printf("Could not locate graft point %s\n",
			    "vas_o::pick_victim");
			return (ret);
		}
		ret = vino_graft_load(gp, "/home/grafts/victim.o");
		if (ret != SUCCESS) {
			printf("Unable to load graft %s\n", 
			    "vas_o::pick_victim");
			return (ret);
		}
		return (SUCCESS);
	}

An application may unload a graft using the vino_graft_unload system call.

The following summarizes the steps required to install a graft into VINO:

  1. Identify the graft point on which to install a graft.
  2. Copy the default implementation of the function being replaced.
  3. Modify the default function to implement the desired algorithm. Be sure to add the include: #include "graft-support.H".
  4. Compile the graft to assembly.
  5. Run misfit on the assembly.
  6. Assemble the graft into a .o.
  7. Move the .o onto the VINO target system (where the application executable also resides).
  8. Run the application, which should include calls to vino_graft_lookup and vino_graft_load.

3. Graft Support Mechanisms

3.1 Graft Safety

Since grafts are inherently untrusted (i.e., we have no guarantees that they are written in a correct manner with respect to the rest of the kernel), the interface between grafts and the main kernel must be restricted. In particular, we must perform rigorous checking of the parameters a graft passes to other kernel functions and the values that it returns to its caller. This rigorous interface is analogous to, but different from, the system call interface.

In VINO, all functions that may be called directly from a graft have been decorated with the GRAFT_CALLABLE annotation. GRAFT_CALLABLE simply indicates that the function has been written in such a way that it may safely be called from a graft and it will perform all necessary argument checking so that a malicious graft can neither crash the system nor violate the security of the system by simply passing bad parameters. For example, our pick_victim graft above calls the coremap_entry_o::get_pagestate method. If get_pagestate did not perform rigorous argument checking, a malicious graft might ask for the page state of a page that does not belong to its calling process and pass information between two untrusted processes.

All such graft callable functions are listed in Appendix 2. The annotations shown below appear in the appropriate include file.

in "coremap_entry.H" in the class definition of a coremap_entry_o:
	GRAFT_CALLABLE  page_state_t get_pagestate(void) {

		return (page_state);

	};

At this point, we've discussed the interface issues for graft safety; now we have to discuss the recoverability issues.

In addition to performing argument checking, any function either directly or indirectly callable from a graft must take whatever actions are necessary to recover in the case of graft failure or abortion. Since grafts are untrusted code, it must be possible for the system to kill a graft at any point in time and return the kernel to a consistent, pre-graft-execution state. In VINO, we use transactions to provide this recoverability. At graft invocation time, VINO begins a transaction before calling the graft. When the graft returns, the transaction is committed. In order to provide for the potential that the system may abort the transaction, every function called in the context of the graft's transaction must take precautions to insure that it can "undo" whatever changes it has made in the face of an abort. Functions that take these precautions are called graft safe and are annotated with the macro GRAFT_SAFE. By definition, all GRAFT_CALLABLE functions must also beGRAFT_SAFE.

3.2 Making Functions GRAFT_SAFE

In implementing graft callable and graft safe functions in VINO, two different logs are used: undo and commit. The first is used as the log in case the transaction aborts and the second in case it commits. At any time when a system parameter or variable is changed, the kernel needs to log the change to either the undo log or the commit log. The undo log is used when a system variable can easily be changed. For example, changing a parameter from one value to another can be undone. The commit log is used when a system change can't easily be undone. For example, deleting an object can not be undone. In this case, the commit log delays the actual deletion of the element until the transaction completes. If the kernel were to delete the object, if the transaction aborts, it would be impossible to get the deleted object back. By postponing the operation until the transaction commits, the kernel's work is simplified. However, from the graft's point of view the element is actually deleted.

The following is an example of a GRAFT_SAFE function that alters the status of a page in memory. A page's status in memory may be free, active, inactive or wired.

	page_state_t 
	coremap_entry_o::set_pagestate(page_state_t new_page)
        {
	        undo_log_1(undo_pageset, get_pagestate());        
		page_state = (page_state_t) new_page;
        }

Before changing the state of the page, the function pushes a record onto the undo stack that will call the function undo_pageset() with the old value of the state. If the transaction aborts,undo_pageset() will be called; if the transaction commits, the new state value will be left alone.

The specific undo log function depends on the number of system parameters to be logged. For example, to place an undo log of one system parameter, use undo_log_1. This macro expands to:

	#define undo_log_1(fn, a)                               
                {                                       
                        undo_log_head();                
                        undo_log_arg(a);                
                        undo_log_tail(fn,               
                                 log_round(sizeof(a)))  
                }

If one were using commit logs, the commit_log_1() macro would be used instead.

3.3 Sandboxing

A graft only accesses variables and code within its own allotted portion of the address space which is called a sandbox [4]. The boundaries are defined by a set of global variables. The graft code is prevented from writing or accessing any address outside its sandbox. This ensures that a graft is not permitted to read or alter system behavior to benefit itself. A malicious graft, for example, may wish to change the scheduler to use for itself 100 percent of the CPU utilization. Placing boundaries for grafts is similar to the management of processes in any operating system.

3.4 Graft header definitions

The method declaration for grafts (usually defined in header files) has a different format than normal. The pick_victim method described in section 2 has the following declaration.

        GRAFTABLE(status_t, vas_o, pick_victim, 
            (IN list_o<coremap_entry_o *=""> *dummy1, 
 	       IN coremap_entry_o *dummy2,
	       OUT PTR coremap_entry_o **my_choice));

The IN and OUT PTR in the parameter list above are examples of attributes. The attributes tell the stub compiler how to copy the arguments to the graft.

The possible attributes are:

IN
Copy the argument into the graft, but don't copy it back
OUT
Copy the result back to the caller, but don't copy it in (only makes sense with PTR and ARRAY).
INOUT
Copy the value in and out again (only makes sense with PTR and ARRAY)
ARRAY(n)
The argument is an array of n elements (n can be a variable)
PTR
The argument is a ptr to one element (shorthand for ARRAY(1)).
STRING
The argument is a NULL terminated string (shorthand for ARRAY(1+strlen(s)))

The OUT and INOUT attributes make sense for pointer, array and string arguments. This is because C passes arrays by reference and non-arrays by value.

4. Troubleshooting

a) If the linker complains that it can't find a graft like pick_victim, it probably means that the stub generator did not correctly create the stub. This may be because the stub generator parser is picky about syntax. For example, the syntax in the header file function declarations must be precise. The syntax requires that all arguments in the prototype have names. For example, the following won't be parsed correctly and thus won't be used to generate a stub:

         GRAFTABLE(coremap_entry_o *, vas_o, pick_victim, 
             (IN list_o<coremap_entry_o *=""> *, IN coremap_entry_o *));

This can be solved by placing argument names as such:

         GRAFTABLE(coremap_entry_o *, vas_o, pick_victim, 
             (IN list_o<coremap_entry_o *=""> * dummy1, 
              IN coremap_entry_o * dummy2));

b) If the compiler has problems with the return type of the stub, it may be that the graft was defined to return something other than a status_t. A grafted function must return status_t.

c) If you write a graft that calls a non-GRAFT_CALLABLE method, the kernel will complain when loading the graft into the graft namespace. See Appendix 2 for a list of all GRAFT_CALLABLE methods.

d) If while compiling the graft, the system complains that it does not understand the GRAFTABLE macro, make sure you have included the graft-support.H header file.

Bibliography

[1] Pei Cao, et al, Application-controlled File Caching Policies, Proceedings of the 1994 Summer USENIX Technical Conference, Boston, MA, June 1994, 171-182.

[2] Margo Seltzer, Yasuhiro Endo, Christopher Small, Keith Smith, Dealing with Disaster: Surviving Misbehaved Kernel Extensions, Proceedings of the 1996 Symposium on Operating System Design and Implementation (OSDI II).

[3] Christopher Small, A Tool For Constructing Safe Extensible C++, Systems COOTS '97.

[4] R. Wahbe et al., Efficient Software-Based Fault Isolation, Proceedings of the 14th SOSP, 1993.

[5] Carl Waldspurger and William E. Weihl, Lottery Scheduling: Flexible Proportional-Share Resource Management, OSDI94 USENIX Association 1-11.

Appendix 1: GRAFTABLE methods

Virtual Memory:

  • pm_mgr::stop_pageout
  • pm_mgr::awake_pagedaemon
  • pm_mgr::swap_pageout_order
  • vas_o::pick_victim

File System:

  • open_file_o::compute_ra
  • active_file_cache_o::find (not yet implemented)
  • as_mgr_o::alloc_pages (not yet implemented)
  • as_mgr_o::free_pages (not yet implemented)
  • ffs_active_file_o::balloc (not yet implemented)
  • ffs_active_file_o::bmap (not yet implemented)
  • ffs_active_file_o::free (not yet implemented)
  • ffs_active_file_o::~ffs_active_file_o (not yet implemented)
  • ffs_dir_o::component_lookup (not yet implemented)
  • ffs_dir_o::ufs_direnter (not yet implemented)
  • ffs_dir_o::ufs_dirremove (not yet implemented)
  • ffs_fs_o::inode_alloc (not yet implemented)
  • ffs_fs_o::inode_free (not yet implemented)
  • ffs_storage_mgr_o::alloc (not yet implemented)
  • ffs_storage_mgr_o::ffs_alloccg (not yet implemented)
  • ffs_storage_mgr_o::ffs_alloccgblk (not yet implemented)
  • ffs_storage_mgr_o::ffs_hashalloc (not yet implemented)
  • ffs_storage_mgr_o::free (not yet implemented)
  • ffsactive_file_o::write_intent (not yet implemented)
  • file_mem_o::read (not yet implemented)
  • file_mem_o::write (not yet implemented)
  • indir_mem_o::check_alloc (not yet implemented)
  • indir_mem_o::free (not yet implemented)
  • name_cache_o (not yet implemented)
  • open_file_o::read (not yet implemented)
  • open_file_o::seek (not yet implemented)
  • open_file_o::write (not yet implemented)

Scheduler (not yet implemented):

  • thread_o::get_pick

Collections (not yet implemented):

  • allocator_o::allocate
  • allocator_o::deallocate
  • allocator_o::new_allocate
  • allocator_o::relative_allocate
  • array_iter_o::cur
  • array_iter_o::next
  • array_iter_o::prev
  • array_iter_o::rewind
  • array_o::add
  • array_o::contract
  • array_o::find
  • array_o::get
  • array_o::get_first
  • array_o::get_last
  • array_o::iter
  • array_o::member
  • array_o::nth
  • array_o::remove
  • array_o::size
  • bag_iter_o::cur
  • bag_iter_o::next
  • bag_iter_o::prev
  • bag_iter_o::rewind
  • bag_o::add
  • bag_o::do_intersect
  • bag_o::do_union
  • bag_o::iter
  • bag_o::member
  • bag_o::nth
  • bag_o::remove
  • bag_o::size
  • bitmap_o::allocate
  • bitmap_o::deallocate
  • bitmap_o::new_allocate
  • bitmap_o::relative_allocate
  • del_o::del_o
  • hash_iter_o::cur
  • hash_iter_o::curkey
  • hash_iter_o::hash_iter_o
  • hash_iter_o::next
  • hash_iter_o::prev
  • hash_iter_o::rewind
  • interval_o::interval_o
  • kp_o::kp_o
  • link_o::get_data
  • link_o::get_next
  • link_o::get_prev
  • list_iter_o::cur
  • list_iter_o::next
  • list_iter_o::prev
  • list_iter_o::rewind
  • lock_o::free
  • lock_o::free_unsafe
  • lock_o::free_unsync
  • lock_o::free_unsync_unsafe
  • lock_o::get_holder
  • lock_o::get_mode
  • lock_o::get_status
  • lock_o::set_status
  • pair_o::pair_o
  • ref_o::txnabort
  • ref_o::txncommit
  • txn_o::abort
  • txn_o::commit
  • txn_o::get_id
  • txn_o::get_parent
  • txn_o::lock
  • txn_o::prepare
  • vino_o::get_name
  • vino_o::set_name
  • vm_btree_o::add
  • vm_btree_o::find
  • vm_btree_o::find_all
  • vm_btree_o::find_next
  • vm_btree_o::find_nth
  • vm_btree_o::get
  • vm_btree_o::insert_nonfull
  • vm_btree_o::member
  • vm_btree_o::nth
  • vm_btree_o::size
  • vm_btree_o::split_child
  • vm_btree_o::split_root

Appendix 2: GRAFT_CALLABLE methods

Virtual Memory:

  • as_o::deref
  • as_o::deref_all
  • coremap_entry_o::get_owner_vas
  • coremap_entry_o::get_pa
  • coremap_entry_o::get_pagestate
  • coremap_entry_o::set_pagestate
  • coremap_entry_o::get_statelink
  • coremap_entry_o::get_owner_vas
  • coremap_entry_o::get_pa
  • vm_offset_t::bsd_pmap_extract
  • vm_offset_t::bsd_pmap_is_referenced
  • vm_offset_t::bsd_pmap_is_modified
  • vm_offset_t::bsd_pmap_rev_lookup

File System:

  • open_file_o::access
  • open_file_o::get_access_control
  • open_file_o::read
  • open_file_o::write
  • open_file_o::seek
  • open_file_o::sync
  • open_file_o::free
  • open_file_o::clone
  • open_file_o::get_attr
  • open_file_o::set_attr
  • open_file_o::get_memobj
  • fs_active_file_o::bmap
  • ffs_active_file_o::get_attr
  • ffs_active_file_o::set_attr
  • ffs_active_file_o::update
  • active_file_cache_o::resize
  • active_file_o::alloc (not yet implemented)
  • active_file_o::bmap (not yet implemented)
  • active_file_o::free (not yet implemented)
  • active_file_o::get_fid (not yet implemented)
  • active_file_o::get_fs (not yet implemented)
  • active_file_o::get_map_addr (not yet implemented)
  • active_file_o::get_map_addr (not yet implemented)
  • active_file_o::get_memobj (not yet implemented)
  • active_file_o::get_size (not yet implemented)
  • active_file_o::inc_ref_count (not yet implemented)
  • active_file_o::lock_owner (not yet implemented)
  • active_file_o::read_intent (not yet implemented)
  • active_file_o::ref_count (not yet implemented)
  • active_file_o::update (not yet implemented)
  • active_file_o::write_intent (not yet implemented)
  • component_name_o::flags (not yet implemented)
  • component_name_o::nameiop (not yet implemented)
  • component_name_o::nameptr (not yet implemented
  • ffs_extmap_o::deallocate (not yet implemented)
  • ffs_extmap_o::new_allocate (not yet implemented)
  • ffs_fs_o::inode_alloc (not yet implemented)
  • file_ext_o::set (not yet implemented)
  • file_mem_o::enqueue_ra (not yet implemented)
  • file_mem_o::prefetch (not yet implemented)
  • fs_as_mgr::alloc_bytes (not yet implemented)
  • fs_as_mgr::free_bytes (not yet implemented)
  • fs_o::create_active_file (not yet implemented)
  • hash_o::add (not yet implemented)
  • hash_o::get (not yet implemented)
  • indir_mop::write_intent (not yet implemented)
  • kern_as::map (not yet implemented)
  • kern_as::unmap (not yet implemented)
  • list_o::member (not yet implemented)
  • list_o::pop_head (not yet implemented)
  • list_o::remove (not yet implemented)
  • memobj::get_size (not yet implemented)
  • memobj::grow (not yet implemented)
  • mutex::free (not yet implemented)
  • mutex::get (not yet implemented)
  • storage_mgr_o::alloc (not yet implemented)
  • storage_mgr_o::get_blksize (not yet implemented)
  • storage_mgr_o::get_cginfo (not yet implemented)
  • storage_mgr_o::get_superblock (not yet implemented)
  • storage_mgr_o::read (not yet implemented)
  • storage_mgr_o::write (not yet implemented)