next up previous contents index
Next: 5.3 Installation of Condor Up: 5. Condor for Microsoft Previous: 5.1 Introduction to Condor

Subsections

   
5.2 Release Notes for Condor NT Preview 6.1.8

Released mid-October 1999, this is the first public release of Condor NT.

5.2.0.1 What is missing from Condor NT Preview 6.1.8?

In general, this preview release on NT works the same as the full-blown release of Condor for Unix.

However, following items are still being worked on and are not supported in this preview:

5.2.0.2 What is included in Condor NT Preview 6.1.8?

Except for the functionality listed above, practically everything else works the same way in Condor NT Preview as it does in the full-blown release. This Preview release is based on the Condor 6.1.8 source tree, and thus the feature set is the same as 6.1.8. For instance, all of the following works in Condor NT:

  
5.2.1 Condor File Transfer Mechanism

Condor remote system calls and the ability to access network shares is not yet supported on NT -- they will be in the near future. For now, Condor NT users must utilize the Condor File Transfer mechanism.

When Condor finds a machine willing to execute your job, it will create a temporary subdirectory for your job on the execute machine. The Condor File Transfer mechanism will then send via TCP the job executable(s) and input files from the submitting machine into this temporary directory on the execute machine. After the input files have been transferred, the execute machine will start running the job with the temporary directory as the job's current working directory. When the job completes or is kicked off, Condor File Transfer will automatically send back to the submit machine any output files created by the job. After the files have been sent back successfully, the temporary working directory on the execute machine is deleted.

Condor's File Transfer mechanism has several features to ensure data integrity in a non-dedicated environment. For instance, transfers of multiple files are performed atomically.

5.2.1.1 File Transfer Submit-Description Parameters

Condor File Transfer behavior is specified at job submit time via the submit-description file and condor_submit. Along with all the other job submit-description parameters (see section 8 on page [*]), use the following new commands in the submit-description file:

transfer_input_files = < file1, file2, file... >
Use this parameter to list all the files which should be transferred into the working directory for the job before the job is started. Separate multiple filenames with a comma. By default, the file specified via the Executable parameter and any file specified via the Input parameter (i.e. stdin) are transferred.

transfer_output_files = < file1, file2, file... >
Use this parameter to explicitly list which output files to transfer back from the temporary working directory on the execute machine to the submit machine. Most of the time, however, there is no need to use this parameter. If transfer_output_files is not specified, Condor will automatically transfer back all files in the job's temporary working directory which have been modified or created by the job. This is usually the desired behavior. Explicitly listing output files is typically only done when the job creates many files, and the user really only cares to keep a subset of the files created. WARNING: Do not specify transfer_output_file in your submit-description file unless you really have a good reason -- it is almost always best to let Condor figure things out by itself based upon what the job actually wrote.

transfer_files = <ONEXIT | ALWAYS>
Setting transfer_files equal to ONEXIT will cause Condor to transfer the job's output files back to the submitting machine only when the job completes (exits). If not specified, ONEXIT is used as the default. Specifying ALWAYS tells Condor to transfer back the output files when the job completes or whenever Condor kicks off the job (preempts) from a machine prior to job completion (if, for example, activity is detected on the keyboard). The ALWAYS option is specifically intended for fault-tolerant jobs which periodocially write out their state to disk and can restart where the left off. Any output files transferred back to the submit machine when Condor kicks off a job will automatically be sent back out again as input files when the job restarts.

5.2.1.2 Ensuring File Transfer has enough disk space

It is highly recommended that you specify a Requirements expression in your submit-description file that checks the size of the Disk attribute when using File Transfer! Doing so can ensure that Condor picks a machine with enough local disk space for your job. Here is a sample submit-description file:

        # Condor submit file for program "foo.exe".
        #
        # foo reads from files "my-input-data" and "my-other-input-data".
        # foo then writes out results into several files.
        # The total disk space foo uses for all input and output files
        # is never more than 10 megabytes.
        #
        executable = foo.exe
        # Now set Requirements saying that the machine which runs our job
        # must have more than 10megs of free disk space.  Note that "Disk"
        # is expressed in kilobytes; 10meg is 10000 kbytes.
        requirements = Disk > 10000
        # 
        queue

If you do not specify a requirement on Disk (a bad idea!), condor_submit will append to the job ad Requirements that Disk >= DiskUsage. The DiskUsage attribute is in the job ad and represents the maximum amount total disk space required by the job in kilobytes. Condor will automatically update DiskUsage approx every 20 minutes while your job runs with the amount of space being used by the job on the execute machine.

5.2.1.3 Current Limitations of File Transfer

Itemized below are some current limitations of the File Transfer mechanism. We anticipate improvement on these areas in upcoming releases.

5.2.2 Some details on how Condor NT starts/stops a job

This section provides some details on how Condor NT starts and stops jobs. This discussion is geared for the Condor administrator or advanced user who is already familiar with the material in Chapter 2 (the Administrators' Manual) and wishes to know detailed information on what Condor NT does when starting/stopping jobs.

When Condor NT is about to start a job, the condor_startd on the execute machine spawns a condor_starter process. The condor_starter then creates:

1.
a new temporary run account on the machine with a login name of ``condor-run-dir_XXX'', where XXX is the process ID of the condor_starter. This account is added to group Users and group Everyone.

2.
a new temporary working subdirectory for the job on the execute machine. This subdirectory is named ``dir_XXX'', where XXX is the process ID of the condor_starter. The subdirectory is created in the $(EXECUTE) subdirectory as specified in Condor's configuration file. Then Condor grants write permission to this subdirectory for user account it just created for the job.

3.
a new, non-visible Window Station and Desktop for the job. Permissions are set so that only the user account just created has access rights to this Desktop. Any windows created by this job are not seen by anyone; the job is run ``in the background''.

Next, the condor_starter (henceforth called the starter) contacts the condor_shadow (henceforth called the shadow) process which is running on the submitting machine and pulls over the job's executable and input files. These files are placed into the temporary working subdirectory for the job. After all files have been received, the starter spawns the user's executable as user ``condor-run-dir_XXX'' with its current working directory set to the temporary working subdirectory (i.e. $(EXECUTE)/dir_XXX).

While the job is running, the starter is closely monitoring the CPU usage and image size of all processes started by the job. Every 20 minutes it sends this information, along with the total size of all files contained in the job's working subdirectory, to the shadow. The shadow then inserts this information into the job's ClassAd so policy/scheduling expressions can make use of this dynamic information.

If the job exits of its own accord (i.e. the job completes), the starter first terminates any processes started by the job which could still be laying around if the job did not clean up after itself. examines the job's temporary working subdirectory for any files which have been created or modified and sends these files back to the shadow running on the submit machine. The shadow places these files into the initialdir specified in the submit-description file; if no initialdir was specified, the files go into the directory where the user ran condor_submit. Once all the output files are safely transferred back, the job is removed from the queue. If, however, the condor_startd forcibly kills the job before all output files could be transferred, the job is not removed from the queue but instead switches back to Idle.

If the condor_startd decides to vacate a job prematurely (perhaps because the startd policy says to kick off jobs whenever activity on the keyboard is detected, or whatever), the starter sends a WM_CLOSE message to the job. If the job spawned multiple child processes, the WM_CLOSE message is only sent to the parent process (i.e. the one started by the starter). The WM_CLOSE message is the preferred way to terminate a process on Windows NT, since this method allows the job to cleanup and free any resources it may have allocated. Then when the job exits, the starter cleans up any processes left behind. At this point if transfer_files was set to ONEXIT (the default) in this job's submit file, the job simply switches from state Running to state Idle and no files are transferred back. But if transfer_files is set to ALWAYS, then any files in the job's temporary working directory which were changed or modified are first sent back to the shadow. But this time, the shadow places these so-called intermediate files into a subdirectory created in the $(SPOOL) directory on the submitting machine ($(SPOOL) is specified in Condor's configuration file). Then the job is switched back to the Idle state until Condor finds a different machine for it to run on. When the job is started again, Condor will place into the job's temporary working directory the executable and input files as before, plus any files stored in the submit machine's $(SPOOL) directory for that job.

NOTE: A Windows console process can intercept a WM_CLOSE message via the Win32 SetConsoleCtrlHandler() function if it needs to do special cleanup work at vacate time; a WM_CLOSE message generates a CTRL_CLOSE_EVENT. See SetConsoleCtrlHandler() in the Win32 documentation for more info.

NOTE: The default handler in Windows NT for a WM_CLOSE message is for the process to exit. Of course, the job could be coded to ignore it an not exit, but eventually the condor_startd will get impatient and hard-kill the job (if that is the policy desired by the administrator).

Finally, after the job has left and any files transferred back, the condor_starter will delete the temporary working directory, the temporary run account, the WindowStation and the desktop before exiting itself. If the starter should terminate abnormally for some reason, the condor_startd will take upon itself to cleanup the directory, the account, etc. If for some reason the condor_startd should disappear as well (i.e. if the entire machine was power-cycled hard), the condor_startd will cleanup the temporary directory(s) and/or account(s) left behind when Condor is restarted at reboot time.

5.2.3 Security considerations in Condor NT Preview

On the execute machine, the user job is run using the access token of an account dynamically created by Condor which has bare-bones access rights and privileges. For instance, if your machines are configured so that only Administrators have write access C: WINNT, then certainly no Condor job run on that machine would be able to write anything there. The only files the job should be able to access on the execute machine are files accessible by group Everybody and files in the job's temporary working directory.

On the submit machine, Condor permits the File Transfer mechanism to only read files which the submitting user has access to read, and only write files to which the submitting user has access to write. For example, say only Administrators can write to C: WINNT on the submit machine, and a user gives the following to condor_submit :

         executable = mytrojan.exe
         initialdir = c:\winnt
         output = explorer.exe
         queue
Unless that user is in group Administrators, Condor will not permit explorer.exe to be overwritten.

If for some reason the submitting user's account disappears between the time condor_submit was run and when the job runs, Condor is not able to check and see if the now-defunct submitting user has read/write access to a given file. In this case, Condor will ensure that group ``Everyone'' has read or write access to any file the job subsequently tries to read or write. This is in consideration for some network setups, where the user account only exists for as long as the user is logged in.

Condor also provides protection to the job queue. It would be bad if the integrity of the job queue is compromised, because a malicious user could remove other user's jobs or even change what executable a user's job will run. To guard against this, in Condor's default configuration all connections to the condor_schedd (the process which manages the job queue on a given machine) are authenticated using Windows NT's SSPI security layer. The user is then authenticated using the same challenge-response protocol that NT uses to authenticate users to Windows NT file servers. Once authenticated, the only users allowed to edit job entry in the queue are:

1.
the user who originally submitted that job (i.e. Condor allows users to remove or edit their own jobs)
2.
users listed in the condor_config file parameter QUEUE_SUPER_USERS. In the default configuration, only the ``SYSTEM'' (LocalSystem) account is listed here.
WARNING: Do not remove ``SYSTEM'' from QUEUE_SUPER_USERS, or Condor itself will not be able to access the job queue when needed. If the LocalSystem account on your machine is compromised, you have all sorts of problems!

To protect the actual job queue files themselves, the Condor NT installation program will automatically set permissions on the entire Condor release directory so that only Administrators have write access.

Finally, Condor NT Preview has all the IP/Host-based security mechanisms present in the full-blown version of Condor. See section 3.8 starting on page [*] for complete information on how to allow/deny access to Condor based upon machine hostname or IP address.

5.2.4 Interoperability between Condor for Unix and Condor NT

Unix machines and Windows NT machines running Condor can happily co-exist in the same Condor pool without any problems. For now, the only restriction is jobs submitted on Windows NT must run on Windows NT, and job submitted on Unix must run on Unix. You will get this behavior by default, since condor_submit will automatically set a Requirements expression in the job ClassAd stating that the execute machine must have the same architecture and operating system as the submit machine.

There is absolutely no need to run more than one Condor central manager, even if you have both Unix and NT machines. The Condor central manager itself can run on either Unix or NT; there is no advantage to choosing one over the other. Here at University of Wisconsin-Madison, for instance, we have hundreds of Unix (Solaris, Linux, Irix, etc) and Windows NT machines in our Computer Science Department Condor pool. Our central manager is running on Windows NT. All is happy.

5.2.5 Some differences between Condor for Unix -vs- Condor NT

 
next up previous contents index
Next: 5.3 Installation of Condor Up: 5. Condor for Microsoft Previous: 5.1 Introduction to Condor
condor-admin@cs.wisc.edu