GRAM Audit V2
From Globus
Contents |
Overview
This wiki page will be used to document the latest details of the GRAM Audit v2 implementation as it progresses.
Version 1 of GRAM Audit is currently implemented in GT 4.0.5 and later. It is anticipated that version 2 of GRAM Audit will be included in a future GT 4.2 point release (e.g. 4.2.1, 4.2.2, etc.) but the release date of GRAM Audit V2 is currently undetermined.
The GRAM Audit V2 campaign is here
The Gridshib security audit wiki is here
The GRAM Audit V1 database schema is useful but some deficiencies have been identified as well as a significant number of requests for additional information for inclusion in the audit records.
Date Time field datatype
Looking at the V1 schema there is an easy and significant improvement to be had by making all the “time/date” records actually use ‘datetime’ rather than varchar(x) format. This will allow much easier time/date based queries to be performed.
Additional job lifecycle fields
Related to this are requests for new “time” variables to better understand the lifecycle of audited jobs.
- active_time
- Date when the job was started/running in the local resource manager (as measured/observed by the gram service)
- lrm_job_terminated_time*
- Date when the job terminated in the local resource manager (as measured/observed by the gram service)
- job_all_done_time
- Date when the job was fully processed by the GRAM service. This includes, staging, execution, cleanup, etc...
Security / DB Access Concerns
Because of security concerns a complete audit record will consist of multiple audit sub-records which are submitted as various stages of the job are completed and the information is available. Thus “update” privilege is not required and a compromised GRAM wouldn’t be able to modify prior records.
Additional feature requests
Some additional information is requested to improve the usefulness of the audit record. Some of the requested information allows for easier tracking of information into other (existing or future) databases to gather more details. Some is anticipated to be broadly useful when an audit record is used.
- request_id
- This is the unique ID for each client interaction with the GT container. The request ID is needed in order to join records with multiple GT auditing tables. For example, core audit records, security audit records. There are some plans to have a security audit table for the GT gridshib component.
- job_resource_key
- This is the unique ID (UUID) generated by the service and is included in the job's EPR.
- client_hostname
- This is the hostname of the client that sent the job to the gram service
- executing_hostname
- This is the FQDN (hostname) of the actual worker node the job runs on.
- resource_usage fields
- information as reported by the UNIX time command
- elapsedtime (In seconds) time between invocation and termination
- usertime (in seconds) User cpu usage: the sum of the tms_utime and tms_cutime values in a struct tms as returned by times(2))
- systime (In seconds) System cpu usage: the system CPU time (the sum of the tms_stime and tms_cstime values in a struct tms as returned by times(2))
- information as reported by the UNIX time command
The resource usage fields are mandatory for the “fork” resource manager and optional for all other managers. If no information is available they should be NULL.
V2 Schemas
For V2 of gram-auditing we need 6 tables instead of one to allow us not to require “update” privilege. Five stages of a jobs life are assumed: initialization(initjob), queuing(queuejob), activation(runjob), completion(endjob) and cancellation/termination (canceljob). Additionally, an accounting table is defined for information that can be supplied from the local resource manager’s accounting information (acctjob).
- It was suggested to go with datetime for V2. But datetime does not exist in derby, but it does have timestamp. MySQL and PostgreSQL both have timestamp, so we plan on going with timestamp.
- What timestamp value should be used? e.g. local time or UTC?
- local time might be easier for an admin to look at a single gram audit DB, "return all records from 3:05pm"
- With UTC time, the admin wanting to query his local audit DB would have to covert the local time to UTC and then do the query. UTC adds an additional step for the admin, but it could be easily automated in a script.
- But the target users for gram audit are grids that can have many gram audit DBs in many timezones.
- When a grid admin wants to troubleshoot a problem, that occurred around 3:05pm Central Standard Time, it can be converted to UTC and then any gram audit DBs can be easily queried to find the relevant records.
initialized_jobs
These records are added when the job is submitted to GRAM to start processing.
- Questions
- where does client_host_name come from? Core? Is it there?
- request_id should be available because we use it in cepds. verify.
create table gram_audit_initialized_jobs (
job_grid_id varchar(256),
request_id varchar(128),
client_host_name varchar(128),
user_name varchar(16) not null,
client_submission_id varchar(128),
creation_time_utc timestamp not null,
resource_manager_type varchar(16) not null,
globus_toolkit_version varchar(16) not null,
job_description text not null,
PRIMARY KEY(job_grid_id(256)));
queued_jobs
These records are added just after GRAM submitted the job in the LRM.
create table gram_audit_queued_jobs (
job_grid_id varchar(256),
stage_in_grid_id varchar(256),
local_job_id varchar(512),
queued_time_utc timestamp,
PRIMARY KEY(job_grid_id(256)));
started_jobs
These records are added at the time GRAM detected that the job started running in the LRM.
create table gram_audit_started_jobs (
job_grid_id varchar(256),
started_time_utc timestamp,
executing_host_name varchar(128),
PRIMARY KEY(job_grid_id(256)));
cancelled_jobs
These records are added at the time the cancellation is received by the GRAM service, NOT when the cancellation has completed. The time the cancellation has completed will be recorded in the finished_jobs table.
create table gram_audit_cancelled_jobs (
job_grid_id varchar(256),
request_id varchar(128),
reason varchar(16) not null, (one of: lifetime expired, user canceled, gram canceled)
cancelled_time_utc timestamp,
PRIMARY KEY(job_grid_id(256)));
finished_jobs
These records are for jobs that have been fully processed by the GRAM service.
create table gram_audit_finished_jobs (
job_grid_id varchar(256),
lrm_job_finished_time_utc timestamp,
gram_job_finished_time_utc timestamp,
stage_out_grid_id varchar(256),
clean_up_grid_id varchar(256),
elapsed_time double,
final_job_state varchar(20),
final_job_exit_code int,
PRIMARY KEY(job_grid_id(256)));
accounting_jobs
These records contain information that comes directly from the LRM accounting information.
create table gram_audit_accounting_jobs (
ID <DB generated unique ID>,
local_job_id varchar(512),
queued_time_utc timestamp,
started_time_utc timestamp,
finished_time_utc timestamp,
elapsed_time double,
user_cpu double,
sys_cpu double,
PRIMARY KEY(ID));
Notes
The idempotence_id was renamed to be client_submission_id (more representative of its meaning). The PRIMARY KEY used is currently job_grid_id which ties all four tables together. This may not be the best way to do this. Two new variables were added for the end job record: final_job_state and final_job_exit_code. The job state is envisioned to hold values like “SUCCESSFUL_COMPLETION”, “FAIL_USER_TERMINATED”, “FAIL_RESOURCE_LIMIT”, “FAIL_LRM_FAILURE” etc. The final_job_exit code can provide the exit code of the user’s application.
In order to get the job_grid_id from an acctjob record, join to the queuejob with a match on the local_job_id and some reasonable time range (4 hours?) between the queued_times. This is needed, because not all LRM’s provide a unique local job id. So the timestamp for when the job was queued is necessary to assure a match.

