GRAM Audit V2

From Globus

Contents

Overview

This wiki page will be used to document the latest details of the GRAM Audit v2 implementation as it progresses.

Version 1 of GRAM Audit is currently implemented in GT 4.0.5 and later. It is anticipated that version 2 of GRAM Audit will be included in a future GT 4.2 point release (e.g. 4.2.1, 4.2.2, etc.) but the release date of GRAM Audit V2 is currently undetermined.

The GRAM Audit V2 campaign is here

The GRAM Audit V1 database schema is useful but some deficiencies have been identified as well as a significant number of requests for additional information for inclusion in the audit records.

Date Time field datatype

In the V1 schema, different SQL types are used for the various datetime columns (VARCHAR for MySQL and TIMESTAMP for PostgreSQL). It is a goal of GRAM Audit V2 to fix this problem. This will make it much easier to perform datetime-based queries.

Since grids can have multiple GRAM audit databases in different timezones, we want to include timezone values in datetime columns. When a grid admin wants to troubleshoot a problem that occurred at 3:05pm Central Standard Time, that datetime value can be converted to UTC and then any GRAM audit database can be queried to find the relevant records.

Issues:

  • Database implementations are all over the map with respect to the "standard" TIMESTAMP type. It's going to be difficult (if not impossible) to come up with a single schema that works across implementations.
  • PostgreSQL and Derby implement the "standard" TIMESTAMP type. MySQL implements DATETIME instead of TIMESTAMP, however. The TIMESTAMP type in MySQL is an altogether different type.
  • What timezone value should be used, local time or UTC? Local time might be easier for an admin to query a single GRAM audit DB. With UTC time, an admin wanting to query a local audit DB might have to convert local time to UTC and then do the query. Thus UTC potentially adds an additional step for admins. On the other hand, timezone conversion could be easily automated in a script.
  • PostgreSQL has full timezone support while Derby has none. As far as I can tell, MySQL has no timezone support either.
  • All three database implementations (PostgreSQL, MySQL, and Derby) support fractional seconds up to 6 digits. However, in MySQL, these fractional digits are ignored (i.e., not stored in the database). In PostgreSQL, the effective precision is something less than 6 fractional digits since timestamps are stored internally as floating point values.
  • In summary, the SQL TIMESTAMP type is not evenly supported across database implementations. Some implementations (such as MySQL and MSSQL) support the DATETIME type instead. Moreover, there is uneven support for the "standard" TIMESTAMP WITH TIME ZONE type.
  • Two possible alternatives to TIMESTAMP are date strings (i.e., VARCHAR) or seconds (or milliseconds) past the epoch. Since milliseconds require a long integer type (which itself is nonstandard over database implementations), the only real alternative is to store datetime values as date strings.

Additional job lifecycle fields

Related to this are requests for new “time” variables to better understand the lifecycle of audited jobs.

  • active_time
    • Date when the job was started/running in the local resource manager (as measured/observed by the gram service)
  • lrm_job_terminated_time*
    • Date when the job terminated in the local resource manager (as measured/observed by the gram service)
  • job_all_done_time
    • Date when the job was fully processed by the GRAM service. This includes, staging, execution, cleanup, etc...

Security / DB Access Concerns

Because of security concerns a complete audit record will consist of multiple audit sub-records which are submitted as various stages of the job are completed and the information is available. Thus “update” privilege is not required and a compromised GRAM wouldn’t be able to modify prior records.

Security Table

In conjunction with the GRAM Audit V2 work, a security table will be added to JWS Core.

Audit Attributes

Users will want to insert some (or all) of the data from the GRAM audit records into the security table. Processess such as the TeraGrid Extension, which will be rewritten to accommodate GRAM Audit V2, will select attributes from the security table. For example, the TeraGrid Extension for GRAM Audit V1 requires job_grid_id, subject_name, and gateway_user, which are pushed directly from GRAM to the TGCDB. With GRAM Audit V2, these attributes will be pushed from the security table instead.

For this reason, every column in the GRAM Audit V2 schema requires a globally unique name. This name will become the attribute name of the name-value pair of the attribute inserted into the security table. Since SAML attribute names are URIs, we recommend that all attributes in the security table have names that are URIs.

As an example, suppose we assign names to the columns of the gram_audit_table in GRAM Audit V1 as follows:

Column Name Attribute Name
job_grid_id http://globus.org/names/attribute/gram/audit_v1/job_grid_id
local_job_id http://globus.org/names/attribute/gram/audit_v1/local_job_id
subject_name http://globus.org/names/attribute/gram/audit_v1/subject_name
... ...

A similar approach may be used in GRAM Audit V2.

Additional feature requests

Some additional information is requested to improve the usefulness of the audit record. Some of the requested information allows for easier tracking of information into other (existing or future) databases to gather more details. Some is anticipated to be broadly useful when an audit record is used.

  • request_id
    • This is the unique ID for each client interaction with the GT container. The request ID is needed in order to join records with multiple GT auditing tables. For example, core audit records, security audit records. There are some plans to have a security audit table for the GT gridshib component.
  • job_resource_key
    • This is the unique ID (UUID) generated by the service and is included in the job's EPR.
  • client_hostname
    • This is the hostname of the client that sent the job to the gram service
  • executing_hostname
    • This is the FQDN (hostname) of the actual worker node the job runs on.
  • resource_usage fields
    • information as reported by the UNIX time command
      • elapsedtime (In seconds) time between invocation and termination
      • usertime (in seconds) User cpu usage: the sum of the tms_utime and tms_cutime values in a struct tms as returned by times(2))
      • systime (In seconds) System cpu usage: the system CPU time (the sum of the tms_stime and tms_cstime values in a struct tms as returned by times(2))

The resource usage fields are mandatory for the “fork” resource manager and optional for all other managers. If no information is available they should be NULL.

V2 Schemas

For V2 of gram-auditing we need 6 tables instead of one to allow us not to require “update” privilege. Five stages of a jobs life are assumed: initialization(initjob), queuing(queuejob), activation(runjob), completion(endjob) and cancellation/termination (canceljob). Additionally, an accounting table is defined for information that can be supplied from the local resource manager’s accounting information (acctjob).

initialized_jobs

These records are added when the job is submitted to GRAM to start processing.

  • Questions
    • where does client_host_name come from? Core? Is it there?
    • request_id should be available because we use it in cepds. verify.
    • MySQL does not support the TIMESTAMP data type.
 create table gram_audit_initialized_jobs (
     job_grid_id varchar(256),
     request_id varchar(128),
     client_host_name varchar(128),
     user_name varchar(16) not null,
     client_submission_id varchar(128),
     creation_time_utc timestamp not null,
     resource_manager_type varchar(16) not null,
     globus_toolkit_version varchar(16) not null,
     job_description text not null,
     PRIMARY KEY(job_grid_id(256)));

queued_jobs

These records are added just after GRAM submitted the job in the LRM.

  • Questions
    • MySQL does not support the TIMESTAMP data type.
 create table gram_audit_queued_jobs (
     job_grid_id varchar(256),
     stage_in_grid_id varchar(256),
     local_job_id varchar(512),
     queued_time_utc timestamp,
     PRIMARY KEY(job_grid_id(256)));

started_jobs

These records are added at the time GRAM detected that the job started running in the LRM.

  • Questions
    • MySQL does not support the TIMESTAMP data type.
 create table gram_audit_started_jobs (    
     job_grid_id varchar(256),  
     started_time_utc timestamp, 
     executing_host_name varchar(128),
     PRIMARY KEY(job_grid_id(256)));

cancelled_jobs

These records are added at the time the cancellation is received by the GRAM service, NOT when the cancellation has completed. The time the cancellation has completed will be recorded in the finished_jobs table.

  • Questions
    • MySQL does not support the TIMESTAMP data type.
 create table gram_audit_cancelled_jobs (
     job_grid_id varchar(256),  
     request_id varchar(128),
     reason varchar(16) not null, (one of: lifetime expired, user canceled, gram canceled)
     cancelled_time_utc timestamp,
     PRIMARY KEY(job_grid_id(256)));

finished_jobs

These records are for jobs that have been fully processed by the GRAM service.

  • Questions
    • MySQL does not support the TIMESTAMP data type.
 create table gram_audit_finished_jobs (
     job_grid_id varchar(256),
     lrm_job_finished_time_utc timestamp,
     gram_job_finished_time_utc timestamp,
     stage_out_grid_id varchar(256),
     clean_up_grid_id varchar(256),
     elapsed_time  double,
     final_job_state varchar(20),
     final_job_exit_code int,
     PRIMARY KEY(job_grid_id(256)));

accounting_jobs

These records contain information that comes directly from the LRM accounting information.

  • Questions
    • MySQL does not support the TIMESTAMP data type.
 create table gram_audit_accounting_jobs (
     ID <DB generated unique ID>,
     local_job_id varchar(512),
     queued_time_utc timestamp,
     started_time_utc timestamp,
     finished_time_utc timestamp,
     elapsed_time  double,
     user_cpu double,
     sys_cpu double,
     PRIMARY KEY(ID));

Notes

The idempotence_id was renamed to be client_submission_id (more representative of its meaning). The PRIMARY KEY used is currently job_grid_id which ties all four tables together. This may not be the best way to do this. Two new variables were added for the end job record: final_job_state and final_job_exit_code. The job state is envisioned to hold values like “SUCCESSFUL_COMPLETION”, “FAIL_USER_TERMINATED”, “FAIL_RESOURCE_LIMIT”, “FAIL_LRM_FAILURE” etc. The final_job_exit code can provide the exit code of the user’s application.

In order to get the job_grid_id from an acctjob record, join to the queuejob with a match on the local_job_id and some reasonable time range (4 hours?) between the queued_times. This is needed, because not all LRM’s provide a unique local job id. So the timestamp for when the job was queued is necessary to assure a match.

Implementation

Personal tools
Execution Projects
Information projects
Distribution Projects
Documentation Projects
Deprecated