UoM Technical Issues with the UNICORE/EUROGRID/GRIP Broker

Note that some of these are already resolved.

Fujitsu NJS Interface

Allow the use of timeout on jobs submitted via consignJob

Part done. Broker subjobs have termination time. Need to check whether this is sufficient for all interesting failure modes.
consignJob to be more tolerant of transient UPL errors, such as Gateway connection not available - should retry.

Not done.
Add call to the interface for when the NJS gets an ExecuteTask with ResourceCheck and/or QoSCheck resource which contains a ticket (broker must validate ticket and decide whether or not it should and can honour the ticket). There should be some notion of accepting a Ticket, or rejecting it (where the consign should fail, I think). In either case, some action via the TSI may be necessary, e.g. to claim an advance reservation on success; or on failure, to cancel a reservation.

Done though the exact interface might need further adjustment in the future.
Also need for the plugin to be able to store/retrieve created Tickets in the NJSs state (which get saved). Interface as in TicketManager or something similar.

Done.
Get access to the locations of Storage areas for a user. Which directory and/or filesystems are they on? It would be nice to get the locations for a particular incarnated user, so that any translations of the IDB contents are done for me, e.g. expanding things like "$HOME". As the incarnated user won't always be the one passed in to the incarnateCheckResources, the exapansion would need to be available via a method, say:
```
ResourceChecker.NJS.getStorageAreaMap(IncarnatedUser user)
```
Recent NJSes support the following likely-looking method:
```
ResourceChecker.NJS.getStorageLocation(Storage storage)
```
However, it will need to be handed off to the TSI for resolution into an actual location.

Need to integrate with Broker.
To do proper QoS, we need to know the default numbers of Nodes, Processors, CPU Time (or FloatingPoint) - this last one is also needed if we're to check CPU Usage. Finally, to do the storage checking, we also need the default amounts requested on each of the Storage Areas.

Not done. In fact, what does this mean?!
Allow the Broker to do some sort of ListVsites at a Gateway. Done?
Allow a Broker to lookup actual execution time and cost for a particular Ticket. Or some mechanism for upstream "registration of interest", so NJS pushes this information once Ticket has been spent. Lookup probably better, broker could hourly try to retrieve Ticket info.

Lookup doesn't scale, alas. Feedback requires new inter-broker/inter-NJS messages, plus some policy work to control when the information is allowed to be handed back from the execution NJS plus some code in all TSIs to actually provide resource usage information back to the NJS.

Not done.
Performance of broker is poor. Too many messages flying around. Too many gateway threads being used.

Turns out the client was hammering the broker. Might be other problems too though; I suspect the NJS isn't closing its connections to other NJSes very quickly, leading to Gateway overload. Need to check.

Gateway overload too. Yuck. Ongoing.
Tickets as contracts (i.e. signed by both parties - the issuing broker/NJS and the client). This will require alterations through the UNICORE Forum.

Need to do some more conceptual work first though. Not done.
Do something with the control method on the ResourceBroker. Currently it's ignored. Not done.

Pallas Client Integration

Allow submission of an AJO to a Vsite of the Broker's choosing. Client must do things like marking it as a Pallas Client job in the usual way, but not build a new AJO, or set resource information in anyway. Any Vsite not in the resource manager must be added.

Issue here is related to what to do about Usites/Vsites that are not already listed in the Resource Manager.

Part done.
Allow a checkAJO ignoring any inadequate resources for the following JobGroups, or for the following Vsite.

Alternatively expressed: Suppress warnings about unsatisfyable resources at early-stage brokering; after all, that's why we're brokering! Easiest way would be if the pre-submit error checks had a computer-readable field to say whether it was a resource-missing failure or something else...

This is really strange. Things seem to work fine when brokering at the default sites but not at all when brokering at a specific site. This seems to depend on where the job is pointing, not where it is being submitted to. Actually suppressing the warnings is non-trivial as they are not distinguished in any way from warnings relating to serious stuff, and hence you'd need support from each and every plugin that generates a task. Horrible! Punt to Pallas.
Allow a BrokerAtPreset sites to be put in the top toolbar. Done.

Also have way of marking groups for brokering, or perhaps brokering that don't have a specific site already picked. Not done.

Possible now(?) but would require doing things in the client that would bind us to a particular layout. We really need Pallas to rethink their strategy for plugins in this respect...
BROKER-only NJSs appear in the main Vsite list, but should not. Pallas bug/issue.

DKF - One possibility is to set things up so that people set which site does their brokering as part of the configuration. If they haven't assigned a site, they'll get the broker (and we should make sure that this works!) Then, we (i.e. I) put in some interface to allow people to choose their broker from the sites offering brokering services as part of a general brokering configuration panel. Done.

University of Manchester to do

Major items

Facade Brokering. This one is very deep.

There's a Vsite type MUTABLE that looks interesting.

A different way is to have a job that submits the real job and sends back where to look for it to the client...

Dave Snelling has a different idea: enhance the security architecture with another basic role, that of User (defaults to whichever of Consigner/Endorser is not generally set in normal NJS operation now.) Future Work.
Allow the brokering of multiple job groups (tabbed dialog or something)

Would prefer to give preferences to the broker agent and keep the GUI out of the way of the user for this.

Brokering of job groups containing multiple tasks works now. But not where there are multiple job groups.
Check sensible action regarding groups, etc. At least "Do N" should assume N executions. Otherwise, 1 is OK for repeat until, and true-branch for IF/THEN/ELSE

Really tough. An alternative would be to refuse to broker any groups with such conditionals in...
Implement performance estimation training thing.

Or at least provide an interface to allow someone else to do performance estimation training.
Resource reservation, but only where queue supports it.
Brokering of things other than execution tasks. (Network transfers, access to distributed data, etc.)

Client plugin

Make work with UNICORE V4. This means pushing Pallas on point 1 above.

Done, but if the job submission process ever changes we're in trouble. The pointing of the visible version of the job at the Vsite that accepted it is particularly hairy...
Implement inclusion of Ticket in AJO, and present this at the Vsite. Must prompt the user with the new resource sets. User should also be able to view any invalid tickets (i.e. adverts).

Done.
Hand off an offer selection policy to the broker engine and let it do the choices for the client.

Not Done.
Ticket message as HTML. Banner ads!

Not Done. The Examine dialog needs more work anyway.
Issue an initial blank ticket from the client to help work around future delegation problems.

Not Done. I don't like this, or at least I don't like to think of doing this before we have stronger binding of tickets to resources than we currently have.

NJS-side

Add code to check/validate tickets (see NJS Interface point 2) Need to keep tickets as NJS state.

This is the TicketManager thing all over again.

Done but could do with more information about the tickets being claimed so they can be checked that they have been applied to the correct abstract action, etc.
Implement mechanism for feedback of actual results to broker. Probably have a pull mechanism.

Not Done.
Multiple execute tasks in single TaskResourceDAG - do sensible QoS checking in one go, passing dependence information. Implement Broker QoS script to handle multiple entries and dependences.

Part Done. Works, but assumes all execute tasks in job independent of each other.
Do something with CPU Quota and disk quota scripts.

Not Done. See the getStorageLocation method outlined above.
Allow configuration to offer different priority/prices things. Will change QoS checking script.

Not Done.
Not filling in VsiteTaskIDStatusMap at the moment - just the VsiteEstimateMap.

Part Done. The data is filled in, but not on a per-user basis. Need per-user resource checks so as to perform quota checks especially.
Use GRIP work to broker for Globus resources.

Part Done. Only works with MDS2-based resource systems, so information exceedingly poor.

Issues from GRIP Deliverable 1.1c (some duplicates)

Extended Expert Brokering
Resource Consumption Feedback
Self-Configuring Brokers
Brokering of Jobs Containing Coupled Tasks
Automated Offer Selection
Brokering and Service Level Agreements
Brokering with Distributed Data and Network Transfers
Brokering in a Wider Grid-Services Context
Using Ontology Work to Drive Forward Resource Descriptions

These lists of things were originally written by Jon MacLaren, but are currently maintained by Donal K. Fellows.