Google Summer of Code 2011 Ideas
From Globus
Globus has been accepted as a Google Summer of Code 2011 mentoring organization. This page lists our proposed GSoC project ideas. The project ideas are grouped according to the Globus projects that proposed them, but this is by no means an exclusive list of ideas; if you have a cool idea for a Globus-related project, please contact one of the GSoC mentors. There are also additional pages where you may be able to find inspiration for interesting summer projects.
Before submitting an application to GSoC with Globus as your mentoring organization, make sure you read our GSoC FAQ, which provides some pointers on how to write a successful application.
Once you are ready to submit an application, remember that you must do so before April 8th through the GSoC webapp.
Globus Toolkit projects
Connection Management in GridFTP
Globus project: Globus Toolkit
Mentor: Raj Kettimuthu
Programming Language/s: C
Level of Expertise: Intermediate
Description: The GridFTP protocol is a backward compatible extension of the legacy RFC959 FTP protocol. Globus implementation of GridFTP is widely used for data movement in the Grid community. To access the server, a user must be authenticated, have appropriate read and write permissions, and respect the total connection limit, but beyond that, there is no management or control. A user can hold a connection open indefinitely and move an unlimited number of files (barring disk space or system quota constraints). We need more flexible management to limit the length of time a user can hold a connection, address prioritization and responses to overburdened services, and prevent starvation. Globus Fork (GFork) - a user-configurable super-server daemon very similar to xinetd that enables sharing of state across client connections for a service, user-defined master programs that co-ordinate resource sharing. Associated with a GFork instance is a master process. When GFork starts, it runs a user-defined master program and opens up bi-directional pipes to it. The master program runs for the lifetime of the GFork daemon. The master is free to do whatever it wants; it is a user-defined program. For example, it can monitor system resources and implement algorithm to manage the connection requests to GridFTP server and/or manage the memory usage of GridFTP server processes. The goal of this project is to develop a GFork plugin for connection management to the GridFTP server. For example, limit low priority user to one concurrent connection, but a high priority user to 10 connections.
Requirements:
- A command of UNIX environments
- C programming
- Good operating systems background
- A basic understanding of FTP
GridFTP Server Windows Port
Globus project: Globus Toolkit
Mentor: Mike Link
Programming Language/s: C
Level of Expertise: Intermediate
Description: Port the GridFTP Server code to Windows 7. This should be done natively (e.g. Not CYGWIN). The majority of the server code is already portable. User and file accounting, and process management will need to be written using native Windows APIs. This should include an installer as well as service management support. The modified code should be able to be compiled using MinGW or cross-compiled using gcc on linux.
Requirements:
- Familiar with C Windows APIs
- General UNIX/POSIX knowledge
DemoGrid Web Console
Globus project: DemoGrid
Mentor: Borja Sotomayor
Programming Language/s: Python
Level of Expertise: Intermediate
Description: Globus DemoGrid is a tool that will build an instructional grid environment that you can then deploy, using virtual machines, on a cloud (such as Amazon EC2) or on your own physical resources. The current interface is console-based, but a web-based interface would make DemoGrid even easier to use by beginners. This web console could be either used locally (by a single user), or installed on a site for several users to access (e.g., if DemoGrid uses local resources instead of Amazon EC2).
Requirements:
- DemoGrid is written in Python, although the Web Console can be written in your language of choice. So, some knowledge of Python is required.
- Should already be familiar with building web applications, AJAX, and all that jazz.
Add more Chef recipes to DemoGrid
Globus project: DemoGrid
Mentor: Borja Sotomayor
Programming Language/s: Python
Level of Expertise: Advanced
Description: Globus DemoGrid is a tool that will build an instructional grid environment that you can then deploy, using virtual machines, on a cloud (such as Amazon EC2) or on your own physical resources. DemoGrid currently supports a limited number of Grid technologies (see "Supported Grid technologies" in http://confluence.globus.org/display/DEMOGRID/Introduction+to+DemoGrid). In this project, the student will write Chef recipes that will allow DemoGrid to automatically deploy new Grid software.
Requirements:
- Familiar with Chef (http://www.opscode.com/chef/)
- DemoGrid is written in Python and, although most of the work will involve writing Chef recipes and various scripts, some familiarity with Python will come in handy.
- This project is meant for students who are already familiar with a particular Grid technology, and would like to add support for it in DemoGrid.
Globus Online projects
Adaptive Tuning in Globus Online
Globus project: Globus Online
Mentor: Tanu Malik
Programming Language/s: C
Level of Expertise: Advanced
Description: The goal of this project is to design an automatic tuning framework within Globus Online. GridFTP is the primary protocol used within Globus Online for performing bulk data transfers. In GridFTP, the efficiency of a data transfer operation depends upon correctly predicting the number of streams for performing the data transfer. Parallel streams often achieve high throughput by reducing bandwidth delay product. However, efficiency drops when all streams experience packet loss. Thus, correctly predicting the value of this parameter is often a challenge and depends on a variety of factors such bandwidth, RTT, packet loss rate, etc. While GridFTP employs models to predict the thread level, it also discourages changing the thread level arbitrarily as it hurts overall performance. We will implement a "decision framework" that will determine for GridFTP if its profitable to change its current thread level. The framework will consist of a) predicting the value of this parameter under given number of threads, (b) estimating overheads within GridFTP in changing its current number of streams, and (c) adaptively deciding whether to change its current stream level. The project will require an intimate knowledge of GridFTP, and understanding of throughput prediction models as described in the following papers [1,2,3]. The deliverables of the project will be:
- Implement the decision framework,
- Test it using real workloads from past GridFTP usage, and
- Measure performance with and without the decision framework.
References:
- [1] W. Liu, B. Tieman, R. Kettimuthu, I. Foster, "A Data Transfer Framework for Large-Scale Science Experiments", in International Workshop on Data Intensive Distributed Computing (DIDC 2010), 2010.
- [2] J. Lee, D. Gunter, B. Tierney, B. Allcock, J. Bester, J. Bresnahan, and S. Tuecke, “Applied techniques for high bandwidth data transfers across wide area networks,” in Proc. International Conference on Computing in High Energy and Nuclear Physics (CHEP01), 2001.
- [3] E. Yildirim, D. Yin, T. Kosar, "Prediction of Optimal Parallelism Level in Wide Area Data Transfers", in IEEE Transactions on Parallel and Distributed Systems, 2010.
Requirements:
- A basic understanding of FTP
- C programming
- Understanding of network fundamentals
JGlobus GridFTP/GlobusOnline bridge
Globus project: Globus Online
Mentor: Vijay Anand
Programming Language/s: Java
Level of Expertise: Advanced
Description: The current JGlobus client for GridFTP is used extensively to connect to GridFTP servers from Java. As the features supported by Globus Online expand, many JGlobus users may want to connect to Globus Online instead of using GridFTP directly. To support this in a seamless way, this project proposed to implement the JGlobus api for the GridFTP client using the REST interfaces provided by Globus Online. This will allow existing users of the JGlobus client to migrate to using Globus Online with few (or, ideally, no) changes to their existing code. The project will require the following steps:
- Conduct a gap analysis between the RESTful API for Globus Online and what JGlobus provides. This should yield suggestions for changes to the Globus Online apis.
- Implement all appropriate JGlobus GridFTP client methods using the Globus Online RESTful API
- Document the usage of the client.
Requirements:
- A basic understanding of REST principals
- Java programming
- Basic understanding of JGlobus
Graph based Cassandra Partitioner
Globus project: Globus Online
Mentor: Tom Howe
Programming Language/s: Python
Level of Expertise: Advanced
Description: Globus Online uses the Cassandra[1] datastore for persisting it's information. To facilitate using Cassandra, we developed a graph database called Agamemnon[2], which runs on top of Cassandra. This allows us to use graph semantics for modeling our data. When data is stored in Cassandra, it is distributed via a Partitioner, which determines which data should go to which backend datastore. In order to optimize the storage and retrieval of data, this project would implement a new Partitioner which uses a standard graph visualization algorithm to distribute the data. One such algorithm is the Fruchterman-Reingold algorithm[3] This way, information that is closely related can be stored near each other. This project will require the following steps:
- Identify the best graph visualization algorithms for partitioning the data
- Implement the partitioner
- Fully test and document the code.
References:
- [1] http://cassandra.apache.org
- [2] https://github.com/turtlebender/agamemnon
- [3] Fruchterman, Thomas M. J.; Reingold, Edward M. (1991). "Graph Drawing by Force-Directed Placement". Software – Practice & Experience (Wiley) 21 (11): 1129–1164.
Requirements:
- An understanding of graph algorithms
- Python Programming
- A familiarity with Cassandra and how Cassandra distributes data
iOS application for Globus Online
Globus project: Globus Online
Mentor: Tom Howe
Programming Language/s: Objective C
Level of Expertise: Advanced
Description: To support users who have mobile devices produced by Apple, we propose a project to develop an iPhone/iPad application which will give access to the core services provided by GlobusOnline. This will include credential management and file transfer capabilities. The app should replicate most of the functionality of the current site and provide a secure experience for users who want to interact with Globus Online. The project will require the following steps:
- Wireframe the application
- Build the application according to the specifications
- Document the usage of the application.
Requirements:
- An understanding of how to develop applications in iOS
- Objective C programming (or C)
- Basic understanding of Globus Online
Android App for Globus Online
Globus project: Globus Online
Mentor: Bryce Allen
Programming Language/s: Java
Level of Expertise: Advanced
Description: We propose a project to develop an Android application which will give access to the core services provided by GlobusOnline. This will include credential management and file transfer capabilities. The app should replicate most of the functionality of the current site and provide a secure experience for users who want to interact with Globus Online.
The Globus Online File Transfer (REST) API documentation is here: http://transfer.api.globusonline.org
The project will require the following steps:
- Wireframe the application
- Build the application according to the specifications
- Document the usage of the application.
Requirements:
- An understanding of how to develop Android applications
- Experience using Linux systems
Nimbus projects
Automatic User Management
Globus project: Nimbus
Mentor: John Bresnahan
Programming Language/s: Python and Java
Level of Expertise: Intermediate
Description: Currently in Nimbus all of our user management tools are run by the Nimbus administrator. While often appropriate, this is inconvenient for (among other things) tutorials. We are interested in providing a solution that will allow an easier process for automatically creating users.
The service would allow a user to request access to a Nimbus cloud via a REST API. The user would provide an email address with their request. This request would trigger some configurable logic that would either automatically approve the request, reject the request, or put the request in a queue awaiting the admins approval. The decision would be made based on the provided email address and a set of configurable rules (ex: accept all .edu, all uchicago.edu, etc).
Once the account is approved (either automatically or via the sysadmin) the user will be sent email with a link to a web page. The web page is only available for a short period of time. When the user visits that web page they can download their credentials and will then have access to the Nimbus cloud.
If there is time in the project we will also modify the cloud client to take the approval url (described above) as a parameter and use it to automatically download the needed credentials from the REST service and install them into the users cloud-client installation.
We currently have a web application that allows for delivery of credentials to users. The current application does not solve the entirety of the problem. It is meant to handle the delivery of credentials, not the request. A user still needs to contact a cloud admin and request an account, that cloud admin then needs to run the nimbus-new-user program and then send the new user back a url contained in the current web app. At that point the user can retrieve their credentials.
We need to add to this a means for account requesting. A web page, or preferably, a web API will be created that allows a user to request an account given their email address (and some other set of information). The user will receive an email at the given address indicating if the request was accepted and if so, a url like the one provided in current web app.
Requirements:
- Python
- Java
- General UNIX skills
- Experience with django a plus
Multipart Uploads in Cumulus (S3)
Globus project: Nimbus
Mentor: John Bresnahan
Programming Language/s: Python
Level of Expertise: Intermediate
Description: Cumulus is an S3 look a like service. http://www.nimbusproject.org/docs/2.7/faq.html#cumulus. Since the release of Cumulus Amazon has added support for multi-part uploads to their protocol. More information about this can be found here: http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?uploadobjusingmpu.html.
In this project we will add support for multi-part upload to the Cumulus service.
Requirements:
- Python
- General UNIX skills
- Understanding of networks and data transfer
LanTorrent Compression
Globus project: Nimbus
Mentor: John Bresnahan
Programming Language/s: Python and Java
Level of Expertise: Intermediate
Description: LANTorrent is a multicast network protocol in the Nimbus toolkit that is used to propagate VM images to backend nodes for execution. Often times VM images have 'blank space' in them. Because of this they are sparsely populated files and the compress very well.
In this project we will modify LANTorrent in 3 ways: First we will make it compress files as it sends them and uncompress them as they arrive. Two, make an association of file extension to compression algorithm allowing LANTorrent to send already compressed files to a receiver that will unzip them as they flow in. Three, add support for sparse files so that the compression is not needed.
Requirements:
- Python
- General understanding of UNIX
- Understanding of sparse files
- Understanding of basic network architecture
- Understanding of network protocols and TCP communication
Dead VM Reaper
Globus project: Nimbus
Mentor: Dave LaBissoniere
Programming Language/s: Python and Java
Level of Expertise: Beginner
Description: In the Nimbus IaaS platform Virtual Machines are remotely executed and managed via libvirt. When a machine is launched its existence is recorded in a database associated with the frontend service. When it is killed the record of that instance is removed. A problem comes up when a VM unexpectedly dies. The frontend will not know it has died until the time allocation for that VM expires, or until a user manually kills it. This leaves the frontend thinking it has resources in use that are not and could actually be freed up for us with other VMs.
In this project we will add functionality to Nimbus to periodically check the backend nodes for running VMs. When we find that a VM no longer exists this new functionality will remove it from the frontend's database.
Requirements:
- Python
- Java
- libvirt a plus
EPU REST Interfaces
Globus project: Nimbus
Mentor: Dave LaBissoniere
Programming Language/s: Python and Java
Level of Expertise: Beginner
Description: Currently the Elastic Processing Unit is using AMQP to communicate messages between components. The system is distributed and relies heavily on messaging for incoming sensor data as well as control channel messages.
The project is to convert this messaging to REST or HTTP RPC based mechanisms and to ensure that there is a retry mechanism in place that fits with the reliability design. After some experimentation, the first half of the project will be replacing the marshalling layers and configuration files. The second half of the project will be security work and polishing.
Requirements:
- Python
- General UNIX skills
- Strong security mindset
- Experience with Twisted a plus
- Experience with REST/HTTP in Python a plus
virtio Support in Generated libvirt XML
Globus project: Nimbus
Mentor: Dave LaBissoniere
Programming Language/s: Python and Java
Level of Expertise: Intermediate
Description: Right now the xml file created for launching VMs with libvirt is generated in python code. We would like to separate this out into a template with the initial goal of having support for virtio, but also for a more configurable service for cloud experimentation. More information can be found here: https://github.com/nimbusproject/nimbus/issues#issue/33
Requirements:
- Python
- libvirt a plus
- KVM or Xen a plus
Make KVM First Class Functionality
Globus project: Nimbus
Mentor: Dave LaBissoniere
Programming Language/s: Python and Java
Level of Expertise: Advanced
Description: Currently the Nimbus toolkit is geared towards Xen. While it works with KVM it requires some modifications and alternative set of instructions. Further we do not fully take advantage of some KVM features. The main task here will be adding support for KVM image formats (qcow2, QED, raw...) and device drivers (ide, scsi, virtio).
Requirements:
- Python
- General UNIX knowledge
- KVM knowledge
Improved Admin Tools
Globus project: Nimbus
Mentor: Patrick Armstrong
Programming Language/s: Python and Java
Level of Expertise: Beginner
Description: Provide API and command line tools for administrator operations on Nimbus clouds. For example, terminating a running VM, listing VMs by user, node, etc. The task is to provide a clean API and implementation which takes requests and makes necessary calls in the service, then provide polished and well-documented command line tools which communicate with these APIs.
Requirements:
- Python
- General UNIX knowledge
VM Console Output
Globus project: Nimbus
Mentor: John Bresnahan
Programming Language/s: Python and Java
Level of Expertise: Advanced
Description: When a system is booted, be it a VM or a real machine, is very useful to read both for diagnostic information and debugging. The Nimbus IaaS platform currently does not provide a way for users to see this output, this when a VM image fails to boot on Nimbus it can be hard to figure out why. Often times the cloud admin who has access to the node it is started on needs to get involved. This makes the creation of new VM images all the harder for users.
EC2 and other IaaS platforms allow the user to fetch the console output for inspection and debugging. The task is to add support to for fetching the console output of a booted VM Nimbus. This task touches most layers of the stack: cloud client, web services, service<->node communication, libvirt integration.
The deliverables for this project would be:
- A modified cloud client that remotely calls the fetch console method
- An additional operation on the service to fetch the console output
- Modification to VM management code to capture the console output and store it for fetching
Requirements:
- Python
- Java
- Experience with Virtual Machines
- Experience with an IaaS platform (EC is fine) a plus
Replace ssh communication to cluster VMMs with AMQP
Globus project: Nimbus
Mentor: John Bresnahan
Programming Language/s: Python and Java
Level of Expertise: Advanced
Description: Nimbus uses ssh to communicate with the VMM in the Nimbus cluster. This can cause some scalability issues in the case of highly loaded Nimbus clusters. In order to increase the scalability of Nimbus deployments the development project would involve replacement of ssh messaging with AMQP, probably using RabbitMQ. This would be a good project to learn about AMQP and scalability of services.
The deliverables for this project would be:
- A messaging framework using AMQP that the Nimbus IaaS service can use.
Requirements:
- Python
- Java
- Experience with rabbitmq
- Familiarity with AMQP
Context Broker Improvements
Globus project: Nimbus
Mentor: Dave LaBissoniere
Programming Language/s: Python and Java
Level of Expertise: Intermediate
Description: The Nimbus Context Broker provides a facility for coordinating the launch of many VMs and securely exchanging information. A common use is to set up a virtual compute cluster (PBS for example) composed of VMs. The broker is the central service which is contacted by each node. Currently it lacks persistence, so any service failure or restart causes failure of incomplete contexts. It also is too precise about its requirements for success: for example if launching a cluster with 1000 workers and only 999 check in, currently this means failure for the whole run. It should be possible to allow it to proceed in these situations.
The deliverables for this project would be:
- An implementation of persistence to a database for the state of the Context Broker
- A design and implementation of a scheme allowing "partial success" of launches, either by some timeout or a manual operation saying to proceed with contextualization
Requirements:
- Java
- Python (potentially)
Swift projects
Develop data management provider (driver) for Globus Online
Globus project: Swift
Mentor: Michael Wilde
Programming Language/s: Java
Level of Expertise: Intermediate
Description: Enable Swift to move large datasets using Globus Online by developing a "data provider" interface for the Java "CoG Kit" layer that Swift uses to move data and execute remote tasks.
Requirements:
- Java
- REST
Integrate Swift into Globus Online to provide an application execution and scripting service
Globus project: Swift
Mentor: Michael Wilde
Programming Language/s: shell, Python
Level of Expertise: Intermediate
Description: Embed the Swift scripting engine into Globus Online to enable GO users to define and execute parallel application scripts as a cloud-hosted service.
Requirements:
- Scripting via shell and/or Python
- REST
SwiftScript functional iteration constructs
Globus project: Swift
Description: SwiftScript, the language of Swift, has a feel quite like a functional programming language. Some difficulties arise where it does not (for example in its foreach and iterate control constructs). I would be interested to see constructs that more like map, fold and scan as found in (for example) Haskell or other functional languages.
I have a fairly good idea what this should look like, so there is not a huge amount of design work involved. Initially we'd want to develop constructs which can replicate the present iteration constructs, but more nicely expressed - and then attempt to see how easy they are for people to use compared to the existing ones.
Mentor: Mike Wilde
Enhance Swift app() functions with run-time specifications
Globus project: Swift
Description: Add the ability to specify things like MPI width, RAM and walltime needs, disk space needs, etc. calculated by Swift expressions from variable info available in the script. This would feed into the existing rich Swift "profile" framework that is currently only statically settable at workflow start time. More details TBD.
Mentor: Mike Wilde
Implementing efficient Map-Reduce models using the Swift parallel scripting language
Globus project: Swift
Description: The Swift parallel scripting language ( www.ci.uchicago.edu/swift ) enables you to run many (tens of thousands and more) application programs in parallel on clusters, grids, clouds, and supercomputers, as well on on multicore workstations. While Swift, as a functional-inspired programming language, makes it easy to specify a map-reduce style of scripting, as a general functional language it operates on typed data objects and does not intrinsically use the key-value data model of Google and Hadoop MapReduce. Hence it lacks the ability to automatically sort, partition, merge, and reduce results. This project will focus on creating a map-reduce library tailored for Swift execution. It will enable key-value-style problems to be expressed in Swift, as well as efficiently applying map-reduce to problems where keys and values can be implicitly derived from the data types being processed. The result will be a pleasing and highly productive style of specifying map-reduce problems in a simple functional notation. You could support MapReduce style applications without explicit support from Swift. I think all is needed are some applications that Swift would call. For example, if you wanted to do sort via MapReduce, all you need is a Map to break up a file into buckets by key, and a Merge that needs to take some buckets and put them back into a single file. The data-flow of this application can be expressed simply by two subsequent for loops. Depending at what level you were thinking of addressing MapReduce applications, I might be interested in co-mentoring students on this project as well.
Mentors: Michael Wilde
Level: Advanced
Requirements:
- Familiarity with map-reduce and other parallel programming concepts
- Skills in scripting in Python or similar languages
- Moderate Java skills to write and run Hadoop program
Integrating Swift parallel scripting semantics into Python, R, Octave, and MATLAB
Globus project: Swift
Description: The power of Swift parallel distributed scripting model could be more widely leveraged if it its execution semantics could be expressed in other popular scripting languages. Preliminary work in embedding and calling Swift from the R data analysis language indicates that this approach may yield new and highly-productive methods for easily writing parallel and distributed scripts. (See http://www.ci.uchicago.edu/wiki/bin/view/SWFT/SwiftR)
Python is the preferred implementation we'd like to focus on next. R is pretty well underway. Octave/MATLAB would be of interest.
There are many alternative models to explore for this, including "co-routine" styles of programming; methods for passing arguments and results between language interpreters executing in parallel; expressing the Swift task graph in an efficient, scalable, but easy-to-program-with data structure using the native language; conventions for mapping datasets and wrapping external application programs; and conventions for embedding simple snippets of Swift code into scripts in other languages. This project will entail selecting one language to experiment with (Python is preferred), implementing an experimental library interface to swift, and evaluating its usability, expressiveness, and performance on simple test cases and on a set of scientific applications.
Mentor: Michael Wilde
Requirements:
- Familiarity with parallel and distributed computing models
- Advanced scripting skills in the chosen scripting language (Python preferred)
Integrating service access capabilities in the Swift parallel scripting language
Globus project: Swift
Description: Swift provides a uniform, general, and flexible model for expressing the inputs and outputs of application programs, including how large and possible complex structured datasets are passed to the application. This project will extend that model to applications expressed as REST, JSON, and SOAP services, so that standard Linux/POSIX applications can be freely intermixed with service-based applications in Swift scripts to compose powerful scientific workflows. Case studies will be drawn from applications in biology and astronomy where such service-oriented applications are abundant and the need for this integration is high. As an extension, this project could explore
Mentor: Michael Wilde
Level: Intermediate
Requirements:
- Familiarity with SOAP, REST and/or JSON
- Scripting skills in Python or similar language
- Familiarity with XML
Integrating in-RAM function calls and data passing
Globus project: Swift
Description: Swift's "leaf" or lowet-level functions are current application programs wrapped into a Swift function interface. This project will explore the ability to apply the Swift implicitly-prallel data flow programming model to calling in-memory functions such as long-running math library routines, etc. Also Java methods. For calling C and FOrtran, we expect to model this capability on MPI data-passing conventions.
Mentor: Michael Wilde
Level: Intermediate
Requirements:
- Advanced Java programming skills
- Knowledge of MPI helpful
Adding associative array operators to Swift
Globus project: Swift
Description: Swift provides a nicely powerful dynamic array construct, but at the moment these arrays permit only integer indexing. But the underlying array mechanism is in fact implemented as a hash table. This project will add syntax and semantics to the language to permit indexing arrays by using string keys and perhaps arbitrary objects as keys.
Mentor: Michael Wilde
Level: Intermediate
Requirements:
- Advanced Java programming skills
- Familiarity with compiler technqiues
Measuring and enhancing multi-site parallel scheduling in Swift
Globus project: Swift
Description: Swift provides a powerful task execution site selector and scheduler which can sense the responsiveness and load level of computing sites and can throttle the volume of work that Swift sends to the site up and down. This algorithm needs to be extensively tested, measured, and enhanced, and its performance and capabilities should be documented in academic papers. We want to make its many parameters as self-tuning as possible, we want to determine what necessary controls to provide to the Swift end-user and make these as simple and usable as possible. This project will involved extensive experimentation with the scheduler on large-scale distributed resources such as Open Science Grid and TeraGrid, and possibly the Amazon EC2 cloud computing service and petascale supercomputers at DOE and TeraGrid sites as well.
Mentor: Michael Wilde
Level: Advanced
Requirements:
- Familiarity with parallel and distributed computing models
- Performance measurement and evaluation skills
- Plotting and data analysis skills
- Scripting skills in shell, Python, or similar languages
Enabling distributed, interactive debugging and status monitoring/reporting of Swift parallel and distributed scripts
Globus project: Swift
Description: In this project we propose to enhance both the display and control of running parallel workflows, adding capabilities to probe down to individual remote worker nodes to gain detailed insight on application performance, and generate an informative set of dynamic plots by which users can fully understand all aspects of their script's and application's behavior and performance
This will be a good project for students with a strong interest in distributed computing, performance analysis and modeling, and graphical display of information.
There are 3 possible sub-projects here:
- Developing the ability to connect to remote compute nodes and probe the performance of a running application using a shell. A rudimentary prototype of this exists and needs many enhancements
- Display the overall status of a Swift script in progress. Similarly, a curses-based version of this exists, but many enhancements and improvements are needed
- Create reports and plots of Swift script performance
Additional details on the third sub-project are:
We have an older version of a plotting capability that suffers from terrible performance (in parsing the log and organizing its data) and bugs, and which produces many informative but rather confusing logs that are hard to select, control, and interpret.
This is a fascinating project if you enjoy (or want to learn) performance analysis of distributed and parallel systems.
The project will involve:
- normalizing the Swift log records (log4j) from its Java execution engine so that the events of interest are formatted in a unified fashion
- adding more log records and information fields
- adding easier control by the user to enable logging by function and/or logging level rather than by log4j class names.
- create a few useful event streams out of the logging info
- creating multiple reporting and display options including summaries, tables, and plots that show various dimensions of a Swift application's activities
- basic plots to be produced by an integrated Java plot library such as JFreeChart (http://www.jfree.org/jfreechart/samples.html)
- creating convenient R datasets and plot routines to enable the user to explore performance data ("slice and dice") and compare and plot one or more Swift runs
- creating documentation to enable the user to generate and interpret activity plots and use them to understand and tune the performance of their application scripts and identify performance problems and bottlenecks.
Mentor: Michael Wilde
Level: Intermediate
Requirements:
- Familiarity with parallel and distributed computing models
- Performance measurement and evaluation skills
- Plotting and data analysis skills
- Advanced scripting skills in shell, Python, and/or Perl
- Intermediate Java programming skills
Multi-level workflow programming models
Globus project: Swift
Description: This project will involve experimentation with the Swift parallel scripting programming model and its implementation, to create execution mechanisms that can partition a Swift program's execution graph across multiple, distributed, asynchronous executors. MORE TBD.... Once the execution graph is partitioned, mechanisms need to be in place to coordinate the execution graph, and to ensure that performance (e.g. throughput of tasks/sec) is not impacted significantly. Experiments are expected to be carried out to test the new upper bound on scalability of Swift, using this partitioned approach, and to test throughput of the Swift system to ensure comparable performance.
Mentor: Michael Wilde
Level: Advanced
Flexible cloud and volunteer resource management for the Swift parallel scripting language
Globus project: Swift
Description: Preliminary work to date indicates that Swift runs well on cloud and volunteer computing resources. In this project, we propose to experiment with and adapt Swift's resource management and scheduling mechanisms to make it easy to run swift scripts in Amazon EC2, DOE Magellan, and BOINC volunteer clouds, with both automated and user-driven control over resource levels, types, and costs. The project will deal with issues of usability, evaluating scientific workloads in cloud contexts, making it easy to grow and shrink resource clouds, and managing long-running workloads.
Mentor: Michael Wilde
Level: Intermediate
Requirements:
- Java
- Scripting in shell and/or Python or similar
- Prior experience with EC2 or BOINC desirable
Reworking the Swift parallel scripting dataset mapper model and toolset
Globus project: Swift
Description: The Swift parallel scripting model implements a dataset typing model in which directory structures of scientific data can be described as structure and array objects, and can be mapped from their on-disk structure to an in-memory representation of that structure. This enables complex directory trees of data to be processed using simple scripts. This project will involved refining the mapping model, and designing a new set of mappers and mapping conventions, based on user experience with the current mapping model, to make this style of script-writing even easier, more natural, and more robust and reusable. This project will deal with many file naming conventions and data access services, and will provide mechanisms to maintain and access large-scale collaboration-wide data catalogs. The project will involve the use of the Globus Replica Location Service (RLS) and other highly-scalable file / dataset catalogs.
Mentor: Michael Wilde
Level: Intermediate
Requirements:
- Scripting in shell and/or Python or similar
- Solid experience using Linux systems
Enabling large-scale distributed application build and deployment under Swift
Globus project: Swift
Description: Using distributed grid resources on the Open Science Grid ( www.opensciencegrid.org ) and TeraGrid ( www.teragrid.org ) requires that application programs used by a user workload be installed on each computing site that will be used on a given grid. This project will build on one of the many systems that attempt to automate this process, and will refine, improve and test this mechanism, focusing on reliability and usability. The result will have great value for numerous scientific collaborations that wrestle with this difficult problem.
Mentor: Michael Wilde
Level: Intermediate
Requirements:
- Software building skills
- Familiarity with Make and Configure
Making the Swift parallel scripting system easy to install, evaluate and learn on readily available computing resources
Globus project: Swift
Description: While the Swift parallel scripting system is meant primarily for users on large-scale clusters, grid, and supercomputers, its also quite usable on multi-core workstations, and it can readily federate small networks of such workstations or servers into a valuable parallel computing resource. At the same time, the ability to do this make such an approach an attractive vehicle by which new and prospective Swift users can evaluate and learn the system. This project entails creating an attractive Swift "starter kit" with tested and documented demos, on a set of interesting but easy-to-install application problems. This is an ideal project for junior programmers who seek an introduction to parallel computing and to scientific applications.
Mentor: Michael Wilde
Level: Introductory
Requirements:
- Familiarity with ssh and scp
- Knowledge of Linux systems and shell scripting
- Familiarity with Python or similar scripting languages
- Ability to experiment with simple scientific applications and graphical tools
Enhancing Swift accessibility and usability on the Open Science Grid and TeraGrid
Globus project: Swift
Description: The Swift parallel scripting system can be an ideal tool to enable new users of TeraGrid and OpenScienceGrid to rapidly achieve the ability to transition and scale up from a local sever or cluster environment to a more powerful Grid environment with far greater computing resources. This project will help to build a starter kit for such users that pre-configures Swift for their use on these large-scale Grid infrastructures, and which automates and hides many of the complexities in executing at such as large and highly distributed scale.
Mentor: Michael Wilde
Level: Introductory
Requirements:
- Familiarity with ssh and scp
- Knowledge of Linux systems and shell scripting
- Familiarity with Python or similar scripting languages
- Ability to experiment with simple scientific applications and graphical tools
Enhancing a Gadget-based portal interface for the execution of scientific workflows
Globus project: Swift
Description: Our new scientific workflow portal mechanisms enable users to create customized "workspace" environments in which they can execute parallel, distributed scientific workflow scripts on a variety of grid and supercomputer resources. The portal provides the means to integrate distributed data management, workflow launch and status monitoring, result tracking, analysis and visualization. This project will involve working on a set of innovative enhancements to the portal in the areas of tagging and metadata management, workflow status display, interactive distributed debugging, and visualization and analysis of scientific results.
Mentor: Michael Wilde
Level: Intermediate
Requirements:
- Web 2.0 programming with DHTML, CSS, Google Gadgets
- Data analysis and visualization in MATLAB, Octave, or R
- User interface usability analysis and enhancement skills
Enhancing the Swift parallel scripting Library
Globus project: Swift
Description: As a relatively young language, Swift does not yet have a rich set of library functions to perform many of the common tasks needed in typical scientific scripts. This project will involve building such initial libraries for string and text manipulation, network access, file data management, and elementary mathematical functions. Requirements will be gathered from discussion with Swift users, analysis of existing Swift scripts, and from a set of exploratory Swift scripts to be written by the student. The project will involve enhancing the initial, primitive library "import" capability of Swift, and will explore how to add "namespaces" and library search mechanism to the language.
Mentor: Michael Wilde
Level: Introductory to intermediate
Requirements:
- Scripting in shell and python or similar languages
- Elementary Java programming skills
Scientific Services
Semantics-Oriented Behavior-Empowered Scientific Service Search Engine
Globus project: [ None.]
Mentor: Jia Zhang
Programming Language/s: Java
Level of Expertise: TBD
Description: The Internet, the Grid and the newly emerging cloud environment have provided a community platform for scientists to share various kinds of resources (e.g., experimental data and analytical applications) in the form of services. However, our recent analysis revealed that the reusability of scientific services is very low. How to effectively and efficiently help scientists find suitable services and help them construct new workflows (experimental process) with existing services remains a big challenge. This project proposes to tackle this issue by intelligently extracting information from the shared computing environment, leveraging the power of social network analysis and complex network theories. Such a heuristic approach may complement the existing syntax and semantics-oriented services discovery research, and provide guidance to the construction of the next-generation of service search engine. The driving factor of the research is our hypothesis that there is much useful information implicit in the past use of scientific services. This project aims to answer two fundamental questions: What implicit information may be extracted to help scientists better understand existing artifacts? and How can such implicit information be used to facilitate service-based artifact reuse? To this end, the project will build models and techniques to study the past behaviors of scientific services in the context of scientific experimental processes. As a proof of concept, this project will build a prototyping search engine, as a plugin to Taverna, a known life science scientific workflow management workbench.
The deliverables for this project would be:
- TBD
References:
- Wei Tan, Jia Zhang, and Ian Foster, "Network Analysis of Scientific Workflows: a Gateway to Reuse", IEEE Computer, Sep., 2010, 43: pp. 54-61.
- Jia Zhang, Daniel Kuc, and Shiyong Lu, "Confucius: A Scientific Collaboration System Using collaborative Scientific Workflows", in Proceedings of IEEE International Conference on Web Services (ICWS), 2010, Miami, FL, USA, Jul. 5-10, pp. 567-575.
- T. Oinn, M. Greenwood, M. Addis, M.N. Alpdemir, J. Ferris, K. Glover, C. Goble, A. Goderis, D. Hull, D. Marvin, P. Li, P. Lord, M.R. Pocock, M. Senger, R. Stevens, A. Wipat, and C. Wroe, "Taverna: Lessons in Creating a Workflow Environment for the Life Sciences", Concurrency and Computation: Practice & Experience, 2006, 18(10): pp. 1067–1100.
Requirements:
- Experienced with Java programming
- Basic knowledge about XML, Web services and scientific workflow (highly preferred)
"Blue sky" ideas
The following are "blue sky" project ideas. They are not as detailed as the above project proposals, and some of them might not even be feasible during a single summer. However, they could end up being the seed from which a really cool project springs.
- No "Blue sky" ideas yet.
Other sources of project ideas
The above list of project ideas is by no means exclusive. You may find inspiration for other cool ideas in the following places:
- Our mentors. Feel free to contact any mentor whose field of interest matches your own. If you are unsure of who to contact, or no mentor seems like a good match, please contact Borja Sotomayor (our GSoC org admin), and he will put you in touch with the right person.
- 2010 project ideas, 2009 project ideas and 2008 project ideas. Some of these ideas were already picked up and implemented by students, or may be outdated at this point. However, some mentors might be interested in reviving an old idea if you have an interesting proposal in mind.
- 2009 projects and 2008 projects. Our students in 2008 and 2009 developed some really cool projects, and some of them might be able to use an extra summer of work to add new features, take the project in a new direction, etc. If one of our past projects looks interesting, you can try contacting the student and his/her mentor to see if they'd be willing to mentor a similar project.
Mentors
Our GSoC mentors (and their areas of expertise) are:
- Bryce Allen: Globus Online
- Vijay Anand
- Patrick Armstrong: Nimbus
- John Bresnahan: Nimbus, GridFTP, Globus XIO
- Tim Freeman: Nimbus
- Tom Howe: Globus Online
- Raj Kettimuthu: GridFTP
- David LaBissoniere: Nimbus
- Mike Link: GridFTP
- Ravi Madduri
- Tanu Malik
- Stuart Martin: Globus Online
- Borja Sotomayor: Globus DemoGrid
- Michael Wilde: Swift, scientific workflows
- Jia Zhang: Scientific workflows
If you have an idea for a project, but none of the above mentors seem like a good match, please contact Borja Sotomayor (our GSoC org admin) and he will try to match you to an adequate mentor.
Project idea guidelines
This section is intended only for mentors who want to propose new project ideas for students.
We have created a Mediawiki Template to add new ideas to the list. Please use it when adding new ideas:
{{GSoCproject|
idea_title=Example Idea: Increase Awesomeness of Globus
|
globus_project=Globus Toolkit at-large
|
globus_project_url=http://www.globus.org/
|
mentor_name=John Q. Globus
|
mentor_email=john.q.globus@example.org
|
programming_languages=Java and C
|
expertise=Intermediate
|
description=We already know Globus is awesome, but there's no upper bound on awesomeness.
In this project, we propose that you make Globus more awesome.
The deliverables for this project would be:
* Develop a metric of awesomeness.
* Measure current awesomeness of Globus.
* Improve Globus to make it more awesome, according to the provided metric.
* Measure new awesomeness of Globus.
|
requirements=
* Must already be familiar with the Globus Toolkit.
* Must be an awesome student.
}}
This is the information you should include in each idea:
- idea_title: The title of your idea
- globus_project and globus_project_url: What project (in the dev.globus sense of the word) does this idea relate to? (include the name and its URL)
- mentor_name and mentor_email: Each project must have a mentor. The mentor is in charge of supervising students, tracking their progress, answering questions about the project, etc. If you would like to be the mentor for this project, please include your name and e-mail address here. If not, please leave this field blank, and we will assign a mentor from the mentor pool.
- programming_languages: What programming languages will be used in the project?
- expertise: What level of expertise do you expect from the student? You should specify just one of the following three words. If you need to elaborate on the level of expertise or the prerequisites of your project, you should do so in the requirements field.
- Beginner: You don't assume any prerequisite knowledge about Globus or Grid Computing. In other words, the project can be done by any student who is somewhat fluent in the programming languages listed in the project idea. For example, adequate for sophomore, or even freshmen, students in Computer Science or Engineering.
- Intermediate: You assume some advanced knowledge in Computer Science, but not specifically on Globus (e.g., the student may have to know about networks and distributed systems). Should be adequate for juniors or seniors majoring in Computer Science or Engineering who have taken upper-level courses.
- Advanced: You assume the student is already familiar with Globus.
- description: Include a 1-2 paragraph description of what has to be accomplished in this project. You do not need to completely specify the project, just give prospective students a good idea of what work is required (is it mainly development? will it involve a lot of independent research? is it easy or hard? etc.). Also, note that ideas don't necessarily have to be concrete tasks ("Add support for protocol FOO in component BAR") but can also be "blue-sky" ideas (e.g., "GridFTP is not currently capable of dealing with the latencies involved in transferring large files to Mars. Solve this."). In fact, Google encourages that we include a couple of these since they usually lead to the most interesting projects. If possible, include websites or papers related to this project. For example, if you want a student to implement an idea you proposed in a paper, include a link to that paper.
- requirements: What specific skills are required to do this project. (languages, knowledge of protocols, should they already be familiar with GT4/GT5 or is on-the-job training ok?, etc.)
Right now, the template is rendered like an idea from last year's list of ideas. We want to change this to make the idea list more navigable, so please make sure you use the above template so it will be easier to switch to a new format. For now, the above would render like this:
Increase Awesomeness of Globus
Globus project: Globus Toolkit at-large
Mentor: John Q. Globus
Programming Language/s: Java and C
Level of Expertise: Intermediate
Description: We already know Globus is awesome, but there's no upper bound on awesomeness. In this project, we propose that you make Globus more awesome.
The deliverables for this project would be:
- Develop a metric of awesomeness.
- Measure current awesomeness of Globus.
- Improve Globus to make it more awesome, according to the provided metric.
- Measure new awesomeness of Globus.
Requirements:
- Must already be familiar with the Globus Toolkit.
- Must be an awesome student.
If you need additional inspiration on how to write up your idea, take a look at last year's project ideas, 2009's list of ideas, or 2008's list of ideas.


