Protein Structure Initiative Update

May 2007

The Protein Structure Initiative (PSI) was established in 2000 by the National Institute of General Medical Sciences (NIGMS) after a two year study by the Institute staff and Council. Following the success of the genome sequencing projects and manifold technical advances in structural biology, a number of scientists proposed establishment of a large-scale structural biology project to significantly extend structural coverage of sequenced genes. They pointed out that the number and complexity of structures in the Protein Data Bank (PDB) had grown impressively over the past decades and that these structures had led to many scientific successes and to a greater understanding of structure-function relationships, but that growth of new and unique structures had not kept pace with the overall growth.

Following several national and international workshops, a U.S. “structural genomics” project was proposed. Structural genomics involves high-throughput experimental determination of a large number of representative structures, with the goal of achieving systematic sampling of sequence families. Utilization of computational modeling of sequence family homologs then extends the structural information to a much larger fraction of sequenced genes.

While considering this project, the NIGMS held three workshops to examine the feasibility, goals, scale, and target selection strategy for a structural genomics effort. Meeting summaries can be found on the NIGMS website. Following these workshops and discussions with advisors, the NIGMS Council concluded that the Institute should undertake this effort and asked the NIGMS staff to organize a “pilot” phase of the PSI as a 5-year project. PSI-1 consisted of a centers program and an investigator-initiated grants program for methodology and technology development. With the guidance of Council, the Institute published a Request for Applications (RFA) for PSI pilot centers.

PSI Pilot Phase (PSI-1)

In response to this announcement, nine pilot research centers were established, seven in 2000 and two in 2001, to test strategies for high-throughput structural determination. Two of these pilot centers were co-funded by the NIH National Institute of Allergy and Infectious Diseases (NIAID). The goals of PSI-1 were to:

  1. Develop methodology and technology to increase success rates and lower costs of structure determination,
  2. Construct and automate the protein production and structure determination pipeline, and
  3. Determine unique protein structures. In this context, the term “unique” was defined to mean structures for proteins that were less than 30% identical in sequence to proteins for which structures had already been determined.

During the first year, the Institute appointed the Protein Structure Initiative Advisory Committee (PSIAC), a working group of the NIGMS Council composed of independent scientists (i.e., not connected to the PSI) to provide strategic advice to the NIGMS Council and staff on the management and planning of the project. One important product of their first meeting was a project mission statement: “to make the three-dimensional atomic level structures of most proteins easily available from knowledge of their corresponding DNA sequences.”

PSI Pilot Phase Results

The nine pilot centers produced a variety of results, including the development of numerous important new methods, automated and parallel procedures, robotic instruments, and structural determination pipeline salvage (or rescue) procedures. These new methods and tools were rapidly incorporated into the pilot centers’ structural genomics pipelines and many components were subsequently adopted by structural biology labs throughout the world.

During PSI-1, target selection was left to each center and was not centralized, but all centers were required to aim for unique protein structures and to list their targets on the PSI centralized database in order to minimize overlap and duplication of effort. The pilot centers were required to disseminate their results, including rapid deposition and release of atomic coordinates and the data used for structure determination. NIGMS also supported technology development for high-throughput structural biology data collection, including both the enhancement and new construction of synchrotron beamlines. As the centers ramped up and the two new centers were begun, the aggregate budget for all the pilot centers increased from $31 million total costs in the first year to $71 million total costs in the final year of the pilot phase.

Over the five years of PSI-1, the nine pilot centers determined about 1,300 structures, with about 65% unique. Structures contributed by PSI are comparable in quality and size to structures deposited into the PDB from other structural biology laboratories. Since these centers took several years to reach high-throughput operation, it was not surprising that 40% of the PSI-1 structures were determined in one year -- the fifth and final year of the project. By the fifth year of PSI-1, the cost per structure had fallen more than two-fold -- to $138,000. (This estimated cost per structure includes funds for ongoing technology development.)

Lessons Learned

From this first phase of the PSI, NIGMS staff and the PSIAC concluded that several lessons had been learned:

  • Structural genomics pipelines can be constructed and scaled-up,
  • High-throughput operation works for many proteins,
  • NMR can make a significant contribution to structural genomics pipelines,
  • Bottlenecks remain for some proteins, especially integral membrane proteins,
  • A coordinated, 5-year target selection policy is critical for future PSI efforts,
  • Centralized archiving of materials is essential,
  • Homology modeling methods need improvement, and
  • Outreach to and involvement of the broad scientific community must be fostered.

PSI Production Phase (PSI-2)

Following consideration of PSI-1 progress and PSIAC recommendations, the NIGMS Council recommended that the Institute staff prepare announcements for the second phase of the PSI, PSI-2, to begin in July 2005. Building on the experience and progress of the first phase, the PSI-2 Network undertook several goals:

  1. Structural coverage of sequence families, including those of known high biological importance;
  2. Continued methodology and technology development, especially for challenging classes of proteins such as integral membrane proteins; and
  3. Increased promotion of the use of structures by the broader biological community.

To achieve these goals, the PSI-2 Network included five separate components:

  1. Four large-scale high-throughput research centers focused on production of a large number of unique protein structures that, with application of computational modeling methods, broaden structural coverage of protein sequences,
  2. Six specialized centers focused on technical problems associated with pipeline bottlenecks and challenging proteins,
  3. Two homology modeling centers and a research grants program focused on improving the accuracy of comparative protein structure modeling,
  4. A materials repository to store and distribute expression clones, and
  5. A knowledgebase to serve as an information analysis and dissemination center.

Through the individual center websites, there is a great deal of information on the accomplishments and productivity of these centers. The first four components of the PSI-2 Network are already operational. The large-scale centers and specialized centers were funded in 2005. The homology modeling centers and materials repository were funded in 2006. In addition, an investigator-initiated research grants program was added in 2007 to enhance homology modeling methods and increase the chance of producing breakthroughs. The knowledgebase will be funded in mid-2007. A supplemental grants program for the study of PSI structures of unassigned function was initiated in 2003 and is continuing. This activity provides funds to enable investigators interested in protein function to undertake short-term research projects which capitalize on the information and reagents produced by the PSI. The budget for all 14 PSI-2 centers and the two small grants programs is about $66 million total costs per year. One of the specialized centers is co-funded by the NIH National Center for Research Resources (NCRR).

PSI-2 Target Selection

The overall PSI-2 goal of providing broad structural coverage and the determination of unique protein structures from large protein families was built into the PSI-2 project, but implementation of target selection is intended to be worked out by the PSI-2 researchers. This task is undertaken by the directors and bioinformatics staff of the large-scale centers. Targets for the large-scale center joint activity are chosen in order to maximize structural coverage, enhance biological impact, and make the structures useful to the broad scientific community (perhaps the most important aspect of PSI-2). The large-scale centers are required to spend 70% of their effort on the joint PSI-2 Network activity of structural coverage. Additionally, these centers must also provide 15% of their effort for community nominated targets and collaborations and another 15% for their own individual biomedical theme project.

At many levels, the issue of target selection has received intense scrutiny and countless rounds of bioinformatics analysis. Two groups have borne most of the responsibility for target selection and coordination of this project. The Operations and Management Group (OMG) consists of the four large-scale center directors and the NIH PSI Network director. The Bioinformatics Group (BIG) is composed of the four informatics directors of these centers. These two groups, separately and together have communicated weekly to forge a common plan for target selection and operation. Following extensive communications, the large-scale centers agreed on a total of 3,000 structures as a 5-year goal for PSI-2 and worked out agreements on the rules of operation and target selection. Several thousand target families have already been chosen and allotments made to each large-scale center by a “match” process.

Summarizing the strategy of target selection, goals include the:

  • Coarse sampling of large families (initially Pfam with other large families added) with no structural representatives in PDB to achieve broad structural coverage (joint activity);
  • Moderate sampling of very large families with limited structural representatives in PDB for: (joint activity)
    • Increased structural coverage to explore evolution of structure and function and to aid in computational modeling
    • Structural coverage of selected families with high biomedical relevance;
  • Exploration of single organisms, metagenomes, and microbiomes (joint activity);
  • Community targets nominated by non-PSI investigators and centers (joint/individual center activity); and
  • Biomedical theme targets (individual center activity).

Methodology and Technology Development

In PSI-2, methodology and technology development is centered in the specialized centers and also continues in the large-scale centers and research grants program. The smaller specialized centers are focused on specific bottlenecks in production and structure determination, especially from proteins from more difficult classes. Two specialized centers are focused on membrane proteins and another on eukaryotic proteins. The other three specialized centers are developing methods and instruments for improving protein production, crystallization, and structural determination. Short informal reports on technical developments and problems are exchanged quarterly between all fourteen centers. The specialized centers are expected to determine structures and contribute to the Network goal of structural coverage, but at much lower rates than those from the large-scale centers.

PSI Knowledgebase

The knowledgebase will provide a platform for scientific community involvement in target selection and functional annotation and will play an important role in increasing the impact of protein structures on biological and biomedical research. Plans for the PSI knowledgebase include support for:

  1. A homology modeling portal that will provide the scientific community with facile access to computational models of proteins;
  2. A functional annotation module to facilitate community participation in assigning function to structures;
  3. A metrics module for the analysis of PSI progress;
  4. A database module for tracking PSI targets (TargetDB) and PSI experimental methods (PepcDB);
  5. Integration with other data resources, such as the PDB, NCBI, model organism databases, etc; and
  6. Integration with the materials repository.

PSI Production Phase Results

During the first year of PSI-2 (July 2005- June 2006), the four large-scale centers developed additional new methods and jointly devised a target selection process to maximize structural coverage and the biomedical relevance of the structures. They have determined about 425 protein structures. Over 70% of these are unique, and the cost per structure has been reduced to $94,000. (Again, ongoing technology development activities make this figure an over-estimate of the current cost per structure.) These structures represent about 40% of the unique structures deposited into the PDB from all sources, worldwide, during this period. These large-scale centers are well on their way to an even larger number of structures in the second year (July 2006 - June 2007).

PSI Policies

As a public resource, the PSI-2 has special regulations and policies. From the inception, PSI required rapid release of all results, including the deposition and release of coordinates and related information into the PDB. The PSI-2 centers are not funded by the usual research grants mechanism, but via cooperative agreements. The Principal Investigator is responsible for directing his/her center, but all centers are required to work together, and NIH staff and outside advisors share important roles in determining program goals and actions. The PSI-2 centers are also responsible for outreach activities -- to the scientific community, to minority scientists and students, and for research training. As a network with joint activities and goals, the centers are continually discussing and fine-tuning issues such as target selection, management, operation, and cooperation.

PSI-2 Organization

The PSI Steering Committee (PSISC) is the internal governing body of PSI-2, providing direction and revising goals and plans within the framework established by the PSI-2 RFAs, the PSIAC, and the NIGMS Advisory Council. The PSISC is also responsible for the implementation of plans and overall project operation. It is composed of the PSI center directors, four NIH staff, and five outside scientists. The PSISC chair interacts on a regular basis with the OMG and BIG and oversees work of four subcommittees: Goals and Milestones, Target Selection, Center Interactions, and Communication with the Scientific Community. Each subcommittee has produced a report that is available on the PSI website. The Goals and Milestones report was developed with input from a large group of center and Institute staff. It enumerates the expected deliverables for PSI-2. A preliminary analysis of PSI-2 progress on structural coverage and other goals is due in August 2007. This documentation will also include a list of publications, workshops, etc. In the future, the PSI knowledgebase will be responsible for providing periodic evaluations of progress and goals.

In addition to regular communication, the PSI center directors, the PSIAC, the PSISC, and NIH staff attend the PSI annual meeting to discuss progress, plans, and strategies. At the annual PSI-sponsored “Bottlenecks” meeting, scientists from PSI centers and other laboratories discuss technical hurdles in protein production, crystallization, and structure determination. This exchange has led to significant enhancements in methods and techniques incorporated by structural genomics pipelines and used by the structural biology community as a whole. In addition, various specialized workshops are convened by the individual centers.

Future Directions

As a major NIH project intended to serve as a scientific resource, questions of policy and goals are discussed regularly by those involved with the PSI. The focus of these deliberations is target selection, goals and milestones, operation, and management. Longer range planning is also being addressed. Over the past two years, the NIGMS/PSI Network has sought input, especially on the following issues:

  • What is the appropriate role of the specialized centers within the PSI Networks?
  • How can the PSI interact with other structural genomics efforts and with structural biology projects?
  • How can the PSI knowledgebase increase the impact and value of biological and biomedical research?
  • How can PSI further involve the scientific community in target selection and structural annotation?
  • How should the PSI coordinate activities and target selection, in particular, with international structural genomics projects?

The PSI and its network will continue to consult the scientific community. More information on the history and background, plus summaries of PSI workshops, and program announcements, goals, requirements, and progress of the initiative are available on the NIGMS PSI website.