FoldMiner and LOCK 2 Manual: Installation and Usage Instructions

FoldMiner: Structural Motif Discovery Using an Improved Superposition Algorithm
Jessica Ebert and Douglas Brutlag Stanford University

Contact for Installation Support, Suggestions, and Bug Reports: Jessica Ebert

Copyright 1997 The Board of Trustees of The Leland Stanford Junior University. All Rights Reserved

Table of Contents:
  1. Package Contents
  2. Installation Instructions
  3. Usage Instructions for the LOCK 2 Command Line Interface
  4. Usage Instructions for the FoldMiner Command Line Interface
  5. References

I. Package Contents

This package includes the following three programs:
FoldMiner:
Runs structural similarity searches (aligns a query structure to a database of target structures to find those that are structurally similar to the query) using LOCK 2 to perform pairwise alignments. In the process of running the structural similarity search, FoldMiner identifies the structural motif that is the basis of the structural similarity shared among the query structure and high scoring targets. The motif discovery process is entirely unsupervised and is described in more detail in a 2004 Protein Science paper. Results are reported for the motif and for each statistically signficant LOCK 2 alignment (see below for details).

LOCK 2:
LOCK 2 is an improved version of the LOCK program, and is used by FoldMiner to perform pairwise structural alignments. LOCK 2 is more capable of detecting distant structural similarities than is its predecessor. Statistical significance scores are provided.

Web interface:
Provides access to FoldMiner and LOCK 2 using a web interface.

II. Installation Instructions

Installation for FoldMiner, LOCK 2, and web interface:

Note: All packages (FoldMiner, LOCK 2, and the web interfaces) require the completion of steps 1-3 for installation. Step 4 describes options for installing LOCK 2 and/or FoldMiner without the web interface.

  1. Examine your system configuration and download files.
    1. Executables and web pages:

      The FoldMiner and LOCK 2 software and web pages require approximately 40MB of disk space. The vast majority of this space (38MB) consists of example FoldMiner structural similarity searches. We recommend that you do not delete these files until you are familiar with FoldMiner.

    2. tmp directory location:

      The installation process assumes the existence of a tmp directory on your system. It must be located in "/". If your tmp directory is in a different location, make a symbolic link "/tmp" that points to the actual location (you must do this manually):

      ln -s <tmp dir location on your system> /tmp


    3. SCOP version 1.63 PDB files:

      For most applications, you will most likely want to align your query structure to a database of targets representing most or all known folds. We recommend SCOP as a source of target databases. (See the ASTRAL website for a description of methods used to select sequence dissimilar subsets of SCOP and target databases consisting of representatives from various levels of the SCOP hierarchy.) The web page interface assumes that only these SCOP target databases will be used, though you may, of course, use any target database at the command line.

      If you choose to use SCOP target databases, you will need to download SCOP PDB files from the FoldMiner distribution website or from the ASTRAL web site. You will need approximately 600MB of disk space. You may decrease the amount of time required for LOCK 2 and FoldMiner runs somewhat by gunzipping these directories, which will then require approximately 2.5GB of disk space. The computational time saved is that required to gunzip the PDB files. The location of the directory containing the SCOP PDB files must be set in the site.defs file (see the SCOP_PDB_DB_LOC variable and explanation), which is described below.

      Because secondary structure information is obtained using the dssp algorithm if it is not included in PDB file headers, you may wish to create a database of DSSP files for the SCOP PDB files. This will require an additional 85MB of disk space if the files are gzipped, or 245MB otherwise.. You may choose to create this database during the installation process. (See the MAKE_SCOP_DSSP_DB and SCOP_PDB_DB_LOC variables and explanations in site.defs.) DSSP files will be gzipped by default; type:

      gunzip -r <dssp files directory>

      to gunzip them.

      Alternatively, you may create this directory later by typing:

      make MAKE_SCOP_DSSP_DB

      from within the installation package directory at any time after running the "./configure" command described in the distribution installation section below, regardless of how you set the MAKE_SCOP_DSSP_DB variable in the site.defs file.

    4. Local copy of the PDB database:

      You may also wish to create a local copy of the PDB on your system. If so, its location must be specified in site.defs (see the PDB_DB_LOC variable and explanation). If you will use PDB structures frequently in your searches, you may also wish to create a corresponding database of DSSP files. You may choose to create this database during the installation process; DSSP files will be gzipped by default. This will require approximately 420MB of disk space, and will take some time to create. (See the PDB_DSSP_DB_LOC and MAKE_PDB_DSSP_DB variables and explanations in site.defs.)

      Alternatively, you may create this directory later by typing:

      make MAKE_PDB_DSSP_DB

      from within the installation package directory at any time after running the "./configure" command described in the distribution installation section below, regardless of how you set the MAKE_PDB_DSSP_DB variable in the site.defs file.

    5. Perl modules:

      Two perl modules (Scop.pm and Expectation.pm) were included with this distribution. We will provide updated modules on the FoldMiner distribution website as new SCOP releases become available. The modules in this distribution correspond to SCOP release 1.63. (See the SCOP_MODULES_LOC variable and explanation in site.defs to choose the installation directory for these modules.)



  2. Downloading and unpacking the distribution:

    Download the file http://fold.stanford.edu/distributions/FoldMiner/FoldMinerDistribution.tar.gz and unpack it:

    gunzip -c FoldMinerDistribution.tar.gz | tar xvf -


  3. Preparing the distribution for installation:

    Edit the file site.defs in the FoldMinerDistribution directory to define various directories and default parameters. Each variable in the file has associated instructions. If you will not be installing the web interface, you may safely ignore variables relating to web directories.

    Do NOT include any extra spaces around the equals signs.

  4. Distribution installation:

    Program installation may require root privileges, depending on the installation directories you entered in the site.defs file described in section II.C. Log in as root (if necessary) and cd to the directory containing the software distribution (FoldMinerDistribution by default).

    There are three major installation options:


III. Instructions for the LOCK 2 Command Line Interface:

    Contents:
  1. Aligning a query structure to a single target structure
  2. Aligning a query structure to a database of target structures
  3. LOCK 2 output files
  4. Calculating statistical significance values (p values)
  5. Viewing Alignments
  1. Aligning a query structure to a single target structure:

    Run LOCK 2 as follows:

    lock2 -q <queryfile> -t <targetfile>

    The arguments <queryfile> and <targetfile> may be complete paths to PDB files, PDB accession codes (e.g. 1mbd or 1seb-A, where 1seb-A specifies chain A of 1seb), or SCOP identifiers (e.g. d1dlwa_). PDB accession codes will work only if you have correctly entered the location of a local copy of the PDB in the site.defs file, and SCOP identifiers will work only if you have correctly entered the location of a local copy of SCOP PDB files in the site.defs file (see section II.C above).

    To see all available command line options, type:

    lock2 -h

  2. Aligning a query structure to a database of target structures:

    To align a query structure to several target structures, create a file containing the list of target structures (as either full path names or accession codes). The first line of this file must read "LOCK_LIST". Then use the file name as the <targetfile> argument. For example:

    lock2 -q 1mbd -t mytargets.file

    where mytargets.list contains the required first line and PDB or SCOP identifiers on each successive line. For example:

    LOCK_LIST
    2gdm
    1mbc
    3sdh-A
    1eca


  3. LOCK 2 Output Files:

    LOCK 2 creates two output files:
    1. <query>_<target>.out: The alignment results are placed in a file that will be called "query_target.out" if the query and target are specified with complete path names. If PDB or SCOP accession codes are used for either the query or target, the output file name will contain the accession codes instead (eg 1mbd_2gdm.out). This output file can be used to create a PDB file containing both aligned structures by using the script makePDBfile.pl as follows: makePDBfile.pl query_target.out.
      The format of the output file is generally self explanatory, with one significant exception. A reference to a given residue includes not just the residue number, but the insertion code (or a space, if none exists) and a chain (or a dash, if none exists). Thus, residue number 122 of chain A would be listed as "122 A" if the PDB file lists no insertion code.

    2. search.out: The second output file is called "search.out" and contains summary statistics for each alignment of a query structure to a target in the following order: target name, p value, alignment score, number of secondary structure elements aligned, number of residues aligned, rmsd, and a PDB header. The rest of the line contains alignment scores for individual secondary structure elements; most users will not find this information to be useful. This is file is useful when a single query structure is aligned to several target structures.


  4. Calculating Statistical Significance Values (p values):

    The p values given in the search.out file are based on a background score distribution that encompasses all SCOP folds. We have found that the significance of an alignment is more accurately assessed by considering the query structure's fold. If your query structure is a SCOP domain, you can obtain more accurate p values using the script "calculate_pvalues.pl." This will replace the p values in search.out; no new files will be created. The usage is as follows:

    ./calculate_pvalues.pl <query's SCOP identifier or SCOP fold> <full path to search.out>

    The script looks for the LOCK 2 output file specified in the second command line argument; if you have renamed search.out, this script will still function properly.

    If your query structure is not a SCOP domain but you wish to calculate p values for a specific SCOP fold (e.g. if you know your query is a globin), you may do so by providing the SCOP fold as an argument in place of the query's SCOP identifier. For example:

    ./calculate_pvalues.pl a.1 myresults/search.out

  5. Viewing Alignments:

    The script "makePDBfile.pl" can be used to create a PDB file containing the query structure as chain A and the target structure as chain B. To run, type:

    ./makePDBfile.pl <query_target.out>

    where "query_target.out" is the alignment file produced by LOCK 2 for the alignment you wish to view (see section II.C, item 1). This will produce a file of the same name with the ".out" extension replaced with a ".pdb" extension.

    Load the PDB file in any viewer. We recommend displaying the alignment as a cartoon diagram. To do this in Rasmol, enter the following commands or choose the equivalent options from the menus:

    wireframe off
    cartoons
    color chain


IV. Instructions for the FoldMiner command line interface:

    Contents:
  1. Introduction
  2. Recommended Target Databases
  3. Usage and Arugments
  4. Output Files
  1. Introduction

    FoldMiner runs a structural similarity search (using LOCK 2 to perform pairwise structural superpositions) and automatically finds a structural motif that is the basis of the similarity between the query structure and high scoring targets. Algorithmic details can be found in the reference cited at the end of this document.

    If you have not already performed LOCK 2 alignments, FoldMiner will do so for you.

    If you have already aligned a query protein structure to a database of targets using LOCK 2, FoldMiner will not redo the structural alignments. Note that you must have a search.out file (potentially renamed) containing results for all pairwise alignments to avoid repeating the LOCK 2 alignments. This file is automatically produced each time LOCK 2 is run.

  2. Recommended Target Databases

    Files containing subsets of SCOP domain identifiers are installed in the "targetdb" subdirectory of the directory in which you have chosen to install the FoldMiner command line interface (specified by the variable FOLDMINERDIR in the site.defs file described in section II.C). See astral.stanford.edu for documentation on these subsets.

    We find the file astral-scopdom-seqres-gd-sel-gs-bib-25-1.63.id, which contains a set of SCOP domains such that no two have greater than 25% sequence identity, to be particularly useful.

  3. Usage and Arguments

    (Note: This information is also included at the top of the file FoldMiner.pl)

    To run FoldMiner, type:

    ./FoldMiner.pl -q <query PDB or SCOP id> -t <full path to target database file> [-r <full path to search.out file> -a <alignments directory> -x -e -exclude <exclude string> -lpg]

    Explanations:



    To reanalyze results with different parameters, simply set the options as desired, the -a argument and, if necessary, the -r argument. FoldMiner will reanalyze data without repeating alignments. Note that results files may be overwritten.

  4. FoldMiner Output Files:

    1. Alignment Files:

      When running alignments, FoldMiner produces the same output files as described in the "LOCK 2 Output Files" section above. The "search.out" file described in that section is is placed in the location specified in FoldMiner's second command line argument. If this directory does not exist, it will be created if its parent directory exists.

    2. Other Output Files:

      There are several FoldMiner output files that are placed in the same directory as the search.out file:

      1. significant-results.txt and significant-results.html: Summaries of alignment statistics for statistically significant target structures in both text and html format.

      2. all-results.txt and all-results.html: Contains alignment statistics for all targets for which an alignment was produced. (Some targets may produce no alignment if they are very different than the query.)

      3. query_motif.spt: A rasmol script that can be used to color the query structure's secondary structure elements by the conservation values calculated by FoldMiner. Strongly conserved secondary structure elements (those whose positions remain relatively fixed among the query and its structural homologs) will be brightly colored. To use this script, load the query structure into Rasmol and type "source query_motif.spt" in the command window:

        rasmol <full path for query PDB file>
        source query_motif.spt


      4. <query structure name>-SSE-conservations:

        The numerical conservation values calculated by FoldMiner for each secondary structure element. Values range from 0 to 1, where higher values indicate a greater degree of conservation of the secondary structure element's position within the query and its structural homologs. The start and end residue numbers for each secondary structure element are also given.

        You may wish to run FoldMiner again on the same set of alignments by excluding certain secondary structure elements using the -exclude option. Weakly conserved secondary structure elements can be excluded altogether to attempt to improve the specificity and sensitivity of the structural similarity search (i.e. to exclude false positives and recruit additional true positives). Alternatively, if a large number of secondary structure elements are weakly conserved (and therefore unlikely to be part of a conserved structural motif), you may wish to exclude the strongly conserved secondary structure elements in order to attempt to identify a second structural motif among the remaining ones. This process is described in more detail in the reference cited at the bottom of this document.


IV. References

To cite this work, use the following references:

Shapiro, J. and Brutlag, D. (2004). FoldMiner: Structural Motif Discovery Using an Improved Superposition Algorithm. Protein Science 13(1):278-294.

Shapiro, J. and Brutlag, D. (2004) FoldMiner and LOCK 2: Protein Structure Comparison and Motif Discovery on the Web. Nucleic Acids Research 32:W536-541.

Please send questions and comments to Jessica Ebert.