Frequently Asked Questions

  1. What is NHR-scan
  2. What can NHR-scan do
  3. What is the difference to using TRANSFAC, Rvista or ConSite
  4. What is the difference to NUBIscan
  5. What is an Hidden Markov Model
  6. How is scoring done (the Viterbi and Forward algorithm)
  7. How does the model achitecture look like?
  8. What is a FASTA-formatted sequence
  9. What is 'probability for entering match state'
  10. I have a different idea on what the model parameters should look like
  11. What about single halfsite prediction
  12. Can I download the program?
  13. Can I download the model?
  14. How do I cite you?
  15. My question is not listed here
  16. I found a bug: what do I do?

 


What is NHR-scan?

In brief, NHR-scan is a computational predictor of nuclear hormone receptor binding sites (NHRBS), similar in functionality as MATINSPECTOR or ConSite. However, the underlying algoritms are different, since the biological problem is more complicated. The model is an amalgam of a large number of sites from different NHR proteins - a 'supermodel' describing the entire class.


What can NHR-scan do?

NHR-scan can predict potential nuclear hormone receptor binding stes in genomic sequences. Like all transcription factor binding site tools, it suffers from making too many 'false' predictions - that is, non-functional sites. As always in bioinformatics, predictions are predictions and cannot replace experimental investgation. Instead, the goal is to highlight potential sites to facilitate faster experimental investigation. NHR-scan allows for parameter modifications, enabling it to predict binding sites for custom NHR proteins for advanced users


What is the difference to using TRANSFAC, Rvista or ConSite

The main difference lies in the model framework. TRANSFAC, RVista and Consite uses a simple 'profile' model for describing binding preferences. In most cases this kind of model is fully adequate - however, it unsuitable for modelling variable spacings and alternative behaviours (like the different site configurations of nuclear receptors). NHR-SCAN uses a Hidden Markov Model framework for describing the binding behaviour. Additionally, RVista and Consite support cross-species comparison, which is as yet not implemented in NHR-scan.


What is the difference to NUBIscan?

NUBI-scan has the same goal as NHR-scan. Three mayor differences are evident

  1. The model framework is different. NHRscan uses HMM models and associated scoring algorithms ,while NUBI-scan uses single halfsite models and assesses possible parings and respective probabilities. As a consequence, NHR-scan can model different half-site strengths and different halfsites models for different configurations, whic is not possible with NUBI-scan
  2. Related to the model framwork, scoring is different. NUBI_scan reports as a default the portion of the predictions whose scores are 6-7 standard deviations above the mean scores, in a distribution built from all putative sites in the input sequence. NHR-scan uses mature HMM scoring algorithms: reporting both the most probable path (the viterbi algorithm) and the total probability of the model producing the sequence (the Forward algorithm)
  3. The underlying data is different. NUBI-scan half-sites are based on few sites (11), while the number of functionally validated sites used for training NHR-scan is almost tenfold (108).

What is an Hidden Markov Model (HMM)?

Hidden Markov Models (HMM) is essentially an extensionof the profile model where each nucleaotice occurence in each postion of a binding site is tabulated in a matrix. For reviews of profile approaches, see this paper. Briefly, an HMM consist of a set of states, which are linked together in a chain-like structure, with a set start and end state. In most cases there are many different possible paths between the start and the end state. Part of the model includes the transition probabilities - the chance of moving from a certain state to another . This is usually modelled as an arrow between states. Each state can emit a nucleotide according to some distribution that is unique to the state: for instance 90% A, 5%C 4%G and 1%T. Whenever we reach a state, it will emit a nucleotide. Therefore, a certain path throgh the states producing a certain sequence of nucleotides has a certain probability: the product of all transitions (probabilities for moving between states) * the product of all emissions (probabilities for generating each nucleotide). With this framework, it is possible to calculate the most probable path ('the best path') that generates a certain sequence, or the total probability that the chain of states produced a certain sequence. A comprehensive guide of HMMs is given in Durbin et al 'Biological Sequence Analysis' , Cambridge Press.

 


How is scoring done (the Viterbi and Forward algorithm)?

The Viterbi algorithm calculates the most probable path between the states that produce the sequence - it is often used for labelling sequences as sites, transmembrane regions or part of a domain, depending on the model. For a more algorithmic discussion, see Durbin et al 'Biological Sequence Analysis' , Cambridge Press.

The Forward algorith calculates the total probability that the model produced the sequence: essentially by summing the probability of all possible paths producing the sequence (note that this probability is not 1). For a more algorithmic discussion, see Durbin et al 'Biological Sequence Analysis' , Cambridge Press.


How does the model achitecture look like?

Essentially the model consists of three 'match state chains' - corresponding to each type of site configuration (direct, inverted and everted repeats), and one 'background state' - corresponding to no prediction. Each match state chain is composed of two half-site models and a spacer model separating the, The half-site models can be described as ordinary profile models, while the spacer model described the probabilities of spacings (technically a set of insertion states).


What is a FASTA-formatted sequence

One of the most used sequence formats: it has one line with name or identifiers and one or more lines of sequences, like so

>my sequence
CAGTCAGTGCGCCGGCGATCGTAGCTAGCTAGCTAGCTAG
CGATCGATGCATGCGACGCGCGGACGTAGCTAGCTAGCTAG
CATCGATGCTAGC

Note that the first line HAS TO start with a ">" sign. All the sequence lines following it are concatenated by the computer.They don't have to be the same length.

What is 'probability for entering match state'

This value corresponds to the transiction probability from the background state to the different match state chains described above. Higher probabilities will results in more predictions. In 'advanced query' individual probabilities for different match staes can be set.


I have a different idea on what the model parameters should look like?

In 'advanced query', set custom parameter modifications to 'yes' to get access to half-site model and transition states parameters.

At this page, changable paramaters are prefilled with respective model parameter. It is important that you have a good grasp on both the biological and mathematical problem to proceed. You are on your own here in determinig what is sensible.

Half site models are in themselves identical to profile (matrix) models, describing counts of bases in each position. For a more comprehensible guide to profile-based approaches, see these references ( Wasserman 2003, Stormo 2000 ). Two half-site models for each repeat type are supplied - each half-site model will be incorporated in the model in both directions. In other words, only one direction of the half-sites need to be suppled(NB -it is imortant that both half-sites are in the same strand - otherwise the configuration scheme will be useless)

Transition parameters that can be se corresponds to the different spacers between halfsites and the probability of entering match states.

The first class of parameters consist of a row of probabiliteis of the different types of spacers for the three different repeats. The total probability must sum to 1 (as other alternatives are not existing).

The probability of entering match states can be viewd as a sensitivity and selectivity cutoff. Higher values will result in more predictions: more real sites will be detected but more false sites will be reported. For comparison of different settings, see the article. Note that different probabilities can be set for the different repeats (for instance, if only direct repeats are interseting, the probability of moving to ER or IR match states could be set to zero)


What about single halfsite prediction?

Single half-site prediction is not supported in NHR-scan - since it renders the model framework useless. The TRANSFAC and JASPAR profile databases have several half-site models for this purpose. The ConSite tool enables custom profile usage.


Can I download the program?

The actual viterbi/forward decoder is a very simple c++ program that uses the intricate GHMM C/C++ package available at . We recommend installing those modules. There is an additional layer of perl modules, that is responsible for the graphical interface between the user and the decoder. You can use our code (if you are an from an academic or non-profit institution) or make your own.


Can I download the model

Yes, if you are an academical researcher. Go to Supplementary data.


How do I cite you?

If you use the web service, model framework or underlying data, please cite

Prediction of Nuclear Hormone Receptor Response Elements

Albin Sandelin and Wyeth W. Wasserman

Mol Endocrinol. 2004 Nov 24; [Epub ahead of print] PMID: 15563547


My question is not listed here

Send us a mail: albin.sandelin AT cgb.ki.se with the question formulated as succinctly as possible.


I found a bug: what do I do?

Send us a mail: albin.sandelin AT cgb.ki.se with the following information:

What did you try to do (as detailed as possible)

What went wrong(as detailed as possible)