Bottom line

This is the start of the documentation, it is not (and may never be) finished. It is a discourse on the history, meaning and future of maxH.

It seems to have been David Eisenberg (1984) who first suggested that integral membrane proteins should each have at least one transmembrane helix which is highly hydrophobic. Klein et al (1985) cited his paper but not in reference to the idea. They proposed that it should be possible to identify integral membrane proteins on the basis of the single criterion of the hydropathy of their most hydrophobic segment. They did a Bayesian analysis of known transmembrane proteins applying this idea. Their analysis suggested that the idea was correct. Using known integral and non integral membrane proteins they derived maximum hydropathy values, maxH values, which should optimally discriminate between the two classes of proteins using several hydropathy scales and window lengths. We applied their method to the proteins of the E. coli genome when it became available. We found the Klein et al discriminator values to be hopelessly inaccurate for E. coli but could immediately see that Eisenberg's idea was correct.

We repeated the Bayesian analysis using known E. coli proteins as the basis for new discriminator values. We compared several scales and window lengths to find the optimal combination, as Klein et al had done. We did find an optimal combination but basically, in E. coli all reasonable hydropathy scales work well at window lengths that are between the length of a typical transmembrane helix and half that length (11 to 21 residues). Our discriminator values were much higher than those of Klein et al. Their discriminators misclassified many well characterized E. coli proteins as integral membrane proteins. Ours misclassified a few E. coli proteins according to the information in Swiss Prot but when the published evidence was examined we could see that no well characterized proteins were misclassified.

Our big surprise came when we plotted a genomic histogram. On the X axis we put the maxH value ranges, and on the y axis the number of proteins in each range. The results for E. coli, a member of the archea and the subset of human proteins available at the time are shown in the figure at the top or the page. (I think they are in that order front to back but I don't know for sure which is which because they are so similar.) Two things are striking:

  1. The curves are bimodal. Each is well approximated by the sum of two simple Gaussian curves. This must mean something.
  2. The valley between the two peaks is exactly the discriminator value which best separates integral and non integral membrane proteins in E. coli. This is shown as a tick mark below the X-axis and a line through the X-axis at the discriminator value (which can be misleading as it continues up the back wall). The discriminator value, remember, was derived from Bayesian analysis of known E. coli proteins so has nothing directly to do with the distribution of maxH values in the genome. The blue proteins are integral the green ones are not, at least in E. coli. Isn't this amazing?

In other organisms the situation may not be the same as in E. coli. Bacillus subtilis has many secreted proteins with cleaved signal sequences that are classified as integral by our E. coli algorithm. Bacillus has many signal peptidases some of which, we suggest, may cleave signals with long hydrophobic cores. In eukaryotes there are many integral membrane proteins that have very low maxH values. The most obvious class to these is the nuclear coded mitochondrial inner membrane proteins. These have a (not well characterized) mitochondrial localization signal that efficiently targets them to the mitochondrion and, we suggest, a requirement for a low maxH value so that they are not efficiently targeted for insertion into the membrane in the ER.

That the single criterion maxH analysis efficiently segregates integral membrane proteins from all other classes in E. coli. A similar Bayesian analysis has not been done in other organisms, but it seems likely that the same simple discriminator approach will not work for all classes of proteins in Gram positive organisms or in eukaryotes. However it is clear that there is some very basic biological principle that is reflected by the E. coli result and that the same principle is also operating in Bacillus, eukaryotes and archaea, but in some cases overridden by specific mechanisms.

Our idea is that there is consistent difference in maxH values for most integral and non integral proteins in all organisms. Thus the biphasic histogram. Exceptional proteins may be exceptional for some reason. Perhaps the E. coli signal peptidase recognizes only proteins with short hydrophobic cores while the Bacillus signal peptidases recognize cores of different lengths, some longer. Perhaps eukaryotes have a statistical requirement for low maxH values for proteins that are destined for insertion into membranes other than those or the ER.