% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/build.cid.lca.R
\name{build.cid.lca}
\alias{build.cid.lca}
\title{build.cid.lca}
\usage{
build.cid.lca(
  pc.directory = NULL,
  tax.sources = "LOTUS - the natural products occurrence database",
  use.pathways = TRUE,
  use.conserved.pathways = FALSE,
  threads = 8,
  cid.taxid.object = NULL,
  taxid.hierarchy.object = NULL,
  cid.pwid.object = NULL,
  min.taxid.table.length = 3,
  output.directory = NULL
)
}
\arguments{
\item{pc.directory}{directory from which to load pubchem .Rdata files. alternatively provide cid.taxid.object, taxid.hierarchy.object, and cid.pwid.object as data.table R objects.}

\item{tax.sources}{vector. which taxonomy sources should be used?  defaults to c("LOTUS - the natural products occurrence database", "The Natural Products Atlas", "KNApSAcK Species-Metabolite Database", "Natural Product Activity and Species Source (NPASS)").}

\item{use.pathways}{logical.  default = TRUE, should pathway data be used in building lowest common ancestor, when taxonomy is associated with a pathway?}

\item{use.conserved.pathways}{logical. default = FALSE, should 'conserved' pathways be used?  when false, only pathways with an assigned taxonomy are used.}

\item{threads}{integer.  number of threads to use when finding lowest common ancestor.  parallel processing via DoParallel and foreach packages.}

\item{cid.taxid.object}{R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory}

\item{taxid.hierarchy.object}{R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory}

\item{cid.pwid.object}{R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory}

\item{min.taxid.table.length}{integer.  when there are few taxa reported to synthesize a particular compound, and those few taxa are spread widely across biology, the LCA concept breaks down.  This value controls the decision as to whether to determine LCA within taxonomic ranks, rather within the full taxonomy hierarchy.  see details.}

\item{output.directory}{directory to which the pubchem.bio database is saved.  If NULL, will try to save in pc.directory (if provided). If both directories are NULL, not saved, only returned as in memory}
}
\value{
nothing.  will save to pc.directory as .Rdata file.
}
\description{
utilizes downloaded and properly formatted local pubchem data created by 'get.pubchem.ftp' as input to generate a relationship between pubchem CID and the lowest common ancestor NCBI taxid
}
\details{
utilizes downloaded and properly formatted local pubchem data created by 'get.pubchem.ftp' function

Some metabolism is highly conserved - all species perform those reactions.
Other metabolism is highly specific - there is one know species to produce
that metabolite. Sometimes, it is in between.  The lowest common ancestor
approach allows us to analyze these patterns and put them to use to
generalize metabolites for metabolomics across species.

Biology is more complex than that though.  Natural products are often
reported as being synthesized by an organism which is in symbiosis with
a second organism.  The taxonomic assignment is sometimes both organisms,
even if neither would create that product in isolation, or if only one is
actually capable of producing that metabolite.  In these situations, the
LCA approach can break down.  For example, if a bacteria is in symbiosis
with an algae, and each is listed as producing the metabolite, the LCA will
be assigned as '1' - the root of all biology, since we have to go back to
the base of the taxonomic tree to find the common taxonomic ancestor of
prokaryotes and eukaryotes.  In this example, there are two unique species,
genera, families, orders, etc listed in the full taxonomic
hierarchy for this metabolite.

The 'min.unique.taxid.ct' variable controls
sensitivity to this phenomenon in assigning LCA.  The number of unique taxa
which are mapped to each metabolite varies by taxonomic level.  it may map
to two species, but only one genus.  in that case, the genus is assigned as
the LCA.  However, if the metabolite maps to two unique species,
two unique genera, two unique families, two unique kingdoms, and one unique
domain, we should ask ourselves whether this sparse patterns supports that
this metabolite should be marked as conserved' or 'primary.'  What makes
more intuitive sense is to conclude that there are may be extenuating
circumstances which have resulted from unique biology.  For example,
Ceratodictyol B is reported from \emph{Haliclona cymaeformis} and \emph{Ceratodictyon}
\emph{spongiosum}, one of which is a red algal symbiont of the other. At each
taxonomic level, there are either 0, 1, or 2 unique taxonomy IDs. 0 unique
levels is uninteresting - that just reflects that there is no taxonomy
assigned for those lineages at that level.

What is more interesting is the number of unique levels of the number of
unique taxonomy ids.  in the case of Ceratodictyol B, the only other value is
'2'.  There are 2 unique taxonomy IDs at each level species, genus, order,
class, and phylum.  So there are five taxonomic levels that have exactly 2
unique taxonomy IDs, and there are no taxonomic levels which have more than 2
unique taxids.  We will call this the taxid.ct.table length, where the
taxid.ct.table is the table of frequencies of the number of unique taxids at
each taxonomic level.  the length is the number of unique values when
IGNORING '0' or '1'.  When the taxid.ct.table length is less than or equal
to min.taxid.table.length, the lca is calcluated within the lowest taxonomic
level that has the most frequent unique taxonomy ID count.

For the Ceratodictyol B example, this would mean that we would find that '2'
was the most common number of unique taxids reported, so we find that the
lowest taxonomic level which reports two unique taxids is 'species'.  LCA is
for assigned to those two species.  If however, there were two \emph{Ceratodicyon}
spp reported, then the species level would have 3 unique taxids, and there
would be 4 levels (rather than five) which have 2unique taxids.  the lowest
taxonomic level with 2 unique taxids, the most frequent count observed,
would now be 'genus', so LCA would be assigned for within each level of
'genus'.  This would mean that the first LCA would be assigned to the
\emph{Ceratodicyon} genus, since there are multiple \emph{Ceratodicyon} species
reported, and then a second LCA would be assigned to the \emph{Haliclona}
\emph{cymaeformis} species.  Sorry it is so complicated.  Life is complicated.
}
\examples{
data('cid.taxid', package = "pubchem.bio")
data('taxid.hierarchy', package = "pubchem.bio")
data('cid.pwid', package = "pubchem.bio")
cid.lca.out <- build.cid.lca(
tax.sources =  "LOTUS - the natural products occurrence database",
use.pathways = FALSE, 
threads = 1, cid.taxid.object = cid.taxid,
taxid.hierarchy.object = taxid.hierarchy,
cid.pwid.object = cid.pwid)
head(cid.lca.out)
}
\author{
Corey Broeckling
}
