% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/corp_or_dtm.R
\name{corp_or_dtm}
\alias{corp_or_dtm}
\title{Generate Corpus or Document Term Matrix with 1 Line}
\usage{
corp_or_dtm(..., from = "dir", type = "corpus", enc = "auto",
  mycutter = DEFAULT_cutter, stop_word = NULL, stop_pattern = NULL,
  control = DEFAULT_control1, myfun1 = NULL, myfun2 = NULL,
  special = "")
}
\arguments{
\item{...}{names of folders, files, or the mixture of the two kinds. It can also be a character 
vector of texts to be processed when setting \code{from} to "v", see below.}

\item{from}{should be "dir" or "v". If your inputs are filenames, it should be "dir" (default), 
If the input is a character vector of texts, it should be "v". However, if it is set to "v", 
make sure each element is not identical to filename in your working
directory; and, if they are identical, the function will raise an error. To do this check is 
because if they are identical, \code{segment} will take the input as a file to read!}

\item{type}{what do you want for result. It is case insensitive, thus those can be transformed 
to "c", "cor", "corp", "corpus" represent a corpus 
result; and "dtm" for document term matrix, 
and "tdm" for term document matrix. See Details. Input other than the above represents 
a corpus result. The default value is "corpus".}

\item{enc}{a length 1 character specifying encoding when reading files. If your files 
may have different encodings, or you do not know their encodings, 
set it to "auto" (default) 
to let the function auto-detect encoding for each file.}

\item{mycutter}{the jiebar cutter to segment text. A default cutter is used. See Details.}

\item{stop_word}{a character vector to specify stop words that should be removed. 
If it is \code{NULL}, nothing is removed. If it is "jiebar", the stop words used by 
\pkg{jiebaR} are used, see \code{\link{make_stoplist}}.
Please note the default value is \code{NULL}. Texts are transformed to lower case before 
removing stop words, so your stop words only need to contain lower case characters.}

\item{stop_pattern}{vector of regular expressions. These patterns are similar to stop words. 
Terms that match the patterns will be removed.}

\item{control}{a named list to be passed to \code{DocumentTermMatrix} 
or \code{TermDocumentMatrix} to create dtm or tdm. Most of the time you do not need to 
set this value because a default value is used. When you set the argument to \code{NULL}, 
it still points to this default value. See Details.}

\item{myfun1}{a function used to modify each text after being read by \code{scancn} 
and before being segmented.}

\item{myfun2}{a function used to modify each text after they are segmented.}

\item{special}{a length 1 character or regular expression to be passed to \code{dir_or_file} 
to specify what pattern should be met by filenames. The default is to read all files.
See \code{\link{dir_or_file}}.}
}
\value{
a corpus, or document term matrix, or term document matrix.
}
\description{
This function allows you to input a vectoer of characters, or a mixture of files and folders, it 
will automatically detect file encodings, segment Chinese texts, 
do specified modification, 
remove stop words,  and then generate corpus or dtm (tdm). Since \pkg{tm} 
does not support Chinese well, this function manages to solve some problems. See Details.
}
\details{
Package \pkg{tm} has two problems in creating Chinese document term matrix. First, 
it tries to segment an already segmented Chinese Corpus and put together terms that should 
not be put together. Second, if a term appears in the middle of a text and the end of the text, 
very occassionally it is taken as two different terms. The function is to deal with the problems.
It calls \code{\link{scancn}} to read files and auto-detect file encoding, 
and calls \code{\link[jiebaR]{segment}} to segment Chinese text, and finally 
calls \code{\link[tm]{Corpus}} to generate corpus, 
or \code{\link[tm]{DocumentTermMatrix}}, 
or \code{\link[tm]{TermDocumentMatrix}} to create dtm or tdm.

Users should provide their jiebar cutter by \code{mycutter}. Otherwise, the function 
uses \code{DEFAULT_cutter} which is created when the package is loaded. 
The \code{DEFAULT_cutter} is simply \code{worker(write = FALSE)}.
See \code{\link[jiebaR]{worker}}.

As long as 
you have not manually created another variable called "DEFAULT_cutter", 
you can directly use \code{jiebaR::new_user_word(DEFAULT_cutter...)} 
to add new words. By the way, whether you manually create an object 
called "DEFAULT_cutter", the original loaded DEFAULT_cutter which is 
used by default by functions in this package will not be removed by you.
So, whenever you want to use this default value, either you do not set 
\code{mycutter}, or set it to \code{mycutter = chinese.misc::DEFAULT_cutter}.

By default, the argument \code{control} is set 
to \code{DEFAULT_control1}, which is created 
when the package is loaded. It allows words with length 1 to 25 to be placed in dtm or tdm.
Alternatively, \code{DEFAULT_control2} is also created 
when loading package, which sets 
word length to 2 to 25. When \code{control} is \code{NULL}, the function still points 
to the original value of \code{DEFAULT_control1}.

You can create your own \code{DEFAULT_control1} or modify the originally 
loaded one, and you can remove them. However, in fact, the original one can 
neither be removed nor modified. 
So, whenever you want to use the original value, just do not set
\code{control}, or set it to {control = chinese.misc::DEFAULT_control1}.
The same is to \code{DEFAULT_control2}.

Whatever the control list is assigned to \code{control}, the function makes sure
that it never re-segments a segmented Chinese text.
}
\examples{
require(tm)
x <- c(
  "Hello, what do you want to drink?", 
  "drink a bottle of milk", 
  "drink a cup of coffee", 
  "drink some water")
# The simplest argument setting
dtm <- corp_or_dtm(x, from = "v", type = "dtm")
# Modify argument control to see what happens
dtm <- corp_or_dtm(x, from = "v", type="dtm", control = list(wordLengths = c(3, 20)))
dtm <- corp_or_dtm(x, from = "v", type = "dtm", stop_word = c("you", "to", "a", "of"))
}
