\name{ngram_frequencies}
\alias{ngram_frequencies}
\title{Frequency Analysis of Cuneiform Sign Combinations (N-grams)}
\description{
Analyzes a Sumerian text for frequently occurring cuneiform sign combinations
(n-grams). The input can be either cuneiform text or transliterated text
(which is automatically converted to cuneiform via \code{\link{as.cuneiform}}).
The analysis starts with the longest combinations and works down to single
signs, masking already-counted occurrences to avoid reporting subsequences
that are only frequent because they are part of a longer frequent combination.
N-grams are searched within lines only (not across line boundaries).
}
\usage{
ngram_frequencies(x, min_freq = c(6, 4, 2), mapping = NULL)
}
\arguments{
  \item{x}{Character vector whose elements are the lines of a Sumerian text.
    The input can be either cuneiform characters or transliterated text. If no
    cuneiform characters (U+12000 to U+1254F) are detected, the input is
    automatically converted using \code{\link{as.cuneiform}}.
    Lines starting with \code{#} are treated as comments and ignored.
    Optional line numbers at the beginning of a line (e.g., \code{"42)\t"})
    are automatically removed. Spaces are removed before tokenization.}

  \item{min_freq}{Integer vector specifying minimum frequencies (default:
    \code{c(6, 4, 2)}). The i-th value specifies the minimum frequency for
    combinations of length i. For lengths beyond the vector's length, the last
    value is used.

    The default \code{c(6, 4, 2)} means: single signs must occur at least 6
    times, pairs at least 4 times, and all longer combinations at least 2
    times.}

  \item{mapping}{A data frame containing the sign mapping table with columns \code{syllables}, \code{name}, and \code{cuneiform}. If \code{NULL} (the default), the package's internal mapping file \file{etcsl_mapping.txt} is loaded.}
}
\details{
A \dQuote{sign} is defined as either a single cuneiform Unicode character
(U+12000 to U+1254F) or a character sequence enclosed in mathematical
angle brackets (U+27E8 ... U+27E9), which is treated as a single token.
All other characters (spaces, X, numbers, punctuation, etc.) are skipped
during tokenization.

The maximum n-gram length is automatically determined as the length of the
longest tokenized line in the input.

The analysis proceeds from the longest combinations down to single signs.
When a combination is identified as frequent (i.e., meets the minimum
frequency threshold), all occurrences except the first are masked before
continuing with shorter combinations. This prevents subsequences from being
reported as frequent when their frequency is solely due to a longer frequent
combination.
}
\value{
A data frame with three columns, sorted by descending length, then descending
frequency:
  \item{frequency}{Integer. The number of occurrences of the combination.}
  \item{length}{Integer. The number of signs in the combination.}
  \item{combination}{Character. The cuneiform sign combination
    (e.g., \code{"\U0001202D\U00012097\U000120A0"}).}
}
\examples{
# Read the text "Enki and the World Order"

path  <- system.file("extdata", "enki_and_the_world_order.txt", package = "sumer")
text <- readLines(path, encoding="UTF-8")

cat(text[1:10],sep="\n")

# Find combinations that appear at least 6 times in the text
freq <- ngram_frequencies(text, min_freq = 6)

freq[1:10,]
}
\seealso{
\code{\link{as.sign_name}} for converting cuneiform to sign names,
\code{\link{as.cuneiform}} for converting transliterations to cuneiform,
\code{\link{split_sumerian}} for tokenizing transliterated text.
}
\keyword{utilities}
\keyword{univar}
