eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Dagstuhl Seminar Proceedings
1862-4405
2006-11-07
6201
1
15
10.4230/DagSemProc.06201.1
article
06201 Abstracts Collection – Combinatorial and Algorithmic Foundations of Pattern and Association Discovery
Ahlswede, Rudolf
Apostolico, Alberto
Levenshtein, Vladimir I.
From 15.05.06 to 20.05.06, the Dagstuhl Seminar 06201 ``Combinatorial and Algorithmic Foundations of Pattern and Association Discovery'' was held
in the International Conference and Research Center (IBFI), Schloss Dagstuhl.
During the seminar, several participants presented their current
research, and ongoing work and open problems were discussed. Abstracts of
the presentations given during the seminar as well as abstracts of
seminar results and ideas are put together in this paper. The first section
describes the seminar topics and goals in general.
Links to extended abstracts or full papers are provided, if available.
https://drops.dagstuhl.de/storage/16dagstuhl-seminar-proceedings/dsp-vol06201/DagSemProc.06201.1/DagSemProc.06201.1.pdf
Data compression
pattern matching
pattern discovery
search
sorting
molecular biology
reconstruction
genome rearrangements
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Dagstuhl Seminar Proceedings
1862-4405
2006-11-07
6201
1
2
10.4230/DagSemProc.06201.2
article
06201 Executive Summary – Combinatorial and Algorithmic Foundations of Pattern and Association Discovery
Ahlswede, Rudolf
Apostolico, Alberto
Levenshtein, Vladimir I.
The goals of this seminar have been (1) to identify and match
recently developed methods to specific tasks and data sets in a core of
application areas; next, based on feedback from the specific applied domain,
(2) to fine tune and personalize those applications, and finally (3) to
identify and tackle novel combinatorial and algorithmic problems, in some
cases all the way to the development of novel software tools.
https://drops.dagstuhl.de/storage/16dagstuhl-seminar-proceedings/dsp-vol06201/DagSemProc.06201.2/DagSemProc.06201.2.pdf
Data compression
pattern matching
pattern discovery
search
sorting
molecular biology
reconstruction
genome rearrangements
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Dagstuhl Seminar Proceedings
1862-4405
2006-11-07
6201
1
12
10.4230/DagSemProc.06201.3
article
Local Minimax Learning of Approximately Polynomial Functions
Jones, Lee
Rybnikov, Konstantin
Suppose we have a number of noisy measurements of an unknown real-valued function $f$ near
point of interest $mathbf{x}_0 in mathbb{R}^d$. Suppose also that nothing can be assumed
about the noise distribution, except for zero mean and bounded covariance matrix. We want
to estimate $f$ at $mathbf{x=x}_0$ using a general linear parametric family
$f(mathbf{x};mathbf{a}) = a_0 h_0 (mathbf{x}) ++ a_q h_q (mathbf{x})$, where
$mathbf{a} in mathbb{R}^q$ and $h_i$'s are bounded functions on a neighborhood $B$ of
$mathbf{x}_0$ which contains all points of measurement. Typically, $B$ is a Euclidean ball
or cube in $mathbb{R}^d$ (more generally, a ball in an $l_p$-norm). In the case when the
$h_i$'s are polynomial functions in $x_1,ldots,x_d$ the model is called
locally-polynomial. In particular, if the $h_i$'s form a basis of the linear space of
polynomials of degree at most two, the model is called locally-quadratic (if the degree is
at most three, the model is locally-cubic, etc.). Often, there is information, which is
called context, about the function $f$ (restricted to $B$ ) available, such as that it
takes values in a known interval, or that it satisfies a Lipschitz condition. The theory of
local minimax estimation with context for locally-polynomial models and approximately
locally polynomial models has been recently initiated by Jones. In the case of local
linearity and a bound on the change of $f$ on $B$, where $B$ is a ball, the solution for
squared error loss is in the form of ridge regression, where the ridge parameter is
identified; hence, minimax justification for ridge regression is given together with
explicit best error bounds. The analysis of polynomial models of degree above 1 leads to
interesting and difficult questions in real algebraic geometry and non-linear optimization.
We show that in the case when $f$ is a probability function, the optimal (in the minimax
sense) estimator is effectively computable (with any given precision), thanks to Tarski's
elimination principle.
https://drops.dagstuhl.de/storage/16dagstuhl-seminar-proceedings/dsp-vol06201/DagSemProc.06201.3/DagSemProc.06201.3.pdf
Local learning
statistical learning
estimator
minimax
convex optimization
quantifier elimination
semialgebraic
ridge regression
polynomial
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Dagstuhl Seminar Proceedings
1862-4405
2006-11-07
6201
1
4
10.4230/DagSemProc.06201.4
article
Non--binary error correcting codes with noiseless feedback, localized errors, or both
Ahlswede, Rudolf
Deppe, Christian
Lebedev, Vladimir
We investigate non--binary error correcting codes with noiseless feedback, localized errors, or both. It turns out that the Hamming bound is a central concept. For block codes with feedback we present here a coding scheme based on an idea of erasions, which we call the {\bf rubber method}. It gives an optimal rate for big error correcting fraction $\tau$ ($>{1\over q}$) and infinitely many points on the Hamming bound for small $\tau$.
We also consider variable length codes with all lengths bounded from above by $n$ and the end of a word carries the symbol $\Box$ and is thus recognizable by the decoder. For both, the $\Box$-model with feedback and the $\Box$-model with localized errors, the Hamming bound is the exact capacity curve for $\tau <1/2.$ Somewhat surprisingly, whereas with feedback the capacity curve coincides with the Hamming bound also for
$1/2\leq \tau \leq 1$, in this range for localized errors the capacity curve equals 0.
Also we give constructions for the models with both, feedback and localized errors.
https://drops.dagstuhl.de/storage/16dagstuhl-seminar-proceedings/dsp-vol06201/DagSemProc.06201.4/DagSemProc.06201.4.pdf
Error-correcting codes
localized errors
feedback
variable length codes
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Dagstuhl Seminar Proceedings
1862-4405
2006-11-07
6201
1
9
10.4230/DagSemProc.06201.5
article
On the Monotonicity of the String Correction Factor for Words with Mismatches
Apostolico, Alberto
Pizzi, Cinzia
The string correction factor is the term by
which the probability of a word $w$ needs to be multiplied in order
to account for character changes or ``errors'' occurring in at most
$k$ arbitrary positions in that word. The behavior of this factor,
as a function of $k$ and of the word length, has implications on the
number of candidates that need to be considered and weighted when
looking for subwords of a sequence that present unusually recurrent
replicas within some bounded number of mismatches. Specifically, it
is seen that over intervals of mono- or bi-tonicity for the
correction factor, only some of the candidates need be considered.
This mitigates the computation and leads to tables of
over-represented words that are more compact to represent and
inspect. In recent work, expectation and score monotonicity has been
established for a number of cases of interest, under {em i.i.d.}
probabilistic assumptions. The present paper reviews the cases of
bi-tonic behavior for the correction factor, concentrating on the
instance in which the question is still open.
https://drops.dagstuhl.de/storage/16dagstuhl-seminar-proceedings/dsp-vol06201/DagSemProc.06201.5/DagSemProc.06201.5.pdf
Pattern discovery
Motif
Over-represented word
Monotone score
Correction Factor
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Dagstuhl Seminar Proceedings
1862-4405
2006-11-07
6201
1
12
10.4230/DagSemProc.06201.6
article
Sequence prediction for non-stationary processes
Ryabko, Daniil
Hutter, Marcus
We address the problem of sequence prediction for nonstationary stochastic processes. In particular, given two measures on the set of one-way infinite sequences over a finite alphabet, consider the question whether one of the measures predicts the other. We find some conditions on local absolute continuity under which prediction is possible.
https://drops.dagstuhl.de/storage/16dagstuhl-seminar-proceedings/dsp-vol06201/DagSemProc.06201.6/DagSemProc.06201.6.pdf
Sequence prediction
probability forecasting
local absolute continuity
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Dagstuhl Seminar Proceedings
1862-4405
2006-11-07
6201
1
10
10.4230/DagSemProc.06201.7
article
Solving Classical String Problems an Compressed Texts
Lifshits, Yury
How to solve string problems, if instead of input
string we get only program generating it? Is it possible to
solve problems faster than just "generate text + apply classical
algorithm"?
In this paper we consider strings generated by straight-line programs
(SLP). These are programs using only assignment operator. We show
new algorithms for equivalence, pattern matching, finding periods and
covers, computing fingerprint table on SLP-generated strings.
From the other hand, computing the Hamming distance is NP-hard.
Main corollary is an $O(n2*m)$ algorithm for pattern matching in
LZ-compressed texts.
https://drops.dagstuhl.de/storage/16dagstuhl-seminar-proceedings/dsp-vol06201/DagSemProc.06201.7/DagSemProc.06201.7.pdf
Pattern matching
Compressed text
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Dagstuhl Seminar Proceedings
1862-4405
2006-11-07
6201
1
4
10.4230/DagSemProc.06201.8
article
Some Results for Identification for Sources and its Extension to Liar Models
Varbanov, Zlatko
Let (${cal U}, P$) be a source, where ${cal U} =
{1,2,dots,N}, P = {P_1, P_2, dots, P_N}$, and let ${cal C}
= {c_1,c_2,dots,c_N}$ be a binary prefix code (PC) for this
source with $||c_u||$ as length of $c_u$. Introduce the random
variable $U$ with Prob($U=u$) = $p_u$ for $u = 1,2,dots,N$ and
the random variable $C$ with $C = c_u =
(c_1,c_2,dots,c_{u||c_u||})$ if $U=u$. We use the PC for
noiseless identification, that is user $u$ wants to know whether
the source output equals $u$, that is, whether $C$ equals $c_u$ or
not. The user iteratively checks whether $C$ coincides with $c_u$
in the first, second, etc. letter and stops when the first
different letter occurs or when $C = c_u$. What is the expected
number $L_{cal C}(P,u)$ of checkings?
In order to calculate this quantity we introduce for the binary
tree $T_{cal C}$, whose leaves are the codewords
$c_1,c_2,dots,c_N$, the sets of leaves ${cal C}_{ik} (1 leq i
leq N; 1 leq k)$, where ${cal C}_{ik} = {c in {cal C}: c$
coincides with $c_i$ exactly until the $k$'th letter of $c_i}$.
If $C$ takes a value in ${cal C}_{uk}, 0 leq k leq ||c_u||-1$,
the answers are $k$ times "Yes" and 1 time "No". For $C = c_u$ the
$$
L_{cal C}(P,u) = sum_{k=0}^{||c_u||-1}P(C in {cal
C}_{uk})(k+1) + ||c_u||P_u.
$$
For code ${cal C}$,~ $L_{cal C}(P) = max L_{cal C}(P,u)$, $1
geq u geq N$, is the expected number of checkings in the worst
case and $L(P) = min L_{cal C}(P)$ is this number for the best
code ${cal C}$.
Let $P = P^N = {frac{1}{N}, dots, frac{1}{N}}$. We construct
a prefix code ${cal C}$ in the following way. In each node
(starting at the root) we split the number of remaining codewords
in proportion as close as possible to $(frac{1}{2},frac{1}{2})$.
It is known that
$$
lim_{N
ightarrow infty} L_{cal C}(P^N) = 2
$$
(Ahlswede, Balkenhol, Kleinewachter, 2003)
We know that $L(P) leq 3$ for all $P$ (Ahlswede, Balkenhol, Kleinewachter, 2003). Also, the problem to estimate an universal
constant $A = sup L(P)$ for general $P = (P_1,dots, P_N)$ was stated (Ahlswede, 2004). We
compute this constant for uniform distribution and this code
${cal C}$.
$$
sup_N L_{cal C}(P^N) = 2+frac{log_2(N-1)-2}{N}
$$
Also, we consider the average number of checkings, if code ${cal
C}$ is used: $ L_{cal C}(P,P) = sum P_u L_{cal C}(P,u)$, for
${u in {cal U}}$. We calculate the exact values of $L_{cal
C}(P^N)$ and $L_{cal C}(P^N,P^N)$ for some $N$.
Other problem is the extension of identification for sources to
liar models. We obtain a upper bound for the expected number of
checkings $L_{cal C}(P^N;e)$, where $e$ is the maximum number of
lies.
$$
L_{cal C}(P^N;e) leq M_{cal C}(P^N;e) = (e+1)L_{cal C}(P^N) +
e; ~~ lim_{N
ightarrow infty} M_{cal C}(P^N;e) = 3e+2
$$
https://drops.dagstuhl.de/storage/16dagstuhl-seminar-proceedings/dsp-vol06201/DagSemProc.06201.8/DagSemProc.06201.8.pdf
Identification for sources
lies
prefix code
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Dagstuhl Seminar Proceedings
1862-4405
2006-11-07
6201
1
2
10.4230/DagSemProc.06201.9
article
Suballowable sequences of permutations
Asinowski, Andrei
We discuss a notion of "allowable sequence of permutations" and show a few combinatorial results and geometric applications.
https://drops.dagstuhl.de/storage/16dagstuhl-seminar-proceedings/dsp-vol06201/DagSemProc.06201.9/DagSemProc.06201.9.pdf
Aequences of permutations
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Dagstuhl Seminar Proceedings
1862-4405
2006-11-07
6201
1
8
10.4230/DagSemProc.06201.10
article
Subwords in reverse-complement order
Erdös, Péter L.
Ligeti, Péter
Sziklai, Péter
Torney, David C.
We examine finite words over an alphabet $Gamma={a,bar{a};b,bar{b}}$ of pairs of letters, where each word $w_1w_2...w_t$ is identical with its {it reverse complement} $bar{w_t}...bar{w_2}bar{w_1}$ (where $bar{bbar{a}}=a,bar{bar{b}}=b$). We seek the smallest $k$ such that every word of length $n,$ composed from $Gamma$, is uniquely determined by the set of its subwords of length up to $k$. Our almost sharp result ($ksim 2n/3$) is an analogue of a classical result for ``normal'' words.
This classical problem originally was posed by M.P. Sch"utzenberger and I. Simon, and gained a considerable interest for several researchers, foremost by Vladimir Levenshtein.
Our problem has its roots in bioinformatics.
https://drops.dagstuhl.de/storage/16dagstuhl-seminar-proceedings/dsp-vol06201/DagSemProc.06201.10/DagSemProc.06201.10.pdf
Reverse complement order
Reconstruction of words
Microarray experiments
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Dagstuhl Seminar Proceedings
1862-4405
2006-11-07
6201
1
20
10.4230/DagSemProc.06201.11
article
Vertex reconstruction in Cayley graphs
Konstantinova, Elena
In this report paper we collect recent results on the vertex
reconstruction in Cayley graphs $Cay(G,S)$. The problem is stated as
the problem of reconstructing a vertex from the minimum number of
its $r$-neighbors that are vertices at distance at most $r$ from the
unknown vertex. The combinatorial properties of Cayley graphs on the
symmetric group $Sn$ and the signed permutation group $Bn$ with
respect to this problem are presented. The sets of generators of $S$
are specified by applications in coding theory, computer science,
molecular biology and physics.
https://drops.dagstuhl.de/storage/16dagstuhl-seminar-proceedings/dsp-vol06201/DagSemProc.06201.11/DagSemProc.06201.11.pdf
Reconstruction problems
Cayley graphs
the symmetric group
the signed permutation group
sorting by reversals
pancake problem