public final class DictionaryLookup extends Object implements IStemmer, Iterable<WordData>
Important: finite state automatons in Jan Daciuk's implementation use bytes not unicode characters. Therefore objects of this class always have to be constructed with an encoding used to convert Java strings to byte arrays and the other way around. You can use UTF-8 encoding, as it should not conflict with any control sequences and separator characters.
Constructor and Description |
---|
DictionaryLookup(Dictionary dictionary)
Creates a new object of this class using the given FSA for word lookups
and encoding for converting characters to bytes.
|
Modifier and Type | Method and Description |
---|---|
static ByteBuffer |
decodeBaseForm(ByteBuffer output,
byte[] encoded,
int encodedLen,
ByteBuffer inflectedForm,
DictionaryMetadata metadata)
Decode the base form of an inflected word and save its decoded form into
a byte buffer.
|
Dictionary |
getDictionary() |
char |
getSeparatorChar() |
Iterator<WordData> |
iterator()
Return an iterator over all
WordData entries available in the
embedded Dictionary . |
List<WordData> |
lookup(CharSequence word)
Searches the automaton for a symbol sequence equal to
word ,
followed by a separator. |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
forEach, spliterator
public DictionaryLookup(Dictionary dictionary) throws IllegalArgumentException
Creates a new object of this class using the given FSA for word lookups and encoding for converting characters to bytes.
IllegalArgumentException
- if FSA's root node cannot be acquired (dictionary is empty).public List<WordData> lookup(CharSequence word)
word
,
followed by a separator. The result is a stem (decompressed accordingly
to the dictionary's specification) and an optional tag data.public static ByteBuffer decodeBaseForm(ByteBuffer output, byte[] encoded, int encodedLen, ByteBuffer inflectedForm, DictionaryMetadata metadata)
output
- The byte buffer to save the result to. A new buffer may be
allocated if the capacity of bb
is not large
enough to store the result. The buffer is not flipped upon
return.inflectedForm
- Inflected form's bytes (decoded properly).encoded
- Bytes of the encoded base form, starting at 0 index.encodedLen
- Length of the encode base form.bb
or a new buffer whose capacity is
large enough to store the output of the decoded data.public Iterator<WordData> iterator()
WordData
entries available in the
embedded Dictionary
.public Dictionary getDictionary()
Dictionary
used by this object.public char getSeparatorChar()
DictionaryMetadata.separator
and
may not be valid in the target encoding (although this is highly unlikely).Copyright © 2016. All rights reserved.