embfile.core.loaders

Classes

RandomAccessLoader(words, word2vec[, …])

A loader for files that can randomly access word vectors.

SequentialLoader(file, words[, missing_ok, …])

A Loader that just scans the file from beginning to the end and yields a word vector pair when it meets a requested word.

VectorsLoader(words[, missing_ok])

(Abstract class) Iterator that, given some input words, looks for the corresponding vectors into the file and yields a word vector pair for each vector found; once the iteration stops, the attribute missing_words contains the set of words not found.

Word2Vector()

Maps a word to a vector

Reference

class embfile.core.loaders.VectorsLoader(words, missing_ok=True)[source]

Bases: abc.ABC, Iterator[WordVector]

(Abstract class) Iterator that, given some input words, looks for the corresponding vectors into the file and yields a word vector pair for each vector found; once the iteration stops, the attribute missing_words contains the set of words not found.

Subclasses can load the word vectors in any order.

Parameters
  • words (Iterable[str]) – the words to load

  • missing_ok (bool) – If False, raises a KeyError if any input word is not in the file

abstractmethod missing_words

The words that have still to be found; once the iteration stops, it’s the set of the words that are in the input words but not in the file.

close()[source]

Closes eventual open resources (e.g. a reader).

class embfile.core.loaders.SequentialLoader(file, words, missing_ok=True, verbose=False)[source]

Bases: abc.ABC, Iterator[WordVector]

A Loader that just scans the file from beginning to the end and yields a word vector pair when it meets a requested word. Used by txt and bin files. It’s unable to tell if a word is in the file or not before having read the entire file.

The progress bar shows the percentage of file that has been examined, not the number of yielded word vectors, so the iteration may stop before the bar reaches its 100% (in the case that all the input words are in the file).

(Abstract class) Iterator that, given some input words, looks for the corresponding vectors into the file and yields a word vector pair for each vector found; once the iteration stops, the attribute missing_words contains the set of words not found.

Subclasses can load the word vectors in any order.

Parameters
  • words (Iterable[str]) – the words to load

  • missing_ok (bool) – If False, raises a KeyError if any input word is not in the file

missing_words

The words that have still to be found; once the iteration stops, it’s the set of the words that are in the input words but not in the file.

close()[source]

Closes eventual open resources (e.g. a reader).

class embfile.core.loaders.RandomAccessLoader(words, word2vec, word2index=None, missing_ok=True, verbose=False, close_hook=None)[source]

Bases: abc.ABC, Iterator[WordVector]

A loader for files that can randomly access word vectors. If word2index is provided, the words are sorted by their position and the corresponded vectors are loaded in this order; I observed that this significantly improves the performance (with VVMEmbFile) (presumably due to buffering).

Parameters
  • words (Iterable[str]) –

  • word2vec (Word2Vector) – object that implements word2vec[word] and word in word2vec

  • word2index (Optional[Callable[[str], int]]) – function that returns the index (position) of a word inside the file; this enables an optimization for formats like VVM that store vectors sequentially in the same file.

  • missing_ok (bool) –

  • verbose (bool) –

  • close_hook (Optional[Callable]) – function to call when closing this loader

missing_words

The words that have still to be found; once the iteration stops, it’s the set of the words that are in the input words but not in the file.

close()[source]

Closes eventual open resources (e.g. a reader).

class embfile.core.loaders.Word2Vector[source]

Bases: abc.ABC

Maps a word to a vector

abstractmethod __getitem__(word)[source]
Return type

ndarray

abstractmethod __contains__(word)[source]
Return type

bool