embfile.core

Substructure

Classes

AbstractEmbFileReader(out_dtype)

(Abstract class) Facilitates the implementation of a EmbFileReader, especially for a file that stores a word and its vector nearby in the file (txt and bin formats), though it can be used for other kind of formats as well if it looks convenient.

EmbFile(path[, out_dtype, verbose])

(Abstract class) The base class of all the embedding files.

EmbFileReader(out_dtype)

(Abstract class) Iterator that yields a word at each step and read the corresponding vector only if the lazy property current_vector is accessed.

RandomAccessLoader(words, word2vec[, …])

A loader for files that can randomly access word vectors.

SequentialLoader(file, words[, missing_ok, …])

A Loader that just scans the file from beginning to the end and yields a word vector pair when it meets a requested word.

VectorsLoader(words[, missing_ok])

(Abstract class) Iterator that, given some input words, looks for the corresponding vectors into the file and yields a word vector pair for each vector found; once the iteration stops, the attribute missing_words contains the set of words not found.

WordVector(word, vector)

A (word, vector) NamedTuple

class embfile.core.EmbFile(path, out_dtype=None, verbose=True)[source]

Bases: abc.ABC

(Abstract class) The base class of all the embedding files.

Sub-classes must:

  1. ensure they set attributes vocab_size and vector_size when a file instance is created

  2. implement a EmbFileReader for the format and implements the abstract method _reader()

  3. implement the abstract method _close()

  4. (optionally) implement a VectorsLoader (if they can improve upon the default loader) and override loader()

  5. (optionally) implement a EmbFileCreator for the format and set the class constant Creator

Parameters
  • path (Path) – path of the embedding file (eventually compressed)

  • out_dtype (numpy.dtype) – all the vectors will be converted to this data type. The sub-class is responsible to set a suitable default value.

  • verbose (bool) – whether to show a progress bar by default in all time-consuming operations

Variables
  • path (Path) – path of the embedding file

  • vocab_size (int or None) – number of words in the file (can be None for some TextEmbFile)

  • vector_size (int) – length of the vectors

  • verbose (bool) – whether to show a progress bar by default in all time-consuming operations

  • closed (bool) – True if the file was closed

abstractmethod _reader()[source]

(Abstract method) Returns a new reader for the file which allows to iterate efficiently the word-vectors inside it. Called by reader().

Return type

EmbFileReader

abstractmethod _close()[source]

(Abstract method) Releases eventual resources used by the EmbFile.

Return type

None

DEFAULT_EXTENSION: str
vocab_size: Optional[int]
close()[source]

Releases all the open resources linked to this file, including the opened readers.

Return type

None

reader()[source]

Creates and returns a new file reader. When the file is closed, all the still opened readers are closed automatically.

Return type

EmbFileReader

loader(words, missing_ok=True, verbose=None)[source]

Returns a VectorsLoader, an iterator that looks for the provided words in the file and yields available (word, vector) pairs one by one. If missing_ok=True (default), provides the set of missing words in the property missing_words (once the iteration ends).

See embfile.core.VectorsLoader for more info.

Example

You should use a loader when you need to load many vectors in some custom data structure and you don’t want to waste memory (e.g. build_matrix uses it to load the vectors directly into the matrix):

data_structure = MyCustomStructure()
with file.loader(many_words) as loader:
    for word, vector in loader:
        data_structure[word] = vector
print('Number of missing words:', len(loader.missing_words)

See also

load() find()

Return type

VectorsLoader

for ... in words()[source]

Returns an iterable for all the words in the file.

Return type

Iterable[str]

for ... in vectors()[source]

Returns an iterable for all the vectors in the file.

Return type

Iterable[ndarray]

for ... in word_vectors()[source]

Returns an iterable for all the (word, vector) pairs in the file.

Return type

Iterable[WordVector]

to_dict(verbose=None)[source]

Returns the entire file content in a dictionary word -> vector.

Return type

Dict[str, ndarray]

to_list(verbose=None)[source]

Returns the entire file content in a list of WordVector’s.

Return type

List[WordVector]

load(words, verbose=None)[source]

Loads the vectors for the input words in a {word: vec} dict, raising KeyError if any word is missing.

Parameters
Return type

Dict[str, ndarray]

Returns

(Dict[str, VectorType]) – a dictionary {word: vector}

See also

find() - it returns the set of all missing words, instead of raising KeyError.

find(words, verbose=None)[source]

Looks for the input words in the file, return: 1) a dict {word: vec} containing the available words and 2) a set containing the words not found.

Parameters
  • words (Iterable[str]) – the words to look for

  • verbose (Optional[bool]) – if None, self.verbose is used

Return type

_FindOutput

Returns

namedtuple – a namedtuple with the following fields:

  • word2vec (Dict[str, VectorType]): dictionary {word: vector}

  • missing_words (Set[str]): set of words not found in the file

See also

load() - which raises KeyError if any word is not found in the file.

for ... in filter(condition, verbose=None)[source]

Returns a generator that yields a word vector pair for each word in the file that satisfies a given condition. For example, to get all the words starting with “z”:

list(file.filter(lambda word: word.startswith('z')))
Parameters
  • condition (Callable[[str], bool]) – a function that, given a word in input, outputs True if the word should be taken

  • verbose (Optional[bool]) – if True, a progress bar is showed (the bar is updated each time a word is read, not each time a word vector pair is yielded).

Return type

Iterator[Tuple[str, ndarray]]

save_vocab(path=None, encoding='utf-8', overwrite=False, verbose=None)[source]

Save the vocabulary of the embedding file on a text file. By default the file is saved in the same directory of the embedding file, e.g.:

/path/to/filename.txt.gz  ==> /path/to/filename_vocab.txt
Parameters
  • path (Union[str, Path, None]) – where to save the file

  • encoding (str) – text encoding

  • overwrite (bool) – if the file exists and it is True, overwrite the file

  • verbose (Optional[bool]) – if None, self.verbose is used

Return type

Path

Returns

(Path) – the path to the vocabulary file

classmethod create(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, **format_kwargs)[source]

Creates a file on disk containing the provided word vectors.

Parameters
  • out_path (Union[str, Path]) – path to the created file

  • word_vectors (Dict[str, VectorType] or Iterable[Tuple[str, VectorType]]) – it can be an iterable of word vector tuples or a dictionary word -> vector; the word vectors are written in the order determined by the iterable object.

  • vocab_size (Optional[int]) – it must be provided if word_vectors has no __len__ and the specific-format creator needs to know a priori the vocabulary size; in any case, the creator should check at the end that the provided vocab_size matches the actual length of word_vectors

  • compression (Optional[str]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"

  • verbose (bool) – if positive, show progress bars and information

  • overwrite (bool) – overwrite the file if it already exists

  • format_kwargs – format-specific arguments

Return type

None

classmethod create_from_file(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, **format_kwargs)[source]

Creates a new file on disk with the same content of another file.

Parameters
  • source_file (EmbFile) – the file to take data from

  • out_dir (Union[str, Path, None]) – directory where the file will be stored; by default, it’s the parent directory of the source file

  • out_filename (Optional[str]) – filename of the produced name (inside out_dir); by default, it is obtained by replacing the extension of the source file with the proper one and appending the compression extension if compression is not None. Note: if you pass this argument, the compression extension is not automatically appended.

  • vocab_size (Optional[int]) – if the source EmbFile has attribute vocab_size == None, then: if the specific creator requires it (bin and txt formats do), it must be provided; otherwise it can be provided for having ETA in progress bars.

  • compression (Optional[str]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"

  • verbose (bool) – print info and progress bar

  • overwrite (bool) – overwrite a file with the same name if it already exists

  • format_kwargs – format-specific arguments (see above)

Return type

Path

class embfile.core.EmbFileReader(out_dtype)[source]

Bases: abc.ABC

(Abstract class) Iterator that yields a word at each step and read the corresponding vector only if the lazy property current_vector is accessed.

Iteration model. The iteration model is not the most obvious: each iteration step doesn’t return a word vector pair. Instead, for performance reasons, at each step a reader returns the next word. To read the vector for the current word, you must access the (lazy) property current_vector():

with emb_file.reader() as reader:
    for word in reader:
        if word in my_vocab:
            word2vec[word] = reader.current_vector

When you access current_vector() for the first time, the vector data is read/parsed and a vector is created; the vector remains accessible until a new word is read.

Creation. Creating a reader usually implies the creation of a file object. That’s why EmbFileReader implements the ContextManager interface so that you can use it inside a with clause. Nonetheless, a EmbFile keeps track of all its open readers and close them automatically when it is closed.

Parameters

out_dtype (Union[str, dtype]) – all the vectors will be converted to this dtype before being returned

Variables

out_dtype (numpy.dtype) – all the vectors will be converted to this data type before being returned

close()[source]

Closes the reader

Return type

None

abstractmethod reset()[source]

(Abstract method) Brings back the reader to the first word vector pair

Return type

None

abstractmethod next_word()[source]

(Abstract method) Reads and returns the next word in the file.

Return type

str

abstractmethod current_vector()[source]

(Abstract method) The vector for the current word (i.e. the last word read). If accessed before any word has been read, it raises IllegalOperation. The dtype of the returned vector is cls.out_dtype.

Return type

ndarray

class embfile.core.AbstractEmbFileReader(out_dtype)[source]

Bases: embfile.core.reader.EmbFileReader, abc.ABC

(Abstract class) Facilitates the implementation of a EmbFileReader, especially for a file that stores a word and its vector nearby in the file (txt and bin formats), though it can be used for other kind of formats as well if it looks convenient. It:

  • keeps track of whether the reader is pointing to a word or a vector and skips the vector when it is not requested during an iteration

  • caches the current vector once it is read

Sub-classes must implement:

_read_word()

(Abstract method) Reads a word assuming the next thing to read in the file is a word.

_read_vector()

(Abstract method) Reads the vector for the last word read.

_skip_vector()

(Abstract method) Called when we want to read the next word without loading the vector for the current word.

_close()

(Abstract method) Closes the reader

abstractmethod _read_word()[source]

(Abstract method) Reads a word assuming the next thing to read in the file is a word. It must raise StopIteration if there’s not another word to read.

Return type

str

abstractmethod _read_vector()[source]

(Abstract method) Reads the vector for the last word read. This method is never called if no word has been read or at the end of file. It is called at most time per word.

Return type

ndarray

abstractmethod _skip_vector()[source]

(Abstract method) Called when we want to read the next word without loading the vector for the current word. For some formats, it may be empty.

Return type

None

abstractmethod _reset()[source]

(Abstract method) Resets the reader

Return type

None

abstractmethod _close()

(Abstract method) Closes the reader

Return type

None

reset()[source]

Brings back the reader to the beginning of the file

Return type

None

next_word()[source]

Reads and returns the next word in the file.

Return type

str

current_vector

The vector associated to the current word (i.e. the last word read). If accessed before any word has been read, it raises IllegalOperation.

Return type

ndarray

class embfile.core.VectorsLoader(words, missing_ok=True)[source]

Bases: abc.ABC, Iterator[WordVector]

(Abstract class) Iterator that, given some input words, looks for the corresponding vectors into the file and yields a word vector pair for each vector found; once the iteration stops, the attribute missing_words contains the set of words not found.

Subclasses can load the word vectors in any order.

Parameters
  • words (Iterable[str]) – the words to load

  • missing_ok (bool) – If False, raises a KeyError if any input word is not in the file

abstractmethod missing_words

The words that have still to be found; once the iteration stops, it’s the set of the words that are in the input words but not in the file.

close()[source]

Closes eventual open resources (e.g. a reader).

class embfile.core.SequentialLoader(file, words, missing_ok=True, verbose=False)[source]

Bases: abc.ABC, Iterator[WordVector]

A Loader that just scans the file from beginning to the end and yields a word vector pair when it meets a requested word. Used by txt and bin files. It’s unable to tell if a word is in the file or not before having read the entire file.

The progress bar shows the percentage of file that has been examined, not the number of yielded word vectors, so the iteration may stop before the bar reaches its 100% (in the case that all the input words are in the file).

(Abstract class) Iterator that, given some input words, looks for the corresponding vectors into the file and yields a word vector pair for each vector found; once the iteration stops, the attribute missing_words contains the set of words not found.

Subclasses can load the word vectors in any order.

Parameters
  • words (Iterable[str]) – the words to load

  • missing_ok (bool) – If False, raises a KeyError if any input word is not in the file

missing_words

The words that have still to be found; once the iteration stops, it’s the set of the words that are in the input words but not in the file.

close()[source]

Closes eventual open resources (e.g. a reader).

class embfile.core.RandomAccessLoader(words, word2vec, word2index=None, missing_ok=True, verbose=False, close_hook=None)[source]

Bases: abc.ABC, Iterator[WordVector]

A loader for files that can randomly access word vectors. If word2index is provided, the words are sorted by their position and the corresponded vectors are loaded in this order; I observed that this significantly improves the performance (with VVMEmbFile) (presumably due to buffering).

Parameters
  • words (Iterable[str]) –

  • word2vec (Word2Vector) – object that implements word2vec[word] and word in word2vec

  • word2index (Optional[Callable[[str], int]]) – function that returns the index (position) of a word inside the file; this enables an optimization for formats like VVM that store vectors sequentially in the same file.

  • missing_ok (bool) –

  • verbose (bool) –

  • close_hook (Optional[Callable]) – function to call when closing this loader

missing_words

The words that have still to be found; once the iteration stops, it’s the set of the words that are in the input words but not in the file.

close()[source]

Closes eventual open resources (e.g. a reader).

class embfile.core.WordVector(word: str, vector: numpy.ndarray)[source]

Bases: tuple

A (word, vector) NamedTuple

Create new instance of WordVector(word, vector)

word: str

Alias for field number 0

vector: numpy.ndarray

Alias for field number 1

staticmethod format_vector(arr)[source]

Used by __repr__ to convert a numpy vector to string. Feel free to monkey-patch it.