embfile.core¶

Substructure

Classes

`AbstractEmbFileReader`(out_dtype)	(Abstract class) Facilitates the implementation of a `EmbFileReader`, especially for a file that stores a word and its vector nearby in the file (txt and bin formats), though it can be used for other kind of formats as well if it looks convenient.
`EmbFile`(path[, out_dtype, verbose])	(Abstract class) The base class of all the embedding files.
`EmbFileReader`(out_dtype)	(Abstract class) Iterator that yields a word at each step and read the corresponding vector only if the lazy property `current_vector` is accessed.
`RandomAccessLoader`(words, word2vec[, …])	A loader for files that can randomly access word vectors.
`SequentialLoader`(file, words[, missing_ok, …])	A Loader that just scans the file from beginning to the end and yields a word vector pair when it meets a requested word.
`VectorsLoader`(words[, missing_ok])	(Abstract class) Iterator that, given some input words, looks for the corresponding vectors into the file and yields a word vector pair for each vector found; once the iteration stops, the attribute `missing_words` contains the set of words not found.
`WordVector`(word, vector)	A (word, vector) NamedTuple

class embfile.core.EmbFile(path, out_dtype=None, verbose=True)[source]¶

Bases: abc.ABC

(Abstract class) The base class of all the embedding files.

Sub-classes must:

ensure they set attributes vocab_size and vector_size when a file instance is created
implement a EmbFileReader for the format and implements the abstract method _reader()
implement the abstract method _close()
(optionally) implement a VectorsLoader (if they can improve upon the default loader) and override loader()
(optionally) implement a EmbFileCreator for the format and set the class constant Creator

Parameters

path (Path) – path of the embedding file (eventually compressed)
out_dtype (numpy.dtype) – all the vectors will be converted to this data type. The sub-class is responsible to set a suitable default value.
verbose (bool) – whether to show a progress bar by default in all time-consuming operations

Variables

path (Path) – path of the embedding file
vocab_size (int or None) – number of words in the file (can be None for some TextEmbFile)
vector_size (int) – length of the vectors
verbose (bool) – whether to show a progress bar by default in all time-consuming operations
closed (bool) – True if the file was closed

abstractmethod _reader()[source]¶

(Abstract method) Returns a new reader for the file which allows to iterate efficiently the word-vectors inside it. Called by reader().

Return type: EmbFileReader

abstractmethod _close()[source]¶

(Abstract method) Releases eventual resources used by the EmbFile.

Return type: None

DEFAULT_EXTENSION: str¶

vocab_size: Optional[int]¶

close()[source]¶

Releases all the open resources linked to this file, including the opened readers.

Return type: None

reader()[source]¶

Creates and returns a new file reader. When the file is closed, all the still opened readers are closed automatically.

Return type: EmbFileReader

loader(words, missing_ok=True, verbose=None)[source]¶

Returns a VectorsLoader, an iterator that looks for the provided words in the file and yields available (word, vector) pairs one by one. If missing_ok=True (default), provides the set of missing words in the property missing_words (once the iteration ends).

See embfile.core.VectorsLoader for more info.

Example

You should use a loader when you need to load many vectors in some custom data structure and you don’t want to waste memory (e.g. build_matrix uses it to load the vectors directly into the matrix):

data_structure = MyCustomStructure()
with file.loader(many_words) as loader:
    for word, vector in loader:
        data_structure[word] = vector
print('Number of missing words:', len(loader.missing_words)

See also

load() - which raises KeyError if any word is not found in the file.

for ... in filter(condition, verbose=None)[source]¶

Returns a generator that yields a word vector pair for each word in the file that satisfies a given condition. For example, to get all the words starting with “z”:

list(file.filter(lambda word: word.startswith('z')))

Parameters

condition (Callable[[str], bool]) – a function that, given a word in input, outputs True if the word should be taken
verbose (Optional[bool]) – if True, a progress bar is showed (the bar is updated each time a word is read, not each time a word vector pair is yielded).

Return type

Iterator[Tuple[str, ndarray]]

save_vocab(path=None, encoding='utf-8', overwrite=False, verbose=None)[source]¶

Save the vocabulary of the embedding file on a text file. By default the file is saved in the same directory of the embedding file, e.g.:

/path/to/filename.txt.gz  ==> /path/to/filename_vocab.txt

Parameters

path (Union[str, Path, None]) – where to save the file
encoding (str) – text encoding
overwrite (bool) – if the file exists and it is True, overwrite the file
verbose (Optional[bool]) – if None, self.verbose is used

Return type

Path

Returns

(Path) – the path to the vocabulary file

classmethod create(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, **format_kwargs)[source]¶

Creates a file on disk containing the provided word vectors.

Parameters

out_path (Union[str, Path]) – path to the created file
word_vectors (Dict[str, VectorType] or Iterable[Tuple[str, VectorType]]) – it can be an iterable of word vector tuples or a dictionary word -> vector; the word vectors are written in the order determined by the iterable object.
vocab_size (Optional[int]) – it must be provided if word_vectors has no __len__ and the specific-format creator needs to know a priori the vocabulary size; in any case, the creator should check at the end that the provided vocab_size matches the actual length of word_vectors
compression (Optional[str]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"
verbose (bool) – if positive, show progress bars and information
overwrite (bool) – overwrite the file if it already exists
format_kwargs – format-specific arguments

Return type

None

classmethod create_from_file(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, **format_kwargs)[source]¶

Creates a new file on disk with the same content of another file.

Parameters

source_file (EmbFile) – the file to take data from
out_dir (Union[str, Path, None]) – directory where the file will be stored; by default, it’s the parent directory of the source file
out_filename (Optional[str]) – filename of the produced name (inside out_dir); by default, it is obtained by replacing the extension of the source file with the proper one and appending the compression extension if compression is not None. Note: if you pass this argument, the compression extension is not automatically appended.
vocab_size (Optional[int]) – if the source EmbFile has attribute vocab_size == None, then: if the specific creator requires it (bin and txt formats do), it must be provided; otherwise it can be provided for having ETA in progress bars.
compression (Optional[str]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"
verbose (bool) – print info and progress bar
overwrite (bool) – overwrite a file with the same name if it already exists
format_kwargs – format-specific arguments (see above)

Return type

Path

class embfile.core.EmbFileReader(out_dtype)[source]¶

Bases: abc.ABC

(Abstract class) Iterator that yields a word at each step and read the corresponding vector only if the lazy property current_vector is accessed.

Iteration model. The iteration model is not the most obvious: each iteration step doesn’t return a word vector pair. Instead, for performance reasons, at each step a reader returns the next word. To read the vector for the current word, you must access the (lazy) property current_vector():

with emb_file.reader() as reader:
    for word in reader:
        if word in my_vocab:
            word2vec[word] = reader.current_vector

When you access current_vector() for the first time, the vector data is read/parsed and a vector is created; the vector remains accessible until a new word is read.

Creation. Creating a reader usually implies the creation of a file object. That’s why EmbFileReader implements the ContextManager interface so that you can use it inside a with clause. Nonetheless, a EmbFile keeps track of all its open readers and close them automatically when it is closed.

Parameters: out_dtype (Union[str, dtype]) – all the vectors will be converted to this dtype before being returned
Variables: out_dtype (numpy.dtype) – all the vectors will be converted to this data type before being returned

close()[source]¶

Closes the reader

Return type: None

abstractmethod reset()[source]¶

(Abstract method) Brings back the reader to the first word vector pair

Return type: None

abstractmethod next_word()[source]¶

(Abstract method) Reads and returns the next word in the file.

Return type: str

abstractmethod current_vector()[source]¶

(Abstract method) The vector for the current word (i.e. the last word read). If accessed before any word has been read, it raises IllegalOperation. The dtype of the returned vector is cls.out_dtype.

Return type: ndarray

class embfile.core.AbstractEmbFileReader(out_dtype)[source]¶

Bases: embfile.core.reader.EmbFileReader, abc.ABC

(Abstract class) Facilitates the implementation of a EmbFileReader, especially for a file that stores a word and its vector nearby in the file (txt and bin formats), though it can be used for other kind of formats as well if it looks convenient. It:

keeps track of whether the reader is pointing to a word or a vector and skips the vector when it is not requested during an iteration
caches the current vector once it is read

Sub-classes must implement:

`_read_word`()	(Abstract method) Reads a word assuming the next thing to read in the file is a word.
`_read_vector`()	(Abstract method) Reads the vector for the last word read.
`_skip_vector`()	(Abstract method) Called when we want to read the next word without loading the vector for the current word.
`_close`()	(Abstract method) Closes the reader

abstractmethod _read_word()[source]¶

(Abstract method) Reads a word assuming the next thing to read in the file is a word. It must raise StopIteration if there’s not another word to read.

Return type: str

abstractmethod _read_vector()[source]¶

(Abstract method) Reads the vector for the last word read. This method is never called if no word has been read or at the end of file. It is called at most time per word.

Return type: ndarray

abstractmethod _skip_vector()[source]¶

(Abstract method) Called when we want to read the next word without loading the vector for the current word. For some formats, it may be empty.

Return type: None

abstractmethod _reset()[source]¶

(Abstract method) Resets the reader

Return type: None

abstractmethod _close()¶

(Abstract method) Closes the reader

Return type: None

reset()[source]¶

Brings back the reader to the beginning of the file

Return type: None

next_word()[source]¶

Reads and returns the next word in the file.

Return type: str

current_vector¶

The vector associated to the current word (i.e. the last word read). If accessed before any word has been read, it raises IllegalOperation.

Return type: ndarray

class embfile.core.VectorsLoader(words, missing_ok=True)[source]¶

Bases: abc.ABC, Iterator[WordVector]

(Abstract class) Iterator that, given some input words, looks for the corresponding vectors into the file and yields a word vector pair for each vector found; once the iteration stops, the attribute missing_words contains the set of words not found.

Subclasses can load the word vectors in any order.

Parameters

words (Iterable[str]) – the words to load
missing_ok (bool) – If False, raises a KeyError if any input word is not in the file

abstractmethod missing_words¶: The words that have still to be found; once the iteration stops, it’s the set of the words that are in the input words but not in the file.

close()[source]¶: Closes eventual open resources (e.g. a reader).

class embfile.core.SequentialLoader(file, words, missing_ok=True, verbose=False)[source]¶

Bases: abc.ABC, Iterator[WordVector]

A Loader that just scans the file from beginning to the end and yields a word vector pair when it meets a requested word. Used by txt and bin files. It’s unable to tell if a word is in the file or not before having read the entire file.

The progress bar shows the percentage of file that has been examined, not the number of yielded word vectors, so the iteration may stop before the bar reaches its 100% (in the case that all the input words are in the file).

(Abstract class) Iterator that, given some input words, looks for the corresponding vectors into the file and yields a word vector pair for each vector found; once the iteration stops, the attribute missing_words contains the set of words not found.

Subclasses can load the word vectors in any order.

Parameters

words (Iterable[str]) – the words to load
missing_ok (bool) – If False, raises a KeyError if any input word is not in the file

missing_words¶: The words that have still to be found; once the iteration stops, it’s the set of the words that are in the input words but not in the file.

close()[source]¶: Closes eventual open resources (e.g. a reader).

class embfile.core.RandomAccessLoader(words, word2vec, word2index=None, missing_ok=True, verbose=False, close_hook=None)[source]¶

Bases: abc.ABC, Iterator[WordVector]

A loader for files that can randomly access word vectors. If word2index is provided, the words are sorted by their position and the corresponded vectors are loaded in this order; I observed that this significantly improves the performance (with VVMEmbFile) (presumably due to buffering).

Parameters

words (Iterable[str]) –
word2vec (Word2Vector) – object that implements word2vec[word] and word in word2vec
word2index (Optional[Callable[[str], int]]) – function that returns the index (position) of a word inside the file; this enables an optimization for formats like VVM that store vectors sequentially in the same file.
missing_ok (bool) –
verbose (bool) –
close_hook (Optional[Callable]) – function to call when closing this loader

missing_words¶: The words that have still to be found; once the iteration stops, it’s the set of the words that are in the input words but not in the file.

close()[source]¶: Closes eventual open resources (e.g. a reader).

class embfile.core.WordVector(word: str, vector: numpy.ndarray)[source]¶

Bases: tuple

A (word, vector) NamedTuple

Create new instance of WordVector(word, vector)

word: str¶: Alias for field number 0

vector: numpy.ndarray¶: Alias for field number 1

staticmethod format_vector(arr)[source]¶: Used by __repr__ to convert a numpy vector to string. Feel free to monkey-patch it.

Table Of Contents

embfile.core¶