embfile API¶

Substructure

Classes

`BinaryEmbFile`(path[, encoding, dtype, …])	Format used by the Google word2vec tool.
`BuildMatrixOutput`(matrix, word2index, int], …)	NamedTuple returned by `build_matrix()`
`EmbFile`(path[, out_dtype, verbose])	(Abstract class) The base class of all the embedding files.
`TextEmbFile`(path[, encoding, out_dtype, …])	The format used by Glove and FastText files. Each vector pair is stored as a line of text made of space-separated fields::.
`VVMEmbFile`(path[, out_dtype, verbose])	(Custom format) A tar file storing vocabulary, vectors and metadata in 3 separate files.

Functions

`associate_extension`(ext, format_id[, overwrite])	Associates a file extension to a registered embedding file format.
`build_matrix`(f, words[, start_index, dtype, …])	Creates an embedding matrix for the provided words.
`extract`(src_path[, member, dest_dir, …])	Extracts a file compressed with gzip, bz2 or lzma or a member file inside a zip/tar archive.
`extract_if_missing`(src_path[, member, …])	Extracts a file unless it already exists and returns its path.
`open`(path[, format_id])	Opens an embedding file inferring the file format from the file extension (if not explicitly provided in `format_id`).
`register_format`(format_id, extensions[, …])	Class decorator that associates a new `EmbFile` sub-class with a format_id and one or multiple extensions.

Data

FORMATS

Maps each EmbFile subclass to a format_id and one or multiple file extensions.

embfile.open(path, format_id=None, **format_kwargs)[source]¶

Opens an embedding file inferring the file format from the file extension (if not explicitly provided in format_id). Note that you can always open a file using the specific EmbFile subclass; it can be more convenient since you get auto-completion and quick doc for format-specific arguments.

Example:

with embfile.open('path/to/embfile.txt') as f:
    # do something with f

Supported formats:

Class	format_id	Extensions	Description
`TextEmbFile`	txt	.txt, .vec	Glove/fastText format
`BinaryEmbFile`	bin	.bin	Google word2vec format
`VVMEmbFile`	vvm	.vvm	A tarball containing three files: vocab.txt, vectors.bin, meta.json

You can register new formats or extensions using the functions embfile.register_format() and embfile.associate_extension().

Parameters

path (Union[str, Path]) – path to the file
format_id (Optional[str]) – string ID of the embedding file format. If not provided, it is inferred from the file name. Valid choices are: ‘txt’, ‘bin’, ‘vvm’.
format_kwargs – additional format-specific arguments (see doc for specific file formats)

Return type

EmbFile

Returns

An instance of a concrete subclass of EmbFile .

See also

embfile.register_format():: registers your custom EmbFile implementation so it is recognized by this function
embfile.associate_extension():: associates an extension to a registered format

embfile.build_matrix(f, words, start_index=0, dtype=None, oov_initializer=<embfile.initializers.NormalInitializer object>, verbose=None)[source]¶

Creates an embedding matrix for the provided words. words can be:

an iterable of strings – in this case, the words in the iterable are mapped to consecutive rows of the matrix starting from the row of index start_index (by default, 0); the rows with index i < start_index are left to zeros.
a dictionary {word -> index} that maps each word to a row – in this case, the matrix has shape:
```
[max_index + 1, vector_size]
```
where max_index = max(word_to_index.values()). The rows that are not associated with any word are left to zeros. If multiple words are mapped to the same row, the function raises ValueError.

In both cases, all the word vectors that are not found in the file are initialized using oov_initializer, which can be:

None – leave missing vectors to zeros

a function that takes the shape of the array to generate (a tuple) as first argument:

oov_initializer=lambda shape: numpy.random.normal(scale=0.01, size=shape)
oov_initializer=numpy.ones  # don't use this for word vectors :|

an instance of Initializer, which is a “fittable” initializer; in this case, the initializer is fit on the found vectors (the vectors that are both in vocab and in the file).

By default, oov_initializer is an instance of NormalInitializer which generates vectors using a normal distribution with the same mean and standard deviation of the vectors found.

Parameters

f (EmbFile) – the file containing the word vectors
words (Iterable[str] or Dict[str, int]) – iterable of words or dictionary that maps each word to a row index
start_index (int) – ignored if vocab is a dict; if vocab is a collection, determines the index associated to the first word (and so, the number of rows left to zeros at the beginning of the matrix)
dtype (optional, DType) – matrix data type; if None, cls.out_dtype is used
oov_initializer (optional, Callable or Initializer) – initializer for out-of-(file)-vocabulary word vectors. See the class docstring for more information.
verbose (bool) – if None, f.verbose is used

Return type

BuildMatrixOutput

class embfile.BuildMatrixOutput(matrix: numpy.ndarray, word2index: Dict[str, int], missing_words: Set[str])[source]¶

Bases: tuple

NamedTuple returned by build_matrix()

Create new instance of BuildMatrixOutput(matrix, word2index, missing_words)

matrix: numpy.ndarray¶: Alias for field number 0

word2index: Dict[str, int]¶: Alias for field number 1

missing_words: Set[str]¶: Alias for field number 2

found_words()[source]¶

word_indexes(words)[source]¶

Return type: List[int]

vector(word)[source]¶

pretty(precision=3, threshold=5)[source]¶: Pretty string method for documentation purposes.

class embfile.EmbFile(path, out_dtype=None, verbose=True)[source]¶

Bases: abc.ABC

(Abstract class) The base class of all the embedding files.

Sub-classes must:

ensure they set attributes vocab_size and vector_size when a file instance is created
implement a EmbFileReader for the format and implements the abstract method _reader()
implement the abstract method _close()
(optionally) implement a VectorsLoader (if they can improve upon the default loader) and override loader()
(optionally) implement a EmbFileCreator for the format and set the class constant Creator

Parameters

path (Path) – path of the embedding file (eventually compressed)
out_dtype (numpy.dtype) – all the vectors will be converted to this data type. The sub-class is responsible to set a suitable default value.
verbose (bool) – whether to show a progress bar by default in all time-consuming operations

Variables

path (Path) – path of the embedding file
vocab_size (int or None) – number of words in the file (can be None for some TextEmbFile)
vector_size (int) – length of the vectors
verbose (bool) – whether to show a progress bar by default in all time-consuming operations
closed (bool) – True if the file was closed

abstractmethod _reader()[source]¶

(Abstract method) Returns a new reader for the file which allows to iterate efficiently the word-vectors inside it. Called by reader().

Return type: EmbFileReader

abstractmethod _close()[source]¶

(Abstract method) Releases eventual resources used by the EmbFile.

Return type: None

DEFAULT_EXTENSION: str¶

close()[source]¶

Releases all the open resources linked to this file, including the opened readers.

Return type: None

reader()[source]¶

Creates and returns a new file reader. When the file is closed, all the still opened readers are closed automatically.

Return type: EmbFileReader

loader(words, missing_ok=True, verbose=None)[source]¶

Returns a VectorsLoader, an iterator that looks for the provided words in the file and yields available (word, vector) pairs one by one. If missing_ok=True (default), provides the set of missing words in the property missing_words (once the iteration ends).

See embfile.core.VectorsLoader for more info.

Example

You should use a loader when you need to load many vectors in some custom data structure and you don’t want to waste memory (e.g. build_matrix uses it to load the vectors directly into the matrix):

data_structure = MyCustomStructure()
with file.loader(many_words) as loader:
    for word, vector in loader:
        data_structure[word] = vector
print('Number of missing words:', len(loader.missing_words)

See also

load() find()

Return type: VectorsLoader

for ... in words()[source]¶

Returns an iterable for all the words in the file.

Return type: Iterable[str]

for ... in vectors()[source]¶

Returns an iterable for all the vectors in the file.

Return type: Iterable[ndarray]

for ... in word_vectors()[source]¶

Returns an iterable for all the (word, vector) pairs in the file.

Return type: Iterable[WordVector]

to_dict(verbose=None)[source]¶

Returns the entire file content in a dictionary word -> vector.

Return type: Dict[str, ndarray]

to_list(verbose=None)[source]¶

Returns the entire file content in a list of WordVector’s.

Return type: List[WordVector]

load(words, verbose=None)[source]¶

Loads the vectors for the input words in a {word: vec} dict, raising KeyError if any word is missing.

Parameters

words (Iterable[str]) – the words to get
verbose (Optional[bool]) – if None, self.verbose is used

Return type

Dict[str, ndarray]

Returns

(Dict[str, VectorType]) – a dictionary {word: vector}

See also

find() - it returns the set of all missing words, instead of raising KeyError.

find(words, verbose=None)[source]¶

Looks for the input words in the file, return: 1) a dict {word: vec} containing the available words and 2) a set containing the words not found.

Parameters

words (Iterable[str]) – the words to look for
verbose (Optional[bool]) – if None, self.verbose is used

Return type

_FindOutput

Returns

namedtuple – a namedtuple with the following fields:

word2vec (Dict[str, VectorType]): dictionary {word: vector}
missing_words (Set[str]): set of words not found in the file

See also

load() - which raises KeyError if any word is not found in the file.

for ... in filter(condition, verbose=None)[source]¶

Returns a generator that yields a word vector pair for each word in the file that satisfies a given condition. For example, to get all the words starting with “z”:

list(file.filter(lambda word: word.startswith('z')))

Parameters

condition (Callable[[str], bool]) – a function that, given a word in input, outputs True if the word should be taken
verbose (Optional[bool]) – if True, a progress bar is showed (the bar is updated each time a word is read, not each time a word vector pair is yielded).

Return type

Iterator[Tuple[str, ndarray]]

save_vocab(path=None, encoding='utf-8', overwrite=False, verbose=None)[source]¶

Save the vocabulary of the embedding file on a text file. By default the file is saved in the same directory of the embedding file, e.g.:

/path/to/filename.txt.gz  ==> /path/to/filename_vocab.txt

Parameters

path (Union[str, Path, None]) – where to save the file
encoding (str) – text encoding
overwrite (bool) – if the file exists and it is True, overwrite the file
verbose (Optional[bool]) – if None, self.verbose is used

Return type

Path

Returns

(Path) – the path to the vocabulary file

classmethod create(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, **format_kwargs)[source]¶

Creates a file on disk containing the provided word vectors.

Parameters

out_path (Union[str, Path]) – path to the created file
word_vectors (Dict[str, VectorType] or Iterable[Tuple[str, VectorType]]) – it can be an iterable of word vector tuples or a dictionary word -> vector; the word vectors are written in the order determined by the iterable object.
vocab_size (Optional[int]) – it must be provided if word_vectors has no __len__ and the specific-format creator needs to know a priori the vocabulary size; in any case, the creator should check at the end that the provided vocab_size matches the actual length of word_vectors
compression (Optional[str]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"
verbose (bool) – if positive, show progress bars and information
overwrite (bool) – overwrite the file if it already exists
format_kwargs – format-specific arguments

Return type

None

classmethod create_from_file(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, **format_kwargs)[source]¶

Creates a new file on disk with the same content of another file.

Parameters

source_file (EmbFile) – the file to take data from
out_dir (Union[str, Path, None]) – directory where the file will be stored; by default, it’s the parent directory of the source file
out_filename (Optional[str]) – filename of the produced name (inside out_dir); by default, it is obtained by replacing the extension of the source file with the proper one and appending the compression extension if compression is not None. Note: if you pass this argument, the compression extension is not automatically appended.
vocab_size (Optional[int]) – if the source EmbFile has attribute vocab_size == None, then: if the specific creator requires it (bin and txt formats do), it must be provided; otherwise it can be provided for having ETA in progress bars.
compression (Optional[str]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"
verbose (bool) – print info and progress bar
overwrite (bool) – overwrite a file with the same name if it already exists
format_kwargs – format-specific arguments (see above)

Return type

Path

class embfile.BinaryEmbFile(path, encoding='utf-8', dtype=dtype('float32'), out_dtype=None, verbose=True)[source]¶

Bases: embfile.core._file.EmbFile

Format used by the Google word2vec tool. You can use it to read the file GoogleNews-vectors-negative300.bin.

It begins with a text header line of space-separated fields:

<vocab_size> <vector_size>

Each word vector pair is encoded as following:

encoded word + space
followed by the binary representation of the vector.

Variables

path –
encoding –
dtype –
out_dtype –
verbose –

Parameters

path (Union[str, Path]) – path to the (eventually compressed) file
encoding (str) – text encoding; note: if you provide an utf encoding (e.g. utf-16) that uses a BOM (Byte Order Mark) without specifying the byte-endianness (e.g. utf-16-le or utf-16-be), the little-endian version is used (utf-16-le).
dtype (Union[str, dtype]) – a valid numpy data type (or whatever you can pass to numpy.dtype()) (default: ‘<f4’; little-endian float, 4 bytes)
out_dtype (Union[str, dtype, None]) – all the vectors returned will be (eventually) converted to this data type; by default, it is equal to the original data type of the vectors in the file, i.e. no conversion takes place.

DEFAULT_EXTENSION: str = '.bin'¶

vocab_size: Optional[int]¶

classmethod create(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]¶

Format-specific arguments are encoding and dtype.

Note: all the text is encoded without BOM (Byte Order Mark). If you pass “utf-16” or “utf-18”, the little-endian version is used (e.g. “utf-16-le”)

See create() for more.

Return type: None

classmethod create_from_file(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]¶

Format-specific arguments are encoding and dtype.

Note: all the text is encoded without BOM (Byte Order Mark). If you pass “utf-16” or “utf-18”, the little-endian version is used (e.g. “utf-16-le”)

See create_from_file() for more.

Return type: Path

class embfile.TextEmbFile(path, encoding='utf-8', out_dtype='float32', vocab_size=None, verbose=True)[source]¶

Bases: embfile.core._file.EmbFile

The format used by Glove and FastText files. Each vector pair is stored as a line of text made of space-separated fields:

word vec[0] vec[1] ... vec[vector_size-1]

It may have or not an (automatically detected) “header”, containing vocab_size and vector_size (in this order).

If the file doesn’t have a header, vector_size is set to the length of the first vector. If you know vocab_size (even an approximate value), you may want to provide it to have ETA in progress bars.

If the file has a header and you provide vocab_size, the provided value is ignored.

Compressed files are decompressed while you proceed reeding. Note that each file reader will decompress the file independently, so if you need to read the file multiple times it’s better you decompress the entire file first and then open it.

Variables

path –
encoding –
out_dtype –
verbose –

Parameters

path (Union[str, Path]) – path to the embedding file
encoding (str) – encoding of the text file; default is utf-8
out_dtype (Union[str, dtype]) – the dtype of the vectors that will be returned; default is single-precision float
vocab_size (Optional[int]) – useful when the file has no header but you know vocab_size; if the file has a header, this argument is ignored.
verbose (int) – default level of verbosity for all methods

DEFAULT_EXTENSION: str = '.txt'¶

vocab_size: Optional[int]¶

classmethod create(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', precision=5)[source]¶

Creates a file on disk containing the provided word vectors.

Parameters

out_path (Union[str, Path]) – path to the created file
word_vectors (Dict[str, VectorType] or Iterable[Tuple[str, VectorType]]) – it can be an iterable of word vector tuples or a dictionary word -> vector; the word vectors are written in the order determined by the iterable object.
vocab_size (Optional[int]) – it must be provided if word_vectors has no __len__ and the specific-format creator needs to know a priori the vocabulary size; in any case, the creator should check at the end that the provided vocab_size matches the actual length of word_vectors
compression (Optional[str]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"
verbose (bool) – if positive, show progress bars and information
overwrite (bool) – overwrite the file if it already exists
format_kwargs – format-specific arguments

Return type

None

classmethod create_from_file(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', precision=5)[source]¶

Creates a new file on disk with the same content of another file.

Parameters

source_file (EmbFile) – the file to take data from
out_dir (Union[str, Path, None]) – directory where the file will be stored; by default, it’s the parent directory of the source file
out_filename (Optional[str]) – filename of the produced name (inside out_dir); by default, it is obtained by replacing the extension of the source file with the proper one and appending the compression extension if compression is not None. Note: if you pass this argument, the compression extension is not automatically appended.
vocab_size (Optional[int]) – if the source EmbFile has attribute vocab_size == None, then: if the specific creator requires it (bin and txt formats do), it must be provided; otherwise it can be provided for having ETA in progress bars.
compression (Optional[str]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"
verbose (bool) – print info and progress bar
overwrite (bool) – overwrite a file with the same name if it already exists
format_kwargs – format-specific arguments (see above)

Return type

Path

class embfile.VVMEmbFile(path, out_dtype=None, verbose=True)[source]¶

Bases: embfile.core._file.EmbFile, embfile.core.loaders.Word2Vector

(Custom format) A tar file storing vocabulary, vectors and metadata in 3 separate files.

Features:

the vocabulary can be loaded very quickly (with no need for an external vocab file) and it is loaded in memory when the file is opened;
direct access to vectors
- by word using __getitem__() (e.g. file['hello'])
- by index using vector_at()
implements __contains__() (e.g. 'hello' in file)
all the information needed to open the file are stored in the file itself

Specifics. The files contained in a VVM file are:

vocab.txt: contains each word on a separate line
vectors.bin: contains the vectors in binary format (concatenated)
meta.json: must contain (at least) the following fields:
- vocab_size: number of word vectors in the file
- vector_size: length of a word vector
- encoding: text encoding used for vocab.txt
- dtype: vector data type string (notation used by numpy)

Variables

path –
encoding –
dtype –
out_dtype –
verbose –
vocab (OrderedDict[str, int]) – map each word to its index in the file

Parameters

path (Union[str, Path]) –
out_dtype (Union[str, dtype, None]) –
verbose (int) –

DEFAULT_EXTENSION: str = '.vvm'¶

vocab_size: Optional[int]¶

words()[source]¶

Returns an iterable for all the words in the file.

Return type: Iterable[str]

__contains__(word)[source]¶

Returns True if the file contains a vector for word

Return type: bool

vector_at(index)[source]¶

Returns a vector by its index in the file (random access).

Return type: ndarray

__getitem__(word)[source]¶

Returns the vector associated to a word (random access to file).

Return type: ndarray

classmethod create(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]¶

Format-specific arguments are encoding and dtype.

Being VVM a tar file, you should use a compression supported by the tarfile package (avoid zip): gz, bz2 or xz.

See create() for more doc.

Return type: None

classmethod create_from_file(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]¶

Format-specific arguments are encoding and dtype. Being VVM a tar file, you should use a compression supported by the tarfile package (avoid zip): gz, bz2 or xz.

See create_from_file() for more doc.

Return type: Path

embfile.register_format(format_id, extensions, overwrite=False)[source]¶: Class decorator that associates a new EmbFile sub-class with a format_id and one or multiple extensions. Once you register a format, you can use open() to open files of that format.

embfile.associate_extension(ext, format_id, overwrite=False)[source]¶: Associates a file extension to a registered embedding file format.

embfile.extract(src_path, member=None, dest_dir='.', dest_filename=None, overwrite=False)¶

Extracts a file compressed with gzip, bz2 or lzma or a member file inside a zip/tar archive. The compression format is inferred from the extension or from the magic number of the file (in the case of zip and tar).

The file is first extracted to a .part file that is renamed when the extraction is completed.

Parameters

src_path (Union[str, Path]) – source file path
member (Optional[str]) – must be provided if src_path points to an archive that contains multiple files;
dest_dir (Union[str, Path]) – destination directory; by default, it’s the current working directory
dest_filename (Optional[str]) – destination filename; by default, it’s equal to member (if provided)
overwrite (bool) – overwrite existing file at dest_path if it already exists

Return type

Path

Returns

Path – the path to the extracted file

embfile.extract_if_missing(src_path, member=None, dest_dir='.', dest_filename=None)[source]¶

Extracts a file unless it already exists and returns its path.

Note: during extraction, a .part file is used, so there’s no risk of using a partially extracted file.

Parameters

src_path (Union[str, Path]) –
member (Optional[str]) –
dest_dir (Union[str, Path]) –
dest_filename (Optional[str]) –

Return type

Path

Returns

The path of the decompressed file is returned.

Table Of Contents

embfile API¶