embfile API

Substructure

Classes

BinaryEmbFile(path[, encoding, dtype, …])

Format used by the Google word2vec tool.

BuildMatrixOutput(matrix, word2index, int], …)

NamedTuple returned by build_matrix()

EmbFile(path[, out_dtype, verbose])

(Abstract class) The base class of all the embedding files.

TextEmbFile(path[, encoding, out_dtype, …])

The format used by Glove and FastText files. Each vector pair is stored as a line of text made of space-separated fields::.

VVMEmbFile(path[, out_dtype, verbose])

(Custom format) A tar file storing vocabulary, vectors and metadata in 3 separate files.

Functions

associate_extension(ext, format_id[, overwrite])

Associates a file extension to a registered embedding file format.

build_matrix(f, words[, start_index, dtype, …])

Creates an embedding matrix for the provided words.

extract(src_path[, member, dest_dir, …])

Extracts a file compressed with gzip, bz2 or lzma or a member file inside a zip/tar archive.

extract_if_missing(src_path[, member, …])

Extracts a file unless it already exists and returns its path.

open(path[, format_id])

Opens an embedding file inferring the file format from the file extension (if not explicitly provided in format_id).

register_format(format_id, extensions[, …])

Class decorator that associates a new EmbFile sub-class with a format_id and one or multiple extensions.

Data

FORMATS

Maps each EmbFile subclass to a format_id and one or multiple file extensions.

embfile.open(path, format_id=None, **format_kwargs)[source]

Opens an embedding file inferring the file format from the file extension (if not explicitly provided in format_id). Note that you can always open a file using the specific EmbFile subclass; it can be more convenient since you get auto-completion and quick doc for format-specific arguments.

Example:

with embfile.open('path/to/embfile.txt') as f:
    # do something with f

Supported formats:

Class

format_id

Extensions

Description

TextEmbFile

txt

.txt, .vec

Glove/fastText format

BinaryEmbFile

bin

.bin

Google word2vec format

VVMEmbFile

vvm

.vvm

A tarball containing three files: vocab.txt, vectors.bin, meta.json

You can register new formats or extensions using the functions embfile.register_format() and embfile.associate_extension().

Parameters
  • path (Union[str, Path]) – path to the file

  • format_id (Optional[str]) – string ID of the embedding file format. If not provided, it is inferred from the file name. Valid choices are: ‘txt’, ‘bin’, ‘vvm’.

  • format_kwargs – additional format-specific arguments (see doc for specific file formats)

Return type

EmbFile

Returns

An instance of a concrete subclass of EmbFile .

See also

embfile.register_format():

registers your custom EmbFile implementation so it is recognized by this function

embfile.associate_extension():

associates an extension to a registered format

embfile.build_matrix(f, words, start_index=0, dtype=None, oov_initializer=<embfile.initializers.NormalInitializer object>, verbose=None)[source]

Creates an embedding matrix for the provided words. words can be:

  1. an iterable of strings – in this case, the words in the iterable are mapped to consecutive rows of the matrix starting from the row of index start_index (by default, 0); the rows with index i < start_index are left to zeros.

  2. a dictionary {word -> index} that maps each word to a row – in this case, the matrix has shape:

    [max_index + 1, vector_size]
    

    where max_index = max(word_to_index.values()). The rows that are not associated with any word are left to zeros. If multiple words are mapped to the same row, the function raises ValueError.

In both cases, all the word vectors that are not found in the file are initialized using oov_initializer, which can be:

  1. None – leave missing vectors to zeros

  2. a function that takes the shape of the array to generate (a tuple) as first argument:

    oov_initializer=lambda shape: numpy.random.normal(scale=0.01, size=shape)
    oov_initializer=numpy.ones  # don't use this for word vectors :|
    
  3. an instance of Initializer, which is a “fittable” initializer; in this case, the initializer is fit on the found vectors (the vectors that are both in vocab and in the file).

By default, oov_initializer is an instance of NormalInitializer which generates vectors using a normal distribution with the same mean and standard deviation of the vectors found.

Parameters
  • f (EmbFile) – the file containing the word vectors

  • words (Iterable[str] or Dict[str, int]) – iterable of words or dictionary that maps each word to a row index

  • start_index (int) – ignored if vocab is a dict; if vocab is a collection, determines the index associated to the first word (and so, the number of rows left to zeros at the beginning of the matrix)

  • dtype (optional, DType) – matrix data type; if None, cls.out_dtype is used

  • oov_initializer (optional, Callable or Initializer) – initializer for out-of-(file)-vocabulary word vectors. See the class docstring for more information.

  • verbose (bool) – if None, f.verbose is used

Return type

BuildMatrixOutput

class embfile.BuildMatrixOutput(matrix: numpy.ndarray, word2index: Dict[str, int], missing_words: Set[str])[source]

Bases: tuple

NamedTuple returned by build_matrix()

Create new instance of BuildMatrixOutput(matrix, word2index, missing_words)

matrix: numpy.ndarray

Alias for field number 0

word2index: Dict[str, int]

Alias for field number 1

missing_words: Set[str]

Alias for field number 2

found_words()[source]
word_indexes(words)[source]
Return type

List[int]

vector(word)[source]
pretty(precision=3, threshold=5)[source]

Pretty string method for documentation purposes.

class embfile.EmbFile(path, out_dtype=None, verbose=True)[source]

Bases: abc.ABC

(Abstract class) The base class of all the embedding files.

Sub-classes must:

  1. ensure they set attributes vocab_size and vector_size when a file instance is created

  2. implement a EmbFileReader for the format and implements the abstract method _reader()

  3. implement the abstract method _close()

  4. (optionally) implement a VectorsLoader (if they can improve upon the default loader) and override loader()

  5. (optionally) implement a EmbFileCreator for the format and set the class constant Creator

Parameters
  • path (Path) – path of the embedding file (eventually compressed)

  • out_dtype (numpy.dtype) – all the vectors will be converted to this data type. The sub-class is responsible to set a suitable default value.

  • verbose (bool) – whether to show a progress bar by default in all time-consuming operations

Variables
  • path (Path) – path of the embedding file

  • vocab_size (int or None) – number of words in the file (can be None for some TextEmbFile)

  • vector_size (int) – length of the vectors

  • verbose (bool) – whether to show a progress bar by default in all time-consuming operations

  • closed (bool) – True if the file was closed

abstractmethod _reader()[source]

(Abstract method) Returns a new reader for the file which allows to iterate efficiently the word-vectors inside it. Called by reader().

Return type

EmbFileReader

abstractmethod _close()[source]

(Abstract method) Releases eventual resources used by the EmbFile.

Return type

None

DEFAULT_EXTENSION: str
close()[source]

Releases all the open resources linked to this file, including the opened readers.

Return type

None

reader()[source]

Creates and returns a new file reader. When the file is closed, all the still opened readers are closed automatically.

Return type

EmbFileReader

loader(words, missing_ok=True, verbose=None)[source]

Returns a VectorsLoader, an iterator that looks for the provided words in the file and yields available (word, vector) pairs one by one. If missing_ok=True (default), provides the set of missing words in the property missing_words (once the iteration ends).

See embfile.core.VectorsLoader for more info.

Example

You should use a loader when you need to load many vectors in some custom data structure and you don’t want to waste memory (e.g. build_matrix uses it to load the vectors directly into the matrix):

data_structure = MyCustomStructure()
with file.loader(many_words) as loader:
    for word, vector in loader:
        data_structure[word] = vector
print('Number of missing words:', len(loader.missing_words)

See also

load() find()

Return type

VectorsLoader

for ... in words()[source]

Returns an iterable for all the words in the file.

Return type

Iterable[str]

for ... in vectors()[source]

Returns an iterable for all the vectors in the file.

Return type

Iterable[ndarray]

for ... in word_vectors()[source]

Returns an iterable for all the (word, vector) pairs in the file.

Return type

Iterable[WordVector]

to_dict(verbose=None)[source]

Returns the entire file content in a dictionary word -> vector.

Return type

Dict[str, ndarray]

to_list(verbose=None)[source]

Returns the entire file content in a list of WordVector’s.

Return type

List[WordVector]

load(words, verbose=None)[source]

Loads the vectors for the input words in a {word: vec} dict, raising KeyError if any word is missing.

Parameters
Return type

Dict[str, ndarray]

Returns

(Dict[str, VectorType]) – a dictionary {word: vector}

See also

find() - it returns the set of all missing words, instead of raising KeyError.

find(words, verbose=None)[source]

Looks for the input words in the file, return: 1) a dict {word: vec} containing the available words and 2) a set containing the words not found.

Parameters
  • words (Iterable[str]) – the words to look for

  • verbose (Optional[bool]) – if None, self.verbose is used

Return type

_FindOutput

Returns

namedtuple – a namedtuple with the following fields:

  • word2vec (Dict[str, VectorType]): dictionary {word: vector}

  • missing_words (Set[str]): set of words not found in the file

See also

load() - which raises KeyError if any word is not found in the file.

for ... in filter(condition, verbose=None)[source]

Returns a generator that yields a word vector pair for each word in the file that satisfies a given condition. For example, to get all the words starting with “z”:

list(file.filter(lambda word: word.startswith('z')))
Parameters
  • condition (Callable[[str], bool]) – a function that, given a word in input, outputs True if the word should be taken

  • verbose (Optional[bool]) – if True, a progress bar is showed (the bar is updated each time a word is read, not each time a word vector pair is yielded).

Return type

Iterator[Tuple[str, ndarray]]

save_vocab(path=None, encoding='utf-8', overwrite=False, verbose=None)[source]

Save the vocabulary of the embedding file on a text file. By default the file is saved in the same directory of the embedding file, e.g.:

/path/to/filename.txt.gz  ==> /path/to/filename_vocab.txt
Parameters
  • path (Union[str, Path, None]) – where to save the file

  • encoding (str) – text encoding

  • overwrite (bool) – if the file exists and it is True, overwrite the file

  • verbose (Optional[bool]) – if None, self.verbose is used

Return type

Path

Returns

(Path) – the path to the vocabulary file

classmethod create(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, **format_kwargs)[source]

Creates a file on disk containing the provided word vectors.

Parameters
  • out_path (Union[str, Path]) – path to the created file

  • word_vectors (Dict[str, VectorType] or Iterable[Tuple[str, VectorType]]) – it can be an iterable of word vector tuples or a dictionary word -> vector; the word vectors are written in the order determined by the iterable object.

  • vocab_size (Optional[int]) – it must be provided if word_vectors has no __len__ and the specific-format creator needs to know a priori the vocabulary size; in any case, the creator should check at the end that the provided vocab_size matches the actual length of word_vectors

  • compression (Optional[str]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"

  • verbose (bool) – if positive, show progress bars and information

  • overwrite (bool) – overwrite the file if it already exists

  • format_kwargs – format-specific arguments

Return type

None

classmethod create_from_file(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, **format_kwargs)[source]

Creates a new file on disk with the same content of another file.

Parameters
  • source_file (EmbFile) – the file to take data from

  • out_dir (Union[str, Path, None]) – directory where the file will be stored; by default, it’s the parent directory of the source file

  • out_filename (Optional[str]) – filename of the produced name (inside out_dir); by default, it is obtained by replacing the extension of the source file with the proper one and appending the compression extension if compression is not None. Note: if you pass this argument, the compression extension is not automatically appended.

  • vocab_size (Optional[int]) – if the source EmbFile has attribute vocab_size == None, then: if the specific creator requires it (bin and txt formats do), it must be provided; otherwise it can be provided for having ETA in progress bars.

  • compression (Optional[str]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"

  • verbose (bool) – print info and progress bar

  • overwrite (bool) – overwrite a file with the same name if it already exists

  • format_kwargs – format-specific arguments (see above)

Return type

Path

class embfile.BinaryEmbFile(path, encoding='utf-8', dtype=dtype('float32'), out_dtype=None, verbose=True)[source]

Bases: embfile.core._file.EmbFile

Format used by the Google word2vec tool. You can use it to read the file GoogleNews-vectors-negative300.bin.

It begins with a text header line of space-separated fields:

<vocab_size> <vector_size>

Each word vector pair is encoded as following:

  • encoded word + space

  • followed by the binary representation of the vector.

Variables
  • path

  • encoding

  • dtype

  • out_dtype

  • verbose

Parameters
  • path (Union[str, Path]) – path to the (eventually compressed) file

  • encoding (str) – text encoding; note: if you provide an utf encoding (e.g. utf-16) that uses a BOM (Byte Order Mark) without specifying the byte-endianness (e.g. utf-16-le or utf-16-be), the little-endian version is used (utf-16-le).

  • dtype (Union[str, dtype]) – a valid numpy data type (or whatever you can pass to numpy.dtype()) (default: ‘<f4’; little-endian float, 4 bytes)

  • out_dtype (Union[str, dtype, None]) – all the vectors returned will be (eventually) converted to this data type; by default, it is equal to the original data type of the vectors in the file, i.e. no conversion takes place.

DEFAULT_EXTENSION: str = '.bin'
vocab_size: Optional[int]
classmethod create(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]

Format-specific arguments are encoding and dtype.

Note: all the text is encoded without BOM (Byte Order Mark). If you pass “utf-16” or “utf-18”, the little-endian version is used (e.g. “utf-16-le”)

See create() for more.

Return type

None

classmethod create_from_file(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]

Format-specific arguments are encoding and dtype.

Note: all the text is encoded without BOM (Byte Order Mark). If you pass “utf-16” or “utf-18”, the little-endian version is used (e.g. “utf-16-le”)

See create_from_file() for more.

Return type

Path

class embfile.TextEmbFile(path, encoding='utf-8', out_dtype='float32', vocab_size=None, verbose=True)[source]

Bases: embfile.core._file.EmbFile

The format used by Glove and FastText files. Each vector pair is stored as a line of text made of space-separated fields:

word vec[0] vec[1] ... vec[vector_size-1]

It may have or not an (automatically detected) “header”, containing vocab_size and vector_size (in this order).

If the file doesn’t have a header, vector_size is set to the length of the first vector. If you know vocab_size (even an approximate value), you may want to provide it to have ETA in progress bars.

If the file has a header and you provide vocab_size, the provided value is ignored.

Compressed files are decompressed while you proceed reeding. Note that each file reader will decompress the file independently, so if you need to read the file multiple times it’s better you decompress the entire file first and then open it.

Variables
  • path

  • encoding

  • out_dtype

  • verbose

Parameters
  • path (Union[str, Path]) – path to the embedding file

  • encoding (str) – encoding of the text file; default is utf-8

  • out_dtype (Union[str, dtype]) – the dtype of the vectors that will be returned; default is single-precision float

  • vocab_size (Optional[int]) – useful when the file has no header but you know vocab_size; if the file has a header, this argument is ignored.

  • verbose (int) – default level of verbosity for all methods

DEFAULT_EXTENSION: str = '.txt'
vocab_size: Optional[int]
classmethod create(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', precision=5)[source]

Creates a file on disk containing the provided word vectors.

Parameters
  • out_path (Union[str, Path]) – path to the created file

  • word_vectors (Dict[str, VectorType] or Iterable[Tuple[str, VectorType]]) – it can be an iterable of word vector tuples or a dictionary word -> vector; the word vectors are written in the order determined by the iterable object.

  • vocab_size (Optional[int]) – it must be provided if word_vectors has no __len__ and the specific-format creator needs to know a priori the vocabulary size; in any case, the creator should check at the end that the provided vocab_size matches the actual length of word_vectors

  • compression (Optional[str]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"

  • verbose (bool) – if positive, show progress bars and information

  • overwrite (bool) – overwrite the file if it already exists

  • format_kwargs – format-specific arguments

Return type

None

classmethod create_from_file(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', precision=5)[source]

Creates a new file on disk with the same content of another file.

Parameters
  • source_file (EmbFile) – the file to take data from

  • out_dir (Union[str, Path, None]) – directory where the file will be stored; by default, it’s the parent directory of the source file

  • out_filename (Optional[str]) – filename of the produced name (inside out_dir); by default, it is obtained by replacing the extension of the source file with the proper one and appending the compression extension if compression is not None. Note: if you pass this argument, the compression extension is not automatically appended.

  • vocab_size (Optional[int]) – if the source EmbFile has attribute vocab_size == None, then: if the specific creator requires it (bin and txt formats do), it must be provided; otherwise it can be provided for having ETA in progress bars.

  • compression (Optional[str]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"

  • verbose (bool) – print info and progress bar

  • overwrite (bool) – overwrite a file with the same name if it already exists

  • format_kwargs – format-specific arguments (see above)

Return type

Path

class embfile.VVMEmbFile(path, out_dtype=None, verbose=True)[source]

Bases: embfile.core._file.EmbFile, embfile.core.loaders.Word2Vector

(Custom format) A tar file storing vocabulary, vectors and metadata in 3 separate files.

Features:

  1. the vocabulary can be loaded very quickly (with no need for an external vocab file) and it is loaded in memory when the file is opened;

  2. direct access to vectors

  3. implements __contains__() (e.g. 'hello' in file)

  4. all the information needed to open the file are stored in the file itself

Specifics. The files contained in a VVM file are:

  • vocab.txt: contains each word on a separate line

  • vectors.bin: contains the vectors in binary format (concatenated)

  • meta.json: must contain (at least) the following fields:

    • vocab_size: number of word vectors in the file

    • vector_size: length of a word vector

    • encoding: text encoding used for vocab.txt

    • dtype: vector data type string (notation used by numpy)

Variables
  • path

  • encoding

  • dtype

  • out_dtype

  • verbose

  • vocab (OrderedDict[str, int]) – map each word to its index in the file

Parameters
DEFAULT_EXTENSION: str = '.vvm'
vocab_size: Optional[int]
words()[source]

Returns an iterable for all the words in the file.

Return type

Iterable[str]

__contains__(word)[source]

Returns True if the file contains a vector for word

Return type

bool

vector_at(index)[source]

Returns a vector by its index in the file (random access).

Return type

ndarray

__getitem__(word)[source]

Returns the vector associated to a word (random access to file).

Return type

ndarray

classmethod create(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]

Format-specific arguments are encoding and dtype.

Being VVM a tar file, you should use a compression supported by the tarfile package (avoid zip): gz, bz2 or xz.

See create() for more doc.

Return type

None

classmethod create_from_file(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]

Format-specific arguments are encoding and dtype. Being VVM a tar file, you should use a compression supported by the tarfile package (avoid zip): gz, bz2 or xz.

See create_from_file() for more doc.

Return type

Path

embfile.register_format(format_id, extensions, overwrite=False)[source]

Class decorator that associates a new EmbFile sub-class with a format_id and one or multiple extensions. Once you register a format, you can use open() to open files of that format.

embfile.associate_extension(ext, format_id, overwrite=False)[source]

Associates a file extension to a registered embedding file format.

embfile.extract(src_path, member=None, dest_dir='.', dest_filename=None, overwrite=False)

Extracts a file compressed with gzip, bz2 or lzma or a member file inside a zip/tar archive. The compression format is inferred from the extension or from the magic number of the file (in the case of zip and tar).

The file is first extracted to a .part file that is renamed when the extraction is completed.

Parameters
  • src_path (Union[str, Path]) – source file path

  • member (Optional[str]) – must be provided if src_path points to an archive that contains multiple files;

  • dest_dir (Union[str, Path]) – destination directory; by default, it’s the current working directory

  • dest_filename (Optional[str]) – destination filename; by default, it’s equal to member (if provided)

  • overwrite (bool) – overwrite existing file at dest_path if it already exists

Return type

Path

Returns

Path – the path to the extracted file

embfile.extract_if_missing(src_path, member=None, dest_dir='.', dest_filename=None)[source]

Extracts a file unless it already exists and returns its path.

Note: during extraction, a .part file is used, so there’s no risk of using a partially extracted file.

Parameters
Return type

Path

Returns

The path of the decompressed file is returned.