Usage

Opening a file

The core class of the package is the abstract class EmbFile. Three subclasses are implemented, one per supported format. Each format is associated with a format_id (string) and one or multiple file extensions:

Class

format_id

Extensions

Description

TextEmbFile

txt

.txt, .vec

Glove/fastText format

BinaryEmbFile

bin

.bin

Google word2vec format

VVMEmbFile

vvm

.vvm

Custom format storing vocabulary vectors and metadata in separate files inside a TAR

You can open an embedding file either:

  • using the constructor of any of the subclasses above:

    from embfile import BinaryEmbFile
    
    with BinaryEmbFile('GoogleNews-vectors-negative300.bin') as file:
        ...
    
  • or using embfile.open(), which by default infers the file format from the file extension:

    import embfile
    
    with embfile.open('GoogleNews-vectors-negative300.bin') as file:
        print(file)
    
    """ Will print:
    BinaryEmbFile (
      path = GoogleNews-vectors-negative300.bin,
      vocab_size = 3000000,
      vector_size = 300
    )
    """
    

    You can force a particular format passing the format_id argument.

All the path arguments can either be of type string or pathlib.Path. Object attributes storing paths are always pathlib.Path, not strings.

Shared arguments

All the EmbFile subclasses support two optional arguments (that you can safely pass to embfile.open as well):

  • out_dtype (numpy.dtype) – if provided, all vectors read from the file are converted to this data type (if needed) before being returned;

  • verbose (bool) – sets the default value of the verbose argument exposed by all time-consuming EmbFile methods; when verbose is True, progress bars are displayed by default; you can always pass verbose=False to a method to disable console output.

Format-specific arguments

For format-specific arguments, check out the specific class documentation:

BinaryEmbFile(path[, encoding, dtype, …])

Format used by the Google word2vec tool.

TextEmbFile(path[, encoding, out_dtype, …])

The format used by Glove and FastText files. Each vector pair is stored as a line of text made of space-separated fields::.

VVMEmbFile(path[, out_dtype, verbose])

(Custom format) A tar file storing vocabulary, vectors and metadata in 3 separate files.

You can pass format-specific arguments to embfile.open too.

Compressed files

How to handle compression is left to EmbFile subclasses. As a general rule, a concrete EmbFile requires non-compressed files unless the opposite is specified in its docstring. Anyway, in most cases, you want to work on non-compressed files because it’s much faster (of course).

embfile provide utilities to work with compression in the submodule compression; the following functions can be used (or imported) directly from the root module:

embfile.extract(src_path[, member, …])

Extracts a file compressed with gzip, bz2 or lzma or a member file inside a zip/tar archive.

embfile.extract_if_missing(src_path[, …])

Extracts a file unless it already exists and returns its path.

Lazy (on-the-fly) decompression

Currently, TextEmbFile is the only format that allows you to open a compressed file directly and to decompress it “lazily” while reading it. Lazy decompression works for all compression formats but zip. For uniformity of behavior, you can still open zipped files directly but, under the hood, the file will be fully extracted to a temporary file before starting reading it.

Lazy decompression makes sense only if you only want to perform a single pass through the file (e.g. you are converting the file); indeed, every new operation (that requires to create a new file reader) requires to (lazily) decompress the file again.

Registering new formats or file extensions

Format ID and file extensions of each registered file format are stored in the global object embfile.FORMATS. To associate a file extension to a registered format you can use associate_extension():

>>> import embfile
>>> embfile.associate_extension(ext='.w2v', format_id='bin')
>>> print(embfile.FORMATS)
Class          Format ID    Extensions
-------------  -----------  ------------
BinaryEmbFile  bin          .bin, .w2v
TextEmbFile    txt          .txt, .vec
VVMEmbFile     vvm          .vvm

To register a new format (see Implementing a new format), you can use the class decorator register_format():

@embfile.register_format(format_id='hdf5', extensions=['.h5', '.hdf5'])
class HDF5EmbFile(EmbFile):
    # ...

Loading word vectors

Loading specific word-vectors

load(words[, verbose])

Loads the vectors for the input words in a {word: vec} dict, raising KeyError if any word is missing.

find(words[, verbose])

Looks for the input words in the file, return: 1) a dict {word: vec} containing the available words and 2) a set containing the words not found.

loader(words[, missing_ok, verbose])

Returns a VectorsLoader, an iterator that looks for the provided words in the file and yields available (word, vector) pairs one by one.

word2vec = f.load(['hello', 'world'])  # raises KeyError if any word is missing

word2vec, missing_words = f.find(['hello', 'world', 'missingWord'])

You should prefer loader to find when you want to store the vectors directly into some custom data structure without wasting time and memory for building an intermediate dictionary. For example, build_matrix() uses loader to load the vectors directly into a numpy array.

Here’s how you use a loader:

data_structure = MyCustomStructure()
for word, vector in file.loader(many_words):
    data_structure[word] = vector

If you’re interested in missing_words:

data_structure = MyCustomStructure()
loader = file.loader(many_words)
for word, vector in loader:
    data_structure[word] = vector
print('Missing words:', loader.missing_words)

Loading the entire file in memory

to_dict([verbose])

Returns the entire file content in a dictionary word -> vector.

to_list([verbose])

Returns the entire file content in a list of WordVector’s.

Building a matrix

The docstring of embfile.build_matrix() contains everything you need to know to use it. Here, we’ll give some examples through an IPython session.

First, we’ll generate a dummy file with only three vectors:

In [1]: import tempfile

In [2]: from pathlib import Path

In [3]: import numpy as np

In [4]: import embfile

In [5]: from embfile import VVMEmbFile

In [6]: word_vectors = [
   ...:     ('hello', np.array([1, 2, 3])),
   ...:     ('world', np.array([4, 5, 6])),
   ...:     ('!',     np.array([7, 8, 9]))
   ...: ]
   ...: 

In [7]: path = Path(tempfile.gettempdir(), 'file.vvm')

In [8]: VVMEmbFile.create(path, word_vectors, overwrite=True, verbose=False)

Let’s build a matrix out of a list of words. We’ll use the default oov_initializer for initializing the vectors for out-of-file-vocabulary words:

In [9]: words = ['hello', 'ciao', 'world', 'mondo']

In [10]: with embfile.open(path, verbose=False) as f:
   ....:     result = embfile.build_matrix(
   ....:         f, words,
   ....:         start_index=1,   # map the first word to the row 1 (default is 0)
   ....:     )
   ....: 

# result belongs to a class that extends NamedTuple
In [11]: print(result.pretty())
[ 0.000  0.000  0.000]  # 0: 
[ 1.000  2.000  3.000]  # 1: hello
[ 1.839  3.846  4.377]  # 2: ciao [out of file vocabulary]
[ 4.000  5.000  6.000]  # 3: world
[ 5.016  6.407  5.299]  # 4: mondo [out of file vocabulary]

In [12]: result.matrix
Out[12]: 
array([[0.        , 0.        , 0.        ],
       [1.        , 2.        , 3.        ],
       [1.83933484, 3.84599979, 4.37652519],
       [4.        , 5.        , 6.        ],
       [5.01626252, 6.40740252, 5.2986341 ]])

In [13]: result.word2index
Out[13]: {'hello': 1, 'ciao': 2, 'world': 3, 'mondo': 4}

In [14]: result.missing_words
Out[14]: {'ciao', 'mondo'}

Now, we’ll build a matrix from a dictionary {word: index}. We’ll use a custom oov_initializer this time.

In [15]: word2index = {
   ....:     'hello': 1,
   ....:     'ciao': 3,
   ....:     'world': 4,
   ....:     'mondo': 5
   ....: }
   ....: 

In [16]: with embfile.open(path, verbose=False) as f:
   ....:     def custom_initializer(shape):
   ....:         scale = 1 / np.sqrt(f.vector_size)
   ....:         return np.random.normal(loc=0, scale=scale, size=shape)
   ....:     result = embfile.build_matrix(f, word2index, oov_initializer=custom_initializer)
   ....: 

In [17]: print(result.pretty())
[ 0.000  0.000  0.000]  # 0: 
[ 1.000  2.000  3.000]  # 1: hello
[ 0.000  0.000  0.000]  # 2: 
[-0.619 -0.339 -0.273]  # 3: ciao [out of file vocabulary]
[ 4.000  5.000  6.000]  # 4: world
[-0.369  0.942  1.401]  # 5: mondo [out of file vocabulary]

In [18]: result.matrix
Out[18]: 
array([[ 0.        ,  0.        ,  0.        ],
       [ 1.        ,  2.        ,  3.        ],
       [ 0.        ,  0.        ,  0.        ],
       [-0.61940755, -0.33910621, -0.27301005],
       [ 4.        ,  5.        ,  6.        ],
       [-0.36894112,  0.9424371 ,  1.40109962]])

In [19]: result.word2index
Out[19]: {'hello': 1, 'ciao': 3, 'world': 4, 'mondo': 5}

In [20]: result.missing_words
Out[20]: {'ciao', 'mondo'}

See embfile.initializers for checking out the available initializers.

Iteration

File readers

Efficient iteration of the file is implemented by format-specific readers.

EmbFileReader(out_dtype)

(Abstract class) Iterator that yields a word at each step and read the corresponding vector only if the lazy property current_vector is accessed.

A new reader for a file can be created using the method reader(). Every method that requires to iterate the file entries sequentially uses this method to create a new reader.

You usually won’t need to use a reader directly because EmbFile defines quicker-to-use methods that use a reader for you. If you are interested, the docstring is pretty detailed.

Dict-like methods

The following methods are wrappers of reader(). Keep in mind that every time you use these methods, you are creating a new file reader and items are read from disk (the vocabulary may be loaded in memory though, as in VVM files).

words()

Returns an iterable for all the words in the file.

vectors()

Returns an iterable for all the vectors in the file.

word_vectors()

Returns an iterable for all the (word, vector) pairs in the file.

filter(condition[, verbose])

Returns a generator that yields a word vector pair for each word in the file that satisfies a given condition. For example, to get all the words starting with “z”::.

Don’t use word_vectors() if you want to filter the vectors based on a condition on words: it’ll read vectors for all words you read, even those that don’t meet the condition. Use filter instead.

Creating/converting a file

Each subclass of EmbFile implements the following two class methods:

create(out_path, word_vectors[, vocab_size, …])

Creates a file on disk containing the provided word vectors.

create_from_file(source_file[, out_dir, …])

Creates a new file on disk with the same content of another file.

Examples of file creation

You can create a new file either from:

  • a dictionary {word: vector}

  • an iterable of (word, vector) tuples; the iterable can also be an iterator/generator.

For example:

import numpy as np
from embfile import VVMEmbFile

word_vectors = {
    "hello": np.array([0.1, 0.2, 0.3]),
    "world": np.array([0.4, 0.5, 0.6])
    # ... a lot more word vectors
}

VVMEmbFile.create(
    '/tmp/dummy.vvm.gz',
    word_vectors,
    dtype='<2f',      # store numbers as little-endian 2-byte float
    compression='gz'  # compress with gzip
)

Example of file conversions

Let’s convert a textual file to a vvm file. The following will generate a compressed vvm file in the same folder of the textual file (and with a proper file extension):

from embfile import VVMEmbFile

with embfile.open('path/to/source/file.txt') as src_file:
    dest_path = VVMEmbFile.create_from_file(src_file, compression='gz')

# dest_path  == Path('path/to/source/file.vvm.gz')

Implementing a new format

If you ever feel the need for implementing a new format, it’s fairly easy to integrate your custom format in this library and to test it. My suggestion is:

  1. grab the template below

  2. read EmbFile docstring

  3. look at existing implementations in the embfile.formats subpackage

  4. for testing, see how they are tested in tests/test_files.py

You are highly suggested to use a IDE of course.

from pathlib import Path
from typing import Iterable, Optional, Tuple

import embfile
from embfile.types import DType, PathType, VectorType
from embfile.core import EmbFile, EmbFileReader



# TODO: implement a reader
# Note: you could also extend AbstractEmbFileReader if it's convenient for you
class CustomEmbFileReader(EmbFileReader):
    """ Implements file sequential reading """
    def __init__(self, out_dtype: DType):  # TODO: add the needed arguments
        super().__init__(out_dtype)

    def _close(self) -> None:
        pass

    def reset(self) -> None:
        pass

    def next_word(self) -> str:
        pass

    def current_vector(self) -> VectorType:
        pass


@embfile.register_format('custom', extensions=['.cst', '.cust'])
class CustomEmbFile(EmbFile):

    def __init__(self, path: PathType, out_dtype: DType = None, verbose: int = 1):
        super().__init__(path, out_dtype, verbose)  # this is not optional
        # cls.vocab_size = ??
        # cls.vector_size = ??

    def _close(self) -> None:
        pass

    def _reader(self) -> EmbFileReader:
        return CustomEmbFileReader()  # TODO: pass the needed arguments

    # Optional:
    def _loader(self, words: Iterable[str], missing_ok: bool = True, verbose: Optional[int] = None) -> 'VectorsLoader':
        """ By default, a SequentialLoader is returned. """
        return super()._loader(words, missing_ok, verbose)

    @classmethod
    def _create(cls, out_path: Path, word_vectors: Iterable[Tuple[str, VectorType]],
                vector_size: int, vocab_size: Optional[int], compression: Optional[str] = None,
                verbose: bool = True, **format_kwargs) -> None:
        pass


if __name__ == '__main__':
    print(embfile.FORMATS)

This’ll print:

"""
Class          Format ID    Extensions
-------------  -----------  ------------
BinaryEmbFile  bin          .bin
TextEmbFile    txt          .txt, .vec
VVMEmbFile     vvm          .vvm
CustomEmbFile  custom       .cst, .cust
"""