embfile.formats.vvm

Classes

VVMEmbFile(path[, out_dtype, verbose])

(Custom format) A tar file storing vocabulary, vectors and metadata in 3 separate files.

VVMEmbFileReader(file, vectors_file)

EmbFileReader for the vvm format.

Reference

class embfile.formats.vvm.VVMEmbFile(path, out_dtype=None, verbose=True)[source]

Bases: embfile.core._file.EmbFile, embfile.core.loaders.Word2Vector

(Custom format) A tar file storing vocabulary, vectors and metadata in 3 separate files.

Features:

  1. the vocabulary can be loaded very quickly (with no need for an external vocab file) and it is loaded in memory when the file is opened;

  2. direct access to vectors

  3. implements __contains__() (e.g. 'hello' in file)

  4. all the information needed to open the file are stored in the file itself

Specifics. The files contained in a VVM file are:

  • vocab.txt: contains each word on a separate line

  • vectors.bin: contains the vectors in binary format (concatenated)

  • meta.json: must contain (at least) the following fields:

    • vocab_size: number of word vectors in the file

    • vector_size: length of a word vector

    • encoding: text encoding used for vocab.txt

    • dtype: vector data type string (notation used by numpy)

Variables
  • path

  • encoding

  • dtype

  • out_dtype

  • verbose

  • vocab (OrderedDict[str, int]) – map each word to its index in the file

Parameters
DEFAULT_EXTENSION: str = '.vvm'
vocab_size: Optional[int]
words()[source]

Returns an iterable for all the words in the file.

Return type

Iterable[str]

__contains__(word)[source]

Returns True if the file contains a vector for word

Return type

bool

vector_at(index)[source]

Returns a vector by its index in the file (random access).

Return type

ndarray

__getitem__(word)[source]

Returns the vector associated to a word (random access to file).

Return type

ndarray

classmethod create(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]

Format-specific arguments are encoding and dtype.

Being VVM a tar file, you should use a compression supported by the tarfile package (avoid zip): gz, bz2 or xz.

See create() for more doc.

Return type

None

classmethod create_from_file(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]

Format-specific arguments are encoding and dtype. Being VVM a tar file, you should use a compression supported by the tarfile package (avoid zip): gz, bz2 or xz.

See create_from_file() for more doc.

Return type

Path

class embfile.formats.vvm.VVMEmbFileReader(file, vectors_file)[source]

Bases: embfile.core.reader.AbstractEmbFileReader

EmbFileReader for the vvm format.