Classes
|
(Custom format) A tar file storing vocabulary, vectors and metadata in 3 separate files. |
|
|
Reference
embfile.formats.vvm.
VVMEmbFile
(path, out_dtype=None, verbose=True)[source]¶Bases: embfile.core._file.EmbFile
, embfile.core.loaders.Word2Vector
(Custom format) A tar file storing vocabulary, vectors and metadata in 3 separate files.
Features:
the vocabulary can be loaded very quickly (with no need for an external vocab file) and it is loaded in memory when the file is opened;
direct access to vectors
by word using __getitem__()
(e.g. file['hello']
)
by index using vector_at()
implements __contains__()
(e.g. 'hello' in file
)
all the information needed to open the file are stored in the file itself
Specifics. The files contained in a VVM file are:
vocab.txt: contains each word on a separate line
vectors.bin: contains the vectors in binary format (concatenated)
meta.json: must contain (at least) the following fields:
vocab_size: number of word vectors in the file
vector_size: length of a word vector
encoding: text encoding used for vocab.txt
dtype: vector data type string (notation used by numpy)
__getitem__
(word)[source]¶Returns the vector associated to a word (random access to file).
create
(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]¶Format-specific arguments are encoding and dtype.
Being VVM a tar file, you should use a compression supported by the tarfile package (avoid zip): gz, bz2 or xz.
See create()
for more doc.
create_from_file
(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]¶Format-specific arguments are encoding and dtype. Being VVM a tar file, you should use a compression supported by the tarfile package (avoid zip): gz, bz2 or xz.
See create_from_file()
for more doc.
embfile.formats.vvm.
VVMEmbFileReader
(file, vectors_file)[source]¶Bases: embfile.core.reader.AbstractEmbFileReader
EmbFileReader
for the vvm format.