Classes
|
Format used by the Google word2vec tool. |
|
|
Reference
embfile.formats.bin.
BinaryEmbFile
(path, encoding='utf-8', dtype=dtype('float32'), out_dtype=None, verbose=True)[source]¶Bases: embfile.core._file.EmbFile
Format used by the Google word2vec tool. You can use it to read the file GoogleNews-vectors-negative300.bin.
It begins with a text header line of space-separated fields:
<vocab_size> <vector_size>
Each word vector pair is encoded as following:
encoded word + space
followed by the binary representation of the vector.
path –
encoding –
dtype –
out_dtype –
verbose –
path (Union
[str
, Path
]) – path to the (eventually compressed) file
encoding (str
) – text encoding; note: if you provide an utf encoding (e.g. utf-16) that uses a
BOM (Byte Order Mark) without specifying the byte-endianness (e.g. utf-16-le or
utf-16-be), the little-endian version is used (utf-16-le).
dtype (Union
[str
, dtype
]) – a valid numpy data type (or whatever you can pass to numpy.dtype())
(default: ‘<f4’; little-endian float, 4 bytes)
out_dtype (Union
[str
, dtype
, None
]) – all the vectors returned will be (eventually) converted to this data type;
by default, it is equal to the original data type of the vectors in the file,
i.e. no conversion takes place.
create
(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]¶Format-specific arguments are encoding
and dtype
.
Note: all the text is encoded without BOM (Byte Order Mark). If you pass “utf-16” or “utf-18”, the little-endian version is used (e.g. “utf-16-le”)
See create()
for more.
create_from_file
(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]¶Format-specific arguments are encoding
and dtype
.
Note: all the text is encoded without BOM (Byte Order Mark). If you pass “utf-16” or “utf-18”, the little-endian version is used (e.g. “utf-16-le”)
See create_from_file()
for more.
embfile.formats.bin.
BinaryEmbFileReader
(file_obj, encoding='utf-8', dtype=dtype('float32'), out_dtype=None)[source]¶Bases: embfile.core.reader.AbstractEmbFileReader
EmbFileReader
for the binary format.