embfile.formats.bin¶

Classes

`BinaryEmbFile`(path[, encoding, dtype, …])	Format used by the Google word2vec tool.
`BinaryEmbFileReader`(file_obj[, encoding, …])	`EmbFileReader` for the binary format.

Reference

class embfile.formats.bin.BinaryEmbFile(path, encoding='utf-8', dtype=dtype('float32'), out_dtype=None, verbose=True)[source]¶

Bases: embfile.core._file.EmbFile

Format used by the Google word2vec tool. You can use it to read the file GoogleNews-vectors-negative300.bin.

It begins with a text header line of space-separated fields:

<vocab_size> <vector_size>

Each word vector pair is encoded as following:

encoded word + space
followed by the binary representation of the vector.

Variables

path –
encoding –
dtype –
out_dtype –
verbose –

Parameters

path (Union[str, Path]) – path to the (eventually compressed) file
encoding (str) – text encoding; note: if you provide an utf encoding (e.g. utf-16) that uses a BOM (Byte Order Mark) without specifying the byte-endianness (e.g. utf-16-le or utf-16-be), the little-endian version is used (utf-16-le).
dtype (Union[str, dtype]) – a valid numpy data type (or whatever you can pass to numpy.dtype()) (default: ‘<f4’; little-endian float, 4 bytes)
out_dtype (Union[str, dtype, None]) – all the vectors returned will be (eventually) converted to this data type; by default, it is equal to the original data type of the vectors in the file, i.e. no conversion takes place.

DEFAULT_EXTENSION: str = '.bin'¶

vocab_size: Optional[int]¶

classmethod create(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]¶

Format-specific arguments are encoding and dtype.

Note: all the text is encoded without BOM (Byte Order Mark). If you pass “utf-16” or “utf-18”, the little-endian version is used (e.g. “utf-16-le”)

See create() for more.

Return type: None

classmethod create_from_file(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]¶

Format-specific arguments are encoding and dtype.

Note: all the text is encoded without BOM (Byte Order Mark). If you pass “utf-16” or “utf-18”, the little-endian version is used (e.g. “utf-16-le”)

See create_from_file() for more.

Return type: Path

class embfile.formats.bin.BinaryEmbFileReader(file_obj, encoding='utf-8', dtype=dtype('float32'), out_dtype=None)[source]¶

Bases: embfile.core.reader.AbstractEmbFileReader

EmbFileReader for the binary format.

classmethod from_path(path, encoding='utf-8', dtype=dtype('float32'), out_dtype=None)[source]¶

Table Of Contents

embfile.formats.bin¶