embfile.formats.bin

Classes

BinaryEmbFile(path[, encoding, dtype, …])

Format used by the Google word2vec tool.

BinaryEmbFileReader(file_obj[, encoding, …])

EmbFileReader for the binary format.

Reference

class embfile.formats.bin.BinaryEmbFile(path, encoding='utf-8', dtype=dtype('float32'), out_dtype=None, verbose=True)[source]

Bases: embfile.core._file.EmbFile

Format used by the Google word2vec tool. You can use it to read the file GoogleNews-vectors-negative300.bin.

It begins with a text header line of space-separated fields:

<vocab_size> <vector_size>

Each word vector pair is encoded as following:

  • encoded word + space

  • followed by the binary representation of the vector.

Variables
  • path

  • encoding

  • dtype

  • out_dtype

  • verbose

Parameters
  • path (Union[str, Path]) – path to the (eventually compressed) file

  • encoding (str) – text encoding; note: if you provide an utf encoding (e.g. utf-16) that uses a BOM (Byte Order Mark) without specifying the byte-endianness (e.g. utf-16-le or utf-16-be), the little-endian version is used (utf-16-le).

  • dtype (Union[str, dtype]) – a valid numpy data type (or whatever you can pass to numpy.dtype()) (default: ‘<f4’; little-endian float, 4 bytes)

  • out_dtype (Union[str, dtype, None]) – all the vectors returned will be (eventually) converted to this data type; by default, it is equal to the original data type of the vectors in the file, i.e. no conversion takes place.

DEFAULT_EXTENSION: str = '.bin'
vocab_size: Optional[int]
classmethod create(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]

Format-specific arguments are encoding and dtype.

Note: all the text is encoded without BOM (Byte Order Mark). If you pass “utf-16” or “utf-18”, the little-endian version is used (e.g. “utf-16-le”)

See create() for more.

Return type

None

classmethod create_from_file(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]

Format-specific arguments are encoding and dtype.

Note: all the text is encoded without BOM (Byte Order Mark). If you pass “utf-16” or “utf-18”, the little-endian version is used (e.g. “utf-16-le”)

See create_from_file() for more.

Return type

Path

class embfile.formats.bin.BinaryEmbFileReader(file_obj, encoding='utf-8', dtype=dtype('float32'), out_dtype=None)[source]

Bases: embfile.core.reader.AbstractEmbFileReader

EmbFileReader for the binary format.

classmethod from_path(path, encoding='utf-8', dtype=dtype('float32'), out_dtype=None)[source]