embfile.formats.txt

Classes

TextEmbFile(path[, encoding, out_dtype, …])

The format used by Glove and FastText files. Each vector pair is stored as a line of text made of space-separated fields::.

TextEmbFileReader(file_obj[, out_dtype, …])

EmbFileReader for the textual format.

Reference

class embfile.formats.txt.TextEmbFile(path, encoding='utf-8', out_dtype='float32', vocab_size=None, verbose=True)[source]

Bases: embfile.core._file.EmbFile

The format used by Glove and FastText files. Each vector pair is stored as a line of text made of space-separated fields:

word vec[0] vec[1] ... vec[vector_size-1]

It may have or not an (automatically detected) “header”, containing vocab_size and vector_size (in this order).

If the file doesn’t have a header, vector_size is set to the length of the first vector. If you know vocab_size (even an approximate value), you may want to provide it to have ETA in progress bars.

If the file has a header and you provide vocab_size, the provided value is ignored.

Compressed files are decompressed while you proceed reeding. Note that each file reader will decompress the file independently, so if you need to read the file multiple times it’s better you decompress the entire file first and then open it.

Variables
  • path

  • encoding

  • out_dtype

  • verbose

Parameters
  • path (Union[str, Path]) – path to the embedding file

  • encoding (str) – encoding of the text file; default is utf-8

  • out_dtype (Union[str, dtype]) – the dtype of the vectors that will be returned; default is single-precision float

  • vocab_size (Optional[int]) – useful when the file has no header but you know vocab_size; if the file has a header, this argument is ignored.

  • verbose (int) – default level of verbosity for all methods

DEFAULT_EXTENSION: str = '.txt'
vocab_size: Optional[int]
classmethod create(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', precision=5)[source]

Creates a file on disk containing the provided word vectors.

Parameters
  • out_path (Union[str, Path]) – path to the created file

  • word_vectors (Dict[str, VectorType] or Iterable[Tuple[str, VectorType]]) – it can be an iterable of word vector tuples or a dictionary word -> vector; the word vectors are written in the order determined by the iterable object.

  • vocab_size (Optional[int]) – it must be provided if word_vectors has no __len__ and the specific-format creator needs to know a priori the vocabulary size; in any case, the creator should check at the end that the provided vocab_size matches the actual length of word_vectors

  • compression (Optional[str]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"

  • verbose (bool) – if positive, show progress bars and information

  • overwrite (bool) – overwrite the file if it already exists

  • format_kwargs – format-specific arguments

Return type

None

classmethod create_from_file(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', precision=5)[source]

Creates a new file on disk with the same content of another file.

Parameters
  • source_file (EmbFile) – the file to take data from

  • out_dir (Union[str, Path, None]) – directory where the file will be stored; by default, it’s the parent directory of the source file

  • out_filename (Optional[str]) – filename of the produced name (inside out_dir); by default, it is obtained by replacing the extension of the source file with the proper one and appending the compression extension if compression is not None. Note: if you pass this argument, the compression extension is not automatically appended.

  • vocab_size (Optional[int]) – if the source EmbFile has attribute vocab_size == None, then: if the specific creator requires it (bin and txt formats do), it must be provided; otherwise it can be provided for having ETA in progress bars.

  • compression (Optional[str]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"

  • verbose (bool) – print info and progress bar

  • overwrite (bool) – overwrite a file with the same name if it already exists

  • format_kwargs – format-specific arguments (see above)

Return type

Path

class embfile.formats.txt.TextEmbFileReader(file_obj, out_dtype=dtype('float32'), vocab_size=None)[source]

Bases: embfile.core.reader.AbstractEmbFileReader

EmbFileReader for the textual format.

classmethod from_path(path, encoding='utf-8', out_dtype=dtype('float32'), vocab_size=None)[source]

Returns a TextEmbFileReader from the path of a (eventually compressed) text file.

Return type

TextEmbFileReader

classmethod parse_header(line)[source]
Return type

Dict[str, Any]