Classes
|
The format used by Glove and FastText files. Each vector pair is stored as a line of text made of space-separated fields::. |
|
|
Reference
embfile.formats.txt.
TextEmbFile
(path, encoding='utf-8', out_dtype='float32', vocab_size=None, verbose=True)[source]¶Bases: embfile.core._file.EmbFile
The format used by Glove and FastText files. Each vector pair is stored as a line of text made of space-separated fields:
word vec[0] vec[1] ... vec[vector_size-1]
It may have or not an (automatically detected) “header”, containing vocab_size
and
vector_size
(in this order).
If the file doesn’t have a header, vector_size
is set to the length of the first vector.
If you know vocab_size
(even an approximate value), you may want to provide it to have ETA
in progress bars.
If the file has a header and you provide vocab_size
, the provided value is ignored.
Compressed files are decompressed while you proceed reeding. Note that each file reader will decompress the file independently, so if you need to read the file multiple times it’s better you decompress the entire file first and then open it.
path –
encoding –
out_dtype –
verbose –
encoding (str
) – encoding of the text file; default is utf-8
out_dtype (Union
[str
, dtype
]) – the dtype of the vectors that will be returned; default is single-precision float
vocab_size (Optional
[int
]) – useful when the file has no header but you know vocab_size;
if the file has a header, this argument is ignored.
verbose (int
) – default level of verbosity for all methods
create
(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', precision=5)[source]¶Creates a file on disk containing the provided word vectors.
word_vectors (Dict[str, VectorType] or Iterable[Tuple[str, VectorType]]) – it can be an iterable of word vector tuples or a dictionary word -> vector
;
the word vectors are written in the order determined by the iterable object.
vocab_size (Optional
[int
]) – it must be provided if word_vectors
has no __len__
and the specific-format
creator needs to know a priori the vocabulary size; in any case, the creator
should check at the end that the provided vocab_size
matches the actual length
of word_vectors
compression (Optional
[str
]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"
verbose (bool
) – if positive, show progress bars and information
overwrite (bool
) – overwrite the file if it already exists
format_kwargs – format-specific arguments
create_from_file
(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', precision=5)[source]¶Creates a new file on disk with the same content of another file.
source_file (EmbFile
) – the file to take data from
out_dir (Union
[str
, Path
, None
]) – directory where the file will be stored; by default, it’s the parent directory
of the source file
out_filename (Optional
[str
]) – filename of the produced name (inside out_dir
); by default, it is obtained by
replacing the extension of the source file with the proper one and appending the
compression extension if compression is not None
.
Note: if you pass this argument, the compression extension is not automatically
appended.
vocab_size (Optional
[int
]) – if the source EmbFile has attribute vocab_size == None
, then: if the specific
creator requires it (bin and txt formats do), it must be provided; otherwise it
can be provided for having ETA in progress bars.
compression (Optional
[str
]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"
verbose (bool
) – print info and progress bar
overwrite (bool
) – overwrite a file with the same name if it already exists
format_kwargs – format-specific arguments (see above)
embfile.formats.txt.
TextEmbFileReader
(file_obj, out_dtype=dtype('float32'), vocab_size=None)[source]¶Bases: embfile.core.reader.AbstractEmbFileReader
EmbFileReader
for the textual format.