Substructure
Classes
|
Format used by the Google word2vec tool. |
|
NamedTuple returned by |
|
(Abstract class) The base class of all the embedding files. |
|
The format used by Glove and FastText files. Each vector pair is stored as a line of text made of space-separated fields::. |
|
(Custom format) A tar file storing vocabulary, vectors and metadata in 3 separate files. |
Functions
|
Associates a file extension to a registered embedding file format. |
|
Creates an embedding matrix for the provided words. |
|
Extracts a file compressed with gzip, bz2 or lzma or a member file inside a zip/tar archive. |
|
Extracts a file unless it already exists and returns its path. |
|
Opens an embedding file inferring the file format from the file extension (if not explicitly provided in |
|
Class decorator that associates a new |
Data
|
Maps each |
embfile.
open
(path, format_id=None, **format_kwargs)[source]¶Opens an embedding file inferring the file format from the file extension (if not explicitly
provided in format_id
). Note that you can always open a file using the specific EmbFile
subclass; it can be more convenient since you get auto-completion and quick doc for
format-specific arguments.
Example:
with embfile.open('path/to/embfile.txt') as f:
# do something with f
Supported formats:
Class |
format_id |
Extensions |
Description |
---|---|---|---|
txt |
.txt, .vec |
Glove/fastText format |
|
bin |
.bin |
Google word2vec format |
|
vvm |
.vvm |
A tarball containing three files: vocab.txt, vectors.bin, meta.json |
You can register new formats or extensions using the functions
embfile.register_format()
and embfile.associate_extension()
.
EmbFile
An instance of a concrete subclass of EmbFile
.
See also
embfile.register_format()
:registers your custom EmbFile implementation so it is recognized by this function
embfile.associate_extension()
:associates an extension to a registered format
embfile.
build_matrix
(f, words, start_index=0, dtype=None, oov_initializer=<embfile.initializers.NormalInitializer object>, verbose=None)[source]¶Creates an embedding matrix for the provided words. words
can be:
an iterable of strings – in this case, the words in the iterable are mapped
to consecutive rows of the matrix starting from the row of index start_index
(by default, 0
); the rows with index i < start_index
are left to zeros.
a dictionary {word -> index}
that maps each word to a row –
in this case, the matrix has shape:
[max_index + 1, vector_size]
where max_index = max(word_to_index.values())
. The rows that are not associated
with any word are left to zeros. If multiple words are mapped to the same row, the
function raises ValueError
.
In both cases, all the word vectors that are not found in the file are initialized using
oov_initializer
, which can be:
None
– leave missing vectors to zeros
a function that takes the shape of the array to generate (a tuple) as first argument:
oov_initializer=lambda shape: numpy.random.normal(scale=0.01, size=shape)
oov_initializer=numpy.ones # don't use this for word vectors :|
an instance of Initializer
, which is a “fittable”
initializer; in this case, the initializer is fit on the found vectors (the vectors that
are both in vocab
and in the file).
By default, oov_initializer is an instance of
NormalInitializer
which generates vectors using a normal distribution with the same mean and standard
deviation of the vectors found.
f (EmbFile
) – the file containing the word vectors
words (Iterable[str] or Dict[str, int]) – iterable of words or dictionary that maps each word to a row index
start_index (int) – ignored if vocab
is a dict; if vocab
is a collection, determines the index
associated to the first word (and so, the number of rows left to zeros at the
beginning of the matrix)
dtype (optional, DType) – matrix data type; if None
, cls.out_dtype
is used
oov_initializer (optional, Callable or Initializer
) – initializer for out-of-(file)-vocabulary word vectors. See the class docstring for
more information.
verbose (bool) – if None, f.verbose is used
BuildMatrixOutput
embfile.
BuildMatrixOutput
(matrix: numpy.ndarray, word2index: Dict[str, int], missing_words: Set[str])[source]¶Bases: tuple
NamedTuple returned by build_matrix()
Create new instance of BuildMatrixOutput(matrix, word2index, missing_words)
matrix
: numpy.ndarray¶Alias for field number 0
embfile.
EmbFile
(path, out_dtype=None, verbose=True)[source]¶Bases: abc.ABC
(Abstract class) The base class of all the embedding files.
Sub-classes must:
ensure they set attributes vocab_size
and vector_size
when a file
instance is created
implement a EmbFileReader
for the format and implements
the abstract method _reader()
implement the abstract method _close()
(optionally) implement a VectorsLoader
(if they can improve
upon the default loader) and override loader()
(optionally) implement a EmbFileCreator
for the format and set
the class constant Creator
path (Path) – path of the embedding file (eventually compressed)
out_dtype (numpy.dtype) – all the vectors will be converted to this data type. The sub-class is responsible to set a suitable default value.
verbose (bool) – whether to show a progress bar by default in all time-consuming operations
path (Path) – path of the embedding file
vocab_size (int or None
) – number of words in the file (can be None
for some TextEmbFile
)
vector_size (int) – length of the vectors
verbose (bool) – whether to show a progress bar by default in all time-consuming operations
closed (bool) – True if the file was closed
_reader
()[source]¶(Abstract method) Returns a new reader for the file which allows to iterate
efficiently the word-vectors inside it. Called by reader()
.
_close
()[source]¶(Abstract method) Releases eventual resources used by the EmbFile.
close
()[source]¶Releases all the open resources linked to this file, including the opened readers.
reader
()[source]¶Creates and returns a new file reader. When the file is closed, all the still opened readers are closed automatically.
loader
(words, missing_ok=True, verbose=None)[source]¶Returns a VectorsLoader
, an iterator that looks for the
provided words in the file and yields available (word, vector) pairs one by one.
If missing_ok=True
(default), provides the set of missing words in the
property missing_words
(once the iteration ends).
See embfile.core.VectorsLoader
for more info.
Example
You should use a loader when you need to load many vectors in some custom data structure and you don’t want to waste memory (e.g. build_matrix uses it to load the vectors directly into the matrix):
data_structure = MyCustomStructure()
with file.loader(many_words) as loader:
for word, vector in loader:
data_structure[word] = vector
print('Number of missing words:', len(loader.missing_words)
word_vectors
()[source]¶Returns an iterable for all the (word, vector) pairs in the file.
to_list
(verbose=None)[source]¶Returns the entire file content in a list of WordVector
’s.
load
(words, verbose=None)[source]¶Loads the vectors for the input words in a {word: vec}
dict, raising
KeyError
if any word is missing.
(Dict[str, VectorType]) – a dictionary {word: vector}
See also
find()
- it returns the set of all missing words, instead of raising
KeyError
.
find
(words, verbose=None)[source]¶Looks for the input words in the file, return: 1) a dict {word: vec}
containing the available words and 2) a set containing the words not found.
_FindOutput
namedtuple – a namedtuple with the following fields:
word2vec (Dict[str, VectorType]): dictionary {word: vector}
missing_words (Set[str]): set of words not found in the file
See also
load()
- which raises KeyError if any word is not found in the file.
filter
(condition, verbose=None)[source]¶Returns a generator that yields a word vector pair for each word in the file that satisfies a given condition. For example, to get all the words starting with “z”:
list(file.filter(lambda word: word.startswith('z')))
save_vocab
(path=None, encoding='utf-8', overwrite=False, verbose=None)[source]¶Save the vocabulary of the embedding file on a text file. By default the file is saved in the same directory of the embedding file, e.g.:
/path/to/filename.txt.gz ==> /path/to/filename_vocab.txt
create
(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, **format_kwargs)[source]¶Creates a file on disk containing the provided word vectors.
word_vectors (Dict[str, VectorType] or Iterable[Tuple[str, VectorType]]) – it can be an iterable of word vector tuples or a dictionary word -> vector
;
the word vectors are written in the order determined by the iterable object.
vocab_size (Optional
[int
]) – it must be provided if word_vectors
has no __len__
and the specific-format
creator needs to know a priori the vocabulary size; in any case, the creator
should check at the end that the provided vocab_size
matches the actual length
of word_vectors
compression (Optional
[str
]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"
verbose (bool
) – if positive, show progress bars and information
overwrite (bool
) – overwrite the file if it already exists
format_kwargs – format-specific arguments
create_from_file
(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, **format_kwargs)[source]¶Creates a new file on disk with the same content of another file.
source_file (EmbFile
) – the file to take data from
out_dir (Union
[str
, Path
, None
]) – directory where the file will be stored; by default, it’s the parent directory
of the source file
out_filename (Optional
[str
]) – filename of the produced name (inside out_dir
); by default, it is obtained by
replacing the extension of the source file with the proper one and appending the
compression extension if compression is not None
.
Note: if you pass this argument, the compression extension is not automatically
appended.
vocab_size (Optional
[int
]) – if the source EmbFile has attribute vocab_size == None
, then: if the specific
creator requires it (bin and txt formats do), it must be provided; otherwise it
can be provided for having ETA in progress bars.
compression (Optional
[str
]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"
verbose (bool
) – print info and progress bar
overwrite (bool
) – overwrite a file with the same name if it already exists
format_kwargs – format-specific arguments (see above)
embfile.
BinaryEmbFile
(path, encoding='utf-8', dtype=dtype('float32'), out_dtype=None, verbose=True)[source]¶Bases: embfile.core._file.EmbFile
Format used by the Google word2vec tool. You can use it to read the file GoogleNews-vectors-negative300.bin.
It begins with a text header line of space-separated fields:
<vocab_size> <vector_size>
Each word vector pair is encoded as following:
encoded word + space
followed by the binary representation of the vector.
path –
encoding –
dtype –
out_dtype –
verbose –
path (Union
[str
, Path
]) – path to the (eventually compressed) file
encoding (str
) – text encoding; note: if you provide an utf encoding (e.g. utf-16) that uses a
BOM (Byte Order Mark) without specifying the byte-endianness (e.g. utf-16-le or
utf-16-be), the little-endian version is used (utf-16-le).
dtype (Union
[str
, dtype
]) – a valid numpy data type (or whatever you can pass to numpy.dtype())
(default: ‘<f4’; little-endian float, 4 bytes)
out_dtype (Union
[str
, dtype
, None
]) – all the vectors returned will be (eventually) converted to this data type;
by default, it is equal to the original data type of the vectors in the file,
i.e. no conversion takes place.
create
(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]¶Format-specific arguments are encoding
and dtype
.
Note: all the text is encoded without BOM (Byte Order Mark). If you pass “utf-16” or “utf-18”, the little-endian version is used (e.g. “utf-16-le”)
See create()
for more.
create_from_file
(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]¶Format-specific arguments are encoding
and dtype
.
Note: all the text is encoded without BOM (Byte Order Mark). If you pass “utf-16” or “utf-18”, the little-endian version is used (e.g. “utf-16-le”)
See create_from_file()
for more.
embfile.
TextEmbFile
(path, encoding='utf-8', out_dtype='float32', vocab_size=None, verbose=True)[source]¶Bases: embfile.core._file.EmbFile
The format used by Glove and FastText files. Each vector pair is stored as a line of text made of space-separated fields:
word vec[0] vec[1] ... vec[vector_size-1]
It may have or not an (automatically detected) “header”, containing vocab_size
and
vector_size
(in this order).
If the file doesn’t have a header, vector_size
is set to the length of the first vector.
If you know vocab_size
(even an approximate value), you may want to provide it to have ETA
in progress bars.
If the file has a header and you provide vocab_size
, the provided value is ignored.
Compressed files are decompressed while you proceed reeding. Note that each file reader will decompress the file independently, so if you need to read the file multiple times it’s better you decompress the entire file first and then open it.
path –
encoding –
out_dtype –
verbose –
encoding (str
) – encoding of the text file; default is utf-8
out_dtype (Union
[str
, dtype
]) – the dtype of the vectors that will be returned; default is single-precision float
vocab_size (Optional
[int
]) – useful when the file has no header but you know vocab_size;
if the file has a header, this argument is ignored.
verbose (int
) – default level of verbosity for all methods
create
(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', precision=5)[source]¶Creates a file on disk containing the provided word vectors.
word_vectors (Dict[str, VectorType] or Iterable[Tuple[str, VectorType]]) – it can be an iterable of word vector tuples or a dictionary word -> vector
;
the word vectors are written in the order determined by the iterable object.
vocab_size (Optional
[int
]) – it must be provided if word_vectors
has no __len__
and the specific-format
creator needs to know a priori the vocabulary size; in any case, the creator
should check at the end that the provided vocab_size
matches the actual length
of word_vectors
compression (Optional
[str
]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"
verbose (bool
) – if positive, show progress bars and information
overwrite (bool
) – overwrite the file if it already exists
format_kwargs – format-specific arguments
create_from_file
(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', precision=5)[source]¶Creates a new file on disk with the same content of another file.
source_file (EmbFile
) – the file to take data from
out_dir (Union
[str
, Path
, None
]) – directory where the file will be stored; by default, it’s the parent directory
of the source file
out_filename (Optional
[str
]) – filename of the produced name (inside out_dir
); by default, it is obtained by
replacing the extension of the source file with the proper one and appending the
compression extension if compression is not None
.
Note: if you pass this argument, the compression extension is not automatically
appended.
vocab_size (Optional
[int
]) – if the source EmbFile has attribute vocab_size == None
, then: if the specific
creator requires it (bin and txt formats do), it must be provided; otherwise it
can be provided for having ETA in progress bars.
compression (Optional
[str
]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"
verbose (bool
) – print info and progress bar
overwrite (bool
) – overwrite a file with the same name if it already exists
format_kwargs – format-specific arguments (see above)
embfile.
VVMEmbFile
(path, out_dtype=None, verbose=True)[source]¶Bases: embfile.core._file.EmbFile
, embfile.core.loaders.Word2Vector
(Custom format) A tar file storing vocabulary, vectors and metadata in 3 separate files.
Features:
the vocabulary can be loaded very quickly (with no need for an external vocab file) and it is loaded in memory when the file is opened;
direct access to vectors
by word using __getitem__()
(e.g. file['hello']
)
by index using vector_at()
implements __contains__()
(e.g. 'hello' in file
)
all the information needed to open the file are stored in the file itself
Specifics. The files contained in a VVM file are:
vocab.txt: contains each word on a separate line
vectors.bin: contains the vectors in binary format (concatenated)
meta.json: must contain (at least) the following fields:
vocab_size: number of word vectors in the file
vector_size: length of a word vector
encoding: text encoding used for vocab.txt
dtype: vector data type string (notation used by numpy)
__getitem__
(word)[source]¶Returns the vector associated to a word (random access to file).
create
(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]¶Format-specific arguments are encoding and dtype.
Being VVM a tar file, you should use a compression supported by the tarfile package (avoid zip): gz, bz2 or xz.
See create()
for more doc.
create_from_file
(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]¶Format-specific arguments are encoding and dtype. Being VVM a tar file, you should use a compression supported by the tarfile package (avoid zip): gz, bz2 or xz.
See create_from_file()
for more doc.
embfile.
register_format
(format_id, extensions, overwrite=False)[source]¶Class decorator that associates a new EmbFile
sub-class with a format_id and one or
multiple extensions. Once you register a format, you can use open()
to open
files of that format.
embfile.
associate_extension
(ext, format_id, overwrite=False)[source]¶Associates a file extension to a registered embedding file format.
embfile.
extract
(src_path, member=None, dest_dir='.', dest_filename=None, overwrite=False)¶Extracts a file compressed with gzip, bz2 or lzma or a member file inside a zip/tar archive. The compression format is inferred from the extension or from the magic number of the file (in the case of zip and tar).
The file is first extracted to a .part file that is renamed when the extraction is completed.
member (Optional
[str
]) – must be provided if src_path points to an archive that contains multiple files;
dest_dir (Union
[str
, Path
]) – destination directory; by default, it’s the current working directory
dest_filename (Optional
[str
]) – destination filename; by default, it’s equal to member
(if provided)
overwrite (bool
) – overwrite existing file at dest_path
if it already exists
Path – the path to the extracted file