Substructure
Classes
|
(Abstract class) Facilitates the implementation of a |
|
(Abstract class) The base class of all the embedding files. |
|
(Abstract class) Iterator that yields a word at each step and read the corresponding vector only if the lazy property |
|
A loader for files that can randomly access word vectors. |
|
A Loader that just scans the file from beginning to the end and yields a word vector pair when it meets a requested word. |
|
(Abstract class) Iterator that, given some input words, looks for the corresponding vectors into the file and yields a word vector pair for each vector found; once the iteration stops, the attribute |
|
A (word, vector) NamedTuple |
embfile.core.
EmbFile
(path, out_dtype=None, verbose=True)[source]¶Bases: abc.ABC
(Abstract class) The base class of all the embedding files.
Sub-classes must:
ensure they set attributes vocab_size
and vector_size
when a file
instance is created
implement a EmbFileReader
for the format and implements
the abstract method _reader()
implement the abstract method _close()
(optionally) implement a VectorsLoader
(if they can improve
upon the default loader) and override loader()
(optionally) implement a EmbFileCreator
for the format and set
the class constant Creator
path (Path) – path of the embedding file (eventually compressed)
out_dtype (numpy.dtype) – all the vectors will be converted to this data type. The sub-class is responsible to set a suitable default value.
verbose (bool) – whether to show a progress bar by default in all time-consuming operations
path (Path) – path of the embedding file
vocab_size (int or None
) – number of words in the file (can be None
for some TextEmbFile
)
vector_size (int) – length of the vectors
verbose (bool) – whether to show a progress bar by default in all time-consuming operations
closed (bool) – True if the file was closed
_reader
()[source]¶(Abstract method) Returns a new reader for the file which allows to iterate
efficiently the word-vectors inside it. Called by reader()
.
_close
()[source]¶(Abstract method) Releases eventual resources used by the EmbFile.
close
()[source]¶Releases all the open resources linked to this file, including the opened readers.
reader
()[source]¶Creates and returns a new file reader. When the file is closed, all the still opened readers are closed automatically.
loader
(words, missing_ok=True, verbose=None)[source]¶Returns a VectorsLoader
, an iterator that looks for the
provided words in the file and yields available (word, vector) pairs one by one.
If missing_ok=True
(default), provides the set of missing words in the
property missing_words
(once the iteration ends).
See embfile.core.VectorsLoader
for more info.
Example
You should use a loader when you need to load many vectors in some custom data structure and you don’t want to waste memory (e.g. build_matrix uses it to load the vectors directly into the matrix):
data_structure = MyCustomStructure()
with file.loader(many_words) as loader:
for word, vector in loader:
data_structure[word] = vector
print('Number of missing words:', len(loader.missing_words)
word_vectors
()[source]¶Returns an iterable for all the (word, vector) pairs in the file.
to_list
(verbose=None)[source]¶Returns the entire file content in a list of WordVector
’s.
load
(words, verbose=None)[source]¶Loads the vectors for the input words in a {word: vec}
dict, raising
KeyError
if any word is missing.
(Dict[str, VectorType]) – a dictionary {word: vector}
See also
find()
- it returns the set of all missing words, instead of raising
KeyError
.
find
(words, verbose=None)[source]¶Looks for the input words in the file, return: 1) a dict {word: vec}
containing the available words and 2) a set containing the words not found.
_FindOutput
namedtuple – a namedtuple with the following fields:
word2vec (Dict[str, VectorType]): dictionary {word: vector}
missing_words (Set[str]): set of words not found in the file
See also
load()
- which raises KeyError if any word is not found in the file.
filter
(condition, verbose=None)[source]¶Returns a generator that yields a word vector pair for each word in the file that satisfies a given condition. For example, to get all the words starting with “z”:
list(file.filter(lambda word: word.startswith('z')))
save_vocab
(path=None, encoding='utf-8', overwrite=False, verbose=None)[source]¶Save the vocabulary of the embedding file on a text file. By default the file is saved in the same directory of the embedding file, e.g.:
/path/to/filename.txt.gz ==> /path/to/filename_vocab.txt
create
(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, **format_kwargs)[source]¶Creates a file on disk containing the provided word vectors.
word_vectors (Dict[str, VectorType] or Iterable[Tuple[str, VectorType]]) – it can be an iterable of word vector tuples or a dictionary word -> vector
;
the word vectors are written in the order determined by the iterable object.
vocab_size (Optional
[int
]) – it must be provided if word_vectors
has no __len__
and the specific-format
creator needs to know a priori the vocabulary size; in any case, the creator
should check at the end that the provided vocab_size
matches the actual length
of word_vectors
compression (Optional
[str
]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"
verbose (bool
) – if positive, show progress bars and information
overwrite (bool
) – overwrite the file if it already exists
format_kwargs – format-specific arguments
create_from_file
(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, **format_kwargs)[source]¶Creates a new file on disk with the same content of another file.
source_file (EmbFile
) – the file to take data from
out_dir (Union
[str
, Path
, None
]) – directory where the file will be stored; by default, it’s the parent directory
of the source file
out_filename (Optional
[str
]) – filename of the produced name (inside out_dir
); by default, it is obtained by
replacing the extension of the source file with the proper one and appending the
compression extension if compression is not None
.
Note: if you pass this argument, the compression extension is not automatically
appended.
vocab_size (Optional
[int
]) – if the source EmbFile has attribute vocab_size == None
, then: if the specific
creator requires it (bin and txt formats do), it must be provided; otherwise it
can be provided for having ETA in progress bars.
compression (Optional
[str
]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"
verbose (bool
) – print info and progress bar
overwrite (bool
) – overwrite a file with the same name if it already exists
format_kwargs – format-specific arguments (see above)
embfile.core.
EmbFileReader
(out_dtype)[source]¶Bases: abc.ABC
(Abstract class) Iterator that yields a word at each step and read the corresponding vector
only if the lazy property current_vector
is accessed.
Iteration model. The iteration model is not the most obvious: each iteration step doesn’t
return a word vector pair. Instead, for performance reasons, at each step a reader returns
the next word. To read the vector for the current word, you must access the (lazy) property
current_vector()
:
with emb_file.reader() as reader:
for word in reader:
if word in my_vocab:
word2vec[word] = reader.current_vector
When you access current_vector()
for the first time,
the vector data is read/parsed and a vector is created; the vector remains
accessible until a new word is read.
Creation. Creating a reader usually implies the creation of a file object. That’s why
EmbFileReader
implements the ContextManager
interface so that you can use it inside
a with
clause. Nonetheless, a EmbFile
keeps track of all its open readers and close them
automatically when it is closed.
out_dtype (Union
[str
, dtype
]) – all the vectors will be converted to this dtype before being returned
out_dtype (numpy.dtype) – all the vectors will be converted to this data type before being returned
reset
()[source]¶(Abstract method) Brings back the reader to the first word vector pair
embfile.core.
AbstractEmbFileReader
(out_dtype)[source]¶Bases: embfile.core.reader.EmbFileReader
, abc.ABC
(Abstract class) Facilitates the implementation of a EmbFileReader
, especially for a
file that stores a word and its vector nearby in the file (txt and bin formats), though it can
be used for other kind of formats as well if it looks convenient. It:
keeps track of whether the reader is pointing to a word or a vector and skips the vector when it is not requested during an iteration
caches the current vector once it is read
Sub-classes must implement:
(Abstract method) Reads a word assuming the next thing to read in the file is a word. |
|
(Abstract method) Reads the vector for the last word read. |
|
(Abstract method) Called when we want to read the next word without loading the vector for the current word. |
|
|
(Abstract method) Closes the reader |
_read_word
()[source]¶(Abstract method) Reads a word assuming the next thing to read in the file is a word. It must raise StopIteration if there’s not another word to read.
_read_vector
()[source]¶(Abstract method) Reads the vector for the last word read. This method is never called if no word has been read or at the end of file. It is called at most time per word.
embfile.core.
VectorsLoader
(words, missing_ok=True)[source]¶Bases: abc.ABC
, Iterator
[WordVector
]
(Abstract class) Iterator that, given some input words, looks for the corresponding
vectors into the file and yields a word vector pair for each vector found; once the
iteration stops, the attribute missing_words
contains the set of words not found.
Subclasses can load the word vectors in any order.
missing_words
¶The words that have still to be found; once the iteration stops, it’s the set of
the words that are in the input words
but not in the file.
embfile.core.
SequentialLoader
(file, words, missing_ok=True, verbose=False)[source]¶Bases: abc.ABC
, Iterator
[WordVector
]
A Loader that just scans the file from beginning to the end and yields a word vector pair when it meets a requested word. Used by txt and bin files. It’s unable to tell if a word is in the file or not before having read the entire file.
The progress bar shows the percentage of file that has been examined, not the number of yielded word vectors, so the iteration may stop before the bar reaches its 100% (in the case that all the input words are in the file).
(Abstract class) Iterator that, given some input words, looks for the corresponding
vectors into the file and yields a word vector pair for each vector found; once the
iteration stops, the attribute missing_words
contains the set of words not found.
Subclasses can load the word vectors in any order.
missing_words
¶The words that have still to be found; once the iteration stops, it’s the set of
the words that are in the input words
but not in the file.
embfile.core.
RandomAccessLoader
(words, word2vec, word2index=None, missing_ok=True, verbose=False, close_hook=None)[source]¶Bases: abc.ABC
, Iterator
[WordVector
]
A loader for files that can randomly access word vectors. If word2index is provided, the words are sorted by their position and the corresponded vectors are loaded in this order; I observed that this significantly improves the performance (with VVMEmbFile) (presumably due to buffering).
word2vec (Word2Vector
) – object that implements word2vec[word]
and word in word2vec
word2index (Optional
[Callable
[[str
], int
]]) – function that returns the index (position) of a word inside the file;
this enables an optimization for formats like VVM that store vectors
sequentially in the same file.
missing_ok (bool
) –
verbose (bool
) –
close_hook (Optional
[Callable
]) – function to call when closing this loader
missing_words
¶The words that have still to be found; once the iteration stops, it’s the set of
the words that are in the input words
but not in the file.
embfile.core.
WordVector
(word: str, vector: numpy.ndarray)[source]¶Bases: tuple
A (word, vector) NamedTuple
Create new instance of WordVector(word, vector)
vector
: numpy.ndarray¶Alias for field number 1