Reference
embfile.core.reader.
EmbFileReader
(out_dtype)[source]¶Bases: abc.ABC
(Abstract class) Iterator that yields a word at each step and read the corresponding vector
only if the lazy property current_vector
is accessed.
Iteration model. The iteration model is not the most obvious: each iteration step doesn’t
return a word vector pair. Instead, for performance reasons, at each step a reader returns
the next word. To read the vector for the current word, you must access the (lazy) property
current_vector()
:
with emb_file.reader() as reader:
for word in reader:
if word in my_vocab:
word2vec[word] = reader.current_vector
When you access current_vector()
for the first time,
the vector data is read/parsed and a vector is created; the vector remains
accessible until a new word is read.
Creation. Creating a reader usually implies the creation of a file object. That’s why
EmbFileReader
implements the ContextManager
interface so that you can use it inside
a with
clause. Nonetheless, a EmbFile
keeps track of all its open readers and close them
automatically when it is closed.
out_dtype (Union
[str
, dtype
]) – all the vectors will be converted to this dtype before being returned
out_dtype (numpy.dtype) – all the vectors will be converted to this data type before being returned
reset
()[source]¶(Abstract method) Brings back the reader to the first word vector pair
embfile.core.reader.
AbstractEmbFileReader
(out_dtype)[source]¶Bases: embfile.core.reader.EmbFileReader
, abc.ABC
(Abstract class) Facilitates the implementation of a EmbFileReader
, especially for a
file that stores a word and its vector nearby in the file (txt and bin formats), though it can
be used for other kind of formats as well if it looks convenient. It:
keeps track of whether the reader is pointing to a word or a vector and skips the vector when it is not requested during an iteration
caches the current vector once it is read
Sub-classes must implement:
(Abstract method) Reads a word assuming the next thing to read in the file is a word. |
|
(Abstract method) Reads the vector for the last word read. |
|
(Abstract method) Called when we want to read the next word without loading the vector for the current word. |
|
|
(Abstract method) Closes the reader |
_read_word
()[source]¶(Abstract method) Reads a word assuming the next thing to read in the file is a word. It must raise StopIteration if there’s not another word to read.
_read_vector
()[source]¶(Abstract method) Reads the vector for the last word read. This method is never called if no word has been read or at the end of file. It is called at most time per word.