embfile.core.reader

Reference

class embfile.core.reader.EmbFileReader(out_dtype)[source]

Bases: abc.ABC

(Abstract class) Iterator that yields a word at each step and read the corresponding vector only if the lazy property current_vector is accessed.

Iteration model. The iteration model is not the most obvious: each iteration step doesn’t return a word vector pair. Instead, for performance reasons, at each step a reader returns the next word. To read the vector for the current word, you must access the (lazy) property current_vector():

with emb_file.reader() as reader:
    for word in reader:
        if word in my_vocab:
            word2vec[word] = reader.current_vector

When you access current_vector() for the first time, the vector data is read/parsed and a vector is created; the vector remains accessible until a new word is read.

Creation. Creating a reader usually implies the creation of a file object. That’s why EmbFileReader implements the ContextManager interface so that you can use it inside a with clause. Nonetheless, a EmbFile keeps track of all its open readers and close them automatically when it is closed.

Parameters

out_dtype (Union[str, dtype]) – all the vectors will be converted to this dtype before being returned

Variables

out_dtype (numpy.dtype) – all the vectors will be converted to this data type before being returned

close()[source]

Closes the reader

Return type

None

abstractmethod reset()[source]

(Abstract method) Brings back the reader to the first word vector pair

Return type

None

abstractmethod next_word()[source]

(Abstract method) Reads and returns the next word in the file.

Return type

str

abstractmethod current_vector()[source]

(Abstract method) The vector for the current word (i.e. the last word read). If accessed before any word has been read, it raises IllegalOperation. The dtype of the returned vector is cls.out_dtype.

Return type

ndarray

class embfile.core.reader.AbstractEmbFileReader(out_dtype)[source]

Bases: embfile.core.reader.EmbFileReader, abc.ABC

(Abstract class) Facilitates the implementation of a EmbFileReader, especially for a file that stores a word and its vector nearby in the file (txt and bin formats), though it can be used for other kind of formats as well if it looks convenient. It:

  • keeps track of whether the reader is pointing to a word or a vector and skips the vector when it is not requested during an iteration

  • caches the current vector once it is read

Sub-classes must implement:

_read_word()

(Abstract method) Reads a word assuming the next thing to read in the file is a word.

_read_vector()

(Abstract method) Reads the vector for the last word read.

_skip_vector()

(Abstract method) Called when we want to read the next word without loading the vector for the current word.

_close()

(Abstract method) Closes the reader

abstractmethod _read_word()[source]

(Abstract method) Reads a word assuming the next thing to read in the file is a word. It must raise StopIteration if there’s not another word to read.

Return type

str

abstractmethod _read_vector()[source]

(Abstract method) Reads the vector for the last word read. This method is never called if no word has been read or at the end of file. It is called at most time per word.

Return type

ndarray

abstractmethod _skip_vector()[source]

(Abstract method) Called when we want to read the next word without loading the vector for the current word. For some formats, it may be empty.

Return type

None

abstractmethod _reset()[source]

(Abstract method) Resets the reader

Return type

None

abstractmethod _close()

(Abstract method) Closes the reader

Return type

None

reset()[source]

Brings back the reader to the beginning of the file

Return type

None

next_word()[source]

Reads and returns the next word in the file.

Return type

str

current_vector

The vector associated to the current word (i.e. the last word read). If accessed before any word has been read, it raises IllegalOperation.

Return type

ndarray