Table of contents
The core class of the package is the abstract class EmbFile
.
Three subclasses are implemented, one per supported format. Each format is associated
with a format_id
(string) and one or multiple file extensions:
Class |
format_id |
Extensions |
Description |
---|---|---|---|
txt |
.txt, .vec |
Glove/fastText format |
|
bin |
.bin |
Google word2vec format |
|
vvm |
.vvm |
Custom format storing vocabulary vectors and metadata in separate files inside a TAR |
You can open an embedding file either:
using the constructor of any of the subclasses above:
from embfile import BinaryEmbFile
with BinaryEmbFile('GoogleNews-vectors-negative300.bin') as file:
...
or using embfile.open()
, which by default infers the file format from the file extension:
import embfile
with embfile.open('GoogleNews-vectors-negative300.bin') as file:
print(file)
""" Will print:
BinaryEmbFile (
path = GoogleNews-vectors-negative300.bin,
vocab_size = 3000000,
vector_size = 300
)
"""
You can force a particular format passing the format_id
argument.
All the path
arguments can either be of type string or pathlib.Path
.
Object attributes storing paths are always pathlib.Path
, not strings.
For format-specific arguments, check out the specific class documentation:
|
Format used by the Google word2vec tool. |
|
The format used by Glove and FastText files. Each vector pair is stored as a line of text made of space-separated fields::. |
|
(Custom format) A tar file storing vocabulary, vectors and metadata in 3 separate files. |
You can pass format-specific arguments to embfile.open
too.
How to handle compression is left to EmbFile
subclasses. As a general rule,
a concrete EmbFile
requires non-compressed files unless the opposite is
specified in its docstring. Anyway, in most cases, you want to work on non-compressed
files because it’s much faster (of course).
embfile
provide utilities to work with compression in the submodule
compression
; the following functions can be used
(or imported) directly from the root module:
|
Extracts a file compressed with gzip, bz2 or lzma or a member file inside a zip/tar archive. |
|
Extracts a file unless it already exists and returns its path. |
Currently, TextEmbFile
is the only format that allows you to open a compressed file
directly and to decompress it “lazily” while reading it. Lazy decompression works for
all compression formats but zip. For uniformity of behavior, you can still open zipped
files directly but, under the hood, the file will be fully extracted to a temporary file
before starting reading it.
Lazy decompression makes sense only if you only want to perform a single pass through the file (e.g. you are converting the file); indeed, every new operation (that requires to create a new file reader) requires to (lazily) decompress the file again.
Format ID and file extensions of each registered file format are stored in the global object
embfile.FORMATS
. To associate a file extension to a registered format you can
use associate_extension()
:
>>> import embfile
>>> embfile.associate_extension(ext='.w2v', format_id='bin')
>>> print(embfile.FORMATS)
Class Format ID Extensions
------------- ----------- ------------
BinaryEmbFile bin .bin, .w2v
TextEmbFile txt .txt, .vec
VVMEmbFile vvm .vvm
To register a new format (see Implementing a new format), you can use the class decorator
register_format()
:
@embfile.register_format(format_id='hdf5', extensions=['.h5', '.hdf5'])
class HDF5EmbFile(EmbFile):
# ...
|
Loads the vectors for the input words in a |
|
Looks for the input words in the file, return: 1) a dict |
|
Returns a |
word2vec = f.load(['hello', 'world']) # raises KeyError if any word is missing
word2vec, missing_words = f.find(['hello', 'world', 'missingWord'])
You should prefer loader
to find
when you want to store the vectors
directly into some custom data structure without wasting time and memory for building an
intermediate dictionary. For example, build_matrix()
uses loader
to
load the vectors directly into a numpy array.
Here’s how you use a loader:
data_structure = MyCustomStructure()
for word, vector in file.loader(many_words):
data_structure[word] = vector
If you’re interested in missing_words
:
data_structure = MyCustomStructure()
loader = file.loader(many_words)
for word, vector in loader:
data_structure[word] = vector
print('Missing words:', loader.missing_words)
|
Returns the entire file content in a dictionary word -> vector. |
|
Returns the entire file content in a list of |
The docstring of embfile.build_matrix()
contains everything you need to know
to use it. Here, we’ll give some examples through an IPython session.
First, we’ll generate a dummy file with only three vectors:
In [1]: import tempfile
In [2]: from pathlib import Path
In [3]: import numpy as np
In [4]: import embfile
In [5]: from embfile import VVMEmbFile
In [6]: word_vectors = [
...: ('hello', np.array([1, 2, 3])),
...: ('world', np.array([4, 5, 6])),
...: ('!', np.array([7, 8, 9]))
...: ]
...:
In [7]: path = Path(tempfile.gettempdir(), 'file.vvm')
In [8]: VVMEmbFile.create(path, word_vectors, overwrite=True, verbose=False)
Let’s build a matrix out of a list of words. We’ll use the default oov_initializer for initializing the vectors for out-of-file-vocabulary words:
In [9]: words = ['hello', 'ciao', 'world', 'mondo']
In [10]: with embfile.open(path, verbose=False) as f:
....: result = embfile.build_matrix(
....: f, words,
....: start_index=1, # map the first word to the row 1 (default is 0)
....: )
....:
# result belongs to a class that extends NamedTuple
In [11]: print(result.pretty())
[ 0.000 0.000 0.000] # 0:
[ 1.000 2.000 3.000] # 1: hello
[ 1.839 3.846 4.377] # 2: ciao [out of file vocabulary]
[ 4.000 5.000 6.000] # 3: world
[ 5.016 6.407 5.299] # 4: mondo [out of file vocabulary]
In [12]: result.matrix
Out[12]:
array([[0. , 0. , 0. ],
[1. , 2. , 3. ],
[1.83933484, 3.84599979, 4.37652519],
[4. , 5. , 6. ],
[5.01626252, 6.40740252, 5.2986341 ]])
In [13]: result.word2index
Out[13]: {'hello': 1, 'ciao': 2, 'world': 3, 'mondo': 4}
In [14]: result.missing_words
Out[14]: {'ciao', 'mondo'}
Now, we’ll build a matrix from a dictionary {word: index}
. We’ll use a custom
oov_initializer
this time.
In [15]: word2index = {
....: 'hello': 1,
....: 'ciao': 3,
....: 'world': 4,
....: 'mondo': 5
....: }
....:
In [16]: with embfile.open(path, verbose=False) as f:
....: def custom_initializer(shape):
....: scale = 1 / np.sqrt(f.vector_size)
....: return np.random.normal(loc=0, scale=scale, size=shape)
....: result = embfile.build_matrix(f, word2index, oov_initializer=custom_initializer)
....:
In [17]: print(result.pretty())
[ 0.000 0.000 0.000] # 0:
[ 1.000 2.000 3.000] # 1: hello
[ 0.000 0.000 0.000] # 2:
[-0.619 -0.339 -0.273] # 3: ciao [out of file vocabulary]
[ 4.000 5.000 6.000] # 4: world
[-0.369 0.942 1.401] # 5: mondo [out of file vocabulary]
In [18]: result.matrix
Out[18]:
array([[ 0. , 0. , 0. ],
[ 1. , 2. , 3. ],
[ 0. , 0. , 0. ],
[-0.61940755, -0.33910621, -0.27301005],
[ 4. , 5. , 6. ],
[-0.36894112, 0.9424371 , 1.40109962]])
In [19]: result.word2index
Out[19]: {'hello': 1, 'ciao': 3, 'world': 4, 'mondo': 5}
In [20]: result.missing_words
Out[20]: {'ciao', 'mondo'}
See embfile.initializers
for checking out the available initializers.
Efficient iteration of the file is implemented by format-specific readers.
|
(Abstract class) Iterator that yields a word at each step and read the corresponding vector only if the lazy property |
A new reader for a file can be created using the method reader()
. Every
method that requires to iterate the file entries sequentially uses this method to
create a new reader.
You usually won’t need to use a reader directly because EmbFile
defines quicker-to-use
methods that use a reader for you. If you are interested, the docstring is pretty detailed.
The following methods are wrappers of reader()
.
Keep in mind that every time you use these methods, you are creating a new file reader
and items are read from disk (the vocabulary may be loaded in memory though,
as in VVM files).
|
Returns an iterable for all the words in the file. |
|
Returns an iterable for all the vectors in the file. |
Returns an iterable for all the (word, vector) pairs in the file. |
|
|
Returns a generator that yields a word vector pair for each word in the file that satisfies a given condition. For example, to get all the words starting with “z”::. |
Don’t use word_vectors()
if you want to filter the vectors based on a condition
on words: it’ll read vectors for all words you read, even those that don’t meet the condition.
Use filter
instead.
Each subclass of EmbFile
implements the following two class methods:
|
Creates a file on disk containing the provided word vectors. |
|
Creates a new file on disk with the same content of another file. |
You can create a new file either from:
a dictionary {word: vector}
an iterable of (word, vector)
tuples; the iterable can also be an iterator/generator.
For example:
import numpy as np
from embfile import VVMEmbFile
word_vectors = {
"hello": np.array([0.1, 0.2, 0.3]),
"world": np.array([0.4, 0.5, 0.6])
# ... a lot more word vectors
}
VVMEmbFile.create(
'/tmp/dummy.vvm.gz',
word_vectors,
dtype='<2f', # store numbers as little-endian 2-byte float
compression='gz' # compress with gzip
)
Let’s convert a textual file to a vvm file. The following will generate a compressed vvm file in the same folder of the textual file (and with a proper file extension):
from embfile import VVMEmbFile
with embfile.open('path/to/source/file.txt') as src_file:
dest_path = VVMEmbFile.create_from_file(src_file, compression='gz')
# dest_path == Path('path/to/source/file.vvm.gz')
If you ever feel the need for implementing a new format, it’s fairly easy to integrate your custom format in this library and to test it. My suggestion is:
grab the template below
read EmbFile
docstring
look at existing implementations in the embfile.formats
subpackage
for testing, see how they are tested in tests/test_files.py
You are highly suggested to use a IDE of course.
from pathlib import Path
from typing import Iterable, Optional, Tuple
import embfile
from embfile.types import DType, PathType, VectorType
from embfile.core import EmbFile, EmbFileReader
# TODO: implement a reader
# Note: you could also extend AbstractEmbFileReader if it's convenient for you
class CustomEmbFileReader(EmbFileReader):
""" Implements file sequential reading """
def __init__(self, out_dtype: DType): # TODO: add the needed arguments
super().__init__(out_dtype)
def _close(self) -> None:
pass
def reset(self) -> None:
pass
def next_word(self) -> str:
pass
def current_vector(self) -> VectorType:
pass
@embfile.register_format('custom', extensions=['.cst', '.cust'])
class CustomEmbFile(EmbFile):
def __init__(self, path: PathType, out_dtype: DType = None, verbose: int = 1):
super().__init__(path, out_dtype, verbose) # this is not optional
# cls.vocab_size = ??
# cls.vector_size = ??
def _close(self) -> None:
pass
def _reader(self) -> EmbFileReader:
return CustomEmbFileReader() # TODO: pass the needed arguments
# Optional:
def _loader(self, words: Iterable[str], missing_ok: bool = True, verbose: Optional[int] = None) -> 'VectorsLoader':
""" By default, a SequentialLoader is returned. """
return super()._loader(words, missing_ok, verbose)
@classmethod
def _create(cls, out_path: Path, word_vectors: Iterable[Tuple[str, VectorType]],
vector_size: int, vocab_size: Optional[int], compression: Optional[str] = None,
verbose: bool = True, **format_kwargs) -> None:
pass
if __name__ == '__main__':
print(embfile.FORMATS)
This’ll print:
"""
Class Format ID Extensions
------------- ----------- ------------
BinaryEmbFile bin .bin
TextEmbFile txt .txt, .vec
VVMEmbFile vvm .vvm
CustomEmbFile custom .cst, .cust
"""