docs |
|
---|---|
tests |
|
package |
A package for working with files containing word embeddings (aka word vectors). Written for:
providing a common interface for different file formats;
providing a flexible function for building “embedding matrices” that you can use for initializing the Embedding layer of your deep learning model;
taking as less RAM as possible: no need to load 3M vectors like with gensim.load_word2vec_format when you only need 20K;
satisfying my (inexplicable) urge of writing a Python package.
Supports textual and Google’s binary format plus a custom convenient format (.vvm) supporting constant-time access of word vectors (by word).
Allows to easily implement, test and integrate new file formats.
Supports virtually any text encoding and vector data type (though you should probably use only UTF-8 as encoding).
Well-documented and type-annotated (meaning great IDE support).
Extensively tested.
Progress bars (by default) for every time-consuming operation.
pip install embfile
import embfile
with embfile.open("path/to/file.bin") as f: # infer file format from file extension
print(f.vocab_size, f.vector_size)
# Load some word vectors in a dictionary (raise KeyError if any word is missing)
word2vec = f.load(['ciao', 'hello'])
# Like f.load() but allows missing words (and returns them in a Set)
word2vec, missing_words = f.find(['ciao', 'hello', 'someMissingWord'])
# Build a matrix for initializing the Embedding layer either from
# an iterable of words or a dictionary {word: index}. Handle the
# initialization of eventual missing word vectors (see argument "oov_initializer")
matrix, word2index, missing_words = embfile.build_matrix(f, words)
Table of contents
The core class of the package is the abstract class EmbFile
.
Three subclasses are implemented, one per supported format. Each format is associated
with a format_id
(string) and one or multiple file extensions:
Class |
format_id |
Extensions |
Description |
---|---|---|---|
txt |
.txt, .vec |
Glove/fastText format |
|
bin |
.bin |
Google word2vec format |
|
vvm |
.vvm |
Custom format storing vocabulary vectors and metadata in separate files inside a TAR |
You can open an embedding file either:
using the constructor of any of the subclasses above:
from embfile import BinaryEmbFile
with BinaryEmbFile('GoogleNews-vectors-negative300.bin') as file:
...
or using embfile.open()
, which by default infers the file format from the file extension:
import embfile
with embfile.open('GoogleNews-vectors-negative300.bin') as file:
print(file)
""" Will print:
BinaryEmbFile (
path = GoogleNews-vectors-negative300.bin,
vocab_size = 3000000,
vector_size = 300
)
"""
You can force a particular format passing the format_id
argument.
All the path
arguments can either be of type string or pathlib.Path
.
Object attributes storing paths are always pathlib.Path
, not strings.
For format-specific arguments, check out the specific class documentation:
|
Format used by the Google word2vec tool. |
|
The format used by Glove and FastText files. Each vector pair is stored as a line of text made of space-separated fields::. |
|
(Custom format) A tar file storing vocabulary, vectors and metadata in 3 separate files. |
You can pass format-specific arguments to embfile.open
too.
How to handle compression is left to EmbFile
subclasses. As a general rule,
a concrete EmbFile
requires non-compressed files unless the opposite is
specified in its docstring. Anyway, in most cases, you want to work on non-compressed
files because it’s much faster (of course).
embfile
provide utilities to work with compression in the submodule
compression
; the following functions can be used
(or imported) directly from the root module:
|
Extracts a file compressed with gzip, bz2 or lzma or a member file inside a zip/tar archive. |
|
Extracts a file unless it already exists and returns its path. |
Currently, TextEmbFile
is the only format that allows you to open a compressed file
directly and to decompress it “lazily” while reading it. Lazy decompression works for
all compression formats but zip. For uniformity of behavior, you can still open zipped
files directly but, under the hood, the file will be fully extracted to a temporary file
before starting reading it.
Lazy decompression makes sense only if you only want to perform a single pass through the file (e.g. you are converting the file); indeed, every new operation (that requires to create a new file reader) requires to (lazily) decompress the file again.
Format ID and file extensions of each registered file format are stored in the global object
embfile.FORMATS
. To associate a file extension to a registered format you can
use associate_extension()
:
>>> import embfile
>>> embfile.associate_extension(ext='.w2v', format_id='bin')
>>> print(embfile.FORMATS)
Class Format ID Extensions
------------- ----------- ------------
BinaryEmbFile bin .bin, .w2v
TextEmbFile txt .txt, .vec
VVMEmbFile vvm .vvm
To register a new format (see Implementing a new format), you can use the class decorator
register_format()
:
@embfile.register_format(format_id='hdf5', extensions=['.h5', '.hdf5'])
class HDF5EmbFile(EmbFile):
# ...
|
Loads the vectors for the input words in a |
|
Looks for the input words in the file, return: 1) a dict |
|
Returns a |
word2vec = f.load(['hello', 'world']) # raises KeyError if any word is missing
word2vec, missing_words = f.find(['hello', 'world', 'missingWord'])
You should prefer loader
to find
when you want to store the vectors
directly into some custom data structure without wasting time and memory for building an
intermediate dictionary. For example, build_matrix()
uses loader
to
load the vectors directly into a numpy array.
Here’s how you use a loader:
data_structure = MyCustomStructure()
for word, vector in file.loader(many_words):
data_structure[word] = vector
If you’re interested in missing_words
:
data_structure = MyCustomStructure()
loader = file.loader(many_words)
for word, vector in loader:
data_structure[word] = vector
print('Missing words:', loader.missing_words)
|
Returns the entire file content in a dictionary word -> vector. |
|
Returns the entire file content in a list of |
The docstring of embfile.build_matrix()
contains everything you need to know
to use it. Here, we’ll give some examples through an IPython session.
First, we’ll generate a dummy file with only three vectors:
In [1]: import tempfile
In [2]: from pathlib import Path
In [3]: import numpy as np
In [4]: import embfile
In [5]: from embfile import VVMEmbFile
In [6]: word_vectors = [
...: ('hello', np.array([1, 2, 3])),
...: ('world', np.array([4, 5, 6])),
...: ('!', np.array([7, 8, 9]))
...: ]
...:
In [7]: path = Path(tempfile.gettempdir(), 'file.vvm')
In [8]: VVMEmbFile.create(path, word_vectors, overwrite=True, verbose=False)
Let’s build a matrix out of a list of words. We’ll use the default oov_initializer for initializing the vectors for out-of-file-vocabulary words:
In [9]: words = ['hello', 'ciao', 'world', 'mondo']
In [10]: with embfile.open(path, verbose=False) as f:
....: result = embfile.build_matrix(
....: f, words,
....: start_index=1, # map the first word to the row 1 (default is 0)
....: )
....:
# result belongs to a class that extends NamedTuple
In [11]: print(result.pretty())
[ 0.000 0.000 0.000] # 0:
[ 1.000 2.000 3.000] # 1: hello
[ 1.802 2.867 5.709] # 2: ciao [out of file vocabulary]
[ 4.000 5.000 6.000] # 3: world
[ 4.375 4.634 6.280] # 4: mondo [out of file vocabulary]
In [12]: result.matrix
Out[12]:
array([[0. , 0. , 0. ],
[1. , 2. , 3. ],
[1.80158469, 2.86658918, 5.70920508],
[4. , 5. , 6. ],
[4.37541454, 4.63356608, 6.28049425]])
In [13]: result.word2index
Out[13]: {'hello': 1, 'ciao': 2, 'world': 3, 'mondo': 4}
In [14]: result.missing_words
Out[14]: {'ciao', 'mondo'}
Now, we’ll build a matrix from a dictionary {word: index}
. We’ll use a custom
oov_initializer
this time.
In [15]: word2index = {
....: 'hello': 1,
....: 'ciao': 3,
....: 'world': 4,
....: 'mondo': 5
....: }
....:
In [16]: with embfile.open(path, verbose=False) as f:
....: def custom_initializer(shape):
....: scale = 1 / np.sqrt(f.vector_size)
....: return np.random.normal(loc=0, scale=scale, size=shape)
....: result = embfile.build_matrix(f, word2index, oov_initializer=custom_initializer)
....:
In [17]: print(result.pretty())
[ 0.000 0.000 0.000] # 0:
[ 1.000 2.000 3.000] # 1: hello
[ 0.000 0.000 0.000] # 2:
[ 0.275 0.044 0.387] # 3: ciao [out of file vocabulary]
[ 4.000 5.000 6.000] # 4: world
[ 0.452 0.464 -0.488] # 5: mondo [out of file vocabulary]
In [18]: result.matrix
Out[18]:
array([[ 0. , 0. , 0. ],
[ 1. , 2. , 3. ],
[ 0. , 0. , 0. ],
[ 0.27523144, 0.04421023, 0.38716581],
[ 4. , 5. , 6. ],
[ 0.45208219, 0.46350951, -0.48754331]])
In [19]: result.word2index
Out[19]: {'hello': 1, 'ciao': 3, 'world': 4, 'mondo': 5}
In [20]: result.missing_words
Out[20]: {'ciao', 'mondo'}
See embfile.initializers
for checking out the available initializers.
Efficient iteration of the file is implemented by format-specific readers.
|
(Abstract class) Iterator that yields a word at each step and read the corresponding vector only if the lazy property |
A new reader for a file can be created using the method reader()
. Every
method that requires to iterate the file entries sequentially uses this method to
create a new reader.
You usually won’t need to use a reader directly because EmbFile
defines quicker-to-use
methods that use a reader for you. If you are interested, the docstring is pretty detailed.
The following methods are wrappers of reader()
.
Keep in mind that every time you use these methods, you are creating a new file reader
and items are read from disk (the vocabulary may be loaded in memory though,
as in VVM files).
|
Returns an iterable for all the words in the file. |
|
Returns an iterable for all the vectors in the file. |
Returns an iterable for all the (word, vector) word_vectors in the file. |
|
|
Returns a generator that yields a word vector pair for each word in the file that satisfies a given condition. For example, to get all the words starting with “z”::. |
Don’t use word_vectors()
if you want to filter the vectors based on a condition
on words: it’ll read vectors for all words you read, even those that don’t meet the condition.
Use filter
instead.
Each subclass of EmbFile
implements the following two class methods:
|
Creates a file on disk containing the provided word vectors. |
|
Creates a new file on disk with the same content of another file. |
You can create a new file either from:
a dictionary {word: vector}
an iterable of (word, vector)
tuples; the iterable can also be an iterator/generator.
For example:
import numpy as np
from embfile import VVMEmbFile
word_vectors = {
"hello": np.array([0.1, 0.2, 0.3]),
"world": np.array([0.4, 0.5, 0.6])
# ... a lot more word vectors
}
VVMEmbFile.create(
'/tmp/dummy.vvm.gz',
word_vectors,
dtype='<2f', # store numbers as little-endian 2-byte float
compression='gz' # compress with gzip
)
Let’s convert a textual file to a vvm file. The following will generate a compressed vvm file in the same folder of the textual file (and with a proper file extension):
from embfile import VVMEmbFile
with embfile.open('path/to/source/file.txt') as src_file:
dest_path = VVMEmbFile.create_from_file(src_file, compression='gz')
# dest_path == Path('path/to/source/file.vvm.gz')
If you ever feel the need for implementing a new format, it’s fairly easy to integrate your custom format in this library and to test it. My suggestion is:
grab the template below
read EmbFile
docstring
look at existing implementations in the embfile.formats
subpackage
for testing, see how they are tested in tests/test_files.py
You are highly suggested to use a IDE of course.
from pathlib import Path
from typing import Iterable, Optional, Tuple
import embfile
from embfile.types import DType, PathType, VectorType
from embfile.core import EmbFile, EmbFileReader
# TODO: implement a reader
# Note: you could also extend AbstractEmbFileReader if it's convenient for you
class CustomEmbFileReader(EmbFileReader):
""" Implements file sequential reading """
def __init__(self, out_dtype: DType): # TODO: add the needed arguments
super().__init__(out_dtype)
def _close(self) -> None:
pass
def reset(self) -> None:
pass
def next_word(self) -> str:
pass
def current_vector(self) -> VectorType:
pass
@embfile.register_format('custom', extensions=['.cst', '.cust'])
class CustomEmbFile(EmbFile):
def __init__(self, path: PathType, out_dtype: DType = None, verbose: int = 1):
super().__init__(path, out_dtype, verbose) # this is not optional
# cls.vocab_size = ??
# cls.vector_size = ??
def _close(self) -> None:
pass
def _reader(self) -> EmbFileReader:
return CustomEmbFileReader() # TODO: pass the needed arguments
# Optional:
def _loader(self, words: Iterable[str], missing_ok: bool = True, verbose: Optional[int] = None) -> 'VectorsLoader':
""" By default, a SequentialLoader is returned. """
return super()._loader(words, missing_ok, verbose)
@classmethod
def _create(cls, out_path: Path, word_vectors: Iterable[Tuple[str, VectorType]],
vector_size: int, vocab_size: Optional[int], compression: Optional[str] = None,
verbose: bool = True, **format_kwargs) -> None:
pass
if __name__ == '__main__':
print(embfile.FORMATS)
This’ll print:
"""
Class Format ID Extensions
------------- ----------- ------------
BinaryEmbFile bin .bin
TextEmbFile txt .txt, .vec
VVMEmbFile vvm .vvm
CustomEmbFile custom .cst, .cust
"""
This section is about a benchmark I did out of curiosity for comparing the performance of the formats supported by this library. The snippet under test is the following:
with ConcreteEmbFile(path, verbose=0) as f:
f.find(query)
The benchmark was performed on generated files for increasing input sizes (number of words to load).
For each input size, the test was repeated 5 times with the exact same input.
The script used for running this tests is in the benchmark
folder of the repository.
The inputs were obtained as following:
first, a list of max(input_sizes)
words was (uniformly) sampled from the file vocabulary
the input for size k
was obtained taking
the first k
words of the sampled list
an additional out-of-file-vocabulary word
So, the input for the i
-th size is a super-set of the previous ones.
The additional out-of-file-vocabulary word forces txt and bin file objects to read the entire file. The number of missing words isn’t an interesting parameter to consider, since missing words are simply added to a set in all the cases.
The input sizes reported below don’t consider the additional word: the actual input size
is reported_size + 1
, but that’s practically irrelevant.
The measured times (on each single try) include the time for opening the file; VVM files can
take several seconds to open since the vocabulary is entirely read at the start; thus the actual
time taken by only find()
in VVM files is lower that those reported
below.
Tests were performed on an old desktop computer upgraded with a SSD:
CPU: Intel® Core™2 Quad Q6600
RAM: 8GB DDR2 (4 x 2GB, 800Mhz)
SSD: Samsung 850 EVO 256GB
OS: Windows 10
Expect much better times on newer computers.
1K |
50K |
150K |
300K |
|
---|---|---|---|---|
|
6.8 |
7.6 |
8.7 |
10.4 |
|
6.2 |
11.3 |
21.4 |
36.3 |
|
1.7 |
3.4 |
5.4 |
8.1 |
1K |
50K |
150K |
300K |
|
---|---|---|---|---|
|
8.1 |
8.8 |
10.1 |
12.0 |
|
12.2 |
25.5 |
51.8 |
91.6 |
|
1.8 |
4.0 |
7.4 |
11.1 |
1K |
50K |
150K |
300K |
|
---|---|---|---|---|
|
21.1 |
21.9 |
23.1 |
24.9 |
|
18.0 |
23.6 |
34.1 |
49.8 |
|
5.8 |
7.8 |
10.9 |
14.3 |
1K |
50K |
150K |
300K |
|
---|---|---|---|---|
|
25.4 |
27.0 |
28.1 |
30.0 |
|
36.2 |
49.4 |
75.8 |
116.5 |
|
5.7 |
8.3 |
12.7 |
18.2 |
Substructure
Substructure
Classes
|
A loader for files that can randomly access word vectors. |
|
A Loader that just scans the file from beginning to the end and yields a word vector pair when it meets a requested word. |
|
(Abstract class) Iterator that, given some input words, looks for the corresponding vectors into the file and yields a word vector pair for each vector found; once the iteration stops, the attribute |
Reference
embfile.core.loaders.
VectorsLoader
(words, missing_ok=True)[source]¶Bases: abc.ABC
, Iterator
[WordVector
]
(Abstract class) Iterator that, given some input words, looks for the corresponding
vectors into the file and yields a word vector pair for each vector found; once the
iteration stops, the attribute missing_words
contains the set of words not found.
Subclasses can load the word vectors in any order.
missing_words
¶The words that have still to be found; once the iteration stops, it’s the set of
the words that are in the input words
but not in the file.
embfile.core.loaders.
SequentialLoader
(file, words, missing_ok=True, verbose=False)[source]¶Bases: abc.ABC
, Iterator
[WordVector
]
A Loader that just scans the file from beginning to the end and yields a word vector pair when it meets a requested word. Used by txt and bin files. It’s unable to tell if a word is in the file or not before having read the entire file.
The progress bar shows the percentage of file that has been examined, not the number of yielded word vectors, so the iteration may stop before the bar reaches its 100% (in the case that all the input words are in the file).
(Abstract class) Iterator that, given some input words, looks for the corresponding
vectors into the file and yields a word vector pair for each vector found; once the
iteration stops, the attribute missing_words
contains the set of words not found.
Subclasses can load the word vectors in any order.
missing_words
¶The words that have still to be found; once the iteration stops, it’s the set of
the words that are in the input words
but not in the file.
embfile.core.loaders.
RandomAccessLoader
(words, word2vec, word2index=None, missing_ok=True, verbose=False, close_hook=None)[source]¶Bases: abc.ABC
, Iterator
[WordVector
]
A loader for files that can randomly access word vectors. If word2index is provided, the words are sorted by their position and the corresponded vectors are loaded in this order; I observed that this significantly improves the performance (with VVMEmbFile) (presumably due to buffering).
word2vec (Word2Vector
) – object that implements word2vec[word]
and word in word2vec
word2index (Optional
[Callable
[[str
], int
]]) – function that returns the index (position) of a word inside the file;
this enables an optimization for formats like VVM that store vectors
sequentially in the same file.
missing_ok (bool
) –
verbose (bool
) –
close_hook (Optional
[Callable
]) – function to call when closing this loader
missing_words
¶The words that have still to be found; once the iteration stops, it’s the set of
the words that are in the input words
but not in the file.
Reference
embfile.core.reader.
EmbFileReader
(out_dtype)[source]¶Bases: abc.ABC
(Abstract class) Iterator that yields a word at each step and read the corresponding vector
only if the lazy property current_vector
is accessed.
Iteration model. The iteration model is not the most obvious: each iteration step doesn’t
return a word vector pair. Instead, for performance reasons, at each step a reader returns
the next word. To read the vector for the current word, you must access the (lazy) property
current_vector()
:
with emb_file.reader() as reader:
for word in reader:
if word in my_vocab:
word2vec[word] = reader.current_vector
When you access current_vector()
for the first time,
the vector data is read/parsed and a vector is created; the vector remains
accessible until a new word is read.
Creation. Creating a reader usually implies the creation of a file object. That’s why
EmbFileReader
implements the ContextManager
interface so that you can use it inside
a with
clause. Nonetheless, a EmbFile
keeps track of all its open readers and close them
automatically when it is closed.
out_dtype (Union
[str
, dtype
]) – all the vectors will be converted to this dtype before being returned
out_dtype (numpy.dtype) – all the vectors will be converted to this data type before being returned
reset
()[source]¶(Abstract method) Brings back the reader to the first word vector pair
embfile.core.reader.
AbstractEmbFileReader
(out_dtype)[source]¶Bases: embfile.core.reader.EmbFileReader
, abc.ABC
(Abstract class) Facilitates the implementation of a EmbFileReader
, especially for a
file that stores a word and its vector nearby in the file (txt and bin formats), though it can
be used for other kind of formats as well if it looks convenient. It:
keeps track of whether the reader is pointing to a word or a vector and skips the vector when it is not requested during an iteration
caches the current vector once it is read
Sub-classes must implement:
(Abstract method) Reads a word assuming the next thing to read in the file is a word. |
|
(Abstract method) Reads the vector for the last word read. |
|
(Abstract method) Called when we want to read the next word without loading the vector for the current word. |
|
|
(Abstract method) Closes the reader |
_read_word
()[source]¶(Abstract method) Reads a word assuming the next thing to read in the file is a word. It must raise StopIteration if there’s not another word to read.
_read_vector
()[source]¶(Abstract method) Reads the vector for the last word read. This method is never called if no word has been read or at the end of file. It is called at most time per word.
Classes
|
(Abstract class) Facilitates the implementation of a |
|
(Abstract class) The base class of all the embedding files. |
|
(Abstract class) Iterator that yields a word at each step and read the corresponding vector only if the lazy property |
|
A loader for files that can randomly access word vectors. |
|
A Loader that just scans the file from beginning to the end and yields a word vector pair when it meets a requested word. |
|
(Abstract class) Iterator that, given some input words, looks for the corresponding vectors into the file and yields a word vector pair for each vector found; once the iteration stops, the attribute |
|
A (word, vector) NamedTuple |
embfile.core.
EmbFile
(path, out_dtype=None, verbose=True)[source]¶Bases: abc.ABC
(Abstract class) The base class of all the embedding files.
Sub-classes must:
ensure they set attributes vocab_size
and vector_size
when a file
instance is created
implement a EmbFileReader
for the format and implements
the abstract method _reader()
implement the abstract method _close()
(optionally) implement a VectorsLoader
(if they can improve
upon the default loader) and override loader()
(optionally) implement a EmbFileCreator
for the format and set
the class constant Creator
path (Path) – path of the embedding file (eventually compressed)
out_dtype (numpy.dtype) – all the vectors will be converted to this data type. The sub-class is responsible to set a suitable default value.
verbose (bool) – whether to show a progress bar by default in all time-consuming operations
path (Path) – path of the embedding file
vocab_size (int or None
) – number of words in the file (can be None
for some TextEmbFile
)
vector_size (int) – length of the vectors
verbose (bool) – whether to show a progress bar by default in all time-consuming operations
closed (bool) – True if the file was closed
_reader
()[source]¶(Abstract method) Returns a new reader for the file which allows to iterate
efficiently the word-vectors inside it. Called by reader()
.
_close
()[source]¶(Abstract method) Releases eventual resources used by the EmbFile.
close
()[source]¶Releases all the open resources linked to this file, including the opened readers.
reader
()[source]¶Creates and returns a new file reader. When the file is closed, all the still opened readers are closed automatically.
loader
(words, missing_ok=True, verbose=None)[source]¶Returns a VectorsLoader
, an iterator that looks for the
provided words in the file and yields available (word, vector) pairs one by one.
If missing_ok=True
(default), provides the set of missing words in the
property missing_words
(once the iteration ends).
See embfile.core.VectorsLoader
for more info.
Example
You should use a loader when you need to load many vectors in some custom data structure and you don’t want to waste memory (e.g. build_matrix uses it to load the vectors directly into the matrix):
data_structure = MyCustomStructure()
with file.loader(many_words) as loader:
for word, vector in loader:
data_structure[word] = vector
print('Number of missing words:', len(loader.missing_words)
word_vectors
()[source]¶Returns an iterable for all the (word, vector) word_vectors in the file.
to_list
(verbose=None)[source]¶Returns the entire file content in a list of WordVector
’s.
load
(words, verbose=None)[source]¶Loads the vectors for the input words in a {word: vec}
dict, raising
KeyError
if any word is missing.
(Dict[str, VectorType]) – a dictionary {word: vector}
See also
find()
- it returns the set of all missing words, instead of raising
KeyError
.
find
(words, verbose=None)[source]¶Looks for the input words in the file, return: 1) a dict {word: vec}
containing the available words and 2) a set containing the words not found.
_FindOutput
namedtuple – a namedtuple with the following fields:
word2vec (Dict[str, VectorType]): dictionary {word: vector}
missing_words (Set[str]): set of words not found in the file
See also
load()
- which raises KeyError if any word is not found in the file.
filter
(condition, verbose=None)[source]¶Returns a generator that yields a word vector pair for each word in the file that satisfies a given condition. For example, to get all the words starting with “z”:
list(file.filter(lambda word: word.startswith('z')))
save_vocab
(path=None, encoding='utf-8', overwrite=False, verbose=None)[source]¶Save the vocabulary of the embedding file on a text file. By default the file is saved in the same directory of the embedding file, e.g.:
/path/to/filename.txt.gz ==> /path/to/filename_vocab.txt
create
(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, **format_kwargs)[source]¶Creates a file on disk containing the provided word vectors.
word_vectors (Dict[str, VectorType] or Iterable[Tuple[str, VectorType]]) – it can be an iterable of word vector tuples or a dictionary word -> vector
;
the word vectors are written in the order determined by the iterable object.
vocab_size (Optional
[int
]) – it must be provided if word_vectors
has no __len__
and the specific-format
creator needs to know a priori the vocabulary size; in any case, the creator
should check at the end that the provided vocab_size
matches the actual length
of word_vectors
compression (Optional
[str
]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"
verbose (bool
) – if positive, show progress bars and information
overwrite (bool
) – overwrite the file if it already exists
format_kwargs – format-specific arguments
create_from_file
(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, **format_kwargs)[source]¶Creates a new file on disk with the same content of another file.
source_file (EmbFile
) – the file to take data from
out_dir (Union
[str
, Path
, None
]) – directory where the file will be stored; by default, it’s the parent directory
of the source file
out_filename (Optional
[str
]) – filename of the produced name (inside out_dir
); by default, it is obtained by
replacing the extension of the source file with the proper one and appending the
compression extension if compression is not None
.
Note: if you pass this argument, the compression extension is not automatically
appended.
vocab_size (Optional
[int
]) – if the source EmbFile has attribute vocab_size == None
, then: if the specific
creator requires it (bin and txt formats do), it must be provided; otherwise it
can be provided for having ETA in progress bars.
compression (Optional
[str
]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"
verbose (bool
) – print info and progress bar
overwrite (bool
) – overwrite a file with the same name if it already exists
format_kwargs – format-specific arguments (see above)
embfile.core.
EmbFileReader
(out_dtype)[source]¶Bases: abc.ABC
(Abstract class) Iterator that yields a word at each step and read the corresponding vector
only if the lazy property current_vector
is accessed.
Iteration model. The iteration model is not the most obvious: each iteration step doesn’t
return a word vector pair. Instead, for performance reasons, at each step a reader returns
the next word. To read the vector for the current word, you must access the (lazy) property
current_vector()
:
with emb_file.reader() as reader:
for word in reader:
if word in my_vocab:
word2vec[word] = reader.current_vector
When you access current_vector()
for the first time,
the vector data is read/parsed and a vector is created; the vector remains
accessible until a new word is read.
Creation. Creating a reader usually implies the creation of a file object. That’s why
EmbFileReader
implements the ContextManager
interface so that you can use it inside
a with
clause. Nonetheless, a EmbFile
keeps track of all its open readers and close them
automatically when it is closed.
out_dtype (Union
[str
, dtype
]) – all the vectors will be converted to this dtype before being returned
out_dtype (numpy.dtype) – all the vectors will be converted to this data type before being returned
reset
()[source]¶(Abstract method) Brings back the reader to the first word vector pair
embfile.core.
AbstractEmbFileReader
(out_dtype)[source]¶Bases: embfile.core.reader.EmbFileReader
, abc.ABC
(Abstract class) Facilitates the implementation of a EmbFileReader
, especially for a
file that stores a word and its vector nearby in the file (txt and bin formats), though it can
be used for other kind of formats as well if it looks convenient. It:
keeps track of whether the reader is pointing to a word or a vector and skips the vector when it is not requested during an iteration
caches the current vector once it is read
Sub-classes must implement:
(Abstract method) Reads a word assuming the next thing to read in the file is a word. |
|
(Abstract method) Reads the vector for the last word read. |
|
(Abstract method) Called when we want to read the next word without loading the vector for the current word. |
|
|
(Abstract method) Closes the reader |
_read_word
()[source]¶(Abstract method) Reads a word assuming the next thing to read in the file is a word. It must raise StopIteration if there’s not another word to read.
_read_vector
()[source]¶(Abstract method) Reads the vector for the last word read. This method is never called if no word has been read or at the end of file. It is called at most time per word.
embfile.core.
VectorsLoader
(words, missing_ok=True)[source]¶Bases: abc.ABC
, Iterator
[WordVector
]
(Abstract class) Iterator that, given some input words, looks for the corresponding
vectors into the file and yields a word vector pair for each vector found; once the
iteration stops, the attribute missing_words
contains the set of words not found.
Subclasses can load the word vectors in any order.
missing_words
¶The words that have still to be found; once the iteration stops, it’s the set of
the words that are in the input words
but not in the file.
embfile.core.
SequentialLoader
(file, words, missing_ok=True, verbose=False)[source]¶Bases: abc.ABC
, Iterator
[WordVector
]
A Loader that just scans the file from beginning to the end and yields a word vector pair when it meets a requested word. Used by txt and bin files. It’s unable to tell if a word is in the file or not before having read the entire file.
The progress bar shows the percentage of file that has been examined, not the number of yielded word vectors, so the iteration may stop before the bar reaches its 100% (in the case that all the input words are in the file).
(Abstract class) Iterator that, given some input words, looks for the corresponding
vectors into the file and yields a word vector pair for each vector found; once the
iteration stops, the attribute missing_words
contains the set of words not found.
Subclasses can load the word vectors in any order.
missing_words
¶The words that have still to be found; once the iteration stops, it’s the set of
the words that are in the input words
but not in the file.
embfile.core.
RandomAccessLoader
(words, word2vec, word2index=None, missing_ok=True, verbose=False, close_hook=None)[source]¶Bases: abc.ABC
, Iterator
[WordVector
]
A loader for files that can randomly access word vectors. If word2index is provided, the words are sorted by their position and the corresponded vectors are loaded in this order; I observed that this significantly improves the performance (with VVMEmbFile) (presumably due to buffering).
word2vec (Word2Vector
) – object that implements word2vec[word]
and word in word2vec
word2index (Optional
[Callable
[[str
], int
]]) – function that returns the index (position) of a word inside the file;
this enables an optimization for formats like VVM that store vectors
sequentially in the same file.
missing_ok (bool
) –
verbose (bool
) –
close_hook (Optional
[Callable
]) – function to call when closing this loader
missing_words
¶The words that have still to be found; once the iteration stops, it’s the set of
the words that are in the input words
but not in the file.
Substructure
Classes
|
Format used by the Google word2vec tool. |
|
|
Reference
embfile.formats.bin.
BinaryEmbFile
(path, encoding='utf-8', dtype=dtype('float32'), out_dtype=None, verbose=True)[source]¶Bases: embfile.core._file.EmbFile
Format used by the Google word2vec tool. You can use it to read the file GoogleNews-vectors-negative300.bin.
It begins with a text header line of space-separated fields:
<vocab_size> <vector_size>
Each word vector pair is encoded as following:
encoded word + space
followed by the binary representation of the vector.
path –
encoding –
dtype –
out_dtype –
verbose –
path (Union
[str
, Path
]) – path to the (eventually compressed) file
encoding (str
) – text encoding; note: if you provide an utf encoding (e.g. utf-16) that uses a
BOM (Byte Order Mark) without specifying the byte-endianness (e.g. utf-16-le or
utf-16-be), the little-endian version is used (utf-16-le).
dtype (Union
[str
, dtype
]) – a valid numpy data type (or whatever you can pass to numpy.dtype())
(default: ‘<f4’; little-endian float, 4 bytes)
out_dtype (Union
[str
, dtype
, None
]) – all the vectors returned will be (eventually) converted to this data type;
by default, it is equal to the original data type of the vectors in the file,
i.e. no conversion takes place.
create
(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]¶Format-specific arguments are encoding
and dtype
.
Note: all the text is encoded without BOM (Byte Order Mark). If you pass “utf-16” or “utf-18”, the little-endian version is used (e.g. “utf-16-le”)
See create()
for more.
create_from_file
(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]¶Format-specific arguments are encoding
and dtype
.
Note: all the text is encoded without BOM (Byte Order Mark). If you pass “utf-16” or “utf-18”, the little-endian version is used (e.g. “utf-16-le”)
See create_from_file()
for more.
embfile.formats.bin.
BinaryEmbFileReader
(file_obj, encoding='utf-8', dtype=dtype('float32'), out_dtype=None)[source]¶Bases: embfile.core.reader.AbstractEmbFileReader
EmbFileReader
for the binary format.
Classes
|
The format used by Glove and FastText files. Each vector pair is stored as a line of text made of space-separated fields::. |
|
|
Reference
embfile.formats.txt.
TextEmbFile
(path, encoding='utf-8', out_dtype='float32', vocab_size=None, verbose=True)[source]¶Bases: embfile.core._file.EmbFile
The format used by Glove and FastText files. Each vector pair is stored as a line of text made of space-separated fields:
word vec[0] vec[1] ... vec[vector_size-1]
It may have or not an (automatically detected) “header”, containing vocab_size
and
vector_size
(in this order).
If the file doesn’t have a header, vector_size
is set to the length of the first vector.
If you know vocab_size
(even an approximate value), you may want to provide it to have ETA
in progress bars.
If the file has a header and you provide vocab_size
, the provided value is ignored.
Compressed files are decompressed while you proceed reeding. Note that each file reader will decompress the file independently, so if you need to read the file multiple times it’s better you decompress the entire file first and then open it.
path –
encoding –
out_dtype –
verbose –
encoding (str
) – encoding of the text file; default is utf-8
out_dtype (Union
[str
, dtype
]) – the dtype of the vectors that will be returned; default is single-precision float
vocab_size (Optional
[int
]) – useful when the file has no header but you know vocab_size;
if the file has a header, this argument is ignored.
verbose (int
) – default level of verbosity for all methods
create
(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', precision=5)[source]¶Creates a file on disk containing the provided word vectors.
word_vectors (Dict[str, VectorType] or Iterable[Tuple[str, VectorType]]) – it can be an iterable of word vector tuples or a dictionary word -> vector
;
the word vectors are written in the order determined by the iterable object.
vocab_size (Optional
[int
]) – it must be provided if word_vectors
has no __len__
and the specific-format
creator needs to know a priori the vocabulary size; in any case, the creator
should check at the end that the provided vocab_size
matches the actual length
of word_vectors
compression (Optional
[str
]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"
verbose (bool
) – if positive, show progress bars and information
overwrite (bool
) – overwrite the file if it already exists
format_kwargs – format-specific arguments
create_from_file
(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', precision=5)[source]¶Creates a new file on disk with the same content of another file.
source_file (EmbFile
) – the file to take data from
out_dir (Union
[str
, Path
, None
]) – directory where the file will be stored; by default, it’s the parent directory
of the source file
out_filename (Optional
[str
]) – filename of the produced name (inside out_dir
); by default, it is obtained by
replacing the extension of the source file with the proper one and appending the
compression extension if compression is not None
.
Note: if you pass this argument, the compression extension is not automatically
appended.
vocab_size (Optional
[int
]) – if the source EmbFile has attribute vocab_size == None
, then: if the specific
creator requires it (bin and txt formats do), it must be provided; otherwise it
can be provided for having ETA in progress bars.
compression (Optional
[str
]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"
verbose (bool
) – print info and progress bar
overwrite (bool
) – overwrite a file with the same name if it already exists
format_kwargs – format-specific arguments (see above)
embfile.formats.txt.
TextEmbFileReader
(file_obj, out_dtype=dtype('float32'), vocab_size=None)[source]¶Bases: embfile.core.reader.AbstractEmbFileReader
EmbFileReader
for the textual format.
Classes
|
(Custom format) A tar file storing vocabulary, vectors and metadata in 3 separate files. |
|
|
Reference
embfile.formats.vvm.
VVMEmbFile
(path, out_dtype=None, verbose=True)[source]¶Bases: embfile.core._file.EmbFile
, embfile.core.loaders.Word2Vector
(Custom format) A tar file storing vocabulary, vectors and metadata in 3 separate files.
Features:
the vocabulary can be loaded very quickly (with no need for an external vocab file) and it is loaded in memory when the file is opened;
direct access to vectors
by word using __getitem__()
(e.g. file['hello']
)
by index using vector_at()
implements __contains__()
(e.g. 'hello' in file
)
all the information needed to open the file are stored in the file itself
Specifics. The files contained in a VVM file are:
vocab.txt: contains each word on a separate line
vectors.bin: contains the vectors in binary format (concatenated)
meta.json: must contain (at least) the following fields:
vocab_size: number of word vectors in the file
vector_size: length of a word vector
encoding: text encoding used for vocab.txt
dtype: vector data type string (notation used by numpy)
__getitem__
(word)[source]¶Returns the vector associated to a word (random access to file).
create
(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]¶Format-specific arguments are encoding and dtype.
Being VVM a tar file, you should use a compression supported by the tarfile package (avoid zip): gz, bz2 or xz.
See create()
for more doc.
create_from_file
(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]¶Format-specific arguments are encoding and dtype. Being VVM a tar file, you should use a compression supported by the tarfile package (avoid zip): gz, bz2 or xz.
See create_from_file()
for more doc.
embfile.formats.vvm.
VVMEmbFileReader
(file, vectors_file)[source]¶Bases: embfile.core.reader.AbstractEmbFileReader
EmbFileReader
for the vvm format.
Functions
|
Extracts a file compressed with gzip, bz2 or lzma or a member file inside a zip/tar archive. |
|
Extracts a file unless it already exists and returns its path. |
|
Open a file, eventually with (de)compression. |
Data
Maps each compression format to its associated extensions |
|
Maps a compression extensions to the corresponding compression format name |
Reference
embfile.compression.
open_file
(path, mode='rt', encoding=None, compression=None)[source]¶Open a file, eventually with (de)compression.
If compression
is not given, it is inferred from the file extension. If the file has not the
extension of a supported compression format, the file is opened without compression, unless the
argument compression
is given.
embfile.compression.
extract_file
(src_path, member=None, dest_dir='.', dest_filename=None, overwrite=False)[source]¶Extracts a file compressed with gzip, bz2 or lzma or a member file inside a zip/tar archive. The compression format is inferred from the extension or from the magic number of the file (in the case of zip and tar).
The file is first extracted to a .part file that is renamed when the extraction is completed.
member (Optional
[str
]) – must be provided if src_path points to an archive that contains multiple files;
dest_dir (Union
[str
, Path
]) – destination directory; by default, it’s the current working directory
dest_filename (Optional
[str
]) – destination filename; by default, it’s equal to member
(if provided)
overwrite (bool
) – overwrite existing file at dest_path
if it already exists
Path – the path to the extracted file
embfile.compression.
extract_if_missing
(src_path, member=None, dest_dir='.', dest_filename=None)[source]¶Extracts a file unless it already exists and returns its path.
Note: during extraction, a .part file is used, so there’s no risk of using a partially extracted file.
embfile.compression.
EXTENSION_TO_COMPRESSION
= {'.bz2': 'bz2', '.gz': 'gz', '.gzip': 'gz', '.lzma': 'xz', '.xz': 'xz', '.zip': 'zip'}¶Maps a compression extensions to the corresponding compression format name
embfile.compression.
COMPRESSION_TO_EXTENSIONS
= {'bz2': ['.bz2'], 'gz': ['.gz', '.gzip'], 'xz': ['.xz', '.lzma'], 'zip': ['.zip']}¶Maps each compression format to its associated extensions
Reference
embfile.errors.
IllegalOperation
[source]¶Bases: embfile.errors.Error
Raised when the user attempts to perform an operation that is illegal in the current state (e.g. using a closed file)
embfile.errors.
BadEmbFile
[source]¶Bases: embfile.errors.Error
Raised when the file is malformed.
Embedding initializers.
Classes
A random number generator meant to be used with |
|
Generates vectors using a normal distribution with the same mean and standard deviation of the set of vectors passed to the fit method. |
Functions
|
Returns a normal sampler. |
Reference
embfile.initializers.
Initializer
[source]¶Bases: abc.ABC
A random number generator meant to be used with build_matrix()
.
It can be fit to a sequence of other vectors in order to compute stats to be used for
generation. When passed to build_matrix
, the initializer is fit to the found vectors.
embfile.initializers.
NormalInitializer
[source]¶Bases: embfile.initializers.Initializer
Generates vectors using a normal distribution with the same mean and standard deviation
of the set of vectors passed to the fit method. When used with
build_matrix()
, it initializes out-of-file-vocabulary vectors so
that they have the same mean and deviation of the vectors found in the file.
If not fit before to generate vectors, it raises IllegalOperation
Reference
embfile.registry.
FormatsRegistry
[source]¶Bases: object
Maps each EmbFile
subclass to a format_id and one or multiple file extensions.
id_to_class –
extension_to_id –
id_to_extensions –
register_format
(embfile_class, format_id, extensions, overwrite=False)[source]¶Registers a new embedding file format with a given id and associates the provided file extensions to it.
Type aliases used in the library
Classes
alias of |
Data
|
The central part of internal API. |
|
The central part of internal API. |
|
The central part of internal API. |
embfile.types.
VectorType
¶alias of numpy.ndarray
Classes
|
A (word, vector) NamedTuple |
Reference
Classes
|
Format used by the Google word2vec tool. |
|
NamedTuple returned by |
|
(Abstract class) The base class of all the embedding files. |
|
The format used by Glove and FastText files. Each vector pair is stored as a line of text made of space-separated fields::. |
|
(Custom format) A tar file storing vocabulary, vectors and metadata in 3 separate files. |
Functions
|
Associates a file extension to a registered embedding file format. |
|
Creates an embedding matrix for the provided words. |
|
Extracts a file compressed with gzip, bz2 or lzma or a member file inside a zip/tar archive. |
|
Extracts a file unless it already exists and returns its path. |
|
Opens an embedding file inferring the file format from the file extension (if not explicitly provided in |
|
Class decorator that associates a new |
Data
|
Maps each |
embfile.
open
(path, format_id=None, **format_kwargs)[source]¶Opens an embedding file inferring the file format from the file extension (if not explicitly
provided in format_id
). Note that you can always open a file using the specific EmbFile
subclass; it can be more convenient since you get auto-completion and quick doc for
format-specific arguments.
Example:
with embfile.open('path/to/embfile.txt') as f:
# do something with f
Supported formats:
Class |
format_id |
Extensions |
Description |
---|---|---|---|
txt |
.txt, .vec |
Glove/fastText format |
|
bin |
.bin |
Google word2vec format |
|
vvm |
.vvm |
A tarball containing three files: vocab.txt, vectors.bin, meta.json |
You can register new formats or extensions using the functions
embfile.register_format()
and embfile.associate_extension()
.
EmbFile
An instance of a concrete subclass of EmbFile
.
See also
embfile.register_format()
:registers your custom EmbFile implementation so it is recognized by this function
embfile.associate_extension()
:associates an extension to a registered format
embfile.
build_matrix
(f, words, start_index=0, dtype=None, oov_initializer=<embfile.initializers.NormalInitializer object>, verbose=None)[source]¶Creates an embedding matrix for the provided words. words
can be:
an iterable of strings – in this case, the words in the iterable are mapped
to consecutive rows of the matrix starting from the row of index start_index
(by default, 0
); the rows with index i < start_index
are left to zeros.
a dictionary {word -> index}
that maps each word to a row –
in this case, the matrix has shape:
[max_index + 1, vector_size]
where max_index = max(word_to_index.values())
. The rows that are not associated
with any word are left to zeros. If multiple words are mapped to the same row, the
function raises ValueError
.
In both cases, all the word vectors that are not found in the file are initialized using
oov_initializer
, which can be:
None
– leave missing vectors to zeros
a function that takes the shape of the array to generate (a tuple) as first argument:
oov_initializer=lambda shape: numpy.random.normal(scale=0.01, size=shape)
oov_initializer=numpy.ones # don't use this for word vectors :|
an instance of Initializer
, which is a “fittable”
initializer; in this case, the initializer is fit on the found vectors (the vectors that
are both in vocab
and in the file).
By default, oov_initializer is an instance of
NormalInitializer
which generates vectors using a normal distribution with the same mean and standard
deviation of the vectors found.
f (EmbFile
) – the file containing the word vectors
words (Iterable[str] or Dict[str, int]) – iterable of words or dictionary that maps each word to a row index
start_index (int) – ignored if vocab
is a dict; if vocab
is a collection, determines the index
associated to the first word (and so, the number of rows left to zeros at the
beginning of the matrix)
dtype (optional, DType) – matrix data type; if None
, cls.out_dtype
is used
oov_initializer (optional, Callable or Initializer
) – initializer for out-of-(file)-vocabulary word vectors. See the class docstring for
more information.
verbose (bool) – if None, f.verbose is used
BuildMatrixOutput
embfile.
BuildMatrixOutput
(matrix: numpy.ndarray, word2index: Dict[str, int], missing_words: Set[str])[source]¶Bases: tuple
NamedTuple returned by build_matrix()
Create new instance of BuildMatrixOutput(matrix, word2index, missing_words)
matrix
¶Alias for field number 0
word2index
¶Alias for field number 1
missing_words
¶Alias for field number 2
embfile.
EmbFile
(path, out_dtype=None, verbose=True)[source]¶Bases: abc.ABC
(Abstract class) The base class of all the embedding files.
Sub-classes must:
ensure they set attributes vocab_size
and vector_size
when a file
instance is created
implement a EmbFileReader
for the format and implements
the abstract method _reader()
implement the abstract method _close()
(optionally) implement a VectorsLoader
(if they can improve
upon the default loader) and override loader()
(optionally) implement a EmbFileCreator
for the format and set
the class constant Creator
path (Path) – path of the embedding file (eventually compressed)
out_dtype (numpy.dtype) – all the vectors will be converted to this data type. The sub-class is responsible to set a suitable default value.
verbose (bool) – whether to show a progress bar by default in all time-consuming operations
path (Path) – path of the embedding file
vocab_size (int or None
) – number of words in the file (can be None
for some TextEmbFile
)
vector_size (int) – length of the vectors
verbose (bool) – whether to show a progress bar by default in all time-consuming operations
closed (bool) – True if the file was closed
_reader
()[source]¶(Abstract method) Returns a new reader for the file which allows to iterate
efficiently the word-vectors inside it. Called by reader()
.
_close
()[source]¶(Abstract method) Releases eventual resources used by the EmbFile.
close
()[source]¶Releases all the open resources linked to this file, including the opened readers.
reader
()[source]¶Creates and returns a new file reader. When the file is closed, all the still opened readers are closed automatically.
loader
(words, missing_ok=True, verbose=None)[source]¶Returns a VectorsLoader
, an iterator that looks for the
provided words in the file and yields available (word, vector) pairs one by one.
If missing_ok=True
(default), provides the set of missing words in the
property missing_words
(once the iteration ends).
See embfile.core.VectorsLoader
for more info.
Example
You should use a loader when you need to load many vectors in some custom data structure and you don’t want to waste memory (e.g. build_matrix uses it to load the vectors directly into the matrix):
data_structure = MyCustomStructure()
with file.loader(many_words) as loader:
for word, vector in loader:
data_structure[word] = vector
print('Number of missing words:', len(loader.missing_words)
word_vectors
()[source]¶Returns an iterable for all the (word, vector) word_vectors in the file.
to_list
(verbose=None)[source]¶Returns the entire file content in a list of WordVector
’s.
load
(words, verbose=None)[source]¶Loads the vectors for the input words in a {word: vec}
dict, raising
KeyError
if any word is missing.
(Dict[str, VectorType]) – a dictionary {word: vector}
See also
find()
- it returns the set of all missing words, instead of raising
KeyError
.
find
(words, verbose=None)[source]¶Looks for the input words in the file, return: 1) a dict {word: vec}
containing the available words and 2) a set containing the words not found.
_FindOutput
namedtuple – a namedtuple with the following fields:
word2vec (Dict[str, VectorType]): dictionary {word: vector}
missing_words (Set[str]): set of words not found in the file
See also
load()
- which raises KeyError if any word is not found in the file.
filter
(condition, verbose=None)[source]¶Returns a generator that yields a word vector pair for each word in the file that satisfies a given condition. For example, to get all the words starting with “z”:
list(file.filter(lambda word: word.startswith('z')))
save_vocab
(path=None, encoding='utf-8', overwrite=False, verbose=None)[source]¶Save the vocabulary of the embedding file on a text file. By default the file is saved in the same directory of the embedding file, e.g.:
/path/to/filename.txt.gz ==> /path/to/filename_vocab.txt
create
(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, **format_kwargs)[source]¶Creates a file on disk containing the provided word vectors.
word_vectors (Dict[str, VectorType] or Iterable[Tuple[str, VectorType]]) – it can be an iterable of word vector tuples or a dictionary word -> vector
;
the word vectors are written in the order determined by the iterable object.
vocab_size (Optional
[int
]) – it must be provided if word_vectors
has no __len__
and the specific-format
creator needs to know a priori the vocabulary size; in any case, the creator
should check at the end that the provided vocab_size
matches the actual length
of word_vectors
compression (Optional
[str
]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"
verbose (bool
) – if positive, show progress bars and information
overwrite (bool
) – overwrite the file if it already exists
format_kwargs – format-specific arguments
create_from_file
(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, **format_kwargs)[source]¶Creates a new file on disk with the same content of another file.
source_file (EmbFile
) – the file to take data from
out_dir (Union
[str
, Path
, None
]) – directory where the file will be stored; by default, it’s the parent directory
of the source file
out_filename (Optional
[str
]) – filename of the produced name (inside out_dir
); by default, it is obtained by
replacing the extension of the source file with the proper one and appending the
compression extension if compression is not None
.
Note: if you pass this argument, the compression extension is not automatically
appended.
vocab_size (Optional
[int
]) – if the source EmbFile has attribute vocab_size == None
, then: if the specific
creator requires it (bin and txt formats do), it must be provided; otherwise it
can be provided for having ETA in progress bars.
compression (Optional
[str
]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"
verbose (bool
) – print info and progress bar
overwrite (bool
) – overwrite a file with the same name if it already exists
format_kwargs – format-specific arguments (see above)
embfile.
BinaryEmbFile
(path, encoding='utf-8', dtype=dtype('float32'), out_dtype=None, verbose=True)[source]¶Bases: embfile.core._file.EmbFile
Format used by the Google word2vec tool. You can use it to read the file GoogleNews-vectors-negative300.bin.
It begins with a text header line of space-separated fields:
<vocab_size> <vector_size>
Each word vector pair is encoded as following:
encoded word + space
followed by the binary representation of the vector.
path –
encoding –
dtype –
out_dtype –
verbose –
path (Union
[str
, Path
]) – path to the (eventually compressed) file
encoding (str
) – text encoding; note: if you provide an utf encoding (e.g. utf-16) that uses a
BOM (Byte Order Mark) without specifying the byte-endianness (e.g. utf-16-le or
utf-16-be), the little-endian version is used (utf-16-le).
dtype (Union
[str
, dtype
]) – a valid numpy data type (or whatever you can pass to numpy.dtype())
(default: ‘<f4’; little-endian float, 4 bytes)
out_dtype (Union
[str
, dtype
, None
]) – all the vectors returned will be (eventually) converted to this data type;
by default, it is equal to the original data type of the vectors in the file,
i.e. no conversion takes place.
create
(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]¶Format-specific arguments are encoding
and dtype
.
Note: all the text is encoded without BOM (Byte Order Mark). If you pass “utf-16” or “utf-18”, the little-endian version is used (e.g. “utf-16-le”)
See create()
for more.
create_from_file
(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]¶Format-specific arguments are encoding
and dtype
.
Note: all the text is encoded without BOM (Byte Order Mark). If you pass “utf-16” or “utf-18”, the little-endian version is used (e.g. “utf-16-le”)
See create_from_file()
for more.
embfile.
TextEmbFile
(path, encoding='utf-8', out_dtype='float32', vocab_size=None, verbose=True)[source]¶Bases: embfile.core._file.EmbFile
The format used by Glove and FastText files. Each vector pair is stored as a line of text made of space-separated fields:
word vec[0] vec[1] ... vec[vector_size-1]
It may have or not an (automatically detected) “header”, containing vocab_size
and
vector_size
(in this order).
If the file doesn’t have a header, vector_size
is set to the length of the first vector.
If you know vocab_size
(even an approximate value), you may want to provide it to have ETA
in progress bars.
If the file has a header and you provide vocab_size
, the provided value is ignored.
Compressed files are decompressed while you proceed reeding. Note that each file reader will decompress the file independently, so if you need to read the file multiple times it’s better you decompress the entire file first and then open it.
path –
encoding –
out_dtype –
verbose –
encoding (str
) – encoding of the text file; default is utf-8
out_dtype (Union
[str
, dtype
]) – the dtype of the vectors that will be returned; default is single-precision float
vocab_size (Optional
[int
]) – useful when the file has no header but you know vocab_size;
if the file has a header, this argument is ignored.
verbose (int
) – default level of verbosity for all methods
create
(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', precision=5)[source]¶Creates a file on disk containing the provided word vectors.
word_vectors (Dict[str, VectorType] or Iterable[Tuple[str, VectorType]]) – it can be an iterable of word vector tuples or a dictionary word -> vector
;
the word vectors are written in the order determined by the iterable object.
vocab_size (Optional
[int
]) – it must be provided if word_vectors
has no __len__
and the specific-format
creator needs to know a priori the vocabulary size; in any case, the creator
should check at the end that the provided vocab_size
matches the actual length
of word_vectors
compression (Optional
[str
]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"
verbose (bool
) – if positive, show progress bars and information
overwrite (bool
) – overwrite the file if it already exists
format_kwargs – format-specific arguments
create_from_file
(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', precision=5)[source]¶Creates a new file on disk with the same content of another file.
source_file (EmbFile
) – the file to take data from
out_dir (Union
[str
, Path
, None
]) – directory where the file will be stored; by default, it’s the parent directory
of the source file
out_filename (Optional
[str
]) – filename of the produced name (inside out_dir
); by default, it is obtained by
replacing the extension of the source file with the proper one and appending the
compression extension if compression is not None
.
Note: if you pass this argument, the compression extension is not automatically
appended.
vocab_size (Optional
[int
]) – if the source EmbFile has attribute vocab_size == None
, then: if the specific
creator requires it (bin and txt formats do), it must be provided; otherwise it
can be provided for having ETA in progress bars.
compression (Optional
[str
]) – valid values are: "bz2"|"bz", "gzip"|"gz", "xz"|"lzma", "zip"
verbose (bool
) – print info and progress bar
overwrite (bool
) – overwrite a file with the same name if it already exists
format_kwargs – format-specific arguments (see above)
embfile.
VVMEmbFile
(path, out_dtype=None, verbose=True)[source]¶Bases: embfile.core._file.EmbFile
, embfile.core.loaders.Word2Vector
(Custom format) A tar file storing vocabulary, vectors and metadata in 3 separate files.
Features:
the vocabulary can be loaded very quickly (with no need for an external vocab file) and it is loaded in memory when the file is opened;
direct access to vectors
by word using __getitem__()
(e.g. file['hello']
)
by index using vector_at()
implements __contains__()
(e.g. 'hello' in file
)
all the information needed to open the file are stored in the file itself
Specifics. The files contained in a VVM file are:
vocab.txt: contains each word on a separate line
vectors.bin: contains the vectors in binary format (concatenated)
meta.json: must contain (at least) the following fields:
vocab_size: number of word vectors in the file
vector_size: length of a word vector
encoding: text encoding used for vocab.txt
dtype: vector data type string (notation used by numpy)
__getitem__
(word)[source]¶Returns the vector associated to a word (random access to file).
create
(out_path, word_vectors, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]¶Format-specific arguments are encoding and dtype.
Being VVM a tar file, you should use a compression supported by the tarfile package (avoid zip): gz, bz2 or xz.
See create()
for more doc.
create_from_file
(source_file, out_dir=None, out_filename=None, vocab_size=None, compression=None, verbose=True, overwrite=False, encoding='utf-8', dtype=None)[source]¶Format-specific arguments are encoding and dtype. Being VVM a tar file, you should use a compression supported by the tarfile package (avoid zip): gz, bz2 or xz.
See create_from_file()
for more doc.
embfile.
register_format
(format_id, extensions, overwrite=False)[source]¶Class decorator that associates a new EmbFile
sub-class with a format_id and one or
multiple extensions. Once you register a format, you can use open()
to open
files of that format.
embfile.
associate_extension
(ext, format_id, overwrite=False)[source]¶Associates a file extension to a registered embedding file format.
embfile.
extract
(src_path, member=None, dest_dir='.', dest_filename=None, overwrite=False)¶Extracts a file compressed with gzip, bz2 or lzma or a member file inside a zip/tar archive. The compression format is inferred from the extension or from the magic number of the file (in the case of zip and tar).
The file is first extracted to a .part file that is renamed when the extraction is completed.
member (Optional
[str
]) – must be provided if src_path points to an archive that contains multiple files;
dest_dir (Union
[str
, Path
]) – destination directory; by default, it’s the current working directory
dest_filename (Optional
[str
]) – destination filename; by default, it’s equal to member
(if provided)
overwrite (bool
) – overwrite existing file at dest_path
if it already exists
Path – the path to the extracted file
Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.
When reporting a bug please include:
Your operating system name and version.
Any details about your local setup that might be helpful in troubleshooting.
Detailed steps to reproduce the bug.
The best way to send feedback is to file an issue at https://github.com/janLuke/embfile/issues.
If you are proposing a feature:
Explain in detail how it would work.
Keep the scope as narrow as possible, to make it easier to implement.
Remember that this is a volunteer-driven project, and that code contributions are welcome :)
To set up embfile for local development:
Fork embfile (look for the “Fork” button).
Clone your fork locally:
git clone git@github.com:your_name_here/embfile.git
Create a branch for local development:
git checkout -b name-of-your-bugfix-or-feature
Now you can make your changes locally.
When you’re done making changes, run all the checks, doc builder and spell checker with tox one command:
tox
Commit your changes and push your branch to GitHub:
git add .
git commit -m "Your detailed description of your changes."
git push origin name-of-your-bugfix-or-feature
Submit a pull request through the GitHub website.
If you need some code review or feedback while you’re developing the code just make the pull request.
For merging, you should:
Include passing tests (run tox
) 1.
Update documentation when there’s new API, functionality etc.
Add a note to CHANGELOG.rst
about the changes.
Add yourself to AUTHORS.rst
.
If you don’t have all the necessary python versions available locally you can rely on Travis - it will run the tests for each change you add in the pull request.
It will be slower though …
To run all the tests run:
tox
To run a subset of tests:
tox -e envname -- pytest -k test_myfeature
To run all the test environments in parallel (you need to pip install detox
):
detox
Note, to combine the coverage data from all the tox environments run:
Windows |
set PYTEST_ADDOPTS=--cov-append
tox
|
---|---|
Other |
PYTEST_ADDOPTS=--cov-append tox
|
Gianluca Gippetto - gianluca.gippetto@gmail.com