Wednesday, March 9, 2011

Making numpy ndarray's Hashable

I am doing some research work using numpy and scipy, and I have been amazed by how fast they run; the other day, though, I stumbled on what seemed a roadblock. I wanted to use these quite large (length 1024) arrays as keys for storing data in Python dictionaries, however numpy's ndarray objects are not hashable; converting them to tuples would take forever, besides eating up an unbelievably large amount of memory. Searching the web provided no satisfactory answers.

After a while, though, I realized that it would be simple to implement a hashable wrapper for my array objects. So here you have it:
from hashlib import sha1

from numpy import all, array, uint8


class hashable(object):
    r'''Hashable wrapper for ndarray objects.

        Instances of ndarray are not hashable, meaning they cannot be added to
        sets, nor used as keys in dictionaries. This is by design - ndarray
        objects are mutable, and therefore cannot reliably implement the
        __hash__() method.

        The hashable class allows a way around this limitation. It implements
        the required methods for hashable objects in terms of an encapsulated
        ndarray object. This can be either a copied instance (which is safer)
        or the original object (which requires the user to be careful enough
        not to modify it).
    '''
    def __init__(self, wrapped, tight=False):
        r'''Creates a new hashable object encapsulating an ndarray.

            wrapped
                The wrapped ndarray.

            tight
                Optional. If True, a copy of the input ndaray is created.
                Defaults to False.
        '''
        self.__tight = tight
        self.__wrapped = array(wrapped) if tight else wrapped
        self.__hash = int(sha1(wrapped.view(uint8)).hexdigest(), 16)

    def __eq__(self, other):
        return all(self.__wrapped == other.__wrapped)

    def __hash__(self):
        return self.__hash

    def unwrap(self):
        r'''Returns the encapsulated ndarray.

            If the wrapper is "tight", a copy of the encapsulated ndarray is
            returned. Otherwise, the encapsulated ndarray itself is returned.
        '''
        if self.__tight:
            return array(self.__wrapped)

        return self.__wrapped
Using the wrapper class is simple enough:
>>> from numpy import arange
>>> a = arange(0, 1024)
>>> d = {}
>>> d[a] = 'foo'
TypeError: unhashable type: 'numpy.ndarray'
>>> b = hashable(a)
>>> d[b] = 'bar'
>>> d[b]
'bar'
In my profiling sessions, adding the wrapped-up 1024-long arrays as keys to a dictionary amounted to no more overhead than adding the naked arrays to a list.

8 comments:

  1. This is a nice idea, I was looking for such a solution, thanks.

    I'm really not sure that it would work in your context (plus I'm pretty sure that it's not secure), but based on this:

    http://www.scipy.org/Subclasses

    your code inspired me this variant, using a subclass instead of a wrapper:

    from numpy import *
    from hashlib import sha1

    class HashableArray(ndarray):

    def __new__(cls, data):
    return array(data).view(cls)

    def __hash__(self):
    return int(sha1(self).hexdigest(), 16)

    def __eq__(self, other):
    return all(array(self) == array(other))

    It seems to work in my context, but I'm not sure that it's very efficient though.

    ReplyDelete
  2. Oddly enough, just the other day I was experimenting with subclassing as a way to more compactly add "hashability" to numpy arrays. My version looked like this:

    class hashable_array(ndarray):
    def __new__(cls, values):
    this = ndarray.__new__(cls, shape=values.shape, dtype=values.dtype)
    this[...] = values
    return this

    def __init__(self, values):
    self.__hash = int(sha1(self).hexdigest(), 16)

    def __eq__(self, other):
    return all(ndarray.__eq__(self, other))

    def __hash__(self):
    return self.__hash

    def __setitem__(self, key, value):
    raise Exception('hashable arrays are read-only')

    My version added some amenities (such as caching the hash value are overriding __setitem__ to disable array writing), but otherwise it was pretty much the same as yours.

    ReplyDelete
  3. It really makes more sense to cache the hash value, indeed.

    One thing I wasn't sure about was the correct way to define __eq__. I finally settled for:

    def __eq__(self, other):
    return allclose(self, other)

    because it simply didn't occur to me to use the method version of equality testing, as you do. I was stuck with the operator version, which yields an infinite recursion:

    def __eq__(self, other):
    return all(self == other)

    or this:

    def __eq__(self, other):
    return all(array(self) == other)

    which must be inefficient (or maybe not, but anyway it's ugly). But I think your solution is the most elegant.

    And finally, it makes great sense also to block the setter.. all in all, I adopt your version!

    ReplyDelete
  4. Sorry it's me again.. I just noticed a subtle problem with having the _hash precomputed in __init__: it doesn't work well with operator overloading it seems. If you have for instance:

    a = array((1,2,3))
    b = HashableArray((2,3,4))
    c = a - b
    print c.__class__ # ok, still a HashableArray
    print c._hash # doesn't exist

    I'm not sure exactly why and how, but it has something to do with the overloading of the basic arithmetic operators (__sub__, __add__, etc), which causes the object to be created without invoking its __init__ method. Having the __hash__ method compute the value on the fly solves the problem (at least in my context), at the cost of a slight performance decrease (I assume).

    ReplyDelete
  5. Indeed this is related to the special arithmetic methods. __add__, __sub__, etc. all return new object instances; in this case, as they are implemented in the ndarray class, naturally they will return ndarray (instead of HashableArray) instances.

    Unfortunately there is no automatic way out of it; the best you could do is to override the arithmetic methods on the HashableArray class:

    def __add__(self, other):
    return HashableArray(ndarray.__add__(self, other))

    def __sub__(self, other):
    return HashableArray(ndarray.__sub__(self, other))

    ...and so on.

    ReplyDelete
  6. How about this:


    a = array([3,4])

    In [11]: a.flags.writeable = False

    In [12]: b = set()

    In [13]: b.add(a.data)

    In [14]: b
    Out[14]: set([])

    In [15]: print a.data


    In [16]: print a
    [3 4]

    In [17]: print b
    set([])

    In [18]: x = b.pop()

    In [19]: print x


    In [20]: y = frombuffer(x, dtype = 'int')

    In [21]: y
    Out[21]: array([3, 4])

    Nicky

    ReplyDelete
    Replies
    1. I can see this technique applied to create a "numpy_set" that moved the logic from the array to the container; but then you'd be customizing a class just the same, only at the opposite side of the problem (i.e. changing the container not the array). Either way could work better depending on the context of application.

      Delete
  7. Check out this totally badass library for memoizing numpy arrays: https://pythonhosted.org/joblib/index.html

    ReplyDelete