After a while, though, I realized that it would be simple to implement a hashable wrapper for my array objects. So here you have it:

from hashlib import sha1 from numpy import all, array, uint8 class hashable(object): r'''Hashable wrapper for ndarray objects. Instances of ndarray are not hashable, meaning they cannot be added to sets, nor used as keys in dictionaries. This is by design - ndarray objects are mutable, and therefore cannot reliably implement the __hash__() method. The hashable class allows a way around this limitation. It implements the required methods for hashable objects in terms of an encapsulated ndarray object. This can be either a copied instance (which is safer) or the original object (which requires the user to be careful enough not to modify it). ''' def __init__(self, wrapped, tight=False): r'''Creates a new hashable object encapsulating an ndarray. wrapped The wrapped ndarray. tight Optional. If True, a copy of the input ndaray is created. Defaults to False. ''' self.__tight = tight self.__wrapped = array(wrapped) if tight else wrapped self.__hash = int(sha1(wrapped.view(uint8)).hexdigest(), 16) def __eq__(self, other): return all(self.__wrapped == other.__wrapped) def __hash__(self): return self.__hash def unwrap(self): r'''Returns the encapsulated ndarray. If the wrapper is "tight", a copy of the encapsulated ndarray is returned. Otherwise, the encapsulated ndarray itself is returned. ''' if self.__tight: return array(self.__wrapped) return self.__wrappedUsing the wrapper class is simple enough:

>>> from numpy import arange >>> a = arange(0, 1024) >>> d = {} >>> d[a] = 'foo' TypeError: unhashable type: 'numpy.ndarray' >>> b = hashable(a) >>> d[b] = 'bar' >>> d[b] 'bar'In my profiling sessions, adding the wrapped-up 1024-long arrays as keys to a dictionary amounted to no more overhead than adding the naked arrays to a list.

This is a nice idea, I was looking for such a solution, thanks.

ReplyDeleteI'm really not sure that it would work in your context (plus I'm pretty sure that it's not secure), but based on this:

http://www.scipy.org/Subclasses

your code inspired me this variant, using a subclass instead of a wrapper:

from numpy import *

from hashlib import sha1

class HashableArray(ndarray):

def __new__(cls, data):

return array(data).view(cls)

def __hash__(self):

return int(sha1(self).hexdigest(), 16)

def __eq__(self, other):

return all(array(self) == array(other))

It seems to work in my context, but I'm not sure that it's very efficient though.

Oddly enough, just the other day I was experimenting with subclassing as a way to more compactly add "hashability" to numpy arrays. My version looked like this:

ReplyDeleteclass hashable_array(ndarray):

def __new__(cls, values):

this = ndarray.__new__(cls, shape=values.shape, dtype=values.dtype)

this[...] = values

return this

def __init__(self, values):

self.__hash = int(sha1(self).hexdigest(), 16)

def __eq__(self, other):

return all(ndarray.__eq__(self, other))

def __hash__(self):

return self.__hash

def __setitem__(self, key, value):

raise Exception('hashable arrays are read-only')

My version added some amenities (such as caching the hash value are overriding __setitem__ to disable array writing), but otherwise it was pretty much the same as yours.

It really makes more sense to cache the hash value, indeed.

ReplyDeleteOne thing I wasn't sure about was the correct way to define __eq__. I finally settled for:

def __eq__(self, other):

return allclose(self, other)

because it simply didn't occur to me to use the method version of equality testing, as you do. I was stuck with the operator version, which yields an infinite recursion:

def __eq__(self, other):

return all(self == other)

or this:

def __eq__(self, other):

return all(array(self) == other)

which must be inefficient (or maybe not, but anyway it's ugly). But I think your solution is the most elegant.

And finally, it makes great sense also to block the setter.. all in all, I adopt your version!

Sorry it's me again.. I just noticed a subtle problem with having the _hash precomputed in __init__: it doesn't work well with operator overloading it seems. If you have for instance:

ReplyDeletea = array((1,2,3))

b = HashableArray((2,3,4))

c = a - b

print c.__class__ # ok, still a HashableArray

print c._hash # doesn't exist

I'm not sure exactly why and how, but it has something to do with the overloading of the basic arithmetic operators (__sub__, __add__, etc), which causes the object to be created without invoking its __init__ method. Having the __hash__ method compute the value on the fly solves the problem (at least in my context), at the cost of a slight performance decrease (I assume).

Indeed this is related to the special arithmetic methods. __add__, __sub__, etc. all return new object instances; in this case, as they are implemented in the ndarray class, naturally they will return ndarray (instead of HashableArray) instances.

ReplyDeleteUnfortunately there is no automatic way out of it; the best you could do is to override the arithmetic methods on the HashableArray class:

def __add__(self, other):

return HashableArray(ndarray.__add__(self, other))

def __sub__(self, other):

return HashableArray(ndarray.__sub__(self, other))

...and so on.

How about this:

ReplyDeletea = array([3,4])

In [11]: a.flags.writeable = False

In [12]: b = set()

In [13]: b.add(a.data)

In [14]: b

Out[14]: set([])

In [15]: print a.data

In [16]: print a

[3 4]

In [17]: print b

set([])

In [18]: x = b.pop()

In [19]: print x

In [20]: y = frombuffer(x, dtype = 'int')

In [21]: y

Out[21]: array([3, 4])

Nicky

I can see this technique applied to create a "numpy_set" that moved the logic from the array to the container; but then you'd be customizing a class just the same, only at the opposite side of the problem (i.e. changing the container not the array). Either way could work better depending on the context of application.

DeleteCheck out this totally badass library for memoizing numpy arrays: https://pythonhosted.org/joblib/index.html

ReplyDelete