I just had a fun conversation with Ronny about how to set up an object
network where children are stored in hashtables keyed on the child's
unique id so that when the child's unique id is changed the parent's
hashtable is easily updateable.
the funnest part is where we were speculating about the quality of
hashing on string. turns out that the default string GetHashCode is
based on sequentially XOR-ing each character of the string to produce a
32-bit value.
plus, this guy
ran into a problem where hashing 350,000 words from an english
dictionary ran into 64 collisions. The example he gave was that "blair"
and "brainlessness" both give you a hash code of 175803953 [try it:
Console.WriteLine( "{0} = {1}", "blair".GetHashCode(),
"brainlessness".GetHashCode() );]
what i'm proud of is that i understood why:
blair
brainlessness
subtract the letters b, l, a, i and r from brainlessness, and you
end up with "nessness". xor is commutative, so the two "blairs" will
come up with the same hashcode; the difference will be the "nessness".
notice that "nessness" repeats "ness" twice. what happens when we xor a
number with itself? all the bits are turned off - we get zero.
therefore, "ness" xor "ness" will be zero. what happens when we xor
"blair" with zero? no bits are changed. therefore, f("blair") =
f("brainlessness"), which is a nice little bit of political
commentary...
to stop blair colliding with brainlessness, you could change
GetHashCode to add 1 to the hashcode every time it steps one character
forward in the string; this would mean that "an" would not collide with
"na", for example, which seems like a nice feature for a hashing
algorithm to have.