How can I generate a reliable digest from a hash in Ruby 2.4?

1.4k views Asked by At

I have a very large table containing 2 billion rows of 50 attributes. Not all are filled out and it's a sparse matrix.

I dislike having to build query off of all of the values, and the indexes are much too large now. I've lost performance.

For my new approach, I want to add a digest column that contains a digest of all of the attributes in a particular row.

There is no security requirement for this hash, so even MD5 would be fine.

Am I better off building a simple string containing representations of all the keys and values together? Or is there a better way?

For example, given hash:

attr_hash = { attribute1: "Please",
              attribute2: nil,
              attribute3: "don't",
              attribute4: nil,
              attribute5: nil,
              attribute6: nil,
              attribute7: "immediately",
              attribute8: "",
              attribute9: "downvote",
              attribute10: "my",
              attribute11: nil,
              attribute12: "question" }

would this be preferable (and I'm sure you'll agree this is beautiful):

attr_str = attr_hash.select{|k,v| v!="" && !v.nil?}.keys.sort.map{|k| "#{k}=#{attr_hash[k]}" }.join("^^")
digest = Digest::MD5.hexdigest(attr_str)

which gives a nice-looking string:

790470349a791b9897afd52a336ab2bb

I can index that column and get very, very fast response times from the database. And I'm unlikely to get many if any collisions from that. And if there's a collision one in 5 or 10 million times, it's fine.

I deeply appreciate any insights.

2

There are 2 answers

3
tadman On BEST ANSWER

Lazy way:

Digest::SHA2.hexdigest(attr_hash.inspect)

Where that pre-supposes your items have identical ordering. If you need to sort the items first:

Digest::SHA2.hexdigest(attr_hash.to_a.sort_by { |k, _v| k }.inspect)

I'd use JSON.dump(x) instead of x.inspect if I wanted something more portable, like to non-Ruby code-bases.

I also wouldn't bother stripping out empty values. The hash function doesn't care.

0
yuceel On

This may be useful for somebody who is dealing with nested json object

def predigest_json(json)
  if json.class == Hash
    arr = []
    json.each { |key, value| arr << "#{predigest_json key}=>#{predigest_json value}" }
    json = arr
  end
  if json.class == Array
    str = ''
    json.map! { |value| predigest_json value }.sort!.each { |value| str << value }
  end
  if json.class != String
    json = json.to_s << json.class.to_s
  end
  json
end


Digest::SHA2.hexdigest(predigest_json(json))