What is a hash and when do we use them?


Cryptographic hashes are mathematical functions that take data as input and convert that input into an alphanumeric string. This string is called message digest, hash value or digital fingerprint. This hashing mechanism has specific properties that make it very interesting for security:

  • Irreversibility: The mathematical function is irreversible, it’s a one-way function, meaning that we can’t take the hash, apply an inverse function and obtain the input. The input cannot be determined from the output even if you know the hash mechanism used to create it.
  • Hashing and encryption are not the same: Hash functions produce always a fixed length string regardless of the input data size. For instance if we use MD5 we’ll get a 128 bit output whether we hash a whole book or just a short sentence. So, unlike encryption where there’s a direct relationship between the input size and the length of the resulted encrypted output, there’s no such relationship between data input and its corresponding crypto-hashes fixed length output.
    In summary, Hashing and encryption are not the same. They’ve got different purposes but they can be used together

Why Hashing?

Why would you want to convert a message into an illegible alphanumeric string if there’s no way to reverse it? What’s the point of sending an encoded message to someone if it cannot be decoded? All of this is true, and even though it might be strange, the irreversibility feature is something we can benefit from in security.
Cryptographic hashes are deterministic. This means:
  • Hash or message digest with the same input produce always the same alphanumeric string. It must do it, always. The hash of the same message will always be the same.
  • No two non-identical inputs must ever return the same alphanumeric output.
  • This is why a hash can be considered as a kind of signature or as a check-sum for files in computing. For example, Linux systems use the MD5 hash to create a check-sum for downloads so that the file integrity can be verified. If just one bit of the file has been tampered then the check-sum of the downloaded file will be different.

Use Cases for Hashing

With all the above in mind, we can start thinking of use cases where we can apply it:
  • Digital signatures: In cryptography, a digital signature is a mathematical scheme for verifying the authenticity of digital messages or documents. In a few words, a digital signature is the hash of something encrypted with the private key of the signer. A valid digital signature on a message gives a recipient confidence that the message came from a sender known to the recipient
  • Hashes are used to verify the integrity of documents
  • File management: Some companies use hashes for file management. Using hashes they can index data, identify files and delete duplicates. If a system has thousands or millions of files, using hashes can save lot of time on these tasks. For example, we can use hashes to compress data to reduce the amount of memory required to store large files. The hashes they create can be stored in a special data structure called hash table, which allows us to do faster lookup.
  • Hashes are used to check integrity of messages before and after communication. Hashes help us prove the data has not been altered during the communication
  • Hashing protects data at rest, so even if an attacker gets access to our hashed files/docs, the items remain unreadable
  • Generate unique IDs - Some systems like Git generate documents ID by hashing the content of the document
  • Storing passwords in a database. If we store the passwords as clear text in a database there’s always the risk of a security breach, having someone accessing the database and stealing the passwords. Even if we store the passwords encrypted there’s always the risk that they can be decrypted. However, if we hash the passwords and store the hash of the password and not the real value then the value string stored will be readable but cannot be used to obtain the real value thanks to the irreversibility of the hash. The way we can benefit from this is the following:
    • When the users create their password, we use that password as the input of our hash function and store it in our database. 
    • When the users provide their password, we’ll then apply again the same hash function to the password provided and compare the result with the value that we’ve got in our database. If the hashes are the same that means the password/input used is the same.
  • Proof of Work Algorithms - Most proof-of-work algorithms calculate a hash value which is bigger than certain value (known as mining difficulty). To find this hash value, miners calculate billions of different hashes and take the biggest of them, because hash numbers are unpredictable. This is the basis of how transactions are validated in crytocurrencies like Bitcoin.
  • Security in a blockchain - in the case of cryptocurrency, they are used to ensure data contained in the blocks on a blockchain are not altered. The information encrypted by the hashing function is validated by network participants, so if anyone modifies any data in the blockchain it can easily be identified by everyone in the network. This prevents fraudulent transactions and double-spending.


Hashing is not perfect

There are mainly two concerns when it comes to cryptographic hashing:
  • Collision Resistance - The main feature of a cryptographic hash function is the uniqueness between a fixed input and a hashed output. Collision resistance is the effectiveness of the hash function used to guarantee that no different inputs can create matching outputs. When this happens we’ve got a collision. We use Collision Resistance to measure the effectiveness and strength of a hash. Collision resistance is determined by how difficult is to create two different messages that will provide the same hashed output.  For example, MD5 and SHA-1 are not collision resistant. Their collision has been demonstrated by Google
  • Preimage - Preimage is a technique where an attacker will try to construct a message that will hash to a given value. If the attacker is able to work backwards and is able to create inputs that produce specific hashes then our system will be compromised. 
To evaluate the strengths and weaknesses of hash functions we define:
  • preimage resistance: Given a hash h it must be difficult to find a message m that produces that hash
  • weak collision resistance: Given a message m1, it must be difficult to find another message m2 that produces the same hash
  • Strong collision resistance: It must be difficult to find any messages m1 and m2 that produce the same hash
  • The term difficult refers to the computational power required to do that

How Strong a hash function can be

The strenght of a hash function are relative to their fixed lenghts. This way a hash function of 128 bit is considered weaker than one of 256 bit as there are millions more potential combinations to be tested when conducted a brute-force attack
  • With 128 bits the possible hashes are 3.4 x 1038, which is a huge space for combinations in brute force attack.
  • SHA-1 with 160 bits is 4 billions times larger than MD5’s 128 bits space. And yet, both of them are deprecated
  • Some of the more popular hashes are:
    • MD4 - 128 bits, considered obsolete
    • MD5 - 128 bits. it’s also deprecated although we still see it
    • RIPEMD-160 (160 bits), Tiger, Whirlpool, GOST - not widely used
    • SHA-1 - 160 bits. Deprecated. Still present but not recommended to use with SSL/TLS
    • SHA-2 and SHA-3 are the recommended ones
      • SHA-2 is a suite of hashing algorithms. the suite contains SHA-256, SHA-384 and SHA-512 - these are longer versions of SHA-1, and more secure. these are the recommended hashes currently
Previous Post Next Post