From the Archive: Ease Me Into Cryptography, Part 1: Buzzwords and Hash Functions

Here's a post from the archives, a series I originally wrote in 2017 that was published on Ethical Hacker Network. Even back then I was passionate about making sure technical concepts weren't alienating to anyone. This goal has proven to still be critical to the technical leaders I get to see succeeding every day. Please enjoy.

Note: these examples were written using a previous version of python and may require adjustments.

---

You know what it’s like being in security and someone asks you what you do. Now imagine the responses when I tell people I do cryptography. And it’s not just outsiders. Even within a techie crowd, common responses range from “Ooof, that sounds complicated” to “I wouldn’t touch that with a ten-foot stick”. I usually laugh and assure people that, although it can be complex, the complexity is surmountable. Even my reassuring comments are met with disbelief and the persistence of a feeling of intimidation by the topic of cryptography. I would love nothing more than for my words to be met with intrigue rather than hesitation. So I’m here to prove to you that crypto is tackle-able, and you can be the one to tackle it.

Cryptography is no longer a convenient addition. It is becoming more and more of a necessity for security and privacy. Organizations and consumers are demanding it. So, if you must learn it eventually, why not start now and why not learn the easy way. I fully admit that cryptography sounds intimidating, especially when it comes to adding it into your code. However, I firmly believe that the intimidation is solely because it is in an unfamiliar context. If the concepts can be broken down into bite-sized pieces, then our brains can more easily consume the crypto elephant. “Ease Me Into Cryptography”, a series of introductory articles for InfoSec professionals, will do just that.

Taking the first bite

One of my favorite subreddits is called “Explain Like I’m Five” (ELIF). The subreddit isn’t intended to be condescending but rather offers a place for people to say, “I don’t get this, but I want to. Can you break it down so I can make sense of it?” I think this strategy works because when things are broken down simply, we naturally draw parallels to concepts we already know. That helps us to soak in the new knowledge rather than just letting the information go in one ear and out the other. It gives us the ability to actually process it. So, if cryptography is something that seems unapproachable when left as a complex, out-of-context blob, let’s be inspired by ELIF and break it down into digestible chunks!

The first thing we need to get out of the way is understanding some of the words that come up a lot in crypto. First is the word “cryptography” itself. It comes from a combination of the Ancient Greek words kryptós for "hidden or secret" and graphein which is "to write". Makes sense, since most of us understand that the basic idea is to be able to send a private message. But the key (pun somewhat intended) is that you want your message to be able to be read, albeit only by the intended recipient. Therefore, in most cases cryptography is a 2-way street in hiding AND un-hiding a communicated message.

As with anything complex, there is jargon that makes the topic seem inaccessible. The most important terms to grasp as we ease in are:

  • Plaintext is the content to be communicated and is presented in an understandable form.
  • Ciphertext is the content after it has been hidden (encrypted) and is no longer understandable in this form.
  • Encryption is the first direction in our 2-way street that turns the understandable blob (plaintext) into jibberish (ciphertext).
  • Decryption is the opposite of encryption taking the jibberish (ciphertext) and turning it back into understandable content (plaintext).
  • A Cryptographic Algorithm is a fancy term for jibberish instructions. It’s just what we call the rule-set that is used to take data from one state to another and (most of the time) back. For example, AES is the cryptographic algorithm that defines a way to encrypt and decrypt information.

There are more specific terms that will come up, but no need to be overloaded right now. We will get there!

A great place to jump in to understanding cryptographic fundamentals is to start with hash functions. What is a hash function? How is it useful? Are there weaknesses? Let’s take a look.

What is a hash function?

A hash function is a non-reversible process for taking any length of input and spitting out a fixed-length value. In other words, it will scramble anything you give it in such a way that you cannot easily retrieve the un-scrambled version. The output is called a digest, but you’ll commonly see it referred to simply as the hash.

Note: If it helps you to think about a hash function as a one-way cryptographic algorithm where we can encrypt but not decrypt, you can do that! Just keep in mind, we don’t usually refer to hashing this way, because we can’t get the plaintext back algorithmically.

Here comes the code.

Now don’t get scared. The whole point of this series is to not only learn crypto but also prove you get it by doing it yourself! There’s no better way than getting your hands dirty. We’ll still ease our way in, so don’t panic. We’ll go line by line, so trust me and just dive in.

The hash function I see used the most is SHA256, a version of the Secure Hashing Algorithm that gives a 256-bit (32 bytes) digest. Remember it will give an output that is 32 bytes no matter the length of the input we give it. To see an example of this in Python, all you need is the Python Cryptography Toolkit (pycrypto) or another similar Python cryptography library. Here is an example script:

from Crypto.Hash import SHA256

# initialize a new SHA256 hash object
hash = SHA256.new()

# tell the hash object what string I want it to hash
string_to_hash = "Ellie"
hash.update(string_to_hash)

# hash it!
digest = hash.hexdigest()

# make the output pretty
print("The SHA256 digest of " + string_to_hash + " is: " + digest)
print(str(hash.digest_size) + " bytes long (" + str(len(digest)) + " characters)")

From the top, this script imports SHA256 from the crypto library, then instantiates a new SHA256 hash object. It then tells it that the input is going to be my name, “Ellie” and performs the hashing. Remember that the output of a hash is called a digest, so here we are telling it to give a hexadecimal digest when it hashes the input. Then finally we print out the digest and digest length.

I highly encourage you to try this out on your own. It’s also recommended that you type everything above instead of copy and paste. The more you type the code, the more familiar it becomes to you.

When you run it as is, you’ll probably catch something that doesn’t seem quite right. When we output the “hexdigest”, it is 64 characters long. However, when we output the “digest_length” property, it shows “32”. Hmm… Why is this? Remember that these are hexadecimal bytes, i.e. the crypto library sees 0x2a as one byte even though we print it as 2a - two characters. Hashes are generally shown as the hexadecimal string, the things that look like alphanumeric nonsense to us.

Now it’s time to play a little on your own. Try changing the input (the “string_to_hash” variable) to your name and running the script again. Change the input as many times as you want and notice that the digests are unique. Not only are they unique, but they don’t really look related. We can’t see any patterns or similarities between the digests as we change the inputs. For example, look at how different the digests look when I input “Ellie” vs when I input “ellie”. The inputs are really similar, but the outputs look nothing alike!

On top of producing unique digests, there isn’t a way to undo a hash function. We don’t have a function that can take a digest as an input and give us the original text as output. Awesome! So why in the world would anyone want to use a function that scrambles everything but doesn’t give us a way to unscramble?

How is hashing useful?

Why would we want something that only scrambles data? That would mean that we can’t get the original data back… sounds useless. Well, hashes are typically used for verification. You may have seen the term checksum which is commonly used as a synonym for digest. A checksum is what we call a digest when we are using it for integrity verification. For example, verifying the integrity of digital downloads. You may have gone to a website to download something and seen the file to download and a note that reads, “the checksum should be…”. If you download the file and compute the checksum (i.e. hash it) and the output matches the checksum provided, you can feel reasonably certain that the data you downloaded is the same data that was intended for you to download. In other words, the data has not been tampered with or has not been corrupted.

To see an example of this, osboxes.org is a resource for downloading virtual machine images to use in VMWare or VirtualBox. If you go to https://www.osboxes.org/centos/ you will see a few CentOS images for download. Right under the download it says “SHA256: …” If you download one of these images, you can compute the SHA256 checksum by going to your terminal and using the following command on Mac and Linux terminals (for Windows, try the PowerShell Get-FileHash cmdlet):

$ shasum -a 256 CentOS-7-1804-VB-32bit.7z

If your checksum matches the one posted on the website, then you downloaded the file as osboxes intended and it wasn’t corrupted or tampered with! I downloaded the CentOS 7 32bit virtual image, and my checksum matched the one on the website. Phew!

Another common use of hash functions in the real world is in password hashing. Every site you log in to has to store your username and password somewhere, so that it remembers you and can verify “yes that is Ellie’s password, let her log in”. However, with a list of usernames and passwords sitting around somewhere, if someone got a hold of that list, then they can now access all of those user accounts and passwords. Not good. So, it is better for companies to store the digest of the password instead of the password in plaintext. If we think about this, it makes sense. The company cannot decrypt the password digests, but it doesn’t need to. All it needs is to be able to verify that the password being used is correct. If it is stored as a digest, then when you try to log in to something, the digest of the password you typed is computed and compared with the digest of your correct password that the company has stored. This way, your login information can still be verified but your password is not stored anywhere!

So even though you cannot reverse a hash, there are still real-world applications for these cryptographic algorithms.

Are there weaknesses?

Now that we understand what a hash function is and how it can be used, are there any weaknesses? We know that a hash function outputs a fixed size digest no matter what length input we give it. If we think about this, there are way more potential input values than output values if the set of outputs is fixed. In other words, there is a finite number of outputs that are 32 bytes long, whereas, given that there are no length constraints on inputs, there is an infinite number of inputs.

If there are an unlimited number of inputs and a limited number of outputs, there must eventually be multiple inputs that overlap and give the same output. For a cryptographic hash function, we call this repeated output a collision: when two DIFFERENT inputs generate the SAME output. In a good hash function, collisions are very uncommon. In fact, this is an important property of a hash function, and we call it collision resistance.

Suggested Reading: If this intrigues you, read about SHAttered at shattered.io where researchers found a way (not a trivial way, but a way) to manufacture collisions in SHA1. This caused some acceleration in the industry in moving away from SHA1 to something more collision resistant like SHA256 or SHA512.

Why is a collision such a bad thing? We use hash digests primarily for integrity verification. With this use case, we are looking for assurance that the input was exactly the correct data and absolutely nothing else. If there are two different inputs that produce the same output, then we cannot say for certain which input was given. Again, when we use good hash functions, collisions are highly, highly unlikely. Therefore, if the probability of a collision is low enough, we can feel confident in saying that the input was almost certainly (however not 100%) the correct one.

Nothing to it

Cryptography can sound intimidating and tedious. I get it! And if I hadn’t had some patient friends and teachers along the way, I wouldn’t be working with it and would probably still feel the same way. But as ELIF reminds us, the intimidation factor starts to diminish if we can attack a concept in approachable chunks. Hash functions are an integral, fundamental concept in cryptography. And we’ve just tackled them! Not only can we answer what a hash function is, but we can also point to real-world uses and even demonstrate how to calculate a digest in Python. If someone were to ask if hash functions had any risks or concerns, we could explain that, too! That wasn’t so bad, was it?

Next time we’ll step up our game a little and continue doing so over the next several articles. As we continue, we will build on prior knowledge to tackle cryptographic fundamentals including symmetric and asymmetric ciphers (the 2-way streets), signatures, and protocols. When we’re done, you’ll be the one explaining to all of your friends, family and colleagues how to ease me into cryptography.