You’ve been pranked. If you're lucky, everyone got a good laugh out of it. If not, there may be more serious consequences, depending on the situation. This sort of everyday scenario raises an obvious question. Short of opening the file and manually perusing each slide in the presentation, how could you be sure that it had not been modified by any of the pranksters you may share your computer or network with? More seriously, how can we verify the integrity of a file that may or may not have been modified by a malicious individual seeking to infect out computer or network with a dangerous piece of malware?
In this article, we’ll consider these questions and discuss the pros and cons of one simple means by which we can verify a file’s integrity to ensure that it has not been tampered with, namely, by verifying its hash value. We’ll conclude with a quick tutorial on how to verify a file’s hash value on Mac, Linux and Windows systems, and provide some links to a few lectures on cryptographic hash functions culled from the series of courses listed in our collection of free online computer science courses. Our primary sources along the way will be Everyday Cryptography by Keith M. Martin, and Applied Cryptography by Bruce Schneier.
Malware comes in many different guises. As the Electronic Frontier Foundation writes in their Surveillance Self-Defense Project, malware is frequently spread by "trick[ing] the computer user into running a software program that does something the user wouldn't have wanted." Let's say you decide to download a file from a website you know and trust, and from which you have safely downloaded files in the past. How do you know, for example, that the file you have downloaded onto your computer is in fact the one intended by the trusted website? How do you know it was not altered in transit? How do you know it was not swapped for another file by a malicious attacker? And how can you determine this without running the file first?
One simple way to verify a file's integrity is by confirming its hash value. In Everyday Cryptography, Martin writes: “Hash functions can be used to provide checks against accidental changes to data and, in certain cases, deliberate manipulation of data . . . As such they are sometimes referred to as modification detection codes or manipulation detection codes” (emphasis in original, Martin, p. 188). In our opening example, a suitable hash function would have allowed you to detect that your presentation had been modified in some way without ever opening it.
So, what is a hash function? The primary practical property of a hash function is that it compresses arbitrarily long inputs into a fixed length output (Martin, p. 189, Schneier, section 2.4). Furthermore, slight differences in the input data result in large differences in the output data. “A single bit change in the pre-image [i.e. the file you’re hashing] changes, on the average, half of the bits in the hash value,” (Schneier, section 2.4). Two of the most commonly used cryptographic hash functions are known as MD5 and SHA1. Schnier quotes NIST’s description of the SHA hash function as found in the Federal Register:
The SHA is called secure because it is designed to be computationally infeasible to recover a message corresponding to a given message digest, or to find two different messages which produce the same message digest. Any change to a message in transit will, with a very high probability, result in a different message digest. (Schneier, section 18.7.)Here’s a simple example. I have created a plain text file named hello.txt on my Desktop. The file contains a single line that reads: “Hello there.” Applying the well-known sha1 hash function to the file produces the following hash value:
4177876fcf6806ef65c4c1a1abf464087bfbf337.
If I edit the file and remove the period from the end of the line so that it reads “Hello there”, the hash function now returns an entirely different value: 33ab5639bfd8e7b95eb1d8d0b87781d4ffea4d5d.
If I then return the file to its original state by adding the period back in to the end of the sentence, the hash value of the newly edited file will be the same as the original hash. And we would have seen much the same result (though it would have taken a good bit longer to compute!) if my original file had been a copy of the complete works of Shakespeare from which I then removed a period.
Let’s consider a more practical example. The Electronic Frontier Foundation provides a number of recommendations on how to reduce your risk of malware infection in its Surveillance Self-Defense Project. At the top of their list, we read: “Currently, running a minority operating system [their examples are Linux and MacOS -ed.] significantly diminishes the risk of infection because fewer malware applications have been targeted at these platforms. (The overwhelming majority of existing malware targets only a single particular operating system.)” This is more security through obscurity than anything else, but it’s still fun to try out new things, so after a bit of reading you decide to download a copy of the latest version of Ubuntu from an online repository.
How can you check to make sure that the file you’ve downloaded is the official one intended by Ubuntu’s developers and has not been manipulated or corrupted in transit? One way is to confirm that the file’s hash value is equivalent to the one provided by the developers. So you go to the page that lists the download’s hash value and make a note of it. Next, you run the hash function on the file you downloaded. If the resulting value is equivalent to the expected one, you have successfully verified the file’s hash.
However, it is critical to note here that verifying a file’s hash value by itself can only establish a relatively weak form of data integrity, in comparison with more robust mechanisms such as digital signature schemes which can provide a stronger form of integrity verification and even authentication. (Martin, pp. 186-189.) This is because a hash value such as we are discussing here cannot tell us anything about the origin of a digital file. For example, assume that unbeknownst to you, the site you’ve downloaded your file from has itself been compromised, and the attacker has: 1) replaced the download file with a piece of malware, and 2) also replaced the corresponding hash value that you use to check the file’s integrity with the hash value of the malware.
If you then verify the hash value of your downloaded file, you have done nothing more than verify the integrity of the malware! And you’re none the wiser because the site itself was compromised! At the same time, however, if you found out through another source that the site and file were compromised, you could then identify the malicious file and distinguish it from the legitimate source file. In a digital signature scheme, as mentioned above, the developer could digitally sign the legitimate hash value with a trusted key. In this way, the question of trust is then displaced to the question of signature authentication.
A second concern regarding this method of determining data integrity is the security of the hash functions themselves. There are known practical and theoretical vulnerabilities in two hash functions that are among the most common in use for these exact purposes on the web today: MD5 and SHA1. A discussion of these vulnerabilities is beyond the scope of the present article, but more information can be easily found online.
Still, as Bruce Schnier states, “we cannot use [one-way hash functions] to determine with certainty that the two strings are equal, but we can use them to get a reasonable assurance of accuracy.” (Schneier, section 2.4). In other words, hash functions can help us establish a basic level of data integrity. In our opening example, simply making a note of the hash and then checking it the next day would have sufficed to establish that the file had been tampered with. But, of course, if the file had been secured or encrypted to begin with, it never would have even been an issue in the first place.
Finally, how does one actually compute the hash value of a file? It is actually rather simple, but the specifics depend on your choice of operating system. MacOS and Linux systems come bundled with basic functionality to check any file’s hash value, while Microsoft Windows systems require you to download a piece of software to accomplish the task. Two of the most common functions used to verify file hashes are known as MD5 and SHA1. We’ll consider each in turn.
MacOS
1) Open up a command line Terminal.
2) Type “openssl md5 </path/to/file>” into the terminal and press enter.
2A) As an alternative to #2, you can also type “openssl md5 ” into the terminal, then drag and drop the target file into the Terminal window, and press enter.
3) The terminal will then return the MD5 hash value of the given file.
To compute the hash value of the file using a different hash function, type the name of that function into the terminal command in place of “md5”. For example, to compute the sha1 hash of a file, you would type: “openssl sha1 ” followed by the file path. To see a list of all the message digest commands available on your machine, type “openssl —help” into the command line terminal.
Linux (Debian-based)
1) Open up a command line Terminal.
2) Type: “md5sum </path/to/file>”. Then press enter.
3) The terminal will return the MD5 hash value of the given file.
To compute the hash value of the file using a different hash function, type the appropriate command into the terminal in front of the path to the target file. For example, “sha1sum </path/to/file>” will compute the file’s sha1 hash value. To see what other hash functions are available on your system, type “man dgst” into the terminal.
Windows
Windows systems apparently do not come bundled with a built-in utility to check hash values. However, there are a number of different pieces of software you can download to accomplish the task. Microsoft Support lists the File Checksum Integrity Verifier, but warns that this is not supported by Microsoft and is only of use on Windows 2000, Windows XP and Windows Server 2003. This discussion at superuser provides a number of different extant options.
Video Lectures on Hash Functions
- Hashing with Chaining (MIT)
- Table Doubling, Karp-Rabin (MIT)
- Open Addressing, Cryptographic Hashing (MIT)
- Hash Functions (Ruhr University, Bochum)
- SHA1 Hash Function (Ruhr University, Bochum)