Consider the following situation. You have been working for days on a
PowerPoint presentation for work or school, and have been keeping the
file on a shared computer, a network drive or even a personal flash
drive. You put the final touches on your presentation the night before
it’s due, save the file and get ready for a good night's sleep. The next
day, you confidently begin your presentation. But imagine your surprise
when you and your audience see the following image on your third slide:
You’ve
been pranked. If you're lucky, everyone got a good laugh out of it. If
not, there may be more serious consequences, depending on the situation.
This sort of everyday scenario raises an obvious question. Short of
opening the file and manually perusing each slide in the presentation,
how could you be sure that it had not been modified by any of the
pranksters you may share your computer or network with? More seriously,
how can we verify the integrity of a file that may or may not have been
modified by a malicious individual seeking to infect out computer or
network with a dangerous piece of malware?
In this
article, we’ll consider these questions and discuss the pros and cons of
one simple means by which we can verify a file’s integrity to ensure
that it has not been tampered with, namely, by verifying its hash value.
We’ll conclude with a quick tutorial on how to verify a file’s hash
value on Mac, Linux and Windows systems, and provide some links to a few
lectures on cryptographic hash functions culled from the series of
courses listed in our collection of
free online computer science courses. Our primary sources along the way will be
Everyday Cryptography by Keith M. Martin, and
Applied Cryptography by Bruce Schneier.
Malware comes in many different guises. As the Electronic Frontier Foundation writes in their
Surveillance Self-Defense Project,
malware is frequently spread by "trick[ing] the computer user into
running a software program that does something the user wouldn't have
wanted." Let's say you decide to download a file from a website you know
and trust, and from which you have safely downloaded files in the past.
How do you know, for example, that the file you have downloaded onto
your computer is in fact the one intended by the trusted website? How do
you know it was not altered in transit? How do you know it was not
swapped for another file by a malicious attacker? And how can you
determine this without running the file first?
One simple way to verify a file's integrity is by confirming its hash value. In
Everyday Cryptography,
Martin writes: “Hash functions can be used to provide checks against
accidental changes to data and, in certain cases, deliberate
manipulation of data . . . As such they are sometimes referred to as
modification detection codes or
manipulation detection codes”
(emphasis in original, Martin, p. 188). In our opening example, a
suitable hash function would have allowed you to detect that your
presentation had been modified in some way without ever opening it.
So,
what is a hash function? The primary practical property of a hash
function is that it compresses arbitrarily long inputs into a fixed
length output (Martin, p. 189, Schneier, section 2.4). Furthermore,
slight differences in the input data result in large differences in the
output data. “A single bit change in the pre-image [i.e. the file you’re
hashing] changes, on the average, half of the bits in the hash value,”
(Schneier, section 2.4). Two of the most commonly used cryptographic
hash functions are known as
MD5 and
SHA1. Schnier quotes NIST’s description of the SHA hash function as found in the
Federal Register:
The
SHA is called secure because it is designed to be computationally
infeasible to recover a message corresponding to a given message digest,
or to find two different messages which produce the same message
digest. Any change to a message in transit will, with a very high
probability, result in a different message digest. (Schneier, section
18.7.)
Here’s a simple example. I have created a
plain text file named hello.txt on my Desktop. The file contains a
single line that reads: “Hello there.” Applying the well-known sha1 hash
function to the file produces the following hash value:
4177876fcf6806ef65c4c1a1abf464087bfbf337.
If
I edit the file and remove the period from the end of the line so that
it reads “Hello there”, the hash function now returns an entirely
different value: 33ab5639bfd8e7b95eb1d8d0b87781d4ffea4d5d.
If
I then return the file to its original state by adding the period back
in to the end of the sentence, the hash value of the newly edited file
will be the same as the original hash. And we would have seen much the
same result (though it would have taken a good bit longer to compute!)
if my original file had been a copy of the complete works of Shakespeare
from which I then removed a period.
Let’s
consider a more practical example. The Electronic Frontier Foundation
provides a number of recommendations on how to reduce your risk of
malware infection in its
Surveillance Self-Defense Project.
At the top of their list, we read: “Currently, running a minority
operating system [their examples are Linux and MacOS -ed.]
significantly diminishes the risk of infection because fewer malware
applications have been targeted at these platforms. (The overwhelming
majority of existing malware targets only a single particular operating
system.)” This is more
security through obscurity
than anything else, but it’s still fun to try out new things, so after a
bit of reading you decide to download a copy of the latest version of
Ubuntu from an online repository.
How can you check to
make sure that the file you’ve downloaded is the official one intended
by Ubuntu’s developers and has not been manipulated or corrupted in
transit? One way is to confirm that the file’s hash value is equivalent
to the one provided by the developers. So you go to the page that lists
the download’s hash value and make a note of it. Next, you run the hash
function on the file you downloaded. If the resulting value is
equivalent to the expected one, you have successfully verified the
file’s hash.
However, it is critical to note here that
verifying a file’s hash value by itself can only establish a relatively
weak form of data integrity, in comparison with more robust mechanisms
such as
digital signature
schemes which can provide a stronger form of integrity verification and
even authentication. (Martin, pp. 186-189.) This is because a hash
value such as we are discussing here cannot tell us anything about the
origin of a digital file. For example, assume that unbeknownst to you,
the site you’ve downloaded your file from has itself been compromised,
and the attacker has: 1) replaced the download file with a piece of
malware, and 2) also replaced the corresponding hash value that you use
to check the file’s integrity with the hash value of the malware.
If
you then verify the hash value of your downloaded file, you have done
nothing more than verify the integrity of the malware! And you’re none
the wiser because the site itself was compromised! At the same time,
however, if you found out through another source that the site and file
were compromised, you could then identify the malicious file and
distinguish it from the legitimate source file. In a digital signature
scheme, as mentioned above, the developer could digitally sign the
legitimate hash value with a trusted key. In this way, the question of
trust is then displaced to the question of signature authentication.
A
second concern regarding this method of determining data integrity is
the security of the hash functions themselves. There are known practical
and theoretical vulnerabilities in two hash functions that are among
the most common in use for these exact purposes on the web today: MD5
and SHA1. A discussion of these vulnerabilities is beyond the scope of
the present article, but
more information can be easily found online.
Still,
as Bruce Schnier states, “we cannot use [one-way hash functions] to
determine with certainty that the two strings are equal, but we can use
them to get a reasonable assurance of accuracy.” (Schneier, section
2.4). In other words, hash functions can help us establish a basic level
of data integrity. In our opening example, simply making a note of the
hash and then checking it the next day would have sufficed to establish
that the file had been tampered with. But, of course, if the file had
been secured or encrypted to begin with, it never would have even been
an issue in the first place.
Finally, how does one
actually compute the hash value of a file? It is actually rather simple,
but the specifics depend on your choice of operating system. MacOS and
Linux systems come bundled with basic functionality to check any file’s
hash value, while Microsoft Windows systems require you to download a
piece of software to accomplish the task. Two of the most common
functions used to verify file hashes are known as MD5 and SHA1. We’ll
consider each in turn.
MacOS
1) Open up a command line Terminal.
2) Type “openssl md5 </path/to/file>” into the terminal and press enter.
2A)
As an alternative to #2, you can also type “openssl md5 ” into the
terminal, then drag and drop the target file into the Terminal window,
and press enter.
3) The terminal will then return the MD5 hash value of the given file.
To
compute the hash value of the file using a different hash function,
type the name of that function into the terminal command in place of
“md5”. For example, to compute the sha1 hash of a file, you would type:
“openssl sha1 ” followed by the file path. To see a list of all the
message digest commands available on your machine, type “openssl —help”
into the command line terminal.
Linux (Debian-based)
1) Open up a command line Terminal.
2) Type: “md5sum </path/to/file>”. Then press enter.
3) The terminal will return the MD5 hash value of the given file.
To
compute the hash value of the file using a different hash function,
type the appropriate command into the terminal in front of the path to
the target file. For example, “sha1sum </path/to/file>” will
compute the file’s sha1 hash value. To see what other hash functions are
available on your system, type “man dgst” into the terminal.
Windows
Windows
systems apparently do not come bundled with a built-in utility to check
hash values. However, there are a number of different pieces of
software you can download to accomplish the task.
Microsoft Support
lists the File Checksum Integrity Verifier, but warns that this is not
supported by Microsoft and is only of use on Windows 2000, Windows XP
and Windows Server 2003.
This discussion at superuser provides a number of different extant options.
Video Lectures on Hash Functions
As always, comments, questions, suggestions and angry tirades are welcome below.