A hash value is a result of a calculation (hash algorithm) that can be performed on a string of text, electronic file or entire hard drives contents. The result is also referred to as a checksum, hash code or hashes. Hash values are used to identify and filter duplicate files (i.e. email, attachments, and loose files) from an ESI collection or verify that a forensic image or clone was captured successfully.
Each hashing algorithm uses a specific number of bytes to store a “ thumbprint” of the contents. The following is a list of hash values for the same text file. Regardless of the amount of data feed into a specific hash algorithm or checksum it will return the same number of characters. For example, an MD5 hash uses 32 characters for the thumbprint whether it’s a single character in a text file or an entire hard drive.
As you can see there are also various length hashes within a family (SHA-1, SHA-256 et.) The most common hash values are MD5, SHA-1 and SHA-256. The longer hash values require more time to calculate and are designed to reduce the probability of a collision.
A few other ways that hash values are used:
– Verify a downloaded file was created by the publisher (oppose to a virus-infected version)
– Identify and filter files on the NSRL/NIST list (“deNISTing”)
– Locate known contraband (illegal images and videos)
Here are a few reasons why hash values are so widely used as a means to validate and compare content:
1) Privileged Data – There would be obvious issues storing and providing multiple copies of the contents of a company’s files or entire hard drives data in a database to perform a byte comparison. Not to mention illegal images and videos (child pornography) would have to be stored and used in each system scan. These scenarios are unacceptable.
2) Speed – Comparing an indexed hash value versus what could be billions or trillions of bytes or source data is much quicker. Optimized hash engines (Pinpoint Harvester) can compare thousands of hash values in a second.
3) Security – Hashing data is a one-way trip. The original data can’t be recreated or reverse engineered from the hash value. This provides additional security that a person can’t determine the source data from the hash.
The argument that data sources could be different and have the same hash value has raised a lot of concern. There are countless threads related to this issue on the litigation support and computer forensic forums. The bottom line is the only way to do an exact comparison of the original data is to store it everywhere you need to deduplicate or verify the information, however, as mentioned about this isn’t a practical alternative.
More complex hashing functions have been introduced (SHA-256, SHA-512, etc.) which will further reduce the likely hood of a collision. It is also worth noting that even in those cases where scientists have created collisions it was a result of exploiting the weaknesses in a specific hash algorithm. The same alterations would not create a collision in a different hashing algorithm.
So, if you still aren’t satisfied with the incredibly remote possibility a collision could happen using a single hash value then the easiest way to implement an extra precaution is to take the time to have your processes calculate hash values from two separate algorithms (i.e. MD5/SHA256) for each item. Unfortunately, most EED applications and forensic imaging tools don’t support this option, especially in a single pass.
What to Remember
Hash values are a reliable, fast, and secure way to compare the contents of individual files and media. Whether it’s a single text file containing a phone number or five terabytes of data on a server, calculating hash values are an invaluable process for Deduplication and evidence verification in electronic discovery and computer forensics.