Hashing for deduplication to reduce an investigation or review corpus is a necessary standard practice in today’s ediscovery market, which is primarily concerned with reducing costs and proportionality. Do you and your ediscovery provider understand all the implications of hashing emails using different tools? At least two of the most popular processing tools have BCC as an opt-in option instead of a standard setting. But, why does this matter?
It is now a standard practice in ediscovery to apply deduplication to reduce the review corpus significantly. Consider a review or investigation where communication between parties is a key issue of focus. Emails that contain blind-copy recipients are important on such investigations - reviewers will be interested in more than just content, but also as to who has received those emails. When MD5 does not include BCC in the hash calculation for deduplication, under promotion or mistakes in responsiveness coding could occur.
In order to further illustrate this point, I have outlined a case study example below and three ways one could go about it.
A criminal inquiry with civil litigation risks has been under way for six months. Outside counsel has spent several months reviewing a mountain of communication data that has been collected from multiple sites and countries, where four priority custodians were identified at the onset of review. Searches have been conducted based on date range and keywords, focused on communications with, by, or between any of the four custodians.
Priority custodians are ingested in alphabetical order. MD5 hash values do not include BCC in calculations.
Results: All items receive the same MD5 hash value. Priority custodian 1 with BCC in email header survives as unique and duplicate emails with no BCC in email header are suppressed from review.
Potential risk at review: HIGH. The risk at review would be mitigated only when the priority custodian is the one that has the blind-copy recipient in email header, a factor that cannot be determined without close review.
Priority custodians are ingested in pre-determined order. MD5 hash values do not include BCC in calculations.
Results: All items receive the same MD5 hash value. Priority custodian 1 without BCC in email header survives as unique and duplicate emails with BCC in email header are suppressed from review.
Potential risk at review: HIGH. Without close attention to all custodian values, the fact may be missed that another priority custodian is on this email thread/aware of potential legal wrongdoing.
Priority custodians are ingested in pre-determined order. MD5 hash values include BCC in calculations.
Results: Similar items receive the same MD5 hash value. Priority custodian 1 with BCC in email header survives as unique and priority custodian 2 without BCC in email header survives as unique.
Potential risk at review: Low. Both copies of the email are promoted for review, where the content is duplicative yet the header values differ.
Although an accepted standard, you’ve now seen that de-duplication can cause headaches during ediscovery reviews if not used correctly. It is best to discuss and agree the de-duplication settings with your vendor based on the requirements of the review. On a matter where someone being blind-copied on a communication is not material to the investigation, it may be more prudent to ignore BCC in hash calculation. On a case where this is material, as illustrated above, it will require BCC to be included in the MD5 hash calculations and higher attention to recipients during review.
If you would like to discuss this topic further or have questions, please reach out to me at email@example.com.