By Karl Sobylak

Published on Tue, November 24, 2020

All posts by this person

Big data sets are the “new normal” of discovery and bring with them six sinister large data set challenges, as recently detailed in my colleague Nick’s article. These challenges range from classics like overly broad privileged screens, to newer risks in ensuring sensitive information (such as personally identifiable information (PII) or proprietary information such as source code) does not inadvertently make its way into the hands of opposing parties or government regulators. While these challenges may seem insurmountable due to ever-increasing data volumes (and also tend to keep discovery program managers and counsel up at night) there are new solutions that can help mitigate these risks and optimize workflows.

Advanced Analytics – The Key to Mitigating Big Data Risks_AdobeStock_293897945

As I previously wrote, ediscovery is actually a big data challenge. Advances in AI and machine learning, when applied to ediscovery big data, can help mitigate and reduce these sinister risks by breaking down the silos of individual cases, learning from a wealth of prior case data, and then transferring these learnings to new cases. Having the capability to analyze and understand large data sets at scale combined with state-of-the-art methods provides a number of benefits, five of which I have outlined below.

  1. Pinpointing Sensitive Information - Advances in deep learning and natural language processing has now made pinpointing sensitive content achievable. A company’s most confidential content could be laying in plain sight within their electronic data and yet be completely undetected. Imagine a spreadsheet listing customers, dates of birth, and social security numbers attached to an email between sales reps. What if you are a technology company and two developers are emailing each other snippets of your company’s source code? Now that digital medium is the dominant form of communication within workplaces, situations like this are becoming ever-present and it is very challenging for review teams to effectively identify and triage this content. To solve this challenge, advanced analytics can learn from massive amounts of publically available and computer-generated data and then fine tuned to specific data sets using a recent breakthrough innovation in natural language processing (NLP) called “transfer learning.” In addition, at the core of big data is the capability to process text at scale. Combining these two techniques enables precise algorithms to evaluate massive amounts of discovery data, pinpoint sensitive data elements, and elevate them to review teams for a targeted review workflow.
  2. Prioritizing the Right Documents - Advanced analytics can learn both key trends and deep insights about your documents and review criteria. A normal search term based approach to identify potentially responsive or privileged content provides a binary output. Documents either hit on a search term or they do not. Document review workflows are predicated on this concept, often leading to suboptimal review workflows that both over-identify documents that are out of scope and miss documents that should be reviewed. Advanced analytics provide a range of outcomes that enable review teams to create targeted workflow streams tailored to the risk at hand. Descriptive analysis on data can generate human interpretable rules that help organize documents, such as “all documents with more than X number of recipients is never privileged” or “99.9% of the time, documents coming from the following domains are never responsive”. Deep learning-based classifiers, again using transfer learning, can generalize language on open source content and then fine-tune models to specific review data sets. Having a combination of analytics, both descriptive and predictive, provides a range of options and gives review teams the ability to prioritize the right content, rather than just the next random document. Review teams can now concentrate on the most important material while deprioritizing the less important content for a later effort.
  3. Achieving Work-Product Consistency - Big data and advanced analytics approaches can ensure the same document or similar documents are treated consistently across cases. Corporations regularly collect, process, and review the same data across cases over and over again, even when cases are not related. Keeping document treatment consistent across these matters can obviously be extremely important when dealing with privilege content – but is also important when it comes to responsiveness across related cases, such as a multi-district litigation. With the standard approach, cases are in siloes without any connectivity between them to enable consistent approaches. A big data approach enables connectivity between cases using hub-and-spoke techniques to communicate and transit learnings and work-product between cases. Work product from other cases, such as coding calls, redactions, and even production information can be utilized to inform workflows on the next case. For big data, activities like this are table stakes.
  4. Mitigating Risk - What do all of these approaches have in common? At its core, big data and analytics is an engine for mitigating risk. Having the ability to pinpoint sensitive data, prioritize what you look at, and ensure consistency across your cases is a no-brainer. This all may sound like a big change, but in reality, it’s pretty seamless to implement. Instead of simply batching out documents that hit on an outdated privilege screen for privilege review, review managers can instead use a combination of analytics and fine-tuned privilege screen hits. Review then occurs from there largely as it does today, just with the right analytics to inform reviewers with the context needed to make the best decision.
  5. Reducing Cost - The other side of the coin is cost savings. Every case has a different cost and risk profile and advanced analytics should provide a range of options to support your decision making process on where to set the lever. Do you really need to review each of these categories in full, or would an alternative scenario based on sampling high-volume and low-risk documents be a more cost-effective and defensible approach? The point is that having a better and more holistic view of your data provides an opportunity to make these data-driven decisions to reduce costs.

One key tip to remember - you do not need to try to implement this all at once! Start by identifying a key area where you want to make improvements, determine how you can measure the current performance of the process, then apply some of these methods and measure the results. Innovation is about getting a win in order to perpetuate the next.

If you are interested in this topic or just love to talk about big data and analytics, feel free to reach out to me at

About the Author
Karl Sobylak

Senior Product Manager, Product Development

Karl is responsible for the innovation, development, and deployment of cutting-edge big data analytic based products that create better and more optimized legal outcomes for our clients, including the reduction of cost and risk. After graduating from SUNY Albany with a B.S. in Computer Science and Applied Mathematics in 2003, Karl joined a start-up ediscovery services company where he learned everything he could about the world of legal including operations, development, services, and strategy. With more than 16 years of expertise in the legal industry, creating data-centric solutions, and applying risk mitigation tactics, Karl possesses a strong background that has allowed him to help reduce legal costs, improve precision and recall rates, and gain favorable legal results.