By Tom Groom, Vice President, Senior Consultant, D4 LLC
I do love technology. I also love to cook. So, I’m going to have some fun and combine both of these passions into this one post. My goal is to simplify and disambiguate some of the three letter combinations floating around in our predictive analytics alphabet soup. In the end, I just want a good meal that satisfies and receive requests from the dining room for more.
For starters, we have more than enough acronyms floating around in our soup bowls. We do need to keep ESI (Electronically Stored Information), for those three letters bind our industry together and provide context. ECA (Early Case Assessment) – or my preference: EDA (Early Data Analysis) – are eDiscovery acronyms for activities that will remain for the foreseeable future. Toss in a little TAR (Technology Assisted Review), CAR (Computer Aided Review), “X”AR (“first letter of the software name” Assisted Review) or PC (Predictive Coding), add your favorite spice, and then stir to get a better tasting soup. But it doesn’t stop there.
If you’ve yet to read the recent Cormack-Grossman study Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery, it is worth the read (Disclaimer: You may want to digest it with your favorite adult beverage for it is pretty heavy on the spice). This well-written academic paper was presented last month at the annual conference for the Special Interest Group on Information Retrieval (SIGIR) – a part of the Association for Computing Machinery (ACM). The authors of the study set out to answer the question:
“…should training documents be selected at random, or should they be selected using one or more non-random methods, such as keyword search or active learning?”
The paper tested three methods of document selection for Predictive Analytics (which added three new letter combinations to our alphabet soup).
- SPL: Simple Passive Learning – random selection only
- SAL: Simple Active Learning – uses judgmental seeds to start, then computer-generated seeds to maximize the classifier’s understanding of the dividing line between relevant and not relevant
- CAL: Continuous Active Learning – uses judgmental seeds to start, but then trains primarily with highly relevant documents
This paper challenged many of my fellow eDiscovery cooks to come to the kitchen and make their arguments as to why their approach is best and/or how their software compares to the results of study; i.e. Ralph Losey, John Tredennick and Herbert L. Roitblat. All interesting reads with salient points.
While I agree with most of what is being said by my fellow chefs listed above, in actual practice we have found the best Predictive Analytics results use a combination of all three methods described in the Cormack-Grossman study.
Each method discussed in the study has its benefits, and can be combined with other methods for a more delicious recipe:
- SPL: Without exception, Simple Passive Learning (a.k.a. random selection) is required for a statistically valid control set. Systems like Equivio Relevance, Relativity Assisted Review, ViewPoint Assisted Review and Clearwell use random sampling to establish control and in the early training cycles. They then use SAL and CAL to complete the training.
- SAL: Simple Active Learning (SAL) shortens the system training cycle – especially when compared with SPL as documented in the study – by actively seeking boundary documents that divide “relevant” from “non relevant” during training. SAL is fundamental for Support Vector Machine (SVM)-based systems (another acronym for our soup!). Examples of SVM-based Predictive Coding systems include Equivio Relevance and Clearwell’s Transparent Predictive Coding. The best training workflow includes feedback that can best be provided by seed documents found using a variety of methods.
- CAL: Continuous Active Learning is seen in most Assisted Review workflows using a conceptual engine, but also appears in SVM-based systems by sprinkling in highly relevant seed documents found during or outside of training to improve system precision. The machine learns from examples – better examples mean better results.
The take away is this: Food is better if it is cooked. It doesn’t really matter if you bake, broil, fry, microwave, or grill, you will have more success as a chef if you cook the food before serving it to your guests. Likewise, Predictive Analytics (TAR, CAR, “X”AR or PC) will yield better results than manual human review. Yes, one approach may get you slightly better results than another, but any of these methods will yield much better results than the traditional review approach. You just have to be willing to try the new recipe.
A combination of the Predictive Analytics methods will have a greater probability of maximizing the benefits for you and your clients. That said, Predictive Analytics technology by itself isn’t sufficient. It takes chefs who understand the ingredients and how to best put them together to cook the tastiest Predictive Analytics soup.
————————————————————————————————-
You Might Also Like:
3 Methods of eDiscovery Document Review Compared [WHITEPAPER]
—
Predictive Analytics: What is it and how does it work?
—
Technology Assisted Review Glossary
—
Find the Most Important Documents to Win Your Case in Opposing Counsel’s Production
—
What You Need to Know About Using Technology Assisted Review [INFOGRAPHIC]
—
Technology Assisted Review – What Does TAR Really Mean? [WHITEPAPER]