E-Discovery Forum  

Go Back   E-Discovery Forum > General Discussion > E-Discovery News

Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 02-08-2010, 03:35 PM
Senior Member
 
Join Date: Sep 2008
Posts: 9,583
Default Clinching the Concept of Concept Search - Electronic Discovery

Clinching the Concept of Concept Search

As a frequent speaker, I live for the "aha" moment that lights the eyes of an audience. It's that magical turning point when you've made a daunting technical topic accessible. You can almost hear the, "Thank you, thank you, thank you, for making something I've long wondered about but never fully grasped clear to me." Yesterday, at an e-discovery conference in Austin, I watched Ed Fiducia of Inventus earn his "aha" moment describing concept search. It's a challenging topic--one that entails shoving a host of different approaches under a broad rubric, and more math than the average lawyer wants to recall. Then, explanations are often laced with--or should I say lacerated by?--marketing-speak. But Ed hit the bull's-eye. Ed wisely defused rampant technofear by tying his explanation to the immensely popular CSI television series (Las Vegas and New York, as Ed's not fond of David Caruso's trite, trademark take-off-the-sunglasses move). Rather than embrace the specifics of the various approaches to concept search, Ed tackled the concept of concept search, particularly document clustering and near de-duplication. He began by reminding us that when the CSI team runs a fingerprint through the Automated Fingerprint Identification System (AFIS), the system doesn't check every aspect of the print but only the spatial relationship between distinctive features comprised of loops, whorls and arches. That is, the computer compares a digitally recorded geometric analysis of the ridges at their points of termination and bifurcation to a database of geometric characteristics of other fingerprints. The computer then assembles a list of likely matches and calculates a percentage estimation of such likelihood. On television, this is often accompanied by a fanciful "100% match" along with a mug shot and rap sheet. Ed's point was that we don't need to consider every nuance of a fingerprint to drastically reduce the universe of potential matches. Instead, we can calculate a finite number of geometric values and plot those values to identify candidates for identicality. Then, we look carefully at the candidates to gauge true matches. This doesn't eliminate the need for human judgment, but it allows human review to be deployed efficiently. Applying this technique to documents, we plot words instead of whorls. To lay the groundwork, Ed posited a world where all documents were composed of combinations of only three words, say "run," "home" and "cat." Were we to analyze each document in terms of the number of instances of each word and plot these values on X, Y and Z axes, we'd have a crude measure of similarity. If we factor in the spatial/geometric relationship of the words, we'd have a much more exact measure of similarity. Plus, patterns would emerge, and we'd start to see similar documents cluster in geometric space. By focusing on clusters of similar documents, review for responsiveness and privilege becomes more efficient in the same way that focusing on geometrically similar fingerprint candidates makes crime scene investigation more efficient. And, therein lies a leading concept behind concept search. Enabling a single reviewer to rapidly...

As a frequent speaker, I live for the "aha" moment that lights the eyes of an audience.  It's that magical turning point when you've made a daunting technical topic accessible.  You can almost hear the, "Thank you, thank you, thank you, for making something I've long wondered about but never fully grasped clear to me." 

Yesterday, at an e-discovery conference in Austin, I watched Ed Fiducia of Inventus earn his "aha" moment describing concept search.  It's a challenging topic--one that entails shoving a host of different approaches under a broad rubric, and more math than the average lawyer wants to recall.  Then, explanations are often laced with--or should I say lacerated by?--marketing-speak.  But Ed hit the bull's-eye.

Ed wisely defused rampant technofear by tying his explanation to the immensely popular CSI television series (Las Vegas and New York, as Ed's not fond of David Caruso's trite, trademark take-off-the-sunglasses move). 

Rather than embrace the specifics of the various approaches to concept search, Ed tackled the concept of concept search, particularly document clustering and near de-duplication.  He began by reminding us that when the CSI team runs a fingerprint through the Automated Fingerprint Identification System (AFIS), the system doesn't check every aspect of the print but only the spatial relationship between distinctive features comprised of loops, whorls and arches.  That is, the computer compares a digitally recorded geometric analysis of the ridges at their points of termination and bifurcation to a database of geometric characteristics of other fingerprints.  The computer then assembles a list of likely matches and calculates a percentage estimation of such likelihood.  On television, this is often accompanied by a fanciful "100% match" along with a mug shot and rap sheet.

Ed's point was that we don't need to consider every nuance of a fingerprint to drastically reduce the universe of potential matches.  Instead, we can calculate a finite number of geometric values and plot those values to identify candidates for identicality.  Then, we look carefully at the candidates to gauge true matches.  This doesn't eliminate the need for human judgment, but it allows human review to be deployed efficiently.

Applying this technique to documents, we plot words instead of whorls.  To lay the groundwork, Ed posited a world where all documents were composed of combinations of only three words, say "run," "home" and "cat."  Were we to analyze each document in terms of the number of instances of each word and plot these values on X, Y and Z axes, we'd have a crude measure of similarity.  If we factor in the spatial/geometric relationship of the words, we'd have a much more exact measure of similarity.  Plus, patterns would emerge, and we'd start to see similar documents cluster in geometric space.

 

By focusing on clusters of similar documents, review for responsiveness and privilege becomes more efficient in the same way that focusing on geometrically similar fingerprint candidates makes crime scene investigation more efficient.  And, therein lies a leading concept behind concept search.

Enabling a single reviewer to rapidly muster similar documents not only reduces the risk of inconsistent characterization and redaction, but also reveals similarities that might have been overlooked.  It's like shopping in a neighborhood where all the stores sell the same things--think the Diamond District in New York or the Goldfish Market in Hong Kong.  Having all the permutations at hand fosters smarter choices.

Nice work, Ed!



Visit the publisher of this e-discovery news article: original source.
__________________
DISCLAIMER: This news is syndicated from e-discovery websites and blogs that make their feed available via RSS. To have your RSS feed added or removed, contact the forum administrator.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT -5. The time now is 09:16 AM.


Powered by vBulletin® Version 3.8.4
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0