The main issue in “Approximate String Matching in databases” or “Fuzzy Matching” is PERFORMANCE

Architecture of Microsoft SQL S

Image via Wikipedia

Ramy Ghaly January 28, 2011

In other words, I’m trying to find who has this technology and what claims do they have about the performance of their systems?

My question is technical more than commercial. However, the application is for commercial use in “fuzzy search matching” technology and performance.

For example, a vendor that is specialized in name and address fuzzy matching and has the below performance:

matchIT SQL performance (based on Windows XP, SQL Server 2005, Intel Core 2 Quad CPU, 2.40GHz, 4 GB RAM):

· Key generation = 10mil records/hour
· Find matches (one table, standard low-volume match keys):
o 1 million records = 14mins
· Find overlap (two tables, standard high-volume match keys):
o 40mil+5mil = 2.5hours

 

What Key Professionals Are Saying?


Cohan Sujay Carlos

Follow Cohan Sujay

Cohan Sujay Carlos PostgreSQL has a module for fuzzy matching of names:http://www.postgresql.org/docs/current/static/fuzzystrmatch.html

Solr/Lucene uses Levenshtein distance:http://lucene.apache.org/java/2_3_2/queryparsersyntax.html#Fuzzy%20Searches

Pentaho provides a choice of algorithms in its ETL engine.

1 day ago

Anurag Gupta

Follow Anurag

Anurag Gupta FAST ESP (now FSIS) provides choices in fuzzy matching, photenic matching and even full wildcards. Performance here is a function of hardware resources, starting with a few million documents being matched in sub-seconds on a single server install. Not to mention, there are other factors like query load, doc size, query size etc which all will need to be considered for an appropriate sizing. Hope this helps.

1 day ago

Vineet Yadav sphnix (http://sphinxsearch.com/) is an open source full text search engine server and it can be used with both sql database and nosql datastore. It supports real time indexing.
You can find performance comparison between sphnix and other open source search engine http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/. Also postgres supports fuzzy string matching algo (http://www.viget.com/extend/alternatives-to-sphinx-fuzzy-string-matching-in-postgresql/). Agrep (http://en.wikipedia.org/wiki/Agrep) is used for fast fuzzy string matching and it is based on wu-manber algorithm (http://webglimpse.net/pubs/TR94-17.pdf). I suggest you to look at modified version of wu-manber algorithm.

1 day ago

Hans Hjelm

Follow Hans

Hans Hjelm


This article springs to mind:

http://cs.anu.edu.au/~Peter.Christen/publications/ausdm2008christen.pdf

As for commercial systems, that’s a different story.

Steve Harris

Follow Steve

Steve Harris There’s some fuzzy word matching inside 4store, but we designed it for fuzzy matching on names, and document titles, so it might not be efficient enough on large blocks of text. It’s based on double metaphones, and supports a number of languages.

http://4store.org/trac/wiki/TextIndexing

2 days ago

http://static02.linkedin.com/scds/common/u/img/icon/icon_no_photo_80x80.png

Follow Moty

Moty Mondiano The problem with fuzzy is that every permutation of the word has to be calculated and match in real time (unless the dictionary is very small creating pre- calculating and indexing all possible permutation is out of the question). i.e. fuzzy/distance is inherently costly.
An alternative would be to equate the “canonical” form of the words (create a single canonical form for the word) than match toCanonic ( input) with toCanonic(candidate). This way performance is the same as regular comparison.
Consider using “syd” or “soundex” for implementing “toCanonic”. it all depends if for example soundex (using the phonetic signature of the word) is a good enough match for your needs.

2 days ago

Jeremy Branham

Follow Jeremy

Jeremy Branham Maybe you could use a stemming algorithm to create the canonical form of the word. Like the Porter Stemmer algorithm…
http://tartarus.org/~martin/PorterStemmer/

2 days ago

Adi Eyal

Follow Adi

Adi Eyal Depends on what exactly it is that you want to fuzzy search. Canonical forms like stemming, soundex, nysiis etc don’t work very well in practice and don’t cover important semantic classes. If you want to find Chuck when searching for Charles or Prescription Shoppe for Drugstore, there’s not much to do but to either index Chuck along with Charles or to convert the search string to Charles or Chuck in the background.

The trade-off of course is computational time vs storage. If you pre-index then you’re increasing storage requirements for your database while pre-processing the search string while increase the computational burden at search time.

I don’t think there’s a silver bullet to solve this problem and the difficulty depends on what you’re trying to achieve.

1 day ago

Mark Bishop

Follow Mark

Mark Bishop For best-fit problems you may be interested in the application of a Swarm Intelligence meta-heuristic called Stochastic Diffusion Search.

For time complexity issues see: <http://www.doc.gold.ac.uk/~mas02mb/Selected%20Papers/1998%20Time%20complexity.pdf>.

2 days ago

Neil Smith

Web Developer at Basekit

see all my answers

Best Answers in: Web Development (16) see more

Compare it to Sphinx fulltext search engine, particularly when running in clustered mode and indexing of large volume document sets.

Typically quoted : 10-15MB/second indexing speed PER CORE, 500qps fulltext search, 50 mil queries per day (craigslist)

It’s widely used on sites such as Craigslist and other large scale sites – routinely with > 2TB datasets.

 

Links:

· http://sphinxsearch.com/about/sphinx/

· http://notes.jschutz.net/wp-content/uploads/2009/04/sphinx-performance.pdf

· http://sphinxsearch.com/info/powered/

Posted 2 days ago |

Edwin Stauthamer

Consultant Search Solutions at Emid Consult B.V. Specialist on Google, Autonomy and Exalead technologies.

See all my answers

I would like to ask you a question instead of answering it: ‘Why are you using a database for this?”
And now the answer: “Specifically for this kind of text retrieval purposes Search Engines are invented”.

There are plenty of engines out there that can solves this performance issue, amongst them Autonomy, Solr and Exalead.

Databases are commonly optimized for transactional performance and not retrieval purposes.

Posted 2 days ago

Charlie Hull

Follow Charlie

Charlie Hull In my experience this isn’t likely to be publically available information, although vendors will tell partners and potential customers if asked.

2 days ago

Sam Mefford

Follow Sam

Sam Mefford We created a large-scale fuzzy search for a government organization searching across 20 million names. With tuning and distribution, we were able to get most queries to run sub-second. We used Autonomy IDOL.

22 hours ago

Charlie Hull

Follow Charlie

Charlie Hull

We’ve implemented fuzzy search for a media monitoring application – there’s more at http://www.flax.co.uk/blog/2010/12/13/next-generation-media-monitoring-with-open-source-search/ – made some great improvements in accuracy for our client. We used Python and Xapian.

39 minutes ago

Vadim Berman

Follow Vadim

Vadim Berman
Yep, it always is. The thing is, it’s more of a database related issue, if I understand it correctly. Different vendors resolve it in different ways with different strengths and weaknesses, so you might want to benchmark different databases for your particular application.

You might need to tweak things on your side as well. Generally, if you have a fixed prefix (e.g. “blah%” and not “%blah%”), it is easier to resolve for most systems, and does not require full-text indexing.

Otherwise, there are still ways to speed it up, but it’s not as straightforward. Of course, full-text index helps.

I don’t want to start “religious wars” which one is better, but from what I know, PostgreSQL is more “tweakable” than MySQL and scales better for clustering and such. I never tried it though. Try these:

http://www.wikivs.com/wiki/MySQL_vs_PostgreSQL#Advanced_Indexinghttp://www.postgresonline.com/journal/archives/51-Cross-Compare-of-SQL-Server,-MySQL,-and-PostgreSQL.html

Michael Belanger

Michael Belanger Look into Kenneth Baclawski’s work (Northeastern University)

2 days ago

Simon Spero

Follow Simon

Simon Spero A lot depends on the size of the data set vs. the query rate. A lot also depends whether there is domain specific knowledge that can be brought to bear. For example, there has been a lot of work on approximate name matching, and a lot of systems have support for coding scheme like Soundex. See e.g:

Cohen, W., P. Ravikumar, and S. Fienberg (2003). “A comparison of string distance metrics for name-matching tasks”. In: the Proceedings of the IJCAI. American Association for Artificial Intelligence.http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.15.178&rep=rep1&type=pdf

A good introduction to large scale approximate matching can be found in:

Scheuren, Fritz, William E Winkler, and Thomas Herzog (2007). Data quality and record linkage techniques. New York: Springer. ISBN: 9780387695051.

Winkler, William E. (2006). Overview of Record Linkage and Current Research Directions. Research Report Series, Statistics #2006-2. Statistical Research Division U.S. Census Bureau. URL: http://www.census.gov/srd/papers/pdf/rrs2006-02.pdf

Some systems can maintain n-gram indexes, which can dramatically improve query times at the expense of slower updates. Postgresql has had support for trigram indexes for some time. Postgresql 9 greatly improved the update speed for these indexes.

See: http://www.postgresql.org/docs/9.0/static/pgtrgm.html

Postgresql also has support for fuzzy matching; some of the methods can be used to build indexes; however, levenshtein distances must be computed on the fly.

See: http://www.postgresql.org/docs/9.0/static/fuzzystrmatch.html

2 days ago

Charles Patridge

Follow Charles

Charles Patridge

I have been doing “Fuzzy Search Matching” since 1980’s, primarily using the SAS Software. Yes, Performance is a big issue, and it chews up a lot of CPUs.

SAS is pretty good at this but it is expensive / slow to get the results you desire. In SAS, I have to use a number of SAS functions such as INDEX, INDEXC, SOUNDEX, RXPARSE, and other such functions; most of which are CPU bound.

It seems “Regular Expressions” & PERL have more power to FUZZY Search Matching but I can not give specifics as to how much faster it is over SAS but I suspect it is faster based upon what I have seen.

I do not know of any other tools that do this kind of searching nor any vendors that promote this capability.

If you happen across a vendor that says it is good at FUZZY Search, I would like hear about them.

2 days ago

Marie Risov

Follow Marie

Marie Risov SQL Server Integration Services (SSIS) has fuzzy grouping and fuzzy matching transformations. They are also resource intensive.

2 days ago

Note: With this question posted in 28 targeted groups on LinkedIn with more than 15-20 thousand potential views, the question/topic made a huge impact in these groups to a point that this question was voted “Top Influential Discussion Of The Week in 80 to 90 percent of them. It would be nice to keep this conversation going with some replies to these mentioned responses.

For more information link to the question here: LinkedIn or you can also link to Q&A .

Thank you all for your support and cooperation.