Tag Archives: Apache Solr

[repost ]Apache Solr vs ElasticSearch



Feature Solr 4.7.0 ElasticSearch 1.0
Binary API   SolrJ  TransportClient, Thrift (through a plugin)
JMX support  ES specific stats are exposed through the REST API
Client libraries  PHP, Ruby, Perl, Scala, Python, .NET, Javascript PHP, Ruby, Perl, Scala, Python, .NET, Javascript, Erlang, Clojure
3rd-party product integration (open-source) Drupal, Magento, Django, ColdFusion, WordPress, OpenCMS, Plone, Typo3, ez Publish, Symfony2, Riak (via Yokozuna) Drupal, Django, Symfony2, WordPress, CouchBase
3rd-party product integration (commercial) DataStax Enterprise Search, Cloudera Search, Hortonworks Data Platform, MapR SearchBlox, Hortonworks Data Platform, MapR
Output JSON, XML, PHP, Python, Ruby, CSV, Velocity, XSLT, native Java JSON, XML/HTML (via plugin)



Feature Solr 4.7.0 ElasticSearch 1.0
Data Import DataImportHandler – JDBC, CSV, XML, Tika, URL, Flat File Rivers modules – ActiveMQ, Amazon SQS, CouchDB, Dropbox, DynamoDB, FileSystem, Git, GitHub, Hazelcast, JDBC, JMS, Kafka, LDAP, MongoDB, neo4j, OAI, RabbitMQ, Redis, RSS, Sofa, Solr, St9, Subversion, Twitter, Wikipedia
ID field for updates and deduplication
Partial Doc Updates   with stored fields  with _source field
Custom Analyzers and Tokenizers 
Per-field analyzer chain 
Per-doc/query analyzer chain 
Synonyms   Supports Solr and Wordnet synonym format
Multiple indexes 
Near-Realtime Search/Indexing 
Complex documents   Flat document structure. No native support for nesting documents
Schemaless   4.4+
Multiple document types per schema   One set of fields per schema, one schema per core
Online schema changes   Schema change requires restart. Workaround possible using MultiCore.  Only backward-compatible changes.
Apache Tika integration 
Dynamic fields 
Field copying   via multi-fields
Hash-based deduplication 



Feature Solr 4.7.0 ElasticSearch 1.0
Lucene Query parsing 
Structured Query DSL   Need to programmatically create queries if going beyond Lucene query syntax.
Span queries   via SOLR-2703
Spatial search 
Multi-point spatial search 
Faceting   The way top N facets work now is by getting the top N from each shard, and merging the results. This can give incorrect counts when num shards > 1.
Advanced Faceting   blog post
Pivot Facets 
More Like This
Boosting by functions 
Boosting using scripting languages 
Push Queries  JIRA issue  Percolation. Distributed percolation supported in 1.0
Field collapsing/Results grouping   possibly 1.0+ link
Spellcheck  Suggest API
Autocomplete  Added in 0.90.3 here
Query elevation  workaround
Joins   It’s not supported in distributed search. See LUCENE-3759.  via has_children and top_children queries
Resultset Scrolling   New to 4.7.0  via scan search type
Filter queries   also supports filtering by native scripts
Filter execution order   local params and cache property  _cache and _cache_key property
Alternative QueryParsers   DisMax, eDisMax  query_string, dis_max, match, multi_match etc
Negative boosting   but awkward. Involves positively boosting the inverse set of negatively-boosted documents.
Search across multiple indexes  it can search across multiple compatible collections
Result highlighting
Custom Similarity 
Searcher warming on index reload   Warmers API



Feature Solr 4.7.0 ElasticSearch 1.0
Pluggable API endpoints 
Pluggable search workflow   via SearchComponents
Pluggable update workflow 
Pluggable Analyzers/Tokenizers
Pluggable Field Types
Pluggable Function queries
Pluggable scoring scripts
Pluggable hashing 
Pluggable webapps   site plugin
Automated plugin installation   Installable from GitHub, maven, sonatype or elasticsearch.org



Feature Solr 4.7.0 ElasticSearch 1.0
Self-contained cluster   Depends on separate ZooKeeper server  Only ElasticSearch nodes
Automatic node discovery  ZooKeeper  internal Zen Discovery or ZooKeeper
Partition tolerance  The partition without a ZooKeeper quorum will stop accepting indexing requests or cluster state changes, while the partition with a quorum continues to function.  Partitioned clusters can diverge unless discovery.zen.minimum_master_nodes set to at least N/2+1, where N is the size of the cluster. If configured correctly, the partition without a quorum will stop operating, while the other continues to work. See this
Automatic failover  If all nodes storing a shard and its replicas fail, client requests will fail, unless requests are made with the shards.tolerant=true parameter, in which case partial results are retuned from the available shards.
Automatic leader election
Shard replication
Automatic shard rebalancing   it can be machine, rack, availability zone, and/or data center aware. Arbitrary tags can be assigned to nodes and it can be configured to not assign the same shard and its replicates on a node with the same tags.
Change # of shards  Shards can be added (when using implicit routing) or split (when using compositeId). Cannot be lowered. Replicas can be increased anytime.  each index has 5 shards by default. Number of primary shards cannot be changed once the index is created. Replicas can be increased anytime.
Relocate shards and replicas   can be done by creating a shard replicate on the desired node and then removing the shard from the source node  can move shards and replicas to any node in the cluster on demand
Control shard routing   shards or _route_ parameter  routing parameter
Consistency Indexing requests are synchronous with replication. A indexing request won’t return until all replicas respond. No check for downed replicas. They will catch up when they recover. When new replicas are added, they won’t start accepting and responding to requests until they are finished replicating the index. Replication between nodes is synchronous by default, thus ES is consistent by default, but it can be set to asynchronous on a per document indexing basis. Index writes can be configured to fail is there are not sufficient active shard replicas. The default is quorum, but all or one are also available.



Feature Solr 4.7.0 ElasticSearch 1.0
Web Admin interface  bundled with Solr  via site plugins: elasticsearch-headbigdeskkopf,elasticsearch-HQHammer
Hosting providers WebSolrSearchifyHosted-SolrIndexDepotOpenSolr,gotosolr bonsai.ioIndexistoqbox.ioIndexDepot



As a number of folks point out in the discussion below, feature comparisons are inherently shallow and only go so far. I think they serve a purpose, but shouldn’t be taken to be the last word on these 2 fantastic search products.

If you’re running a smallish site and need search features without fancy bells-and-whistles, I think you’ll be very happy with either Solr or ElasticSearch.

I’ve found ElasticSearch to be friendlier to teams which are used to REST APIs, JSON etc and don’t have a Java background. If you’re planning a large installation that requires running distributed search instances, I suspect you’re also going to be happier with ElasticSearch.

As Matt Weber points out below, ElasticSearch was built to be distributed from the ground up, not tacked on as an ‘afterthought’ like it was with Solr. This is totally evident when examining the design and architecture of the 2 products, and also when browsing the source code.





If you see any mistakes, or would like to append to the information on this webpage, you can clone the GitHub repo for this site with:

git clone https://github.com/superkelvint/solr-vs-elasticsearch

and submit a pull request.


[project ]ManifoldCF


Welcome to ManifoldCF!

What Is ManifoldCF?

ManifoldCF is an effort to provide an open source framework for connecting source content repositories like Microsoft Sharepoint and EMC Documentum, to target repositories or indexes, such as Apache Solr , OpenSearchServer or ElasticSearch. ManifoldCF also defines a security model for target repositories that permits them to enforce source-repository security policies.

Currently included connectors support FileNet P8 (IBM), Documentum (EMC), LiveLink (OpenText), Meridio (Autonomy), Windows shares (Microsoft), and SharePoint (Microsoft). Also included are a general CMIS connector, a generic file system connector, a general JDBC connector, an RSS feed connector, a Wiki connector, and a general web connector. Currently supported targets include Apache Solr, QBase (formerly MetaCarta) GTS , OpenSearchServer and ElasticSearch. The complete repository compatibility list can be found in the release documentation here.

The original ManifoldCF code base was granted by MetaCarta, Inc., to the Apache Software Foundation in December 2009. The MetaCarta effort represented more than five years of successful development and testing in multiple, challenging enterprise environments.

Project status

Apache ManifoldCF 1.0.1 is available! Download it here.

A philosophical note about third-party repositories

Many of the connectors included with ManifoldCF currently require third-party libraries, packages, or other licensed materials in order to be built. The code for these connectors can be distributed under Apache guidelines because they are conditionally compiled; that is, you as a developer must supply the necessary third-party components before the connector will build. While we have every intention of trying to reduce the number of affected connectors by means of clean-room development under the auspices of ASF, realistically the current situation will not change in any dramatic way very quickly. The focus of this group remains primarily on providing solutions.

The required libraries, tools, and procedures for building each of these connectors are well documented in the Wiki pages for this project.

[repost ]Apache Solr crash course



Apache Solr crash course – Presentation Transcript

  1. Apache Solr crash course Tommaso Teofili
  2. Agenda• IR• Solr• Tips&tricks• Case study• Extras
  3. Information Retrieval• “Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)” – P. Nayak, Stanford University
  4. Inverted index• Each document has an id and a list of terms• For each term t we must store a list of all documents that contain t• Identify each document by its id
  5. IR Metrics• How good is an IR system?• Precision: Fraction of retrieved docs that are relevant to user’s information need• Recall: Fraction of relevant docs in collection that are retrieved
  6. Apache Lucene• Information Retrieval library• Inverted index of documents• Vector space model• Advanced search options (synonims, stopwords, similarity, proximity)
  7. Lucene API – indexing• Lucene indexes are built on a Directory• Directory can be accessed by IndexReaders and IndexWriters• IndexSearchers are built on top of Directories and IndexReaders• IndexWriters can write Documents inside the index• Documents are made of Fields• Fields have values• Directory > IndexReader/Writer > Document > Field
  8. Lucene API – searching• Open an IndexSearcher on top of an IndexReader over a Directory• Many query types: TermQuery, MultiTermQuery, BooleanQuery, WildcardQuery, PhraseQuery, PrefixQuery, MultiPhraseQuery, FuzzyQuery, NumericRangeQuery, …• Get results from a TopDocs object
  9. Apache Solr• Ready to use enterprise search server• REST (and programmatic) API• Results in XML, JSON, PHP, Ruby, etc…• Exploit Lucene power• Scaling capabilities (replication, distributed search, …)• Administration interface• Customizable via plugins
  10. Apache Solr 3.1.0
  11. Apache Solr – admin UI
  12. Solr – project status• Solr 3.1.0 version released in March 2011• Lucene/Solr is now a single project• Huge community• Backed by Lucid Imagination
  13. Solr basic configuration• schema.xml • contains types definitions for field analysis (field type+tokenizers+filters) • contains field definitions• solrconfig.xml • contains the Solr instance configuration
  14. Solr – schema.xml• Types (with index/query Analyzers)• Fields with name, type and options• Unique key• Dynamic fields• Copy fields
  15. Solr – content analysis• define documents’ model• each document consists of fields• each field • has attributes telling Solr how to handle its contents • contains free text, keywords, dates, numbers, etc.
  16. Solr – content analysis• Analyzer: create tokens using a Tokenizer and, eventually, some filters (TokenFilters)• Each field can define an Analyzer at ‘query’ time and another at ‘index’ time, or the same in both cases• Each field can be indexed (searchable), stored (possibly fetched with results), multivalued, required, etc.
  17. Solr – content analysis• Commonly used tokenizers: • WhitespaceTokenizerFactory • StandardTokenizerFactory • KeywordTokenizerFactory • PatternTokenizerFactory • HTMLStripWhitespaceTokenizerFactory • HTMLStripStandardTokenizerFactory
  18. Solr – content analysis• Commonly used TokenFilters: • SnowballPorterFilterFactory • StopFilterFactory • LengthFilterFactory • LowerCaseFilterFactory • WordDelimiterFilterFactory • SynonymFilterFactory • PatternReplaceFilterFactory • ReverseWildcardFilterFactory • CharFilterFactories (Mapping,HtmlString)
  19. Debugging analysis
  20. Solr – solrconfig.xml• Data directory (where Solr will write the Lucene index)• Caches configuration: documents, query results, filters• Request handlers definition (search/update handlers)• Update request processor chains definition• Event listeners (newSearcher, firstSearcher)• Fine tuning parameters• …
  21. Solr – indexing• Update requests on index are given with XML commands via HTTP POST• <add> to insert and update • <add> <doc boost=”2.5″> • <field name=”employeeId”>05991</field> • </doc></add>• <delete> to remove by unique key or query • <delete><id>05991</id></delete> • <delete><query>office:Bridgewater</query></delete>• <commit/> reopen readers on the new index version• <optimize/> optimize index internal structure for faster access
  22. Solr – basic indexing• REST call – XML/JSON • curl ‘http://localhost:8983/solr/update? commit=true’ -H “Content-Type: text/xml” — data-binary <add><doc><field name=”id”>testdoc</field></doc></add> • curl http://localhost:8983/solr/update/json? commit=true -H Content-type:application/json -d { “add”: {“doc”: {“id” : “TestDoc1”, “title” : “test1”} } }’
  23. Solr – binary files indexing • Many documents are produced in (properietary) binary formats : PDF, RTF, XLS, etc. • Apache Tika integrated in Solr REST service for indexing such documents • curl “http://localhost:8983/solr/update/ extract?literal.id=doc1&commit=true” -F “myfile=@tutorial.html”
  24. Solr – index analysis• Luke is a tool for navigating Lucene indexes• For each field : top terms, distinct terms, terms histogram, etc.• LukeRequestHandler : • http://localhost:8983/solr/admin/luke? wt=xslt&tr=luke.xsl
  25. Solr – data import handler • DBMS • FileSystem • HTTP
  26. Solr – searching
  27. Solr – searching
  28. Solr – query syntax• query fields with fieldname:value• + – AND OR NOT operators• Range queries on date or numeric fields, ex: timestamp:[* TO NOW]• Boost terms, ex: people^4 profits• Fuzzy search, ex: roam~0.6• Proximity search, ex: “apache solr”~2• …
  29. Solr – basic search• parameters: • q: the query • start: offset of the first result • rows: max no. of results returned • fl: comma separated list of fields to return • defType: specify the query parser • debugQuery: enable query debugging • wt: result format (xml, json, php, ruby, javabin, etc)
  30. Solr – query parsers• Most used: • Default Lucene query parser • DisMax query parser • eDisMax query parser
  31. Solr – highlighting• can be done on fields with stored=”true”• returns a snippet containing the higlighted terms for each doc• enabled with hl=true&hl.fl=fieldname1,fieldname2
  32. Solr – sorting results• Sorting can be done on the “score” of the document, or on any multiValued=”false” indexed=”true” field provided that field is either non-tokenized (ie: has no Analyzer) or uses an Analyzer that only produces a single term• add parameter &sort=score desc, inStock desc, price asc• can sort on function queries (see later)
  33. Solr – filter queries• get a subset of the index• place it in a cache• run queries for such a “filter” in memory• add parameter &fq=category:hardware• if multiple fq parameters the query will be run against the intersection of the specified filters
  34. Solr – facets• facet by: • field value • arbitrary queries • range• can facet on fields with indexed=”true”
  35. Solr – function queries• allow deep customization of ranking : • http://localhost:8983/solr/select/? fl=score,id&q=DDR&sort=termfreq (text,memory)%20desc• functions : sum, sub, product , div ,pow, abs, log, sqrt, map, scale, termfreq, …
  36. Solr – query elevation• useful for “marketing”• configure the top results for a given query regardless of the normal Lucene scoring• http://localhost:8983/solr/elevate?q=best %20product&enableElevation=true
  37. Solr – spellchecking• collects suggestions about input query• eventually correct user query with “suggested” terms
  38. Solr – spellchecking• build a spellcheck index dynamically• return suggested results• http://localhost:8983/solr/spell?q=hell ultrashar&spellcheck=true&spellcheck.collate=true &spellcheck.build=true• useful to create custom query converters • <queryConverter name=”queryConverter”/>
  39. Solr – similarity• get documents “similar” to a given document or a set of documents• Vector Space Model• http://localhost:8983/solr/select? q=apache&mlt=true&mlt.fl=manu,cat&mlt. mindf=1&mlt.mintf=1&fl=id,score
  40. Solr – geospatial search• index location data• query by spatial concepts and sort by distance• find all documents with store position at no more than 5km than a specified point• http://localhost:8983/solr/select? &indent=true&fl=name,store&q=*:*&fq={!geofilt %20sfield=store}&pt=45.15,-93.85&d=5
  41. Solr – field collapsing• group resulting documents on per field basis • http://localhost:8983/solr/select? &indent=true&fl=id,name&q=solr +memory&group=true&group.field=man u_exact• useful for displaying results in a smart way• see SOLR-236
  42. Solr – join• new feature (SOLR-2272)• many users ask for it• quite of a paradigm change• http://localhost:8983/solr/select?q={!join +from=manu_id_s%20to=id} ipod&fl=id,manu_id&debugQuery=true
  43. Solr statistics
  44. Solr statistics
  45. Solr statistics
  46. Solr replication
  47. Solr Architectures• Simple• Multicore• Replication• Sharded
  48. Solr – MultiCore• Define multiple Solr cores inside one only Solr instance• Each cores maintain its own index• Unified administration interface• Runtime commands to create, reload, load, unload, delete, swap cores• Cores can be thought as ‘collections’• Allow no downtime while deploying new features/ bugfixes
  49. Solr – Replication• It’s useful in case of high traffic to replicate a Solr instance and split (with eventually a VIP in front) the search load• Master has the “original” index• Slave polls master asking the latest version of index• If slave has a different version of the index asks the master for the delta (rsync like)• In the meanwhile indexes remain available• No impact of indexing on search (almost)
  50. Solr – Shards• When an index is too large, in terms of space or memory required, it can be useful to define two or more shards• A shard is a Solr instance and can be searched or indexed independently• At the same time it’s possible to query all the shards having the result be merged from the sub-results of each shard• http://localhost:8983/solr/select?shards=localhost:8983/ solr,localhost:7574/solr&q=category:information• Note that the document distribution among indexes is up to the user (or who feeds the indexes)
  51. Solr – Architectures• When to use each?• KISS principle• High query load : replication• Huge index : shard
  52. Solr – Architectures• High queries w/ large indexes : shard + replication
  53. Solr – Architectures• Tips & Tricks: • Don’t share indexes between Master and Slaves on distributed file systems (locking) • Anyway get rid of distributed file systems (slow) • Lucene/Solr is I/O intensive thus behaves better with quick disks • Always use MultiCore – hot deploy of changes/ bugfixes • Replication is network intensive • Check replication poll time and indexing rates
  54. Tips&Tricks• Solr based SE development process• Plugins• Performance tuning• Deploy
  55. Process – t0 analysis• Analyze content• Analyze queries• Analyze collections• Pre-existing query/index load (if any)• Expected query/index load• Desired throughput/avg response time• First architecture
  56. Process – n-th iteration• index 10-15% content• search stress test (analyze peaks) – use SolrMeter• quality tests from stakeholders (accuracy, recall)• eventually add/reconfigure features• check http://wiki.apache.org/solr/FieldOptionsByUseCase and make sure fields used for faceting/sorting/highlighting/ etc. have proper options• need to change field types/analysis/options – rebuild the index
  57. Solr – Plugins• QParserPlugin• RequestHandler (Search/UpdateHandler)• UpdateRequestProcessor• ResponseWriter• Cache
  58. Performance tuning• A huge tuning is done in schema.xml• Configure Solr caches• Set auto commit where possible• Play with mergeFactor
  59. Performance tuning• The number of indexed fields greatly increases memory usage during indexing, segment merge time, optimization times, index size• Stored fields impact on index size, search time, …• set omitNorms=”true” where it makes sense (disabling length normalization and index time boosting)• set omitTermFreqAndPositions=”true” if no queries on this field using positions or should not influence score
  60. Performance tuning• FilterCache – unordered document ids for caching filter queries• QueryResultCache – ordered document ids for caching queries results (caching only the returned docs)• DocumentCache – stores stored fields (at least <max_results> * <max_concurrent_queries>• Setup autowarming – keep caches warm after commits
  61. Performance tuning• Choose correct cache implementation FastLRUCache vs LRUCache• FastLRUCache has faster gets and slower puts in single threaded operation and thus is generally faster than LRUCache when the hit ratio of the cache is high (> 75%)
  62. Performance tuning• Explicit warm sorted fields• Often check cache statistics• JVM options – don’t let the OS without memory!• mergeFactor – impacts on the number of index segments created on the disk • low mF : smaller number of index files, which speeds up searching but more segment merges slow down indexing • high mF : generally improves indexing speed but gets less frequent merges, resulting in a collection with more index files which may slow searching
  63. Performance tuning• set autocommit where possible, this will avoid close and reopen of IndexReaders everytime a document is indexed – can choose max number of documents and/or time to wait before automatically do the commit• finally…need to get your hand dirty!
  64. Deploy• SolrPackager by Simone Tripodi!• It’s a Maven archetype• Create standalone/multicore project• Each project will generate a master and a slave instance• Define environment dependent properties without having to manage N config files• ‘mvn -Pdev package’ // will create a Tomcat package for the development environment
  65. Case study
  66. Case study
  67. Case Study• Architecture analysis• Plugin development• Testing and support
  68. Challenges• Architecture• Schema design
  69. Challenge• Architecture • 4B docs of ~4k each • ~3 req/sec overall • 3 collections: • |archive| = 3B • |2010-2011| = 1M • |intranet| = 0.9B
  70. Challenge• Content analysis• get the example Solr schema.xml• optimize the schema in order to enable both stemmed and unstemmed versions of fields: author, title, text, cat• add omitNorms=”true” where possible• add a field ‘html_content’ which will contain an HTML text but will be searched as clean text• all string fields should be lowercased
  71. Extras• Clustering (Solr-Carrot2)• Named entity extraction (Solr-UIMA)• SolrCloud (Solr-Zookeeper)• ManifoldCF• Stanbol EntityHub• Solandra (Solr-Cassandra)
  72. THANKS!

[repost ]Use cases of faceted search for Apache Solr


In this post I write about some use cases of facets for Apache Solr. Please submit your own ideas in the comments.
This post is splitted into the following parts:

  • What are facets?
  • How do you enable and use simple facets?
  • What are other use cases?
    1. Category navigation
    2. Autocompletion
    3. Trending keywords or links
    4. Rss feeds
  • Conclusion

What are facets?

In Apache Solr elements for navigational purposes are named facets. Keep in mind that Solr provides filter queries (specified via http parameter fq) which filters out documents from the search result. In contrast facet queries only provide information (count of documents) and do not change the result documents. I.e. they provide ‘filter queries for future queries’. So define a facet query and see how much documents I can expect if I would apply the related filter query.

But a picuture – from this great facet-introduction – is worth a thousand words:

What do you see?

  • You see different facets like Manufacturer, Resolution, …
  • Every facet has some constraints, where the user can filter its search results easily
  • The breadcrumb shows all selected contraints and allows removing them

All these values can be extracted from Solrs’ search results and can be defined at query time, which looks surprising if you come from FAST ESP. Nevertheless the fields on which you do faceting needs to be indexed and untokenized. E.g. string or integer. But the type of fields where you want to do faceting mustn’t be the default ‘text’ type, which is tokenized.

In Solr you have

The normal facets can be useful if your documents have a manufacturer string field e.g. a document can be within the ‘Sony’ or ‘Nikon’ bucket. In contrast you will need facet queries for integers like pricing. For example if you specify a facet query from 0 to 10 EUR Solr will calculate on the fly all documents which fall into that bucket. But the facet queries becomes relative unhandy if you have several identical ranges like 0-10, 10-20, 20-30, … EUR. Then you can use range queries.

Date facets are special range queries. As an example look into this screenshot from jetwick:

where here the interval (which is called gap) for every bucket is one day.

For a nice introduction into facets have a look into this publication or use the solr wiki here.

How do you enable and use simple facets?

As stated before they can be enabled at query time. For the http API you add “&facet=true&facet.field=manu” to your normal query “http://localhost:8983/solr/select?q=*:*”. For SolrJ you do:

new SolrQuery("*:*").setFacet(true).addFacetField("manu");

In the Xml returned from the Solr server you will get something like this – again from this post:

<lst name="facet_fields">
            <lst name="manu">
               <int name="Canon USA">17</int>
               <int name="Olympus">12</int>
               <int name="Sony">12</int>
               <int name="Panasonic">9</int>
               <int name="Nikon">4</int>

To retrieve this with SolrJ you don’t need to touch any Xml, of course. Just get the facet objects:

List<FacetField> facetFields = queryResponse.getFacetFields();

To append facet queries specify them with addFacetQuery:

solrQuery.addFacetQuery("quality:[* TO 10]").addFacetQuery("quality:[11 TO 100]");

And how you would query for documents which does not have a value for that field? This is easy: q=-field_name:[* TO *]

Now I’ll show you like I implemented date facets in jetwick:

q.setFacet(true).set(“facet.date”, “{!ex=dt}dt”).
set(“facet.date.start”, “NOW/DAY-6DAYS”).
set(“facet.date.end”, “NOW/DAY+1DAY”).
set(“facet.date.gap”, “+1DAY”);

With that query you get 7 day buckets which is visualized via:

It is important to note that you will have to use local parameters like {!ex=dt} to make sure that if a user applies a facet (uses the facet query as filter query) then the other facet queries won’t get a count of 0. In the picture the filter query was fq={!tag=dt}dt:[2010-12-04T00:00:00.000Z+TO+2010-12-05T00:00:00.000Z]. Again: filter query needs to start with {!tag=dt} to make that working. Take a look into the DateFilter source code or this for more information.

Be aware that you will have to tune the filterCache in order to keep performance green. It is also important to use warming queries to avoid time outs and pre-fill caches with old ‘heavy’ used data.

What are other use cases?

1. Category navigation

The problem: you have a tree of categories and your products are categorized in multiple of those categories.

There are two relative similar solutions for this problem. I will describe one of them:

  • Create a multivalued string field called ‘category’. Use the category id (or name if you want to avoid DB queries).
  • You have a category tree. Make sure a document gets not only the leaf category, but all categories until the root node.
  • Now facet over the category field with ‘-1′ as limit
  • But what if you want to display only the categories of one level? E.g. if you don’t want other level at a time or if they are too much.
    Then index the category field ala <level>_category. For that you will need the complete category tree in RAM while indexing. Then use facet.prefix=<level>_ to filter the category list for the level
  • Clicking on a category entry should result in a filter query ala fq=category:”<levle>_categoryId”
  • The little tricky part is now that your UI or middle tier has to parse the level e.g. 2 and the append 2+1=3 to the query: facet.prefix=3_
  • If you filter the level then one question remains:
    Q: how can you display the path from the selected category until the root category?
    A: Either get the category parents via DB, which is easy if you store the category ids in Solr – not the category names.
    Or get the parents from the parameter list which is a bit more complicated but doable. In this case you’ll need to store the category names in Solr.

Please let me know if this explanation makes sense to you or if you want to see that in action – I don’t want to make advertisments for our customers here :-)

BTW: The second approach I have in mind is: instead of using facet.prefix you can use dynamic fields ala category_<level>_s

Special Hint: If it are too many facets you can even page through them!

2. Autocompletion

The problem: you want to show suggestions as the user types.

You’ll need a multivalued ‘tag’ field. For jetwick I’m using a heavy noise word filter to get only terms ‘with information’ into the tag field, from the very noisy tweet text. If you are using a shingle filter you can even create phrase suggestions. But I will describe the “one more word” suggestion here, which will only suggest the next word (not a complete different phrase).

To do this create a the following query when the user types in some characters (see getQueryChoices method of SolrTweetSearch):

  • Use the old query with all filter queries etc to provide a context dependent autocomplete (ie. only give suggestions which will lead to results)
  • split the query into “completed” terms and one “to do” term. E.g. if you enter “michael jack”
    Then michael is complete (ends with space) and jack should be completed
  • set the query term of the old query to michael and add the facet.prefix=jack
  • set facet limit to 10
  • read the 10 suggestions from facet field but exclude already completed terms.

The implementation for jetwick which uses Apache Wicket is available in the SearchBox source file which uses MyAutoCompleteTextField and the getQueryChoices method of SolrTweetSearch. But before you implement autocomplete with facets take a look into this documentation. And if you don’t want to use wicket then there is a jquery autocomplete library especially for solr – no UI layer required.

3. Trending keywords or links

Similar to autocomplete you will need a tag or link field in your index. Then use the facet counts as an indicator how important a term is. If you now do a query e.g. solr you will get the trending keywords and links depending on the filters. E.g. you can select different days to see the changes:

The keyword panel is implemented in the TagCloudPanel and the link list is available as UrlTrendPanel.

Of course it would be nice if we would get the accumulated score of every link instead of a simple ‘count’ to prevent spammers from reaching this list. For that, look into this JIRA issue and into the StatsComponent. Like I explained in the JIRA issue this nice feature could be simulated by the results grouping feature.

4. Rss feeds

If you log into at jetwick.com you’ll see this idea implemented. Every user can have different saved searches. For example I have one search for ‘apache solr’ and one for ‘wikileaks’. Every search could contain additional filters like only German language or sort against retweets. Now the task is to transform that query into a facet query:

  • insert AND’s between the query and all the filter query
  • remove all date filters
  • add one date filter with the date of the last processed search (‘last date’)

Then you will see how many new tweets are available for every saved searches:

Update: no need to click refresh to see the counts. The count-update is done in background via JavaScript.


There are a lot of applications for faceted search. It is very convinient to use them. Okay, the ‘local parameter hack’ is a bit daunting, but hey: it works :-)

It is nice that I can specify different facets for every query in Solr, with that feature you can generate personalized facets like it was explained under “rss feeds”.

One improvement for the facets implemented in Solr could be a feature which does not calculate the count. Instead it sums up a fieldA for documents with the same value in fieldB or even returns the score for a facet or a facet query. To improve the use case “Trending keywords or links”.

[repost]The Apache Projects – The Justice League Of Scalability

In this post I will define what I believe to be the most important projects within the Apache Projects for building scalable web sites and generally managing large volumes of data.

If you are not aware of the Apache Projects, then you should be. Do not mistake this for The Apache Web (HTTPD) Server, which is just one of the many projects in the Apache Software Foundation. Each project is it’s own open-source movement and, while Java is often the language of choice, it may have been created in any language.

I often see developers working on solutions that can be easily solved by using one the tools in the Apache Projects toolbox. The goal of this post is to raise awareness of these great pieces of open-source software. If this is not new to you, then hopefully it will be a good recap of some of the key projects for building scalable solutions.

The Apache Software Foundation
provides support for the Apache community of open-source software projects.

You have probably heard of many of these projects, such as Cassandra or Hadoop, but maybe you did not realize that they all come under the same umbrella. This umbrella is known as the Apache Projects and is a great place to keep tabs on existing and new projects. It is also a gold seal of approval in the open-source world. I think of these projects, or tools, as the superheroes of building modern scalable websites. By day they are just a list of open-source projects listed on a rather dry and hard to navigate website at http://www.apache.org. But by night… they do battle with some of the world’s most gnarly datasets. Terabytes and even petabytes are nothing to these guys. They are the nemesis of high throughput, the digg effect, and the dreaded query-per-second upper limit. They laugh in the face of limited resources. Ok, maybe I am getting carried away here. Let’s move on…

Joining The Justice League

Prior to a project joining the Apache Projects it will first be accepted into the Apache Incubator. For instance,Deltacloud has recently been accepted into the Apache Incubator.

The Apache Incubator has two primary goals:

  • Ensure all donations are in accordance with the ASF legal standards
  • Develop new communities that adhere to our guiding principles

– Apache Incubator Website

You can find a list all the projects currently in the Apache Incubator here.

Our Heroes

Here is a list, in no particular order, of tools you will find in the Apache Software Foundation Projects.


While each have their major benefits, none is a “silver bullet” or a “golden hammer”. When you design software that scales or does any job very well, then you have to make certain choices. If you are not using the software for the task it is was designed for then you will obviously find it’s weakness. Understanding and balancing the strengths and weaknesses of various solutions will enable you to better design your scalable architecture. Just because Facebook uses Cassandra to do battle with it’s petabytes of data, does not necessarily mean it will be a good solution for your “simple” terabyte-wrestling architecture. What you do with the data is often more important than how much data you have. For instance, Facebook has decided that HBase is now a better solution for many of their needs. [This is the cue for everyone to run to the other side of the boat].

Kryptonite is one of the few things that can kill Superman.

The word kryptonite is also used in modern speech as a synonym for Achilles’ heel, the one weakness of an otherwise invulnerable hero.

– “Krytonite” by Wikipedia

Now, let’s look in more detail at some of the projects that can fly faster than a speeding bullet and leap tall datasets is a single [multi-server, distributed, parallelized, fault-tolerant, load-balanced and adequately redundant] bound.

Apache Cassandra

Cassandra was built by Facebook to hold it’s ridiculous volumes of data within it’s email system. It’s much a like distributed key-value store, but with a hierarchy. The model is very similar to most NoSQL databases. The data-model consists of columns, column-families and super-columns. I will not go into detail here about the data-model, but there is a great intro (see “WTF is a SuperColumn? An Intro to the Cassandra Data Model“) that you can read.

Cassandra can handle fast writes and reads, but its Kryptonite is consistency. It takes time to make sure all the nodes serving the data have the same value. For this reason Facebook is now moving away from Cassandra for its new messaging system, to HBase. HBase is a NoSQL database built on-top of Hadoop. More on this below.

Apache Hadoop

Apache Hadoop, son of Apache Nutch and later raised under the watchful eye of Yahoo, has since become an Apache Project. Nutch is an open-source web search project built on Lucene. The component of Nutch that became Hadoop gave Nutch it’s “web-scale”.

Hadoop’s goal was to manage large volumes of data on commodity hardware, such as thousands of desktop harddrives. Hadoop takes much of it’s design from a paper published by Google on their Bigtable. It stores data on the Hadoop Distributed File-System (a.k.a. “HDFS”) and manages the running of distributed Map-Reduce jobs. In a previous post I gave an example using Ruby with Hadoop to perform Map-Reduce jobs.

Map-Reduce is a way to crunch large datasets using two simple algorithms (“map” and “reduce”). You write these algorithms specific to the data you are processing. Although the your map and reduce code can beextremely simple, it scales across the entire dataset using Hadoop. This applies even if you have petabytes of data across thousands of machines. Your resulting data can be found in a directory on you HDFS disk when the Map-Reduce job completes. Hadoop provides some great web-based tools for visualizing your cluster and monitoring the progress of any running jobs.

Hadoop deals with very large chunks of data. You can tell Hadoop to Map-Reduce across everything it finds under a specific directory within HDFS and then output the results to another directory within HDFS. Hadoop likes [really really] large files (many gigabytes) made from large chunks (eg. 128Mb), so it can stream through them quickly, without many disk-seeks and manage distributing the chunks effectively.

Hadoop’s Kryptonite would be it’s brute force. It is designed for churning through large volumes of data, rather than being real-time. A common use-case is to spool up data for a period of time (an hour) and then run your map-reduce jobs on that data. By doing this you can very efficiently process vasts amounts of data, but you would not have real-time results.

Recommended Reading
Hadoop: The Definitive Guide by Tom White

Apache HBase

Apache HBase is a NoSQL layer on-top of Hadoop that adds structure to your data. HBase uses write-ahead logging to manage writes, which are then merged down to HDFS. A client request is responded to as soon as the update is written to the write-ahead log and the change is made in memory. This means that updates are very fast. The read side is also fast, since data is stored on disk in key order. Subsequently, because data is stored on disk in key-order, scans across sequential keys are fast, due to the low number disk seeks required. Larger scans are not currently possible.

HBase’s Krytonite would be similar to most NoSQL databases out there. Many use-cases still benefit from using relational database and HBase is not a relational database.

Look Out For This Book Coming Soon
HBase: The Definitive Guide by Lars George (May 2011)
Lars has an excellent blog that covers Hadoop and HBase thoroughly.

Apache ZooKeeper

Apache ZooKeeper is the janitor of our Justice League. It is being used more and more in scalable applications such as Apache HBase, Apache Solr (see below) and Katta. It manages an application’s distributed needs, such as configuration, naming and synchronization. All these tasks are important when you have a large cluster with constantly failing disks, failing servers, replication and shifting roles between your nodes.

Apache Solr

Apache Solr is built on-top of one of my favorite Apache Projects, Apache Lucene [Java]. Lucene is a powerful search engine API written in Java. I have built large distributed search engines with Lucene and have been very happy with the results.

Solr packages up Lucene as a product that can be used stand-alone. It provides various ways to interface with the search engine, such as via XML or JSON requests. Therefore, Java knowledge is not a requirement for using it. It adds a layer to Lucene that makes it more easily scale across a cluster of machines.

Apache ActiveMQ

Apache ActiveMQ is a messaging queue much like RabbitMQ or ZeroMQ. I mention ActiveMQ because I has used it, but it is definitely worth checking out the alternatives.

message queue is way to quickly collect data, funnel the data through your system and use the same information for multiple services. This provides separation, within your architecture, between collecting data and using it. Data can be entered into different queues (data streams). Different clients can subscribe to these queues and use the data as they wish.

ActiveMQ has two types of queue, “queue” and “topic”.

The queue type “queue” means that each piece of data on the queue can only be read once. If client “A” reads a piece of data off the queue then client “B” cannot read it, but can read the next item on the queue. This is a good way of dividing up data across a cluster. All the clients in the cluster will take a share of the data and process it, but the whole dataset will only be processed once. Faster clients will take a larger share of the data and slow clients will not hold-up the queue.

A “topic” means that each client subscribed will see all the data, regardless of what the other clients do. This is useful if you have different services all requiring the same dataset. It can be collected and managed once by ActiveMQ, but utilized by multiple processors. Slow clients can cause this type of queue to back-up.

If you are interested in messaging queues then I suggest checking out these Message Queue Evaluation Notesby SecondLife, who are heavy users of messaging queues.

Apache Mahout

The son of Lucene and now a Hadoop side-kick, Apache Mahout was born to be an intelligence engine. From the Hindi word for “elephant driver” (Hadoop being the elephant), Mahout has grown into a top-level Apache Project in it’s own right, mastering the art of Artificial Intelligence on large datasets. While Hadoop can tackle the more heavy-weight datasets on it’s own, more cunning datasets require a little more algorithmic manipulation. Much of the focus of Mahout is on large datasets using Map-Reduce on-top of Hadoop, but the code-base is optimized to run well on non-distributed datasets as well.

We appreciate your comments

If you found this blog post useful then please leave a comment below. I would like to hear which other Apache Projects you think deserve more attention and if you have ever been saved, like Lois Lane, by one of the above.