Alternative Indexing/Search Solutions#
alm.solrindex#
alm.solrindex
is another add-on for connecting Plone search to Solr.
It takes a different approach:
collective.solr
wraps the Zope catalog. Each item is indexed both in the ZCatalog and in Solr, typically including many indexes in both. When a search is performed, based on the indexes used, it decides to query either ZCatalog or solr but not both.alm.solrindex
operates as an index within the Zope catalog, replacing the standard SearchableText index. Solr only needs to index the fulltext, and the ZCatalog no longer needs to do so. When a search is performed that includes a SearchableText criterion, first alm.solrindex will query solr for results, then those results will be further filtered by other ZCatalog indexes.
Pros:
Solr is more efficient than ZCTextIndex at indexing and querying fulltext.
Avoids duplication of index storage.
Less data needs to be sent between Plone and solr when indexing.
Don't need to add new indexes to Solr and reindex.
Cons:
No admin UI in Plone control panel.
Customizations can require monkey patching.
Potential for missing some results. (see below)
Setup#
We set up Solr in our buildout in a similar way,
using the hexagonit.recipe.download and collective.recipe.solr
buildout recipes.
The solr-instance
buildout part looks a bit different.
[solr-instance]
recipe = collective.recipe.solrinstance
solr-location = ${solr-download:location}
host = ${settings:solr-host}
port = ${settings:solr-port}
basepath = /solr
max-num-results = 500
default-search-field = SearchableText
unique-key = docid
index =
name:docid type:integer stored:true required:true
name:SearchableText type:text stored:false
name:Title type:text stored:false
name:Description type:text stored:false
We set the
unique-key
identifying the record todocid
.alm.solrindex
will pass the ZCatalog's internal integer record id (rid
) in this field.We set the
default-search-field
to SearchableText, so that Solr queries which don't specify a field will use SearchableText.We configure fields for docid and each of the standard Plone fulltext indexes, but not any other fields.
We set
stored: false
on the indexes so that Solr will only store the docid.
We also need to reference the Solr URI in an environment variable for the Plone instance part,
so that alm.solrindex
knows where to connect
[instance]
environment-vars =
SOLR_URI http://${settings:solr-host}:${settings:solr-port}/solr
After running buildout,
we can start Plone and activate alm.solrindex
in the Add-ons control panel.
Note
The default installation profile removes the existing SearchableText, Title, and Description indexes, but does not automatically reindex existing content.
If you have existing content in the site, you'll need to do a full reindex of the ZCatalog to get them indexed in Solr.
Why Are Results Missing?#
There is a limitation to this approach.
Solr is configured with a maximum limit on the number of results it will return
(max-num-results
in the buildout configuration).
This is done because it hurts performance if there are thousands and thousands of results,
and Solr has to serialize all of them and Plone has to deserialize all of them.
For queries that only use indexes that are in Solr (i.e. the fulltext indexes), this is not a big problem.
Solr ranks the results so the limited set it returns should be the most relevant results, and most users are not going to navigate past more than a few pages of results anyway.
It can be a problem when the search term is very generic (so there are many results and its hard for Solr to determine the most relevant ones) and the results are also going to be filtered by other indexes (such as in a faceted search solution).
In this case the limited result set from Solr is fairly arbitrary, the other filters only get to operate on this limited set, and we might end up missing results that should be there.
Example: Consider a site where there are 10,000 items with the term 'pdf', including one in a folder "/annual-reports/2015". If a search is performed for 'pdf' within the path '/annual-reports/2015':
First Solr finds all documents matching 'pdf', and ranks them.
Next it returns the top 500 results to Plone.
Next Plone filters those results by path. There is a good chance that our target document was not included in the 500 that Solr returned, so this filters down to no results.
There are a couple workarounds for this problem, both of which have their own tradeoff:
Increase
max-num-results
above the total number of documents (but this will hurt performance for queries that return many results).Make sure that other indexes that are likely to narrow down the results a lot are also included in Solr (but this detracts from the main advantages of using
alm.solrindex
overcollective.solr
).
Customization#
Each type of field has its own handler which takes care of translating between ZCatalog and Solr queries. These can be overridden to handle advanced customization:
Example: monkey patch the TextFieldHandler
to use an edismax
query that allows boosting some fields
from Products.PluginIndexes.common.util import parseIndexRequest
from alm.solrindex.handlers import TextFieldHandler
from alm.solrindex.quotequery import quote_query
def parse_query(self, field, field_query):
name = field.name
request = {name: field_query}
record = parseIndexRequest(request, name, ('query',))
if not record.keys:
return None
query_str = ' '.join(record.keys)
if not query_str:
return None
if name == 'SearchableText':
q = quote_query(query_str)
else:
q = u'+%s:%s' % (name, quote_query(query_str))
return {
'q': q,
'defType': 'edismax',
'qf': 'Title^10 Description^2 SearchableText^0.2', # boost fields
'pf': 'Title~2^20 Description~5^5 SearchableText~10^2', # boost phrases
}
TextFieldHandler.parse_query = parse_query
Example: Add a path
index that works like Zope's ExtendedPathIndex
(i.e. it'll find anything whose path begins with the query value):
solr.cfg
[solr-instance]
...
index =
...
name:path type:descendent_path stored:false
handlers.py
from alm.solrindex.handlers import DefaultFieldHandler
class PathFieldHandler(DefaultFieldHandler):
def parse_query(self, field, field_query):
query = super(PathFieldHandler, self).parse_query(field, field_query)
if query == {'fq': 'path:""'}:
return {}
return query
def convert_one(self, value):
# avoid including the site path in the index data
if value.startswith('/Plone'):
value = value[6:]
return super(PathFieldHandler, self).convert_one(value)
ZCML:
<utility component=".handlers.PathFieldHandler"
provides="alm.solrindex.interfaces.ISolrFieldHandler"
name="path" />
DIY Solr#
If both collective.solr and alm.solrindex are too much for you or you have special needs, you can access Solr by custom code. This might be, if you:
need to access a Solr server with a newer version / multicore setup and you don't have access to the configuration of Solr
Only want a fulltext search page of a small site with no need for full realtime support
You can find a full-featured example of a full-fledged custom Solr integration at the Ploneintranet (advanced!):
collective.elasticsearch#
Another option for an advanced search integration is the younger project Elasticsearch. Like for Solr, the technical foundation is the Lucene index, written in Java.
Pros of Elasticsearch
It uses JSON instead of an XML schema for (field) configuration, which might be easier to configure.
Clustering and replication is built in from the beginning. It is easier to configure. Especially ad-hoc cluster which can (re)configure automatically.
The project and community is agile and active.
Cons of Elasticsearch
JSON is abused as Query DSL. It can lead to queries with up to 10 layers. This can be annoying especially if you write them programatically.
The integration of Elasticsearch with Plone is done with https://pypi.org/project/collective.elasticsearch/
Google Custom Search#
Google provides a couple related tools for using Google as a site-specific search engine embedded in your site: Google Custom Search (free, ad-supported) and Google Site Search (paid).
Note
don't confuse these solutions with Google Search Appliance, which was a rack-mounted device which has been discontinued.
Pros:
Better ranking of results compared to ZCTextIndex.
Fairly straightforward to integrate.
GUI control panel for basic configuration.
Don't have to run and maintain a separate Java service.
Can easily be configured to search multiple websites.
Cons:
Free version includes Google branding and ads in results.
Cannot index private items.
Changes are not indexed immediately (usually within a week).
Only returns top 100 results for a query.
Only useful for fulltext search, not searching specific fields.
Limited control over result ranking and formatting.
Google has a habit of discontinuing free services.