Reindexing Solr Core

Relevance tuning consulting often involves clients sharing existing index - either by sharing index files or setting up access to their test environment. In my experience, it is rare case that there is some reindexing process in place and even if there is one, it is often better to be able to do it on your own in order to avoid long turnarounds. Elasticsearch has a nice reindex API, but you should be used by now that Solr does not have such nice-to-have-simple-to-use features. What are alternatives?

You could write some indexing code assuming you have access to data source, but that is not likely to have. If you are lucky, fields that you need to use are stored so you can use existing core as data source. In that case don't forget that you need should use cursors to load data page by page and index it in batches. You can do it using simple shell script, Jmeter script, Javascript program or Java with Solrj.

Luckily, Solr is delivered with component that does just that - it is SolrEntityProcessor. It means that you will have to include DIH libs in your target Solr and set up DIH handler. It is simple processor and this blog post is not DIH and/or SolrEntityProcessor tutorial, so here is just sample config:


  <dataConfig>
    <document>
      <entity name="recent" processor="SolrEntityProcessor"
            url="http://localhost:8983/solr/test_core"
            query="publish_date:[2015-07-01T00:00:00Z TO *]"
            fl="id,title,content,publish_date"/>
    </document>
  </dataConfig>

In order to avoid changes in indexing code, it is better not to change field names in entity processor definition. It is better to use copyField in combination with ignored field type and keep field name changes within Solr. In addition to that, Solr UpdateRequestProcessors is a nice addition in your toolbox.

This is not a perfect replacement for reindex API but if you ignore the fact that it requires DIH to work, it can be good enough. I hope to catch some time and try to extract existing code into simpler handler and avoid additional dependencies. Stay tuned.

Post a Comment