Elasticsearch on VIP Go

Jetpack Search is a powerful replacement for WordPress’ built-in search, available to all sites hosted on the VIP Go platform. It provides higher quality results and better scalability, and runs in the WordPress.com cloud. It’s powered by Elasticsearch (ES), the industry-leading open source search engine.

Why use Jetpack Search?

With Jetpack Search, you can easily replace the WordPress’ built-in search functionality with supercharged search that is more accurate and significantly faster (in some cases, we’ve seen queries that are 10-300x faster). You can also utilize powerful features such as Aggregations (to allow search filters/drill-downs), Geolocation-based queries, and other complex queries that often don’t scale well within WordPress (e.g. NOT IN, taxonomy, postmeta, etc).

You get a fully managed Elasticsearch service, where we handle all the heavy lifting around mapping, indexing, and making your site’s data available in an easily searchable format.

How Does It Work?

There are two prerequisites for using Jetpack Search on a VIP Go site:

  1. An active Jetpack connection on your site.
  2. An active Elasticsearch Index on WordPress.com.

The content from your site is synced to the WordPress.com cloud via the Jetpack Sync module and then indexed in Elasticsearch. Most content updates are synced and indexed within a minute.

Note that the index needs to be created and set up by the VIP team; please open a support ticket and we’d be happy to do that for you.

The easiest way to integrate Jetpack Search is to enable the module.

How to activate the Search module

We recommend enabling the Search module via code:

add_filter( 'option_jetpack_active_modules', 'x_enable_jetpack_search_module', 9999 );

function x_enable_jetpack_search_module( $modules ) {
    $modules = array_merge( $modules, [ 'search' ] );
    $modules = array_unique( $modules );

    return $modules;
}

Note that this needs to loaded before the `plugins_loaded` hook is fire and we recommend doing so via `client-mu-plugins`.

If you’d prefer to use a UI, you can also enable it from the WordPress.com Dashboard under `Settings > Traffic`.

Jetpack Search provides powerful filtering capabilities to allow users to refine the results across a number of criteria, like post type, category, and date. These are also sometimes referred to as Aggregations (and in earlier versions as Facets).

To enable filters on Jetpack Search, you need define which filters you want to enable and then integrate the filtering into your site.

To define the filters, use Jetpack_Search::set_filters():

add_action( 'init', ‘x_init_jetpack_search_filters’ );
function x_init_jetpack_search_filters() {
    Jetpack_Search::instance()->set_filters( [
        'Content Type' => [
            'type'     => 'post_type',
            'count'    => 10,
        ],
        'Categories' => [
            'type'     => 'taxonomy',
            'taxonomy' => 'category',
            'count'    => 10,
        ],
        'Tags' => [
            'type'     => 'taxonomy',
            'taxonomy' => 'post_tag',
            'count'    => 10,
        ],
        'Year' => [
            'type'     => 'date_histogram',
            'field'    => 'post_date',
            'interval' => 'year',
            'count'    => 10,
        ],
        'Month' => [
            'type'     => 'date_histogram',
            'field'    => 'post_date',
            'interval' => 'month',
            'count'    => 10,
        ],
    ] );
} );

To integrate with your site, you can use the “Search Filters” widget included in Jetpack. Alternatively, for more control, you can integrate the filters yourself. See the code for the widget as a guideline on how to do so.

Modifying Queries

You can modify the Elasticsearch queries generated by the plugin by hooking into the jetpack_search_es_query_args filter and adjusting the args as needed.

For example, here’s an easy way to limit search results to a specific date range:

add_filter( ‘jetpack_search_es_query_args’, function( $args, $query ) {
    $args['date_range'] = array(
        'field' => 'date',
        'gte' => '2016-01-01',
        'lte' => '2016-12-31',
    );

    return $args;
}, 10, 2 );

Searching Through And Indexing Post Meta

By default, only a small subset of post metadata is currently synced and indexed. To index and be able to query across additional metadata, you can whitelist them:

add_filter( 'jetpack_sync_post_meta_whitelist', function( $keys ) {
    // Any keys
    $keys[] = ‘vip_post_subtitle';
    return $keys;
} );

After the whitelist addition, the site’s content will need to be re-synced and re-indexed. This can be done from the WordPress.com Dashboard under Settings > Manage your connection > Initiate a sync manually.

You can then do a term query against meta.[NAME].value where [NAME] is the meta key you want to search through.

add_filter( 'jetpack_search_es_wp_query_args', function( $args, $query ) {
	$args['query_fields'] = [
		'title',
		'content',
		'author',
		'tag',
		'category',
		'meta.vip_post_subtitle.value',

		// You can also search across taxonomy fields.
		// e.g. searching through terms in a "location" taxonomy
		// 'taxonomy.location.name',
	];

	return $args;
}, 10, 2 );

You can also filter based on metadata. Here’s an example that excludes sponsored posts from search results:

add_filter( 'jetpack_search_es_query_args', function( $es_args, $query ) {
	$filter = [
		'not' => [
			'term' => [
				'meta.is_sponsored_post.value' => 1,
			],
		],
	];

	if ( ! isset( $es_args['filter'] ) ) {
		$es_args['filter'] = $filter;
	} elseif ( ! isset( $es_args['filter'] ) ) {
		$es_args['filter']['and'] = [ $filter ];
	} else {
		$es_args['filter']['and'][] = $filter;
	}

	return $es_args;
}, 10, 2 );

Complex Data Types

While it is possible to whitelist additional postmeta, there are two types of data that are not indexed:

  1. “private” postmeta (i.e. keys prefixed with _)
  2. non-primitive data types like arrays and objects

All content in the default index is meant to be public, so private metadata is ignored by default. Serialized versions of objects and arrays tend to include additional metadata which doesn’t make sense to index as-is.

To work around these restrictions, you can extract out searchable pieces of content into separate post meta entries on add/update (and remove on delete). The only use for these entries is for indexing and querying.

For example, let’s say we have a post meta entry (x_content_blocks) which stores an object with various content strings and other metadata used to build out our page. As-is, the content within this object will not be searchable. However, we can extract out all the searchable content any time this post meta entry is added or updated. The extracted content can be combined together into a single string and inserted into a new post meta entry called es_x_content_blocks. Similarly, we can remove the meta entry when all the content blocks are deleted.

We can then whitelist the meta key, proceed with a full re-sync and index, and then be able to search through this field.

Here’s some sample code that you can use as a headstart:

add_action( 'updated_post_meta', 'x_maybe_require_search_extraction', 10, 4 );
add_action( 'added_post_meta', 'x_maybe_require_search_extraction', 10, 4 );
add_action( 'deleted_post_meta', 'x_maybe_require_search_extraction', 10, 4 );

function x_maybe_require_search_extraction( $meta_id, $post_id, $meta_key, $meta_value ) {
    if ( ! in_array( $meta_key, [ /* list of content keys go here */ ] ) ) {
        return;
    }

    global $x_require_search_extraction;
    if ( ! isset( $x_require_search_extraction ) ) {
        $x_require_search_extraction = [];
    }

    if ( ! in_array( $post_id, $x_require_search_extraction ) ) {
        $x_require_search_extraction[] = $post_id;
    }
}

add_action( 'shutdown', 'x_extract_searchable_content' );
function x_extract_searchable_content() {
    global $x_require_search_extraction;

    if ( empty( $x_require_search_extraction ) ) {
        return;
    }

    foreach ( $x_require_search_extraction as $post_id ) {
        // grab all postmeta containing our content
        // extract into a string
        // update_postmeta( ... )
    }
}

Changing Sort Order

By default, search results will be returned ordered by relevancy with some weight given to recency. You can change this sort order by hooking into the jetpack_search_es_query_args filter and modifying the ES query.

This snippet changes the results to be ordered by most recent posts:

add_filter( ‘jetpack_search_es_query_args’, function( $args, $query ) {
    // Re-sort the results based on their modified date
    $args['sort'] = [
        [
            'modified' =>; [
                'order' => 'desc'
            ],
        ],
    ];

    return $args;
}, 10, 2 );

You can modify the sort order in any you choose including weighting various fields depending on your requirements. For more details on sorting, check out the Elasticsearch documentation.

Testing in Local / Preprod Environments

You can use ngrok to connect your local development environments to Jetpack: https://gist.github.com/mjangda/7fcafffda4c6c5c1de18c4af647bd111

Some things worth noting:

  • ngrok is a 3rd party service, which does come with a small cost for custom/reserved domains (which you’ll need to ensure your Jetpack connection stays up).
  • Once connected, we will need to set up an Elasticsearch index for your development site. Please open a ticket with your reserved ngrok domain and we’d be happy to set things up.

Jetpack Search replaces the legacy WPCOM Elasticsearch plugin.

The key changes are:

  • Jetpack Search is directly integrated in Jetpack; no additional plugin required, however, the Search module needs to be activated.
  • The WPCOM_Elasticsearch class has been replaced with Jetpack_Search.
  • Jetpack Search runs off a newer version of Elasticsearch, which has replaced Facets with Aggregations.
  • If you are using the built-in widget, it may need to be re-added as the slug has been changed.
  • The filters from WPCOM Elasticsearch have been renamed:
    • wpcom_elasticsearch_wp_query_args => jetpack_search_es_wp_query_args
    • wpcom_elasticsearch_query_args => jetpack_search_es_query_args
    • wpcom_elasticsearch_found_posts => no replacement available
    • wpcom_elasticsearch_taxonomy_query_var => jetpack_search_taxonomy_query_var

Advanced Topics

Debugging Elasticsearch

The Debug Bar on VIP sites includes an Elasticsearch panel which provides details on all queries performed on the page, including the full JSON-based query sent to the index, timing data (how fast was the query, the network request, etc.), and so on. This can be incredibly useful when things aren’t working as expected.

Replacing WP_Query

We also support other advanced use cases with Elasticsearch outside of search like offloading expensive queries.

You can use the es-wp-query plugin from Alley Interactive to offload any WP_Query lookup to Elasticsearch for the increased performance of a variety of workloads. Use of the plugin is as simple as adding ’es’ => true to your query params when building the WP_Query object.

While Elasticsearch does generally speed up most queries, there may be cases where Elasticsearch may actually be slower. This depends on various factors like the speed of the original MySQL query (using a primary key like ID or an index), complexity of the Elasticsearch query, cost of the network round-trip, and so on. Query times and complexity should be carefully measured before deciding to offload.

Search API

You can send fully qualified Elasticsearch queries directly to the API as well:

curl https://public-api.wordpress.com/rest/v1/sites/${SITEID_OR_URL}]/search --data '{"query": { "term": { "title": "elasticsearch" } } }'

Where ${SITEID_OR_URL} should be replace with your site’s WordPress.com ID or the site url.

Querying multiple sites

Searching across sites is not currently supported via Jetpack Search but it’s something we’re interested in exploring. If this is functionality that you are interested in, please let us know.

Non-Public Content

To search for non-public content, you can make authenticated requests to the API. Documentation on this functionality to come.

We’re exploring ways to add Jetpack Search capabilities to the admin to allow for faster and more relevant searches for editorial teams. Stay tuned for more.

Mapping

To learn more about how content from your WordPress site is mapped to the Elasticsearch index, please see https://developer.wordpress.com/elasticsearch-post-doc-schema-v2/

The code that handles the mapping/translation can be found here: https://github.com/Automattic/wpes-lib

FAQs & Common Problems

Help, ES didn’t return the data I expected!

There are several cases where indexing operations can manifest themselves in “missing content” when queried:

  1. A delay in sync, e.g. large volumes of cron jobs may be backlogged on the site which may take some time to work through or you just initiated a full sync, which can take some time on larger sites.
  2. A delay in indexing, e.g.: a backup in the processing of indexing jobs. The system design here is “eventual consistency”, so some delays are to be expected, but during normal operation, content queued for indexing should be processed in < 60 sec at the outside and ~15 sec on average.
  3. A failure in indexing. There is retry logic for failed indexing jobs, however during normal operation the failure rate is very low. When the failure rate spikes, there are usually other problems in dependent systems eg: the databases fail fail to return the content to index.
  4. The site or the specific post_type is not set to public.

If you are missing data that you definitely expect to be in the index, please reach out and we’d be happy to help investigate.

Where does Elasticsearch live?

WordPress.com VIP site indexes all reside on one ES cluster. However, that ES cluster is distributed across our datacenters to provide redundancy and optimize response times.

Ready to get started?

Tell us about your needs

Let us lead the way. We’ll help you select a top tier development partner. We’ll train your developers, operations, infrastructure, and editorial teams. We’ll coarchitect your deployment processes. We will provide live support for peak events. We’ll help your people avoid dark alleys and blind corners, and reduce wasted cycles.