By Jon Rangel, MongoDB Consulting Engineer

Introduction

分面搜索或称分面导航,是通过对集合中的项目的各种属性(构面)应用过滤器来浏览和搜索数据的方式。它越来越被视为用于许多搜索平台的UI的重要部分,并且现在在诸如电子商务网站等应用场景逐渐成为必需。

Faceted search makes it easy for users to navigate to the specific item or items they are interested in. 它通过促进探索和发现来辅助补全更多格式自由的关键字来进行搜索,因此当用户并不知道他们希望搜索的特定关键字的时候非常有用。

分面搜索功能应该提供的一些核心功能包括:

In this article, we’ll look at implementing the above faceted search functionality using a pure MongoDB solution. We’ll examine a number of approaches to solving this problem, and discuss their relative performance characteristics and any other pros/cons. We will also introduce some third party tools that, alternatively, can integrate with MongoDB to provide faceted search functionality.

Navigating a Book Store

Suppose we want to build faceted search functionality for a product catalog for a book store. A typical document representing a publication in the catalog might look something like the following:

  {
        _id : 123,
        title : "MongoDB: The Definitive Guide",
        authors : [ "Kristina Chodorow" ],
        publication_date : ISODate("2013-05-23"),
        pages : 432,
        edition : 2,
        isbn_10 : 1449344682,
        isbn_13 : 978-1449344689,
        language : "English",
        publisher : {
            name: "O’Reilly Media",
            ...
        },
        last_updated : ISODate("2013-05-16"),
        ...
    } 

First off, let’s state some reasonable assumptions about the facets for this (or indeed any other) catalog:

For this example, let’s say we have three facets on which we wish to search – Subject, Publisher and Language – and consider how to search efficiently, and how to generate the faceted navigation meta-data to present to the user. We will test on some pre-generated test data based on a real-world product catalog.

Searching

The first part of the problem to solve is how to efficiently search for items in the product catalog. A few schema and indexing approaches are presented below.

Solution #1

One way to define the facet tags for a publication would be to store all facet types and values in subdocuments in an array, as follows:

  {
        _id: 123,
        ...
        facets1 : [
            {
                type : "subject",
                val : "MongoDB"
            },
            {
                type : "subject",
                val : "Databases"
            },
            {
                type : "publisher",
                val : "O'Reilly Media"
            },
            {
                type : "language",
                val : "English"
            }
        ]
    } 

A single ‘generic’ compound index can then be created containing all the facets and facet values:

 > db.books.ensureIndex({"facets1.type" : 1, "facets1.val" : 1})
    > db.books.stats()
    {
        "ns" : "test.books",
        "count" : 105280,
        "size" : 109597152,
        "avgObjSize" : 1041.0063829787234,
        ...
        "totalIndexSize" : 29891456,
        "indexSizes" : {
            "_id_" : 3433920,
            "facets1.type_1_facets1.val_1" : 26457536
        },
        "ok" : 1
    }

See this blog post for a good treatment on these kinds of generic indexes.

Let’s see how this performs for some faceted searches, using explain(). We’ll look at queries on a single facet tag to start with.

Find all books about databases:

  > db.books.find(
    ...     { "facets1" : { $elemMatch : { "type" : "subject", "val" : "Databases" } } }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
        "isMultiKey" : true,
        "n" : 7315,
        "nscannedObjects" : 7315,
        "nscanned" : 7315,
        "nscannedObjectsAllPlans" : 7315,
        "nscannedAllPlans" : 7315,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 27,
        "indexBounds" : {
            "facets1.type" : [
                [
                    "subject",
                    "subject"
                ]
            ],
            "facets1.val" : [
                [
                    "Databases",
                    "Databases"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

Find all books by a specific publisher:

  > db.books.find(
    ...     { "facets1" : { $elemMatch : { "type" : "publisher", "val" : "O'Reilly Media" } } }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
        "isMultiKey" : true,
        "n" : 39960,
        "nscannedObjects" : 39960,
        "nscanned" : 39960,
        "nscannedObjectsAllPlans" : 39960,
        "nscannedAllPlans" : 39960,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 133,
        "indexBounds" : {
            "facets1.type" : [
                [
                    "publisher",
                    "publisher"
                ]
            ],
            "facets1.val" : [
                [
                    "O'Reilly Media",
                    "O'Reilly Media"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

Both of these queries use the index optimally as the number of documents returned is the same as the number of documents scanned (nscanned is the same as n).

How about queries for documents matching the union or intersection of multiple facet values? To do these “and”/“or” queries we use the $all/$in operators respectively.

Find all books about databases OR published by O'Reilly Media:

  > db.books.find(
    ...     { "facets1" :
    ...         { "$in" : [
    ...             { $elemMatch : { "type" : "publisher", "val" : "O'Reilly Media" } },
    ...             { $elemMatch : { "type" : "subject", "val" : "Databases" } }
    ...         ]}
    ...     }
    ... ).explain()
    Fri Aug 16 15:59:04.989 JavaScript execution failed: error: 
    { "$err" : "$elemMatch not allowed within $in", "code" : 15881 } at src/mongo/shell/query.js:L128

Oops! This type of search doesn’t work using $in to construct the query as we cannot use the $elemMatch operator within a $in clause. This query can instead be constructed using the $or operator:

  > db.books.find(
    ...     { "$or" : [
    ...             { "facets1" : { $elemMatch : { "type" : "publisher", "val" : "O'Reilly Media" } } },
    ...             { "facets1" : { $elemMatch : { "type" : "subject", "val" : "Databases" } } }
    ...         ]
    ...     }
    ... ).explain()
    {
        "clauses" : [
            {
                "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
                "isMultiKey" : true,
                "n" : 40019,
                "nscannedObjects" : 40019,
                "nscanned" : 40019,
                "nscannedObjectsAllPlans" : 40019,
                "nscannedAllPlans" : 40019,
                "scanAndOrder" : false,
                "indexOnly" : false,
                "nYields" : 0,
                "nChunkSkips" : 0,
                "millis" : 118,
                "indexBounds" : {
                    "facets1.type" : [
                        [
                            "publisher",
                            "publisher"
                        ]
                    ],
                    "facets1.val" : [
                        [
                            "O'Reilly Media",
                            "O'Reilly Media"
                        ]
                    ]
                }
            },
            {
                "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
                "isMultiKey" : true,
                "n" : 6640,
                "nscannedObjects" : 7374,
                "nscanned" : 7374,
                "nscannedObjectsAllPlans" : 7374,
                "nscannedAllPlans" : 7374,
                "scanAndOrder" : false,
                "indexOnly" : false,
                "nYields" : 1,
                "nChunkSkips" : 0,
                "millis" : 123,
                "indexBounds" : {
                    "facets1.type" : [
                        [
                            "subject",
                            "subject"
                        ]
                    ],
                    "facets1.val" : [
                        [
                            "Databases",
                            "Databases"
                        ]
                    ]
                }
            }
        ],
        "n" : 46659,
        "nscannedObjects" : 47393,
        "nscanned" : 47393,
        "nscannedObjectsAllPlans" : 47393,
        "nscannedAllPlans" : 47393,
        "millis" : 242,
        "server" : "rangel.lan:27017"
    }

This query is pretty optimal: the number of documents scanned is only slightly more than the number returned, and the index is used for both parts of the “or” statement.

Next, find all books about databases AND published by O'Reilly Media:

 > db.books.find(
    ...     { "facets1" :
    ...         { "$all" : [
    ...             { $elemMatch : { "type" : "publisher", "val" : "O'Reilly Media" } },
    ...             { $elemMatch : { "type" : "subject", "val" : "Databases" } }
    ...         ]}
    ...     }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
        "isMultiKey" : true,
        "n" : 675,
        "nscannedObjects" : 39960,
        "nscanned" : 39960,
        "nscannedObjectsAllPlans" : 39960,
        "nscannedAllPlans" : 39960,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 118,
        "indexBounds" : {
            "facets1.type" : [
                [
                    "publisher",
                    "publisher"
                ]
            ],
            "facets1.val" : [
                [
                    "O'Reilly Media",
                    "O'Reilly Media"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

This query uses the index, but is not optimal as many more documents are scanned than returned. Note that the number of documents scanned is the same as the number of books by this publisher (as seen from the previous query) – this is because at present $all only uses the index for the first element in the query array.

The performance of these kinds of queries will improve significantly once MongoDB supports index intersection, which is a feature that is coming soon (see SERVER-3071). With single index intersection, queries like the above will not need to scan more documents than those returned. In the meantime, to optimize these kinds of queries put the most selective filter criterion as the first element of the $all array if possible to minimize scanning:

 > db.books.find(
    ...     { "facets1" :
    ...         { "$all" : [
    ...             { $elemMatch : { "type" : "subject", "val" : "Databases" } },
    ...             { $elemMatch : { "type" : "publisher", "val" : "O'Reilly Media" } }
    ...         ]}
    ...     }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
        "isMultiKey" : true,
        "n" : 675,
        "nscannedObjects" : 7315,
        "nscanned" : 7315,
        "nscannedObjectsAllPlans" : 7315,
        "nscannedAllPlans" : 7315,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 20,
        "indexBounds" : {
            "facets1.type" : [
                [
                    "subject",
                    "subject"
                ]
            ],
            "facets1.val" : [
                [
                    "Databases",
                    "Databases"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }
Solution #2

Store all facet types and values in in an array, but instead of each element of the array being a subdocument, concatenate the facet type name and value into a single string value:

 {
        _id: 123,
        ...
        facets2 : [
            "subject:MongoDB",
            "subject:Databases",
            "publisher:O'Reilly Media",
            "language:English"
        ]
    }

Create an index on the facets field:

  > db.books.ensureIndex({"facets2" : 1})
    > db.books.stats()
    {
        "ns" : "test.books",
        "count" : 105280,
        "size" : 109597152,
        "avgObjSize" : 1041.0063829787234,
        ...
        "totalIndexSize" : 55694912,
        "indexSizes" : {
            "_id_" : 3433920,
            "facets1.type_1_facets1.val_1" : 26457536,
            "facets2_1" : 25803456
        },
        "ok" : 1
    }

Now let’s try some of the same queries as before. First, a simple query on a single facet value (all books about databases):

   > db.books.find(
    ...     { "facets2" : "subject"+":"+"Databases" }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets2_1",
        "isMultiKey" : true,
        "n" : 7315,
        "nscannedObjects" : 7315,
        "nscanned" : 7315,
        "nscannedObjectsAllPlans" : 7315,
        "nscannedAllPlans" : 7315,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 28,
        "indexBounds" : {
            "facets2" : [
                [
                    "subject:Databases",
                    "subject:Databases"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

This works exactly as expected.

Now, lets try an “or” query (all books about databases OR published by O'Reilly Media):

    > db.books.find(
    ...     { "facets2" :
    ...         { "$in" : [
    ...             "publisher"+":"+"O'Reilly Media",
    ...             "subject"+":"+"Databases"
    ...         ]}
    ...     }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets2_1 multi",
        "isMultiKey" : true,
        "n" : 46600,
        "nscannedObjects" : 47275,
        "nscanned" : 47276,
        "nscannedObjectsAllPlans" : 47275,
        "nscannedAllPlans" : 47276,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 117,
        "indexBounds" : {
            "facets2" : [
                [
                    "publisher:O'Reilly Media",
                    "publisher:O'Reilly Media"
                ],
                [
                    "subject:Databases",
                    "subject:Databases"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

This query is pretty optimal: the number of documents scanned is only slightly more than the number returned, and the index bounds look sensible, showing that the index is used for both elements of the $in array. Note that $in may be used to construct this type of query since we don’t need to use the $elemMatch operator with this schema.

Finally, an “and” query (all books about databases that are published by O'Reilly Media):

  > db.books.find(
    ...     { "facets2" :
    ...         { "$all" : [
    ...             "subject"+":"+"Databases",
    ...             "publisher"+":"+"O'Reilly Media"
    ...         ]}
    ...     }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets2_1",
        "isMultiKey" : true,
        "n" : 675,
        "nscannedObjects" : 7315,
        "nscanned" : 7315,
        "nscannedObjectsAllPlans" : 7315,
        "nscannedAllPlans" : 7315,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 20,
        "indexBounds" : {
            "facets2" : [
                [
                    "subject:Databases",
                    "subject:Databases"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

If you’ve been following so far, you won’t be too surprised to see that, unfortunately, this performs exactly the same as in solution #1, for the same reasons described there. Index intersection is coming soon though!

Solution #3

Consider the following schema, where each facet is a field in a subdocument, associated with an array of the tags for that facet:

 {
        _id: 123,
        ...
        facets3 : {
            subject : [ "MongoDB", "Databases" ],
            publisher : [ "O'Reilly Media" ],
            language : [ "English" ]
        }
    }

Add an index on each facet individually:

  > db.books.ensureIndex({"facets3.subject" : 1})
    > db.books.ensureIndex({"facets3.publisher" : 1})
    > db.books.ensureIndex({"facets3.language" : 1})
    > db.books.stats()
    {
        "ns" : "test.books",
        "count" : 105280,
        "size" : 109597152,
        "avgObjSize" : 1041.0063829787234,
        ...
        "totalIndexSize" : 75464480,
        "indexSizes" : {
            "_id_" : 3433920,
            "facets1.type_1_facets1.val_1" : 26457536,
            "facets2_1" : 25803456,
            "facets3.subject_1" : 12084128,
            "facets3.publisher_1" : 2321984,
            "facets3.language_1" : 5363456
        },
        "ok" : 1
    }

This solution has the same performance characteristics as the first two solutions, with the additional benefit that the total size of the indexes required is significantly smaller. This is because we are not storing the facet names in the index for each indexed value.

Once index intersection using multiple indexes is supported (which is also coming under SERVER-3071), this approach will also perform well for “and” queries.

Generating the Faceted Navigation Information

The other part of the faceted search problem is how to most efficiently generate and return the faceted search meta-data. One way to do this would be to use the Aggregation Framework to calculate this information on-the-fly.

For example, to get all the facet values for the collection and the count of documents associated with each one, we could perform the following aggregation query (assuming schema #2 as above):

 > db.books.aggregate([{ "$unwind" : "$facets2" },
                          { "$group" : { "_id" : "$facets2", count : { "$sum" : 1 } } },
                          { "$sort" : { "_id" : 1 } }
                         ])
    {
        "result" : [
            ...
            {
                "_id" : "publisher:O'Reilly Media",
                "count" : 39960
            },
            ...
            {
                "_id" : "subject:Databases",
                "count" : 7315
            },
            ...
        ],
        "ok" : 1
    }

Then, as the user drills down using the facets, we need to add the filter predicates to the aggregation query. For instance, if the user clicks on the “Databases” subject facet, we can obtain the facet values and counts for documents matching this filter as follows:

 > db.books.aggregate([{ "$match" : { "facets2" : "subject"+":"+"Databases" } },
                          { "$unwind" : "$facets2" },
                          { "$group" : { "_id" : "$facets2", "count" : { "$sum" : 1 } } },
                          { "$sort" : { "_id" : 1 } }
                         ])
    {
        "result" : [
            ...
            {
                "_id" : "publisher:O'Reilly Media",
                "count" : 675
            },
            ...
            {
                "_id" : "subject:Databases",
                "count" : 7315
            },
            ...
        ],
        "ok" : 1
    }

The downside to this approach is that it incurs the overhead of an additional aggregation query each time the user queries the product catalog. Furthermore, for certain choices of schema (e.g. solution #3 above) we actually need to do one aggregation query per distinct facet.

It’s reasonable to assume that the product catalog will be updated much less frequently than it is queried, therefore it may well make sense to pre-compute the faceted navigation meta-data and store it in a separate collection. Consider the following schema for a collection of faceted navigation documents:

 {
        _id : "'facet_filter_string",
        value : {
            count : 12,
            facets : {
                facet1_name : {
                    facet1_val1 : 8,
                    facet1_val2 : 12,
                    ...
                },
                facet2_name : {
                    facet2_val1 : 5,
                    ...
                },
                ...
            }
        }
    }

where <facet_filter_string> is either the empty string (for the document representing the root of the faceted navigation) or one or more of “|<facet_name>:<facet_filter_val>|” concatenated together.

Then, to find the faceted navigation information pertaining to all books about databases, the following simple query on _id will do the job:

 > db.facetnav.find({_id:"|subject:Databases|"}).pretty()
    {
        "_id" : "|subject:Databases|",
        "value" : {
            "count" : 7315,
            "facets" : {
                "publisher" : {
                    "O'Reilly Media" : 675,
                    "Pub2" : 3605,
                    "Pub3" : 185,
                    "Pub4" : 305,
                    "Pub5" : 2505,
                    "Pub6" : 15,
                    "Pub7" : 25
                },
                "language" : {
                    "English" : 7250,
                    "French" : 1095,
                    "German" : 1290
                }
            }
        }
    }

Note that it’s not necessary to generate a document like the above for every single permutation of facet filters, only for each unique combination of filters according to some predetermined canonical ordering of facets (e.g. Subject, Publisher, Language). We can then ensure that the application always builds the _id string with which to query using this canonical ordering.

The faceted navigation meta-data collection can be generated quite easily using a Map-Reduce job. For some example code that does this, take a look at my GitHub repo. With the map and reduce functions defined there, the facetnav info for the entire product catalog can be generated as follows:

 > db.books.mapReduce(mapFn, reduceFn, { "out" : "facetnav" })
    {
        "result" : "facetnav",
        "timeMillis" : 117529,
        "counts" : {
            "input" : 105280,
            "emit" : 2423080,
            "reduce" : 63850,
            "output" : 1599
        },
        "ok" : 1,
    }

Subsequently, whenever the product catalog is updated, the facetnav collection can be quickly updated by specifying that the map-reduce job operate only on the recently updated items and fold those changes in to the existing facetnav collection. For example:

  > db.books.ensureIndex({"last_updated : 1"})
    > db.books.mapReduce(mapFn, reduceFn,
    ...                  { "query" : { "last_updated" : { "$gt" : new Date(2013,7,1) } },
    ...                    "out" : { "reduce" : "facetnav" } })
    {
        "result" : "facetnav",
        "timeMillis" : 724,
        "counts" : {
            "input" : 1000,
            "emit" : 13484,
            "reduce" : 198,
            "output" : 1599
        },
        "ok" : 1,
    }

Third-Party Tools

There are a number of search engine software packages that provide faceted search capabilities. These typically provide the core functionality we have described above, plus more advanced features such as more convenient searching on ranges of facet values (e.g. finding documents that fall within a certain date or price range) or auto-completion (i.e. displaying relevant suggestions, grouped by facet, as a user types in a search query).

The trade-offs with using an additional search engine are:

Two of the most popular search engines are Solr and ElasticSearch which, like MongoDB, are also free and open-source products.

Solr and ElasticSearch can be easily integrated with MongoDB using Mongo Connector, which comes bundled with plugins for interfacing with each of them. Using the appropriate plugin, Mongo Connector can integrate data from MongoDB into the desired target system and keep the two systems in sync.

Conclusion

Faceted search functionality can be implemented in MongoDB, without requiring the use of external search engines. When index intersection arrives, all the types of queries we have examined here will perform optimally. Integrating with an external search engine to provide faceted search is also a good option, and something to consider depending on the specific requirements of your application.