The Wonderful World of Materialized Views
(or, Why Indexing in Magento 1.13 Is So Much Faster)
When Magento Enterprise 1.13 came out it boasted massive improvement in index operation performance. On the Magento blog introducing 1.13 (http://www.magentocommerce.com/blog/comments/magento-enterprise-edition-113/) a simple statement was made: “The introduction of incremental indexing reduces the need to perform a full re-index and most indexing operations are now automated”. While this statement is true it provides a very simplified explanation of what has changed and why the improvements in performance are so great.
Materialized Views
Part of the problem with index performance was simply due to what the indexes in Magento are. They are not indexes like a BTree or hash but act more like a pre-calculation table. In SQL parlance this often known as a Materialized View. This is similar to a regular view, where the table contains the results of a query but unlike a regular view which updates its contents when the row changes a Materialized View does not.
MySQL does not support Materialized Views and so Magento implemented a sort of pseudo-Materialized View. Prior to 1.13 Magento implemented functionality that was like a Materialized View but did not track changes in products and so it was not aware of which items needed to be updated. So a re-index was always a full re-index. So it was a kind of pseudo-pseudo-Materialized View.
The effect of this can be seen in even very cursory, top-level views. In the following example one product was loaded, the inventory changed, the product saved, the indexes loaded and a re-index was done all via the administrative UI.
One of the changes has been the inclusion of a new module called Enterprise_Index. If one were to start looking here for the answer they would only be partially right and really only if they looked in the setup script. The setup script defines a number of different events. But these events are different from the typical event/observer paradigm that most developers are familiar with.
The events in the Enterprise_Index setup are actually triggers in MySQL on individual tables that respond to various modifications to individual rows. These triggers are used to completely invalidate an entire index. As such there are only a few operations that actually trigger this, most of which revolve around store, attribute or certain configuration modifications. And automated index invalidation does not explain why Enterprise 1.13 made such large strides when it comes to re-indexing.
But it gives us a starting place to look.
Partial Indexing
A big part of the performance gains is through the use of the partial re-indexing. The partial re-indexing of small changes is never much of a problem. Just modify the items that were changed and apply the same logic that you use for larger changes.
But how does one determine which items have changed? That can be a huge problem. One option would be to set an event observer for “model_save_after”. But that would result in a lot of overhead added to the execution flow. Plus it wouldn’t take into account if a model class overrides the _afterSave() method and forgets to call the parent. So the problem of reliably determining when a model has changed is a problem that cannot be reliably done in the application code. There are too many moving parts that need to be taken into account.
To get around these and other issues triggers are used to note changes. But unlike the triggers in Enterprise_Event which simply invalidate a full index these triggers run on individual tables and simply watch for, and make note of, changes.
One can see how this is set up in the new Enterprise_Catalog module. The installation script makes use of a model from a new module called Mview. The Mview module is Magento’s implementation of a MySQL-based pseudo-Materialized View.
To configure an individual Mview requires three distinct components. 1) the metadata, 2) a client, and 3) an action. The metadata and client is managed by the Mview module. The action is implemented by whichever module is utilizing the Mview.
This is done by telling Mview to subscribe to actions in a given table, which are referred to as events. The events can be INSERT, UPDATE or DELETE, correlating to the related SQL actions. Mview will set up a trigger on a provided table. This trigger will take the provided column, most likely the primary key, and inject it into a changelog table that is unique for the table provided. The changelog table is a two column table, one with a primary key that is referred to as the version_id and another that holds the value of the key provided. Whenever a subscribed-to change occurs the provided value (again, usually the primary key) will be inserted into the changelog table and the version_id will be incremented via AUTO_INCREMENT.
An action can be run manually, but most of the time they are automatically managed by the new Enterprise_Index module. It adds a cron job that is run each time cron is run. The Enterprise_Index_Model_Observer::refreshIndex() method is called every time the cron job is run. This method retrieves an Mview client over which the observer iterates by retrieving index configuration nodes from under global/index/indexer.
Under this configuration node will be several unique children. Under these children should be several nodes, with the ones important to the Enterprise_Index observer being under the node “action_model”. Under this node are the Magento URNs to the class that is going to be handling the actual re-index process. There are currently two nodes under this parent node; “changelog” and “all”.
<catalog_product_price> <model>enterprise_catalog/index_dummy_price</model> <action_model> <changelog>enterprise_catalog/index_action_product_price_refresh_changelog</changelog> <all>enterprise_catalog/index_action_product_price_refresh</all> </action_model> <index_table>catalog_product_index_price</index_table> <sort_order>40</sort_order> </catalog_product_price>
Figure 2 – Example of Full and Partial Reindex Action Configurations
All re-indexing processes that are using the Materialized View should be initiated by the Mview client and should implement Enterprise_Mview_Model_Action_Interface. The Mview client expects that any view reindexing will be done by a class that implements that interface, often with several sub-classes in between to handle some of the mundane details of managing the partial re-indexing functionality.
Prior to executing a full re-index the Mview client will check the enterprise_mview_metadata table to see if the currently requested index is actually dirty. If it is not marked as dirty then the index will not be run. As noted earlier, there are only certain conditions which will cause the table to be flagged and so just because a table row has changed does not mean that the index will be marked dirty, inducing a full re-index.
But while this partial indexing is often managed via cron it can also happen inline with a regular request. There are several examples of a partial re-index being implemented on saving an item; product inventory being an example. The product inventory indexer listens for changes in a product and will automatically do a partial re-index for that product when the product saves.
This, like most things in Magento, is a configurable option. Most of the partial indexers will check a configuration prior to re-indexing. In Enterprise 1.13 there is a new System Configuration section called “Index Management” which allows the administrator to control this behavior.
The effect of this is clear. The actual number of queries for a save operation is significantly lower.
Using Materialized Views For Your Own Profit
But while the materialized view implementation has helped make the indexing functionality much more efficient one of the real benefits of the new Mview module is that you can use it in a standardized way to manage your own indexing requirements. It will help you implement module functionality when you have a count(*), GROUP BY or DISTINCT queries or when you need to pre-calculate values. That’s not to say that you can’t manage it on your own but by using the Mview module you can offload some of the responsibilities of managing these types of calculations to Magento.
In order to do that you need 3 elements
- A setup script that creates your table subscriber
- An index table
- Indexer action models that deal with either the changelog or a complete reindex
Before we move on we should note that #3 has some caveats. Generally if you are going to be working with the changelog you will not be configuring an index to run with the shell indexer as you might be expected. Instead, it will be run automatically through cron and managed by the Enterprise_Index_Model_Observer class, which will have its refreshIndex() method called each time cron is executed, just like it does in the Enterprise_Catalog module. In other words, you will generally have no direct control via the administration UI or the CLI if you use this method.
For this example we are going to write an extension that will print a summary of the browsers that have viewed a given page in the footer of that page. Granted, this is an extension that would not go on a production front-facing Magento implementation but it will demonstrate how you can integrate with the Mview extension to do calculations that you would not want to do during a normal request. Additionally, for the purposes of clarity, we are going to omit things like Layout and Blocks and standard configuration options and focus on the Mview functionality. This should not be a problem for most Magento developers, though the full code is available on GitHub.
For implementing this kind of functionality you might have considered creating a new table that might contain a column for the URL, the User Agent and a count. However if you were to calculate the results on the fly you are adding a minimum of an extra write during regular load times and you increase the possibility of locks or write contention in the database. Those kinds of writes are generally something that should be avoided. This is not to mention, as well, that in our scenario here the information we need is already being stored in the log_ tables and so we would be duplicating a write for no reason.
There is an additional reason that we would want to do this separately. User Agent strings are long strings of mostly useless information (for our purposes, at least). These strings have many different permutations which may or may not follow a specific pattern. So we have one of two options. The first is that we build the mechanism to normalize the user agents into something useful ourselves or we use what someone else has done so we don’t have to maintain it. The get_browser() functionality is a good suit for this, but it is a little on the slow side if you want accurate results.
So we are going to process those results offline as part of a regularly scheduled cron job that is already set up via Enterprise_Index.
To get started we need to set up a subscriber on the log_url table. This is the table where each and every HTTP request is registered. The Mview module allows your module to watch a given table for changes. When those changes occur the change is noted by the triggers that have been defined for that table and when a change occurs the individual view will have an internal version ID counter incremented.
When creating a subscription you will do it by telling the Mview client to watch a given column on a given table. There is something interesting to note here; because of how the system tracks changes it is beneficial to make sure that you are watching a column that has a unique value. Otherwise it is possible that your indexer may accidentally select data that has already been processed. A good example of this is if you are watching a DATETIME column. It is quite possible that two events occurred in the same second that the indexer was run which would end up getting re-processed the next time the indexer is run. When retrieving the changed values from Mview the indexer looks for a version ID in the enterprise_mview_metadata table which will be matched against the actual value in the change log table. If there is any possibility of duplication here then it is possible that your data will be skewed. Simply watching the primary key of the table you’re interested in is by far the easiest way to accomplish this.
To go about setting these relationships up you will generally do this from within a setup script for the module that you are creating. In the setup script we are creating here there are two tables that need to be created. One is for the summary report data and the second is for mapping individual user agents onto their parent browser values. For example, mapping “Mozilla/5.0 (X11; Linux x86_64; rv:10.0.12) Gecko/20100101 Firefox/10.0.12 Iceweasel/10.0.12 Snapper/1.0” onto “Firefox 10“. These tables, however, we will not document here for the sake of clarity, but you can see them in the GitHub repository for this module.
To start creating our materialized view and its subscription we’re going start with our setup script. We are going to get the name of the log_url table, which is the table we’re going to watch.
$logTable = Mage::getModel('log/log')->getResource()->getTable('url_table');
Why not just type in “log_url”? This goes to an important principle in software development; hardcode as little as you can. What if the table changes? Then you need to go and try and figure out all of the places where you hardcoded the value. This is why, in our next few lines of code, we are a little more verbose when determining the primary key.
1 2 3 4 5 | $pkData = $this->_conn->getIndexList($logTable); if (!isset($pkData['PRIMARY']['COLUMNS_LIST'][0])) { Mage::throwException('Unable to find log table primary key'); } $logPrimaryKey = $pkData['PRIMARY']['COLUMNS_LIST'][0]; |
With these two variables defined we can now configure the Mview client and subscriber.
1 2 3 4 5 6 7 | Mage::getModel('enterprise_mview/metadata') ->setTableName($logTable) ->setViewName($logTable) ->setKeyColumn($logPrimaryKey) ->setGroupCode('eschrade_useragent_report') ->setStatus(Enterprise_Mview_Model_Metadata::STATUS_INVALID) ->save(); |
This first step in the setup process involves retrieving an instance of the Mview metadata class which is responsible for managing the basic view information. The class itself basically functions as an interface to the database, as can be seen from the skeleton-like definition of its resource and collection methods. The metadata class, however, is used by the Mview client as a means of differentiating between various materialized views. The client loads the metadata information prior to executing an individual indexer.
setTableName() is where we set the actual name of the table that is the source of our changed action log. It is the table whose changes we are interested in when they occur. It is the table that the triggers will get set up against when the subscription is created later on. Only one materialized view metadata entry can watch the table, but multiple indexers can process it based off of configuration settings (which we will look at later on).
The setViewName() method is used to determine the source of changes that you are interested in noting. While this value can be a different table from the one specified in setTableName() it will generally be the same and it must be unique.
setKeyColumn() is used to tell the Mview client which is the unique value that we want to store in our changelog. Generally, this will be the primary key.
setGroupCode() is largely unused but must be unique due to table index requirements. However, it does allow you to load the metadata by the code or also allows you to define the group code in your index class.
Setting the status to “invalid” via setStatus() is how you force an index to be completely rebuilt. An index that is out of date is not invalid. What happens during normal cron runs is that each materialized view that is configured will be run, but whether or not that index is invalid will determine which index class is called. We will get into building those classes later on. By setting the index status to “invalid” you will be forcing a full refresh of the index during the next cron run.
After the metadata has been created you will need to create an instance of the client associated with the view name to create the actual subscription.
1 2 3 4 | $client = Mage::getModel('enterprise_mview/client'); /* @var $client Enterprise_Mview_Model_Client */ $client->init($logTable); |
The value that is passed into the init() method is the name of the table that you are watching. This initializes the client and loads the metadata that was previously saved.
The next thing you need to do in your setup is create the changelog table. This is done by asking the client to do it for you by passing in a Magento URN to a class that will create the changelog. Generally speaking you can use the base class to do this, which is what we do here. The base class requires the table_name as a parameter. Generally the changelog will be named {$tableName}_cl.
1 2 3 | $client->execute('enterprise_mview/action_changelog_create', array( 'table_name' => $logTable )); |
Now that the changelog has been created we need to create the subscription. The subscription class will create a trigger that watches for CUD operations and inserts the primary key value for the changed row into the changelog table. For each change that is made to the watched table a row will be inserted into the changelog that contains the previously defined watch column along with a version ID. This version ID corresponds to the version ID stored for this view in the enterprise_mview_metadata table. So when you are creating an indexer later on the way the indexer knows which rows to include is by selecting all of the rows in the changelog that are greater or equal to the version ID found in the metadata table.
1 2 3 4 | $client->execute('enterprise_mview/action_changelog_subscription_create', array( 'target_table' => $logTable, 'target_column' => $logPrimaryKey )); |
At this point we have Magento configured to watch for changes on a specified table and note row changes in the database. Now we need to do something with them.
In order to make this example useful we need to define at least two classes (Eschrade_UserAgent_Model_Index_Ua_Changed and Eschrade_UserAgent_Model_Index_Ua_All) to interact with the Mview components. Those two classes are there to either build the index partially or to build the index fully. But while two classes generally should be defined, because the nature of the indexing operation should be similar between the two. The primary difference will be the size of the data set.
Most of the time the base indexer class can be built by extending Enterprise_Mview_Model_Action_Mview_Refresh_Changelog. While there are classes that can be used to define each class individually this might end up making more work than is necessary. In this case we will build a base class for refreshing the view, but provide some simple hooks that will allow the full re-indexer to easily do its job at the same time.
In the example here there are several methods that we are going to describe in words instead of code since they are either helper functions or are used to modify internal SQL queries to get what we’re looking for. If you would like to see the full code you can find it on GitHub at https://github.com/kschroeder/Magento-Enterprise-Materialized-Views-Example/blob/master/app/code/local/Eschrade/UserAgent/Model/Index/Ua/Changed.php.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | class Eschrade_UserAgent_Model_Index_Ua_Changed extends Enterprise_Mview_Model_Action_Mview_Refresh_Changelog { public function execute() { try { $this->_connection->beginTransaction(); $this->preIndexHook(); $this->populateUserAgentTable(); $versionId = $this->_selectLastVersionId(); $this->processIndex(); $this->_metadata->setValidStatus() ->setVersionId($versionId) ->save(); $this->_connection->commit(); } catch (Exception $e) { $this->_connection->rollBack(); $this->_metadata->setInvalidStatus()->save(); throw $e; } return $this; } } |
This class will, by default, only process changed items. But when we look at the class we’ve defined for processing all of user agent results you will see that there is very little that we need to do.
The first thing we need to do is start a transaction. We don’t need to do this, but it is often a good idea to wrap this functionality in a transaction to handle cases where there is an error that might corrupt the index or cause data to be processed multiple times.
The preIndexHook() method allows any subclasses to execute any logic prior to actually calculating the index. This is primarily used to allow the full re-index class to truncate the data.
Next up we call a method called populateUserAgentTable(). This is a helper method that does a LEFT JOIN on the log_visitor_info table and a table that is used to store the friendly names of the user agents. If there are a large number of user agents that have not been processed, this method can actually take several seconds to execute (full version on GitHub). This is done so that MySQL can do some processing for us later on rather than doing it in our code.
At this point we get the most current version ID from the change log to use later when setting the last processed version ID. This could be done at the end of the process, but there is more of a possibility that a version ID may be skipped if we wait until the end to do this. Perhaps a better option would be to do checks from the changelog version IDs and state that highest one will be set. But, for clarity, and the fact that the elapsed time between when this query is made when the unprocessed ID’s are retrieved will be very low, we do it here.
Next up is the processIndex() method. This is where the magic is implemented.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | public function processIndex() { $indexTable= Mage::getModel('eschrade_useragent/agentreport')->getResource()->getMainTable(); $select = $this->_selectChangedRows(); $this->joinSelectWithUrlAndVisitor($select); $stmt = $select->query(); while (($row = $stmt->fetch()) !== false) { $this->_connection->insertOnDuplicate( $indexTable, array( 'page' => $row['url'], 'agent' => $row['normalized_user_agent'], 'request_count' => $row['page_count'] ), array( 'request_count' => new Zend_Db_Expr('request_count + VALUES(request_count)') ) ); } } |
The first thing we need is the table that we are going to be inserting the values on. That would be the main agentreport table. Next up we retrieve the SELECT object that is pre-configured to retrieve only the primary keys of rows that have changed. This is defined in the Enterprise_Mview_Model_Action_Mview_Refresh_Changelog class. We make a small modification to the method to deal with some of the complex JOINs we do later, but it essentially the same.
After we have retrieved the pre-configured SELECT object, which is directly pertinent to the Mvew discussion we modify it to suit our needs, which is not pertinent to our example. That method is custom to this example and does a number of JOINs on various tables to pull the data we need into one result set. This data is then inserted with an ON DUPLICATE clause that will add any new page/user agent counts to existing ones, or insert them if no counts exist.
At this point it should be worth noting that using Mview forces us to do things on a bit of a less efficient manner. Generally, when you are building a Materialized View the goal is to do as much from INSERT FROM SELECT statements as possible. This can greatly cut down on the possibility of hitting memory limits in PHP and also allows MySQL to do the heavy lifting for you. A good rule of thumb is that you want to work as close to the data as possible.
But because we are doing partial changes and because those partial changes may already have some previously-calculated data in the table we are not able to do that. This is actually a limitation in the database adapter where fields to be defined with a Zend_Db_Expr class (which is how you would make the addition calculation for duplicate entries) are not calculated in a way that can be used. The problem is that when you specify a column the expression that you use for the field is the expression that is used for the ON DUPLICATE clause. Because we are possibly updating existing columns with an addition expression we are not able to use the same field value in the column clause as we do in the ON DUPLICATE clause. Therefore, in this case, we need to process them manually.
This is why we first generate a result set and iterate over them calling insertOnDuplicate(). insertOnDuplicate() allows us to distinctly define columns and ON DUPLICATE clauses, which we can’t do via insertFromSelect().
At this point when we run the index nothing will happen.
That is because we have not defined the class that will (re)process the entire view. Remember back to our setup script where we set the view as invalid? That will result in a full re-index.
So let’s take a look at that class in its entirety.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | class Eschrade_UserAgent_Model_Index_Ua_All extends Eschrade_UserAgent_Model_Index_Ua_Changed { public function preIndexHook() { $this->_connection->truncateTable( $this->getReportResource()->getTable('agentreport') ); } protected function joinSelectWithUrlAndVisitor(Varien_Db_Select $select) { $select->reset(Varien_Db_Select::WHERE); parent::joinSelectWithUrlAndVisitor($select); } } |
Because of the way we built out Eschrade_UserAgent_Model_Index_Ua_Changed we only need to do two things. 1) truncate the report table, and 2) remove the version_id condition from the WHERE clause.
And with that our classes are done.
But our indexers will still not run. That is because we have not defined our indexes in the configuration. Our indexers will be added under the /config/global node in our config.xml file.
<index> <indexer> <eschrade_useragent_report> <model>eschrade_useragent/index_ua_dummy</model> <action_model> <changelog>eschrade_useragent/index_ua_changed</changelog> <all>eschrade_useragent/index_ua_all</all> </action_model> <index_table>log_url</index_table> </eschrade_useragent_report> </indexer> </index>
The first thing you might notice is that there is a dummy class defined under the <model> tag. This node is typically used by the index CLI script to process the individual indexes. However, because Mview based indexers should be run via cron we define a simple class that renders the name and the description for the admin UI but does nothing else. The main indexer, which is managed by the new Enterprise_Index module, uses the same configuration nodes as Mage_Index, but uses the <action_model> and <index_table> nodes instead.
When the Mview index manager is run it will check status column in the mview metadata table for indexes that are invalid. For each row whose status is “invalid” the table_name will be used to match up which indexer configuration should be used. The main indexer will check for a URN under <action_model><all> and run that indexer. This is the first part of the indexing process.
The second part of the indexing process is when Enterprise_Index iterates over all of the indexers defined in the merged configuration. An index is ignored if either the <action_model> or the <index_table> is missing. If a full re-index has already been run for a given indexer entry the changelog-based indexer will be omitted. Beyond that caveat every index configuration that has an <action_model><changelog> and <index_table> node will be run.
Conclusion
While there are several places in the new Magento Enterprise Edition 1.13 that make indexing more efficient there is perhaps no mechanism more important than the implementation of Materialized Views. It allows for a standardized method of building summary/calculation tables in a predictable manner. But rather than building an implementation intended strictly for the new version of Magento Enterprise, Mview module allows developers to utilize the same functionality that is largely responsible for some of the performance improvements noted for EE1.13.
Comments
Magento-Neuigkeiten #31
[…] Kevin Schroeder veröffentlichte einen tollen und sehr ausführlichen Artikel zur Indizierung in Magento EE 1.13. […]
Magento News Roundup #18 | NEWSGENTO
[…] on indexing: Kevin Schroeder has published a comprehensive post in which he dives into the new(ish) indexing of Magento Enterprise Edition […]
Joaquin Becerra
Wow!!
Perfect, that is the explanation I was looking on the new Magento indexing system.
Thank you very much!
Zain
PHP code is html encoded. To help in reading I created code gist with PHP decoded.
https://gist.github.com/zainengineer/8f0bccdd3d328dc6d6b6cc3823fe0429
for xml https://gist.github.com/zainengineer/f172ef32014682d274053fcd4c644c2a