Search – It is absolutely a necessary feature for almost all enterprise and client facing applications today. Users expect applications to perform search and other operations in near real-time. With the large volumes of data to be processed, it can be challenging for application owners / product managers to decide the best architecture and technology to achieve this speed.
In our earlier blog, we saw ways to accelerate search with Elasticsearch in a .NET application. Continuing that discussion further, we look at a real life example of how search functionality can be accelerated through real time sync of DynamoDB and Elasticsearch.
One of our clients in the real estate industry faced major challenges with search functionality. Their property listing application uses DynamoDB and they required a solution that retained DynamoDB in the mix. The team decided on Elasticsearch, a distributed search, open source and RESTful search engine commonly used in applications where a large volume of data needs to be searched.
This blog walks through the approach our team took to set up a real-time sync between DynamoDB and Elasticseach to address their search function performance issues.
Pushing historical data to Elasticsearch from DynamoDBs
The first challenge was to push historical and incremental data from DynamoDB to Elasticsearch, which is typically a one time activity. However, this activity needs to be carried out again if any new parameter is added for search in the future (which is discussed later in this blog).
Let’s first look at the process to load historical data to Elasticsearch. To execute this process, we developed a utility deployed as an AWS Lambda function with a manual trigger. The Lambda function helps to copy the data from Dynamo DB to Elasticsearch index.
AWS Lambda is a serverless computing service which executes a piece of code in response of specific events, while automatically managing the compute resources thus eliminating the need to monitor the scaling and performance of the function. This utility executes only when the new environment gets provisioned. The utility can also be utilized to update Elasticsearch (ELS) whenever a large volume of data is changed in DynamoDB due to some reasons.
The utility walks through the following steps during execution:
- Data is fetched from DynamoDB with the help of AWS APIs
- Each field is mapped with a ELS object field
- An array of ELS objects is created, which will be used to push objects into ELS through a single http request, decreasing the total number of http requests for pushing objects into ELS
- ELS object array is pushed into ELS with the help of ELS bulk APIs
- Batching can be used if the array size is big or the bulk API gives a timeout error due to size of the data. The array can be divided into a number of batches and bulk API will be called to push data from each batch into ELS.
- The object can be created or updated based on the primary key of the object in ELS.
The next challenge was handling any updates in DynamoDB, either manually or through an application, which should reflect in Elasticsearch in real time to ensure that the user gets latest data in the search results. Let’s see how this incremental data can be handled.
Pushing incremental data to Elasticsearch from DynamoDB
An AWS Lambda function with a DynamoDB trigger can be created utilizing the AWS Lambda function service. Whenever a new record gets added or updated in DynamoDB, a new event will be generated triggering the Lambda function to update the Elasticsearch with the updated data. With this solution, Elasticsearch will always be updated with the latest data in DynamoDB.Lambda function continuously listens to the DynamoDB event for adding or updating a record and will execute as soon as it finds such an event.
The flow of Lambda function to update Incremental data into ELS is as follows:
- Details captured for each event include the event type (record is added, updated or deleted), the old record (record before it was updated), and new record (with updated data)
- Lambda function continuously listens to the event and once the event is generated,the execution starts.
- Based on the event type, Lambda function calls the respective function of ELS to update the data
The following screenshots provide a pictorial representation of this process, assuming that the DynamoDB has already been created:
- Start Lambda function creation process by selecting a template from available blueprints
- From blueprints, select dynamodb-process-stream blueprint.
- Once you select the blueprint, configure the DynamoDB trigger with the following:
DynamoDb Table: Select the DynamoDB table which you have already created for your application.
Batch Size: It defines the number of records which can be fetched in one stream. For example, if 200 records are modified and Batch size is 100, 2 batches will get created to update the ELS.
Starting Position: It can be LATEST or TRIM_HORIZON.
TRIM_HORIZON – Age of stream records are 24-hours. The records whose age exceeds this limit will get removed from the stream. The last (untrimmed) stream record will be read, which is the oldest record in the shard.LATEST – It will read always the most recent data in the shard. It will start reading just after the most recent stream record in the shard. - The event is transmitted to Lambda function in a JSON object.. Depicted below is the structure of JSON object:
It shows the event name INSERT, MODIFY or DELETE. In above case, the event is MODIFY and changed data is Property name. Primary key is PropertyID (56817c5a-a419-42be-b9ca-daa4849e31a0). If property id is present in ELS, then it will update the record, else it will create.
Summary
It is critical for application developers / owners to ensure their user experience includes rapid and accurate search functionality. This seemingly simple function gets even more complicated when data is both historical and incremental.
In such scenarios, utilizing Elasticsearch is a good choice for performance. Triggers can be used for DynamoDB or other similar databases to sync the incremental data so end users will always receive the most updated data, quickly. This solution can be used irrespective of which industry you are in. If you are facing a similar challenge, Contact Us for a quick PoC.