DynamoDB is often used for a wide range of applications, from gaming and mobile app backends to high-performance web-scale applications.In various scenarios, businesses may need to store large objects, such as high-resolution images, videos, or extensive text documents. However, Amazon DynamoDB has certain limitations regarding the size of the items that can be stored, which makes storing large objects a challenge. This blog post delves into techniques for efficiently handling and storing large objects in DynamoDB.

Overview of DynamoDB Limitations

  1. Item Size Limitation in DynamoDB: DynamoDB imposes a limit on the size of a single item that you can store, which is currently 400 KB. This includes both the attribute names and values.

  2. Attributes and Throughput Considerations: DynamoDB charges you based on the amount of data you read or write per second, which is measured in Read Capacity Units (RCUs) and Write Capacity Units (WCUs). Handling larger objects may escalate the consumption of RCUs and WCUs for specific operations.

Strategies for Handling Large Objects

Compress Large Objects

Compressing your large objects can significantly reduce their size. Common compression algorithms include GZIP and Brotli. Implementing compression before storing data can often be as simple as using libraries in your programming language of choice.

Implementation Strategy

Below is an example of how you can create a Python wrapper using boto3 and zlib to handle compression and decompression of items stored in DynamoDB:

  
import boto3
import zlib
import json
import logging

class DynamoDBCompressionWrapper:
    def __init__(self, table_name):
        self.dynamodb = boto3.resource('dynamodb')
        self.table = self.dynamodb.Table(table_name)
        logging.basicConfig(level=logging.INFO)

    def compress_data(self, data):
        try:
            serialized_data = json.dumps(data).encode()
            return zlib.compress(serialized_data)
        except Exception as e:
            logging.error(f"Error compressing data: {e}")
            return None

    def decompress_data(self, compressed_data):
        try:
            decompressed_data = zlib.decompress(compressed_data)
            return json.loads(decompressed_data)
        except Exception as e:
            logging.error(f"Error decompressing data: {e}")
            return None

    def put_item(self, key, data):
        try:
            compressed_data = self.compress_data(data)
            if compressed_data:
                self.table.put_item(
                    Item={
                        'key': key,
                        'data': compressed_data
                    }
                )
                logging.info(f"Item with key {key} stored successfully.")
            else:
                logging.error("Failed to store item due to compression error.")
        except Exception as e:
            logging.error(f"Error storing item in DynamoDB: {e}")

    def get_item(self, key):
        try:
            response = self.table.get_item(
                Key={
                    'key': key
                }
            )
            if 'Item' in response:
                compressed_data = response['Item']['data'].value
                return self.decompress_data(compressed_data)
            else:
                logging.warning(f"No item found with key {key}.")
                return None
        except Exception as e:
            logging.error(f"Error retrieving item from DynamoDB: {e}")
            return None
 

Pros

  1. Dynamodb Compatible: By employing compression and continuing to use DynamoDB for storage, you retain access to all the inherent benefits of DynamoDB, such as low latency and fine-grained access control.
  2. Transaction Support: As the compressed data remains a DynamoDB item, it can seamlessly be integrated with DynamoDB’s transaction APIs.

Cons

  1. Processing Overhead: Compression and decompression require CPU processing, which can add overhead, especially for write-heavy or read-heavy applications.
  2. Complexity: Implementing compression adds complexity to your codebase.
  3. Limited Querying Capabilities: Once the data is compressed and stored as a binary blob, you can’t use DynamoDB’s filter expressions on the content of the compressed data. This limits your ability to query data based on its content.
  4. Considerations for GSI Keys: Compressing data makes the content unreadable, so you can’t create Global Secondary Index (GSI) keys based on the content of the compressed data. Therefore, careful planning is necessary to choose appropriate keys and attributes that are not compressed, to optimize query performance using indexes.

Vertical Sharding

Another approach to tackle the challenge of storing large objects in DynamoDB is to use a technique called vertical sharding. Vertical sharding involves splitting the attributes of an item across multiple rows (or shards) in the table, rather than storing them in a single item.

This technique is vital for efficiently storing and retrieving large items, and is especially beneficial when different attributes of an item are accessed in distinct patterns or frequencies. For example, if certain attributes are accessed more frequently than others, they can be grouped together in a shard for more cost-effective and performant data retrieval.

Implementation Strategy

  1. Attribute Grouping: Determine how to group the attributes of your large object logically. Ideally, attributes that are often accessed together should be stored in the same shard.

  2. Shard Key Design: Design a shard key that combines the primary key of the original item with a shard identifier. This will help you to query all the shards of an item efficiently.

  3. Write Operations: When writing data to DynamoDB, divide the object into shards based on the attribute grouping. Write each shard as a separate item in the table using the designed shard key.

  4. Read Operations: When reading data from DynamoDB, perform a query operation using the primary key of the original item to retrieve all shards. Combine the data from these shards to reconstruct the original object.

  5. Handling Updates: When updating an attribute, identify which shard contains that attribute, and update only the relevant shard.

Let’s consider an example where we have a large user profile object for a social media platform. The object includes basic information, an array of posts, a list of followers, and a list of multimedia files (such as photos and videos). The total size of this user profile object is around 800KB, which is beyond the maximum item size limit of 400KB in DynamoDB.

Before employing vertical sharding, the user profile object might look like this:

UserProfile: {
    user_id: "user123",
    name: "John Doe",
    email: "[email protected]",
    posts: [ ... ],              // array of posts, ~400KB
    followers: [ ... ],          // list of followers, ~200KB
    multimedia: [ ... ],         // photos and videos, ~150KB
    additional_info: { ... }     // other data, ~50KB
}

We’ll break this object into four shards: BasicInfoShard, PostsShard, FollowersShard, and MultimediaShard. Each shard will be stored as a separate item in DynamoDB, but they will all be associated with the same user_id.

user_id shard_key name email additional_info posts followers multimedia approximate_size
user123 user123#basic_info John Doe [email protected] { … } ~50KB
user123 user123#posts [ … ] ~400KB
user123 user123#followers [ … ] ~200KB
user123 user123#multimedia [ … ] ~150KB

Since both photos and posts can grow beyond 400KB, we can further shard them by introducing a sequence number. This helps in splitting the posts and multimedia content into multiple shards as needed.

shard_key user_id name email additional_info posts followers multimedia sequence_number
user123#posts#1 user123 [post1, …] 1
user123#posts#2 user123 [postN, …] 2
user123#multimedia#1 user123 [photo1,..] 1
user123#multimedia#2 user123 [video1,..] 2

When querying for a user’s profile, you can perform a query operation using the partition key user_id, and reconstruct the complete profile by combining data from all the shards.

  
def put_user_profile(user_id, basic_info, posts, followers, multimedia):
    # Basic info shard
    table.put_item(
        Item={
            'shard_key': f"{user_id}#basic_info",
            'user_id': user_id,
            'basic_info': json.dumps(basic_info),
        }
    )
    
    # Posts shards
    for index, post_chunk in enumerate(chunks(posts, 5), start=1):  # Example chunk size is 5 posts
        table.put_item(
            Item={
                'shard_key': f"{user_id}#posts#{index}",
                'user_id': user_id,
                'posts': json.dumps(post_chunk),
            }
        )
    
    # Followers shard
    table.put_item(
        Item={
            'shard_key': f"{user_id}#followers",
            'user_id': user_id,
            'followers': json.dumps(followers),
        }
    )
    
    # Multimedia shards
    for index, media_chunk in enumerate(chunks(multimedia, 10), start=1):  # Example chunk size is 10 files
        table.put_item(
            Item={
                'shard_key': f"{user_id}#multimedia#{index}",
                'user_id': user_id,
                'multimedia': json.dumps(media_chunk),
            }
        )

def get_user_profile(user_id):
    # Query all shards for a user
    response = table.query(
        KeyConditionExpression='user_id = :user_id',
        ExpressionAttributeValues={':user_id': user_id}
    )
    
    # Combine shards to reconstruct the original profile
    profile = {}
    for item in response['Items']:
        shard_type = item['shard_key'].split('#')[1]
        if shard_type == 'basic_info':
            profile['basic_info'] = json.loads(item['basic_info'])
        elif shard_type == 'posts':
            profile.setdefault('posts', []).extend(json.loads(item['posts']))
        elif shard_type == 'followers':
            profile['followers'] = json.loads(item['followers'])
        elif shard_type == 'multimedia':
            profile.setdefault('multimedia', []).extend(json.loads(item['multimedia']))
    
    return profile

def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]
 

Pros

  1. Optimized Access Patterns: By grouping related attributes together in shards, you can optimize data retrieval for common access patterns, potentially reducing the amount of data read.

  2. Selective Updates: You can update specific attributes of a large object without having to write the entire object back to DynamoDB, which may save on write capacity.

  3. Granular Control: Offers granular control over which attributes are fetched, allowing for potentially more efficient read operations.

Cons

  1. Increased Complexity: Implementing and managing vertical sharding adds complexity to both the data model and the application logic, especially when it comes to reassembling the shards into a single object.

  2. Consistency Management: Keeping data consistent across shards can be challenging, especially in high-throughput environments.

  3. Query and Pagination Overhead: Querying a sharded item requires additional logic to handle pagination and reassembly of the item from its shards.

In addition to the standalone benefits of vertical sharding, you can further optimize the storage of large objects in DynamoDB by combining vertical sharding with compression. This combination can be especially beneficial when dealing with very large objects, or when you want to optimize the storage and retrieval efficiency.

Using Amazon S3 for Storage

When dealing with very large objects that are not frequently accessed or don’t need the low latency retrieval that DynamoDB offers, an alternative approach is to store these objects in Amazon S3 and keep a reference or pointer to the S3 object in DynamoDB.

Implementation Strategy

  1. Store Object in S3: Store the large object in an S3 bucket.

  2. Store Reference in DynamoDB: Store a reference to the S3 object in a DynamoDB table. This reference can be the S3 object key (usually the file path).

  3. Data Retrieval: When you need to access the object, read the reference from DynamoDB, and then use it to retrieve the object from S3.

Here’s an example code in Python using boto3:

import boto3

# Initialize boto3 clients
dynamodb = boto3.resource('dynamodb')
s3 = boto3.client('s3')

# Specify the S3 bucket and DynamoDB table
bucket_name = 'your-s3-bucket-name'
table_name = 'your-dynamodb-table-name'

# Reference to the DynamoDB table
table = dynamodb.Table(table_name)

def store_large_object(key, file_path):
    # Upload the file to S3
    s3.upload_file(file_path, bucket_name, key)
    
    # Store the reference in DynamoDB
    table.put_item(
        Item={
            'your_primary_key': 'your_value',
            's3_key': key
        }
    )

def retrieve_large_object(your_primary_key):
    # Retrieve the S3 key from DynamoDB
    response = table.get_item(Key={'your_primary_key': your_primary_key})
    s3_key = response['Item']['s3_key']
    
    # Retrieve the object from S3
    s3_object = s3.get_object(Bucket=bucket_name, Key=s3_key)
    return s3_object['Body'].read()

Pros

  1. Scalability: S3 is highly scalable and can store very large objects, up to 5 TB.

  2. Cost-Efficient: Storing large objects in S3 can be more cost-effective than storing them in DynamoDB, especially for infrequently accessed data.

  3. Simplified Data Model in DynamoDB: Keeps the DynamoDB data model simple by only storing references.

  4. Offloads Heavy Lifting: S3 handles the heavy lifting of storing large objects, freeing up resources in DynamoDB for high-performance operations.

Cons

  1. Increased Latency: Retrieving large objects from S3 can have higher latency compared to fetching them directly from DynamoDB.

  2. Consistency Issues: Because you’re using two different data stores (DynamoDB and S3), you might encounter consistency issues if, for example, an object is deleted from S3 but its reference in DynamoDB is not updated.

  3. Increased Complexity: Managing objects across two different services adds complexity to the application logic.

  4. Multiple Points of Failure: Using two services means there are multiple points of failure that need to be accounted for in your application’s error-handling logic.

In conclusion, using S3 for storing large objects and keeping a pointer in DynamoDB is an effective strategy for certain use cases, especially when dealing with very large objects or when cost savings are a priority. However, it’s essential to consider the trade-offs regarding latency, complexity, and consistency.

Cost Analysis

Object Size S3 Storage Cost ($) S3 PUT Cost ($) S3 GET Cost ($) DynamoDB WCU Cost ($) DynamoDB RCU Cost ($) Total Write Cost ($) Total Read Cost ($)
100 KB 0.0000023 0.000005 0.0000004 0.00000125 0.00000025 0.00000855 0.00000065
400 KB 0.0000092 0.000005 0.0000004 0.00000125 0.00000025 0.00001545 0.00000065
1 MB 0.000023 0.000005 0.0000004 0.00000125 0.00000025 0.00002925 0.00000065
4 MB 0.000092 0.000005 0.0000004 0.00000125 0.00000025 0.00009825 0.00000065

If we make a comparison between the costs of using compressed storage (66% comparison) in DynamoDB and the S3 storage strategy for a million 100KB object, the cost incurred with DynamoDB would be $55, whereas the S3 strategy has a cost of $8.55 for writing. This shows that using S3 for storing large objects and keeping a pointer in DynamoDB can be more cost-effective compared to storing compressed objects directly in DynamoDB. Similarly for reading (eventually consistent), it will cost us $1.5 in dynamodb in comparison to $0.65 in s3.

Real-life Scenarios and Use Cases

Handling Large Text Documents

  • For text documents, compression can be highly effective.
  • Store the compressed document in either DynamoDB (if small enough) or store in s3.

Storing High-Resolution Images

  • High-resolution images are best stored in S3.
  • You can use DynamoDB to store metadata such as file names, sizes, and S3 URLs.

Managing Video Files

  • Video files are typically very large and are best handled by storing them in S3.
  • DynamoDB can be used to keep track of metadata.

Storing and Retrieving Complex JSON Objects

  • For large JSON objects where you need to perform efficient querying on different parts of the object, vertical sharding might be more appropriate.
  • For very large JSON objects where querying is less of a concern, and you prioritize storage cost-efficiency and simplicity, using S3 objects can be the most effective.

Conclusion

Handling large objects in DynamoDB is an art that requires a good understanding of the constraints and the various strategies available. Whether it’s through compression, chunking, or using Amazon S3, each method has its own set of trade-offs. By following the best practices and staying vigilant about performance and security, you can master the art of handling large objects efficiently in DynamoDB.

Be sure to check out more such insightful blogs in my Master Dynamodb: Demystifying AWS's Powerful NoSQL Database Service, One Byte at a Time series, for a deeper dive into DynamoDB's best practices and advanced features. Stay tuned and keep learning!