SD - Search and Database
· 7 min read
Search Patterns
Query Type | Index Type | Tools |
---|---|---|
Text search ("Pizza Hut") | Full-text search by inverted index | Elasticsearch, Postgres + PG_TRGM extension |
Find nearby locations (restaurants within 2 miles) ![]() | Geospatial index (Quadtree, GeoHash) | ElasticSearch (GeoHash), Postgres + PostGIS |
Search within a polygon (e.g., restaurants within Santa Clara county) ![]() | Geospatial index with polygons | ElasticSearch (GeoHash), Postgres + PostGIS |
Category search (e.g., "restaurants") | B-Tree index (natively supported) | Elasticsearch, Postgres |
Elasticsearch and Postgres allow combining various types of indexes (e.g., full-text, geospatial, B-Tree) in a single search query.
Sample Query with 3 Indices
CREATE TABLE locations (
id,
name TEXT,
description TEXT, // Search using Inverted Index
category TEXT, // Search using B-tree
coordinates GEOGRAPHY(Point, 4326) // Search using GIS extension
);
CREATE INDEX category_idx ON locations (category); // B-tree Index
CREATE INDEX geo_idx ON locations USING gist (coordinates); // GIS index
CREATE INDEX description_idx ON locations USING gin (to_tsvector('english', description));
// Query using all three indexes
SELECT * FROM locations WHERE category = 'restaurant'
AND ST_DWithin(coordinates, ST_MakePoint(-122.4194, 37.7749)::geography, 2000)
AND to_tsvector('english', description) @@ to_tsquery('pizza');
Database Constraints
1. Problem: Handle Simultaneous Writes
Example - multiple users update the same record simultaneously
Approach | Description | Pros | Cons |
---|---|---|---|
Pessimistic Locking | Locks a record when a user accesses it, preventing others from modifying it until the lock is released. | - Simple to implement. | - Performance bottlenecks if locks are held too long. - Risk of deadlocks. |
Optimistic Locking (OCC) | Allows multiple users to access a record but checks for conflicts during updates using a version or timestamp field. | - Better performance for read-heavy systems. | - Updates can fail, requiring retry logic. - Not suitable for write-heavy systems. |
Status + Expiration Time | Assigns a status (e.g., "in-use") and an expiration time to a record. Expired records are unlocked automatically. | - Resolves abandoned operations naturally. | - Managing abandoned rows is difficult. |
Distributed Lock with TTL | Utilizes Redis or Zookeeper to create distributed locks with a Time-To-Live (TTL). | - TTL prevents stale locks. | - Adds external dependency. |
Optimistic Concurrency Control is same as Optimistic locking.
- A transaction reads data and remembers its state (e.g., timestamp or version).
- The transaction performs update operations.
- Before committing, it checks whether the version has been changed since the initial read.
- If the version is unchanged, the transaction commits; otherwise, it fails.
Implementation Samples
Pessimistic Locking
-- Application Layer: Starts the transaction
BEGIN;
-- Database Layer: Lock the row so no one else can modify it
SELECT * FROM users WHERE id = 1 FOR UPDATE;
-- Business Logic (e.g., validate balance > 100)
-- Database Layer: Deduct $100 from balance
UPDATE users SET balance = balance - 100 WHERE id = 1;
-- Application Layer: Commit the transaction
COMMIT;
Optimistic Locking
def get_order(order_id):
""" Fetches the current order status and version from DynamoDB """
response = table.get_item(Key={'order_id': order_id})
return response.get('Item', None)
def update_order(order_id, new_status):
""" Updates order status only if no other process modified it """
# Step 1: Fetch the current version of the order
order = get_order(order_id)
current_version = order['version'] # Get current version stored in DB
try:
# Step 2: Attempt to update the order with version check
response = table.update_item(
# Like a WHERE clause: Specifies which order to update
Key={'order_id': order_id},
# UpdateExpression: Defines what should be updated in the table.
# ExpressionAttributeValues: Provides actual values for placeholders inside the query.
# Update the order status and increment the version number
UpdateExpression="SET order_status = :new_status, version = version + 1",
# The update happens only if the ConditionExpression is true
# i.e. current version in the database matches expected_version.
ConditionExpression="version = :expected_version",
# Placeholder values for security and performance
ExpressionAttributeValues={
':new_status': new_status, # the new value being written into the database.
':expected_version': current_version
# represents the current version that was read from the database.
# ensures that no other process has modified the order before our update.
# Used inside ConditionExpression to prevent overwriting concurrent updates.
}
)
print("Update successful:", response)
# User A tries to mark order "Shipped"
update_order("123", "Shipped")
# User B tries to mark order "Cancelled" (but order version might have changed)
# User B needs to retry after fetching the latest version.
update_order("123", "Cancelled") # This fails if version changed
Status + Expiration Time
def acquire_document_lock(doc_id, user, lock_duration=300):
"""
Acquires a lock on a document with automatic expiration.
- The lock expires after `lock_duration` seconds (default: 5 minutes).
- Lock acquisition fails if another user is already holding it and it hasn't expired.
"""
now = int(time.time()) # Get current system time
ttl = now + lock_duration # Set lock expiration time
try:
response = table.update_item(
Key={'doc_id': doc_id},
# Set lock status, owner, and expiration time
UpdateExpression="SET status = :status, locked_by = :user, ttl = :ttl",
# Allow lock acquisition only if it’s unlocked or expired
ConditionExpression="attribute_not_exists(status) OR ttl < :now",
ExpressionAttributeValues={
':status': 'locked', # Represents locked state
':user': user, # The user acquiring the lock
':ttl': ttl, # Lock expiration timestamp
':now': now # Current system time (used for expiration check)
},
ReturnValues="ALL_NEW" # Returns updated lock info for debugging
)
print(f"Lock acquired for document {doc_id} by {user}. Expires at {ttl}.")
return True
# Example
if acquire_document_lock(doc_id, user):
try:
print("Document locked and being processed...")
finally:
release_document_lock(doc_id, user)
else:
print("Could not acquire lock")
Distributed Lock with TTL
def acquire_lock(key, ttl=10):
# Query Layer: Create a lock with a TTL
lock_acquired = redis_client.set(key, "locked", ex=ttl, nx=True)
return lock_acquired
def release_lock(key):
# Query Layer: Explicitly release the lock
redis_client.delete(key)
# Try acquiring lock
if acquire_lock("leaderboard_update", ttl=10):
try:
print("Processing leaderboard update...")
finally:
release_lock("leaderboard_update")
else:
print("Lock already held by another process.")
2. Problem: Enforcing uniqueness
Example allow only one post by a user on one business
Solution: Apply unique constraints
Multi-Part Upload Big Files
A presigned URL is a temporary, signed URL that allows a client to upload (or download) an object to/from a storage system (like Amazon S3) without requiring direct authentication.
Each presigned URL is specific to a resource (like a file or chunk)
Process flow for resumable upload of a large file and upload only the missing chunks using presigned URLs.
- Client calculates the file fingerprint
- Client requests file metadata:
- The client sends the fingerprint to the server.
- The server checks if the file exists and retrieves the metadata (e.g., uploaded chunks and missing chunks).
- Server provides upload instructions:
- For missing chunks, the server generates presigned URLs for each chunk.
- Each chunk typically maps to a part number in a multipart upload.
- Presigned URLs are chunk-specific because each upload must be directed to a specific part.
- The client uploads the missing chunks using the provided presigned URLs.
- As each chunk is uploaded, S3 send event notification to Lambda function which updates the status of chunk in FileMetaData
Why is each chunk tied to a presigned URL?
- Security: Each presigned URL is generated for a specific chunk and includes security parameters, such as its expiration time, headers, and permissions. This prevents unauthorized uploads or tampering.
- Concurrency: Using separate URLs for chunks allows clients to upload chunks in parallel, speeding up the upload process.
- Resumability: If the upload fails partway, the server knows exactly which chunks are missing and can generate new presigned URLs for only those chunks.