Skip to main content

SD - Search and Database

· 7 min read

Search Patterns

Query TypeIndex TypeTools
Text search ("Pizza Hut")Full-text search by inverted indexElasticsearch, Postgres + PG_TRGM extension
Find nearby locations (restaurants within 2 miles) Unbounded searchGeospatial index (Quadtree, GeoHash)ElasticSearch (GeoHash), Postgres + PostGIS
Search within a polygon (e.g., restaurants within Santa Clara county) BoundaryGeospatial index with polygonsElasticSearch (GeoHash), Postgres + PostGIS
Category search (e.g., "restaurants")B-Tree index (natively supported)Elasticsearch, Postgres

Elasticsearch and Postgres allow combining various types of indexes (e.g., full-text, geospatial, B-Tree) in a single search query.

Sample Query with 3 Indices
CREATE TABLE locations (
id,
name TEXT,
description TEXT, // Search using Inverted Index
category TEXT, // Search using B-tree
coordinates GEOGRAPHY(Point, 4326) // Search using GIS extension
);

CREATE INDEX category_idx ON locations (category); // B-tree Index
CREATE INDEX geo_idx ON locations USING gist (coordinates); // GIS index
CREATE INDEX description_idx ON locations USING gin (to_tsvector('english', description));

// Query using all three indexes
SELECT * FROM locations WHERE category = 'restaurant'
AND ST_DWithin(coordinates, ST_MakePoint(-122.4194, 37.7749)::geography, 2000)
AND to_tsvector('english', description) @@ to_tsquery('pizza');

Database Constraints

1. Problem: Handle Simultaneous Writes

Example - multiple users update the same record simultaneously

ApproachDescriptionProsCons
Pessimistic LockingLocks a record when a user accesses it, preventing others from modifying it until the lock is released.- Simple to implement.- Performance bottlenecks if locks are held too long.
- Risk of deadlocks.
Optimistic Locking (OCC)Allows multiple users to access a record but checks for conflicts during updates using a version or timestamp field.- Better performance for read-heavy systems.- Updates can fail, requiring retry logic.
- Not suitable for write-heavy systems.
Status + Expiration TimeAssigns a status (e.g., "in-use") and an expiration time to a record. Expired records are unlocked automatically.- Resolves abandoned operations naturally.- Managing abandoned rows is difficult.
Distributed Lock with TTLUtilizes Redis or Zookeeper to create distributed locks with a Time-To-Live (TTL).- TTL prevents stale locks.- Adds external dependency.

Optimistic Concurrency Control is same as Optimistic locking.

  1. A transaction reads data and remembers its state (e.g., timestamp or version).
  2. The transaction performs update operations.
  3. Before committing, it checks whether the version has been changed since the initial read.
  4. If the version is unchanged, the transaction commits; otherwise, it fails.
Implementation Samples

Pessimistic Locking

-- Application Layer: Starts the transaction
BEGIN;

-- Database Layer: Lock the row so no one else can modify it
SELECT * FROM users WHERE id = 1 FOR UPDATE;

-- Business Logic (e.g., validate balance > 100)

-- Database Layer: Deduct $100 from balance
UPDATE users SET balance = balance - 100 WHERE id = 1;

-- Application Layer: Commit the transaction
COMMIT;

Optimistic Locking

def get_order(order_id):
""" Fetches the current order status and version from DynamoDB """
response = table.get_item(Key={'order_id': order_id})
return response.get('Item', None)

def update_order(order_id, new_status):
""" Updates order status only if no other process modified it """

# Step 1: Fetch the current version of the order
order = get_order(order_id)

current_version = order['version'] # Get current version stored in DB

try:
# Step 2: Attempt to update the order with version check
response = table.update_item(
# Like a WHERE clause: Specifies which order to update
Key={'order_id': order_id},

# UpdateExpression: Defines what should be updated in the table.
# ExpressionAttributeValues: Provides actual values for placeholders inside the query.

# Update the order status and increment the version number
UpdateExpression="SET order_status = :new_status, version = version + 1",

# The update happens only if the ConditionExpression is true
# i.e. current version in the database matches expected_version.
ConditionExpression="version = :expected_version",

# Placeholder values for security and performance
ExpressionAttributeValues={
':new_status': new_status, # the new value being written into the database.

':expected_version': current_version

# represents the current version that was read from the database.
# ensures that no other process has modified the order before our update.
# Used inside ConditionExpression to prevent overwriting concurrent updates.
}
)
print("Update successful:", response)

# User A tries to mark order "Shipped"
update_order("123", "Shipped")

# User B tries to mark order "Cancelled" (but order version might have changed)
# User B needs to retry after fetching the latest version.
update_order("123", "Cancelled") # This fails if version changed

Status + Expiration Time

def acquire_document_lock(doc_id, user, lock_duration=300):
"""
Acquires a lock on a document with automatic expiration.

- The lock expires after `lock_duration` seconds (default: 5 minutes).
- Lock acquisition fails if another user is already holding it and it hasn't expired.
"""

now = int(time.time()) # Get current system time
ttl = now + lock_duration # Set lock expiration time

try:
response = table.update_item(
Key={'doc_id': doc_id},

# Set lock status, owner, and expiration time
UpdateExpression="SET status = :status, locked_by = :user, ttl = :ttl",

# Allow lock acquisition only if it’s unlocked or expired
ConditionExpression="attribute_not_exists(status) OR ttl < :now",

ExpressionAttributeValues={
':status': 'locked', # Represents locked state
':user': user, # The user acquiring the lock
':ttl': ttl, # Lock expiration timestamp
':now': now # Current system time (used for expiration check)
},

ReturnValues="ALL_NEW" # Returns updated lock info for debugging
)
print(f"Lock acquired for document {doc_id} by {user}. Expires at {ttl}.")
return True

# Example
if acquire_document_lock(doc_id, user):
try:
print("Document locked and being processed...")
finally:
release_document_lock(doc_id, user)
else:
print("Could not acquire lock")

Distributed Lock with TTL

def acquire_lock(key, ttl=10):
# Query Layer: Create a lock with a TTL
lock_acquired = redis_client.set(key, "locked", ex=ttl, nx=True)
return lock_acquired

def release_lock(key):
# Query Layer: Explicitly release the lock
redis_client.delete(key)

# Try acquiring lock
if acquire_lock("leaderboard_update", ttl=10):
try:
print("Processing leaderboard update...")
finally:
release_lock("leaderboard_update")
else:
print("Lock already held by another process.")

2. Problem: Enforcing uniqueness

Example allow only one post by a user on one business

Solution: Apply unique constraints

Multi-Part Upload Big Files

A presigned URL is a temporary, signed URL that allows a client to upload (or download) an object to/from a storage system (like Amazon S3) without requiring direct authentication.

Each presigned URL is specific to a resource (like a file or chunk)

Process flow for resumable upload of a large file and upload only the missing chunks using presigned URLs.

  1. Client calculates the file fingerprint
  2. Client requests file metadata:
    • The client sends the fingerprint to the server.
    • The server checks if the file exists and retrieves the metadata (e.g., uploaded chunks and missing chunks).
  3. Server provides upload instructions:
    • For missing chunks, the server generates presigned URLs for each chunk.
    • Each chunk typically maps to a part number in a multipart upload.
    • Presigned URLs are chunk-specific because each upload must be directed to a specific part.
  4. The client uploads the missing chunks using the provided presigned URLs.
  5. As each chunk is uploaded, S3 send event notification to Lambda function which updates the status of chunk in FileMetaData

Why is each chunk tied to a presigned URL?

  • Security: Each presigned URL is generated for a specific chunk and includes security parameters, such as its expiration time, headers, and permissions. This prevents unauthorized uploads or tampering.
  • Concurrency: Using separate URLs for chunks allows clients to upload chunks in parallel, speeding up the upload process.
  • Resumability: If the upload fails partway, the server knows exactly which chunks are missing and can generate new presigned URLs for only those chunks.