Skip to main content

AWS - OpenSearch

· 8 min read

1. The Restaurant (OpenSearch Cluster)

In OpenSearch, a cluster is like a restaurant. It has many moving parts: kitchen staff, cooking stations, storage for ingredients, and a management structure. All of these work together to deliver "dishes" (search results and indexing operations) to your "customers" (applications).

  • Cluster: Everything in OpenSearch—indexes, nodes, shards, queries—lives inside this restaurant (cluster).
  • Dedicated Master Nodes: Think of these as the head chefs. Their primary duty is to oversee and coordinate the rest of the staff.
    • Ensuring all stations (nodes) are running efficiently
    • Assigning dishes (work) to the right line cooks (data nodes)
    • Adjusting the kitchen layout as needed (shard allocation)
    • Hiring or removing staff (adding/removing nodes)
    • They do not store data or execute queries—just like head chefs don't actually chop vegetables or sauté onions.
  • Data Nodes (Hot): These are the line cooks, actively preparing meals based on customer orders (queries) or adding new recipes (indexing new data).
    • It's an EC2 instance running the OpenSearch software. It has CPU, RAM, network, etc. The data node is responsible for the actual "work" of storing and retrieving data.
  • Hot Storage: Every kitchen needs quick access to fresh ingredients. Hot storage (SSD storage on EC2 instances) holds the most frequently used data, just like a busy restaurant keeps essentials within arm's reach.
    • Warm/Cold Storage (Optional): Sometimes, however, the restaurant moves rarely used ingredients to other storage areas:
    • Warm Storage (Extra Shelves in the Kitchen): For ingredients you don't use every minute but still need occasionally.

Index, Shard & Alias: How Data is Stored & Retrieved

An index in OpenSearch is like the organized ingredient storage bins in the kitchen — where everything needed for cooking (documents) is stored in a structured way.

  • Every ingredient (document) is labeled and indexed for quick access
  • Each storage bin (index) is specialized for a particular type of data Indexes are optimized for fast lookup, just like a well-organized kitchen keeps essential ingredients easy to find.

For example:

  • A burger restaurant might have an index for:
    • Burger ingredients
    • Drink ingredients
    • Side dishes
  • Similarly, OpenSearch might have indexes for:
    • Customer records
    • Product catalog data
    • Logs and analytics

Shard: A busy restaurant doesn't store all ingredients in a single massive storage bin—instead, it breaks them down into multiple smaller bins for better organization and access.

  • Each storage bin (index) is divided into multiple smaller compartments (shards)
  • This allows multiple line cooks to access different compartments of the same storage bin simultaneously
  • If one compartment is being used, other cooks can still access the ingredients in other compartments

Example:

  • Instead of one massive storage bin for all meats, you might have it divided into 10 smaller compartments
  • This way, multiple cooks can access different parts of the meat storage simultaneously
  • These smaller compartments (shards) make ingredient retrieval faster and more efficient

Alias: A restaurant often renames storage areas for convenience:

  • Instead of saying: "Go to the new section labeled Meat Storage 2024"
  • The chef might simply say: "Go to the Meat Storage" (which always points to the latest section)

In OpenSearch:

  • An alias is a shortcut name for an index or group of indexes
  • It allows applications to refer to data without worrying about version changes

Example: You store your customer data in the customer_data_v1 index, then change schema and move data to the customer_data_v2 index. Without alias, your application would need to know about the latest index name e.g. customer_data_v2 to run queries. With alias, you can just use the alias customer_data to point to the latest version dynamically and change the alias to point to the new index when you update the schema.

2.1 What Are Shards?

  • A shard is like a cooking station that handles a slice of the entire menu (the index). The menu itself can be broken into multiple sections (shards). Each station focuses on its assigned portion.

2.2 Shard Strategy and Cluster Scaling

  • Goal: Distribute each index evenly among the data nodes. It's best if the number of stations (shards) is a multiple (or factor) of the number of cooks (data nodes), so every cook manages the same amount of work.

  • Shard Size: If a single station has to handle an enormous portion of the menu, it becomes overloaded. Conversely, if you have too many tiny stations, managing them all wastes time. A typical sweet spot might be:

    • 10–30 GiB per shard for "search-intense" workloads.
    • 30–50 GiB per shard for logs-heavy workloads.
    • 50 GiB is usually the upper limit to ensure quick responses and easier management.
  • Shard Count: The number of shards per index should be balanced against your total data size. If you only have 5 GiB total, you don't need multiple stations—one shard suffices. But if you have hundreds of GB or TB, multiple shards help the workload scale.

2.3 Shards per Data Node (How Many Stations Can One Cook Manage?)

  • Each cook (data node) has a certain attention limit (JVM heap memory). You don't want to overwhelm them with too many stations. A rule of thumb is no more than 25 shards per GiB of heap. For a node with 32 GiB heap, staying under 800 shards is wise. There's also a hard limit of around 1,000 shards per node in many production environments—going beyond that is risky.

2.4 Shard to CPU Ratio

  • Every time a shard is asked to do something—whether slicing veggies or cooking a dish—it needs a dedicated stove burner (vCPU). If your cook (node) has 8 burners (vCPUs), a recommended approach is to give them around 6 stations to handle comfortably, leaving a little headroom.

3. How It All Relates (Keeping the Restaurant Balanced)

  • Cluster = The entire restaurant.
  • Nodes = Your cooks (data nodes) plus any head chefs (master nodes).
  • Indexes = Specific menus or categories of dishes you serve.
  • Shards = Individual cooking stations that handle a slice of a menu.
  • Hot Storage = The main fridge and pantry holding regularly used ingredients (hot data).

By balancing the number of stations (shards) with the number of cooks (nodes), you keep the restaurant efficient. Having hot storage means all the frequently accessed ingredients are at arm's reach. If you had warm or cold storage, it'd be akin to storing less popular or older ingredients in a distant pantry.

4. Building a Buffer into Your "Ingest Pipeline" (Smoothing Out Rush Hours)

4.1 What is the Ingest Pipeline?

  • When new ingredients arrive at the restaurant or new "orders" (documents to index) come in, they have to be processed before they're fully available on the menu. This flow—from receiving the raw ingredient to storing it in the fridge—is the ingest pipeline.

4.2 Why Buffering Matters

  • If your restaurant suddenly gets a massive delivery of ingredients during peak dinner service, your cooks (data nodes) might get overwhelmed trying to store or prepare them all immediately. Two general strategies exist:
    • Scale Up: Add more cooks and bigger storage to handle these spikes at full speed. This is expensive and might be overkill if spikes are rare.
    • Build a Buffer: Accept deliveries in a "waiting room" (an S3 bucket or a queue) to handle them gradually. This buffer is like a short-term holding area where you offload the initial surge. Then, you feed ingredients into the main fridge at a manageable rate.

4.3 How to Implement a Buffer

  • Buffer Location: An S3 bucket or a queue system (like Amazon SQS or a custom queue). Think of it as a secondary storage room near your loading dock.
  • Flow Control: Automated systems (like Lambda or OpenSearch Ingestion) that pull ingredients from the buffer into the main fridge (your OpenSearch hot storage) at a steady pace. You can tune how fast or slow this "intake" goes.
  • Trade-Off: If data freshness is crucial and you need new ingredients instantly ready for the cooks, you might limit buffering. But if you can tolerate a small delay, buffering saves you money and prevents chaos in the kitchen.

5. Putting It All Together

  • Cooks (Data Nodes) = Handle day-to-day cooking (index and query requests).
  • Head Chef (Dedicated Master) = Coordinates big decisions but usually doesn't cook.
  • Cooking Stations (Shards) = Each station handles a slice of your menu (index).
  • Hot Storage (Main Fridge) = Fast, easily accessible storage for active data.
  • Shard Sizing & Distribution = Keep stations well-proportioned, so no single cook is overloaded.
  • Buffering = Smooths out deliveries (indexing spikes), so you don't drown in new orders during rush hour.