Thursday, May 25, 2023

AWS DynamoDB Local Secondary Index (LSI) - Demystified

Local Secondary Index -  AWS DynamoDB - All you need to know

In a previous post we saw Global Secondary Index (GSI) in detail. In this post let's look into the Local Secondary Index (LS) in detail

Local Secondary Index (LSI):

A Local Secondary Index (LSI) uses the same Partition Key as that of the base Table with different Sort Key. It contains some or all of the attributes of the base Table. 

Sample Table: The following Table captures Goals scored, Number of Matches played by different Countries in FIFA World Cup Soccer during different games (2022, 2018 and so on)
Partition Key: Country
Sort Key: Game



From this Table one can easily query Number of Goals scored by a Country, given the Country name and the Game (e.g. France, FIFA-2022). Without Sort Key in the Query one can find the Goals scored by "France" across all Games they played, however if we want to find out in which Game they scored the most number of Goals the application needs to process the Query result to find out that.

Create a LSI named GoalsIndex

Let's see how a LSI makes such a querying easier, consider the below LSI where the "Partition Key" remains the "Country" however the Sort key is "Goals"
Partition Key: Country
Sort Key: Goals
Additional attribute projected: Matches



Points to remember:

  • The data in LSI is organized by the same Partition Key of the base Table, however with different Sort Key
  • The Partition Key in the LSI must be same as that of the base Table (in example table 'Country')
  • Must specify one non key attribute of base Table as the Sort Key for a LSI, it must be Scalar attribute (String, Number or Binary)
  • The Sort Key, if any, of the base Table is automatically Projected in to the LSI as non key attribute (in example 'Game')
  • The Partition Key of the base Table is automatically projected
  • A maximum of 5 LSIs can be created on a Table
  • A Query can also retrieve attributes that are not Projected to the LSI, DynamoDB retrieves these from the base Table known as 'fetches', however at greater Latency and high provisioned throughput cost
  • One can store up to 10 GB of data per distinct Partition Key value, this includes all the Items in the base Table as well as across all the LSIs
  • Unlike the base Table, the Partition Key and Sort Key combination need not be unique in a LSI

Projecting attributes:

Projection is set of attributes copied from base Table to the Secondary Index. Primary Keys (Partition & Sort keys) are always projected to the Index. The following are the 3 possible attribute Projection options for a LSI
  1. KEYS_ONLY: Projects Primary Key (Partition and Sort keys) attributes from the base Table to the LSI in addition to the Primary Key attributes defined in the LSI. In the sample LSI the base Table's Sort Key "Game" is projected besides its own Sort Key "Goals". This is the smallest possible LSI. The smaller the Index, the less it costs to store and also less the write costs
  2. INCLUDE: Includes specific attributes besides the automatically projected base Table's Primary Key attributes. In the example Index above the 'Matches' non key attribute of base Table is projected
  3. ALL: The LSI includes all the attributes of the base Table. The LSI will have the attributes "Country","Goals", "Game","Matches" and "Venue"

Reading data from GSI:

  • Query and Scan operations are supported GetItem and BatchGetItem are not supported in a LSI. This is the same as GSI
  • If there are multiple Items for a given Partition/Sort Key combination then all the Items that matches given Key are returned but not in a particular order
  • Data can be read from a LSI either "Strongly consistent" or "Eventually consistent" fashion. This can be specified using 'ConsistentRead'Query parameter
  • If the "ScanIndexForward" parameter is set to "false" while querying, the results are returned in the descending order of Sort Key attribute, in the example Index above the first record will be the highest Goals

Data synchronization between base Table and GSI:

  • DynamoDB keeps all the LSIs synchronized against its respective base Tables
  • Applications can't write directly into a LSI
  • Every time a new Item is added to the base Table the data type of the attribute, which is used as Sort Key in a LSI, should be respected. The sample index uses "Goals" as Sort Key, its value should always be of type Number while inserting into base Table
  • The Items in the LSI is not a one-to-one relationship in base Table. If a LSIs Sort Key attribute doesn't have a value in base Table then that Item is not copied to LSI. This rule can be made use of effectively to copy specific sub set of Items to a specific LSI

Read/Write capacity units for LSI:

  • Strongly consistent read consumes one Read Capacity Unit (RCU) while eventually consistent read consumes half of it
  • When query reads only Projected attributes, provisioned RCU usage is calculated based on the size of the Item in the Index (Keys and Projected attributes) and not based on the Item size in the base Table
  • The number of RCU used is the sum of all projected attributes sized across all returned items, this is rounded to the next 4 KB boundary
  • When query reads attributes which are not Projected into a LSI, the RCU is calculated based on the size of the Item in the LSI and also the entire Item size in the base Table, not just the attributes fetched from the base Table. Fetching from base Table causes additional latency
  • The maximum size of the results returned by Query is 1 MB, this includes the sum of size of all items returned matched in LSI and also in the base Table, if any attribute which is not projected is queried
  • When an Item is added, updated or deleted from base Table the corresponding change is replicated in LSI which incur additional Write cost
  • The Provisioned throughput cost is the sum of Write Capacity Units (WCU) consumed for writing to base Table and as well as to LSIs
  • When new Item is written to base Table which has a value for Sort Key attribute of a LSI or an existing Item is updated to populate with value for the same attribute which is previously undefined then WCU is consumed for writing that Item to the LSI
  • When value of Sort key attribute of an Item is changed from X to Y in base Table, it result in one Delete and one Write in a LSI
  • When update on an Item in the base Table removes the Sort key attribute of a LSI a write is consumed to get the Item deleted from the LSI
  • When an Item is written to the base Table, the DynamoDB automatically copies the correct subset of attributes to LSIs. Storage cost is charged on this Item storage for both base Table and LSIs

Items Collections:

  • Item collection is group of Items that have the same Partition key value in base Table and any LSI. In the example Table it's 'Country'. 
  • Following operations can be performed on an Item collection that returns information about the Item collection. When 'ReturnItemCollectionMetrics' parameter is set to SIZE each of these operations return details of size of Items in Collection in Index
    • BatchWriteItem
    • DeleteItem
    • PutItem
    • UpdateItem
    • TransactWriteItems
  • The maximum size of any item collection for a table which has one or more local secondary indexes is 10 GB
  • If an item collection exceeds the 10 GB limit, DynamoDB returns an ItemCollectionSizeLimitExceededException, and you won't be able to add more items to the item collection or increase the sizes of items that are in the item collection
  • If the application expect the size of Item collection to exceed 10 GB then one should consider creating a GSI
  • Each Item collection is stored in a single partition whose size capability is 10 GB. One should choose Partition Key in such a way the data is evenly distributed across Partitions. For a Table with LSIs, applications should not create hot spots of read/write activity within a single Item collection which is in a single Partition

Monday, May 15, 2023

AWS DynamoDB Global Secondary Index (GSI) - Demystified

Global Secondary Index (GSI) - AWS DynamoDB - All you need to know

In this post let's see in detail what's GSI, how it's useful etc. Read through this completely before designing a GSI

GSI:

A Global Secondary Index (GSI), simply Index, is created on a Table to facilitate querying data using non key attributes of the Table which would in general result in full Scan. A GSI contains a selection of attributes from the base table, but they are organized by a Primary key (Partition Key, Sort Key) that is different from that of the base table. 

Sample Table: Consider this Table which captures Scores of Student's in different Subjects. From the main table it's easy to query all the Subject's score given a "Student_Id"

Partition Key: Student_Id

Sort Key: None

From the above Table if we want to query who scored the top in a particular Subject it's not possible without scanning the whole Table. We can create the following GSI to make the querying possible without full scan.

Name of GSI: TopScoreIndex

Partition Key: Subject

Sort Key: Score 


Important points to remember:

  • The GSI's key does not need to have any of the key attributes from the base table. It doesn't even need to have the same key schema as the table. In the example above the base Table's Partition key is "Student_Id" and there's no Sort Key while the GSI's Partition Key is "Subject" and it has a Sort Key "Score"
  • The base Tables's Primary key attributes are always projected in the GSI. In the above GSI the "Student_Id" is automatically projected. Other attributes can be projected as needed. Any attribute which is not projected can't be retrieved from the Index while querying, example the 'University' attribute
  • If the "ScanIndexForward" parameter is set to "false" while querying, the results are returned in the descending order, the highest score will be returned at the first place
  • The "Partition Key" is mandatory in the GSI, the Sort Key is optional which is the case for the base table as well
  • The base Table can have a simple Primary Key (Partition Key alone), the GSI can have a Composite Primary Key (both Partition Key and Sort Key) or vice versa
  • The Index Key attributes should be of any Top-level attributes such as 'String', 'Number' or 'Binary' from the base Table
  • In base table the Primary Key values must be unique, that's not the case in the Index. In the example Index above there are two items with the same "Subject" and "Score" which is "DS&A" Subject with the Score of "92"
  • While querying the Index all the items that matches the Key Attributes are returned, however there's no specific order within the returned Items
  • GSI tracks only the items where value exists for the GSIs Primary Ket attributes in the base Table. In the base Table if one of the GSI Primary Key attributes "Subject" or "Score" doesn't have a value then that Item is not populated in the GSI. That means the Item for "Student_Id" 200 doesn't have value for the "Score" attribute hence that Item will not appear in GSI. This can be exploited to create GSIs which has only subset of interested Items from the base Table.

Projecting attributes:

The following are the 3 possible attribute Projection options for a GSI
  1. KEYS_ONLY: Projects Primary Key attributes from the base Table to the GSI in addition to the Primary Key attributes defined in the GSI. In the sample GSI the base Table's Primary Key "Student_Id" is projected besides its own Primary Key "Subject" and "Score". This is the smallest possible GSI
  2. INCLUDE: Includes specific attributes besides the automatically projected base Table's Primary Key attributes. If needed we can include the "University" attributes in the GSI
  3. ALL: The GSI includes all the attributes of the base Table. This is the largest possible GSI. The GSI will have the attributes "Subject", "Score", "Student_Id", "University" and "Gold_Medal"

Note an projecting attributes:

  1. While considering attributes to project in a GSI one need to keep in mind the associated provisioned throughput and storage costs. Writing to GSI is additional cost besides writing to the base Table and the same applicable for the Storage
  2. Project only the necessary attributes to ensure the GSI is small so that the storage and write costs are the lowest

Reading data from GSI:

  • Query and Scan operations are supported GetItem and BatchGetItem are not supported in a GSI

Data synchronization between base Table and GSI:

  • When Write/Delete happens on the base Table the changes are asynchronously reflected in the GSI in an eventually consistent fashion. While the synchronization takes fraction of a second it's possible the data is not synchronized in an unlikely scenario the application should keep this in mind
  • No direct write on GSI
  • GSI's Key attributes are defined at the time of GSI creation. When new Items are written to the base Table the attributes data type should be the same otherwise 'ValidationException" is thrown. In the sample GSI above the data types of GSI primary key attributes are "String" and "Number" respectively for the attributes "Subject" and "Score". All write in the base table should conforms to this data type

Read/Write capacity units for GSI:

Every point in this section is so important to understand how Provisioned Throughput works with GSI
  • For a GSI created on a Provisioned throughput mode base Table, the Read/Write capacity units must be also specified. This throughput settings are separate from the base Table.
  • A Query on the GSI utilizes the Rad capacity unit of the GSI and not the base Table
  • When an Item is Written/Updated/Deleted on the base Table the changes are also propagated to the GSIs which consumes the Write Capacity of the GSI
  • GSIs support eventually consistent read which consume half of the read capacity unit. Per read capacity unit 8 KB of data can be retrieved (i.e. 2x4KB)
  • For GSI queries the read capacity unit consumption is calculated based on the Index size which depends on the projected attributes and not based on item size on the base Table
  •  The maximum size of results returned by Query is 1 MB
  • When Insert/Update/Delete on a Table affects the GSI the provisioned throughput cost consists of Write Capacity Unit (WCU) consumed for writing to the base Table and also to all the GSI
  • Write to base Table doesn't affect any GSI then no write capacity is consumed for GSI
  • Write to succeed there should have been enough write capacity provisioned in base Table and in all GSIs, otherwise the write will be throttled
  • When a new Item is written to the base Table that qualifies to be propagated to an Index or an existing Item is being updated in base Table (adding an attribute) that makes it to be replicated to the Index then write capacity is consumed for GSI
  • When GSI key attribute's value changed in the base Table it results in two writes in GSI, one for Delete and one for Insert
  • When an attribute that is projected in an Index is deleted in an Item in the base Table then a write is required in the GSI to Delete that Item