Global Secondary Index (GSI) - AWS DynamoDB - All you need to know
In this post let's see in detail what's GSI, how it's useful etc. Read through this completely before designing a GSI
GSI:
A Global Secondary Index (GSI), simply Index, is created on a Table to facilitate querying data using non key attributes of the Table which would in general result in full Scan. A GSI contains a selection of attributes from the base table, but they are organized by a Primary key (Partition Key, Sort Key) that is different from that of the base table.
Sample Table: Consider this Table which captures Scores of Student's in different Subjects. From the main table it's easy to query all the Subject's score given a "Student_Id"
Partition Key: Student_Id
Sort Key: None
From the above Table if we want to query who scored the top in a particular Subject it's not possible without scanning the whole Table. We can create the following GSI to make the querying possible without full scan.
Name of GSI: TopScoreIndex
Partition Key: Subject
Sort Key: Score
Important points to remember:
- The GSI's key does not need to have any of the key attributes from the base table. It doesn't even need to have the same key schema as the table. In the example above the base Table's Partition key is "Student_Id" and there's no Sort Key while the GSI's Partition Key is "Subject" and it has a Sort Key "Score"
- The base Tables's Primary key attributes are always projected in the GSI. In the above GSI the "Student_Id" is automatically projected. Other attributes can be projected as needed. Any attribute which is not projected can't be retrieved from the Index while querying, example the 'University' attribute
- If the "ScanIndexForward" parameter is set to "false" while querying, the results are returned in the descending order, the highest score will be returned at the first place
- The "Partition Key" is mandatory in the GSI, the Sort Key is optional which is the case for the base table as well
- The base Table can have a simple Primary Key (Partition Key alone), the GSI can have a Composite Primary Key (both Partition Key and Sort Key) or vice versa
- The Index Key attributes should be of any Top-level attributes such as 'String', 'Number' or 'Binary' from the base Table
- In base table the Primary Key values must be unique, that's not the case in the Index. In the example Index above there are two items with the same "Subject" and "Score" which is "DS&A" Subject with the Score of "92"
- While querying the Index all the items that matches the Key Attributes are returned, however there's no specific order within the returned Items
- GSI tracks only the items where value exists for the GSIs Primary Ket attributes in the base Table. In the base Table if one of the GSI Primary Key attributes "Subject" or "Score" doesn't have a value then that Item is not populated in the GSI. That means the Item for "Student_Id" 200 doesn't have value for the "Score" attribute hence that Item will not appear in GSI. This can be exploited to create GSIs which has only subset of interested Items from the base Table.
Projecting attributes:
The following are the 3 possible attribute Projection options for a GSI
- KEYS_ONLY: Projects Primary Key attributes from the base Table to the GSI in addition to the Primary Key attributes defined in the GSI. In the sample GSI the base Table's Primary Key "Student_Id" is projected besides its own Primary Key "Subject" and "Score". This is the smallest possible GSI
- INCLUDE: Includes specific attributes besides the automatically projected base Table's Primary Key attributes. If needed we can include the "University" attributes in the GSI
- ALL: The GSI includes all the attributes of the base Table. This is the largest possible GSI. The GSI will have the attributes "Subject", "Score", "Student_Id", "University" and "Gold_Medal"
Note an projecting attributes:
- While considering attributes to project in a GSI one need to keep in mind the associated provisioned throughput and storage costs. Writing to GSI is additional cost besides writing to the base Table and the same applicable for the Storage
- Project only the necessary attributes to ensure the GSI is small so that the storage and write costs are the lowest
Reading data from GSI:
- Query and Scan operations are supported GetItem and BatchGetItem are not supported in a GSI
Data synchronization between base Table and GSI:
- When Write/Delete happens on the base Table the changes are asynchronously reflected in the GSI in an eventually consistent fashion. While the synchronization takes fraction of a second it's possible the data is not synchronized in an unlikely scenario the application should keep this in mind
- No direct write on GSI
- GSI's Key attributes are defined at the time of GSI creation. When new Items are written to the base Table the attributes data type should be the same otherwise 'ValidationException" is thrown. In the sample GSI above the data types of GSI primary key attributes are "String" and "Number" respectively for the attributes "Subject" and "Score". All write in the base table should conforms to this data type
Read/Write capacity units for GSI:
Every point in this section is so important to understand how Provisioned Throughput works with GSI
- For a GSI created on a Provisioned throughput mode base Table, the Read/Write capacity units must be also specified. This throughput settings are separate from the base Table.
- A Query on the GSI utilizes the Rad capacity unit of the GSI and not the base Table
- When an Item is Written/Updated/Deleted on the base Table the changes are also propagated to the GSIs which consumes the Write Capacity of the GSI
- GSIs support eventually consistent read which consume half of the read capacity unit. Per read capacity unit 8 KB of data can be retrieved (i.e. 2x4KB)
- For GSI queries the read capacity unit consumption is calculated based on the Index size which depends on the projected attributes and not based on item size on the base Table
- The maximum size of results returned by Query is 1 MB
- When Insert/Update/Delete on a Table affects the GSI the provisioned throughput cost consists of Write Capacity Unit (WCU) consumed for writing to the base Table and also to all the GSI
- Write to base Table doesn't affect any GSI then no write capacity is consumed for GSI
- Write to succeed there should have been enough write capacity provisioned in base Table and in all GSIs, otherwise the write will be throttled
- When a new Item is written to the base Table that qualifies to be propagated to an Index or an existing Item is being updated in base Table (adding an attribute) that makes it to be replicated to the Index then write capacity is consumed for GSI
- When GSI key attribute's value changed in the base Table it results in two writes in GSI, one for Delete and one for Insert
- When an attribute that is projected in an Index is deleted in an Item in the base Table then a write is required in the GSI to Delete that Item