Amazon Redshift. Distribution Styles.

Let’s take a look about what Amazon Redshift is:

Amazon Redshift is a fast, scalable data warehouse that makes it simple and cost-effective to analyze all your data across your data warehouse and data lake …

The most important side of this is that Amazon Redshift is designed to be a «Petabyte-Scale» Data Warehouse. In other words, it let us to execute queries across Petabytes of data effectively.

In this article I’ll take you how works Amazon Redshift and its distribution styles in order to set up efficient tables based on several available data distribution styles of Amazon Redshift tables.

We’ve got three type of distribution styles, accordingly to Amazon Redshift online documentation:

Even distribution

This is default distribution style. This style let’s the leader node to distribute the rows across the different slices of a node, which means each existing row for the table will be distributed equitably along the slices.

Key distribution

Rows are distributed along the slices based on values in one column. If you are using two columns very usually as a point for joining tables respectively, the leader node will try to save theses columns in the same slice.

By this fashion commented above, you can avoid access latency and to improve the general performance of your queries.

ALL distribution

A copy of the entire table is distributed to every node.

You’ll get very slow loading, inserting or updating operations, unless you would have a small dimensional table that you use often to join with other tables which cost of distribution across every node is low (because of the small size of the table) and avoiding the «join latency» would can be valuable, even paying the all distribution penalty of size multiplication.

This is good for small dimensional tables that you often read and you don’t usually write (load, insert, update).

Thank you for visiting my blog,
I hope to see you in upcoming articles!

Comparte esto:

Relacionado