This is a transcript of applying for data engineer’s mentor at Udacity.
[Slide 1] Hi, everyone. Today session is about Amazon Redshift architecture. So, let get’s started!
[Slide 2] MPP is the core idea of Redshift architecture. So, if you understand MPP, you will surely understand Redshift architecture. MPP stands for Massive Parallel Processing, using many nodes to help computing results. So, I will give you some concrete examples why Redshift architecture decides to implement MPP.
[Slide 3] Let’s start with the basic example here, we have only one node. This is the traditonal data warehouse. Let’s say it’s just a MySQL data warehouse and this node handles all things such as write, read. And then one day, your team just “Hey! I want to make it faster”. So you scale this up. And
[Slide 4] Now, it became the bigger one ! You buy new CPU, storage, ram, After couple days, you again realize that this solution is bad since it will soon hit bottlenecks. maybe the limitations of hardware, software or even IO of disk storage.
[Slide 5] So in the big data community, scaling up is a very bad practice. This is not a good solution. Thus, in the big data community, they use a better solution, which is called MPP.
[Slide 6] Now, we have the master node acting as the coordinator talking with client and the compute node. And we have the compute node known as slave node act as the workers for the master node. So let’s say we have the client sends the query to master node. The master node parse and construct the query planner, send to each compute node. Now, Compute node in this diagram has to process the data in their own storage. In the compute Node A, process the data on row 1, 2, 3. Compute node B process the data on row 4, 5, 6. Compute node C process the data on row 7, 8, 9. Each node process and send the results back to the master node. The master node collect all the results and merge, aggregate, join and send back to the Client. Hooray! everyone’s happy.
[Slide 7] Linear scaling by adding more nodes is possible now. If you want to query faster, or processing more data, just add more nodes. You can read successful stories on this from Netflix, AirBnB, Uber. Some of them are running like thousands nodes in their clusters which is very fascinating for me.
[Slide 8 - 9] Now, you understand the underlying behind the Redshift. The next question is how we distribute all the workloads to the compute nodes and how to store the data on the compute node?
[Slide 10] Well, How leader distributes data? Distribution key and Sort key are the answer to storing the data. Distribution key determines where data is stored in Redshift while Sort key determines the order in which rows in a table are stored.
[Slide 11] So let’s say we have the fresh new data that is never been imported into your Redshift before. We have id, column, country column, boolean and just integer value. We decide to use Id as the dist key and country as the sort key.
[Slide 12] The leader node use the hash function to determine where the data should be stored. In this example, you see that 111 distribution key goes to the hash function and send to the compute node A. 115 input into the hash function. Hash function tells 115 to Compute node C. And now we store data evenly. Now, we garauntee that every compute node has some jobs working for the leader node.
[Slide 13] For the sort key, sort key basically is just how to sort the data inside the physical storage. So, in this case, compute node A sorts alphbetically by country. Argentina comes first, Australia comes second with the Brazil goes last.
[Slide 14] But, what’s the reason for that? The answer is that Query performance is improved because sort keys could skip irrelevant chunks of data during query. Let’s say business user send the query telling that Hey leader node I want to query data where country equals to Japan. The leader node parse that query and tell all compute nodes to get the Japan only and just skip all Argentina Australia and goes directly to Japan. That means you query faster because of reducing I/O, No full table scan.
[Slide 15] Distribution key always come with hot spot problem. In this diagram, you can obviuously see that if you pick a very bad dist key, only compute node doing thing for you while compute node b and c sitting idle and this detrioate the performace . This is a very common scenario in the distributed database like this.
[Slide 16] So, always remember that we use compute nodes to help process quicker. Thus, pick the dist key that make your data distributed evenly.
[Slide 17 - 19] We also can choose other distribution styles instead of dist key. Before going to the All and Even, I have to talk about Slice first.
So slice basically is the idea of shard resources inside nodes. In the node 1 we have 8 cpu, 16 gb of ram, 160 gb hdd. To shard the resources, that means slice 1 now has 4 cpu 8 GB Ram 80 GB HDD work independently from ohter slices in that node. The idea here is to allocate the results. It’s just the virtual nodes in each node.
Go back to the All data on every node. You distribute data into every node. So, in this case, you send all data to Node 1 and Node2. The advantage of this pattern is if you join frequently, this is so nice. You don’t have to join across other nodes which is more expensive. Another distribution is Even or Round robin distribution. It’s very simple but effective. First row goes to node1, second row goes to node 2 and so on. It is guarantee that the data will be spread evenly. The right use case for even is when you can’t find the proper dist key.
[Slide 20] Ok, the next topic is the columnar storage vs row storage, I’m going to give you some examples.
[Slide 21] We have ID, name, age, and score. This is a typical data.
[Slide 22 - 23] And you import all that to traditonal sql database like Postgres or MySQL. Now all data stores on the blocks in the disk storage. The first block stores all data on first row. The second block - second row. The third block - third row. However, in contrast, Redshift stores data by the columns. So, that means all the first column values go to the first block. all the second column values go to the second block and so on and so forth. Clearly, All the name values are living in the same blocks unlike the Postgres that stores seperately.
[Slide 24] But why? There are 2 main benefits. 1.Reducing disk seek and I/O 2.Compression Encodings
[Slide 25] Data warehouse queries commonly read or perform only on few columns Example of function such as MIN(column), MAX(column), GROUP BY(column), COUNT() which are commonly found in every data analytics. All this function are focused on the coloumn oriented.
[Slide 26] Let’s see in the real action with MAX function. In the MySQL or Postgres, they have to scan all the blocks. Like in this case, MySQL goes to first block, get 25, goes to second block 22 age, third block 31 age which gonna return 31 but with disk seek 3 blocks
[Slide 27] In contrast, in Redshift, with the same function, it just goes directly to the third blocks skipping the first block and second block because it knows the age columns only stores in the third block and return the 31 with disk seek only 1 block. It reduces the scan I/O in your disk. This helps you a lot.
[Slide 28 - 29] And the second reason is Compression Encodings. When the Same types of data in each block stores in the same type, means it is encoded easier
Reducing sizes = improving query latency + saving storage costs So for example, before compressing, 100 MB but after compressing it reduces to 50 MB. It means your database has to fetch the data only 50 MB to memory. And also you pay less storage insetad of paying 100 MB.
When you want to encode the data, you declare that as the CREATE TABLE statement and they are various types of compression encodings. Feel free to check on Redhshift documentation.
[Slide 30] Recently, Redshift also has a great extension feature called Spectrum. The idea behind it is that instead of loading the data into the cluster, why don’t we just create as external table and point to te S3 storage instead. because Nowadays, the s3 object stores are used everywhere.
But what is the benefit of doing so ?
[Slide 31] So, before Spectrum, We may have to setup Airflow, do ETL schedule which is sometimes too slow and quite cumbersome for organizations.
[Slide 32] After the introducing of Spectrum, you just point table to S3 and you’re done !.
[Slide 33] Benefits of Spectrum is that Cost effective Separate Computing and storage ( S3 is cheaper) You pay for the queries you run against the data that you actually scan Joining with existing data in the Redshift Data lake are accessed directly without ETL - Ad-hoc queries (Athena is also great)
[Slide 34] That’s it for today ! Thank you very much everyone and see y’all next time.