Azure SQL Data Warehouse – Part II – Analyzing Created Databases

I started writing on this long time back but had no way of completing it. My first post on this was Azure SQL Data Warehouse – Part I – Creating and Connecting and here is the second part of it.
The first post discussed about Azure SQL Data Warehouse and how to create a server and database using the portal. As the next step, let’s talk about the architecture of it bit and then see how Azure data warehouse maintains and processes data.

Control Node

Azure SQL Data Warehouse is a distributed database. It means that data is distributed in multiple locations. However, once the data warehouse is created, we will be connecting with one component called Control Node. It is not exactly a SQL Server database but when connecting to it, it looks and feels like connecting to a SQL Server Database. Control node handles all communication and computation. When we make a request to the data warehouse, Control node accepts it, determines the way it should be distributed based on divide and conquer approach, get it processed and finally send the result to us.

Compute Node

Control node get the data processed in parallel using Compute Nodes. They are SQL Server databases and store all our records. Based on the number of DWU configured, data warehouse is set with one or more Compute Nodes.

Distribution

Data related to the data warehouse is stored in Azure Blob Storage and distributed in multiple locations. It is independent from Compute Nodes, hence they can be operated/adjusted independently. These locations are called as Distributions. The number of distributions for an Azure SQL data warehouse is a fixed number that is sixty (60). These distributions are assigned dynamically to Compute Nodes and when a query is executed, each distribution processes data related to them. This is how the parallel execution happens.
If you need more compute power, you can increase the number of Compute Nodes by increasing DWUs. When the number of Compute Nodes are getting changed, the number of distributions per Compute Node is getting changed as well.

Architecture

This image shows the architecture when you create an Azure SQL Data Warehouse with 100 DWU.
This image shows the architecture when you create an Azure SQL Data Warehouse with 400 DWU.
Let’s create two databases and clarify this configuration.
I have discussed all steps related to server creation and database creation in my first post that is  Azure SQL Data Warehouse – Part I – Creating and Connecting , hence I am not going to discuss the same here. Note the image below. That is how I have created two databases for checking the configurations.
As you see, the first data warehouse is created using 100 DWUs and second one with 400 DWUs. Let’s see how nodes have been created for these two databases. We can simply use sys.dm_pdw_nodes DMV for getting this information.
  1. SELECT * FROM sys.dm_pdw_nodes;
  2. SELECT type, Count(*) NofNodes
  3. FROM sys.dm_pdw_nodes
  4. GROUP BY type;
Result with the first data warehouse that is created using 100 DWUs.
Result with the second data warehouse that is created using 400 DWUs.
Note the second data warehouse. Since we used more DWUs, it has been created with four Compute Nodes that gives better performance than the first one. Since this is a sample database and it has tables, we can check one of the tables and see how data is distributed with distributions.
The following code shows the distributions created for one data warehouse. As mentioned above, it is always 60.
Here is the code for seeing how rows of a table are distributed in distributions with the second data warehouse created. Note how each distributions are assigned to Compute Nodes.
Records are distributed based on the design of the table. Azure SQL Data Warehouse uses two types of distributions: Round Robin and Hash distributions. Let’s talk about it with the next post.

Azure SQL Data Warehouse – Part I – Creating and Connecting

We have been implementing on-premises data warehouses for years and everyone is well aware on difficulties-faced at the first phase of the project, that is requesting hardware for the project. In most cases, either the project is stopped or postponed by the management or client saying that no-budget. But now we have a solution that addresses the initial issue we have. Not only that, this solution gives you many benefits in terms of performance as well. It is Azure SQL Data Warehouse.

The Azure SQL Data Warehouse is a cloud-based SQL Data Warehouse which can be easily scaled with cloud-based-resources and pay as-it-grows. It is a fully managed service that that covers maintenance, software patches and backups. It can store and process both relational and non-relational data, and many Azure features such as Azure Blob Storage, Azure Data Factory can be used for implementing it.

Let’s talk about some benefits in details;

Scalability and Cost
This is one of the major benefits. You can scale Azure Data Warehouse, increasing resources when required and cost will be calculated based on it. Since there is no upfront cost involved with this, you can simply start with less number of resources and adjust when you want. You will be charged on two resource usage types: Storage and Compute.

The storage configured for Azure SQL Data Warehouse is automatically re-scaled as you add and remove data, hence you pay for what you use.

The value set for Compute is the measurement for performance of execution in Azure SQL Data Warehouse. It is measured in Data Warehouse Units (DWU) and you can start with 100 DWU (USD 2 per hour) to 6000 DWU (USD 123 per hour) as per the requirements. This value can be anytime changed and cost of the usage will be calculated as per the value set.

Read more on DWU at: https://azure.microsoft.com/en-gb/pricing/details/sql-data-warehouse/

Performance
Since this can be scaled out as you need, data and queries can be distributed, increasing the performance. This uses Massive Parallel Processing Architecture (MPP), hence workload can be spread across many nodes using complex statistics calculated, increasing the performance significantly.

This uses Columnstore Indexing that is 10 times faster than Rowstore Indexing. Since it supports 5 times more compression than standard Rowstore Indexes reducing IOs, it improves the performance more.

Hybrid Cloud
It is a hybrid cloud environment. It supports seamless integration with on-premises data, cloud data and of course unstructured data. With Polybase, different type of sources can be queried using TSQL as data is stored in local, making sure that everything required is shown in a single database.

Not only that, copying data from other sources can be easily done via Azure Data Factory.

Okay, now let’s see how we can create one and star using it. First of all, you need an Azure Account. Once an account is created, you need to create a SQL Server which is a logical server for holding your data warehouse. If you have already created a SQL Sever, then it can be used, else follow the steps below for creating a logical server.

1. Login to the Azure Portal and open SQL Servers Blade.

2. Set Sever name, Server admin login, Password, Resource Group and Location. Note that you can add database specific logins later.
Azure DW 02

  1. Click on Create button to get it created. If you want to see the server in Azure Dashboard, make sure that Pin to dashboard is checked.

Now the server is ready. Let’s create a data warehouse.

  1. Open the SQL Data Warehouses blade.

  2. Enter the Database Name. With this test, we will be using AdventureWorks (we will be creating different databases later), hence set the name as AdventureWorksDW.

  3. Select the appropriate Resource Group. Make sure that you select the same selected for the server.

  4. Select Sample for Select source.

  5. Select AdventureWorsDW as the sample database.

  6. Select the server created with above steps and enter the admin account and its password.

  7. Select DWU as you need. Let’s initially set this as 200.

    Azure DW 01

  8. Select Pin to dashboard and click Create to get the data warehouse created.

Database is created. Remember, you cannot connect with the server configured until you are permitted to connect from your IP address. Initially you can configure your IP address to the server using the portal. Later, you can connect to the server using Management Studio and configure IP addresses for others using TSQL, either at server level or database level. Here are the steps.

  1. Open the portal and open SQL Servers blade.
  2. Click on the server created (or the one used for adding the data warehouse) to get its properties.
  3. You can add your IP address by clicking Add client IP button and clicking Save button. Or else, you can manually set it using the given grid.
    Azure DW 03
  4. Add it and save.

Now firewall rule is configured for you. Let’s connect and see.

  1. Open Management Studio and connect with the server account as below.
    Azure DW 04
  2. Once connected, open a new Query Window and type below code for seeing firewall rules set.
    Azure DW 05
  3. If you need to add more rules for others, use sp_set_firewall_rule for server level rules and sp_set_database_firewall_rule for adding database level rules.
  4. Query your database and see now. You should be able to query your table without any issue.
    Azure DW 06

With the next post, let’s see how we can use other tools like Visual Studio and Power BI for connecting with the data warehouse created.

Here is the link for Part II – http://dinesql.blogspot.com/2017/08/azure-sql-data-warehouse-part-ii.html