Run Hive Queries using Visual Studio

Once HDInsight cluster is configured, we generally use either the portal dashboard (Powered by Ambari) or a tool like PuTTY for executing queries against data loaded. Although they are not exactly a developer related tools, or in other words, not an IDE, we had to use because we did not have much options. However, now we can use the IDE we have been using for years for connecting with HDInsight and executing various types of queries such as Hive, Pig and USQL. It is Visual Studio.
Let’s see how we can use Visual Studio for accessing HDInsight.

Making Visual Studio read for HDInsight

In order to work with HDInsight using Visual Studio, you need to install few tools on Visual Studio. Here are the supported versions;
  • Visual Studio 2013 Community/Professional/Premium/Ultimate with Update 4
  • Visual Studio 2015 any edition
  • Visual Studio 2017 any edition
You need to make sure that you have installed Azure SDK on your Visual Studio. Click here for downloading the Web Platform Installer and make sure following are installed;
This installs Microsoft Azure Data Lake Tools for Visual Studio as well, make sure it is installed.
Now your Visual Studio is ready for accessing HDInsight.

Connecting with HDInsight

Good thing is, you can connect with your cluster even without creating a project. However, once the SDK is installed, you can see new Templates called Azure Data Lake – HIVE (HDInsight), Pig (HDInsight), Storm (HDInsight) and USQL (ADLA) and HIVE template can be used for creating a project.
Project creates one hql file for you and you can use it from executing your Hive Queries. In addition to that, You can open Server Explorer (View Menu -> Server Explorer), and expand Azure (or connect to your Azure account and then expand) for seeing all components related to Azure.
As you see, you can see all databases, internal and external tables, views and columns. Not only that, by right-clicking the cluster, you can open a windows for writing a query or viewing jobs. Here is the screen when I use the first option that is Write a Hive Query.
Did you notice Intelli-Sense? Yes, it supports with almost all metadata, hence it is really easy to write a query.

Executing Queries

If you need to see records in tables without limiting data with predicates or constructing the query with additional functions, you can simply right-click on the table in Server Explorer and select View top 100 Rows.
If you need to construct a query, then use the above method for opening a window and write the query. There are two ways of executing the code: Batch and Interactive. Batch mode does not give you the result immediately but you will be able to see or download once the job submitted is completed. If you use the Interactive, then it is similar to SSMS result.
If you use the Batch mode, you can see the way job is getting executed. Once the job is completed, you can click on Job Output for seeing or downloading the output.
As you see, there is no graphical interface to see the job execution. Visual Studio will show the job execution using a graphical interface only when the job is executed by Tez Engine. Remember, HDInsight will always use Tez Engine to execute Hive Queries but simpler queries will be executed using Map Reduce Engine.
See this query that has some computation;
Can we create table with this IDE?
Yes, it is possible. You can right-click on the your database in Azure Server Explorer and select Create table menu item.
Let’s talk about more on this with later posts.

Self-Service Business Intelligence with Power BI

Understanding Business Intelligence

There was a time that Business Intelligence (BI) had been marked as a Luxury Facility that was limited to the higher management of the organization. It was used for making Strategic Business Decisions and it did not involve or support on operational level decision making. However, along with significant improvement on data accessibility and demand for big data, data science and predictive analytics, BI has become very much part of business vocabulary. Since modern technology now makes previously-impossible-to-access data, ever-increasing data and previously-unknown data available, and business can use BI with every corner related to the business for reacting fast on customers’ demand, changes in the market and competing with competitors.
The usage of Business Intelligence may vary from organization to another. Organizations work with large number of transactions need to analyze continuous, never-ending transactions, almost near-real-time, for understanding the trends, patterns and habits for better competitiveness. Seeing what we like to buy, enticing us to buy something, offering high-demand items with lower-price as a bundled item in a supermarket or commercial web site are some of the examples for usage of BI. Not only that, the demand on data mashups, dashboards, analytical reports by smart business users over traditional production and formal reports is another scenario where we see the usage of BI. This indicates increasing adoption of BI and how it assists to run the operations and survive in the competitive market.
The implementation of traditional BI (or Corporate BI) is an art. It involves with multiple steps and stages and requires multiple skills and expertise. Although a BI project is considered and treated as a business solution than a technical/software solution, this is completely an IT-Driven solution that requires major implementations like ETLing, relational data warehouse and multi-dimensional data warehouse along with many other tiny components. This talks about an enormous cost. The complexity of the solution, time it takes and the finance support it needs, make the implementation tough and lead to failures but it is still required and demand is high. This need pushed us to another type of implementation called Self-Service Business Intelligence.

Self-Service Business Intelligence

It is all about empowering the business user with rich and fully-fledge equipment for satisfying their own data and analytical requirements. It is not something new, though the term were not used. Ever since the spreadsheet was introduced and business users manipulate data with it for various analysis, it was time that Self-Service BI appeared. Self-Service BI supports data extractions from sources, transformations based on business rules, creating presentations and performing analytics. Old-Fashioned Self-Service BI was limited and required some technical knowledge but availability of modern out-of-the-box solutions have enriched the facilities and functionalities, increasing the trend toward to Self-Service BI.
BI needs Data Models. Data Model can be considered as a consistence-view (or visual representations) of data elements along with their relationships to the business. However, data models created by developers are not exactly the model required for BI. The data model created by developer is a relational model that breaks entities to multiple parts but the model created to BI has real-world entities giving a meaning to data. This is called as Semantic Model. Once it is created, it can be used by users easily as it is a self-descriptive model, for performing data analytics, creating reports using multiple visualizations and creating dashboards that represent key information of the organization. Modern Self-Service BI tools support creating models, hence smart business users can create models as per requirements without a help from IT department.

Microsoft Power BI

Microsoft offers few products for supporting Self-Service BI. Microsoft Office Excel is the most widely-used spreadsheet software and with recent addition of four power tools: Power Pivot, Power Query, Power View and Power Map, it has become one of key software for performing Self-Service BI. In addition to that, Microsoft Reporting Services (specifically Report Builder) and SharePoint Services support Self-Service BI allowing users to perform some Self-Service BI operations.
Microsoft Power BI is the latest from Microsoft; a suite of business analytical tools. This addresses almost all Self-Service BI needs, starting from gathering data, preparation for analysis and presenting and analyzing data as required.
Power BI comes in two flavors: Power BI Desktop and Power BI Service. Power BI Desktop is a stand-alone tool that allows us to connect with hundreds of data sources, internal or external, structured or unstructured, model loaded data as per the requirements, apply transformations to adjust data and create reports with stunning visuals. Power BI Service can hold reports created by individuals as a global repository allowing us to share reports among others, create reports using shared/uploaded datasets and creating personalized dashboards.
Business Users who love Excel might see this as another tool that offers the same Excel functionalities. But it offers much more than Excel in terms of BI. Of course, you might see some functionalities that are available in Excel but not in Power BI. The biggest issue we see related to this is, entering data manually. Excel supports entering data, providing many facilities for data-entry but Power BI support on it is limited. Considering these facts, there can be certain scenario where Power BI cannot be used over Excel but in most cases, considering Self-Service BI, it is possible. The transition is not difficult, it is straight forward as Power BI easily corporates with Excel in many ways. This article gives a great introduction to using Power BI within Excel – Introduction to PowerPivot and Power BI.
In order to perform data analysis or create reports, an appropriate data model with necessary data should be created. With traditional BI, relational and OLAP data warehouses fulfil this requirement but creating the same with Self-Service BI is a challenge. Power BI makes this possible by allowing us to connect with multiple data sources using queries and bring them together in the data model. Regardless of the type of data sources, Power BI supports creating relationships between sources with their original structures. Not only that, transformations on loaded data, creating calculated tables, columns and hierarchies are possible with Power BI data models.
What about ETLing? How it is being handled with Self-Service BI or Power BI? This is the most complex and time consuming task with traditional BI and the major component of it is transformation. Although some transformations can be done in Power BI Data Model, advanced transformations have to be done in Query Editor that comprise more sophisticated tools for transformations.
Selecting the best visual for presenting data and configuring it appropriately is the key of visualization. Power BI has plenty of visuals that can be used for creating reports. In addition to the visuals given, there are many other visuals available free, created and published by community supporters. If a specific visual is required, it can be downloaded and added easily.
Visualization is done with Power BI Report. Once they are created, they can be published to Power BI Service. Shared reports can be viewed via Power BI Service portal and native mobile apps. Not only that, Power BI Service portal allows users to create dashboards, pinning visuals in multiple reports to a single dashboard, extending Self-Service BI.
In some occasions, model is already created either as a relational model or OLAP model by IT department. When the model is available, users do not need to create the model for creating reports, they can directly connect with the model and consume data for creating reports. This reduces the space need to consume as user does not need to import data and offer real-time BI. Power BI supports making direct connections to certain relational data sources using Direct Query method and OLAP data sources using Live Connection method.
Power BI is becoming the best Self-Service BI tool in the market. Microsoft is working hard on this, it can be clearly witnessed with monthly releases of Power BI Desktop and weekly upgrades of Power BI Service. If you have not started working with Power BI yet, register at and start using it. You will see Business Intelligence like never before.

SQL Server 2016 – CREATE/ALTER PROCEDURE must be the first statement in a query batch

Everyone knows that SQL Server started supporting CREATE OR ALTER statement with SQL Server 2016 and it can be used with few objects. However, with my new installation of SQL Server 2016, I noticed this error which was not supposed to see with 2016.
What could be the reason? I could not immediately realize the issue and was thinking that what went wrong. It should be working as this is SQL Server 2016.
After few minutes, I realized the issue;
It is all about the version I used. This feature was introduced with SQL Server 2016 – Service Pack I and I had not installed SP1 for my new instance. You may experience the same, hence made this post to note it :).

SQL Server TRY_CONVERT returns an error instead of NULL

Can this be happened? The functions TRY_CONVERT and TRY_CAST have been designed for handling errors and they should return NULL when there is an error.
But, it is not always true.
What is the reason? Although it returns NULL when an error occurred with the conversion, it does not happen if the conversion is not a permittable. The above code tries to convert an integer into an unique identifier that is not permitted, hence it fails with an error.

SQL Server – Merging Partitions – Does it physically move data from a data file to another?

Partitioning is a common practice when maintaining large number of records in tables regardless of the database type: whether the database is an OLTP database or a data warehouse based on dimensional modeling. Partitioning requires a Partition Function that describes boundary values and number of partitions required, and Partition Scheme that assigns File Groups to partitions.
Here are few posts I have made on partitioning;
Once a table is partitioned, based on the workloads, we add new boundary values for introducing new partitions to the table and we remove boundary values for combining partitions (or removing partitions). These two operations are done with two partition related functions: SPLIT and MERGE. Now the question is, when we remove a partition using the MERGE function, if partitions are distributed with different file groups (means different data files), does SQL Server moves data from one file to another? Do we really have to consider it?
Let’s make a database and do a simple test. Let’s create a database called AdventureWorksTest and add multiple file groups and data files.
As you see, we have five file groups now.
Let’s create a Partitioned Table and load it using AdventureWorks2014 Sales.SalesOrderHeader table. Note that, below code creates a Partition Function with 2014, 2015, 2016 and 2017 as boundary values. And it creates the Partition Scheme setting with multiple file groups.
Here is the result of the query.
Here is another query to see how data is distributed with each file group.
If I illustrate the same in different manner;
This is how the data is distributed with my system now. If I try to delete a data file, example, AdventureWorksTest_Data_03 that is belong to FileGroup3, SQL Server will not allow me to do as it holds data.
If I decide to combine 2015 and 2016, in other words, if I want to remove the 4th partition that holds 2016 data, what will happen. Let’s do and see. The below codes merge the partition;
Once delete, let me run the same query to see how data is partitioned now.
As you see, FileGroup3 is no longer used by the table and number of records in the FileGroup2 has been increased. The partition which was 5 is now 4. This means, data that was maintained with FileGroup3 (or AdventureWorksTest_Data_03) has been moved to FileGroup2 (or AdventureWorksTest_Data_02). This is what has happened;
This clearly shows that data is moved when partitions are merged. Okay, do we have to really consider about this? Yes, we have to, when this is a small table, we do not see any performance issue but if the table is large, moving data from one file to another will surely take time, hence, think twice before merging, specifically when they are distributed in multiple files.
** Now, if you need, you can delete the AdventureWorksTest_Data_03 file and SQL Server will allow you to delete as it is empty.

How to refer files in HDInsight – Azure Storage using different ways

If you have started working with Big Data, you surely need to check the Microsoft support on it via Azure platform – HDInsight service. HDInsight allows you to create a Hadoop environment within few minutes and it can be anytime scaled out or in based on your requirements. I have written few posts on this, you can have a look on them using following links;
In order to work with data loaded to HDInsight, or Hadoop, data files have to be refereed using supported syntax. There are multiple ways for referring files in the storage with HDFS. Here are the ways;

Fully qualified path with wasb(s) protocol

This is most accurate and correct way of referring files in the storage. Here is the pattern;
Here is an example using Putty, connecting with HDInsight and reading a file (processed with Hive) exist. My container name is dinesqlashdinsight and storage name is dinesqlasstorage. File path is data/cleanlog/000000_0 (this is a Hive table in fact).

Connecting with the default container

If your files are in the default container, you can skip the container name and storage name as follow;
Note the three slashes. It is required when you do not mentioned the container name.

Connecting using Hadoop/Linux/Unix native ways

Generally, when you work with Hadoop using Linux/Unix, you refer files without the protocol. Azure HDInsight supports the same and we can refer files using that syntax.

Do I need double quotes for my paths?

It is required when you have some odd characters like equal (=) sign with your path. See the example below. I try to read a data file exist in a the cluster and the path has equal signs, hence path is encased with double quotes.

SQL Server – Dropping Primary Key will drop the Clustered Index?

Everyone knows that SQL Server creates a Clustered Index when we add a Primary Key if there is no Clustered Index already exist in the table. It adds a Non-Clustered Key for the Primary Key if we have already added a Clustered Index. However they are two different objects; one is a Constraint and other is an Index. What if I drop one object? Will it drop the other as well?
Let’s make a test and see. If I create a table Customer like below, making Customer Key as the Primary Key;
As you see, it will create both Key and the Index.
Now if I drop either one, it will drop the other one as well. For an example, if I drop the Primary Key, it will drop the Index as well.
If I need to make sure that it does not happen, I can create them separately, first create the Clustered Index on CustomerKey and then make the CustomerKey as the Primary Key. However, that will add another index specifically for the Primary Key.
The reason for above behavior is, Primary Key needs an Index. It is always associated with an index therefor if one is getting dropped, the associate also getting dropped.