Run Hive Queries using Visual Studio

Once HDInsight cluster is configured, we generally use either the portal dashboard (Powered by Ambari) or a tool like PuTTY for executing queries against data loaded. Although they are not exactly a developer related tools, or in other words, not an IDE, we had to use because we did not have much options. However, now we can use the IDE we have been using for years for connecting with HDInsight and executing various types of queries such as Hive, Pig and USQL. It is Visual Studio.
Let’s see how we can use Visual Studio for accessing HDInsight.

Making Visual Studio read for HDInsight

In order to work with HDInsight using Visual Studio, you need to install few tools on Visual Studio. Here are the supported versions;
  • Visual Studio 2013 Community/Professional/Premium/Ultimate with Update 4
  • Visual Studio 2015 any edition
  • Visual Studio 2017 any edition
You need to make sure that you have installed Azure SDK on your Visual Studio. Click here for downloading the Web Platform Installer and make sure following are installed;
This installs Microsoft Azure Data Lake Tools for Visual Studio as well, make sure it is installed.
Now your Visual Studio is ready for accessing HDInsight.

Connecting with HDInsight

Good thing is, you can connect with your cluster even without creating a project. However, once the SDK is installed, you can see new Templates called Azure Data Lake – HIVE (HDInsight), Pig (HDInsight), Storm (HDInsight) and USQL (ADLA) and HIVE template can be used for creating a project.
Project creates one hql file for you and you can use it from executing your Hive Queries. In addition to that, You can open Server Explorer (View Menu -> Server Explorer), and expand Azure (or connect to your Azure account and then expand) for seeing all components related to Azure.
As you see, you can see all databases, internal and external tables, views and columns. Not only that, by right-clicking the cluster, you can open a windows for writing a query or viewing jobs. Here is the screen when I use the first option that is Write a Hive Query.
Did you notice Intelli-Sense? Yes, it supports with almost all metadata, hence it is really easy to write a query.

Executing Queries

If you need to see records in tables without limiting data with predicates or constructing the query with additional functions, you can simply right-click on the table in Server Explorer and select View top 100 Rows.
If you need to construct a query, then use the above method for opening a window and write the query. There are two ways of executing the code: Batch and Interactive. Batch mode does not give you the result immediately but you will be able to see or download once the job submitted is completed. If you use the Interactive, then it is similar to SSMS result.
If you use the Batch mode, you can see the way job is getting executed. Once the job is completed, you can click on Job Output for seeing or downloading the output.
As you see, there is no graphical interface to see the job execution. Visual Studio will show the job execution using a graphical interface only when the job is executed by Tez Engine. Remember, HDInsight will always use Tez Engine to execute Hive Queries but simpler queries will be executed using Map Reduce Engine.
See this query that has some computation;
Can we create table with this IDE?
Yes, it is possible. You can right-click on the your database in Azure Server Explorer and select Create table menu item.
Let’s talk about more on this with later posts.

How to Connect HDInsight using Putty

I have written many articles on HDInsight Service which is an Azure service that allows us to provision Apache Hadoop Cluster and use it for various type of big data related implementations. If you need to see all my articles written on HDInsight, please use the link: http://dinesql.blogspot.com/search?q=hdinsight.
Although I have mainly used Azure portal to demonstrate HDInsight (Ambari dashboard), many use a tool called PuTTY that is a free implementation of SSH and Telnet for Windows and Unix platform. It is a good tool to work with, specifically if you need to connect the cluster remotely and work with it for navigating nodes or doing some data work with some sub projects like Hive against the cluster. Remember, if you have configured your HDInsight cluster as a Linux-based cluster, you need a SSH client to remotely connect with it. Let’s see the way of using it.
First of all you need to download PuTTY for using it as Windows machines do not have it as a built-in tool (Linux and Mac OS computers have a SSH client interface built-in). Here is the link for downloading PuTTY.
There is no installation with it. You can just use the putty.exe. However, you need a host name to connect. If you go to the portal and open your cluster blade, you can get the Host Name for connecting with a SSH Client by clicking Secure Shell (SSH) button.
Note that there are two, one for Windows and another for Linux, Unix and OS X users, make sure that you copy the Windows one.
When you open PuTTY, you should see an input box called Host Name for entering the copied host name. Enter the copied value and make sure that the Connection Type is selected as SSH.
All you have to do now is, click Open. This opens the standard window with standard font and font size. If you need a larger font, you can go to Window – Appearance  and change settings as you want.
When you open this first time, you will get a security warning as below;
This says that we do not have the key related to the host cached locally, hence it is not secure or not sure whether we connect to the right one. Since we know that we connect to the right host, we can simply click Yes and continue.
Once connected, you will be promoted for user id and password. Note that you need to use the one you configured as SSHUser and the Password when the cluster was created. It does not accept the HTTP User.
Enter the SSHUser and the password, you will be connected with the default node.
Everything is ready now. As you see, first part of the last line is the user connected and hn0 is one of the head nodesLast part of it shows the cluster name but we see it partially as it is lengthy. Now I can navigate, make folders, move files or connect with Hive without any issue.