THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Maria Zakourdaev

Query big data, stored in a csv or parquet files, from any stored procedure

Every cloud provider these days have a serverless interactive query service that uses standard SQL for data analysis.

As for the biggest cloud providers, we have Azure Data Lake analytics, Google BigQuery and Amazon Athena.

Due to the fact that my company is using only Amazon cloud, I have some experience with Athena service and I must say that it is awesome.

I will show you today how you can use Management Studio or any stored procedure to query the data, stored in a csv file, located on S3 storage. I am using CSV file format as an example here, columnar PARQUET gives much better performance.

I am going to:

1. Put a simple CSV file on S3 storage

2. Create External table in Athena service over the data file bucket

        3. Create linked server to Athena inside SQL Server  

        4. Use OPENQUERY to query the data.  

Athena service is built on the top of Presto, distributed SQL engine and also uses Apache Hive to create, alter and drop tables. You can run ANSI SQL statements in the Athena query editor, launching it from the AWS web services UI. You can use complex joins, window functions and many other great SQL language features. Using Athena eliminates need for ETL because it projects your schema on the data files at the time of the query.

Let’s create database in Athena query editor.

image

As a next step I will put this csv file on S3. By the way, Athena supports JSON format, tsv, csv, PARQUET and AVRO formats.

Uploading the below file to S3 bucket (don’t put columns header in file):

image

image

Going back to Athena to create external table over the S3 folder.  You can later add more files to the same folder and your queries will return the new data immidiately.

image

Now I can query the data:

image

image

As a next step I will set up a linked server from my SQL Server instance because I would like to offload the big data querying to Athena. Of course, I am using in this example tiny data file but in real life we are querying sometimes 300GB of data files in the single query and it takes a few seconds.

Athena has ODBC driver, I will install it on SQL Server machine ( AWS EC2 instance for this example).

Here is an installation link: https://s3.amazonaws.com/athena-downloads/drivers/ODBC/Windows/Simba+Athena+1.0+64-bit.msi

Setting up the ODBC connection. Important, click on the Authentication Option and fill in AccessKey and SecretKey that have permissions to access S3 bucket. S3 output location below will hold csv files with results from your queries. Remember to clean the output files from time to time.

image

What is left is to set up Linked Server inside Management Studio using OLEDB provider for ODBC.

EXEC master.dbo.sp_addlinkedserver @server = N'DWH_ATHENA', @srvproduct=N'', @provider=N'MSDASQL', @datasrc=N'DWH_ATHENA'

GO

EXEC master.dbo.sp_addlinkedsrvlogin @rmtsrvname=N'DWH_ATHENA',@useself=N'False',@locallogin=NULL,@rmtuser=N'*******',@rmtpassword='*********'

GO

Replace @rmtuser and @rmtpassword with AWS access key and secret key and now we can query the data files from any script or stored procedure.

There is one very important thing that you need to know. Regular SQL Server ODBC destinations querying behavior of is to send “select *” to linked server and do filtering inside SQL Server. This is very bad for us since we wanted to offload all work to Athena and we do not want to receive all data.  The way to overcome this is to use OPENQUERY.

Here is example of the query that is using a linked server. The remote query has omitted all filtering and receiving ALL columns from the remote table and filter is being applied later on, inside the “Filter” step.

image

The same query that is using OPENQUERY instead of linked server:

image

Result.

image

 

Isn’t it wonderful to be able to keep the row data in files and query them with minimal effort from the SQL Server side?

Yours,

Maria

Published Sunday, December 17, 2017 1:28 PM by Maria Zakourdaev

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Stan said:

Very interesting.  I will have to try this

December 17, 2017 1:31 PM
 

Michael Zilberstein said:

Isn't it too costly? You pay for SQL Server license, S3 and also 5$ per TB of scanned data in Athena. I mean, that's a nice capability but do you see any practical use for that?

December 18, 2017 10:07 AM
 

Maria Zakourdaev said:

Michael, if Athena queries will be defined correctly: only fields that are required and not 'select *' and query is SARGable (with proper filtering), Athena is not expensive at all.

Sql Server instance in such case does not need to be very expensive, it does not process the data, only receives the final result set.  

As for practical use, our DWH is on Athena/S3 and some data clients are on SQL Server. We could either build full ETL to load the data into SQL Server or work through OPENQUERY. Which is much more flexible and requires no development efforts.

December 19, 2017 3:46 AM
 

Vitaliy Krasner said:

Hi Maria,

thank you for very useful post.

I setup and configured linked server following your instruction and test connection returns "connection successful. OPENQUERY is working.

SELECT * FROM OPENQUERY(ATHENA01,

                        'SELECT *

                        FROM twk_cust_source

                        WHERE tw_pk = ''11175000411''')

In SQL Server Object Explorer I am able to see my tables expanding linked server AwsDataCatalog.

However, when I am trying to query the table using linked server

SELECT *

FROM  [ATHENA01].[AwsDataCatalog].[theory].[twk_cust_source]

WHERE [tw_pk] = '11175000411'

I am getting error:

Msg 7399, Level 16, State 1, Line 1

The OLE DB provider "MSDASQL" for linked server "ATHENA01" reported an error. The provider did not give any information about the error.

Msg 7312, Level 16, State 1, Line 1

Invalid use of schema or catalog for OLE DB provider "MSDASQL" for linked server "ATHENA01". A four-part name was supplied, but the provider does not expose the necessary interfaces to use a catalog or schema.

Again, OPENQUERY is working and returns result back to SSMS.

Do you have any clue where the problem could be?

WE are using SQL Server 2016 now

Appreciate you reply.

August 29, 2018 6:16 AM
 

Vitaliy Krasner said:

Found the root cause:

-------------------------

To resolve the problem we should go to: SQL Management Studio -> Server Objects -> Linked Servers -> Providers right click on MSDASQL and select Properties and uncheck the "Level zero only" property.

Hi Maria,

the  "S3 output location" could be setup in form of S3 sub-folder with expiration.

In this case S3 maintains the sub-folder with kind of garbage collector automatically deleting old files

August 29, 2018 12:41 PM

Leave a Comment

(required) 
(required) 
Submit

About Maria Zakourdaev

The shortest word in the English language that contains the letters: abcdef is… feedback! Let me know if I have touched something that is alive in the cosmos.
Privacy Statement