As part of my spark Job I am storing the output to a Hive table on HDInsight. I now want to expose the data to any COTS tools that can consume Odata feed like Tableau or other such tools. I was wondering if any one has some pointers on how this can be accomplished?
It's easy to do if data is stored in hive.
HDInsight Spark cluster has thrift server setup that allows BI tools like Tableau/PowerBI to process data via spark.
See:
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-use-bi-tools/
Related
I am archiving rows that are older than a year into ADLSv2 as delta tables, when there is a need to report on that data, I need to join archived data with some tables existing on on-premise database. Is there a way we can do a join without re-hydrating from or hydrating data to cloud?
Yes, you can achieve this task by using Azure Data Factory.
Azure Data Factory (ADF) is a fully managed, serverless data integration
service. Visually integrate data sources with more than 90 built-in,
maintenance-free connectors at no added cost. Easily construct ETL and
ELT processes code-free in an intuitive environment or write your own
code.
Firstly, you need to install the Self-hosted Integration Runtime in your local machine to access the on-premises SQL Server in ADF. To accomplish this, refer Connect to On-premises Data in Azure Data Factory with the Self-hosted Integration Runtime.
As you have archived the data in ADLS, you need to change the Access tier of that container from Cold -> Hot in order to retrieve the data in ADF.
Later, create a Linked Service using Self-hosted IR which you have created. Create a Dataset using this Linked Service to access the on-premises database.
Similarly, create a Linked Service using default Azure IR. Create a Dataset using this Linked Service to access the data from ADLS.
Now, you also require a destination database where you will store the data after join. If you are storing it in same on-premises database, you can use the existing Linked Service but you need to create a new Dataset mentioning the destination table name.
Once all this configuration done, create a Data Flow activity pipeline in ADF.
Mapping data flows are visually designed data transformations in Azure
Data Factory. Data flows allow data engineers to develop data
transformation logic without writing code. The resulting data flows
are executed as activities within Azure Data Factory pipelines that
use scaled-out Apache Spark clusters.
Learn more about Mapping data flow here.
Finally, in data-flow activity, your sources will be on-premises dataset and ADLS dataset which you have created above. You will be using join transformation in mapping data flow to combine data from two sources. The output stream will include all columns from both sources matched based on a join condition.
The sink transformation will take your destination dataset where the data will be stored as an output.
I have questions about ways to automate data transformation process.
What I normally do is that I transform data using python or postgresql and then export the processed data as csv. After that, I connect the csv file to Tableau.
I have done some research and found that ETL can help. However, I've watched some ETL tools' demo videos, and I'm not sure whether these tools' transform features would meet my need or not. For example, I have written 100+ sql lines for one of my data transforming task; it's better if I can use postgresql to run the query instead of using ETL tools.
The problem is that I don't know what's the proper way to automate the data transforming process and then push the data to Tableau. The csv files will be updated on a daily basis, so I'll need to refresh the data.
Data transformation can be done in various ways. It depends on your nature of data to figure out what can be the right fit.
If you have huge volume of data and you are comfortable in python/java and you can automate your transformation logic using spark and write it to a hive table and then connect tableau to read data from hive.
Most of the next gen ETL tools like pentaho and talend can be used but that erodes the flexibility and portability what a framework like spark or beam can give.
If you want to know , how can you achieve this using cloud provider services like GCP or AWS , please let me know
Prep is the Tableau tool for preparing data. It can be used for joining, appending, cleaning, pivoting, filtering and other data cleansing activities.
Tableau Prep is available:
for free if you have a Tableau Creator license
in desktop and Online/ Tableau server versions
Scheduling Prep flows is available in Tableau Online/ Server. To schedule flows you will need to acquire the Tableau Prep Conductor add-on.
I'm very new to Snowflake, so forgive me if the answer is obvious.
I am loading the data from on-prem into Azure using Data Factory, and then ingesting into Snowflake using COPY INTO. However, I need to enable access for some of the transformed data to other platforms, meaning that if I perform transformation in Snowflake, I'll need to create an external table in Azure (essentially pushing this data back to Azure so other platforms can access it).
As we don't particularly want to introduce a new tool, I have two options for our fairly basic transformation:
do the transformation in ADF
do the transformation in Snowflake in SQL scripts and then create an external table so other teams can access the data using other tools (these platforms don't integrate with Snowflake)
Are there any major drawbacks to option 2 apart from increased storage costs?
I'm trying to weigh up the following: maintenance effort (our team's skills lie in SQL not ADF), cost, and performance.
Any advice would be appreciated.
As stated in the question, there are many possible answers for this scenario - with my favorite being the second one ("do the transformation in Snowflake in SQL scripts and then create an external table so other teams can access the data using other tools").
If you need to make the results of these transformations available on Azure storage, Azure Data Factory supports this natively:
Copy data from Snowflake that utilizes Snowflake's COPY into [location] command to achieve the best performance. https://learn.microsoft.com/en-us/azure/data-factory/connector-snowflake#supported-capabilities
Or you could manage this inside Snowflake using the same COPY INTO that ADF uses.
Let me add a couple screenshots from the Snowflake webinar "Data Warehouse or Data Lake? How You Can Have Both in a Single Platform":
https://resources.snowflake.com/webinars-thought-leadership/data-warehouse-or-data-lake-how-you-can-have-both-in-a-single-platform-3
i setup the VPN connection between my on-prem sql server and GCP. i need to load more than 10 million rows data from sql server to bigquery. is there any way to achieve it? can i use SSIS to load the data to bigquery?
my team lead request to use dataflow to load the data to bigquery and not using SSIS.
In regards to using SSIS to load the data from the on-premises SQL Server to BigQuery, this might be what you're looking for. As for using Cloud Dataflow, the official GCP documentation details how it can be done, although you might need to use Cloud Storage as an intermediate data sink.
I would like to expose HIVE as web service, so that my PHP programs can invoke the web service to show the output in UI. i am not sure how to do this in HIVE.
There is a JDBC driver for Hive. It supports only a subset of JDBC API and SQL syntax. These limitations are defined by the capabilities of Hive. Hive is much better suited for batch commands, for example filter a large subset of an enormous set of data, possibly joining with other Hive tables. Nevertheless, using JDBC you can conceivably access Hive data via web services. You might consider creating asynchronous web services. If you are going to access Hive in synchronous web service calls, make sure your timeouts are large enough to accommodate times needed to run Hive commands.