I am new to Spark I am trying to access Hive table to Spark
1) Created Spark Context
val hc=new HiveContext(sc)
val hivetable= hc.sql("Select * from test_db.Table")
My Question is I got the table into Spark.
1) Why we need to register the Table ?
2) We can Perform Directly SQL operations still why do we need Dataframe functions
Like Join, Select, Filter...etc ?
What makes difference in both operations between SQL Query` and Dataframe Operations
3) What is Spark Optimization ? How does it works?
You don't need to register temporary table if you are accessing Hive table using Spark HiveContext. Registering a DataFrame as a temporary table allows you to run SQL queries over its data.Suppose a scenario that you are accessing data from a file from some location and you want to run SQL queries over this data.
then you need to createDataframe from the Row RDD and you will register temporary table over this DataFrame to run the SQL operations. To perform SQL queries over that data, you need to use Spark SQLContext in your code.
Both methods use exactly the same execution engine and internal data structures. At the end of the day all boils down to the personal preferences of the developer.
Arguably DataFrame queries are much easier to construct programmatically and
provide a minimal type safety.
Plain SQL queries can be significantly more concise an easier to understand.
There are also portable and can be used without any modifications with every supported language. With HiveContext these can be also used to expose some functionalities which can be inaccessible in other ways (for example UDF without Spark wrappers
Reference: Spark sql queries vs dataframe functions
Here is a good read reference on performance comparison between Spark RDDs vs DataFrames vs SparkSQL
Apparently I don't have answer for it and will keep it on you to do some research over net and find out solution :)
Related
I want to use Delta Lake tables in my Hive Metastore on Azure Data Lake Gen2 as basis for my company's lakehouse.
Previously, I used "regular" hive catalog tables. I would load data from parquet into a spark dataframe, and create a temp table using df.CreateOrReplaceTempView("TableName"), so I could use Spark SQL or %%sql magic to do ETL. After doing this, I can use spark.sql or %%sql on the TableName. When I was done, I would write my tables to the hive metastore.
However, what If I don't want to perform this saveAsTable operation, and write to my Data Lake? What would be the best way to perform ETL with SQL?
I know I can persist Delta Tables in the Hive Metastore through a multitude of ways, for instance by creating a Managed catalog table through df.write.format("delta").saveAsTable("LakeHouseDB.TableName")
I also know that I can create a DeltaTable object through the DeltaTable(spark, table_path_data_lake), but then I can only use the Python API and not sql.
Does there exist some equivalent of CreateOrReplaceTempView(), or is there a better way to achieve ETL with SQL without 'writing' to the data lake first?
However, what If I don't want to perform this saveAsTable operation, and write to my Data Lake? What would be the best way to perform ETL with SQL?
Not possible with Delta Lake since it relies heavily on a transaction log (_delta_log) under the data directory of a delta table.
We curate data in the "Dev" Azure SQL Database and then currently use RedGate's Data Compare tool to push up to 6 higher Azure SQL Databases. I am trying to migrate that manual process to ADFv2 and would like to avoid copy/pasting the 10+ copy data actives for each database (x6) to keep it more maintainable for future changes. The static tables have some customization in the copy data activity but the basic idea follows this post to perform an upsert.
How can the implementation described above be done in Azure Data Factory?
I was imagining something like the following:
Using one parameterized link service that has the server name & database name configurable to generate a dynamic connection to Azure SQL Database.
Creating a pipeline for each table's copy data activity.
Creating a master pipeline to then nest each table's pipeline in.
Using variables loop over the different connections an passing those to the sub-pipelines parameters.
Not sure if that is the most efficient plan or even works yet. Other ideas/suggestions?
we can not tell you if that's the most efficient plan. But I think so. Just make it works.
As you said in the comment:
we can use Dynamic Pipelines - Copy multiple tables in Bulk with
'Lookup' & 'ForEach'. we can perform dynamic copies of your data
table lists in bulk within a single pipeline. Lookup returns either
the lists of data or first row of data. ForEach - #activity('Azure
SQL Table lists').output.value ;
#concat(item().TABLE_SCHEMA,'.',item().TABLE_NAME,'.csv') + This is
efficient and cost optimized since we are using less number of
activities and datasets.
In usually, we also will choose same solution with you: dynamic parameter/pipeline, lookup + foreach active to achieve the scenario. In one word, make the pipeline has a strong logic, simple and efficient.
Added the same info mentioned in the Comment as Answer.
Yup, we can use Dynamic Pipelines - Copy multiple tables in Bulk with 'Lookup' & 'ForEach'.
We can perform dynamic copies of your data table lists in bulk within a single pipeline. Lookup returns either the lists of data or first row of data.
ForEach - #activity('Azure SQL Table lists').output.value ;
#concat(item().TABLE_SCHEMA,'.',item().TABLE_NAME,'.csv')
This is efficient and cost optimized since we are using less number of activities and datasets.
Attached pic as ref-
I need to transform a fairly big database table with aws Glue to csv. However I only the newest table rows from the past 24 hours. There ist a column which specifies the creation date of the row. Is it possible, to just transform these rows, without copying the whole table into the csv file? I am using a python script with Spark.
Thank you very much in advance!
There are some Built-in Transforms in AWS Glue which are used to process your data. This transfers can be called from ETL scripts.
Please refer the below link for the same :
https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html
You haven't mentioned the type of database that you are trying connect. Anyway for JDBC connections spark has the option of query, in which you can issue the usual SQL query to get the rows you need.
is it possible to query in memory arrow table using presto or is there some way to use a pandas data frame as a data source for presto query engine ?
Actually I have parquet files which I want to convert to arrow and query that thorough presto is something like this possible ?
As of now, this is not possible. However, there is a Jira open for this feature request. https://github.com/prestodb/presto/issues/12201
What is the command to execute DMLs like Insert,Update,Delete in Google Big Query?
I tried using bq query "select query"
It is working only for Select statements
Note that BigQuery really excels at being a secondary database used for performing fast analytical queries on big data that is static, such as recorded data analysis, logs, and audit history.
If you instead require regular data updates, it is highly recommended to use a separate master database such as the Datastore to perform fast entity operations and updates. You would then persist your data from your master database to your secondary BigQuery database for further analysis.
Therefore, you must tell the bq commandline to use the full standard SQL --use_legacy_sql=false instead of the original BigQuery default legacy SQL to access the Data Manipulation Language (DML) functionality.