I am new to Spark. We use Spark-SQL to query Hive tables on AWS EMR.
I am running a complex query by building several temporary views in steps.
For e.g. the first temp view is created by joining couple of tables in Step 1, then using this temp view
as a source in the next Step, and so on, until the final Step, where the resultant of the final Step is persisted to disk. An example is given below:
create temporary view test1 as
select a.cust_id, b.prod_nm
from a
inner join b
on a.id = b.id;
create temporary view test2 as
select t1.*, t2.*
from test1 t1
inner join c t2
on t1.cust_id = t2.cust_id;
Please note that the resultant view (test1) from the first step is used in the second step as a source in the Join with another table C.
Now, due to the lazy evaluation of Spark, even though the temp views get created at each step, the data is not pulled until the last step. As a result, we often run into performance issues in the queries that implement complex transformations (eg. join on several tables).
Basically I have 2 questions:
How to approximate the size of such a temp view (at any given step)
so that I can choose the right Join strategy in the next Step when
this view is Joined to another table/view in the next Step
What are the best practices to use for such a framework to
ameliorate performance.
Note: We use Spark 2.4. I do not have access to PySpark, but only access to SparkSQL (to query hive tables).
Any help is appreciated. Thanks.
You cannot determine the size of a temp view that you create.
When using a distributed framework like spark, your join strategy shouldn't be based on the size of data but how the data/join keys are distributed across the multiple partitions. If you are using the same temp view multiple times, it's better to cache it, so that the application doesn't read it from hdfs/s3 every time.
code to cache in SQL CACHE TABLE cache_view AS
SELECT * from table;
Related
I have a base transaction table. Then I have around 15 intermediate steps, where I'm combining dimension tables, performing some aggregation and implementing business logic. The way I'm handling currently is creating temporary tables for intermediate stages and post these 15 steps populating the final result in a physical table. It it a better approach or using materialized view instead of these intermediate temp tables is a better approach. If using materialized views for the intermediate steps is a better approach, can you kindly let me know why?
Have already tried scripting both the approaches, scripted 15 intermediate steps as global temporary table as well as Materialized view. I found marginal improvement in performance in MVs when compared to Temp tables, but comes at the cost of excess physical storage. Not sure which is the best practice and why
Temporary tables write to disk, so there's I/O costs for both reading and writing. Also most sites don't manage their temporary tables properly and they end up on the default temporary tablespace, which is the same TEMP tablespace everybody uses for sorting, etc. So there's potential for resource contention there.
Materialized views are intended for materializing aspects of our data set which are commonly reused by many different queries. That's why the most common use case is for storing higher level aggregates of low level data. That doesn't sound like the use case you have here. And lo!
I'm doing a complete refresh of MVs and not a incremental refresh
So nope.
Then I have around 15 intermediate steps, where I'm combining dimension tables, performing some aggregation and implementing business logic.
This is a terribly procedural way of querying data. Sometimes there's no way of avoiding this, especially in certain data warehouse scenarios. However, it doesn't follow that we need to materialize the outputs of those queries. An alternative approach is to use WITH clauses. The output from one WITH subquery can feed into lower subqueries.
with sq1 as (
select whatever
, count(*) as t1_tot
from t1
group by whatever
) , sq2 as (
select sq1.whatever
, max(t2.blah) as max_blah
from sq1
join t2 on t2.whatever = sq1.whatever
) , sq3 as (
select sq2.whatever
,(t3.meh + t3.huh) as qty
from sq2
join t3 on t3.whatever = sq2.whatever
where t3.something >= sq2.max_blah
)
select sq1.whatever
,sq1.t1_tot
,sq2.max_blah
,sq3.qty
from sq1
join sq2 on sq2.whatever = sq1.whatever
join sq3 on sq3.whatever = sq1.whatever
Not saying it won't be a monstrous query, the terror of the department. But it will probably perform way better than your MViews ot GTTs. (Oracle may choose to materialize those intermediate result sets but we can use hints to affect that.)
You may even find from taking this approach that some of your steps are unnecessary and you can combine several steps into one query. Certainly in real life I would write my toy statement above as one query not a join of three subqueries.
From what you said, I'd say that using (global or private, depending on database version you use) temporary tables is a better choice. Why? Because you are "calculating" something, storing results of those calculations into some tables, reusing them for additional processing. All of that - if it can't be done without temporary tables - is to be done with tables.
Materialized view is, as its name says, a view. It is a result of some query, but - opposed to "normal" views, it actually takes space. Can be refreshed (either on demand, when source data is changed, or based on a schedule). Yes, it has its advantages, though I can't see any in what you are currently doing.
I have two databases A and B. My application runs on Database A. Now I must retrieve some data from database B. Therefore I created a database link to B.
I am wondering what is faster:
Create a View with the corresponding select on database B and get the data via this view:
select * from myview#B
Select tables directly:
select * from table1#B, table2#B left outer join table3#B...
I think probably they would be just as fast since the execution plan will be identical. But would be easier on you to just do second option.
About Views
A view is a logical representation of another table or combination of tables. A view derives its data from the tables on which it is based. These tables are called base tables. Base tables might in turn be actual tables or might be views themselves. All operations performed on a view actually affect the base table of the view.
You don't get any performance benefits of using view instead of tables. These are simply a stored query, When you submit select * from myview#B, this simply retrieve the view definition from data dictionary and rewrite the query using it.
I am using Hive version 0.7.1-cdh3u2
I have two big tables (let's say) A and B, both partitioned by day. I am running the following query
select col1,col2
from A join B on (A.day=B.day and A.key=B.key)
where A.day='2014-02-25'
When I look at the xml file of the map reduce task, I find that mapred.input.dir includes A/2014-02-25 and all hdfs directories for all days for B rather than only for the specific day ('2014-02-25'). This takes a lot of time and more number of reduce tasks.
I also tried to use
select col1,col2
from A join B on (A.day=B.day and A.key=B.key and A.day='2014-02-25'
and B.day='2014-02-25')
This query performed much faster and with only the required hdfs directories in mapred.input.dir
I have the following questions.
Shouldn't hive optimizer be smart enough for both the queries to run exactly in the same manner?
What should be an optimized way to run the hive query for joining such tables with partitions on multiple keys?
What is the difference between using conditions that involve partitions in the join on clause and the where clause in terms of performance?
You need to mention the condition i.e the partition directory explicitly in the JOIN clause or in the WHERE clause. So it will process only the required partitions which will in turn increase the performance.
You can refer this link:
Apache Hive LanguageManual
I am trying to do this:
Concatenate many rows into a single text string?
And I want to have the query results join against other tables. So I want to have the csv queries be an indexed view.
I tried the CTE and XML queries to get the csv results and created views using these queries. But SQL Server prevented me from creating an index on these views because CTE and subqueries are not allowed for indexed views.
Are there any other good ways to be able to join a large CSV result set against other tables and still get fast performance? Thanks
Other way is to do materialization by yourself. You create table with required structure and fill it with content of your SELECT. After that you track changes manually and provide actual data in your "cache" table. You can do this by triggers on ALL tables, including in base SELECT (synchronous, but a LOT of pain in complex systems) or by async processing ( Jobs, self-written service, analysis of CDC logs and etc).
I have two databases in SQL2k5: one that holds a large amount of static data (SQL Database 1) (never updated but frequently inserted into) and one that holds relational data (SQL Database 2) related to the static data. They're separated mainly because of corporate guidelines and business requirements: assume for the following problem that combining them is not practical.
There are places in SQLDB2 that PKs in SQLDB1 are referenced; triggers control the referential integrity, since cross-database relationships are troublesome in SQL Server. BUT, because of the large amount of data in SQLDB1, I'm getting eager spools on queries that join from the Id in SQLDB2 that references the data in SQLDB1. (With me so far? Maybe an example will help:)
SELECT t.Id, t.Name, t2.Company
FROM SQLDB1.table t INNER JOIN SQLDB2.table t2 ON t.Id = t2.FKId
This query results in a eager spool that's 84% of the load of the query; the table in SQLDB1 has 35M rows, so it's completely choking this query. I can't create a view on the table in SQLDB1 and use that as my FK/index; it doesn't want me to create a constraint based on a view.
Anyone have any idea how I can fix this huge bottleneck? (Short of putting the static data in the first db: believe me, I've argued that one until I'm blue in the face to no avail.)
Thanks!
valkyrie
Edit: also can't create an indexed view because you can't put schemabinding on a view that references a table outside the database where the view resides. Dang it.
Edit 2: adding in index hints made zero difference.
In case anyone else runs into this problem, I don't have a great solution. But what I ended up having to do was put some limited dupe data into the target database, in order to completely bypass the eager spool.