Batch execution of hive scripts causing issue - hive

I will start off by saying, if you need more information, please let me know, I am trying to keep this as simple as possible, but I can provide full details if required.
So I have a base hql script that looks like this
INSERT INTO TABLE ${DB_NAME}.${TABLE_NAME} PARTITION (SNAPSHOT_YEAR_MONTH,SNAPSHOT_DAY)
SELECT
`(snapshot_year_month|snapshot_day|row_num)?+.+`,
'${SNAPSHOT_YEAR}' as snapshot_year_month,
'${SNAPSHOT_DAY}' as snapshot_day
FROM
(SELECT
*,
ROW_NUMBER() OVER ( PARTITION BY ${PK} ORDER BY time_s DESC,wanted_c DESC ) AS row_num
FROM
( SELECT ${COL} FROM ${SOURCE_DB_NAME}.${SRC_TABLE_NAME} WHERE (concat(snapshot_year_month,snapshot_day) = '${LOWER_LIMIT}${LOWER_LIMIT_DAY}')
UNION ALL
SELECT ${COL} FROM ${SOURCE_DB_NAME}.${SRC_TABLE_NAME} WHERE (concat(snapshot_year_month,snapshot_day) >'${LOWER_LIMIT}${LOWER_LIMIT_DAY}' and concat(snapshot_year_month,snapshot_day) <='${UPPER_LIMIT}${UPPER_LIMIT_DAY}') ) B) A
WHERE A.row_num = 1 ;
This base script generates several hundred hive queries that need to be run. One important part to note is the WHERE a.row_num = 1 which prevents any combination of ${PK} from having more than one row. Now, these scripts are successfully generated and stored in hql files. I then have a batch script to run all of these hive scripts
for f in *.hql ; do beeline -u 'myserverinfo' -f $f ; done
Now, if only a couple files are in the folder, everything works as intended. However, if I drop 500 hqls in the folder, no duplicates (which are correctly removed when only a couple hqls are in the folder) are correctly removed.
So for example,
if I have a dataset thats like this
| my_pk1 | my_pk2 | time_s | wanted_c | other cols... |
1 1 1 1 ...
1 1 2 5 ...
1 1 6 6 ...
If I have just a couple hql files being run with my batch script, it will correctly condense that to just -
| my_pk1 | my_pk2 | time_s | wanted_c | other cols... |
1 1 6 6 ...
However, if I drop 500+ hql scripts to be run, the partitions are correctly inserted, all hql scripts run and have data, however they all have duplicate primary keys, and that same table, with the EXACT same hql script would output
| my_pk1 | my_pk2 | time_s | wanted_c | other cols... |
1 1 1 1 ...
1 1 2 5 ...
1 1 6 6 ...
This makes absolutely no sense to me. Any ideas on what could be happening? I'm sorry if this makes no sense.

Related

Rebuild tables from joined table

I am facing an issue where a data supplier is generating a dump of his multi-tenant databases in a single table. Recreating the original tables is not impossible, the problem is I am receiving millions of rows every day. Recreating everything, every day, is out of question.
Until now, I was using SSIS to do so, with a lookup-intensive approach. In the past year, my virtual machine went from having 2 GB of ram to 128, and still growing.
Let me explain the disgrace:
Imagine a database where users have posts, and posts have comments. In my real scenario, I am talking about 7 distinct tables. Analyzing a few rows, I have the following:
+-----+------+------+--------+------+-----------+------+----------------+
| Id* | T_Id | U_Id | U_Name | P_Id | P_Content | C_Id | C_Content |
+-----+------+------+--------+------+-----------+------+----------------+
| 1 | 1 | 1 | john | 1 | hello | 1 | hello answer 1 |
| 2 | 1 | 2 | maria | 2 | cake | 2 | cake answer 1 |
| 3 | 2 | 1 | pablo | 1 | hello | 1 | hello answer 3 |
| 4 | 2 | 1 | pablo | 2 | hello | 2 | hello answer 2 |
| 5 | 1 | 1 | john | 3 | nosql | 3 | nosql answer 1 |
+-----+------+------+--------+------+-----------+------+----------------+
the Id is from my table
T_Id is the "tenant" Id, which identifies multiple databases
I have imagined the following possible solution:
I make a query that selects non-existent Ids for each table, such as:
SELECT DISTINCT n.t_id,
n.c_id,
n.c_content
FROM mytable n
WHERE n.id > 4
AND NOT EXISTS (SELECT 1
FROM mytable o
WHERE o.id <= 4
AND n.t_id = o.t_id
AND n.c_id = o.c_id)
This way, I am able to select only the new occurrences whenever a new Id of a table is found. Although it works, it may perform badly when working with 100s of millions of rows.
Could anyone share a suggestion? I am quite lost.
Thanks in advance.
EDIT > my question is vague
My final intent is to rebuild the tables from the dump, incrementally, avoiding lookups outside the database. Every now and then I am gonna run a script that will select new tenants, users, posts and comments and add them to their corresponding tables.
My previous solution worked as follows:
Cache the whole database
For each new row, search for the columns inside the cache
If it doesn't exist, then insert it
I know it sounds dumb, but it made sense as a new developer working with ETLs
First, if you have a full flat DB dump, I'll suggest you to work on your file before even importing it in your DB (low level file processing is pretty cheap and nearly instantaneous).
From Removing lines in one file that are present in another file using python you can remove all the already parsed line since your last run.
with open('new.csv','r') as source:
lines_src = source.readlines()
with open('old.csv','r') as f:
lines_f = f.readlines()
destination = open('diff_add.csv',"w")
for data in lines_src:
if data not in lines_f:
destination.write(data)
destination.close()
This take less than five second to work on a 900Mo => 1.2Go dump. With this you'll only work with line that really make change in one of your new table.
Now you can import this flat DB to a working table.
As you'll have to search the needle in each line, some index on the ids may by a good idea (go to composite index that use your Tenant_id first).
For the last part, I don't know exactly how your data look, can you have some update to do ?
The Operators - EXCEPT and INTERSECT can help you too with this kind of problem.

SQL Performance on joining 2 scripts ORACLE

I have 2 scripts which are quite complicated, one was written by me personally, another was done 10 years ago. The first script gets necessary id's and executes in around 30 sec, example:
| ID | some other info ...
+----+--------------------
| 1 | ...
| 2 | ...
| 3 | ...
| 4 | ...
The second script get's some more complicated data, which is calculated through many sub queries, and executes in around 30 sec, ex:
| ID | Computed Info
+----+--------------------
| 1 | 111
| 2 | 222
| 3 | 333
| 4 | 444
Now my script1 needs to include some partial results from script2, because the script2 is very complicated it is quite hard to break down the necessary parts, that is why I have tried to left join results of the script2 to the script1 using ID's
SELECT TABLE1.*, TABLE2.COMPUTED_INFO FROM SCRIPT1 TABLE1 LEFT JOIN SCRIPT2 TABLE2 ON TABLE2.ID = TABLE1.ID
The result I got and also the expected result is:
| ID | some other info ... | Computed Info
+----+---------------------+---------------
| 1 | ... | 111
| 2 | ... | 222
| 3 | ... | 333
| 4 | ... | 444
The problem is that after joining both of them the time of execution is now 20+ min.
I have also tried
with table1 as
(script1),
table2 as
(script2)
select t1.*, t2.computed_data
from table1 t1 left join table2 t2 on t2.id = t1.id
Which resulted in 10+ min.
I am wondering why such thing occurs, when definitely script1 and script2 separately run in around 30 sec each, but when run together go as much as 10+
Is there another way to accomplish that?
You can create temp tables before joın those tables.
first create table temp1 as select * from script1
and create table temp2 as select * from script2
then select your query
SELECT temp1.*, temp2.COMPUTED_INFO FROM temp1 TABLE1 LEFT JOIN temp2 TABLE2 ON temp2.ID = temp1.ID
Last time when I had this kind of issue, I resolved it with a temporary table. I'd created temporary tables with the SCRIPT1 and SCRIPT2 results. Then added indices to the ID columns.
After this, a similar query than yours must execute faster.
This happened on a postgresql server, but the problem's root is the same. Usually an RDBMS can't optimise a subquery/resultset from a PROCEDURE/FUNCTION and cannot use indices on its rows.

How to select using WITH RECURSIVE clause [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have googled and read throug some articles like
this postgreSQL manual page
or this blog page
and tried making queries myself with a moderate success (part of them hangs, while others works good and fast),
but so far I can not completely understand how this magic works.
Can anybody give very clear explanation demonstrating such query semantics and execution process,
better based on typical samples like factorial calculation or full tree expansion from (id,parent_id,name) table?
And what are the basic guidlines and typical mistakes that one should know to make good with recursive queries?
First of all, let us try to simplify and clarify algorithm description given on the manual page. To simplify it consider only union all in with recursive clause for now (and union later):
WITH RECURSIVE pseudo-entity-name(column-names) AS (
Initial-SELECT
UNION ALL
Recursive-SELECT using pseudo-entity-name
)
Outer-SELECT using pseudo-entity-name
To clarify it let us describe query execution process in pseudo code:
working-recordset = result of Initial-SELECT
append working-recordset to empty outer-recordset
while( working-recordset is not empty ) begin
new working-recordset = result of Recursive-SELECT
taking previous working-recordset as pseudo-entity-name
append working-recordset to outer-recordset
end
overall-result = result of Outer-SELECT
taking outer-recordset as pseudo-entity-name
Or even shorter - Database engine executes initial select, taking its result rows as working set. Then it repeatedly executes recursive select on the working set, each time replacing contents of the working set with query result obtained. This process ends when empty set is returned by recursive select. And all result rows given firstly by initial select and then by recursive select are gathered and feeded to outer select, which result becomes overall query result.
This query is calculating factorial of 3:
WITH RECURSIVE factorial(F,n) AS (
SELECT 1 F, 3 n
UNION ALL
SELECT F*n F, n-1 n from factorial where n>1
)
SELECT F from factorial where n=1
Initial select SELECT 1 F, 3 n gives us initial values: 3 for argument and 1 for function value.
Recursive select SELECT F*n F, n-1 n from factorial where n>1 states that every time we need to multiply last funcion value by last argument value and decrement argument value.
Database engine executes it like this:
First of all it executes initail select, which gives the initial state of working recordset:
F | n
--+--
1 | 3
Then it transforms working recordset with recursive query and obtain its second state:
F | n
--+--
3 | 2
Then third state:
F | n
--+--
6 | 1
In the third state there is no row which follows n>1 condition in recursive select, so forth working set is loop exits.
Outer recordset now holds all the rows, returned by initial and recursive select:
F | n
--+--
1 | 3
3 | 2
6 | 1
Outer select filters out all intermediate results from outer recordset, showing only final factorial value which becomes overall query result:
F
--
6
And now let us consider table forest(id,parent_id,name):
id | parent_id | name
---+-----------+-----------------
1 | | item 1
2 | 1 | subitem 1.1
3 | 1 | subitem 1.2
4 | 1 | subitem 1.3
5 | 3 | subsubitem 1.2.1
6 | | item 2
7 | 6 | subitem 2.1
8 | | item 3
'Expanding full tree' here means sorting tree items in human-readable depth-first order while calculating their levels and (maybe) paths. Both tasks (of correct sorting and calculating level or path) are not solvable in one (or even any constant number of) SELECT without using WITH RECURSIVE clause (or Oracle CONNECT BY clause, which is not supported by PostgreSQL). But this recursive query does the job (well, almost does, see the note below):
WITH RECURSIVE fulltree(id,parent_id,level,name,path) AS (
SELECT id, parent_id, 1 as level, name, name||'' as path from forest where parent_id is null
UNION ALL
SELECT t.id, t.parent_id, ft.level+1 as level, t.name, ft.path||' / '||t.name as path
from forest t, fulltree ft where t.parent_id = ft.id
)
SELECT * from fulltree order by path
Database engine executes it like this:
Firstly, it executes initail select, which gives all highest level items (roots) from forest table:
id | parent_id | level | name | path
---+-----------+-------+------------------+----------------------------------------
1 | | 1 | item 1 | item 1
8 | | 1 | item 3 | item 3
6 | | 1 | item 2 | item 2
Then, it executes recursive select, which gives all 2nd level items from forest table:
id | parent_id | level | name | path
---+-----------+-------+------------------+----------------------------------------
2 | 1 | 2 | subitem 1.1 | item 1 / subitem 1.1
3 | 1 | 2 | subitem 1.2 | item 1 / subitem 1.2
4 | 1 | 2 | subitem 1.3 | item 1 / subitem 1.3
7 | 6 | 2 | subitem 2.1 | item 2 / subitem 2.1
Then, it executes recursive select again, retrieving 3d level items:
id | parent_id | level | name | path
---+-----------+-------+------------------+----------------------------------------
5 | 3 | 3 | subsubitem 1.2.1 | item 1 / subitem 1.2 / subsubitem 1.2.1
And now it executes recursive select again, trying to retrieve 4th level items, but there are none of them, so the loop exits.
The outer SELECT sets the correct human-readable row order, sorting on path column:
id | parent_id | level | name | path
---+-----------+-------+------------------+----------------------------------------
1 | | 1 | item 1 | item 1
2 | 1 | 2 | subitem 1.1 | item 1 / subitem 1.1
3 | 1 | 2 | subitem 1.2 | item 1 / subitem 1.2
5 | 3 | 3 | subsubitem 1.2.1 | item 1 / subitem 1.2 / subsubitem 1.2.1
4 | 1 | 2 | subitem 1.3 | item 1 / subitem 1.3
6 | | 1 | item 2 | item 2
7 | 6 | 2 | subitem 2.1 | item 2 / subitem 2.1
8 | | 1 | item 3 | item 3
NOTE: Resulting row order will remain correct only while there are no punctuation characters collation-preceeding / in the item names. If we rename Item 2 in Item 1 *, it will break row order, standing between Item 1 and its descendants.
More stable solution is using tab character (E'\t') as path separator in query (which can be substituted by more readable path separator later: in outer select, before displaing to human or etc). Tab separated paths will retain correct order until there are tabs or control characters in the item names - which easily can be checked and ruled out without loss of usability.
It is very simple to modify last query to expand any arbitrary subtree - you need only to substitute condition parent_id is null with perent_id=1 (for example). Note that this query variant will return all levels and paths relative to Item 1.
And now about typical mistakes. The most notable typical mistake specific to recursive queries is defining ill stop conditions in recursive select, which results in infinite looping.
For example, if we omit where n>1 condition in factorial sample above, execution of recursive select will never give an empty set (because we have no condition to filter out single row) and looping will continue infinitely.
That is the most probable reason why some of your queries hang (the other non-specific but still possible reason is very ineffective select, which executes in finite but very long time).
There are not much RECURSIVE-specific querying guidlines to mention, as far as I know. But I would like to suggest (rather obvious) step by step recursive query building procedure.
Separately build and debug your initial select.
Wrap it with scaffolding WITH RECURSIVE construct
and begin building and debugging your recursive select.
The recommended scuffolding construct is like this:
WITH RECURSIVE rec( <Your column names> ) AS (
<Your ready and working initial SELECT>
UNION ALL
<Recursive SELECT that you are debugging now>
)
SELECT * from rec limit 1000
This simplest outer select will output the whole outer recordset, which, as we know, contains all output rows from initial select and every execution of recusrive select in a loop in their original output order - just like in samples above! The limit 1000 part will prevent hanging, replacing it with oversized output in which you will be able to see the missed stop point.
After debugging initial and recursive select build and debug your outer select.
And now the last thing to mention - the difference in using union instead of union all in with recursive clause. It introduces row uniqueness constraint which results in two extra lines in our execution pseudocode:
working-recordset = result of Initial-SELECT
discard duplicate rows from working-recordset /*union-specific*/
append working-recordset to empty outer-recordset
while( working-recordset is not empty ) begin
new working-recordset = result of Recursive-SELECT
taking previous working-recordset as pseudo-entity-name
discard duplicate rows and rows that have duplicates in outer-recordset
from working-recordset /*union-specific*/
append working-recordset to outer-recordset
end
overall-result = result of Outer-SELECT
taking outer-recordset as pseudo-entity-name

SQL optimisation - Word count in string - Postgresql

I am trying to update a large table (about 1M rows) with the count of words in a field on Postgresql.
This query works, and sets the token_count field counting the words (tokens) in longtext in table my_table:
UPDATE my_table mt SET token_count =
(select count(token) from
(select unnest(regexp_matches(t.longtext, E'\\w+','g')) as token
from my_table as t where mt.myid = t.myid)
as tokens);
myid is the primary key of the table.
\\w+ is necessary because I want to count words, ignoring special characters.
For example, A test . ; ) would return 5 with space-based count, while 2 is the right value.
The issue is that it's horribly slow, and 2 days are not enough to complete it on 1M rows.
What would you do to optimised it? Are there ways to avoid the join?
How can I split the batch into blocks, using for example limit and offset?
Thanks for any tips,
Mulone
UPDATE: I measured the performance of the array_split, and the update is gonna be slow anyway. So maybe a solution would consist of parallelising it. If I run different queries from psql, only one query works and the others wait for it to finish. How can I parallelise an update?
Have you tried using array_length?
UPDATE my_table mt
SET token_count = array_length(regexp_split_to_array(trim(longtext), E'\\W+','g'), 1)
http://www.postgresql.org/docs/current/static/functions-array.html
# select array_length(regexp_split_to_array(trim(' some long text '), E'\\W+'), 1);
array_length
--------------
3
(1 row)
UPDATE my_table
SET token_count = array_length(regexp_split_to_array(longtext, E'\\s+'), 1)
Or your original query without a correlation
UPDATE my_table
SET token_count = (
select count(*)
from (select unnest(regexp_matches(longtext, E'\\w+','g'))) s
);
Using tsvector and ts_stat
get statistics of a tsvector column
SELECT *
FROM ts_stat($$
SELECT to_tsvector(t.longtext)
FROM my_table AS t
$$);
No sample data to try it but it should work.
Sample Data
CREATE TEMP TABLE my_table
AS
SELECT $$A paragraph (from the Ancient Greek παράγραφος paragraphos, "to write beside" or "written beside") is a self-contained unit of a discourse in writing dealing with a particular point or idea. A paragraph consists of one or more sentences.$$::text AS longtext;
SELECT *
FROM ts_stat($$
SELECT to_tsvector(t.longtext)
FROM my_table AS t
$$);
word | ndoc | nentry
--------------+------+--------
παράγραφος | 1 | 1
written | 1 | 1
write | 1 | 2
unit | 1 | 1
sentenc | 1 | 1
self-contain | 1 | 1
self | 1 | 1
point | 1 | 1
particular | 1 | 1
paragrapho | 1 | 1
paragraph | 1 | 2
one | 1 | 1
idea | 1 | 1
greek | 1 | 1
discours | 1 | 1
deal | 1 | 1
contain | 1 | 1
consist | 1 | 1
besid | 1 | 2
ancient | 1 | 1
(20 rows)
Make sure myid is indexed, being the first field in the index.
Consider doing this outside DB in the first place. Hard to say without benchmarking, but the counting may be more costly than select+update; so may be worth it.
use COPY command (BCP equivalent for Postgres) to copy out the table data efficiently in bulk to a file
Run a simple Perl script to count. 1 million rows should take couple of minutes to 1 hour for Perl, depending on how slow your IO is.
use COPY to copy the table back into DB (possibly into a temp table, then update from that temp table; or better yet truncate the main table and COPY straight into it if you can afford the downtime).
For both your approach, AND the last step of my approach #2, update the token_count in batches of 5000 rows (e.g. set rowcount to 5000, and loop the updates adding where token_count IS NULL to the query

Selecting Recent Rows, Optimization (Oracle SQL)

I would appreciate some guidance on the following query. We have a list of experiments and their current progress state (for simplicity, I've reduced the statuses to 4types, but we have 10 different statuses in our data). I need to eventually return a list of the current status of all non-finished experiments.
Given a table exp_status,
Experiment | ID | Status
----------------------------
A | 1 | Starting
A | 2 | Working On It
B | 3 | Starting
B | 4 | Working On It
B | 5 | Finished Type I
C | 6 | Starting
D | 7 | Starting
D | 8 | Working On It
D | 9 | Finished Type II
E | 10 | Starting
E | 11 | Working On It
F | 12 | Starting
G | 13 | Starting
H | 14 | Starting
H | 15 | Working On It
H | 16 | Finished Type II
Desired Result Set:
Experiment | ID | Status
----------------------------
A | 2 | Working On It
C | 6 | Starting
E | 11 | Working On It
F | 12 | Starting
G | 13 | Starting
The most recent ID number will correspond to the most recent status.
Now, the current code I have executes in 150 seconds.
SELECT *
FROM
(SELECT Experiment, ID, Status,
row_number () over (partition by Experiment
order by ID desc) as rn
FROM exp_status)
WHERE rn = 1
AND status NOT LIKE ('Finished%')
The thing is, this code wastes its time. The result set is 45 thousand rows pulled from a table of 3.9 million. This is because most experiments are in the finished status. The code goes through and orders all of them then only filters out the finished at the end. About 95% of the experiments in the table are in the finished phase. I could not figure out how to make the query first pick out all the experiments and statuses where there isn't a 'Finished' for that experiment. I tried the following but had very slow performance.
SELECT *
FROM exp_status
WHERE experiment NOT IN
(
SELECT experiment
FROM exp_status
WHERE status LIKE ('Finished%')
)
Any help would be appreciated!
Given your requirement, I think your current query with with row_number() is one of the most efficient possible. This query takes time not because it has to sort the data, but because there is so much data to read in the first place (the extra cpu time is negligible compared to the fetch time). Furthermore, the first query makes a FULL SCAN that is really the best way to read lots of data.
You need to find a way to read a lot less rows if you want to improve performance. The second query doesn't go in the right direction:
the inner query will likely be a full scan since the 'finished' rows will be spread across the whole table and likely represent a big percentage of all rows.
the outer query will also likey be a full scan and a nice ANTI-HASH JOIN which should be quicker than 45k * (number of status change per experiment) non-unique index scans.
So the second query seems to have at least twice the number of reads (plus a join).
If you want to really improve performance, I think you will need a change of design.
You could for instance build a table of active experiments and join to this table. You would maintain this table either as a materialized view or with a modification to the code that inserts experiment statuses. You could go further and store the last status in this table. Maintaining this "last status" will likely be an extra burden but this could be justified by the improved performance.
Consider partitioning your table by status
www.orafaq.com/wiki/Partitioning_FAQ
You could also create materialized views to avoid having to recalculate your aggregations if these types of queries are frequent.
Could you provide the execution plans of your queries. Without those it is difficult to know the exact reason it is taking so long
You can improve your first query slightly by using this variant:
select experiment
, max(id) id
, max(status) keep (dense_rank last order by id) status
from exp_status
group by experiment
having max(status) keep (dense_rank last order by id) not like 'Finished%'
If you compare the plans, you'll notice one step less
Regards,
Rob.