I have a table, stop_logs in HIVE. When I run a insert query for around 6000 rows, it takes 300 secs, where as if I run just SELECT query, it finishes in 6 seconds. Why insert is taking this much time?
CREATE TABLE stop_logs (event STRING, loadId STRING)
STORED AS SEQUENCEFILE;
Following takes 300 sec:
INSERT INTO TABLE stop_logs
SELECT
i.event, i.loadId
FROM
event_logs i
WHERE
i.stopId IS NOT NULL;
;
Following query takes 6 secs.
SELECT
i.event, i.loadId
FROM
event_logs i
WHERE
i.stopId IS NOT NULL;
;
First you need to understand how Hive is processing your query:
When you perform a "select * from < tablename>", Hive fetches the whole data from file as a FetchTask rather than a mapreduce task which just dumps the data as it is without doing anything on it. This is similar to "hadoop dfs -text ". As it doesn't run any map-reduce task so it runs faster.
while using "select a,b from < tablename>", Hive requires a map-reduce job since it needs to extract the 'column' from each row by parsing it from the file it loads.
While using "insert into table stop_logs select a,b from event_logs" statement, first select statement runs, which trigger map-reduce job since it needs to extract the 'column' from each row by parsing it from the file it loads and for inserting into another table(stop_logs) it will launch another map reduce task to takes values inserted into columns a and b in 'stop_logs' and maps them to columns a and b, respectively, for insertion into the new row.
Another reason for slowness is check If "hive.typecheck.on.insert" is set to true ,because of that values are validated, converted and normalized to conform to their column types (Hive 0.12.0 onward) while inserting into table which also causes insert to perform slow as compare to select statement.
Related
I have a range partitioned table in my database, it is range partitioned by a date column: transaction_date, with 1 partition per 1 month.
Now my problem is:
When running SQL statement to read data from the table,
select col1,col2 from mytable where ID=1
My table is very large so it takes a long time for the SQL to finish.
However, there is another ETL job to insert (append) data to the table at the same time, the insert operation cannot start until the read SQL finishes.
Any suggestions I can avoid this issue while reading data? Also are there any IBM official documents regarding this problem?
** EDIT 1:
$ db2level
DB21085I This instance or install (instance name, where applicable:
"db2inst1") uses "64" bits and Db2 code release "SQL11011" with level
identifier "0202010F".
Informational tokens are "DB2 v11.1.1.1", "s1610100100",
"DYN1610100100AMD64", and Fix Pack "1".
Product is installed at "/opt/ibm/db2/v11.1".
$ db2set -all
[i] DB2COMM=TCPIP
[i] DB2AUTOSTART=TRUE
[i] DB2OPTIONS=+c
[g] DB2FCMCOMM=TCPIP4
[g] DB2SYSTEM=<server hostname>
[g] DB2INSTDEF=db2inst1
** EDIT 2:
For the select and load SQL statement, I am not specifying any isolation level.
For the ETL job, it is an IBM DataStage job, the ETL insert is a bulk load append operation to insert data to a pre-existing range.
You may use the MON_LOCKWAITS administrative view to check what's happening during such a lock situation. You may optionally format a lock with the MON_FORMAT_LOCK_NAME function to get more details on this as well.
SELECT
W.*
--, F.NAME, F.VALUE
FROM SYSIBMADM.MON_LOCKWAITS W
--, TABLE(MON_FORMAT_LOCK_NAME(W.LOCK_NAME)) F
--WHERE W.REQ_APPLICATION_HANDLE = XXX -- if you know the holder's handle to reduce the amount of data returned
ORDER BY W.LOCK_NAME
;
I am supposed to do incremental load and using below structure.
Do the statements execute in sequence i.e. TRUNCATE is never executed before first two statements which are getting data:
#newData = Extract ... (FROM FILE STREAM)
#existingData = SELECT * FROM dbo.TableA //this is ADLA table
#allData = SELECT * FROM #newData UNION ALL SELECT * FROM #existingData
TRUNCATE TABLE dbo.TableA;
INSERT INTO dbo.TableA SELECT * FROM #allData
To be very clear: U-SQL scripts are not executed statement by statement. Instead it groups the DDL/DML/OUTPUT statements in order and the query expressions are just subtrees to the inserts and outputs. But first it binds the data during compilation to their names, so your SELECT from TableA will be bound to the data (kind of like a light-weight snapshot), so even if the truncate is executed before the select, you should still be able to read the data from table A (note that permission changes may impact that).
Also, if your script fails during the execution phase, you should have an atomic execution. That means if your INSERT fails, the TRUNCATE should be undone at the end.
Having said that, why don't you use INSERT incrementally and use ALTER TABLE REBUILD periodically instead of doing the above pattern that reads the full table on every insertion?
I have a query, A, in my ms-access database that takes ~2 seconds to execute. A gives me six fields: Field1, Field2, ..., Field6.
I must append the results of A to a table, T.
I created a query, B, that selects columns from A and inserts them into table T. However, B takes more than 10 minutes to run... Why? and How do I speed-up B?
Here is the code for B:
INSERT INTO TrialRuns (Field1,Field2,...,Field6)
SELECT A.Field1,A.Field2,...,Field6
From A
Try something like this:
INSERT INTO TrialRuns
SELECT * FROM A;
Try;
SELECT A.Field1,A.Field2,...,Field6
INTO TrialRuns
FROM A
Note that this may only work if you make sure that the TrialRuns table doesn't exist to begin with, so do a DROP TABLE TrialRun beforehand if it does exist. This should take as long to run as the initial SELECT statement.
I have created hive table loading data from another table when i execute the query its starting but dint produce any results
CREATE TABLE fact_orders1 (order_number String, created timestamp, last_upd timestamp)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS ORC;
OK Time taken: 0.188 seconds
INSERT OVERWRITE TABLE fact_orders1 SELECT * FROM fact_orders;
Query ID = hadoop_20151230051654_78edfb70-4d41-4fa7-9110-fa9a98d5405d
Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set
to 0 since there's no reduce operator Starting Job =
job_1451392201160_0007, Tracking URL =
http://localhost:8088/proxy/application_1451392201160_0007/ Kill
Command = /home/hadoop/hadoop-2.6.1/bin/hadoop job -kill
job_1451392201160_0007
You have no output from query because there is no data stored in it. I assume you use default metastore under /user/hive/warehouse so what you need to do is:
LOAD DATA LOCAL INPATH '/path/on/hdfs/to/data' OVERWRITE INTO TABLE fact_orders1;
That should work.
Also edit your query for table creation adding the LOCATION statement:
CREATE TABLE fact_orders1 (order_number String, created timestamp, last_upd timestamp)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS ORC
LOCATION /user/hive/warehouse/fact_orders1;
In case if you want to use the data outside the hive metastore you need to use external tables
I need to insert records into a new table from an existing table. I used the following query to do this:
Insert into Newtable
Select * from Oldtable where date1 = #date
This query works most of the time, but in one scenario I get 10 million records to be inserted for the date1 value. In this case I'm getting the following error message:
Error : The transaction log for database "tempDB" is full. To find out why space in the log cannot be reused, see the log_reuse_wait_desc column in sys.databases
Should I break the query into parts and insert them sequentially, or is there a way to do this with the current query?
This is, perhaps, a distasteful suggestion. But, you can try exporting the data to a file and then inserting using bulk-insert, with database logging set to SIMPLE or BULK-LOGGED.
More information is at http://msdn.microsoft.com/en-us/library/ms190422.aspx.