I insert data from external table to my table that is range partitioned and have two local indexes.
My case,
I have to insert records under 60 seconds for each flat file because new one comes.
A flat file consists +5 M records and 2 GB.(volume : Totally 5 billion records daily ) Additionally I do some sort operations before insert on external table select.
My environment is on Oracle ExaData X-5 12.2 version.
There are many process doing insert to same table simultaneously so I can not use append hint. I can use parallel and nologging hints.
I have .exe that manages all this process. It gets flat files from source then combines them if there is one more flat files then moves combined file to true directory for external table and calls a procedure to insert data from external table to my table. Lastly changes flat file with next one.
There is one .exe for each different flat file.
All select operation takes 35-40 seconds from external table but insert takes too much times 50-60 seconds.
Can you give me some useful advices?
Related
So, I'm working on a SAP HANA database that has 10 million records in one table and there are 'n' number of tables in the db. The constraints that I'm facing are:
I do not have write access to the db.
The maximum RAM in the system is 6 GB.
Now, I need to extract the data from this table and save it as a csv or txt or excel file. I tried Select * from query. Using this the machine extracts ~700k records before showing an out of memory exception.
I've tried using LIMIT and OFFSET in SAP HANA and it works perfectly, but it takes around ~30 mins for the machine to process ~500k records. So, going by this route will be very time consuming.
So, I wanted to known if there is anyway by which I can automate the process of selecting 500k records using LIMIT and OFFSET and save each such sub-file containing 500k records automatically into as a csv/txt file on the system, so that I can run this query and leave the system overnight to extract data.
Let's say if I created a hive table as ORC format and inserted 1M records into the table, which created a file with 17 stripes. The last stripe is not full.
Then I insterted another 100 records into this table, will the new 100 records be appended into the last stripe or a new stripe will be created ?
I have tried to test it on a HDFS cluster, seems like every time we insert new records, a new file will be created (of course, new stripes are created too). Was wondering why?
Reason would be HDFS doesn't support editing file.
So when we insert data into Hive table all the time new files will be created.
In case if you want to merge these files you can use concatenate
Alter table <table_name> CONCATENATE;
(or)
You can insert overwrite the same table that you have selected from to merge all small files into big file.
insert overwrite <db_table>.<table1> select * from <db_table>.<table1>
You can also use sort by distribute by to control number of files created in HDFS directory.
I am currently setting up a simple NiFi flow that reads from a RDBMS source and writes to a Hive sink. The flow works as expected until the PuHiveSql processor, which is running extremely slow. It inserts one record every minute approximately.
Currently is setup as a standalone instance running on one node.
The logs showing the insert every 1 minute approx:
(INSERT INTO customer (id, name, address) VALUES (x, x, x))
Any ideas about why this may be? Improvements to try?
Thanks in advance
Inserting one record at a time into Hive will result extreme slowness.
As your doing regular insert into hive table:
Change your flow:
QueryDatabaseTable
PutHDFS
Then create Hive avro table on top of HDFS directory where you have stored the data.
(or)
QueryDatabaseTable
ConvertAvroToORC //incase if you need to store data in orc format
PutHDFS
Then create Hive orc table on top of HDFS directory where you have stored the data.
Are you poshing one record at time? if so may use the merge record process to create batches before pushing into HiveQL,
It is recommended to batch into 100 records :
See here: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-hive-nar/1.5.0/org.apache.nifi.processors.hive.PutHiveQL/
Batch Size | 100 | The preferred number of FlowFiles to put to the database in a single transaction
Use the MergeRecord process and set the number of records or/and timeout, it should speed-up considerably
I'm learning Hadoop/Big data technologies. I would like to ingest data in bulk into hive. I started working with a simple CSV file and when I tried to use INSERT command to load each record by record, one record insertion itself took around 1 minute. When I put the file into HDFS and then used the LOAD command, it was instantaneous since it just copies the file into hive's warehouse. I just want to know what are the trade offs that one have to face when they opt in towards LOAD instead of INSERT.
Load- Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables.
Insert-Query Results can be inserted into tables by using the insert clause and which in turn runs the map reduce jobs.So it takes some time to execute.
In case if you want to optimize/tune the insert statements.Below are some techniques:
1.Setting the execution Engine in hive-site.xml to Tez(if its already installed)
set hive.execution.engine=tez;
2.USE ORCFILE
CREATE TABLE A_ORC (
customerID int, name string, age int, address string
) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);
INSERT INTO TABLE A_ORC SELECT * FROM A;
3. Concurrent job runs in hive can save the overall job running time .To achieve that hive-default.xml,below config needs to be changed:
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=<your value>;
For more info,you can visit http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/
Hope this helps.
I have a .csv file that gets pivoted into 6 million rows during a SSIS package. I have a table in SQLServer 2005 of 25 million + rows. The .csv file has data that duplicates data in the table, is it possible for rows to get updated if it already exists or what would be the best method to achieve this efficiently?
Comparing 6m rows against 25m rows is not going to be too efficient with a lookup or a SQL command data flow component being called for each row to do an upsert. In these cases, sometimes it is most efficient to load them quickly into a staging table and use a single set-based SQL command to do the upsert.
Even if you do decide to do the lookup - split the flow into two streams, one which inserts and the other which inserts into a staging table for an update operation.
If you don't mind losing the old data (ie. the latest file is all that matters, not what's in the table) you could erase all the records in the table and insert them again.
You could also load into a temporary table and determine what needs to be updated and what needs to be inserted from there.
You can use the Lookup task to identify any matching rows in the CSV and the table, then pass the output of this to another table or data flow and use a SQL task to perform the required Update.