We run a cron every 15min which get some data from somewhere and inserts into the main table(basically appends), which is then used for reading in production.
So not to perform read and write in the same table, we use 2 tables(temp and main).
First we get data and insert into temp table, then we rename temp and main table(using a 3rd table), so basically temp becomes main and main becomes temp, and then we make temp equivalent to main(basically truncating temp, and inserting everything from main), so in the end we have identical data in both table.
I am not sure if this is the correct, but is there any better way of doing this?
We are not performing read and write on same table because there can be bulk write which can happen(because of cron) and at that time our read performance can be affected
Related
I am using MonetDB (MDB) for OLAP queries. I am storing source data in PostgreSQL (PGSQL) and syncing it with MonetDB in batches written in Python.
In PGSQL there is a wide table with ID (non-unique) and few columns. Every few seconds Python script takes a batch of 10k records changed in the PGSQL and uploads them to MDB.
The process of upload to MDB is as follows:
Create staging table in MDB
Use COPY command to upload 10k records into the staging table.
DELETE from destination table all IDs that are in staging table.
INSERT to the destination table all rows from staging table.
So, it is basically a DELETE & INSERT. I cannot use MERGE statement, because I do not have a PK - one ID can have multiple values in the destination. So I need to do a delete and full insert for all IDs currently synced.
Now to the problem: the DELETE is slow.
When I do a DELETE on a destination table, deleting 10k records in table of 25M rows, it will take 500ms.
However! If I run simple SELECT * FROM destination WHERE id = 1 and THEN do a DELETE, it takes 2ms.
I think that it has something to do with automatic creation of auxiliary indices. But this is where my knowledge ends.
I tried to solve this problem of "pre-heating" by doing the lookup myself and it works - but only for the first DELETE after pre-heat.
Once I do DELETE and INSERT, the next DELETE gets again slow. And doing the pre-heating before each DELETE does not make sense, because the pre-heat itself takes 500ms.
Is there any way on how to sync data to MDB without breaking auxiliary indices already built? Or make the DELETE faster without pre-heat? Or should I use some different technique to sync data into MDB without PK (does MERGE has the same problem?).
Thanks!
I need to perform some calculations using few columns from a table. This database table that gets updated every couple of hours generates duplicates on couple of columns every other day. There is no way tell which one is inserted first which affects my calculations.
Is there a way to copy these rows into a new table automatically as data gets added every couple of hours and perform calculations on the fly? This way whatever comes first will be captured into a new table for a dashboard and for other business use cases.
I thought of creating a stored procedure and using a job scheduler to perform this. But I do not have admin access and can not schedule jobs. Is there another way of doing this efficiently? Much appreciated!
Edit: My request for admin access is being approved.
Another way as to stated in the answers, what you can do is:
Make a temp table.
Make a prod table.
Use stored procedure to copy everything from the temp table into prod table after any load have been done.
Use the same stored procedure to clean the temp table after the load is done.
Don't know if this will work, but this is in general how we are dealing with huge amount of load on a daily basis.
I need to alter the size of a column on a large table (millions of rows). It will be set to a nvarchar(n) rather than nvarchar(max), so from what I understand, it will not be a long change. But since I will be doing this on production I wanted to understand the ramifications in case it does take long.
Should I just hit F5 from SSMS like I execute normal queries? What happens if my machine crashes? Or goes to sleep? What's the general best practice for doing long running updates? Should it be scheduled as a job on the server maybe?
Thanks
Please DO NOT just hit F5. I did this once and lost all the data in the table. Depending on the change, the update statement that is created for you actually stores the data in memory, drops the table, creates the new one that has the change you want, and populates the data from memory. However in my case one of the changes I made was adding a unique constraint so the population failed, and as the statement was over the data in memory was dropped. This left me with the new empty table.
I would create the table you are changing, with the change(s) you want, as a new table. Then select * into the new table, then re-name the tables in a single statement. If there is potential for data to be entered into the table while this is running and that is an issue, you may want to lock the table.
Depending on the size of the table and duration of the statement, you may want to save the locking and re-naming for later, and after the initial population of the new table do a differential population of new data and re-name the tables.
Sorry for the long post.
Edit:
Also, if the connection times out due to duration, then run the insert statement locally on the DB server. You could also create a job and run that, however it is essentially the same thing.
I would like to know what will happen if a hive SELECT and INSERT OVERWRITE is running at the same time. Please help me to understand what will hive query return in the below scenarios.
Run the query first, while the query is running, INSERT OVERWRITE the same table.
Run the INSERT OVERWRITE first, while overwriting, pull the data from the same table with SELECT.
Are we going to get the old data, new data, mixed data, nothing, or unpredictable data?
I am using MapR 4.0.1, Hive 0.13.
Best regards,
Ryan
Read Hive Locking:
For a non-partitioned table, the lock modes are pretty intuitive. When the table is being read, a S lock is acquired, whereas an X lock is acquired for all other operations (insert into the table, alter table of any kind etc.)
So SELECT and INSERT acquire incompatible locks so they can never run in parallel. One will acquire the lock first and the other will wait.
For partitioned tables things are a bit more complex as the locks acquire are hierarchical (S on table, S/X on partition). Read the link.
I needed to load 100,000 rows of data from an excel file into a temporary table that I created using "on commit preserve rows". But somehow the most efficient methods did not seem to populate the temporary table due to session issues?
I used Toad to Import Table Data and it showed that x amount of records are imported. But when I select from the temp table, it was empty. Then I generated a bunch of insert scripts and saved them in a notepad.sql and called it from toad editor using #/script/location/notepad.sql and hit F5. It ran and showed how many records were inserted. Again the temp table was somehow still empty. So, I decided to run a random insert script manually in the editor and it showed up in the temp table. I believe the methods that didn't work are not considered to be the same session?
I haven't try SQLLDR but I am assuming it will not work judging from the methods I tried. Can someone confirm? I can't access SQLLDR so I won't know.
Is there anyway to get this to work? I can't run the insert scripts manually. That will be time consuming and Toad can't take that many scripts at the same time.
Oracle temp tables created with ON COMMIT PRESERVE ROWS are session-specific, so the data put into them is only visible within a single session, and for the duration of that session. Toad may be creating a separate session for each window and thus data which is populated from one window/session isn't visible from another window/session. The fact that you can run an insert script and then select the data back suggests this may be the case if both operations were done from the same window. I expect you'd see the same behavior if you used SQL*Loader to load the tables because the load would run in one session and the data would be discarded when the session terminated. Best of luck.