Parallelizing insert queries that operate on large volume of data in Postgresql

Parallelizing insert queries that operate on large volume of data in Postgresql - sql

I have the following query that I would like to cut into smaller queries
and execute in parallel:
insert into users (
SELECT user, company, date
FROM (
SELECT
json_each(json ->> 'users') user
json ->> 'company' company,
date(datetime) date
FROM companies
WHERE date(datetime) = '2015-05-18'
) AS s
);
I could try to do it manually - launch workers that would connect to Postgresql
and every worker would take 1000 companies, extract users and insert to the other table. But is it possible to do it in plain SQL?

Related

Massive Delete statement - How to improve query execution time?

I have a Spring batch that will run everyday to :
Read CSV files and import them into our database
Aggregate this data and save these aggregated data into another table.
We have a table BATCH_LIST that contains information about all the batchs that were already executed.
BATCH_LIST has the following columns :
1. BATCH_ID
2. EXECUTION_DATE
3. STATUS
Among the CSV files that are imported, we have one CSV file to feed a APP_USERS table, and another one to feed the ACCOUNTS table.
APP_USERS has the following columns :
1. USER_ID
2. BATCH_ID
-- more columns
ACCOUNTS has the following columns :
1. ACCOUNT_ID
2. BATCH_ID
-- more columns
In step 2, we aggregate data from ACCOUNTS and APP_USERS to insert rows into a USER_ACCOUNT_RELATION table. This table has exactly two columns : ACCOUNT_ID (refering to ACCOUNTS.ACCOUNT_ID) and USER_ID (refering to APP_USERS.USER_ID).
Now we want to add another step in our Spring batch. We want to delete all the data from USER_ACCOUNT_RELATION table but also APP_USERS and ACCOUNTS that are no longer relevant (ie data that was imported before sysdate - 2.
What has been done so far :
Get all the BATCH_ID that we want to remove from the database
SELECT BATCH_ID FROM BATCH_LIST WHERE trunc(EXECUTION_DATE) < sysdate - 2
For each BATCH_ID, we are calling the following methods :
public void deleteAppUsersByBatchId(Connection connection, long batchId) throws SQLException
// prepared statements to delete User account relation and user
And here are the two prepared statements :
DELETE FROM USER_ACCOUNT_RELATION
WHERE USER_ID IN (
SELECT USER_ID FROM APP_USERS WHERE BATCH_ID = ?
);
DELETE FROM APP_USERS WHERE BATCH_ID = ?
My issue is that it takes too long to delete data for one BATCH_ID (more than 1 hour).
Note : I only mentioned the APP_USERS, ACCOUNTS AND USER_ACCOUNT_RELATION tables, but I actually have around 25 tables to delete.
How can I improve the query time ?
(I've just tried to change the WHERE USER_ID IN () into an EXISTS. It is better but still way too long.

If that will be your regular process, ie you want to store only last 2 days, you don't need indexes, since every time you will delete 1/3 of all rows.
It's better to use just 3 deletes instead of 3*7 separate deletes:
DELETE FROM USER_ACCOUNT_RELATION
WHERE ACCOUNT_ID IN
(
SELECT u.ID
FROM {USER} u
join {FILE} f
on u.FILE_ID = f.file
WHERE trunc(f.IMPORT_DATE) < (sysdate - 2)
);
DELETE FROM {USER}
WHERE FILE_ID in (select FILE_ID from {file} where trunc(IMPORT_DATE) < (sysdate - 2));
DELETE FROM {ACCOUNT}
WHERE FILE_ID in (select FILE_ID from {file} where trunc(IMPORT_DATE) < (sysdate - 2));
Just replace {USER}, {FILE}, {ACCOUNT} with your real table names.
Obviously in case of partitioning option it would be much easier - daily interval partitioning, so you could easily drop old partitions.
But even in your case, there is also another more difficult but really fast solution - "partition views": for example for ACCOUNT, you can create 3 different tables ACCOUNT_1, ACCOUNT_2 and ACCOUNT_3, then create partition view:
create view ACCOUNT as
select 1 table_id, a1.* from ACCOUNT_1 a1
union all
select 2 table_id, a2.* from ACCOUNT_2 a2
union all
select 3 table_id, a3.* from ACCOUNT_3 a3;
Then you can use instead of trigger on this view to insert daily data into own table: first day into account_1,second - account_2, etc. And truncate old table each midnight. You can easily get table name using
select 'ACCOUNT_'|| (mod(to_char(sysdate, 'j'),3)+1) tab_name from dual;

Inserting multiple records in database table using PK from another table

I have DB2 table "organization" which holds organizations data including the following columns
organization_id (PK), name, description
Some organizations are deleted so lot of "organization_id" (i.e. rows) doesn't exist anymore so it is not continuous like 1,2,3,4,5... but more like 1, 2, 5, 7, 11,12,21....
Then there is another table "title" with some other data, and there is organization_id from organization table in it as FK.
Now there is some data which I have to insert for all organizations, some title it is going to be shown for all of them in web app.
In total there is approximately 3000 records to be added.
If I would do it one by one it would look like this:
INSERT INTO title
(
name,
organization_id,
datetime_added,
added_by,
special_fl,
title_type_id
)
VALUES
(
'This is new title',
XXXX,
CURRENT TIMESTAMP,
1,
1,
1
);
where XXXX represent "organization_id" which I should get from table "organization" so that insert do it only for existing organization_id.
So only "organization_id" is changing matching to "organization_id" from table "organization".
What would be best way to do it?
I checked several similar qustions but none of them seems to be equal to this?
SQL Server 2008 Insert with WHILE LOOP
While loop answer interates over continuous IDs, other answer also assumes that ID is autoincremented.
Same here:
How to use a SQL for loop to insert rows into database?
Not sure about this one (as question itself is not quite clear)
Inserting a multiple records in a table with while loop
Any advice on this? How should I do it?

If you seriously want a row for every organization record in Title with the exact same data something like this should work:
INSERT INTO title
(
name,
organization_id,
datetime_added,
added_by,
special_fl,
title_type_id
)
SELECT
'This is new title' as name,
o.organization_id,
CURRENT TIMESTAMP as datetime_added,
1 as added_by,
1 as special_fl,
1 as title_type_id
FROM
organizations o
;
you shouldn't need the column aliases in the select but I am including for readability and good measure.
https://www.ibm.com/support/knowledgecenter/ssw_i5_54/sqlp/rbafymultrow.htm
and for good measure in case you process errors out or whatever... you can also do something like this to only insert a record in title if that organization_id and title does not exist.
INSERT INTO title
(
name,
organization_id,
datetime_added,
added_by,
special_fl,
title_type_id
)
SELECT
'This is new title' as name,
o.organization_id,
CURRENT TIMESTAMP as datetime_added,
1 as added_by,
1 as special_fl,
1 as title_type_id
FROM
organizations o
LEFT JOIN Title t
ON o.organization_id = t.organization_id
AND t.name = 'This is new title'
WHERE
t.organization_id IS NULL
;

Hive query reading a list of constants from a text file?

I want to extract data for a list of userid I am interested in. If the list is short, I can type the query directly:
SELECT * FROM mytable WHERE userid IN (100, 101, 102);
(this is an example, the query might be more complex). But the list of userid might be long and available as a text file:
100
101
102
How can I run the same query with Hive reading from userids.txt directly?

One way is to put the data in another table and INNER JOIN to it, so that there has to be a match for the record to go through:
Create the table: CREATE TABLE users (userid INT);
Load the data file: LOAD DATA LOCAL INPATH 'userids.txt' INTO TABLE users;
Filter through the inner join: SELECT mytable.* FROM mytable INNER JOIN users ON mytable.userid = users.userid;

Rails SQL - create buckets, and get counts of records in each bucket

I have a Jobs table with a column for salary.
How can I bucket the salaries into groups of $10,000, and then get a count of how many jobs are in each bucket?
An answer that uses Rails active record is preferable, but given the difficulty I'll accept raw SQL answers as well.
Starting Data
Jobs
id salary (integer)
-----------------
1 93,530
2 72,400
3 120,403
4 193,001
...
Result Data
bucket job_count
----------------------------
$0 - $9,999 0
$10,000 - $19,999 0
$20,000 - $29,999 3
$30,000 - $39,999 5
$40,000 - $49,999 12

Here is another SQL-based solution.
Obtain bucket for each salary like this:
select
FLOOR(salary/10000) as bucket
from jobs
Use GROUP BY to do the counting:
select
bucket,
count(*)
from (
select FLOOR(salary/10000) as bucket
from jobs
) as t1
GROUP BY bucket
Finally, add the ranges instead of bucket number:
select
CONCAT('$', FORMAT(bucket*10000,0), ' - $', FORMAT((bucket+1)*10000-1,0)) as range,
job_count
from (
select bucket, count(*) as job_count
from (
select FLOOR(salary/10000) as bucket
from jobs
) as t1
GROUP BY bucket
) as t2
Note that the functions used are for MySQL. YMMV.

From an SQL perspective there are several approaches. Here is one.
-- First, create a report table (temporarily here, but could be permanent):
create table salary_report (
bucket integer,
lower_limit integer,
upper_limit integer,
job_count integer
);
-- populate the table
-- note this could (and probably should) be automated, not hardcoded
insert into salary_report values (1,00000,09999,0);
insert into salary_report values (2,10000,19999,0);
insert into salary_report values (3,20000,29999,0);
insert into salary_report values (4,30000,39999,0);
insert into salary_report values (5,40000,49999,0);
-- set correct counts
update salary_report as sr
set job_count = (
select count(*)
from jobs as j
where j.salary between sr.lower_limit and sr.upper_limit
);
-- finally, access the data (through activerecord?)
-- note: not formatted as dollar amounts
select concat( sr.lower_limit,' - ',sr.upper_limit) as range, job_count, bucket
from salary_report
order by bucket;
-- drop table if required
drop table salary_report;
I have tried to keep the SQL generic, but exact syntax may vary depending upon your RDBMS.
No SQL Fiddle provided because it seems to be broken today.

Normalizing a table, from one to the other

I'm trying to normalize a mysql database....
I currently have a table that contains 11 columns for "categories". The first column is a user_id and the other 10 are category_id_1 - category_id_10. Some rows may only contain a category_id up to category_id_1 and the rest might be NULL.
I then have a table that has 2 columns, user_id and category_id...
What is the best way to transfer all of the data into separate rows in table 2 without adding a row for columns that are NULL in table 1?
thanks!

You can create a single query to do all the work, it just takes a bit of copy and pasting, and adjusting the column name:
INSERT INTO table2
SELECT * FROM (
SELECT user_id, category_id_1 AS category_id FROM table1
UNION ALL
SELECT user_id, category_id_2 FROM table1
UNION ALL
SELECT user_id, category_id_3 FROM table1
) AS T
WHERE category_id IS NOT NULL;
Since you only have to do this 10 times, and you can throw the code away when you are finished, I would think that this is the easiest way.

One table for users:
users(id, name, username, etc)
One for categories:
categories(id, category_name)
One to link the two, including any extra information you might want on that join.
categories_users(user_id, category_id)
-- or with extra information --
categories_users(user_id, category_id, date_created, notes)
To transfer the data across to the link table would be a case of writing a series of SQL INSERT statements. There's probably some awesome way to do it in one go, but since there's only 11 categories, just copy-and-paste IMO:
INSERT INTO categories_users
SELECT user_id, 1
FROM old_categories
WHERE category_1 IS NOT NULL

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas