Can Hive realize batch select? - hive

I have a list of ids, and a hive table. I want to select all the records with ids in the id list. Currently, I select the records by iterating the id list. However, it is very slow. I'm wondering whether Hive supports batch processing, which can select records with a batch of ids?

You can construct a query and use in:
select t.*
from t
where id in (id1, id2, . . .);
Or, load the ids into a table and use a join:
select t.*
from t join
ids
on t.id = ids.id;

Related

postgres jsonb_object_keys distinct or group by extremely slow

Database version is PostgreSQL 11.16
My table have 424868 records with json field. When I do:
SELECT jsonb_object_keys(raw_json) FROM table;
It returns result for me within a second. So, I need to remove duplicate keys, but when I do:
SELECT DISTINCT jsonb_object_keys(raw_json) FROM table;
My database CPU increase to 100% and it takes 15 min to get result. I tried solution with group by:
select array_agg(json_keys),id from (
select jsonb_object_keys(raw_json) as json_keys, id from table) a group by a.id
Same result.
For debugging I did this:
select count(*) from (SELECT jsonb_object_keys(raw_json) as k from table) test
and it returns me 41633935 keys

Redshift Upsert where staging has duplicate items

I have a Redshift database set up that stores posts. Posts are defined as unique by their post_id, and all other fields can be variable.
I am using a staging table to do an equivalent UPSERT using the following query:
BEGIN;
CREATE TABLE posts_staging (LIKE posts);
COPY posts_staging (post_id,user_id,timestamp,votes,comments) FROM 's3://posts' CREDENTIALS 'aws_access_key_id=xxxx;aws_secret_access_key=yyyy' CSV;
DELETE FROM posts USING posts_staging WHERE posts.post_id = posts_staging.post_id;
INSERT INTO posts SELECT DISTINCT * FROM posts_staging;
DROP TABLE posts_staging;
END;
Most of the time this works correctly, however I am noticing some duplicate values going into the table. I believe what is happening is that there is a possibility that the CSV uploaded has duplicate post_ids, but with different other fields (for example, differing numbers of likes), meaning the DISTINCT is inserting multiple of the same post_id. Is there a way to modify this query to only INSERT unique post_ids?
Redshift, alas does not support distinct on. But you can use row_number():
INSERT INTO posts
SELECT . . .
FROM (SELECT ps.*, ROW_NUMBER() OVER (PARTITION BY post_id ORDER BY post_id) as seqnum
FROM posts_staging ps
) ps
WHERE seqnum = 1;
You will need to list out the columns being inserted.
Problem is with your following query, distinct * may return duplicates.
INSERT INTO posts SELECT DISTINCT * FROM posts_staging;
You should be removing duplicates from post_staging first before upsert.

Oracle Query on 24 tables with same columns

I have 24 tables table1, table2, table3 ... with same columns to keep track of customers data on hourly basis and one rate table which rate is applied for a specific hour, rateId is a foregin key in all the 24 other tables, i need a dynamic query to fetch data from those tables on date and time basis. Can any one provide an example or guide me for that query.
You should not store the same data in 24 different tables. Partitioning (mentioned in a comment) is a very good solution when you have lots and lots of data and want to split it for performance reasons.
In any case, one way to structure your query is:
select t.*
from ((select * from table1) union all
(select * from table2) union all
. . .
(select * from table24)
) t
where <whatever you want>
You can then join this to whatever other tables you like (using rateId, for instance), filter on the fields, or whatever.
If you need to know the table where something came from, then you can get this as well:
select t.*
from ((select t.*, 1 as which from table1 t) union all
(select t.*, 2 as which from table2 t) union all
. . .
(select t.*, 24 as which from table24 t)
) t
where <whatever you want>
Note: I am using * here because the OP explicitly states that the tables have the same format. Even so, it is probably a good idea to list all the columns in each subquery.
EDIT:
As Bill suggests in the comment, you might want to turn this into a view. That way, you can write lots of queries on the tables, without worrying about the detailed tables. (And, better yet, you can fix the data structure by combining the tables, then change the view, and existing queries will work).

SQL: how do you look for missing ids?

Suppose I have a table with lots of rows identified by a unique ID. Now I have a (rather large) user-input list of ids (not a table) that I want to check are already in the database.
So I want to output the ids that are in my list, but not in the table. How do I do that with SQL?
EDIT: I know I can do that with a temporary table, but I'd really like to avoid that if possible.
EDIT: Same comment for using an external programming language.
Try with this:
SELECT t1.id FROM your_list t1
LEFT JOIN your_table t2
ON t1.id = t2.id
WHERE t2.id IS NULL
It is hardly possible to make a single pure and general SQL query for your task, since it requires to work with a list (which is not a relational concept and standard set of list operations is too limited). For some DBMSs it is possible to write a single SQL query, but it will utilize SQL dialect of the DBMS and will be specific to the DBMS.
You haven't mentioned:
which RDBMS will be used;
what is the source of the IDs.
So I will consider PostgreSQL is used, and IDs to be checked are loaded into a (temporary) table.
Consider the following:
CREATE TABLE test (id integer, value char(1));
INSERT INTO test VALUES (1,'1'), (2,'2'), (3,'3');
CREATE TABLE temp_table (id integer);
INSERT INTO temp_table VALUES (1),(5),(10);
You can get your results like this:
SELECT * FROM temp_table WHERE NOT EXISTS (
SELECT id FROM test WHERE id = temp_table.id);
or
SELECT * FROM temp_table WHERE id NOT IN (SELECT id FROM test);
or
SELECT * FROM temp_table LEFT JOIN test USING (id) WHERE test.id IS NULL;
You can pick any option, depending on your volumes you may have different performance.
Just a note: some RDBMS may have limitation on the number of expressions specified literally inside IN() construct, keep this in mind (I hit this several times with ORACLE).
EDIT: In order to match constraints of no temp tables and no external languages you can use the following construct:
SELECT DISTINCT b.id
FROM test a RIGHT JOIN (
SELECT 1 id UNION ALL
SELECT 5 UNION ALL
SELECT 10) b ON a.id=b.id
WHERE a.id IS NULL;
Unfortunately, you'll have to generate lot's of SELECT x UNION ALL entries to make a single-column and many-rows table here. I use UNION ALL to avoid unnecessary sorting step.

SQL select from data in query where this data is not already in the database?

I want to check my database for records that I already have recorded before making a web service call.
Here is what I imagine the query to look like, I just can't seem to figure out the syntax.
SELECT *
FROM (1,2,3,4) as temp_table
WHERE temp_table.id
LEFT JOIN table ON id IS NULL
Is there a way to do this? What is a query like this called?
I want to pass in a list of id's to mysql and i want it to spit out the id's that are not already in the database?
Use:
SELECT x.id
FROM (SELECT #param_1 AS id
FROM DUAL
UNION ALL
SELECT #param_2
FROM DUAL
UNION ALL
SELECT #param_3
FROM DUAL
UNION ALL
SELECT #param_4
FROM DUAL) x
LEFT JOIN TABLE t ON t.id = x.id
WHERE x.id IS NULL
If you need to support a varying number of parameters, you can either use:
a temporary table to populate & join to
MySQL's Prepared Statements to dynamically construct the UNION ALL statement
To confirm I've understood correctly, you want to pass in a list of numbers and see which of those numbers isn't present in the existing table? In effect:
SELECT Item
FROM IDList I
LEFT JOIN TABLE T ON I.Item=T.ID
WHERE T.ID IS NULL
You look like you're OK with building this query on the fly, in which case you can do this with a numbers / tally table by changing the above into
SELECT Number
FROM (SELECT Number FROM Numbers WHERE Number IN (1,2,3,4)) I
LEFT JOIN TABLE T ON I.Number=T.ID
WHERE T.ID IS NULL
This is relatively prone to SQL Injection attacks though because of the way the query is being built. It'd be better if you could pass in '1,2,3,4' as a string and split it into sections to generate your numbers list to join against in a safer way - for an example of how to do that, see http://www.sqlteam.com/article/parsing-csv-values-into-multiple-rows
All of this presumes you've got a numbers / tally table in your database, but they're sufficiently useful in general that I'd strongly recommend you do.
SELECT * FROM table where id NOT IN (1,2,3,4)
I would probably just do:
SELECT id
FROM table
WHERE id IN (1,2,3,4);
And then process the list of results, removing any returned by the query from your list of "records to submit".
How about a nested query? This may work. If not, it may get you in the right direction.
SELECT * FROM table WHERE id NOT IN (
SELECT id FROM table WHERE 1
);