BigQuery - Append missing records from one table to another - google-bigquery

I have two tables - 'todays_data' and 'full_data' with same schema(Id string, name string, Age string). Records in 'todays_data' may or may not be available in 'full_data'. I need to identify the new records(new Id) in 'todays_data' and append it to 'full_data'(Id is the reference key). How to achieve this using 1)Web-UI SQL statement and 2)bq command

Below is a query you should run with a full_data table as a Destination Table and with Append to table as a Write Preference
SELECT id, name, age
FROM todays_data
WHERE NOT id IN (
SELECT id
FROM full_data
GROUP BY id
)
See more for how to achieve this for WebUI and in Commmand line in Storing results in a permanent table

Related

Create virtual table with rowid only of another table

Suppose I have a table in sqlite as follows:
`name` `age`
"bob" 20 (rowid=1)
"tom" 30 (rowid=2)
"alice" 19 (rowid=3)
And I want to store the result of the following table using minimal storage space:
SELECT * FROM mytable WHERE name < 'm' ORDER BY age
How can I store a virtual table from this resultset that will just give me the ordered resultset. In other words, storing the rowid in an ordered way (in the above it would be 3,1) without saving all the data into a separate table.
For example, if I stored this information with just the rowid in a sorted order:
CREATE TABLE vtable AS
SELECT rowid from mytable WHERE name < 'm' ORDER BY age;
Then I believe every time I would need to query the vtable I would have to join it back to the original table using the rowid. Is there a way to do this so that the vtable "knows" the content that it has based on the external table (I believe this is referred to as external-content when creating an fts index -- https://sqlite.org/fts5.html#external_content_tables).
I believe this is referred to as external-content when creating an
fts.
No a virtual table is CREATED using CREATE VIRTUAL TABLE ...... USING module_name (module_parameters)
Virtual tables are tables that can call a module, thus the USING module_name(module_parameters) is mandatory.
For FTS (Full Text Serach) you would have to read the documentation but it could be something like
CREATE VIRTUAL TABLE IF NOT EXISTS bible_fts USING FTS3(book, chapter INTEGER, verse INTEGER, content TEXT)
You very likely don't need/want a VIRTUAL table.
CREATE TABLE vtable AS SELECT rowid from mytable WHERE name < 'm' ORDER BY age;
Would create a normal table IF it didn't already exist that would persist. And if you wanted to use it then it would probably only be of use by joining it with mytable. Effectively it would allow a snapshot, but at a cost, of at least 4k for every snapshot.
I'd suggest a single table for all snapshots that has two columns a snapshot identifier and the rowid of the snapshot. This would probably be far less space consuming.
Basic Example
Consider :-
CREATE TABLE IF NOT EXISTS mytable (
id INTEGER PRIMARY KEY, /* NOTE not using an alias of the rowid may present issues as the id's can change */
name TEXT,
age INTEGER
);
CREATE TABLE IF NOT EXISTS snapshot (id TEXT DEFAULT CURRENT_TIMESTAMP, mytable_map);
INSERT INTO mytable (name,age) VALUES('Mary',21),('George',22);
INSERT INTO snapshot (mytable_map) SELECT id FROM mytable;
SELECT snapshot.id,name,age FROM snapshot JOIN mytable ON mytable.id = snapshot.mytable_map;
And the above is run 3 times with a reasonable interval (seconds so as to distinguish the snapshot id (the timestamp)).
Then you would get 3 snapshots (each with a number of rows but the same value in the id column for each snapshot), the first with 2 rows, the 2nd with 4 and the last with 6 (as each run 2 rows are being added to mytable) :-

Merge update records in a final table

I have a user table in Hive of the form:
User:
Id String,
Name String,
Col1 String,
UpdateTimestamp Timestamp
I'm inserting data in this table from a file which has the following format:
I/U,Timestamp when record was written to file, Id, Name, Col1, UpdateTimestamp
e.g. for inserting a user with Id 1:
I,2019-08-21 14:18:41.002947,1,Bob,stuff,123456
and updating col1 for the same user with Id 1:
U,2019-08-21 14:18:45.000000,1,,updatedstuff,123457
The columns which are not updated are returned as null.
Now simple insertion is easy in hive using load in path in a staging table and then ignoring the first two fields from the stage table.
However, how would I go about the update statements? So that my final row in hive looks like below:
1,Bob,updatedstuff,123457
I was thinking to insert all rows in a staging table and then perform some sort of merge query. Any ideas?
Typically with a merge statement your "file" would still be unique on ID and the merge statement would determine whether it needs to insert this as a new record, or update values from that record.
However, if the file is non-negotiable and will always have the I/U format, you could break the process up into two steps, the insert, then the updates, as you suggested.
In order to perform updates in Hive, you will need the users table to be stored as ORC and have ACID enabled on your cluster. For my example, I would create the users table with a cluster key, and the transactional table property:
create table test.orc_acid_example_users
(
id int
,name string
,col1 string
,updatetimestamp timestamp
)
clustered by (id) into 5 buckets
stored as ORC
tblproperties('transactional'='true');
After your insert statements, your Bob record would say "stuff" in col1:
As far as the updates - you could tackle these with an update or merge statement. I think the key here is the null values. It's important to keep the original name, or col1, or whatever, if the staging table from the file has a null value. Here's a merge example which coalesces the staging tables fields. Basically, if there is a value in the staging table, take that, or else fall back to the original value.
merge into test.orc_acid_example_users as t
using test.orc_acid_example_staging as s
on t.id = s.id
and s.type = 'U'
when matched
then update set name = coalesce(s.name,t.name), col1 = coalesce(s.col1, t.col1)
Now Bob will show "updatedstuff"
Quick disclaimer - if you have more than one update for Bob in the staging table, things will get messy. You will need to have a pre-processing step to get the latest non-null values of all the updates prior to doing the update/merge. Hive isn't really a complete transactional DB - it would be preferred for the source to send full user records any time there's an update, instead of just the changed fields only.
You can reconstruct each record in the table using you can use last_value() with the null option:
select h.id,
coalesce(h.name, last_value(h.name, true) over (partition by h.id order by h.timestamp) as name,
coalesce(h.col1, last_value(h.col1, true) over (partition by h.id order by h.timestamp) as col1,
update_timestamp
from history h;
You can use row_number() and a subquery if you want the most recent record.

How to retrieve ambiguous columns from a hive table using subquery?

Main table:
CREATE EXTERNAL TABLE user(language STRING,snapshot_time STRING,products STRUCT<id:STRING,name:STRING>,item STRUCT<quantity:ARRAY<STRUCT<name:STRING>>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE
LOCATION '/user/input/sample';
I'm trying to insert ambiguous column names "product.name","A.name" into user_prod_info table. Since, the column names are same, I'm facing Ambiguous column reference text in q error.
Insert command:
INSERT OVERWRITE TABLE user_prod_info
SELECT q.* FROM (
SELECT row_number() OVER (PARTITION BY products.id ORDER BY snapshot_time DESC) AS temp_row_num,
language,
snapshot_time,
products.id,
products.name,
A.name
FROM user as raw
LATERAL VIEW EXPLODE(item.quantity) quantity as A
) q WHERE temp_row_num == 1;
This command is unable to retrieve the field from the specific table because we have two "name" fields. one is in "products" and the other is in "A".
I tried creating alias for "A.name as name1". I'm able to insert the data without errors. But, one record is storing in 3 rows with some nulls in it.
I got stuck over here. Can anyone please help me out regarding this...

Hive - Create Table statement with 'select query' and 'fields terminated by' commands

I want to create a table in Hive using a select statement which takes a subset of a data from another table. I used the following query to do so :
create table sample_db.out_table as
select * from sample_db.in_table where country = 'Canada';
When I looked into the HDFS location of this table, there are no field separators.
But I need to create a table with filtered data from another table along with a field separator. For example I am trying to do something like :
create table sample_db.out_table as
select * from sample_db.in_table where country = 'Canada'
ROW FORMAT SERDE
FIELDS TERMINATED BY '|';
This is not working though. I know the alternate way is to create a table structure with field names and the "FIELDS TERMINATED BY '|'" command and then load the data.
But is there any other way to combine the two into a single query that enables me to create a table with filtered data from another table and also with a field separator ?
Put row format delimited .. in front of AS select
do it like this
Change the query to yours
hive> CREATE TABLE ttt row format delimited fields terminated by '|' AS select *,count(1) from t1 group by id ,name ;
Query ID = root_20180702153737_37802c0e-525a-4b00-b8ec-9fac4a6d895b
here is the result
[root#hadoop1 ~]# hadoop fs -cat /user/hive/warehouse/ttt/**
2|\N|1
3|\N|1
4|\N|1
As you can see in the documentation, when using the CTAS (Create Table As Select) statement, the ROW FORMAT statement (in fact, all the settings related to the new table) goes before the SELECT statement.

How to insert generated id into a results table

I have the following query
SELECT q.pol_id
FROM quot q
,fgn_clm_hist fch
WHERE q.quot_id = fch.quot_id
UNION
SELECT q.pol_id
FROM tdb2wccu.quot q
WHERE q.nr_prr_ls_yr_cov IS NOT NULL
For every row in that result set, I want to create a new row in another table (call it table1) and update pol_id in the quot table (from the above result set) with the generated primary key from the inserted row in table1.
table1 has two columns. id and timestamp.
I'm using db2 10.1.
I've tried numerous things and have been unsuccessful for quite a while. Thanks!
Simple solution: create a new table for the result set of your query, which has an identity column in it. Then, after running your query, update the pol_id field with the newly generated ID in your result table.
Alteratively, you can do it more manually by using the the ROW_NUMBER() OLAP function, which I often found convenient for creating IDs. For this it is convenient to use a stored procedure which does the following:
get the maximum old id from Table1 and write it into a variable old_max_id.
after generating the result set, write the row-numbers into the table1, maybe by something like
INSERT INTO TABLE1
SELECT ROW_NUMBER() OVER (PARTITION BY <primary-key> ORDER BY <whatever-you-want>)
+ OLD_MAX_ID
, CURRENT TIMESTAMP
FROM (<here comes your SQL query>)
Either write the result set into a table or return a cursor to it. Here you should either use the same ROW_NUMBER statement as above or directly use the ID from Table1.