I'm trying to understand how to update a column in a Hive table, based on an id match with a different table.
That is, I have a table 'users', with columns 'UID' (string), 'isVerified' (boolean) and a lot more columns. Then I have a second table 'users_verified', with just 1 column 'UID' (string). I'm trying to do something to the effect of
UPDATE users SET isVerified = 1
WHERE UID in (SELECT UID from users_verified);
However neither this nor UPDATE ON JOIN queries seem supported by Hive, and it seems I need to use an INSERT OVERWRITE statement instead. Can anyone give me an example of how that might work?
Related
This might be simple, but I am having troubles resolving this issue
I have a table with following columns and data :
There are multiple entries for an id with different updateTimestamp, I want to identify the max timestamp and repeat that value for all the duplicate ids in the table. Also, the table is large and I do not want to query it multiple times(process is complex).
Here is what I am expecting the output to look like
You need to find maximum of update timestamp for each ID and update the table using ID column. Something like
update <table_name> tgt
set tgt.max_updatetimestamp = src.maxstamp
from (select id, max(updatetimestamp) as maxstamp from <tablename> group by 1) src
where tgt.id = src.id;
I have a user table in Hive of the form:
User:
Id String,
Name String,
Col1 String,
UpdateTimestamp Timestamp
I'm inserting data in this table from a file which has the following format:
I/U,Timestamp when record was written to file, Id, Name, Col1, UpdateTimestamp
e.g. for inserting a user with Id 1:
I,2019-08-21 14:18:41.002947,1,Bob,stuff,123456
and updating col1 for the same user with Id 1:
U,2019-08-21 14:18:45.000000,1,,updatedstuff,123457
The columns which are not updated are returned as null.
Now simple insertion is easy in hive using load in path in a staging table and then ignoring the first two fields from the stage table.
However, how would I go about the update statements? So that my final row in hive looks like below:
1,Bob,updatedstuff,123457
I was thinking to insert all rows in a staging table and then perform some sort of merge query. Any ideas?
Typically with a merge statement your "file" would still be unique on ID and the merge statement would determine whether it needs to insert this as a new record, or update values from that record.
However, if the file is non-negotiable and will always have the I/U format, you could break the process up into two steps, the insert, then the updates, as you suggested.
In order to perform updates in Hive, you will need the users table to be stored as ORC and have ACID enabled on your cluster. For my example, I would create the users table with a cluster key, and the transactional table property:
create table test.orc_acid_example_users
(
id int
,name string
,col1 string
,updatetimestamp timestamp
)
clustered by (id) into 5 buckets
stored as ORC
tblproperties('transactional'='true');
After your insert statements, your Bob record would say "stuff" in col1:
As far as the updates - you could tackle these with an update or merge statement. I think the key here is the null values. It's important to keep the original name, or col1, or whatever, if the staging table from the file has a null value. Here's a merge example which coalesces the staging tables fields. Basically, if there is a value in the staging table, take that, or else fall back to the original value.
merge into test.orc_acid_example_users as t
using test.orc_acid_example_staging as s
on t.id = s.id
and s.type = 'U'
when matched
then update set name = coalesce(s.name,t.name), col1 = coalesce(s.col1, t.col1)
Now Bob will show "updatedstuff"
Quick disclaimer - if you have more than one update for Bob in the staging table, things will get messy. You will need to have a pre-processing step to get the latest non-null values of all the updates prior to doing the update/merge. Hive isn't really a complete transactional DB - it would be preferred for the source to send full user records any time there's an update, instead of just the changed fields only.
You can reconstruct each record in the table using you can use last_value() with the null option:
select h.id,
coalesce(h.name, last_value(h.name, true) over (partition by h.id order by h.timestamp) as name,
coalesce(h.col1, last_value(h.col1, true) over (partition by h.id order by h.timestamp) as col1,
update_timestamp
from history h;
You can use row_number() and a subquery if you want the most recent record.
I am trying to update the column "efficiency" in the table "SUS_WK" with the data from the column "custom_number_8" from the table "SC_PROD". But I only want it to update if certain requirements are met, such as the "ID" from table "SUS_WK" matches the "ID" from the table "SC_PROD".
How can I do this?
I have tried to do this:
UPDATE SUS_WK
SET efficiency = SC_PROD.custom_number_8
FROM SUS_WK t
JOIN SC_PROD p
ON t.id = p.id
When I tried the code above, I get the following error:
The multi-part identifier "SC_PROD_PLAN_PLND.custom_number_8" could not be bound.
But I expect the result of that code to update the column "efficiency" in the "SUS_WK" with the data from column "custom_number_8" in the table "SC_PROD".
You are on the right track. Just use the table alias rather than the table name:
UPDATE t
SET efficiency = p.custom_number_8
FROM SUS_WK t JOIN
SC_PROD p
ON t.id = p.id;
I strongly recommend using the table alias for the UPDATE as well. SQL Server will resolve the table name to be the same as the t -- but depending on that makes the query rather hard to decipher (because references to the same table have difference aliases).
I have a table with a 'user_id' column. Within that same table I have another data field labeled 'GMID'. Within that GMID column there are some fields that are null, the ones that aren't null have values that match the user_id data field within that row. Is there a way to create a script or query that will update all null fields in the GMID column to match the corresponding data values in the user_id row within that row? Are there any best practices I should follow, different approaches? Thanks in advance for anyone that can help.
Of course there is
UPDATE your_table
SET GMID=user_id
WHERE GMID IS NULL
But you even don't need WHERE if GMID always should be same as user_id.
By the way, why do you need two columns with same data in one table?
Another approach would be using the 'coalesce' function. It will return the first non-null value. This approach does not involve data changes on your table. On a query you can 'select coalesce(GMID, user_id) as GMID ...' it will return the first column that is not null.
Documentation for oracle DB:
http://docs.oracle.com/cd/B28359_01/server.111/b28286/functions023.htm
Update: I just reversed the name of the columns inside the coalesce function...
I have a sql table that I am trying to add a column from another table to. Only when I execute the alter table query it does not pull the values out of the table to match the column where I am trying to make the connection.
For example I have column A from table 1 and column A from table 2, they are supposed to coincide. ColumnATable1 being an identification number and ColumnATable2 being the description.
I tried this but got an error...
alter table dbo.CommittedTbl
add V_VendorName nvarchar(200)
where v_venkey = v_vendorno
It tells me that I have incorrect syntax... Anyone know how to accomplish this?
alter table dbo.CommittedTbl
add V_VendorName nvarchar(200);
go
update c
set c.V_VendorName = a.V_VendorName
from CommittedTbl c
join TableA a
on c.v_venkey = a.v_vendorno;
go
I'm just guessing at your structure here.
alter table 2 add column A <some_type>;
update table2 set column A = (select column_A from table2 where v_venkey = v_vendorno);
Your names for tables and columns are a bit confusing but I think that should do it.
There is no WHERE clause for an ALTER TABLE statement. You will need to add the column (your first two lines), and then insert rows based upon a relationship you define between the two tables.
ALTER TABLE syntax:
http://msdn.microsoft.com/en-us/library/ms190273%28v=sql.90%29.aspx
There are several languages within SQL:
DDL: Data Definition Language - this defines the schema (the structure of tables, columns, data types) - adding a column to a table affects the table definitions and all rows will have that new column (not just some rows according to a criteria)
DML: Data Manipulation Language - this affects data within a table, and inserting, updating or other changes fall into this and you can update some data according to criteria (and this is where a WHERE clause would come in)
ALTER is a DDL statement, while INSERT and UPDATE are DML statements.
The two cannot really be mixed as you are doing.
You should ALTER your table to add the column, then INSERT or UPDATE the column to include appropriate data.
Is it possible that you want a JOIN query instead? If you want to join two tables or parts of two tables you should use JOIN.
have a look at this for a start if you need to know more LINK
hope that helps!