Joining streaming data on table data and update the table as the stream receives , is it possible? - apache-spark-sql

I am using spark-sql 2.4.1 , spark-cassandra-connector_2.11-2.4.1.jar and java8.
I have scenario , where I need join streaming data with C*/Cassandra table data.
If record/join found I need to copy the existing C* table record to another table_bkp and update the actual C* table record with latest data.
As the streaming data come in I need to perform this.
Is this can be done using spark-sql steaming ?
If so , how to do it ? any caveats to take care ?
For each batch how to get C* table data freshly ?
What is wrong I am doing here
I have two tables as below "master_table" & "backup_table"
table kspace.master_table(
statement_id int,
statement_flag text,
statement_date date,
x_val double,
y_val double,
z_val double,
PRIMARY KEY (( statement_id ), statement_date)
) WITH CLUSTERING ORDER BY ( statement_date DESC );
table kspace.backup_table(
statement_id int,
statement_flag text,
statement_date date,
x_val double,
y_val double,
z_val double,
backup_timestamp timestamp,
PRIMARY KEY ((statement_id ), statement_date, backup_timestamp )
) WITH CLUSTERING ORDER BY ( statement_date DESC, backup_timestamp DESC);
Each streaming record would have "statement_flag" which might be "I" or "U".
If record with "I" comes we directly insert into "master_table".
If record with "U" comes , need to check if there is any record for given ( statement_id ), statement_date in "master_table".
If there is record in "master_table" copy that one to "backup_table" with current timestamp i.e. backup_timestamp
Update the record in "master_table" with latest record.
To achieve the above I am doing PoC/Code like below
Dataset<Row> baseDs = //streaming data from topic
Dataset<Row> i_records = baseDs.filter(col("statement_flag").equalTo("I"));
Dataset<Row> u_records = baseDs.filter(col("statement_flag").equalTo("U"));
String keyspace="kspace";
String master_table = "master_table";
String backup_table = "backup_table";
Dataset<Row> cassandraMasterTableDs = getCassandraTableData(sparkSession, keyspace , master_table);
writeDfToCassandra( baseDs.toDF(), keyspace, master_table);
u_records.createOrReplaceTempView("u_records");
cassandraMasterTableDs.createOrReplaceTempView("persisted_records");
Dataset<Row> joinUpdatedRecordsDs = sparkSession.sql(
" select p.statement_id, p.statement_flag, p.statement_date,"
+ "p.x_val,p.y_val,p.z_val "
+ " from persisted_records as p "
+ "join u_records as u "
+ "on p.statement_id = u.statement_id and p.statement_date = u.statement_date");
Dataset<Row> updated_records = joinUpdatedRecordsDs
.withColumn("backup_timestamp",current_timestamp());
updated_records.show(); //Showing correct results
writeDfToCassandra( updated_records.toDF(), keyspace, backup_table); // But here/backup_table copying the latest "master_table" records
Sample data
For first record with "I" flag
master_table
backup_table
For second record with "U" flag , i.e. same as earlier except "y_val" column data
master_table
backup_table
Expected
But actual table data is
Question:
Till show the dataframe(updated_records) showing correct data.
But when I insert same dataframe(updated_records) into table , C* backup_table data shows exactly same as latest record of master_table , but which suppose to have earlier record of master_table.
updated_records.show(); //Showing correct results
writeDfToCassandra( updated_records.toDF(), keyspace, backup_table); // But here/backup_table copying the latest "master_table" records
So what am I doing wrong in above program code ?

There are several ways to to do this with various levels of performance depending on how much data you need to check.
For example, if you are only looking up data by partition key the most efficient thing to do is to use joinWithCassandraTable on the Dstream. For every batch this will extract records matching the incoming partition keys. In structured streaming this would happen automatically with the correctly written SQL join and DSE. If DSE was not in use it would fully scan the table with each batch.
If instead you require the whole table for each batch, joining the DStream batch with a CassandraRDD will cause the RDD to be re-read completely on every batch. This is much more expensive if the entire table is not being re-written.
If you are only updating records without checking their previous values, it is sufficient to just write the incoming data directly to the C* table. C* uses upserts and last write win behaviors, and will just overwrite the previous values if they existed.

Related

How to specify a non-temporal column as the “primary key” in a partitioned table?

In DolphinDB, it seems I cannot specify a non-temporal column for the sortColumns of createPartitonedTable with the TSDB storage engine. Or is there any other way to specify a non-temporal column as the “primary key” in a partitioned table?
My table has 4 columns:
a temporal column “DateTime“
2 columns of IDs: “id_key” and “id_partition”, where “id_partition“
is the partitioning column
a column “factor“ holding the factor values
The screenshot above shows records from the partition with id_partition=1. This partition has 20,000 records, and each record has a unique id_key. I want to make id_key the “primary key” of the table.
In other words, when I write a new record into the table, insert it if the value of its id_key does not already exist, otherwise update the existing record. Then order by the DateTime column.
For the TSDB storage engine, the use case you described is not supported. Reason:
When creating partitioned table with TSDB, set the parameter sortColumns to specify the column(s) to sort the table. The last sort column must be a temporal column; the rest of the sort columns will be combined into a "sort key", which serves as the index for sorting. As sortColumns must contain a temporal column, this doesn’t match your requirement.
It is recommended that you use the OLAP storage engine and write data to the table through upsert!:
dbName = "dfs://test_1123"
tbName = "test_1123"
if(existsDatabase(dbName)){
dropDatabase(dbName)
}
//use OLAP storage engine, partition table on id_partition. no sortColumns
db = database(dbName, VALUE, `client01`client02)
colNames = `DateTime`id_key`id_partition`factor
colTypes = [DATETIME, LONG, SYMBOL, DOUBLE]
schemaTable = table(1:0, colNames, colTypes)
db.createPartitionedTable(table=schemaTable, tableName=tbName, partitionColumns=`id_partition)
//prepare test data for one partition with id_partition=1. id_key ranges from 1-20000. Insert data into the partitioned table.
data = table(2022.11.18T00:00:00 + 1..20000 as DateTime, take(1..20000, 20000) as id_key, take(`1, 20000) as id_partition, 10.5 - round(rand(1.0, 20000), 2) as factor)
pt = loadTable(dbName, tbName).upsert!(newData=data, ignoreNull=false, keyColNames=`id_key, sortColumns=`id_partition)
//use upsert! to insert a new record: id_partition=1, id_key=1, DateTime is later than the existing records.
inputOne = table(2022.11.18T00:00:00 + 30000 as DateTime, 1 as id_key, `1 as id_partition, 10.0 as factor)
pt.upsert!(newData=inputOne, ignoreNull=false, keyColNames=`id_key, sortColumns=`DateTime)
Result:
Note:
upsert! inserts rows into a table if the values of the primary key do not already exist, or update them if they do. If you insert a batch of data into a table with upsert!, and the batch contains multiple records with duplicate keys, they will all be inserted without deduplication. For example:
inputOneDuplicated = table(2022.11.18T00:00:00 + 30000..30001 as DateTime, [20001, 20001] as id_key, `1`1 as id_partition, [10.0, 10.1] as factor)
pt.upsert!(newData=inputOneDuplicated, ignoreNull=false, keyColNames=`id_key, sortColumns=`DateTime)
result:
Therefore, before you call upsert!, make sure the primary key values in the batch you’re inserting are unique.

Merge update records in a final table

I have a user table in Hive of the form:
User:
Id String,
Name String,
Col1 String,
UpdateTimestamp Timestamp
I'm inserting data in this table from a file which has the following format:
I/U,Timestamp when record was written to file, Id, Name, Col1, UpdateTimestamp
e.g. for inserting a user with Id 1:
I,2019-08-21 14:18:41.002947,1,Bob,stuff,123456
and updating col1 for the same user with Id 1:
U,2019-08-21 14:18:45.000000,1,,updatedstuff,123457
The columns which are not updated are returned as null.
Now simple insertion is easy in hive using load in path in a staging table and then ignoring the first two fields from the stage table.
However, how would I go about the update statements? So that my final row in hive looks like below:
1,Bob,updatedstuff,123457
I was thinking to insert all rows in a staging table and then perform some sort of merge query. Any ideas?
Typically with a merge statement your "file" would still be unique on ID and the merge statement would determine whether it needs to insert this as a new record, or update values from that record.
However, if the file is non-negotiable and will always have the I/U format, you could break the process up into two steps, the insert, then the updates, as you suggested.
In order to perform updates in Hive, you will need the users table to be stored as ORC and have ACID enabled on your cluster. For my example, I would create the users table with a cluster key, and the transactional table property:
create table test.orc_acid_example_users
(
id int
,name string
,col1 string
,updatetimestamp timestamp
)
clustered by (id) into 5 buckets
stored as ORC
tblproperties('transactional'='true');
After your insert statements, your Bob record would say "stuff" in col1:
As far as the updates - you could tackle these with an update or merge statement. I think the key here is the null values. It's important to keep the original name, or col1, or whatever, if the staging table from the file has a null value. Here's a merge example which coalesces the staging tables fields. Basically, if there is a value in the staging table, take that, or else fall back to the original value.
merge into test.orc_acid_example_users as t
using test.orc_acid_example_staging as s
on t.id = s.id
and s.type = 'U'
when matched
then update set name = coalesce(s.name,t.name), col1 = coalesce(s.col1, t.col1)
Now Bob will show "updatedstuff"
Quick disclaimer - if you have more than one update for Bob in the staging table, things will get messy. You will need to have a pre-processing step to get the latest non-null values of all the updates prior to doing the update/merge. Hive isn't really a complete transactional DB - it would be preferred for the source to send full user records any time there's an update, instead of just the changed fields only.
You can reconstruct each record in the table using you can use last_value() with the null option:
select h.id,
coalesce(h.name, last_value(h.name, true) over (partition by h.id order by h.timestamp) as name,
coalesce(h.col1, last_value(h.col1, true) over (partition by h.id order by h.timestamp) as col1,
update_timestamp
from history h;
You can use row_number() and a subquery if you want the most recent record.

Bulk updating existing rows in Redshift

This seems like it should be easy, but isn't. I'm migrating a query from MySQL to Redshift of the form:
INSERT INTO table
(...)
VALUES
(...)
ON DUPLICATE KEY UPDATE
value = MIN(value, VALUES(value))
For primary keys we're inserting that aren't already in the table, those are just inserted. For primary keys that are already in the table, we update the row's values based on a condition that depends on the existing and new values in the row.
http://docs.aws.amazon.com/redshift/latest/dg/merge-replacing-existing-rows.html does not work, because filter_expression in my case depends on the current entries in the table. I'm currently creating a staging table, inserting into it with a COPY statement and am trying to figure out the best way to merge the staging and real tables.
I'm having to do exactly this for a project right now. The method I'm using involves 3 steps:
1.
Run an update that addresses changed fields (I'm updating whether or not the fields have changed, but you can certainly qualify that):
update table1 set col1=s.col1, col2=s.col2,...
from table1 t
join stagetable s on s.primkey=t.primkey;
2.
Run an insert that addresses new records:
insert into table1
select s.*
from stagetable s
left outer join table1 t on s.primkey=t.primkey
where t.primkey is null;
3.
Mark rows no longer in the source as inactive (our reporting tool uses views that filter inactive records):
update table1
set is_active_flag='N', last_updated=sysdate
from table1 t
left outer join stagetable s on s.primkey=t.primkey
where s.primkey is null;
Is posible to create a temp table. In redshift is better to delete and insert the record.
Check this doc
http://docs.aws.amazon.com/redshift/latest/dg/merge-replacing-existing-rows.html
Here is the fully working approach for Redshift.
Assumptions:
A.Data available in S3 in gunzip format with '|' separated columns, may have some garbage data see maxerror.
B.Sales fact with two dimension tables to keep it simple (TIME and SKU(SKU may have many groups and categories))).
C.You have Sales table like this.
CREATE TABLE sales (
sku_id int encode zstd,
date_id int encode zstd,
quantity numeric(10,2) encode delta32k,
);
1)Create Staging table, that should resemble with your Online Table used by app/apps.
CREATE TABLE stg_sales_onetime (
sku_number varchar(255) encode zstd,
time varchar(255) encode zstd,
qty_str varchar(20) encode zstd,
quantity numeric(10,2) encode delta32k,
sku_id int encode zstd,
date_id int encode zstd
);
2)Copy data from S3( this could done using SSH).
copy stg_sales_onetime (sku_number,time,qty_str) from
's3://<buecket_name>/<full_file_path>' CREDENTIALS 'aws_access_key_id=<your_key>;aws_secret_access_key=<your_secret>' delimiter '|' ignoreheader 1 maxerror as 1000 gzip;
3)This step is optional, in case you don't have good formatted data, this a your transformation step if needed(as converting String(12.555654) quantity to Number(12.56))
update stg_sales_onetime set quantity=convert(decimal(10,2),qty_str);
4)Populating the correct IDs from dimension table.
update stg_sales_onetime set sku_id=<your_sku_demesion_table>.sku_id from <your_sku_demesion_table> where stg_sales_onetime.sku_number=<your_sku_demesion_table>.sku_number;
update stg_sales_onetime set time_id=<your_time_demesion_table>.time_id from <your_time_demesion_table> where stg_sales_onetime.time=<your_time_demesion_table>.time;
5)Finally you have data good to go from Staging to Online Sales table.
insert into sales(sku_id,time_id,quantity) select sku_id,time_id,quantity from stg_sales_onetime;

change ID number to smooth out duplicates in a table

I have run into this problem that I'm trying to solve: Every day I import new records into a table that have an ID number.
Most of them are new (have never been seen in the system before) but some are coming in again. What I need to do is to append an alpha to the end of the ID number if the number is found in the archive, but only if the data in the row is different from the data in the archive, and this needs to be done sequentially, IE, if 12345 is seen a 2nd time with different data, I change it to 12345A, and if 12345 is seen again, and is again different, I need to change it to 12345B, etc.
Originally I tried using a where loop where it would put all the 'seen again' records in a temp table, and then assign A first time, then delete those, assign B to what's left, delete those, etc., till the temp table was empty, but that hasn't worked out.
Alternately, I've been thinking of trying subqueries as in:
update table
set IDNO= (select max idno from archive) plus 1
Any suggestions?
How about this as an idea? Mind you, this is basically pseudocode so adjust as you see fit.
With "src" as the table that all the data will ultimately be inserted into, and "TMP" as your temporary table.. and this is presuming that the ID column in TMP is a double.
do
update tmp set id = id + 0.01 where id in (select id from src);
until no_rows_changed;
alter table TMP change id into id varchar(255);
update TMP set id = concat(int(id), chr((id - int(id)) * 100 + 64);
insert into SRC select * from tmp;
What happens when you get to 12345Z?
Anyway, change the table structure slightly, here's the recipe:
Drop any indices on ID.
Split ID (apparently varchar) into ID_Num (long int) and ID_Alpha (varchar, not null). Make the default value for ID_Alpha an empty string ('').
So, 12345B (varchar) becomes 12345 (long int) and 'B' (varchar), etc.
Create a unique, ideally clustered, index on columns ID_Num and ID_Alpha.
Make this the primary key. Or, if you must, use an auto-incrementing integer as a pseudo primary key.
Now, when adding new data, finding duplicate ID number's is trivial and the last ID_Alpha can be obtained with a simple max() operation.
Resolving duplicate ID's should now be an easier task, using either a while loop or a cursor (if you must).
But, it should also be possible to avoid the "Row by agonizing row" (RBAR), and use a set-based approach. A few days of reading Jeff Moden articles, should give you ideas in that regard.
Here is my final solution:
update a
set IDnum=b.IDnum
from tempimiportable A inner join
(select * from archivetable
where IDnum in
(select max(IDnum) from archivetable
where IDnum in
(select IDnum from tempimporttable)
group by left(IDnum,7)
)
) b
on b.IDnum like a.IDnum + '%'
WHERE
*row from tempimport table = row from archive table*
to set incoming rows to the same IDnum as old rows, and then
update a
set patient_account_number = case
when len((select max(IDnum) from archive where left(IDnum,7) = left(a.IDnum,7)))= 7 then a.IDnum + 'A'
else left(a.IDnum,7) + char(ascii(right((select max(IDnum) from archive where left(IDnum,7) = left(a.IDnum,7)),1))+1)
end
from tempimporttable a
where not exists ( *select rows from archive table* )
I don't know if anyone wants to delve too far into this, but I appreciate contructive criticism...

Create a unique primary key (hash) from database columns

I have this table which doesn't have a primary key.
I'm going to insert some records in a new table to analyze them and I'm thinking in creating a new primary key with the values from all the available columns.
If this were a programming language like Java I would:
int hash = column1 * 31 + column2 * 31 + column3*31
Or something like that. But this is SQL.
How can I create a primary key from the values of the available columns? It won't work for me to simply mark all the columns as PK, for what I need to do is to compare them with data from other DB table.
My table has 3 numbers and a date.
EDIT What my problem is
I think a bit more of background is needed. I'm sorry for not providing it before.
I have a database ( dm ) that is being updated everyday from another db ( original source ) . It has records form the past two years.
Last month ( july ) the update process got broken and for a month there was no data being updated into the dm.
I manually create a table with the same structure in my Oracle XE, and I copy the records from the original source into my db ( myxe ) I copied only records from July to create a report needed by the end of the month.
Finally on aug 8 the update process got fixed and the records which have been waiting to be migrated by this automatic process got copied into the database ( from originalsource to dm ).
This process does clean up from the original source the data once it is copied ( into dm ).
Everything look fine, but we have just realize that an amount of the records got lost ( about 25% of july )
So, what I want to do is to use my backup ( myxe ) and insert into the database ( dm ) all those records missing.
The problem here are:
They don't have a well defined PK.
They are in separate databases.
So I thought that If I could create a unique pk from both tables which gave the same number I could tell which were missing and insert them.
EDIT 2
So I did the following in my local environment:
select a.* from the_table#PRODUCTION a , the_table b where
a.idle = b.idle and
a.activity = b.activity and
a.finishdate = b.finishdate
Which returns all the rows that are present in both databases ( the .. union? ) I've got 2,000 records.
What I'm going to do next, is delete them all from the target db and then just insert them all s from my db into the target table
I hope I don't get in something worst : - S : -S
The danger of creating a hash value by combining the 3 numbers and the date is that it might not be unique and hence cannot be used safely as a primary key.
Instead I'd recommend using an autoincrementing ID for your primary key.
Just create a surrogate key:
ALTER TABLE mytable ADD pk_col INT
UPDATE mytable
SET pk_col = rownum
ALTER TABLE mytable MODIFY pk_col INT NOT NULL
ALTER TABLE mytable ADD CONSTRAINT pk_mytable_pk_col PRIMARY KEY (pk_col)
or this:
ALTER TABLE mytable ADD pk_col RAW(16)
UPDATE mytable
SET pk_col = SYS_GUID()
ALTER TABLE mytable MODIFY pk_col RAW(16) NOT NULL
ALTER TABLE mytable ADD CONSTRAINT pk_mytable_pk_col PRIMARY KEY (pk_col)
The latter uses GUID's which are unique across databases, but consume more spaces and are much slower to generate (your INSERT's will be slow)
Update:
If you need to create same PRIMARY KEYs on two tables with identical data, use this:
MERGE
INTO mytable v
USING (
SELECT rowid AS rid, rownum AS rn
FROM mytable
ORDER BY
co1l, col2, col3
)
ON (v.rowid = rid)
WHEN MATCHED THEN
UPDATE
SET pk_col = rn
Note that tables should be identical up to a single row (i. e. have same number of rows with same data in them).
Update 2:
For your very problem, you don't need a PK at all.
If you just want to select the records missing in dm, use this one (on dm side)
SELECT *
FROM mytable#myxe
MINUS
SELECT *
FROM mytable
This will return all records that exist in mytable#myxe but not in mytable#dm
Note that it will shrink all duplicates if any.
Assuming that you have ensured uniqueness...you can do almost the same thing in SQL. The only problem will be the conversion of the date to a numeric value so that you can hash it.
Select Table2.SomeFields
FROM Table1 LEFT OUTER JOIN Table2 ON
(Table1.col1 * 31) + (Table1.col2 * 31) + (Table1.col3 * 31) +
((DatePart(year,Table1.date) + DatePart(month,Table1.date) + DatePart(day,Table1.date) )* 31) = Table2.hashedPk
The above query would work for SQL Server, the only difference for Oracle would be in terms of how you handle the date conversion. Moreover, there are other functions for converting dates in SQL Server as well, so this is by no means the only solution.
And, you can combine this with Quassnoi's SET statement to populate the new field as well. Just use the left side of the Join condition logic for the value.
If you're loading your new table with values from the old table, and you then need to join the two tables, you can only "properly" do this if you can uniquely identify each row in the original table. Quassnoi's solution will allow you to do this, IF you can first alter the old table by adding a new column.
If you cannot alter the original table, generating some form of hash code based on the columns of the old table would work -- but, again, only if the hash codes uniquely identify each row. (Oracle has checksum functions, right? If so, use them.)
If hash code uniqueness cannot be guaranteed, you may have to settle for a primary key composed of as many columns are required to ensure uniqueness (e.g. the natural key). If there is no natural key, well, I heard once that Oracle provides a rownum for each row of data, could you use that?