How to specify a non-temporal column as the “primary key” in a partitioned table? - indexing

In DolphinDB, it seems I cannot specify a non-temporal column for the sortColumns of createPartitonedTable with the TSDB storage engine. Or is there any other way to specify a non-temporal column as the “primary key” in a partitioned table?
My table has 4 columns:
a temporal column “DateTime“
2 columns of IDs: “id_key” and “id_partition”, where “id_partition“
is the partitioning column
a column “factor“ holding the factor values
The screenshot above shows records from the partition with id_partition=1. This partition has 20,000 records, and each record has a unique id_key. I want to make id_key the “primary key” of the table.
In other words, when I write a new record into the table, insert it if the value of its id_key does not already exist, otherwise update the existing record. Then order by the DateTime column.

For the TSDB storage engine, the use case you described is not supported. Reason:
When creating partitioned table with TSDB, set the parameter sortColumns to specify the column(s) to sort the table. The last sort column must be a temporal column; the rest of the sort columns will be combined into a "sort key", which serves as the index for sorting. As sortColumns must contain a temporal column, this doesn’t match your requirement.
It is recommended that you use the OLAP storage engine and write data to the table through upsert!:
dbName = "dfs://test_1123"
tbName = "test_1123"
if(existsDatabase(dbName)){
dropDatabase(dbName)
}
//use OLAP storage engine, partition table on id_partition. no sortColumns
db = database(dbName, VALUE, `client01`client02)
colNames = `DateTime`id_key`id_partition`factor
colTypes = [DATETIME, LONG, SYMBOL, DOUBLE]
schemaTable = table(1:0, colNames, colTypes)
db.createPartitionedTable(table=schemaTable, tableName=tbName, partitionColumns=`id_partition)
//prepare test data for one partition with id_partition=1. id_key ranges from 1-20000. Insert data into the partitioned table.
data = table(2022.11.18T00:00:00 + 1..20000 as DateTime, take(1..20000, 20000) as id_key, take(`1, 20000) as id_partition, 10.5 - round(rand(1.0, 20000), 2) as factor)
pt = loadTable(dbName, tbName).upsert!(newData=data, ignoreNull=false, keyColNames=`id_key, sortColumns=`id_partition)
//use upsert! to insert a new record: id_partition=1, id_key=1, DateTime is later than the existing records.
inputOne = table(2022.11.18T00:00:00 + 30000 as DateTime, 1 as id_key, `1 as id_partition, 10.0 as factor)
pt.upsert!(newData=inputOne, ignoreNull=false, keyColNames=`id_key, sortColumns=`DateTime)
Result:
Note:
upsert! inserts rows into a table if the values of the primary key do not already exist, or update them if they do. If you insert a batch of data into a table with upsert!, and the batch contains multiple records with duplicate keys, they will all be inserted without deduplication. For example:
inputOneDuplicated = table(2022.11.18T00:00:00 + 30000..30001 as DateTime, [20001, 20001] as id_key, `1`1 as id_partition, [10.0, 10.1] as factor)
pt.upsert!(newData=inputOneDuplicated, ignoreNull=false, keyColNames=`id_key, sortColumns=`DateTime)
result:
Therefore, before you call upsert!, make sure the primary key values in the batch you’re inserting are unique.

Related

UPDATE two columns with new value under large size table

We have table like :
mytable (pid, string_value, int_value)
This table has more than 20M rows in total. Now we have a feature try to mark all the rows from this tables as invalid. So we need update the table columns: string_Value = NULL and int_value = 0 which indicate this is invalid row ( we still want to keep the pid as it is important to us)
So what is the best way?
I use the following SQL:
UPDATE Mytable
SET string_value = NULL,
int_value = 0;
but this query takes more than 4 minutes in my test env. Is there any better way we can improve it?
Updating all the rows can be quite expensive. Often, it is faster to empty the table and reload it.
In generic SQL this looks like:
create table mytable_temp as
select pid
from mytable;
truncate table mytable; -- back it up first!
insert into mytable (pid, string_value, int_value)
select pid, null, 0
from mytable_temp;
The creation of the temporary table may use different syntax, depending on our database.
Updates can take time to complete. Another way of achieving this is to follow the following steps:
Add new columns with the values you need set as the default value
Drop the original columns
Rename the new columns with the names of the original columns.
You can then drop the default values on the new columns.
This needs to be tested as different DBMSs allow different levels of table alters (i.e. not all DMBSs allow a drop default or a drop column).

Joining streaming data on table data and update the table as the stream receives , is it possible?

I am using spark-sql 2.4.1 , spark-cassandra-connector_2.11-2.4.1.jar and java8.
I have scenario , where I need join streaming data with C*/Cassandra table data.
If record/join found I need to copy the existing C* table record to another table_bkp and update the actual C* table record with latest data.
As the streaming data come in I need to perform this.
Is this can be done using spark-sql steaming ?
If so , how to do it ? any caveats to take care ?
For each batch how to get C* table data freshly ?
What is wrong I am doing here
I have two tables as below "master_table" & "backup_table"
table kspace.master_table(
statement_id int,
statement_flag text,
statement_date date,
x_val double,
y_val double,
z_val double,
PRIMARY KEY (( statement_id ), statement_date)
) WITH CLUSTERING ORDER BY ( statement_date DESC );
table kspace.backup_table(
statement_id int,
statement_flag text,
statement_date date,
x_val double,
y_val double,
z_val double,
backup_timestamp timestamp,
PRIMARY KEY ((statement_id ), statement_date, backup_timestamp )
) WITH CLUSTERING ORDER BY ( statement_date DESC, backup_timestamp DESC);
Each streaming record would have "statement_flag" which might be "I" or "U".
If record with "I" comes we directly insert into "master_table".
If record with "U" comes , need to check if there is any record for given ( statement_id ), statement_date in "master_table".
If there is record in "master_table" copy that one to "backup_table" with current timestamp i.e. backup_timestamp
Update the record in "master_table" with latest record.
To achieve the above I am doing PoC/Code like below
Dataset<Row> baseDs = //streaming data from topic
Dataset<Row> i_records = baseDs.filter(col("statement_flag").equalTo("I"));
Dataset<Row> u_records = baseDs.filter(col("statement_flag").equalTo("U"));
String keyspace="kspace";
String master_table = "master_table";
String backup_table = "backup_table";
Dataset<Row> cassandraMasterTableDs = getCassandraTableData(sparkSession, keyspace , master_table);
writeDfToCassandra( baseDs.toDF(), keyspace, master_table);
u_records.createOrReplaceTempView("u_records");
cassandraMasterTableDs.createOrReplaceTempView("persisted_records");
Dataset<Row> joinUpdatedRecordsDs = sparkSession.sql(
" select p.statement_id, p.statement_flag, p.statement_date,"
+ "p.x_val,p.y_val,p.z_val "
+ " from persisted_records as p "
+ "join u_records as u "
+ "on p.statement_id = u.statement_id and p.statement_date = u.statement_date");
Dataset<Row> updated_records = joinUpdatedRecordsDs
.withColumn("backup_timestamp",current_timestamp());
updated_records.show(); //Showing correct results
writeDfToCassandra( updated_records.toDF(), keyspace, backup_table); // But here/backup_table copying the latest "master_table" records
Sample data
For first record with "I" flag
master_table
backup_table
For second record with "U" flag , i.e. same as earlier except "y_val" column data
master_table
backup_table
Expected
But actual table data is
Question:
Till show the dataframe(updated_records) showing correct data.
But when I insert same dataframe(updated_records) into table , C* backup_table data shows exactly same as latest record of master_table , but which suppose to have earlier record of master_table.
updated_records.show(); //Showing correct results
writeDfToCassandra( updated_records.toDF(), keyspace, backup_table); // But here/backup_table copying the latest "master_table" records
So what am I doing wrong in above program code ?
There are several ways to to do this with various levels of performance depending on how much data you need to check.
For example, if you are only looking up data by partition key the most efficient thing to do is to use joinWithCassandraTable on the Dstream. For every batch this will extract records matching the incoming partition keys. In structured streaming this would happen automatically with the correctly written SQL join and DSE. If DSE was not in use it would fully scan the table with each batch.
If instead you require the whole table for each batch, joining the DStream batch with a CassandraRDD will cause the RDD to be re-read completely on every batch. This is much more expensive if the entire table is not being re-written.
If you are only updating records without checking their previous values, it is sufficient to just write the incoming data directly to the C* table. C* uses upserts and last write win behaviors, and will just overwrite the previous values if they existed.

Merge update records in a final table

I have a user table in Hive of the form:
User:
Id String,
Name String,
Col1 String,
UpdateTimestamp Timestamp
I'm inserting data in this table from a file which has the following format:
I/U,Timestamp when record was written to file, Id, Name, Col1, UpdateTimestamp
e.g. for inserting a user with Id 1:
I,2019-08-21 14:18:41.002947,1,Bob,stuff,123456
and updating col1 for the same user with Id 1:
U,2019-08-21 14:18:45.000000,1,,updatedstuff,123457
The columns which are not updated are returned as null.
Now simple insertion is easy in hive using load in path in a staging table and then ignoring the first two fields from the stage table.
However, how would I go about the update statements? So that my final row in hive looks like below:
1,Bob,updatedstuff,123457
I was thinking to insert all rows in a staging table and then perform some sort of merge query. Any ideas?
Typically with a merge statement your "file" would still be unique on ID and the merge statement would determine whether it needs to insert this as a new record, or update values from that record.
However, if the file is non-negotiable and will always have the I/U format, you could break the process up into two steps, the insert, then the updates, as you suggested.
In order to perform updates in Hive, you will need the users table to be stored as ORC and have ACID enabled on your cluster. For my example, I would create the users table with a cluster key, and the transactional table property:
create table test.orc_acid_example_users
(
id int
,name string
,col1 string
,updatetimestamp timestamp
)
clustered by (id) into 5 buckets
stored as ORC
tblproperties('transactional'='true');
After your insert statements, your Bob record would say "stuff" in col1:
As far as the updates - you could tackle these with an update or merge statement. I think the key here is the null values. It's important to keep the original name, or col1, or whatever, if the staging table from the file has a null value. Here's a merge example which coalesces the staging tables fields. Basically, if there is a value in the staging table, take that, or else fall back to the original value.
merge into test.orc_acid_example_users as t
using test.orc_acid_example_staging as s
on t.id = s.id
and s.type = 'U'
when matched
then update set name = coalesce(s.name,t.name), col1 = coalesce(s.col1, t.col1)
Now Bob will show "updatedstuff"
Quick disclaimer - if you have more than one update for Bob in the staging table, things will get messy. You will need to have a pre-processing step to get the latest non-null values of all the updates prior to doing the update/merge. Hive isn't really a complete transactional DB - it would be preferred for the source to send full user records any time there's an update, instead of just the changed fields only.
You can reconstruct each record in the table using you can use last_value() with the null option:
select h.id,
coalesce(h.name, last_value(h.name, true) over (partition by h.id order by h.timestamp) as name,
coalesce(h.col1, last_value(h.col1, true) over (partition by h.id order by h.timestamp) as col1,
update_timestamp
from history h;
You can use row_number() and a subquery if you want the most recent record.

Split Hive table on subtables by field value

I have a Hive table foo. There are several fields in this table. One of them is some_id. Number of unique values in this fields in range 5,000-10,000. For each value (in example it 10385) I need to perform CTAS queries like
CREATE TABLE bar_10385 AS
SELECT * FROM foo WHERE some_id=10385 AND other_id=10385;
What is the best way to perform this bunch of queries?
You can store all these tables in the single partitioned one. This approach will allow you to load all the data in single query. Query performance will not be compromised.
Create table T (
... --columns here
)
partitioned by (id int); --new calculated partition key
Load data using one query, it will read source table only once:
insert overwrite table T partition(id)
select ..., --columns
case when some_id=10385 AND other_id=10385 then 10385
when some_id=10386 AND other_id=10386 then 10386
...
--and so on
else 0 --default partition for records not attributed
end as id --partition column
from foo
where some_id in (10385,10386) AND other_id in (10385,10386) --filter
Then you can use this table in queries specifying partition:
select from T where id = 10385; --you can create a view named bar_10385, it will act the same as your table. Partition pruning works fast

Insert strategy for tables with one-to-one relationships in Teradata

In our data model, which is derived from the Teradata industry models, we observe a common pattern, where the superclass and subclass relationships in the logical data model are transformed into one-to-one relationships between the parent and the child table.
I know you can roll-up or roll-down the attributes to end up with a single table but we are not using this option overall. At the end what we have is a model like this:
Where City Id references a Geographical Area Id.
I am struggling with a good strategy to load the records in these tables.
Option 1: I could select the max(Geographical Area Id) and calculate the next Ids for a batch insert and reuse them for the City Table.
Option 2: I could use an Identity column in the Geographical Area Table and retrieve it after I insert every record in order to use it for the City table.
Any other options?
I need to assess the solution in terms of performance, reliability and maintenance.
Any comment will be appreciated.
Kind regards,
Paul
When you say "load the records into these tables", are you talking about a one-time data migration or a function that creates records for new Geographical Area/City?
If you are looking for a surrogate key and are OK with gaps in your ID values, then use an IDENTITY column and specify the NO CYCLE clause, so it doesn't repeat any numbers. Then just pass NULL for the value and let TD handle it.
If you do need sequential IDs, then you can just maintain a separate "NextId" table and use that to generate ID values. This is the most flexible way and would make it easier for you to manage your BATCH operations. It requires more code/maintenance on your part, but is more efficient than doing a MAX() + 1 on your data table to get your next ID value. Here's the basic idea:
BEGIN TRANSACTION
Get the "next" ID from a lookup table
Use that value to generate new ID values for your next record(s)
Create your new records
Update the "next" ID value in the lookup table and increment it by the # rows newly inserted (you can capture this by storing the value in the ACTIVITY_COUNT value variable directly after executing your INSERT/MERGE statement)
Make sure to LOCK the lookup table at the beginning of your transaction so it can't be modified until your transaction completes
END TRANSACTION
Here is an example from Postgres, that you can adapt to TD:
CREATE TABLE NextId (
IDType VARCHAR(50) NOT NULL,
NextValue INTEGER NOT NULL,
PRIMARY KEY (IDType)
);
INSERT INTO Users(UserId, UserType)
SELECT
COALESCE(
src.UserId, -- Use UserId if provided (i.e. update existing user)
ROW_NUMBER() OVER(ORDER BY CASE WHEN src.UserId IS NULL THEN 0 ELSE 1 END ASC) +
(id.NextValue - 1) -- Use newly generated UserId (i.e. create new user)
)
AS UserIdFinal,
src.UserType
FROM (
-- Bulk Upsert (get source rows from JSON parameter)
SELECT src.FirstName, src.UserId, src.UserType
FROM JSONB_TO_RECORDSET(pUserDataJSON->'users') AS src(FirstName VARCHAR(100), UserId INTEGER, UserType CHAR(1))
) src
CROSS JOIN (
-- Get next ID value to use
SELECT NextValue
FROM NextId
WHERE IdType = 'User'
FOR UPDATE -- Use "Update" row-lock so it is not read by any other queries also using "Update" row-lock
) id
ON CONFLICT(UserId) DO UPDATE SET
UserType = EXCLUDED.UserType;
-- Increment UserId value
UPDATE NextId
SET NextValue = NextValue + COALESCE(NewUserCount,0)
WHERE IdType = 'User'
;
Just change the locking statement to Teradata syntax (LOCK TABLE NextId FOR WRITE) and add an ACTIVITY_COUNT variable after your INSERT/MERGE to capture the # rows affected. This assumes you're doing all this inside a stored procedure.
Let me know how it goes...