Scala Spark Cassandra update or insert rows on primary key match - sql

I am migrating data from csv SQL files (1 per table) to a Cassandra database that is using a pre-determined and standardized format. As a result, I am doing transformations, joins, etc on the SQL data to get it matching this format before writing it to Cassandra. My issue is that this db migration is happening in batches (not all at once) and I cannot ensure that information from the multiple sides of a table join will be present when an entry to Cassandra is written.
ex.
Table 1 and table 2 both have the partitioning and clustering keys (allowing the join since their combination is unique) and are joined using full outer join. With the way that we are being given data, however, there is a chance that we could get a record from Table 1 but not from Table 2 in a "batch" of data. When I perform the full outer join, no problems...extra columns from the other table are added and just fill with nulls. On the next interval that I get data, I then receive the Table 2 portion that should have previously been joined to Table 1.
How do I get those entries combined?
I have looked for an update or insert type method in Spark depending if that set of partitioning and clustering keys exists but have not turned up anything. Is this the most efficient way? Will I just have to add every entry with spark.sql query then update/write?
Note: using uuids that would prevent the primary key conflict will not solve the issue, I do not want 2 partial entries. All data with that particular primary key needs to be in the same row.
Thanks for any help that you can provide!

I think you should be able to just directly write the data to cassandra and not have to worry about it, assuming all primary keys are the same.
Cassandra's inserts are really "insert or update" so I believe when you insert one side of a join, it will just leave some columns empty. Then when you insert the other side of the join, it will update that row with the new columns.
Take this with a grain of salt, as I don't have a Spark+Cassandra cluster available to test and make sure.

Related

how to uniquely identify rows in two table copies

I have essentially two tables that are copies of each other. One is dynamic and some DML statements happen quite constantly, so this table serve as a stage table, the other is used as a way to synchronize the changes form this stage table. So the tables can have different data at different times, and I use a merge statement to sync the tables. Something along these lines:
MERGE INTO source s
USING (
SELECT
*
FROM
stage st
) se ON ( s.eim_product_id = st.eim_product_id )
...
The problem is that eim_product_id is neither a primary key, nor unique. So my merge statement essentially throws this error:
Error report -
ORA-30926: unable to get a stable set of rows in the source tables
And the only pseudo-columns I can think of to use is something like an identity column id_seq INTEGER GENERATED ALWAYS AS IDENTITY or a rowid. However, the problem is that it will not be consistent this approach to uniquely identify the row across both tables, right ? I believe I need some kind of hash that does the job , but unsure what would be the best and simplest approach in this case.
The rowid pseudo-column won't match between the tables, and isn't necessarily constant. Creating a hash could get expensive in terms of CPU; an updated row in the first table wouldn't have a matching hash in the second table for the merge to find. If you only generate the hash at insert and never update then it's just a more expensive, complicated sequence.
Your best bet is an identity column with a unique constraint on the first table, copied to the second table by the merge: it is unique, only calculated very efficiently once at insert, will always identify the same row in both tables, and need never change.

SQL Server hash function on table that represents the complete data on that table?

I need to get the hash function on the particular table that represents 1 million records in that table that is running on DB server A. so that I can compare this particular hash function with another hash function for the same table that is running on DB server B to check the data in that particular tables is same on both DB servers.
I don't know whether the right way to do it to compare the data on the same tables on different DB servers? please suggest if I'm in the wrong direction.
You cannot get HASH for whole table. But, you can get at row level. You can take the primary key and see if the hash is matching between two tables.
There are many functions to support this:
CHECKSUM(*)
BINARY_CHECKSUM(*)
They are accurate, but in extreme cases, it can be wrong. Read this SO post
But, for comparing these two tables, you can create linked server and use EXCEPT clause to see, if data are different.
There are many ways to compare table data. Refer to this post: https://www.mssqltips.com/sqlservertip/2779/ways-to-compare-and-find-differences-for-sql-server-tables-and-data/

oracle join depth while updating table

I have a question regarding Oracle.
I know that Oracle only support the use of aliases to the first subquery level. This poses a problem when I want to group more than one time while updating a table.
Example: I have some server groups and a database containing information about them. I have one table that contains information about the groups and one table where I store with timestamp (to be exact: I used date actually) the workload of specific servers within the groups.
Now I have for performance issues a denormalized field in the server table containing the highest workload the group had within one day.
What I would like to do is something like
update server_group
set last_day_workload=avg(workload1)
from (select max(workload) workload1
from server_performance
where server_performance.server_group_ID_fk=server_group.ID
and time>sysdate-1
group by server_performance.server_group_ID_fk)
While ID is the primary key of server_group and server_group_ID_fk a foreign key reference from the server_performance table. The solution I am using so far is writing the first join into a temporary table and update from that temporary table in the next statement. Is there a better way to do this?
In this problem it isn`t such a problem yet, but if the amount of data increase using a temporary table cost not only some time, but also a notable amount of RAM.
Thank you for your answers!
If I were you, I would work out the results that I wanted in a select statement, and then use a MERGE statement to do the necessary update.

does duplicate values in index takes duplicate space

I want to optimize storage of a big table by taking out values of columns of type varchar to external lookup table (there are many duplicated values)
the process of doing it is very technical in it's nature (creating lookup table and reference it instead of the actual value) and it sounds like it should be part of the infrastructure (sql server in this case or any RDBMS).
than I thought, it should be an option of a index - do not store duplicate values.
only a reference to the duplicated value.
can index be optimized in such a manner - not holding duplicated values, but just reference?
it should make the size of the table and index much smaller when there are many duplicated values.
SQL Server cannot do deduplication of column values. An index stores one row for each row of the base table. They are just sorted differently.
If you want to deduplicate you can keep a separate table that holds all possible (or actually occurring) values with a much shorter ID. You can then refer to the values by only storing their ID.
You can maintain that deduplication table in the application code or using triggers.

Skipping primary key conflicts with SQL copy

I have a large collection of raw data (around 300million rows) with about 10% replicated data. I need to get the data into a database. For the sake of performance I'm trying to use SQL copy. The problem being when I commit the data, primary key exceptions prevent any of the data from being processed. Can I change the behavior of primary keys such that conflicting data is simply ignored, or replaced? I don't really care either way - I just need one unique copy of each of the data.
I think your best bet would be to drop the constraint, load the data, then clean it up and reapply the constraint.
That's what I was considering doing, but was worried about performance of getting rid of 30million randomly placed rows in a 300million entry database. The duplicate data also has a spatial relationship which is why I wanted to try to fix the problem while loading the data rather than after I have it all loaded.
Use a select statement to select exactly the data you want to insert, without the duplicates.
Use that as a basis of a CREATE TABLE XYZ AS SELECT * FROM (query-just-non-dupes)
You might check out ASKTOM ideas on how to select the non-duplicate rows