Global index field querying is not working as expected, When migrated scylla DB data from one cloud service provider to another using spark migrator

Global index field querying is not working as expected, When migrated scylla DB data from one cloud service provider to another using spark migrator - scylla

We have migrated scylla DB data from one cloud service provider (source) to another cloud service provider (target) using spark migrator. Details of the source and target clusters are mentioned below:
Source cluster:
No of nodes: 3
RF : 2
Scylla version: 5.0.5-0.20221009.5a97a1060
Target cluster:
No of nodes: 3
RF : 2
Scylla version: 5.0.5-0.20221009.5a97a1060
After migration, when we query from one of the tables using a global index field, we are getting only a few rows even though there are more rows that match the criteria. When we query based on the partition key, we can see the row that is NOT included in the search result for global index field indicating data is present in the target cluster, but it is not returned in the query using global index. We use read consistency of LOCAL_QUORUM. Any hints on why the data is not returned when searched using global index?
P.S: This issue happens only for the data that was copied from the source cluster. For all new data saved directly into the target cluster, the global index based search works fine and returns all matching rows! This issue is seen for queries made using either of the global index fields mentioned below!
Table structure (same in both source and target) is shown below; only relevant columns are listed to keep it brief:
`
CREATE TABLE maps.route (
routeid bigint PRIMARY KEY,
...
...
latlngkey text,
mainrouteid bigint,
...
...
);
CREATE INDEX route_mainrouteid_idx ON maps.route (mainrouteid);
CREATE INDEX route_latlngkey_idx ON maps.route (latlngkey);
CREATE MATERIALIZED VIEW maps.route_mainrouteid_idx_index AS
SELECT mainrouteid, idx_token, routeid
FROM maps.route
WHERE mainrouteid IS NOT NULL
PRIMARY KEY (mainrouteid, idx_token, routeid)
WITH CLUSTERING ORDER BY (idx_token ASC, routeid ASC)
...
...;
CREATE MATERIALIZED VIEW maps.route_latlngkey_idx_index AS
SELECT latlngkey, idx_token, routeid
FROM maps.route
WHERE latlngkey IS NOT NULL
PRIMARY KEY (latlngkey, idx_token, routeid)
WITH CLUSTERING ORDER BY (idx_token ASC, routeid ASC)
...
...;`

Related

Conditional Upserting into a delta sink with Azure Data Flow in Azure Data Factory

I have a sink delta in an Azure Data Flow module and the dataframe that I'm using to update it has a hash key for business keys and a hash key for all columns contents.
I want to insert new hash business hash keys to the sink and only update already existing hash key if the content hash key is different (essentially only update if content hash changed for an already existing business key).
Do you think I can somehow do this using "Alter Row Policies"?
I'm mostly looking for a solution that resembles the "Merge" option in pyspark where I can have different policies for when the business key matches or not (link).
Also, I'm hoping to avoid joins before writing out to sink; because, I want to avoid having to deal with not having any data in data lake the first time that pipeline runs. I'm writing a template that's reusable for different schemas, so unless I can create an empty dataframe when the sink delta table doesn't exist with a schema that matches the other side of the join, I don't think I can use the join solution.

if (!spark.catalog.tableExists("default", table_name)) {
spark.sql(s"create table $table_name using delta as select * from source_table_$table_name")
}
else {
spark.sql(
s"""
|MERGE INTO $targetTableName
|USING $updatesTableName
|ON $targetTableName.id = $updatesTableName.id
|WHEN MATCHED THEN
| UPDATE SET $targetTableName.ts = $updatesTableName.ts
|WHEN NOT MATCHED THEN
| INSERT (id, par, ts) VALUES ($updatesTableName.id, $updatesTableName.par, $updatesTableName.ts)
""".stripMargin)
}

Storing logs in postgres database as text vs json type

Let's say we want to create a table to store logs of user activity in a database. I can think of 2 ways of doing this:
A table having a single row for each log entry that contains a log id, a foreign key to the user, and the log content. This way we will have a separate row for each activity that happens.
A table having a single row for the activity of each unique user(foreign key to the user) and a log id. We can have a json type column to store the logs associated with each user. Each time an activity occurs, we can get the associated log entry and update its JSON column by appending the new activity to it.
Approach 1 provides a clean way of adding new log entries without the need to update the old ones. But querying such a table to get the activity of a user would query the entire table.
Approach 2 adds complexity to adding a new user activity since we would have to fetch and update the JSON object but querying would just return a single row.
I need help to understand if one approach can be clearly advantageous over the other.

Databases are optimized to store and retrieve small rows from a big table. So go for the first solution. Indexes make joins like that fast.
Lumping all data for a user into a single JSON object won't make you happy: each update would have to read, modify and write the whole JSON, which is not efficient at all.

If you logs changes a lot, in terms of properties, I would create a table with:
log_id, user_id (fk) and log in json format with each row as one activity.
It won't be a performance problem if you index your table. In postgresql you can index on fields inside a json column.
Approach 2 will become slower to update after each update, as the column size grows. Also, querying will be more complex.

Also consider a logging framework that can parse semi-structured data into database columns, such as Serilog.
Otherwise I would also recommend your option '1', a single line per log with an index on the user_id, but suggest adding a timestamp to your columns so the query engine can sort on the order of events before having to parse the json itself for a timestamp:
CREATE TABLE user_log
(
log_id bigint, -- (PRIMARY KEY),
log_ts timestamp NOT NULL DEFAULT(now()),
user_id int NOT NULL, --REFERENCES users(user_id),
log_content json
);
CREATE INDEX ON user_log(user_id);
SELECT user_id, log_ts, log_content => 'action' AS user_action FROM user_log WHERE user_id = ? ORDER BY log_ts;

Create materialised view without primary key column

Is it possible to create materialised view without using one of the primary key.
Use Case:-
I have a table with two primary keys, user_id and app_id, I want to create view to fetch data on the basis of app_id regardless of user_id. I am trying to create materialised view but Cassandra is not allowing me to do so if I keep only one primary key.
I know the fact that, I can use "allow filtering" but this will not give 100% accuracy in data.

In Cassandra, materialized view should always include all existing primary key components, but they could be in the different order. So in this case, you can create MV with primary key of app_id, user_id, but this may lead to big partitions if you have very popular application.
But I suggest just to create a second table with necessary primary key, and populate it from your application - it could be even more performant than having materialized view, because it needs to read data from disk every time you insert/update/delete record in the main table. Plus, take into account that materialized views in Cassandra are experimental feature, and have quite a lot of problems.

Can I use one Azure Search Indexer to index multiple entities with the same Key?

Basically I got two datasources(cosmos db, azure sql), one index and two
indexers.
Both indexers are sharing the same primary key which allows me to join the data from both sources into one index. The issue right now is that the cosmos db contains multiple entries with the same key that is used in the indexers as the primary key, which then by default(I assume) just flattens all entries with the same key and only indexes the latest one found. It runs without errors, but obviously entries are missing as only the last one found is indexed.
The only solution so far is that I index the cosmos db in another indexer using the unique key. I kinda wanted to avoid having multiple search queries, but seems this is the only solution, unless anyone's got a better idea. Thank you!

No, you can not use a same key for multiple docs , key is an unique ID of each doc for looking up. If you adding multiple doc with same key to your index , sys will act multiple update operations on the doc with that key so that you can see the last record only.
Maybe my case is similar as yours which will be helpful , this is my index :
And this is the data in my cosmos db :
as you can see, the itemid is the key of my index and its value in my cosmos db all are same which is 1 .
In my case , I use the _rid value to replace the itemid value while creating data source by sql query below :
SELECT u._rid as itemid, u.FirstName , u.LastName,u.Email , u._ts FROM user u where u._ts >= #HighWaterMark ORDER BY u._ts
As you can see, index has been imported and this issue is solved :
With this way , you can import data to your original index without same key issue.
If there is any misunderstanding or unclear , pls feel free to let me know .

Error while dropping column from a table with secondary index (Scylladb)

While dropping a column from a table that contains secondary index I get the following error. I am using ScyllaDB version 3.0.4.
[Invalid query] message="Cannot drop column name on base table warehouse.myuser with materialized views"
Below are the example commands
create table myuser (id int primary key, name text, email text);
create index on myuser(email);
alter table myuser drop name;
I can successfully run the above statements in Apache Cassandra.

Default secondary indexes in Scylla are global and implemented on top of materialized views (as opposed to Apache Cassandra's local indexing implementation), which gives them new possibilities, but also adds certain restrictions. Dropping a column from a table with materialized views is a complex operation, especially if the target column is selected by one of the views or its liveness can affect view row liveness. In order to avoid these problems, dropping a column is unconditionally not possible when there are materialized views attached to a table. The error you see is a combination of that and the fact that Scylla's index uses a materialized view underneath to store corresponding base keys for each row.
The obvious workaround is to drop the index first, then drop the column and recreate the index, but that of course takes time and resources.
However, in some cases columns can be allowed to be dropped from the base table even if it has materialized views, especially if the column is not selected in the view and its liveness does not have any impact on view rows. For reference, I created an issue that requests implementing it in our bug tracker: https://github.com/scylladb/scylla/issues/4448

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Global index field querying is not working as expected, When migrated scylla DB data from one cloud service provider to another using spark migrator - scylla

Related

Conditional Upserting into a delta sink with Azure Data Flow in Azure Data Factory

Storing logs in postgres database as text vs json type

Create materialised view without primary key column

Can I use one Azure Search Indexer to index multiple entities with the same Key?

Error while dropping column from a table with secondary index (Scylladb)

Categories

Resources