How to prevent data duplication in redisgraph? - redis

I wrote one code to store graph in redisgraph. Initially it is storing single graph but if I execute the same code second time then it is storing the same graph in database without replacing the previous graph.So, now I am getting two same graphs in a single key in the database.I don't want any duplicate graph or any duplicate node that means if I execute same code again it should replace previous graph.How will I do that?

If your code consists of a series of CREATE commands (whether through Cypher or one of the RedisGraph clients), running it twice will duplicate all of your data. This is not to say that the key stores two graphs; rather, it is one graph with every entity repeated.
If you would like to replace an existing graph, you should delete the existing graph first. You can delete a graph using a Redis command:
DEL [graph key]
Or a RedisGraph command:
GRAPH.DELETE [graph key]
The two are functionally identical.
Conversely, if you want to update an existing graph without introducing duplicates, you should use the MERGE clause as described in the RedisGraph documentation.

You can Use MERGE clause to prevent inserting duplicate data.
Below is the query to remove duplicate records from existing data
MATCH (p:LabelName)
WITH p.id as id, collect(p) AS nodes
WHERE size(nodes) > 1
UNWIND nodes[1..] AS node
DELETE node

MERGE will do like a find or create.
If your node, edge or path does not exist it will create it.
That's the recommended way to avoid duplicate entities if they are not permitted.

Related

Advice on changing the partition field for dynamic BigQuery tables

I am dealing with the following issue: I have a number of tables imported into BigQuery from an external source via AirByte with _airbyte_emitted_at as the default setting for partition field.
As this default choice for a partition field is not very lucrative, the need to change the partition field naturally presents itself. I am aware of the method available for changing partitions of existing tables, by means of a CREATE TABLE FROM SELECT * statement, however the new tables thus created - essentially copies of the original ones, with modified partition fields - will be mere static snapshots and no longer dynamically update, as the originals do each time new data is recorded in the external source.
Given such a context, what would the experienced members of this forum suggest as a solution to the problem?
Being that I am a relative beginner in such matters, I apologise in advance for any potential lack of clarity. I look forward to improving the clarity, should there be any suggestions to do so from interested readers & users of this forum.
I can think of 2 approaches to overcome this.
Approach 1 :
You can use Scheduled queries to copy the newly inserted rows to your 2nd table. You have to write the query in such a way that it will always select the latest rows from your main table and once you have that you can use Insert Into statement to append the rows in your 2nd table.
Since Schedule queries run at specific times the only drawback will be the the 2nd table will not get updated immediately whenever there is a new row in the main table, it will get the latest data whenever the Scheduled Query runs.
If you do not wish to have the latest data always in your 2nd table then this approach is the easier one to achieve.
Approach 2 :
You can trigger Cloud Actions for BigQuery events such as Insert, delete, update etc. Whenever a new row gets inserted in your main table ,using Cloud Run Actions you can insert that new data in your 2nd table.
You can follow this article , here a detailed solution has been given.
If you wish to have the latest data always in your 2nd table then this would be a good way to do so.

BigQuery table patch supported operations

Does anyone know what the definitive list of operations supported by the tables.patch operation? I know it supports adding columns, and it does not support removing columns, however I'm not sure what I can modify on a column, and if so what properties can be modified.
The patch operation can do everything the update operation does. As you can see here, both have identical request and response bodies.
The difference is basically that update method replaces the whole table with the data provided by you while the the patch method replaces only the isolated given fields. In other words, when you do an update operation you will take your existing table, replace some information and persist it again. When you do a patch operation, the data provided by you will be merged into the existing table for the given fields.

BigQuery: Best way to handle frequent schema changes?

Our BigQuery schema is heavily nested/repeated and constantly changes. For example, a new page, form, or user-info field to the website would correspond to new columns for in BigQuery. Also if we stop using a certain form, the corresponding deprecated columns will be there forever because you can't delete columns in Bigquery.
So we're going to eventually result in tables with hundreds of columns, many of which are deprecated, which doesn't seem like a good solution.
The primary alternative I'm looking into is to store everything as json (for example where each Bigquery table will just have two columns, one for timestamp and another for the json data). Then batch jobs that we have running every 10minutes will perform joins/queries and write to aggregated tables. But with this method, I'm concerned about increasing query-job costs.
Some background info:
Our data comes in as protobuf and we update our bigquery schema based off the protobuf schema updates.
I know one obvious solution is to not use BigQuery and just use a document storage instead, but we use Bigquery as both a data lake and also as a data warehouse for BI and building Tableau reports off of. So we have jobs that aggregates raw data into tables that serve Tableau.
The top answer here doesn't work that well for us because the data we get can be heavily nested with repeats: BigQuery: Create column of JSON datatype
You are already well prepared, you layout several options in your question.
You could go with the JSON table and to maintain low costs
you can use a partition table
you can cluster your table
so instead of having just two timestamp+json column I would add 1 partitioned column and 5 cluster colums as well. Eventually even use yearly suffixed tables. This way you have at least 6 dimensions to scan only limited number of rows for rematerialization.
The other would be to change your model, and do an event processing middle-layer. You could first wire all your events either to Dataflow or Pub/Sub then process it there and write to bigquery as a new schema. This script would be able to create tables on the fly with the schema you code in your engine.
Btw you can remove columns, that's rematerialization, you can rewrite the same table with a query. You can rematerialize to remove duplicate rows as well.
I think this use case can be implemeted using Dataflow (or Apache Beam) with Dynamic Destination feature in it. The steps of dataflow would be like:
read the event/json from pubsub
flattened the events and put filter on the columns which you want to insert into BQ table.
With Dynamic Destination you will be able to insert the data into the respective tables
(if you have various event of various types). In Dynamic destination
you can specify the schema on the fly based on the fields in your
json
Get the failed insert records from the Dynamic
Destination and write it to a file of specific event type following some windowing based on your use case (How frequently you observe such issues).
read the file and update the schema once and load the file to that BQ table
I have implemented this logic in my use case and it is working perfectly fine.

Copy table with partitions using BigQuey

Any idea how can I duplicate a big table with the partitions.
For example I have a table x with 1,000,000 row and 10 daily partitions I want to copy most of the data to a new table named y with the partition meta data.
is this possible?
As far as duplicating the table, yes, it should be possible as stated in the docs.
If you are using the CLI you could just run a cp command like:
bq cp dataset.old_partitioned_table dataset.new_partitioned_table
Or if using an API client just make sure to setup the configuration.copy properly when running a job resource.
You said in your question though that you want to "copy most of the data" so it looks like you want to have some sort of filter applied. One possibility could be duplicating the table and then running either some query or DML update to account for the changes you need.

SSIS Create Parent

New to SSIS (2k12).
Importing a csv file containing any new or changed PO Lines. My db has a PO master, and a POLine child, so if the PO is new, I need to insert a row into the master before loading up the child(ren). I may have half a dozen children in the POLineDetail import.
To create a Master, I have to match up the ProjectNbr from the tblProjects table to get the ProjectID, similarly with the Vendor (VendorName and VendorID...) I can do this in T-SQL, but I'm not sure how best to do it using SSIS. What's the strategy?
You just need to use the lookup transformation on the data flow task and route the unmatched records to the no match output. The no match output will be records that do not exist and need to be inserted, which you would attach to a destination transformation.
It sounds like the first step that's needed is to load the data into a staging table so that you can work with the data. From there you can use the Lookup Transformations in SSIS to do the matching to populate your master data based on your mentioned criteria. You could also use the same lookup transformation with the CSV as the source without going into the table, but I like to stage the data so that there is an opportunity to do any additional cleanup that's needed. Either way though, the lookup transformation would provide the functionality that you're looking for.