Neo4j Cypher - adding a property with LOAD CSV - properties

I have a set of nodes created using file_A which contains a column with the 'id' of each node. It has been created using this Cypher query (in Java):
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:file_A' AS line FIELDTERMINATOR '\t'
CREATE (c:Node {nodeId:line.id})
Now I have another file (file_B) which contains four columns: id, description, prop2 and prop3. I need to assign a description (property 'nodeDesc') to each of the nodes created before. These descriptions will be read from the 'description' column of file_B. Moreover, to assign this value to the 'nodoDesc' property of the node, both 'prop2' and 'prop3' must be equal to '1'. For this purpose I use this Cypher query:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:file_B' AS line FIELDTERMINATOR '\t'
MATCH (c:Node)
WHERE c.nodeId=line.id AND line.prop2='1' AND line.prop3='1'
SET c.nodeDesc = line.description
file_B contains some descriptions for each node, but only one of them has both 'prop2' and 'prop3' equal to '1'. And that is the one I want to assign to the property of the node.
The problem I obtain after executing the previous query is that some of the nodes don't have description. After performing several tests, I have verified that it doesn't do the MATCH of the 'nodeId' with the column 'id' of file_B, but in that column it is the 'nodeId', and both 'prop2' and 'prop3' are equal to 1'.
Note: file_A has 400.000 rows aprox., and file_B has 1.300.000 rows aprox.
Thanks.

You may want to make sure you aren't comparing integers to strings. That can often be the source of mismatches like these.
And if both values are strings, then you may want to check to see if one string or the other has trailing (or preceding) spaces.

Related

Extract all elements from a string separated by underscores using regex

I am trying to capture all elements and store in separate column from the below string,seprated via underscores(campaign name for an advertisement) and then I wish to compare it with a master table having the true values to determine how accurate the data is being recorded.
eg: Input :
Expected output is :
My first element extraction was : REGEXP_EXTRACT(campaign_name, r"[^_+]{3}")) as parsed_campaign_agency
I only extracted first 3 letters because according to the naming convention(truth table), the agency name is made of only 3 letters.
Caveat: Some elements can have variable lengths too. eg. The third element "CrossBMC" could be 3 letters in length or more.
I am new to regex and the data lies in a SQL table(in BigQuery) so I thought it could be achieved via SQL's regex_extract but what I am having trouble is to extract all elements at once.
Any help is appreciated :)
If number of underscores constant and knows you can use SUBSTRING_INDEX like:
SELECT
SUBSTRING_INDEX(campaign_name,'_',1) first,
SUBSTRING_INDEX(SUBSTRING_INDEX(campaign_name,'_',2),'_',-1) second,
SUBSTRING_INDEX(SUBSTRING_INDEX(campaign_name,'_',3),'_',-1) third
FROM your_table;
Here you can try an example SQLize.online

What should I use for a default value for a json column in SQL Server?

Background: I have been tasked with grafting some simple key/value data pairs to an existing database table in SQL Server (in Azure). The nature of the KvP data is simply some extended data that may or may not exist for all rows.
Further, the data is somewhat freeform as not all rows will have the same key/value pairs. This is very much bolt-on-data that (in my opinion) doesn't merit the complexity of a related table. Instead, I've decided to try using JSON to hold the data and so, to get my feet wet I've tried the following:
First, I created a new column on my table thusly:
ALTER TABLE [TheTable]
ADD [ExtendedData] NVARCHAR(512) NOT NULL DEFAULT('')
Second, I picked a few records at random and added some additional JSON in the newly created column, for example:
{ "Color":"Red", "Size":"Big", "Shape":"Round" }
Finally, I expected to be able to query this extra data, by using the JSON_VALUE function in SQL, like this:
SELECT
Field1,
Field2,
JSON_VALUE(ExtendedData, '$.Color') AS Color,
JSON_VALUE(ExtendedData, '$.Size') AS Size
FROM
MyTable
I expected my output to be a result set with 4 columns (Field1, Field2, Color, Size) where some (most) of the Color and Size fields were NULL (because the majority of rows simply do not have any json data) - but instead I got an error complaining
JSON text is not properly formatted
This led me to suspect that ALL of my ExtendedData should be properly formatted JSON for my new query to work, and so replacing my default column value of '' (an empty string) with '{}' seemingly fixes my problem.
But I am left wondering if this is the correct solution. Should I indeed default my new ExtendedData column to use an empty json object '{}', or is it safe to use an empty string '' and I am missing something syntactically in my query?
Without any evidence to the contrary, and working within the rules established for this database, I've decided to use a default value of '{}' for my JSON data.
If anyone else does this, be careful as some API's / parsers / IDE's might not like the string '{}' and require you to escape the sequence as '{{}}'.

How to filter rows by length in Kettle

I'm using a row filter to filter out columns that are longer than given length. Under filter conditions there are no conditions for checking row length.
So the workaround is to use:
Field1 REGEXP [^.{0,80}$]
OR
Field1 IS NULL
Field2 REGEXP [^.{0,120}$]
OR
Field2 IS NULL
Length check is a very common requirement. Is there a function/simpler way to do this that I'm missing?
Use Data Validator step:
Create a new validation for every column you want to check and set "Max string length" for every validation created.
You can redirect erroneous rows using "Error handling of step" hop:
By default these rows have same structure and values as the input rows, but you can also include additional information, such as the name of the erroneous column or error description.
Alternatively, you can compute a string length before filtering using calculator step, but it may create a lot of additional columns if you have multiple columns to check.
And, of course, you can always perform such checks in User Defined Java Class or Modified Java Script Value.
Assuming you are talking about strings, you can use a Calculator step with the somewhat hard to find calculation "Return the length of a string A". That will give you the values for your Filter Rows step.

Import from CSV fails if there are more than one records in the csv file

Import from CSV fails if there are more than one records in the csv file, In this sample file, the data is delimited by single space ASCII value. Problem is every record has 'single space' even after the last column value, now when the system encounters this last 'single space' in each line .. its assuming as another column value and its not moving forward to the next record (as its unable to find the new line character).
Is there any to specify to ignore single space after last column value in each line?
Any way to consider this last single space on each line as newline character ?
I have thousands of rows so its impossible to manually replace last single space ASCII value with some end of line character?
On other note any good ETL tool which can help easily to move raw data into Cassandra to avoid above kind of problems?
Error message
$COPY sensors_data(samplenumber,magx,magy,magz,accx,accy,accz,gyror,gyrop,gyroy,lbutton,rbutton) FROM '/home/swift/cassandra/input-data/FallFromDesk1.csv' WITH DELIMITER=' ';
Record #0 (line 1) has the wrong number of fields (13 instead of 12).
Note
The above commands works perfectly if there is only 1 row in the .csv file or if we manually remove the single space after the last column value on each row.
Kindly help me out.

How to import a CSV to postgresql that already has ID's assigned?

I have 3 CSV files that are about 150k rows each. They already have been given ID's in the CSV and the assosciations are held within them already. Is there a simple way to skip the auto-assignment of the id value and instead use what is already in the CSV?
A serial column only draws the next number from a sequence by default. If you write a value to it, the default will not kick in. You can just COPY to the table (see #Saravanan' answer) and then update the sequence accordingly. One way to do this:
SELECT setval('tbl_tbl_id_seq', max(tbl_id)) FROM tbl;
tbl_id being the serial column of table tbl, drawing from the sequence tbl_tbl_id_seq (default name).
Best in a single transaction in case of concurrent load.
Note, there is no off-by-1 error here. Per documentation:
The two-parameter form sets the sequence's last_value field to the
specified value and sets its is_called field to true, meaning that the
next nextval will advance the sequence before returning a value.
Bold emphasis mine.
You can directly copy the CSV records to POSTGRES table.
COPY table_name FROM '/path/to/csv' DELIMITER ',' CSV;
By following the above method, we can actually avoid create a record through ActiveRecord object.