Find rows with the duplicate string data

Find rows with the duplicate string data - sql

I have a scenario like this where this particular column "workData" has some JSON in string form. I have to find records where Text is either empty or the Text of both objects are same. For the empty string case I can apply a where clause as WHERE workData not like '%"Text":""%' but I am clueless as to how to do the second part. Assume the table name as "workforce".
Any help will be highly appreciated.

Related

Pentaho - Data Format

I have two questions about Pentaho Kettle, and I need some help please!
So, I have a CSV file with some data. In one the column's, the file have some dates(in years).. The first problem its, some rows have the "None" in that column and other rows have the date in the right format.
This image should help to "see" the problem:
Problem One
To resolve this problem, I changed the data type in input file and in the database to String. That works, but i thing that's not the correct way to do. I also tryed to use the "Filter Rows " step, but don't worked.. Some help please? :)
The second problem its about a null value in the date field. The database expects to received a date value, but some of tha values are null.. Once again, this image should help to "see" the problem:
Problem Two
What I can do to resolve the both problems? What is the right way to not only resolve the problem, but have a good performance to query the data later?
Thanks very much!
Best regards!

for the first query use input step as a string that's fine after that use select value step use can change string to date formate.
for the second step use filter rows step and separate rows which has none after that replace none with null and link to your next step.

For the "None" String value in the Year column you can first read that column as String then you can use the Step called "Null if" and give "None" as the Value to turn to NULL. Later you can make this Year column as Integer type in the Select Values.
For the second problem, since you are table design expects a non-null value for the date column, you could either change the not-null constraint to nullable. Or if you want a default value for such null values then you can use the step "If field value is null" and you can specify the default value there.
If you want to use the non-null value of the date from the previous previous rows, you can set Repeat to Y in Fields tab of the step Text file input

Alternatively, for both cases, you can try to use a "Value Mapper" from None to something your database can accept.

How to add a column that substract the difference between two existing columns ? GREL in OpenRefine

So I'm trying to find a simple way to create a new column that displays the difference between two existing columns (each with numbers)... I can't seem to find the proper GREL expression....
So I'm trying to find the amount of items sold with a column named "stock_before" and the other named "stock_after".
I click on edit column from the column "stock_before" and then add column based on this column.
For the GREL I have already entered is:
value-cells["Stock_after"]
It returns no syntax error but still all of the cells for preview say "null"... I have transformed the value of the columns to numbers.
For Python I have tried:
substract(value,"Stock_after")
Same no syntax error but still everything null.
This seems so ridiculously simple but I couldn't find an answer... You can guess I'm fairly new to all this :) Hope someone out there can help me!
thanks for your having the patience to read this and thanks for your time if you answer!
I'd like something similar to this (3 columns):
Stock_before, Stock_after, dif
1,1,0
3,1,2
4,4,0
2,1,1

In GREL, the expression cells["Stock_after"] returns a Cell object representing the corresponding cell, not the actual value of that cell. To get the value, you need to use cells["Stock_after"].value.
So your final GREL expression should be value - cells["Stock_after"].value.
You should also make sure your values are stored as numerals, not strings: they should appear in green in the table. If they do not, use a "To number" operation on both columns first.
You can find out more about GREL and Cell objects here:
https://github.com/OpenRefine/OpenRefine/wiki/Variables

How to query the presence of an element inside a Spark Dataframe Column that contains a set?

I have a spark dataframe where one column has the type Set<text>.
This column contains a set of string, for example ["eenie","meenie","mo"].
How do I filter the contents of the whole dataframe so that
I only get those rows that (for example) contain the value eenie in the set?
I'm looking for something similar to
dataframe.where($"list".contains("eenie"))
the above shown example is only valid for when the content of column list is a string not a Set. What alternatives are there to fit my circumstances?
Edit: My question is not a duplicate. The user in that question has a set of values and wants to know which ones are located inside a specific column. I have a column that contains a set, and I want to know if a specific value is part of the set. My approach is the opposite of that.

Try:
import org.apache.spark.sql.functions.array_contains
dataframe.where(array_contains($"list", "eenie"))

Suggestion about dealing with subqueries that need string analysis

I need to create a query on the fly through VBA that get some strings from a table based on a set of criteria selected by the user . From the results of that first query I need to find the position of certain characters in the strings and only select the one that have the characters at a given position. This need to be done quickly as it then needs to be displayed in a combo box for the user to select before he runs a full other query.
So my question is what is the fastest/best way to do this ?
Should I put the result of the first query in a temporary table , analyze the strings from that table, delete the records that don't meet the selection and then run a query to display in the combo box from that table ?
Thank you

Pentaho Kettle replace cell value if value exists in other two columns

Am using Pentaho BI server with Data Integration to get data set and merge data from two input tables, Now i need to replace value of cell in each row, if two other columns in the same row matches my criteria, how can i accomplish this with Kettle?
I need to match many values with the cell value in each row, values to be matched with the cell are inside an excel sheet.
I have tried using Replace in string component but it does not work :( can you help me in this regard?

You can do compound tests with the Filter Rows step. For a result that passes you can follow up with a Set Field Value step. It would look something like this:
The way to get the "AND" condition in the Filter Rows step is to click on the little icon on the far upper right of the condition box. Note also that you must supply a value for 'ReplaceVal' earlier in the transform (I just hard coded it in the Data Grid).
EDIT: Based on the wording of your question, your criteria is a simple null check. "IS NULL" is one of the conditions available in the Filter Rows step.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Find rows with the duplicate string data - sql

Related

Pentaho - Data Format

How to add a column that substract the difference between two existing columns ? GREL in OpenRefine

How to query the presence of an element inside a Spark Dataframe Column that contains a set?

Suggestion about dealing with subqueries that need string analysis

Pentaho Kettle replace cell value if value exists in other two columns

Categories

Resources