filter data in Spark SQL in Databrick

filter data in Spark SQL in Databrick - apache-spark-sql

When I run below command in Databrick , I get output where I have three columns called name, id and age but when I try to filter on name by running below I get below error as Name column do not exist. What wrong am i doing?
%sql
SELECT inline(environment.details) FROM TableA
This gives me a table with 3 column correctly.
Now I do filter oon Name like this
%sql
SELECT inline(environment.details) FROM TableA where `Name` == "XYZ"
and I get error as Name table do not exist.What wis wrong here. Also if someone can tell me how can I export the resultant output.
Thanks

Filtering happens before your expand your array of structs. You have two choices here:
Use common table expressions to explode first & then filter:
with exploded as (
SELECT inline(environment.details) FROM TableA
)
SELECT * from exploded where name = ....
Use the filter function to filter out data inside the array with something like that (not tested), but it may require doing the filtering two times:
SELECT inline(filter(environment.details, x -> x.Name = 'XYZ'))
FROM TableA
WHERE array_size(filter(environment.details, x -> x.Name = 'XYZ')) > 0

Related

add table_id to the result from multiple tables in BigQuery

Below is how I structured the data in BigQuery database.
test
-> sales
-> monthly-2015
-> monthly-2016
-> ...
I want to combine the data of all tables with the table name , monthly-*, and below is how I wrote the sql from examples I found.
Running this sql leads an error like following Scalar subquery produced more than one element. How could I fix it to error?
SELECT
*,
(
SELECT
table_id
FROM
`test.sales.__TABLES_SUMMARY__`
WHERE
table_id LIKE 'monthly-%')
FROM
`test.sales.monthly*`

I want to combine the data of all tables with the table name , monthly-*
Try below
SELECT *, 'monthly_' || _TABLE_SUFFIX as table_name
FROM `test.sales.monthly_*`

I need to create a VIEW from an existing TABLE and MAP an additional COLUMN to that VIEW

I am fairly new to SQL. What I am trying to do is create a view from an existing table. I also need to add a new column to the view which maps to the values of an existing column in the table.
So within the view, if the value in a field for Col_1 = A, then the value in the corresponding row for New_Col = C etc
Does this even make sense? Would I use the CASE clause? Is mapping in this way even possible?
Thanks

The best way to do this is to create a mapping or lookup table
For example consider the following LOOKUP table.
COL_A NEW_VALUE
---- -----
A C
B D
Then you can have a query like this:
SELECT A.*, LOOK.NEW_VALUE
FROM TABLEA AS A
JOIN LOOKUP AS LOOK ON A.COL_A = LOOK.COL_A
This is what DimaSUN is doing in his query too -- but in his case he is creating the table dynamically in the body of the query.
Also note, I'm using a JOIN (which is an inner join) so only results in the lookup table will be returned. This could filter the results. A LEFT JOIN there would return all data from A but some of the new columns might be null.

Generally, a view is an instance of a table/a replica provided that there is no alteration to the original table. So, as per your query you can manipulate the data and columns in a view by using case.
Create View viewname as
Select *,
case when column=a.value then 'C'
....
ELSE
END
FROM ( Select * from table) a

If You have restricted list of replaced values You may hardcode that list in query
select T.*,map.New_Col
from ExistingTable T
left join (
values
('A','C')
,('B','D')
) map (Col_1,New_Col) on map.Col_1 = T.Col_1
In this sample You hardcode 'A' -> 'C' and 'B' -> 'D'
In general case You better may to use additional table ( see Hogan answer )

How to query a Postgres `RECORD` datatype

I have a query that will return a row as a RECORD data type from a subquery - see below for example:
select *
from (
select row(st.*) table_rows
from some_table st
) x
where table_rows[0] = 339787
I am trying to further qualify it in the WHERE clause and I need to do so by extracting one of the nodes in the returned RECORD data type.
When I do the above, I get an error saying:
ERROR: cannot subscript type record because it is not an array
Does anybody know of a way of implementing this?

Use (row).column_name. You can just refer to the table itself to create the record:
select *
from (
select r
from some_table r
) x
where (r).column_name = 339787
There is a small chance that later a column is created with the same name as the alias you chose and the above query will fail as select r will return the later created column in instead of the record. The first solution is to use the row constructor as you did in your question:
select row(r.*) as r
The second solution is to use the schema qualified name of the table:
select my_squema.some_table as r

Alternately You can try this
select *
from (
select *
from tbl
) x
where x.col_name = 339787

Compare comma separated list with individual row in table

I have to compare comma separated values with a column in the table and find out which values are not in database. [kind of master data validation]. Please have a look at the sample data below:
table data in database:
id name
1 abc
2 def
3 ghi
SQL part :
Here i am getting comma separated list like ('abc','def','ghi','xyz').
now xyz is invalid value, so i want to take that value and return it as output saying "invalid value".
It is possible if i split those value, take it in temp table, loop through each value and compare one by one.
but is there any other optimal way to do this ??

I'm sure if I got the question right, however, I would personally be trying to get to something like this:
SELECT
D.id,
CASE
WHEN B.Name IS NULL THEN D.name
ELSE "invalid value"
END
FROM
data AS D
INNER JOIN badNames B ON b.Name = d.Name
--as SQL is case insensitive, equal sign should work
There is one table with bad names or invalid values if You prefer. This can a temporary table as well - depending on usage (a black-listed words should be a table, ad hoc invalid values provided by a service should be temp table, etc.).
NOTE: The select above can be nested in a view, so the data remain as they were, yet you gain the correctness information. Otherwise I would create a cursor inside a function that would go through the select like the one above and alter the original data, if that is the goal...

It sounds like you just need a NOT EXISTS / LEFT JOIN, as in:
SELECT tmp.InvalidValue
FROM dbo.HopeThisIsNotAWhileBasedSplit(#CSVlist) tmp
WHERE NOT EXISTS (
SELECT *
FROM dbo.Table tbl
WHERE tbl.Field = tmp.InvalidValue
);
Of course, depending on the size of the CSV list coming in, the number of rows in the table you are checking, and the style of splitter you are using, it might be better to dump the CSV to a temp table first (as you mentioned doing in the question).

Try following query:
SELECT SplitedValues.name,
CASE WHEN YourTable.Id IS NULL THEN 'invalid value' ELSE NULL END AS Result
FROM SplitedValues
LEFT JOIN yourTable ON SplitedValues.name = YourTable.name

How to select items with all possible id-s or just a particular one using the same query?

Is there a variable in SQL that can be used to represent ALL the possible values of a field? Something like this pseudo-code
SELECT name FROM table WHERE id = *ALL_EXISTING_ID-s*
I want to return all rows in this case, but later when I do a search and need only one item I can simply replace that variable with the id I'm looking for, i.e.
SELECT name FROM table WHERE id = 1

The simplest way is to remove the WHERE clause. This will return all rows.
SELECT name FROM table
If you want some "magic" value you can use for the ID that you can use in your existing query and it will return all rows, I think you're out of luck.
Though you could use something like this:
SELECT name FROM table WHERE id = IFNULL(?, id)
If the value NULL is provided, all rows will be returned.
If you don't like NULL then try the following query, which will return all rows if the value -1 is provided:
SELECT name FROM table WHERE id = IFNULL(NULLIF(?, -1), id)
Another approach that achieves the same effect (but requires binding the id twice) is:
SELECT name FROM table WHERE (id = ? OR ? = -1)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

filter data in Spark SQL in Databrick - apache-spark-sql

Related

add table_id to the result from multiple tables in BigQuery

I need to create a VIEW from an existing TABLE and MAP an additional COLUMN to that VIEW

How to query a Postgres `RECORD` datatype

Compare comma separated list with individual row in table

How to select items with all possible id-s or just a particular one using the same query?

Categories

Resources