Executing spark sql in aws glue returns the column name in the queries rather than values - apache-spark-sql

running spark sql in aws glue returns the column name in the queries
data:
product,price,quantityinKG
mango,100,1
apple,200,3
peach,200,2
mango,200,2
My Test Query
eg : select product,sum(price)
from myDataSource
group by product
The output of the query should be
product, sum(price)
mango, 300
but output is :
product, "sum(price)"
mango,
There is nothing in the sum(price) column it only has the product nane , please can you help me with this behaviour of glue

First of all, create or replace a local temporary view with your dataframe and then use the sql query
data.createOrReplaceTempView('data_table')
spark.sql("select product, sum(price) as sum_price from data_table group by product").show()
If you are using the glue dynamic dataframe, then first convert it into a spark dataframe using toDF() function before creating the temp view.

Related

Azure Data Factory Data Flow source query support for FOR JSON AUTO

I am trying to use below query as source for my data flow but I keep getting errors. Is the fuctionality not supported in data flow?
SELECT customer.customerid AS 'customerid',
customer.customer_fname AS 'fname',
customer.customer_lname AS 'lname',
customer.customer_phone AS 'Phone',
address.customer_addressid as 'addressid',
address.Address_type as 'addresstype',
address.street1 as 'street1'
FROM customer customer
INNER JOIN customer_address address
ON customer.customerid = address.customerid
order by customer.customerid
FOR JSON AUTO, ROOT('customer')
I get the following error:
Notifications
Column name needs to be specified in the query, set an alias if using a SQL function
ADF V2, Data Flows, Source
The error is cause by that Data Flow Query doesn't support order by statement, not the 'FOR JOSN AUTO'.
See the error bellow:
Please refence Data Flow Source transformation:
Query: If you select Query in the input field, enter a SQL query for your source. This setting overrides any table that you've chosen in the dataset. Order By clauses aren't supported here, but you can set a full SELECT FROM statement. You can also use user-defined table functions. select * from udfGetData() is a UDF in SQL that returns a table. This query will produce a source table that you can use in your data flow. Using queries is also a great way to reduce rows for testing or for lookups.
SQL Example: Select * from MyTable where customerId > 1000 and customerId < 2000
The query work well in Copy active but false in Data Flow. You need to change the query.

Azure Data Factory Exist Transformation

Is there a way that after comparing two tables and then use the Case function?
I am trying to have a new column base on Exists transformation. In sql I do it like this:
(isnull (select 'YES' from sales where salesperson = t1.salesperson group by salesperson), 'NO')) AS registeredSales
T1 is personal.
Or should I include the table into the stream of the joins and then use the case() function to compare the two columns?
If there's another way to work around to compare these two streams, I would be pleased to hear.
Thanks.
Flat files in a datalake can also be compared. We can use the derived column in dataflow to gernerate a new column.
I create a dataflow demo cotains two sources: CustomerSource(customer.csv stored in datalake2) and SalesSource(sales.csv stored in datalake2 and it contains only one column) as follows
Then I join the two sources with the column CustomerId
Then I use Select activity to give an alias to the CustomerId from SalesSource
In the DerivedColumn, I select the Add column and enter the expression iifNull(SalesCustomerID, 'NO', 'YES') to generate a new column named 'registeredSales' as follows:
The last column of the result shows:

How to convert SQL script to tableau filter?

I have SQL script as below:
select name, location, max(trans_date)
From dataset
group by name, location
I want to replicate this in tableau using data source filters
Create a calculated field with this formula:
[trans_date]={FIXED [name],[location]:MAX([trans_date])}
Set it to True in the data source filters

Optimize view that dynamically choose a table or another

So the problem is that I have three huge table with same structure, and I need to show the results of one of them depending on result from another query.
So my order table looks like that:
code order
A 0
B 2
C 1
And I need to retrieve data from t_results
My approach (which is working) looks like this:
select *
from t_results_a
where 'A' in (
select code
from t_order
where order = 0
)
UNION ALL
select *
from t_results_b
where 'B' in (
select code
from t_order
where order = 0
)
UNION ALL
select *
from t_results_c
where 'C' in (
select code
from t_order
where order = 0
)
Is there anyway to not scan all three tables, as I am working with Athena so I can't program?
I presume that changing your database schema is not an option.
If it were, you could use one database table and add a CODE column whose value would be either A, B or C.
Basically the result of the SQL query on your ORDER table determines which other database table you need to query. For example, if CODE in table ORDER is A, then you have to query table T_RESULTS_A.
You wrote in your question
I am working with Athena so I can't program
I see that there is both an ODBC driver and a JDBC driver for Athena, so you can program with either .NET or Java. So you could write code that queries the ORDER table and use the result of that query to build another query string to query just the relevant table.
Another thought I had was dynamic SQL. Oracle database supports it. I can create a string containing variables where one variable is the database table name and have Oracle interpret the string as SQL and execute it. I briefly searched the Internet to see whether Athena supports this (as I have no experience with Athena) but found nothing - which doesn't mean to say that it does not exist.

Athena partition locations

I can view all the partitions on my table using
show partitions my_table
and I can see the location of a partition by using
describe formatted my_table partition (partition_col='value')
but I have a lot of partitions, and don't want to have to parse the output of describe formatted if it can be avoided.
Is there a way to get all partitions and their locations, in a single query?
There's no built in or consistent way to get this information.
Assuming you know your partition column(s), you can get this information with a query like
select distinct partition_col, "$path" from my_table
The cheapest way to get the locations of the partitions of a table is to use the GetPartitions call from the Glue API. It will list all partitions, their values and locations. You can try it out using the AWS CLI tool like this:
aws glue get-partitions --region us-somewhere-1 --database-name your_database --table-name the_table
Using SQL like SELECT DISTINCT partition_col, "$path" FROM the_table could be expensive since Athena unfortunately scans the whole table to produce the output (it could have just looked at the table metadata but that optimization does not seem to exist yet).
Using boto3 (as of version 1.12.9) the following is returning the complete list:
glue_client = boto3.client("glue")
glue_paginator = glue_client.get_paginator("get_partitions")
pages_iter = glue_paginator.paginate(
DatabaseName=db_name, TableName=table_name
)
res = []
for page in pages_iter:
for partition in page["Partitions"]:
res.append(
{
"Values": partition["Values"],
"Location": partition["StorageDescriptor"]["Location"],
}
)