Converting a Sparksql Query to Dataframe transformations - apache-spark-sql

I am trying to re-write a sparksql query into a dataframe transformation using groupby and aggregate . Below is the original sparksql query .
result = spark.sql(
"select date, Full_Subcategory, Budget_Type, SUM(measure_value) AS planned_sales_inputs FROM lookups GROUP BY date, Budget_Type, Full_Subcategory")
Below is the Dataframe transformation that i am trying to do .
df_lookups.groupBy('Full_Subcategory','Budget_Type','date').agg(col('measure_value'),sum('measure_value')).show()
But i keep getting the below error .
Py4JJavaError: An error occurred while calling o2475.agg.
: org.apache.spark.sql.AnalysisException: cannot resolve '`measure_value`' given input columns: [Full_Subcategory, Budget_Type, date];;
'Aggregate [Full_Subcategory#278, Budget_Type#279, date#413], [Full_Subcategory#278, Budget_Type#279, date#413, 'measure_value, sum('measure_value) AS sum(measure_value)#16168]
I am pretty sure this has something do with grouping by columns and those columns being present in the select clause .
Kindly help .

I think it's because you are doing col('measure_value') inside agg function, which does not make sense as for me, because you are not aggregating any value in such way.
Just remove col('measure_value') from agg and you will get right result.

Related

Error in SQL statement: AnalysisException: Table or view not found:

I've just started with Hive. I'm working on Databricks community. I write in python but wanted to write something in SQL but there is an error I cannot understand. I cannot see anything wrong in my code. Please help me.
spark.sql("create table happiness_perm as select * from happiness_tmp");
%sql
select Country, count(*) from happiness_perm group by Country
I tried use my data freame df_happiness instead happiness_perm and still I receive this:
Error in SQL statement: AnalysisException: Table or view not found: happiness_perm; line 1 pos 30;
'Aggregate ['Country], ['Country, unresolvedalias(count(1), None)]
+- 'UnresolvedRelation [happiness_perm], [], false
I would really appreciate your help!
Try this:
df = spark.sql("select * from happiness_tmp")
df.createOrReplaceTempView("happiness_perm")
First you get your data into a dataframe, then you write the contents of the dataframe to a table in the catalog.
You can then query the table.

Trying to explode an array with unnest() in Presto and failing due to extra column

I have data from a query that looks like this:
SELECT
model_features
FROM some_db
which returns:
{
"food1": 0.65892159938812,
"food2": 0.90786880254745,
"food3": 0.88357985019684,
"food4": 0.99999821186066,
"food5": 0.99237471818924,
"food6": 0.62127977609634
}
{
"food4": 0.9999965429306,
"text1": 0.82206630706787
}
...
etc.
What I am eventually trying to do is simply get a count of each of the "food1", "food2" features,
but to do so (i think) I need to trim out the unnecessary numeric data. I'm at a loss as to how to do this, as everytime I try to simply unnest
SELECT
t.concepts
FROM some_db
CROSS JOIN UNNEST(model_features) AS t(concepts)
I get this error:
Column alias list has 1 entries but 't' has 2 columns available
Anyone mind pointing me in the right direction?
Solved this for myself: the issue was I needed to avoid dropping the second column of information in order for the query to execute. This may not be the canonical best way to approach, but it worked:
SELECT
t.concepts,
t.probabilities
FROM some_db
CROSS JOIN UNNEST(model_features) AS t(concepts,probabilities)

Cannot have map type columns in DataFrame which calls set operations

: org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column map_col is map
I have a hive table with a column of type - MAP<Float, Float>. I get the above error when I try to do an insertion on this table in a spark context. Insertion works fine without the 'distinct'.
create table test_insert2(`test_col` string, `map_col` MAP<INT,INT>)
location 's3://mybucket/test_insert2';
insert into test_insert2
select distinct 'a' as test_col, map(0,0) as map_col
Try to convert dataframe to .rdd then apply .distinct function.
Example:
spark.sql("select 'a'test_col,map(0,0)map_col
union all
select 'a'test_col,map(0,0)map_col").rdd.distinct.collect
Result:
Array[org.apache.spark.sql.Row] = Array([a,Map(0 -> 0)])

SQLDF in R - Problems with date format

I am trying apply some transformations in a data.frame using the sqldf function in R, but got some weird outputs. (I tried to apply some SQL date transformations in the query, but got no success).
First the data.frame has the following format (all columns are of character class):
But after filtering the data.frame with 'sqldf':
sqldf("SELECT BP_OR, N_orden_OR, Tipo_ordenOR, N_lineaOR, OLUSER, OLPID,
(select Fecha_aprobac from aprob_or
order by Fecha_aprobac desc limit 1) AS LastOfFecha_aprobac,
Estadp_sig, Estado_ultimo
FROM aprob_or
GROUP BY BP_OR, N_orden_OR, Tipo_ordenOR, N_lineaOR, OLUSER, OLPID, Estadp_sig, Estado_ultimo
(((OLPID)='X43008'))
")
I got the following format for the column LastOfFecha_aprobc
Those 'P5' is what I don't undestand. I formated the SQL code with some parameters to change the date format, but it persisted.
Do you have a better idea to figure that out?

Group by expression in pig

Consider I have a dataset with tuples (f1, f2). I want to get my data in two bags: one where fi is null and the other where f1 values are not null. I try:
raw = LOAD 'somedata' USING PigStorage() AS (f1:chararray, f2:chararray);
raw_group = GROUP raw BY f1 is null;
raw_count = FOREACH raw_group GENERATE group, COUNT_STAR(raw);
I expect to get two groups with keys true and false. When I run it in grunt I get the following:
2013-12-26 14:56:10,958 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1200: <line 1046, column 25> Syntax error, unexpected symbol at or near 'f1'
I can do a workaround:
raw_group = GROUP raw BY (f1 is null)?0:1;
, but I really like to understand what's going on here, as I just started to learn Pig. According to Pig documentation I can use expressions as a grouping key. Do I miss something here or nulls are treated differently in Pig?
The boolean datatype was introduced in Pig 0.10. The expression f1 is null is a boolean, so it can't appear as a field in a relation, which it would do if it were the value of group. Prior to Pig 0.10, booleans could only be used in FILTER statements or in the ternary operator, as you showed in your workaround.
While I haven't tried this out, presumably if you were to attempt the same thing in Pig 0.10 or later, your original attempt would succeed.