I am trying to use the parameter name given from 'as' in spark sql (spark 2.0.0) in the WHERE clause like so:
val ds = spark.createDataset[Int](List(1,2,3)) ;
ds.createOrReplaceTempView("VIEW")
ds.sparkSession.sql("SELECT count(*) as total FROM VIEW WHERE total > 1").show()
However, i am getting this exception:
cannot resolve '`total`' given input columns: [value]; line 1 pos 41
It seems that spark does not respect the identifier i have given the grouping column. Is this something that is currently out of scope of spark or am i just doing something wrong?
Is total a defined column in VIEW, or are you trying to filter on the value of count(*)? If you want to count and then filter on that count, the syntax should be something like:
select <fieldtogroupon>, count(*) as total
from VIEW
group by <fieldtogroupon>
having count(*) > 1
Related
I had a query made on aurora sql, it was working nice, but now I need to do the same in redshift, but when I do so, it throws an error asking me to group by by every column, but obviously I don't want that.
This is the query:
select
rut,
name,
id,
sum(cantidad_retornos) as cantidad_retornos,
sum(cantidad_aceptadas) as cantidad_aceptadas,
sum(cantidad_auto_accept) as cantidad_auto_accept,
sum(cantidad_rechazadas) as cantidad_rechazadas,
sum(cantidad_aceptadas) - sum(cantidad_auto_accept) as cantidad_aceptadas_manual,
coalesce((sum(cantidad_aceptadas) - sum(cantidad_auto_accept)) / nullif(sum(cantidad_aceptadas),0)) as per_aceptadas_manual,
coalesce(sum(cantidad_auto_accept) / nullif(sum(cantidad_aceptadas),0),0) as per_aceptadas_auto,
coalesce(sum(cantidad_rechazadas) / nullif(sum(cantidad_retornos),0),0) AS rechazo_per,
case
when coalesce(sum(cantidad_rechazadas) / nullif(sum(cantidad_retornos),0) ,0) < 0.1 or cantidad_retornos < 10 then 'Confiable'
when coalesce(sum(cantidad_rechazadas) / nullif(sum(cantidad_retornos),0),0) >= 0.1 and coalesce(sum(cantidad_rechazadas) / nullif(sum(cantidad_retornos),0),0) < 0.5 then 'Estándar'
when coalesce(sum(cantidad_rechazadas) / nullif(sum(cantidad_retornos),0),0) >= 0.5 then 'Poco confiable'
else 'Sin clasificar'
end as nivel_confianza
from table
where 1=1
group by id, name, rut
I tried to group by every column, but it doesn't throw the result that I need
The error that I get:
ERROR: column "reporte_sellers_date.cantidad_retornos" must appear in the GROUP BY clause or be used in an aggregate function
If I group by the third column, it throws the same error but with the column number 4
In the first option in the CASE statement you have or cantidad_retornos without any aggregating function such as SUM(). This is why Redshift is saying it needs to be in a group by. You also alias this name to the sum of the column of the same name. So the is a choice the database needs to make about which one to use - the source column or the aggregate. It looks like Aurora is choosing the aggregate but Redshift is choosing the source column.
Using the same name for an aggregate as a source column is not a good idea as you are relying on the query compiler to make a choice for you. This means the query can break when the compiler is updated or if you port the query to a different database.
To fix this you can either add the SUM() aggregation to the use of cantidad_retornos in the CASE statement or use the aggregate from above in the query but give it a unique name.
I am having an issue locating the error in my code
I am practicing the WITH CLAUSE IN Big Query and I am trying to create two temporary tables to eventually join
first table would for the sum total sales from all the stores (grouping by storeid)
second table would be to get the average of those sum total stores
the main query would be to find which stores are greater than the average sum total store
here is what I was able to code:
WITH Total_sales as
(SELECT s.storeid,
sum(Unitprice)as sum_sale
FROM `g-mail-1234.SALES.sales_info` as s
GROUP BY storeid),
AVG_Sale (average_s_sales) as
(SELECT ROUND(avg(sum_sale),2) as average_s_sales
FROM total_sales)
SELECT * FROM total_sales as ts
JOIN avg_sale as av
ON ts.sum_sale > av.average_s_sale
but when I run the code I get a message:
Syntax error: Expected keyword AS but got "(" at [7:14]
what I would like to know is:
Where is the error?
In the future in BigQuery the 'at [7:14]' is this trying to tell me the line the error code is on? because it is on neither line 7 or line 14
I don’t believe BQ CTE syntax allows you to list the columns that the CTE will return. So this line:
AVG_Sale (average_s_sales) as
should just be:
AVG_Sale as
In my App backend with Knex using PSQL I'm trying to get the count of the rows where they have the same ID.
The issue is that whatever I'm doing always the count is 1 when in reality I have 2 rows for the same ID.
My table looks
In the table shared I need to count the rows with the same conversation_id which is 1.
The expected result should be count = 2
What I tried with Knex:
tx(tableName).select(columns.conversationId)
.whereIn(columns.conversationId, conversationIds)
.groupBy(columns.conversationId, columns.createdAt, columns.id);
The groupBy section if I try to remove columns.createdAt, columns.id it is complaining saying that those need to be included in the groupBy or in an aggregate function.
Removing in the following SQL those extra groupBy element I'm getting the right result but Knex doesn't like it and I'm stuck on it.
SQL generated as follow:
select
conversation_id ,
COUNT(*)
from
message
group by
conversation_id,
created_at ,
id ;
The result of this SQL is as follow
As you see the result is not good and I'm not able to make it work correctly with Knex which complain if I remove the elements from the groupBy
Tinkering with some expressions in the QueryLab, I wonder if something like the following will work:
tx(tableName)
.select(columns.conversationId)
.whereIn(columns.conversationId, conversationIds)
.count()
Which would give something like (these values are placeholders, obviously):
select "columns"."conversationId", count(*) from "tableName" where "columns"."conversationId" in (1, 2, 3)
I'm trying to execute this with pyspark:
query = "SELECT *\
FROM transaction\
INNER JOIN factures\
ON transaction.t_num = factures.f_trx\
WHERE transaction.t_num != ''\
GROUP BY transaction.t_num"
result = sqlContext.sql(query)
Spark gives an error :
u"expression transaction.t_aut is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
You forgot to add list of columns in group by statement. As you are selecting all columns in select statement.
It's saying that there is column named transaction.t_aut that you have projected in your select statement when you used select * that is not being used in your group by.
Solution is to either replace select * with the columns that are in your group by in your case transaction.t_num or add transaction.t_aut to your group by
I want to use a query that could be used as subquery. But I noticed that query like this:
x = Deal.all_objects.filter(state='created').values('id').annotate(cnt=Count('id')).values('cnt')
produces
SELECT COUNT("deals_deal"."id") AS "cnt"
FROM "deals_deal"
WHERE "deals_deal"."state" = created
GROUP BY "deals_deal"."id"
I don't need the GROUP BY, I just want to count offers that match filter.
I don't want to use .count() because It would not let me to write a query like this:
Deal.all_objects.filter(
Q(creator=OuterRef('pk')) | Q(taker=OuterRef('pk'), state='completed')
).annotate(cnt=Count('pk')).values('cnt')
How to modify above query so that it count without GROUP BY?
What you are looking for is done with aggregate not annotate.
x = Deal.objects.filter(state='created').values('id').aggregate(cnt=Count('id'))
# x is a dictionary {"cnt":9000}, you wouldn't even need the ".values('id')" now
This will result in a query something like this
SELECT COUNT("deals_deal"."id") AS "cnt"
FROM "deals_deal"
WHERE "deals_deal"."state" = created
Further cool things you can do with Aggregate