Spark SQL query: org.apache.spark.sql.AnalysisException - apache-spark-sql

I am trying to write a query for a twitter json file to extract the most influential person by looking at retweetCount. I need to group my output by the user, their time zone and the number of retweets in descending order.
When I run the query below I keep getting the exception:
org.apache.spark.sql.AnalysisExceptionorg.apache.spark.sql.AnalysisException:
cannot resolve 'total_retweets' given input columns
t.retweeted_screen_name, t.tz, total_retweets, tweet_count;
sqlContext.sql("""
SELECT
t.retweeted_screen_name,
t.tz,
sum(retweets) AS total_retweets,
count(*) AS tweet_count
FROM (SELECT
actor.displayName as retweeted_screen_name,
body,
actor.twitterTimeZone as tz,
max(retweetCount) as retweets
FROM tweetTable WHERE body <> ''
GROUP BY actor.displayName, actor.twitterTimeZone,
body) t
GROUP BY t.retweeted_screen_name, t.tz
ORDER BY total_retweets DESC
LIMIT 10 """).collect.foreach(println)
When I try to simplify this query I run into errors like:
Column total_retweets is invalid in the select list because it is not
contained in either an aggregate function or the GROUP BY clause.
Will much appreciate any help.

When you run a SQL query, it does not calculate resolve the aliases for each query until after the WHERE, JOIN, GROUP BY and ORDER BY clauses have run (but it does do so before any HAVING clauses). You therefore can't ORDER BY total_retweets, you will need to ORDER BY sum(retweets)

Related

SQL ORDER BY clause causing GROUP BY/aggregate error

I get this error:
PG::GroupingError: ERROR: column "relationships.created_at" must appear in the GROUP BY clause or be used in an aggregate function
from this query:
last_check = #user.last_check.to_i
#new_relationships = User.select('*')
.from("(#{#rels_unordered.to_sql}) AS rels_unordered")
.joins("
INNER JOIN relationships
ON rels_unordered.id = relationships.character_id
WHERE EXTRACT(EPOCH FROM relationships.created_at) > #{last_check}
ORDER BY relationships.created_at DESC
")
Without the ORDER BY line, it works fine. I don't understand what the GROUP BY clause is. How do I get it working and still order by relationships.created_at?
EDIT
I understand you can GROUP BY relationships.created_at. But isn't grouping unnecessary? Is the problem that relationship.created_at is not included in the SELECT? How do you include it? If you've already done an INNER JOIN with relationships, why the hell isn't relationships.created_at included in the result??
I've just realised this is all happening because the logs show the query begins with SELECT COUNT(*) FROM..... So the COUNT is the aggregate function. But I never requested a COUNT! Why does the query start with that?
EDIT 2
Ok, this seems to be happening because of lazy querying. The first thing that happens to #new_relationships is #new_relationships.any? This affects the query and turns it into a count. So I suppose the question is, how do I force the query to run as originally intended? And also to check if #new_relationships is empty without affecting the sql query?
You just need to add group by along with your order by clause
last_check = #user.last_check.to_i
#new_relationships =
User.select('"rels_unordered".*')
.from("(#{#rels_unordered.to_sql}) AS rels_unordered")
.joins("INNER JOIN relationships
ON rels_unordered.id = relationships.character_id
WHERE EXTRACT(EPOCH FROM relationships.created_at) > #{last_check}
GROUP BY relationships.created_at
ORDER BY relationships.created_at DESC ")
It's asking for a GROUP BY after your FROM clause in your EXTRACT (essentially a subselect). There's ways around it, but I've found it's often easier to make a GROUP BY work. Try: ...FROM relationships.created_at GROUP BY id or whatever indexing column you are using from that table. It seems like your ORDER BY is conflicting with itself. By grouping the subselect data it should lose its conflict.

Nth(n,split()) in bigquery

I am running the following query and keep getting the error message:
SELECT NTH(2,split(Web_Address_,'.')) +'.'+NTH(3,split(Web_Address_,'.')) as D , Web_Address_
FROM [Domains.domain
limit 10
Error message: Error: (L1:110): (L1:119): SELECT clause has mix of
aggregations 'D' and fields 'Web_Address_' without GROUP BY
clause Job ID:
symmetric-aura-572:job_axsxEyfYpXbe2gpmlYzH6bKGdtI
I tried to use group by clause on field D and/or Web_address_, but still getting errors about group by.
Does anyone know why this is the case? I have had success with similar query before.
You probably want to use WITHIN RECORD aggregation here, not GROUP BY
select concat(p1, '.', p2), Web_Address_ FROM
(SELECT
NTH(2,split(Web_Ad`enter code here`dress_,'.')) WITHIN RECORD p1,
NTH(3,split(Web_Address_,'.')) WITHIN RECORD p2, Web_Address_
FROM (SELECT 'a.b.c' as Web_Address_))
P.S. If you just trying to cut off first part of web address, it will be easier to do with RIGHT and INSTR functions.
You can also consider using URL functions: HOST, DOMAIN and TLD

Finding most popular and most unique records using SQL

My mom wanted a baby name game for my brother's baby shower. Wanting to learn python, I volunteered to do it. I pretty much have the python bit, it's the SQL that is throwing me.
The way the game is supposed to work is everyone at the shower writes down names on paper, I manually enter them into Excel (normalizing spellings as much as possible) and export to MS Access. Then I run my python program to find the player with the most popular names and the player with the most unique names. The database, called "babynames", is just four columns.
ID | BabyFirstName | BabyMiddleName | PlayerName
---|---------------|----------------|-----------
My mom has changed things every so often, but as they stand right now, I have to figure out :
a) The most popular name (or names if there is a tie) out of all first and middle names
b) The most unique name (or names if there is a tie) out of all the first and middle names
c) The player that has the most number of popular names (wins a prize)
d) The player that has the most number of unique names (wins a prize)
I've been working on this for about a week now and can't even get a SQL query for a) and b) to work, much less c) and d). I'm more than just a bit frustrated.
BTW, I'm just looking at spellings of the names, not phonetics. As I manually enter names, I will change names like "Kris" to "Chris" and "Xtina" to "Christina" etc.
Editing to add a couple of the most recent queries I tried for a)
SELECT [BabyFirstName],
COUNT ([BabyFirstName]) AS 'FirstNameOccurrence'
FROM [babynames]
GROUP BY [BabyFirstName]
ORDER BY 'FirstNameOccurrence' DESC
LIMIT 1
and
SELECT [BabyFirstName]
FROM [babynames]
GROUP BY [BabyFirstName]
HAVING COUNT(*) =
(SELECT COUNT(*)
FROM [babynames]
GROUP BY [BabyFirstName]
ORDER BY COUNT(*) DESC
LIMIT 1)
These both lead to syntax errors.
pyodbc.ProgrammingError: ('42000', '[42000] [Microsoft][ODBC Microsoft Access Driver] Syntax error in ORDER BY clause. (-3508) (SQLExecDirectW)')
I've tried using [FirstNameOccurrence] and just FirstNameOccurrence as well with the same error. Not sure why it's not recognizing it by that column name to order by.
pyodbc.ProgrammingError: ('42000', "[42000] [Microsoft][ODBC Microsoft Access Driver] Syntax error. in query expression 'COUNT(*) = (SELECT COUNT(*) FROM [babynames] GROUP BY [BabyFirstName] ORDER BY COUNT(*) DESC LIMIT 1)'. (-3100) (SQLExecDirectW)")
I'll admit that I'm not really grokking all of the COUNT(*) commands here, but this was a solution for a similar issue here in stackoverflow that I figured I'd try when my other idea didn't pan out.
For A and B, use a group by clause in your SQL, and then count, and order by the count. Use descending order for A and ascending order for B, and just take the first result for each.
For C and D, essentially use the same strategy but now just add the PlayerName (e.g. group by babyname,playername) and then use the ascending order/descending order question.
Here's Microsoft's write-up for a group by clause in MS Access: https://office.microsoft.com/en-us/access-help/group-by-clause-HA001231482.aspx
Here's an even better write-up demonstrating how to do both group by and order by at the same time: http://rogersaccessblog.blogspot.com/2009/06/select-queries-part-3-sorting-and.html
For the first query you tried, change it to:
SELECT TOP 1 [BabyFirstName],
COUNT ([BabyFirstName]) AS 'FirstNameOccurrence'
FROM [babynames]
GROUP BY [BabyFirstName]
ORDER BY 'FirstNameOccurrence' DESC
For the second, change it to:
SELECT [BabyFirstName]
FROM [babynames]
GROUP BY [BabyFirstName]
HAVING COUNT(*) =
(SELECT TOP 1 COUNT(*)
FROM [babynames]
GROUP BY [BabyFirstName]
ORDER BY COUNT(*) DESC)
Limiting the number of records returned by a SQL Statement in Access is achieved by adding a TOP statement directly after SELECT, not with ORDER BY... LIMIT
Also, Access TOP statement will return all instances of the top n (or n percent) unique records, so if there are two or more identical records in the query output (before TOP), and TOP 1 is specified, you'll see them all.

Error in group by using hive

I am using the following code and getting the error below
select d.searchpack,d.context, d.day,d,txnid,d.config, c.sgtype from ds3resultstats d join
context_header c on (d.context=c.contextid) where (d.day>='2012-11-15' and d.day<='2012-11-25' and c.sgtype='Tickler' and d.config like
'%people%') GROUP BY d.context limit 10;
FAILED: Error in semantic analysis: line 1:7 Expression Not In Group By Key d
I am guessing I am using the group by incorrectly
when you use group by, you cannot select other additional field. You can only select group key with aggregate function.
See hive group by for more information.
Related questions.
Code example:
select d.context,count(*)
from ds3resultstats
...
group by d.context
or group by multiply fields.
select d.context, d.field2, count(*)
from ds3resultstats
...
group by d.context, d.field2
It is expecting all the columns to be added with group by.
Even I am facing the same issue however I managed to get a work around to these kind of issues.
you can use collect_set with the column name to get the output. For example
select d.searchpack,collect_set(d.context) from sampletable group by d.searchpack;

Why does this query require a parameter?

I have the following SQL query in MS Access. I'm trying to order the output by the results of an expression contained in the column labeled as 'Response'. My problem is that when I run the query Access prompts me to enter a parameter value for Response. I tried entering zero and one to see what would happen. The query runs in those cases but the sort order is wrong. Can someone please explain to me why this query would require a parameter? Am I doing something wrong?
SELECT Market,
Sum(Calls) AS SumOfCalls,
Sum([A25-54 IMPs] * 1000) AS Impressions,
Round(SumOfCalls/Impressions, 6) AS Response
FROM DRTV_CentralOnly
WHERE [Creative]<>'#N/A'
GROUP BY Market
ORDER BY Response Desc;
It is not requiring a parameter. The problem is that you are using columns aliases in one your Response column. You need to use the actual calculations instead.
SELECT Market
, Sum(Calls) AS SumOfCalls
, Sum([A25-54 IMPs] * 1000) AS Impressions
, Round(Sum(Calls)/Sum([A25-54 IMPs] * 1000), 6) AS Response
FROM DRTV_CentralOnly
WHERE [Creative]<>'#N/A'
GROUP BY Market
ORDER BY 4 Desc;
#bluefeet said:
The problem is that you are using columns aliases in one your Response column.
Actually, that is not the problem at all. MS Access does indeed allow AS clauses ("columns aliases") to be used in the way the OP has used them.
Rather, the problem is that MS Access does not allow AS clauses in the ORDER BY clause.
It is altering the ORDER BY clause to use an ordinal position that fixes the query. The changes to the SELECT clause are a red herring!
The following should work:
SELECT Market,
Sum(Calls) AS SumOfCalls,
Sum([A25-54 IMPs] * 1000) AS Impressions,
Round(SumOfCalls/Impressions, 6) AS Response
FROM DRTV_CentralOnly
WHERE [Creative]<>'#N/A'
GROUP BY Market
ORDER BY 4 Desc;