Nth(n,split()) in bigquery - google-bigquery

I am running the following query and keep getting the error message:
SELECT NTH(2,split(Web_Address_,'.')) +'.'+NTH(3,split(Web_Address_,'.')) as D , Web_Address_
FROM [Domains.domain
limit 10
Error message: Error: (L1:110): (L1:119): SELECT clause has mix of
aggregations 'D' and fields 'Web_Address_' without GROUP BY
clause Job ID:
symmetric-aura-572:job_axsxEyfYpXbe2gpmlYzH6bKGdtI
I tried to use group by clause on field D and/or Web_address_, but still getting errors about group by.
Does anyone know why this is the case? I have had success with similar query before.

You probably want to use WITHIN RECORD aggregation here, not GROUP BY
select concat(p1, '.', p2), Web_Address_ FROM
(SELECT
NTH(2,split(Web_Ad`enter code here`dress_,'.')) WITHIN RECORD p1,
NTH(3,split(Web_Address_,'.')) WITHIN RECORD p2, Web_Address_
FROM (SELECT 'a.b.c' as Web_Address_))
P.S. If you just trying to cut off first part of web address, it will be easier to do with RIGHT and INSTR functions.

You can also consider using URL functions: HOST, DOMAIN and TLD

Related

row_number error when trying to rank items

I'm trying to get back into SQL query and am having a frustrating problem. I have two questions:
I'm trying to take all items in my dataset and rank them by partitions. I researched this and think it should look like this:
select g.ticker, g.sector, g.industry, g.countryname, g.exchange, c.carbon, c.year,
ROW_NUMBER() OVER (
PARTITION BY g.sector, g.industry, g.countryname, g.exchange
ORDER BY c.carbon DESC
) AS 'Rank'
from "General" g
INNER JOIN carbon c ON upper(c.ticker) =g.ticker ;
The output would be a rank for each group in the partition in this case it would be sector, industry, country name and exchange then the rows are ranked based on their carbon emissions.
I'm getting this error:
Error occurred during SQL script execution
Reason:
SQL Error [42601]: ERROR: syntax error at or near "'Rank'"
Position: 1305
if I remove the rank section, the data joins and provides results(obviously not ranked like I want but I know the base query works). What am I doing wrong?
Second(related) question, I forgot how much I hated SQL error messages. The above error tells me there's syntax error then I went to the docs and couldn't see anything different in my code vs their example. Assuming lack of experience, is there a better way to get actionable error messages(i.e. in python I get a stack trace that I can read to see what part of my code went wrong)?
Thank you!
Don't use single quotes for column aliases. Also, I would suggest avoiding anything that is part of standard SQL (which has a rank() function. I often use seqnum:
select g.ticker, g.sector, g.industry, g.countryname, g.exchange, c.carbon, c.year,
row_number() over (
partition by g.sector, g.industry, g.countryname, g.exchange
order by c.carbon desc
) as seqnum
from "General" g join
carbon c
on upper(c.ticker) = g.ticker ;
Note: You should only use single quotes for string and date constants. If you want to escape a column name, use double quotes (just as your query does for the table name General).

Spark SQL query: org.apache.spark.sql.AnalysisException

I am trying to write a query for a twitter json file to extract the most influential person by looking at retweetCount. I need to group my output by the user, their time zone and the number of retweets in descending order.
When I run the query below I keep getting the exception:
org.apache.spark.sql.AnalysisExceptionorg.apache.spark.sql.AnalysisException:
cannot resolve 'total_retweets' given input columns
t.retweeted_screen_name, t.tz, total_retweets, tweet_count;
sqlContext.sql("""
SELECT
t.retweeted_screen_name,
t.tz,
sum(retweets) AS total_retweets,
count(*) AS tweet_count
FROM (SELECT
actor.displayName as retweeted_screen_name,
body,
actor.twitterTimeZone as tz,
max(retweetCount) as retweets
FROM tweetTable WHERE body <> ''
GROUP BY actor.displayName, actor.twitterTimeZone,
body) t
GROUP BY t.retweeted_screen_name, t.tz
ORDER BY total_retweets DESC
LIMIT 10 """).collect.foreach(println)
When I try to simplify this query I run into errors like:
Column total_retweets is invalid in the select list because it is not
contained in either an aggregate function or the GROUP BY clause.
Will much appreciate any help.
When you run a SQL query, it does not calculate resolve the aliases for each query until after the WHERE, JOIN, GROUP BY and ORDER BY clauses have run (but it does do so before any HAVING clauses). You therefore can't ORDER BY total_retweets, you will need to ORDER BY sum(retweets)

Oracle SQL - Comparing AVG functions in WHERE

I'm trying to write a few Oracle SQL scripts for an assignment. I've managed to get all of it to work, except for one part. To summarize, I have to display data from 2 tables if the average of 1 column in table A is greater than the average of another column in table B. I realize you cannot include AVG functions in a WHERE clause or HAVING clause since it seems unable to properly access the data (from what I've read). When I exclude this clause, the script executes properly, so I'm confident there are no other errors.
I've tried writing it as follows but the error I get is ORA-00936: missing expression and it is just before the > sign. I thought this may be due to improper bracket placing but none of my attempts resolved this. Here is my attempt:
SELECT l.l_category, SUM(r.r_sold), AVG(l.l_cost)
FROM promos l
INNER JOIN sales r
ON r.promo_id = l.promo_id
GROUP BY l.l_category
HAVING (SELECT AVG(l.l_cost) OVER (PARTITION BY l.l_cost)) >
(SELECT AVG(r.r_sold) OVER (PARTITION BY r.r_sold));
I tried doing this without the OVER (PARTITION BY ...) as well as putting it into a WHERE clause but it didn't resolve the error. I'm pretty sure I need to put it into a SELECT statement somehow but I'm at a loss.
You do not need to use the OVER clause when applying the aggregate functions in the HAVING clause. Just use the aggregate functions on their own.
SELECT l.l_category, SUM(r.r_sold), AVG(l.l_cost)
FROM promos l
INNER JOIN sales r
ON r.promo_id = l.promo_id
GROUP BY l.l_category
HAVING HAVING AVG(l.l_cost) > AVG(r.r_sold)

Error in group by using hive

I am using the following code and getting the error below
select d.searchpack,d.context, d.day,d,txnid,d.config, c.sgtype from ds3resultstats d join
context_header c on (d.context=c.contextid) where (d.day>='2012-11-15' and d.day<='2012-11-25' and c.sgtype='Tickler' and d.config like
'%people%') GROUP BY d.context limit 10;
FAILED: Error in semantic analysis: line 1:7 Expression Not In Group By Key d
I am guessing I am using the group by incorrectly
when you use group by, you cannot select other additional field. You can only select group key with aggregate function.
See hive group by for more information.
Related questions.
Code example:
select d.context,count(*)
from ds3resultstats
...
group by d.context
or group by multiply fields.
select d.context, d.field2, count(*)
from ds3resultstats
...
group by d.context, d.field2
It is expecting all the columns to be added with group by.
Even I am facing the same issue however I managed to get a work around to these kind of issues.
you can use collect_set with the column name to get the output. For example
select d.searchpack,collect_set(d.context) from sampletable group by d.searchpack;

Group by SQL statement

So I got this statement, which works fine:
SELECT MAX(patient_history_date_bio) AS med_date, medication_name
FROM biological
WHERE patient_id = 12)
GROUP BY medication_name
But, I would like to have the corresponding medication_dose also. So I type this up
SELECT MAX(patient_history_date_bio) AS med_date, medication_name, medication_dose
FROM biological
WHERE (patient_id = 12)
GROUP BY medication_name
But, it gives me an error saying:
"coumn 'biological.medication_dose' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.".
So I try adding medication_dose to the GROUP BY clause, but then it gives me extra rows that I don't want.
I would like to get the latest row for each medication in my table. (The latest row is determined by the max function, getting the latest date).
How do I fix this problem?
Use:
SELECT b.medication_name,
b.patient_history_date_bio AS med_date,
b.medication_dose
FROM BIOLOGICAL b
JOIN (SELECT y.medication_name,
MAX(y.patient_history_date_bio) AS max_date
FROM BIOLOGICAL y
GROUP BY y.medication_name) x ON x.medication_name = b.medication_name
AND x.max_date = b.patient_history_date_bio
WHERE b.patient_id = ?
If you really have to, as one quick workaround, you can apply an aggregate function to your medication_dose such as MAX(medication_dose).
However note that this is normally an indication that you are either building the query incorrectly, or that you need to refactor/normalize your database schema. In your case, it looks like you are tackling the query incorrectly. The correct approach should the one suggested by OMG Poinies in another answer.
You may be interested in checking out the following interesting article which describes the reasons behind this error:
But WHY Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?
You need to put max(medication_dose) in your select. Group by returns a result set that contains distinct values for fields in your group by clause, so apparently you have multiple records that have the same medication_name, but different doses, so you are getting two results.
By putting in max(medication_dose) it will return the maximum dose value for each medication_name. You can use any aggregate function on dose (max, min, avg, sum, etc.)