I am trying to write a query for a twitter json file to extract the most influential person by looking at retweetCount. I need to group my output by the user, their time zone and the number of retweets in descending order.
When I run the query below I keep getting the exception:
org.apache.spark.sql.AnalysisExceptionorg.apache.spark.sql.AnalysisException:
cannot resolve 'total_retweets' given input columns
t.retweeted_screen_name, t.tz, total_retweets, tweet_count;
sqlContext.sql("""
SELECT
t.retweeted_screen_name,
t.tz,
sum(retweets) AS total_retweets,
count(*) AS tweet_count
FROM (SELECT
actor.displayName as retweeted_screen_name,
body,
actor.twitterTimeZone as tz,
max(retweetCount) as retweets
FROM tweetTable WHERE body <> ''
GROUP BY actor.displayName, actor.twitterTimeZone,
body) t
GROUP BY t.retweeted_screen_name, t.tz
ORDER BY total_retweets DESC
LIMIT 10 """).collect.foreach(println)
When I try to simplify this query I run into errors like:
Column total_retweets is invalid in the select list because it is not
contained in either an aggregate function or the GROUP BY clause.
Will much appreciate any help.
When you run a SQL query, it does not calculate resolve the aliases for each query until after the WHERE, JOIN, GROUP BY and ORDER BY clauses have run (but it does do so before any HAVING clauses). You therefore can't ORDER BY total_retweets, you will need to ORDER BY sum(retweets)
How would you overcome the above restriction?
I am trying to find flows based on sequences of 3 records using the LEAD and LAG window functions, and than calculate some aggregations (count, sum, etc,) of their attributes.
When i run my queries on a small sample of data, everything is fine and the group by runs OK. but when running on larger data set, i get: "Resources exceeded during query execution. The query contained a GROUP BY operator, consider using GROUP EACH BY instead."
In many other cases switching to GROUP EACH BY do the work...
However, as I use window functions, I cannot use EACH...
Any suggestions? Best practices?
here is a sample query based of wikipedia sample data. it shows the frequency of title editing by different contributors. the where condition is just to limit response size, if you remove the "B" we get results, if we add it we got the "use EACH" recomendation.
select title,count (case when contributor_id<>LeadContributor then 1 else null end) as different,
count (case when contributor_id=LeadContributor then 1 else null end) as same,
count(*) as total
from
(
SELECT title,contributor_id,lead(contributor_id)over(partition by title order by timestamp) as LeadContributor
FROM [publicdata:samples.wikipedia]
where regexp_match(title,r'^[A,B]')=true)
group by title
Thanks
I guess your particular use case is different to the sample query, but let me comment on what I'm able to see:
You found a way to make GROUP EACH and OVER possible: Surrounding the OVER() query with another one allows you to change the GROUP BY to GROUP EACH BY. However, this query's problem is not there.
Let's forget about GROUP and GROUP EACH. Let's look at the core query:
SELECT title, contributor_id, LEAD(contributor_id)
OVER(PARTITION BY title ORDER BY timestamp) AS LeadContributor
FROM [publicdata:samples.wikipedia]
WHERE REGEXP_MATCH(title, r'^[A,B]')
This query fails with r'^[A,B]' and works with r'^[A]', and it highlight an OVER() limitation: As GROUP BY and ORDER BY, it only works when data fits in one machine, as they are not parallelizable. As the answer to r'^[A]' reveals, that can be a lot of data - though sometimes not enough. That's why BigQuery offers the parallelizable GROUP EACH BY. However, there is no parallelizable OVER EACH BY we can use here.
The workaround I would apply here is exactly what you are doing: Do the OVER() with just a fraction of the data.
(btw, let me say I love the sample query... it's an interesting question with an interesting answer!)
My mom wanted a baby name game for my brother's baby shower. Wanting to learn python, I volunteered to do it. I pretty much have the python bit, it's the SQL that is throwing me.
The way the game is supposed to work is everyone at the shower writes down names on paper, I manually enter them into Excel (normalizing spellings as much as possible) and export to MS Access. Then I run my python program to find the player with the most popular names and the player with the most unique names. The database, called "babynames", is just four columns.
ID | BabyFirstName | BabyMiddleName | PlayerName
---|---------------|----------------|-----------
My mom has changed things every so often, but as they stand right now, I have to figure out :
a) The most popular name (or names if there is a tie) out of all first and middle names
b) The most unique name (or names if there is a tie) out of all the first and middle names
c) The player that has the most number of popular names (wins a prize)
d) The player that has the most number of unique names (wins a prize)
I've been working on this for about a week now and can't even get a SQL query for a) and b) to work, much less c) and d). I'm more than just a bit frustrated.
BTW, I'm just looking at spellings of the names, not phonetics. As I manually enter names, I will change names like "Kris" to "Chris" and "Xtina" to "Christina" etc.
Editing to add a couple of the most recent queries I tried for a)
SELECT [BabyFirstName],
COUNT ([BabyFirstName]) AS 'FirstNameOccurrence'
FROM [babynames]
GROUP BY [BabyFirstName]
ORDER BY 'FirstNameOccurrence' DESC
LIMIT 1
and
SELECT [BabyFirstName]
FROM [babynames]
GROUP BY [BabyFirstName]
HAVING COUNT(*) =
(SELECT COUNT(*)
FROM [babynames]
GROUP BY [BabyFirstName]
ORDER BY COUNT(*) DESC
LIMIT 1)
These both lead to syntax errors.
pyodbc.ProgrammingError: ('42000', '[42000] [Microsoft][ODBC Microsoft Access Driver] Syntax error in ORDER BY clause. (-3508) (SQLExecDirectW)')
I've tried using [FirstNameOccurrence] and just FirstNameOccurrence as well with the same error. Not sure why it's not recognizing it by that column name to order by.
pyodbc.ProgrammingError: ('42000', "[42000] [Microsoft][ODBC Microsoft Access Driver] Syntax error. in query expression 'COUNT(*) = (SELECT COUNT(*) FROM [babynames] GROUP BY [BabyFirstName] ORDER BY COUNT(*) DESC LIMIT 1)'. (-3100) (SQLExecDirectW)")
I'll admit that I'm not really grokking all of the COUNT(*) commands here, but this was a solution for a similar issue here in stackoverflow that I figured I'd try when my other idea didn't pan out.
For A and B, use a group by clause in your SQL, and then count, and order by the count. Use descending order for A and ascending order for B, and just take the first result for each.
For C and D, essentially use the same strategy but now just add the PlayerName (e.g. group by babyname,playername) and then use the ascending order/descending order question.
Here's Microsoft's write-up for a group by clause in MS Access: https://office.microsoft.com/en-us/access-help/group-by-clause-HA001231482.aspx
Here's an even better write-up demonstrating how to do both group by and order by at the same time: http://rogersaccessblog.blogspot.com/2009/06/select-queries-part-3-sorting-and.html
For the first query you tried, change it to:
SELECT TOP 1 [BabyFirstName],
COUNT ([BabyFirstName]) AS 'FirstNameOccurrence'
FROM [babynames]
GROUP BY [BabyFirstName]
ORDER BY 'FirstNameOccurrence' DESC
For the second, change it to:
SELECT [BabyFirstName]
FROM [babynames]
GROUP BY [BabyFirstName]
HAVING COUNT(*) =
(SELECT TOP 1 COUNT(*)
FROM [babynames]
GROUP BY [BabyFirstName]
ORDER BY COUNT(*) DESC)
Limiting the number of records returned by a SQL Statement in Access is achieved by adding a TOP statement directly after SELECT, not with ORDER BY... LIMIT
Also, Access TOP statement will return all instances of the top n (or n percent) unique records, so if there are two or more identical records in the query output (before TOP), and TOP 1 is specified, you'll see them all.
I have a table called "form" in ms access which have following fields:
formno, location, status.
I want to create a report which calculates:
the total no of forms(columns) at each location
the total no of forms(columns) with status= "pending" at each location.
I tried to do it with this query:
select count(formno) as totalforms
from form
group by location;
select count(formno) as pendingforms
from form
group by location
WHERE status = 'pending';
SELECT Location, COUNT(FormNo) AS TotalForms,
SUM(IIF(status = "pending", 1, 0)) AS Pending
FROM FORM GROUP BY Location
Does this help?
The second SQL Statment has the Where Clause in the wrong place. It should look like one of these:
SELECT Count(form.formno) AS CountOfformno, form.location
FROM form
GROUP BY form.location, form.status
HAVING (((form.status)='pending'));
SELECT Count(form.formno) AS CountOfformno, form.location
FROM form
WHERE (((form.status)='pending'))
GROUP BY form.location;
Basically if you wish to group by a column and set criteria on it use the Having Clause. The Where Clause comes before the group by clause.
To learn the SQL syntax you may find it easier to use the Query By Design Grid (in Design View) first then switch to SQL View after you get the results you are after in Datasheet View.
Form is a reserved word and must be enclosed in square brackets: http://support.microsoft.com/kb/286335. I would suggest renaming the table, because it will continue to cause problems.
select location,count(formno) as totalforms
from [form]
group by location
select location, count(formno) as pendingforms
from [form]
WHERE status = 'pending'
group by location
Note that count(*) can also be used if you wish to include all records in the table: In SQL is there a difference between count(*) and count(<fieldname>)
I have 2 tables - an Account table and a Users table. Each account can have multiple users. I have a scenario where I want to execute a single query/join against these two tables, but I want all the Account data (Account.*) and only the first set of user data (specifically their name).
Instead of doing a "min" or "max" on my aggregated group, I wanted to do a "first". But, apparently, there is no "First" aggregate function in TSQL.
Any suggestions on how to go about getting this query? Obviously, it is easy to get the cartesian product of Account x Users:
SELECT User.Name, Account.* FROM Account, User
WHERE Account.ID = User.Account_ID
But how might I got about only getting the first user from the product based on the order of their User.ID ?
Rather than grouping, go about it like this...
select
*
from account a
join (
select
account_id,
row_number() over (order by account_id, id) -
rank() over (order by account_id) as row_num from user
) first on first.account_id = a.id and first.row_num = 0
I know my answer is a bit late, but that might help others. There is a way to achieve a First() and Last() in SQL Server, and here it is :
Stuff(Min(Convert(Varchar, DATE_FIELD, 126) + Convert(Varchar, DESIRED_FIELD)), 1, 23, '')
Use Min() for First() and Max() for Last(). The DATE_FIELD should be the date that determines if it is the first or last record. The DESIRED_FIELD is the field you want the first or the last value. What it does is :
Add the date in ISO format at the start of the string (23 characters long)
Append the DESIRED_FIELD to that string
Get the MIN/MAX value for that field (since it start with the date, you will get the first or last record)
Stuff that concatened string to remove the first 23 characters (the date part)
Here you go!
EDIT: I got problems with the first formula : when the DATE_FIELD has .000 as milliseconds, SQL Server returns the date as string with NO milliseconds at all, thus removing the first 4 characters from the DESIRED_FIELD. I simply changed the format to "20" (without milliseconds) and it works all great. The only downside is if you have two fields that were created at the same seconds, the sort can possibly be messy... in which cas you can revert to "126" for the format.
Stuff(Max(Convert(Varchar, DATE_FIELD, 20) + Convert(Varchar, DESIRED_FIELD)), 1, 19, '')
EDIT 2 : My original intent was to return the last (or first) NON NULL row. I got asked how to return the last or first row, wether it be null or not. Simply add a ISNULL to the DESIRED_FIELD. When you concatenate two strings with a + operator, when one of them is NULL, the result is NULL. So use the following :
Stuff(Max(Convert(Varchar, DATE_FIELD, 20) + IsNull(Convert(Varchar, DESIRED_FIELD), '')), 1, 19, '')
Select *
From Accounts a
Left Join (
Select u.*,
row_number() over (Partition By u.AccountKey Order By u.UserKey) as Ranking
From Users u
) as UsersRanked
on UsersRanked.AccountKey = a.AccountKey and UsersRanked.Ranking = 1
This can be simplified by using the Partition By clause. In the above, if an account has three users, then the subquery numbers them 1,2, and 3, and for a different AccountKey, it will reset the numnbering. This means for each unique AccountKey, there will always be a 1, and potentially 2,3,4, etc.
Thus you filter on Ranking=1 to grab the first from each group.
This will give you one row per account, and if there is at least one user for that account, then it will give you the user with the lowest key(because I use a left join, you will always get an account listing even if no user exists). Replace Order By u.UserKey with another field if you prefer that the first user be chosen alphabetically or some other criteria.
I've benchmarked all the methods, the simpelest and fastest method to achieve this is by using outer/cross apply
SELECT u.Name, Account.* FROM Account
OUTER APPLY (SELECT TOP 1 * FROM User WHERE Account.ID = Account_ID ) as u
CROSS APPLY works just like INNER JOIN and fetches the rows where both tables are related, while OUTER APPLY works like LEFT OUTER JOIN and fetches all rows from the left table (Account here)
You can use OUTER APPLY, see documentation.
SELECT User1.Name, Account.* FROM Account
OUTER APPLY
(SELECT TOP 1 Name
FROM [User]
WHERE Account.ID = [User].Account_ID
ORDER BY Name ASC) User1
SELECT (SELECT TOP 1 Name
FROM User
WHERE Account_ID = a.AccountID
ORDER BY UserID) [Name],
a.*
FROM Account a
The STUFF response from Dominic Goulet is slick. But, if your DATE_FIELD is SMALLDATETIME (instead of DATETIME), then the ISO 8601 length will be 19 instead of 23 (because SMALLDATETIME has no milliseconds) - so adjust the STUFF parameter accordingly or the return value from the STUFF function will be incorrect (missing the first four characters).
First and Last do not exist in Sql Server 2005 or 2008, but in Sql Server 2012 there is a First_Value, Last_Value function. I tried to implement the aggregate First and Last for Sql Server 2005 and came to the obstacle that sql server does guarantee the calculation of the aggregate in a defined order. (See attribute SqlUserDefinedAggregateAttribute.IsInvariantToOrder Property, which is not implemented.) This might be because the query analyser tries to execute the calculation of the aggregate on multiple threads and combine the results, which speeds up the execution, but does not guarantee an order in which elements are aggregated.
Define "First". What you think of as first is a coincidence that normally has to do with clustered index order but should not be relied on (you can contrive examples that break it).
You are right not to use MAX() or MIN(). While tempting, consider the scenario where you the first name and last name are in separate fields. You might get names from different records.
Since it sounds like all your really care is that you get exactly one arbitrary record for each group, what you can do is just MIN or MAX an ID field for that record, and then join the table into the query on that ID.
There are a number of ways of doing this, here a a quick and dirty one.
Select (SELECT TOP 1 U.Name FROM Users U WHERE U.Account_ID = A.ID) AS "Name,
A.*
FROM Account A
(Slightly Off-Topic, but) I often run aggregate queries to list exception summaries, and then I want to know WHY a customer is in the results, so use MIN and MAX to give 2 semi-random samples that I can look at in details e.g.
SELECT Customer.Id, COUNT(*) AS ProblemCount
, MIN(Invoice.Id) AS MinInv, MAX(Invoice.Id) AS MaxInv
FROM Customer
INNER JOIN Invoice on Invoice.CustomerId = Customer.Id
WHERE Invoice.SomethingHasGoneWrong=1
GROUP BY Customer.Id
Create and join with a subselect 'FirstUser' that returns the first user for each account
SELECT User.Name, Account.*
FROM Account, User,
(select min(user.id) id,account_id from User group by user.account_id) as firstUser
WHERE Account.ID = User.Account_ID
and User.id = firstUser.id and Account.ID = firstUser.account_id