Resolving "Failed to parse input string" error in BigQuery - google-bigquery

I'm working on a cohort analysis where I count the number of users who signed up by week, then count the number of events each perform over subsequent weeks. (Pretty standard stuff!)
After spending a lot of time understanding BigQuery nesting and array data, I decided to create two views to flatten nested user_dim and event_dim columns using wildcard over the entire data set (tables generated of events on a daily basis). That's how I wound up with these views, on which my query is based:
USERS_VIEW schema
EVENT_VIEW schema
Query, and the Error Thrown
When I execute the following query joining those two views, I get the error "Failed to parse input string "20161111"":
SELECT
DATE_TRUNC(users.first_seen_date, WEEK) AS week,
COUNT(DISTINCT users.uid) AS signed_up_users,
COUNT(DISTINCT events.uid) AS logged_in_users,
CASE
WHEN COUNT(DISTINCT users.uid) > 0 THEN COUNT(DISTINCT events.uid) * 100 / COUNT(DISTINCT users.uid)
ELSE 0
END AS retention_pct
FROM
USERS_VIEW AS users
LEFT JOIN
EVENTS_VIEW AS events
ON
users.uid = events.uid
AND PARSE_DATE('%x', events.event_date) >= DATE_ADD(users.first_seen_date, INTERVAL 1 WEEK)
AND PARSE_DATE('%x', events.event_date) < DATE_ADD(users.first_seen_date, INTERVAL 2 WEEK)
GROUP BY
1
ORDER BY
1
I feel like this should be simple, but I can't figure out what formatting I'm missing to ensure the dates can be parsed. (And the UI doesn't tell me which line number is the offender.) I'm hoping it's a silly typo that someone else can see. Thanks in advance for any help!

I get the error "Failed to parse input string "20161111""
I think below will help to address that error
PARSE_DATE('%x', events.event_date)
should be
PARSE_DATE('%Y%m%d', events.event_date)
Also, optionally, you might want to change LEFT JOIN to just JOIN

Related

SQL filtering activity after certain event

I am struggling with a SQL query.
My query looks something like this:
Select
Count(user-id),
sum(distinct(date),
Sum(characters-posted)
From (
Select
Date,
User-Id,
Session-Id,
Characters—posted,
Variant-id
From database-name
Where date between ‘2022-09-01’ and ‘2022-09-31’)
This works ok. But, there is another field in the table “mailing-list”, which is just 0 or 1. I want to only get activity for members from the date when they join the mailing list onwards, even if they then leave the list, so can’t just do “where mailing-list=1”.
How can I do this?
It's not obvious what works fine for you as it seems to be uncommon to sum dates, given it is a regular date format. Are you trying to get number of active dates? as for the bottom question you might.
As for your buttom line quesiton, it seems that you might want to use a cte or subselect in a join.
your query...
from db_name dbn
inner join (select user_id, min(date) date from database_name
where mailing_list = 1 group by 1) start_date
on start_date.user_id = dbn.user_id
and start_date.date <= dbn.date
That way you're only getting activity starting from the first time your users join the mailing list.
But I still think you have an error in your final query, check it out.

Executing a Aggregate function within a case without Group by

I am trying to assign a specific code to a client based on the number of gifts that they have given in the past 6 months using a CASE. I am unable to use WITH (screenshot) due to the limitations of the software that I am creating the query in. It only allows for select functions. I am unsure how to get a distinct count from another table (transaction data) and use that as parameters in the CASE I have currently built (based on my client information table). Does anyone know of any workarounds for this? I am unable to GROUP BY clientID at the end of my query because not all of my columns are aggregate, and I only need to GROUP BY clientID for this particular WHEN statement in the CASE. I have looked into the OVER() clause, but I am needing my date range that I am evaluating to be dynamic (counting transactions over the last six months), and the amount of rows that I would be including is variable, as the transaction count month to month varies. Also, the software that I am building this in does not recognize the PARTITIONED BY parameter of the over clause.
Any help would be great!
EDIT:
it is not letting me attach an image... -____- I have added the two sections of code that I am looking for assistance with!
WITH "6MonthGIftCount" (
"ConstituentID"
,"GiftCount"
)
AS (
SELECT COUNT(DISTINCT "GiftView"."GiftID" FROM "GiftView" WHERE MONTHS_BETWEEN("GiftView"."GiftDate", getdate()) <= 6 GROUP BY "GiftView"."ConstituentID")
SELECT...CASE
WHEN "6MonthGiftCount"."GiftCount" >= 4
THEN 'A010'
)
Perform your grouping/COUNT(1) in a subquery to obtain the total # of donations by ConstituentID, then JOIN this total into your main query that uses this new column to perform its CASE statement.
select
hist.*,
case when timesDonated > 5 then 'gracious donor'
when timesDonated > 3 then 'repeated donor'
when timesDonated >= 1 then 'donor'
else null end as donorCode
from gifthistory hist
left join ( /* your grouping subquery here, pretending to be a new table */
select
personID,
count(1) as timesDonated
from gifthistory i
WHERE abs(months_between(giftDate, sysdate)) <= 6
group by personid ) grp on hist.personid = grp.personID
order by 1;
*Naturally, syntax changes will vary by DB; you didn't specify which it was based on, but you should be able to use this template with whichever you utilize. This works in both Oracle and SQL Server after tweaking the month calculation appropriately.

SQL INNER JOIN tables with different row names

Thank you for taking the time to read this, it is probably a very basic question. Most search queries I did seemed a bit more in depth to the INNER JOIN operator.
Basically my question is this: I have a shipping and receiving table with dates on when the item was either shipped or received. In the shipping table (tbl_shipping) the date row is labeled as trans_out_date and for the receiving table (tbl_receiving) the date row is labeled as trans_in_date.
I can view transactions set on either table from a user entered form but I want to populate a table with information pulled from both tables where the criteria meets. ie. If the receiving table has 10 transactions done in April and 5 in June and the shipping table has 15 transactions in April and 10 in June... when the user wants to see all transactions in June, it will populate the 15 transactions that occurred in June.
As of right now, I can pull only from 1 table with
SELECT *
FROM tbl_shipping
WHERE trans_out_date >= ‘from_date’
AND trans_out_date <= ‘to_date’
Would this be the appropriate syntax for what I am looking to achieve?
SELECT *
FROM tbl_shipping
INNER JOIN tbl_receiving ON tbl_shipping.trans_out_date = tbl_receiving.trans_in_date
WHERE
tbl_shipping.trans_out_date >= ‘from_date’
AND tbl_shipping.trans_out_date <= ‘to_date’
Thank you again in advance for reading this.
You appear to want union all rather than a join:
SELECT s.item, s.trans_out_date as dte, 'shipped' as which
FROM tbl_shipping S
WHERE s.trans_out_date >= ? AND
s.trans_out_date <= ?
UNION ALL
SELECT r.item, NULL, r.trans_in_date as dte, 'received'
FROM tbl_receiving r
WHERE r.trans_out_date >= ? AND
r.trans_out_date <= ?
ORDER BY dte;
Notes:
A JOIN can cause problems due to data that goes missing (because dates don't line up) or data that gets duplicated (because there are multiple dates).
The ? is for a parameter. If you are calling this from an application, use parameters!
You can include additional columns for more information in the result set.
This may not be the exact result format you want. If not, ask another question with sample data and desired results.

Group by: calculated field to return respective date in bigquery

I need to do an user level analysis. As the data has a lot of different rows per user (related to different events), I need to group by user and create some calculated fields that represent the different rows. One of the fields is a calculation of the number of days since the last purchase of the user (today - last purchase date). I already tried a lot of different codes and also did a lot of research, but could not find the solution.
The codes that for me makes more sense but did not work are below:
Using case when statement
SELECT CASE WHEN LAST(tr_orderid <> "") THEN
DATEDIFF(CURRENT_DATE(),event_date) ELSE NULL END AS recency_lastbooking
FROM df
GROUP BY domain_userid
Using IF statement
SELECT IF(LAST(tr_total > 0), DATEDIFF(CURRENT_DATE(),event_date), NULL)
AS recency_lastbooking
FROM df
GROUP BY domain_userid
The error that I get is: Expression 'event_date' is not present in the GROUP BY list
I think if I use LAST(event_date) the query will return the last date in all the lines of the specific user, instead of return the last day the user had a purchase event.
P.S: I can use tr_total (total transaction) > 0 or tr_orderid (transaction order id) <> ""
Thank you!
I think you just want a window function:
SELECT DATE_DIFF(CURRENT_DATE,
MAX(tr_orderid) OVER (PARTITION BY domain_userid),
day
) AS recency_lastbooking
FROM df;

SQL: Average value per day

I have a database called ‘tweets’. The database 'tweets' includes (amongst others) the rows 'tweet_id', 'created at' (dd/mm/yyyy hh/mm/ss), ‘classified’ and 'processed text'. Within the ‘processed text’ row there are certain strings such as {TICKER|IBM}', to which I will refer as ticker-strings.
My target is to get the average value of ‘classified’ per ticker-string per day. The row ‘classified’ includes the numerical values -1, 0 and 1.
At this moment, I have a working SQL query for the average value of ‘classified’ for one ticker-string per day. See the script below.
SELECT Date( `created_at` ) , AVG( `classified` ) AS Classified
FROM `tweets`
WHERE `processed_text` LIKE '%{TICKER|IBM}%'
GROUP BY Date( `created_at` )
There are however two problems with this script:
It does not include days on which there were zero ‘processed_text’s like {TICKER|IBM}. I would however like it to spit out the value zero in this case.
I have 100+ different ticker-strings and would thus like to have a script which can process multiple strings at the same time. I can also do them manually, one by one, but this would cost me a terrible lot of time.
When I had a similar question for counting the ‘tweet_id’s per ticker-string, somebody else suggested using the following:
SELECT d.date, coalesce(IBM, 0) as IBM, coalesce(GOOG, 0) as GOOG,
coalesce(BAC, 0) AS BAC
FROM dates d LEFT JOIN
(SELECT DATE(created_at) AS date,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|IBM}%' then tweet_id
END) as IBM,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|GOOG}%' then tweet_id
END) as GOOG,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|BAC}%' then tweet_id
END) as BAC
FROM tweets
GROUP BY date
) t
ON d.date = t.date;
This script worked perfectly for counting the tweet_ids per ticker-string. As I however stated, I am not looking to find the average classified scores per ticker-string. My question is therefore: Could someone show me how to adjust this script in such a way that I can calculate the average classified scores per ticker-string per day?
SELECT d.date, t.ticker, COALESCE(COUNT(DISTINCT tweet_id), 0) AS tweets
FROM dates d
LEFT JOIN
(SELECT DATE(created_at) AS date,
SUBSTR(processed_text,
LOCATE('{TICKER|', processed_text) + 8,
LOCATE('}', processed_text, LOCATE('{TICKER|', processed_text))
- LOCATE('{TICKER|', processed_text) - 8)) t
ON d.date = t.date
GROUP BY d.date, t.ticker
This will put each ticker on its own row, not a column. If you want them moved to columns, you have to pivot the result. How you do this depends on the DBMS. Some have built-in features for creating pivot tables. Others (e.g. MySQL) do not and you have to write tricky code to do it; if you know all the possible values ahead of time, it's not too hard, but if they can change you have to write dynamic SQL in a stored procedure.
See MySQL pivot table for how to do it in MySQL.