I use
ln(session_length) - avg(ln(session_length)) OVER (PARTITION BY device_platform) / nullif(stddev(ln(session_length)) OVER (PARTITION BY device_platform), 0) AS ln_std
for removing outliers with SQL. I have used the function with Redshift before and I did not get any error but when I use this with Postgres I get
[2201E] ERROR: cannot take logarithm of zero
The error comes when I added where clause with ln_std <= 1.67 otherwise there is no error.
Can someone point me if I miss something.
My code is:
SELECT
user_id
, event_date
, device_platform
, marketing_user
, session_length
FROM
(
SELECT
user_id
, date(event_time) AS event_date
, device_platform
, marketing_user AS marketing_user
, session_length
--! Normalisation: Using a logarithmic scale (ln())
--! Create the Z score for removing the outliers
, ln(session_length) - avg(ln(session_length)) OVER (PARTITION BY device_platform) /
nullif(stddev(ln(session_length)) OVER (PARTITION BY device_platform),
0) AS ln_std
FROM
session_start
WHERE
date(install_time) >= '2020-01-01'
) filter
WHERE
ln_std <= 1.67
There is a value less than or equal to zero in your session_length column, the error is describing it pretty well. Do some analysis on why this is happening and threat them accordingly.
Related
I'm trying to run the weighted moving average Silota query with similar data in a Presto database but am encountering an error. The same query in the Redshift database has no issues, however in Presto I receive a syntax error:
Query failed (#20220505_230258_04927_5xpwi):
line 14:14: Column 't2.row_number' cannot be resolved io.prestosql.spi.PrestoException:
line 14:14: Column 't2.row_number' cannot be resolved.
The data is the same in both databases, why does the query run in Redshift while Presto throws the error?
WITH t AS
(select date_trunc('month',mql_date) date, avg(mqls) mqls, row_number() over ()
from marketing.campaign
WHERE date_trunc('month',mql_date) > date('2021-12-31')
GROUP BY 1)
select t.date, avg(t.mqls),
sum(case
when t.row_number - t2.row_number = 0 then 0.4 * t2.mqls
when t.row_number - t2.row_number = 1 then 0.3 * t2.mqls
when t.row_number - t2.row_number = 2 then 0.2 * t2.mqls
when t.row_number - t2.row_number = 3 then 0.1 * t2.mqls
end) weighted_avg
from t
join t t2 on t2.row_number between t.row_number - 3 and t.row_number
group by 1
order by 1
I suspect it is because your SQL assumes that the result of the row_number() window function will be called "row_number". This is true in Redshift but other databases may infer a different name onto it. You should alias this to some defined name such as "rn".
Also you have no "order by" clause in your row_number() function which will make the row numbers unpredictable and possibly varying between invocations.
SQL Code:
SELECT community_table.community_name,
community_table.id,
DATE(timestamp) as date,
ifnull(COUNT(distinct app_opened.user_id), 0) as num_opened_DAU,
lag(COUNT(distinct app_opened.user_id)) OVER
(ORDER BY community_table.community_name, community_table.id, DATE(timestamp)) as pre_Value
FROM *** app_opened
LEFT JOIN (
SELECT DISTINCT id, community_id_2, context_traits_first_name, context_traits_last_name
FROM (
SELECT *
FROM ***,
UNNEST (JSON_EXTRACT_ARRAY(context_traits_community_ids, "$")) as community_id_2
)
GROUP by community_id_2, id, context_traits_first_name, context_traits_last_name) as community_id_table
ON community_id_table.id = app_opened.user_id
LEFT JOIN (
SELECT DISTINCT id, name as community_name
FROM ***) as community_table
ON TO_JSON_STRING(community_table.id) = community_id_table.community_id_2
WHERE app_opened.user_id is not null AND
EXTRACT(DAYOFWEEK FROM DATE(timestamp)) = 2 AND
community_table.community_name is not null
GROUP BY community_table.community_name, community_table.id, DATE(timestamp)
Error Message:
I am quite confused on what could be going wrong here, as the error says that timestamp is not grouped, even though I have grouped it at the bottom. I tried including just timestamp rather than Date(timestamp), but that ruins the table data that I am trying to create, where I find the number of users on a single day. Does anyone have any other ideas? My goal is for a single row, get the previous row's data, but because I am grouping by specific metrics, I need to make sure they are ordered by them as well. Thank you so much!
I think you simply need to modify OVER part as:
OVER (PARTITION BY community_table.community_name, community_table.id, DATE(timestamp)) as pre_Value
UPDATE. Seems that the problem was caused by using DATE() function within OVER so it can be solved by using DATE(timestamp) inside of subquery and passing alias to OVER
When I run the following query, my Netezza NPS reboots. Would someone please let me know what is causing this behaviour?
select avg ( bse.WEEKS_BETWEEN_RESPONSES_HR ) as g_AVG
, sqlext.median( bse.WEEKS_BETWEEN_RESPONSES_HR ) as g_med
from (
select WEEKS_BETWEEN_RESPONSES_HR
FROM (
select distinct LOYALTY_ACCOUNT_CARD_ID
, BONUS_END_DATE
, LAG(BONUS_END_DATE,1) OVER (partition by LOYALTY_ACCOUNT_CARD_ID order by BONUS_END_DATE) as PRIOR_BONUS_END_DATE
,(( BONUS_END_DATE - PRIOR_BONUS_END_DATE)/7) as WEEKS_BETWEEN_RESPONSES_HR
from JO_ACT_PTD_STEP_1 bse
where upper ( bonus_desc ) like '%SPEND%'
and redemption = 1
) BSE
where WEEKS_BETWEEN_RESPONSES_HR is not null and WEEKS_BETWEEN_RESPONSES_HR > 0
) bse limit 500 ```
You need to call the support people at IBM
There is probably a stack trace or a dump file somewhere that will tell them what happened
If I was experiencing your problem I would remove each of the function calls one by one and make the sql simpler and simpler until the error disappeared
But of course you will need to do that in the middle of the night or at a time when nobody else is being bothered by the constant re-boots
Bigquery started to give me error:not enough memory when I run this query this morning. The two tables involved contain no more than 5GB data. Plus I'm using table decorators, 1407249067530 equals around 10:30am today(20140805). I wonder what's the problem.
Job ID: red-road-574:job_x8flLfo4QwA1gQ_FCrNWbKY-bZM
select * from
(
select t_connection.row_id AS debug_row_id,
t_connection.hardware_id AS hardware_id,
t_connection.debug_data AS debug_data,
t_connection.connection_status AS connection_status,
t_connection.date_time AS debug_date_time,
t_gps.hardware_id AS hardware_id2,
t_gps.latitude AS latitude,
t_gps.longitude AS longitude,
t_gps.date_time AS gps_date_time,
t_gps.zip_code AS zip_code,
ROW_NUMBER() OVER (PARTITION BY debug_row_id ORDER BY time_diff) row_num,
from(
select *,
ABS(t_gps.date_time-t_connection.date_time) AS time_diff
from ( select CONCAT(String(gg.hardware_id),String(gg.date_time)) as row_id,
gg.hardware_id as hardware_id,
gg.latitude as latitude,
gg.longitude as longitude,
gg.date_time as date_time,
gg.zip_code as zip_code
from [my data set.table1_20140805#1407249067530-] gg
) AS t_gps
INNER JOIN EACH
( select CONCAT(CONCAT(String(dd.debug_reason),String(dd.hardware_id)),String(dd.date_time)) as row_id,
dd.hardware_id as hardware_id,
dd.date_time as date_time,
dd.debug_data as debug_data,
case
when dd.debug_reason = 1 then 'Successful_Connection'
when dd.debug_reason = 2 then 'Dropped_Connection'
when dd.debug_reason = 3 then 'Failed_Connection'
end AS connection_status
from [my data set.table2_20140805#1407249067530-] dd
where dd.debug_reason in (50013, 50017, 50018)
) as t_connection
ON t_connection.hardware_id = t_gps.hardware_id
)
) WHERE row_num=1
You're hitting an odd corner case. When you use allowLargeResults with results that are nested or repeated and you don't use flattenResults=false, the query goes into a special mode. (when you use timestamps, you're really using a nested data structure, which was a design decision that spawned 1000 bugs and is hopefully changing soon). This special query mode has some limitations, which are what you're hitting.
In general, we want this to be seamless, which is why it isn't documented. However, since you're running into a problem here, I'll explain a little about about how to avoid it.
You have a couple of options to get around this:
If you're using nested or repeated results (it looks like you're not, which is good):
rename your results without dots in the name.
set the flattenResults field on the query to 'false'. This means that nested and repeated fields will be actually nested and repeated in the results.
If you're using timestamps in the results:
Convert your timestamps to strings or numeric values. Sorry.
If you don't really need large results:
unset the allowLargeResults flag.
I realize that all of these options are deeply unsatisfying. This is an area we're actively working to improve.
Now with allowLargeReults=true and flattenResults=false and convert timestamps to numeric value at the first step
select * from
(
select row_id AS debug_row_id,
hardware_id AS hardware_id,
debug_data AS debug_data,
connection_status AS connection_status,
date_time AS debug_date_time,
hardware_id2 AS hardware_id2,
latitude AS latitude,
longitude AS longitude,
date_time2 AS gps_date_time,
zip_code AS zip_code,
ROW_NUMBER() OVER (PARTITION BY debug_row_id ORDER BY time_diff) row_num,
from(
select *,
ABS(t_gps.date_time2-t_connection.date_time) AS time_diff
from ( select CONCAT(String(gg.hardware_id),String(gg.date_time)) as row_id_gps,
gg.hardware_id as hardware_id2,
gg.latitude as latitude,
gg.longitude as longitude,
TIMESTAMP_TO_MSEC(gg.date_time) as date_time2,
gg.zip_code as zip_code
from [test.gps32_20140805#1407249067530-] gg
) AS t_gps
INNER JOIN EACH
( select CONCAT(CONCAT(String(dd.debug_reason),String(dd.hardware_id)),String(dd.date_time)) as row_id,
dd.hardware_id as hardware_id,
TIMESTAMP_TO_MSEC(dd.date_time) as date_time,
dd.debug_data as debug_data,
case
when dd.debug_reason = 1 then 'Successful_Connection'
when dd.debug_reason = 2 then 'Dropped_Connection'
when dd.debug_reason = 3 then 'Failed_Connection'
end AS connection_status
from [test.debug_data_developer_20140805#1407249067530-] dd
where dd.debug_reason in (50013, 50017, 50018)
) as t_connection
ON t_connection.hardware_id = t_gps.hardware_id2
)
) WHERE row_num=1
it gives me
Query Failed
Error: Resources exceeded during query execution.
Job ID: red-road-574:job_ikWQvffmPEUP6DtTvJaYpXHFJ2M
This is the functioning SQL with allowLargeResults=true, flattenResults=true. I don't know what I did to make this work, maybe only add a HAVING clause? But in the JOIN, I change one side to be a whole table instead of the one with decorator as above, so the data involved actually increased. I'm not sure whether it can keep successful or it's just temporary luck.
I'm executing the following query.
SELECT properties.os, boundary, user, td,
SUM(boundary) OVER(ORDER BY rows) AS session
FROM
(
SELECT properties.os, ROW_NUMBER() OVER() AS rows, user, td,
CASE WHEN td > 1800 THEN 1 ELSE 0 END AS boundary
FROM (
SELECT properties.os, t1.properties.distinct_id AS user,
(t2.properties.time - t1.properties.time) AS td
FROM (
SELECT properties.os, properties.distinct_id, properties.time, srlno,
srlno-1 AS prev_srlno
FROM (
SELECT properties.os, properties.distinct_id, properties.time,
ROW_NUMBER()
OVER (PARTITION BY properties.distinct_id
ORDER BY properties.time) AS srlno
FROM [ziptrips.ziptrips_events]
WHERE properties.time > 1367916800
AND properties.time < 1380003200)) AS t1
JOIN (
SELECT properties.distinct_id, properties.time, srlno,
srlno-1 AS prev_srlno
FROM (
SELECT properties.distinct_id, properties.time,
ROW_NUMBER() OVER
(PARTITION BY properties.distinct_id ORDER BY properties.time) AS srlno
FROM [ziptrips.ziptrips_events]
WHERE
properties.time > 1367916800
AND properties.time < 1380003200 )) AS t2
ON t1.srlno = t2.prev_srlno
AND t1.properties.distinct_id = t2.properties.distinct_id
WHERE (t2.properties.time - t1.properties.time) > 0))
It fails the first time with the following error. However on 2nd run it completes without any issue. I'd appreciate any pointers on what might be causing this.
The error message is:
Query Failed
Error: Field 'properties.os' not found in table '__R2'.
Job ID: job_VWunPesUJVLxWGZsMgpoti14BM4
Thanks,
Navneet
We (the BigQuery team) are in the process of rolling out a new version of the query engine that fixes a number of issues like this one. You likely hit an old version of the query engine and then when you retried, hit the new one. It may take us a day or so with a portion of traffic pointing at the updated version in order to verify there aren't any regressions. Please let us know if you hit this again after 24 hours or so.