Clickhouse Cross join workaround? - sql

I am trying to calculate the percentage of faulty transaction statuses per IP address in Clickhouse.
SELECT
c.source_ip,
COUNT(c.source_ip) AS total,
(COUNT(c.source_ip) / t.total_calls) * 100 AS percent_faulty
FROM sip_transaction_call AS c
CROSS JOIN
(
SELECT count(*) AS total_calls
FROM sip_transaction_call
) AS t
WHERE (status = 8 OR status = 9 or status = 13)
GROUP BY c.source_ip
Unfortunately Clickhouse rejects this with:
"Received exception from server (version 20.8.3):
Code: 47. DB::Exception: Received from 127.0.0.1:9000. DB::Exception: Unknown identifier: total_calls there are columns: source_ip, COUNT(source_ip)."
I tried various workarounds for the "invisible" alias, but failed. Any help would be greatly appreciated.

SELECT
source_ip,
countIf(status = 8 OR status = 9 or status = 13) AS failed,
failed / count() * 100 AS percent_faulty
FROM sip_transaction_call
GROUP BY source_ip

If you have a GROUP BY clause, you can only use columns you are grouping by (ie. c.source_ip) - for others you need an aggregate function.
Clickhouse is not too helpful here - for almost any other engine you would get a more meaningful error. See https://learnsql.com/blog/not-a-group-by-expression-error/.
Anyway, change grouping to GROUP BY c.source_ip, t.total_calls to fix it.

Related

SQL BigQuery - Error that variable is not grouped by even though it is

SQL Code:
SELECT community_table.community_name,
community_table.id,
DATE(timestamp) as date,
ifnull(COUNT(distinct app_opened.user_id), 0) as num_opened_DAU,
lag(COUNT(distinct app_opened.user_id)) OVER
(ORDER BY community_table.community_name, community_table.id, DATE(timestamp)) as pre_Value
FROM *** app_opened
LEFT JOIN (
SELECT DISTINCT id, community_id_2, context_traits_first_name, context_traits_last_name
FROM (
SELECT *
FROM ***,
UNNEST (JSON_EXTRACT_ARRAY(context_traits_community_ids, "$")) as community_id_2
)
GROUP by community_id_2, id, context_traits_first_name, context_traits_last_name) as community_id_table
ON community_id_table.id = app_opened.user_id
LEFT JOIN (
SELECT DISTINCT id, name as community_name
FROM ***) as community_table
ON TO_JSON_STRING(community_table.id) = community_id_table.community_id_2
WHERE app_opened.user_id is not null AND
EXTRACT(DAYOFWEEK FROM DATE(timestamp)) = 2 AND
community_table.community_name is not null
GROUP BY community_table.community_name, community_table.id, DATE(timestamp)
Error Message:
I am quite confused on what could be going wrong here, as the error says that timestamp is not grouped, even though I have grouped it at the bottom. I tried including just timestamp rather than Date(timestamp), but that ruins the table data that I am trying to create, where I find the number of users on a single day. Does anyone have any other ideas? My goal is for a single row, get the previous row's data, but because I am grouping by specific metrics, I need to make sure they are ordered by them as well. Thank you so much!
I think you simply need to modify OVER part as:
OVER (PARTITION BY community_table.community_name, community_table.id, DATE(timestamp)) as pre_Value
UPDATE. Seems that the problem was caused by using DATE() function within OVER so it can be solved by using DATE(timestamp) inside of subquery and passing alias to OVER

How to get "session duration" group by "operating system" in Firebase Bigquery SQL?

I try to get the "average session duration" group by "operating system" (device.operating_system) and "date" (event_date).
In the firebase blog, they give us this query to get the average duration session
SELECT SUM(engagement_time) AS total_user_engagement
FROM (
SELECT user_pseudo_id,
(SELECT value.int_value FROM UNNEST(event_params) WHERE key =
"engagement_time_msec") AS engagement_time
FROM `FIREBASE_PROJECT`
)
WHERE engagement_time > 0
GROUP BY user_pseudo_id
This query give me the total user engagement by user ID (each row is a different user):
row|total_user_engagement
---|------------------
1 |989646
2 |225655
3 |125489
4 | 58496
...|......
But I have no idea where I have to add the "operating system" and "event_date" variables to get this information by os and date. I tried differents queries with no result. For example to get this result by operatiing system I tried the following
SELECT SUM(engagement_time) AS total_user_engagement
FROM (
SELECT device.operating_system,
(SELECT value.int_value FROM UNNEST(event_params) WHERE key =
"engagement_time_msec") AS engagement_time
FROM `FIREBASE_PROJECT`
)
WHERE engagement_time > 0
GROUP BY device.operating_system
But it gives me an error message (Error: Unrecognized name: device at [9:10] ). In others queries device.operating_system is recognized.
For example in that one :
SELECT
event_date,
device.operating_system as os_type,
device.operating_system_version as os_version,
device.mobile_brand_name as device_brand,
device.mobile_model_name as device_model,
count(distinct user_pseudo_id) as all_users
FROM `FIREBASE Project`
GROUP BY 1,2,3,4,5
What I would like to have as a result is something like this :
row|event_date|OS |total_user_engagement
---|----------------------------------------
1 |20191212 |ios |989646
2 |20191212 |android|225655
3 |20191212 |ios |125489
4 |20191212 |android| 58496
...
Thank you
The error is probably because you are referencing the variable device in the outer query, while this variable is only visible from the inner query (subquery). I believe the issue will be fixed by changing the last row of the query from GROUP BY device.operating_system
to
GROUP BY operating_system.
Hopefully this will make clearer what is happening here: the inner query is accessing the table FIREBASE_PROJECT and returning the field operating_system from the nested column device. The outer query accesses the results of the inner query, so it only sees the returned field operating_system, without information about its original context within the nested variable device. That is why trying to reference device at this level will fail.
In the other example you posted this issue does not appear, since there is only a simple query.

Where clause with dates in hive

The where clause in the below hive query is not working
select
e.num as badge
from dbo.events as e
where TO_DATE(e.event_time_utc) > TO_DATE(select event_date from DL_EDGE_LRF_facilities.card_swipes_lastpulldate)
both event_time_utc and event_date fields are defined as strings and event_time_utc has timestamp values like '2017-09-18 20:10:19.000000' and event_date has only one date value like '2018-01-25'
i am getting an error like "cannot recognize input near 'select' 'event_date' 'from' in function specification " when i run the query, Please help
#user86683; hive does not recognize the syntax since it does not allow in-query in the inequality condition (>). You may try this query and let me know the result.
select e.num as badge
from dbo.events as e, DL_EDGE_LRF_facilities.card_swipes_lastpulldate c
where TO_DATE(e.event_time_utc) > TO_DATE(c.event_date)
You will get a warning but you may ignore it since the table for event_date has only one record.
Warning: Map Join MAPJOIN[10][bigTable=e] in task 'Map 1' is a cross product
Query ID = xxx_20180201102128_aaabb2235-ee69275cbec1
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_09fdf345)
Hope this helps. Thanks.

IBM Informix-SQL syntax error, basic query from Microsoft BIDS to Cisco UCCX database

I'm running the below query against an IBM Informix database and getting an ERROR 42000: A syntax error has occurred. The FROM and WHERE clauses run fine in other queries, so I'm looking at the SELECT and GROUP BY portions. Any ideas what's wrong with the syntax?
SELECT COUNT(DISTINCT "informix".agentconnectiondetail.sessionid) AS calls_abandoned,
DAY("informix".agentconnectiondetail.startdatetime) AS Expr2
FROM "informix".agentconnectiondetail, "informix".contactqueuedetail, "informix".contactservicequeue
WHERE "informix".agentconnectiondetail.sessionid = "informix".contactqueuedetail.sessionid AND
"informix".contactqueuedetail.targetid = "informix".contactservicequeue.recordid AND "informix".contactqueuedetail.disposition = 1 AND
"informix".agentconnectiondetail.startdatetime BETWEEN '2016-10-1 00:00:00' AND CURRENT
GROUP BY DAY("informix".agentconnectiondetail.startdatetime)
The goal btw is to find the total number of unique calls (calls_abandoned) that occur on each day of the month (1-31).
Replace the
GROUP BY DAY("informix".agentconnectiondetail.startdatetime)
by
GROUP BY 2

Google BiqQuery Internal Error

Edit: Tidied up the query a bit. Checked running on one day (versus the 27 I need) and the query runs. With 27 days of data it's trying to process 5.67TB. Could this be the issue?
Latest ID of error run:
Job ID: ee-corporate:bquijob_3f47d425_1530e03af64
I keep getting this error message when trying to run a query in BigQuery, both through the UI and Bigrquery.
Query Failed
Error: An internal error occurred and the request could not be completed.
Job ID: ee-corporate:bquijob_6b9bac2e_1530dba312e
Code below:
SELECT
CASE WHEN d.category_grouped IS NULL THEN 'N/A' ELSE d.category_grouped END AS category_grouped_cleaned,
COUNT(UNIQUE(msisdn_token)) AS users,
(SUM(up_link_data_bytes) + SUM(down_link_data_bytes))/1000000 AS tot_data_mb
FROM (
SELECT
request_domain, up_link_data_bytes, down_link_data_bytes, msisdn_token, timestamp
FROM (TABLE_DATE_RANGE([helpful-skyline-97216:WEBLOG_Staging.WEBLOG_], TIMESTAMP('20160101'), TIMESTAMP('20160127')))
WHERE SUBSTR(http_status_code,1,1) IN ('1',
'2',
'3')) a
LEFT JOIN EACH web_usage_201601.domain_to_cat_lookup_27JAN_with_groups d
ON
a.request_domain = d.request_domain
WHERE
DATE(timestamp) >= '2016-01-01'
AND DATE(timestamp) <= '2016-01-27'
GROUP EACH BY
1
Is there something I'm doing wrong?
The problem seems to be coming from UNIQUE() - it returns repeated field with too many elements in it. The error could be improved, but workaround for you would be to use explicit GROUP BY and then run COUNT on top of it.
If you are okay with an approximation, you can also use
COUNT(DISTINCT msisdn_token) AS users
or a higher approximation parameter than the default 1000,
COUNT(DISTINCT msisdn_token, 5000) AS users
GROUP BY is the most general approach, but these can be faster if they do what you need.