I'm trying to get percentage from the table in clickhouse DB. I'm having a difficulty writing a query that will calculate percentage of type within each timestamp group.
SELECT
(intDiv(toUInt32(toDateTime(atime)), 120) * 120) * 1000 AS timestamp,
if(dateDiff('second', toDateTime(t1.atime), toDateTime(t2.unixdsn)) <= 5, 'sec5', if((dateDiff('second', toDateTime(t1.atime), toDateTime(t2.unixdsn)) > 5) AND (dateDiff('second', toDateTime(t1.atime), toDateTime(t2.unixdsn)) <= 30), 'sec30', if((dateDiff('second', toDateTime(t1.atime), toDateTime(t2.unixdsn)) > 30) AND (dateDiff('second', toDateTime(t1.atime), toDateTime(t2.unixdsn)) <= 60), 'sec60', 'secgt60'))) AS type,
count() AS total_count,
(total_count * 100) /
(
SELECT count()
FROM sess_logs.logs_view
WHERE (status IN (0, 1)) AND (toDateTime(atime) >= toDateTime(1621410625)) AND (toDateTime(atime) <= toDateTime(1621421425))
) AS percentage_cnt
FROM sess_logs.logs_view AS t1
INNER JOIN
(
SELECT
trid,
atime,
unixdsn,
status
FROM sess_logs.logs_view
WHERE (status = 1) AND (toDate(date) >= toDate(1621410625)) AND if('all' = 'all', 1, userid =
(
SELECT userid
FROM sess_logs.user_details
WHERE (username != 'all') AND (username = 'all')
))
) AS t2 ON t1.trid = t2.trid
WHERE (t1.status = 0) AND (t2.status = 1) AND ((toDate(atime) >= toDate(1621410625)) AND (toDate(atime) <= toDate(1621421425))) AND (toDateTime(atime) >= toDateTime(1621410625)) AND (toDateTime(atime) <= toDateTime(1621421425)) AND if('all' = 'all', 1, userid =
(
SELECT userid
FROM sess_logs.user_details
WHERE (username != 'all') AND (username = 'all')
))
GROUP BY
timestamp,
type
ORDER BY timestamp ASC
Output
┌─────timestamp─┬─type────┬─total_count─┬─────────percentage_cnt─┐
│ 1621410600000 │ sec5 │ 15190 │ 0.9650982602181922 │
│ 1621410600000 │ sec30 │ 1525 │ 0.09689103665785011 │
│ 1621410600000 │ sec60 │ 33 │ 0.002096658498169871 │
│ 1621410600000 │ secgt60 │ 61 │ 0.0038756414663140043 │
│ 1621410720000 │ secgt60 │ 67 │ 0.004256852102344891 │
│ 1621410720000 │ sec30 │ 2082 │ 0.13228009070271735 │
│ 1621410720000 │ sec60 │ 65 │ 0.004129781890334595 │
│ 1621410720000 │ sec5 │ 20101 │ 1.2771191658094723 │
│ 1621410840000 │ sec30 │ 4598 │ 0.29213441741166873 │
│ 1621410840000 │ sec60 │ 36 │ 0.002287263816185314 │
│ 1621410840000 │ secgt60 │ 61 │ 0.0038756414663140043 │
│ 1621410840000 │ sec5 │ 17709 │ 1.1251431922451591 │
│ 1621410960000 │ sec60 │ 17 │ 0.0010800968020875095 │
│ 1621410960000 │ secgt60 │ 81 │ 0.005146343586416957 │
│ 1621410960000 │ sec30 │ 2057 │ 0.13069171305258864 │
│ 1621410960000 │ sec5 │ 18989 │ 1.206468127931748 │
│ 1621411080000 │ sec60 │ 9 │ 0.0005718159540463285 │
│ 1621411080000 │ sec30 │ 3292 │ 0.20915756896894594 │
│ 1621411080000 │ sec5 │ 15276 │ 0.9705622793346349 │
│ 1621411080000 │ secgt60 │ 78 │ 0.004955738268401514 │
└───────────────┴─────────┴─────────────┴────────────────────────┘
It is returning the percentage for each row, but when I do sum of percentage_cnt column, the total does not goes to 100% instead it goes to 80%.
Please help me in correcting my query. I know query is huge, you guys can give simpler example for my use case. Thanks.
Let us say that I have a table with user_id of Int32 type and login_time as DateTime in UTC format. user_id is not unique, so SELECT user_id, login_time FROM some_table; gives following result:
┌─user_id─┬──login_time─┐
│ 1 │ 2021-03-01 │
│ 1 │ 2021-03-01 │
│ 1 │ 2021-03-02 │
│ 2 │ 2021-03-02 │
│ 2 │ 2021-03-03 │
└─────────┴─────────────┘
If I run SELECT COUNT(*) as count, toDate(login_time) as l FROM some_table GROUP BY l I get following result:
┌─count───┬──login_time─┐
│ 2 │ 2021-03-01 │
│ 2 │ 2021-03-02 │
│ 1 │ 2021-03-03 │
└─────────┴─────────────┘
I would like to reformat the result to show COUNT on a weekly level, instead of every day, as I currently do.
My result for the above example could look something like this:
┌──count──┬──year─┬──month──┬─week ordinal┐
│ 5 │ 2021 │ 03 │ 1 │
│ 0 │ 2021 │ 03 │ 2 │
│ 0 │ 2021 │ 03 │ 3 │
│ 0 │ 2021 │ 03 │ 4 │
└─────────┴───────┴─────────┴─────────────┘
I have gone through the documentation, found some interesting functions, but did not manage to make them solve my problem.
I have never worked with clickhouse before and am not very experienced with SQL, which is why I ask here for help.
Try this query:
select count() count, toYear(start_of_month) year, toMonth(start_of_month) month,
toWeek(start_of_week) - toWeek(start_of_month) + 1 AS "week ordinal"
from (
select *, toStartOfMonth(login_time) start_of_month,
toStartOfWeek(login_time) start_of_week
from (
/* emulate test dataset */
select data.1 user_id, toDate(data.2) login_time
from (
select arrayJoin([
(1, '2021-02-27'),
(1, '2021-02-28'),
(1, '2021-03-01'),
(1, '2021-03-01'),
(1, '2021-03-02'),
(2, '2021-03-02'),
(2, '2021-03-03'),
(2, '2021-03-08'),
(2, '2021-03-16'),
(2, '2021-04-01')]) data)
)
)
group by start_of_month, start_of_week
order by start_of_month, start_of_week
/*
┌─count─┬─year─┬─month─┬─week ordinal─┐
│ 1 │ 2021 │ 2 │ 4 │
│ 1 │ 2021 │ 2 │ 5 │
│ 5 │ 2021 │ 3 │ 1 │
│ 1 │ 2021 │ 3 │ 2 │
│ 1 │ 2021 │ 3 │ 3 │
│ 1 │ 2021 │ 4 │ 1 │
└───────┴──────┴───────┴──────────────┘
*/
How can I make sure that with this join I'll only receive the sum of results and not the product?
I have a project entity, which contains two one-to-many relations. If I query disposal and supply.
With the following query:
SELECT *
FROM projects
JOIN disposals disposal on projects.project_id = disposal.disposal_project_refer
WHERE (projects.project_name = 'Höngg')
I get following result:
project_id,project_name,disposal_id,depository_refer,material_refer,disposal_date,disposal_measurement,disposal_project_refer
1,Test,1,1,1,2020-08-12 15:24:49.913248,123,1
1,Test,2,1,2,2020-08-12 15:24:49.913248,123,1
1,Test,7,2,1,2020-08-12 15:24:49.913248,123,1
1,Test,10,3,4,2020-08-12 15:24:49.913248,123,1
The same amount of results get returned by same query for supplies.
type Project struct {
ProjectID uint `gorm:"primary_key" json:"ProjectID"`
ProjectName string `json:"ProjectName"`
Disposals []Disposal `gorm:"ForeignKey:disposal_project_refer"`
Supplies []Supply `gorm:"ForeignKey:supply_project_refer"`
}
If I query both tables I would like to receive the sum of both single queries. Currently I am receiving 16 results (4 supply results multiplied by 4 disposal results).
The combined query:
SELECT *
FROM projects
JOIN disposals disposal ON projects.project_id = disposal.disposal_project_refer
JOIN supplies supply ON projects.project_id = supply.supply_project_refer
WHERE (projects.project_name = 'Höngg');
I have tried achieving my goal with union queries but I was not sucessfull. What else should I try to achieve my goal?
It is your case (simplified):
# with a(x,y) as (values(1,1)), b(x,z) as (values(1,11),(1,22)), c(x,t) as (values(1,111),(1,222))
select * from a join b on (a.x=b.x) join c on (b.x=c.x);
┌───┬───┬───┬────┬───┬─────┐
│ x │ y │ x │ z │ x │ t │
├───┼───┼───┼────┼───┼─────┤
│ 1 │ 1 │ 1 │ 11 │ 1 │ 111 │
│ 1 │ 1 │ 1 │ 11 │ 1 │ 222 │
│ 1 │ 1 │ 1 │ 22 │ 1 │ 111 │
│ 1 │ 1 │ 1 │ 22 │ 1 │ 222 │
└───┴───┴───┴────┴───┴─────┘
It produces cartesian join because the value for join is same in all tables. You need some additional condition for joining your data.For example (tests for various cases):
# with a(x,y) as (values(1,1)), b(x,z) as (values(1,11),(1,22)), c(x,t) as (values(1,111),(1,222))
select *
from a
cross join lateral (
select *
from (select row_number() over() as rn, * from b where b.x=a.x) as b
full join (select row_number() over() as rn, * from c where c.x=a.x) as c on (b.rn=c.rn)
) as bc;
┌───┬───┬────┬───┬────┬────┬───┬─────┐
│ x │ y │ rn │ x │ z │ rn │ x │ t │
├───┼───┼────┼───┼────┼────┼───┼─────┤
│ 1 │ 1 │ 1 │ 1 │ 11 │ 1 │ 1 │ 111 │
│ 1 │ 1 │ 2 │ 1 │ 22 │ 2 │ 1 │ 222 │
└───┴───┴────┴───┴────┴────┴───┴─────┘
# with a(x,y) as (values(1,1)), b(x,z) as (values(1,11),(1,22),(1,33)), c(x,t) as (values(1,111),(1,222))
select *
from a
cross join lateral (
select *
from (select row_number() over() as rn, * from b where b.x=a.x) as b
full join (select row_number() over() as rn, * from c where c.x=a.x) as c on (b.rn=c.rn)
) as bc;
┌───┬───┬────┬───┬─────┬──────┬──────┬──────┐
│ x │ y │ rn │ x │ z │ rn │ x │ t │
├───┼───┼────┼───┼─────┼──────┼──────┼──────┤
│ 1 │ 1 │ 1 │ 1 │ 11 │ 1 │ 1 │ 111 │
│ 1 │ 1 │ 2 │ 1 │ 22 │ 2 │ 1 │ 222 │
│ 1 │ 1 │ 3 │ 1 │ 33 │ ░░░░ │ ░░░░ │ ░░░░ │
└───┴───┴────┴───┴─────┴──────┴──────┴──────┘
# with a(x,y) as (values(1,1)), b(x,z) as (values(1,11),(1,22)), c(x,t) as (values(1,111),(1,222),(1,333))
select *
from a
cross join lateral (
select *
from (select row_number() over() as rn, * from b where b.x=a.x) as b
full join (select row_number() over() as rn, * from c where c.x=a.x) as c on (b.rn=c.rn)
) as bc;
┌───┬───┬──────┬──────┬──────┬────┬───┬─────┐
│ x │ y │ rn │ x │ z │ rn │ x │ t │
├───┼───┼──────┼──────┼──────┼────┼───┼─────┤
│ 1 │ 1 │ 1 │ 1 │ 11 │ 1 │ 1 │ 111 │
│ 1 │ 1 │ 2 │ 1 │ 22 │ 2 │ 1 │ 222 │
│ 1 │ 1 │ ░░░░ │ ░░░░ │ ░░░░ │ 3 │ 1 │ 333 │
└───┴───┴──────┴──────┴──────┴────┴───┴─────┘
db<>fiddle
Note that there is no any obvious relations between disposals and supplies (b and c in my example) so the order of both could be random. As for me the better solution for this task could be the aggregation of the data from those tables using JSON for example:
with a(x,y) as (values(1,1)), b(x,z) as (values(1,11),(1,22),(1,33)), c(x,t) as (values(1,111),(1,222))
select
*,
(select json_agg(to_json(b.*)) from b where a.x=b.x) as b,
(select json_agg(to_json(c.*)) from c where a.x=c.x) as c
from a;
┌───┬───┬──────────────────────────────────────────────────┬────────────────────────────────────┐
│ x │ y │ b │ c │
├───┼───┼──────────────────────────────────────────────────┼────────────────────────────────────┤
│ 1 │ 1 │ [{"x":1,"z":11}, {"x":1,"z":22}, {"x":1,"z":33}] │ [{"x":1,"t":111}, {"x":1,"t":222}] │
└───┴───┴──────────────────────────────────────────────────┴────────────────────────────────────┘
I am trying to concatenate the contents of the rows of a DataFrame similar to this one:
DataFrame(a=["aa","ab","ac"], year=[2015,2016,2017])
a year
aa 2015
ab 2016
ac 2017
The desired output would be a concatenation of the contents of the row cells converted to string
output
aa2015
ab2016
ac2017
I have found this code working in the right direction:
df[:c] = map((x,y) -> string(x, y), df[:a], df[:year])
However my input can be variable as I can have a different number of columns and I want all their contents to be concatenated row by row.
Any suggestion on how to achieve this? It doesn't matter if the column gets added to the original Dataframe, if that helps.
Thanks a lot
You can use eachrow to achieve this:
julia> df = DataFrame(rand('a':'z', 5,5))
5×5 DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │
│ │ Char │ Char │ Char │ Char │ Char │
├─────┼──────┼──────┼──────┼──────┼──────┤
│ 1 │ 'l' │ 'p' │ 'y' │ 't' │ 'n' │
│ 2 │ 'p' │ 'y' │ 'y' │ 'r' │ 's' │
│ 3 │ 'y' │ 'a' │ 'o' │ 'c' │ 'a' │
│ 4 │ 'k' │ 't' │ 'q' │ 's' │ 'q' │
│ 5 │ 'a' │ 'c' │ 'w' │ 'f' │ 'v' │
julia> join.(eachrow(df))
5-element Array{String,1}:
"lpytn"
"pyyrs"
"yaoca"
"ktqsq"
"acwfv"
(here I just created a new vector - you can of course add it to a DataFrame if you want)
I have a SQLite table (for messages). The table has two columns for order: created and sent. I need to get result sorted by sent field (descent), but if there is 0, then by created field (also descent).
I'm using SQL-function COALESCE, but the order of the result is wrong.
Normal result (without COALESCE):
SELECT * FROM messages ORDER BY sent DESC
┌─────────────┬──────────┬────────────┬────────────┐
│ external_id │ body │ created │ sent │
├─────────────┼──────────┼────────────┼────────────┤
│ ... │ qw │ 1463793500 │ 1463793493 │ <-
├─────────────┼──────────┼────────────┼────────────┤
│ ... │ huyak │ 1463783516 │ 1463662248 │
├─────────────┼──────────┼────────────┼────────────┤
│ ... │ tete │ 1463783516 │ 1463662248 │
├─────────────┼──────────┼────────────┼────────────┤
│ ... │ Te │ 1463783516 │ 1463662248 │
└─────────────┴──────────┴────────────┴────────────┘
Wrong result (with COALESCE):
SELECT * FROM messages ORDER BY COALESCE(sent,created)=0 DESC
┌─────────────┬──────────┬────────────┬────────────┐
│ external_id │ body │ created │ sent │
├─────────────┼──────────┼────────────┼────────────┤
│ ... │ Te │ 1463783516 │ 1463662248 │
├─────────────┼──────────┼────────────┼────────────┤
│ ... │ huyak │ 1463783516 │ 1463662248 │
├─────────────┼──────────┼────────────┼────────────┤
│ ... │ tete │ 1463783516 │ 1463662248 │
├─────────────┼──────────┼────────────┼────────────┤
│ ... │ qw │ 1463793500 │ 1463793493 │ <-
└─────────────┴──────────┴────────────┴────────────┘
I tried to remove expression =0, then the order is correct, but that request doesn't work correctly if sent = 0:
SELECT * FROM messages ORDER BY COALESCE(sent,created) DESC
┌─────────────┬──────────┬────────────┬────────────┐
│ external_id │ body │ created │ sent │
├─────────────┼──────────┼────────────┼────────────┤
│ ... │ qw │ 1463793500 │ 1463793493 │ <-
├─────────────┼──────────┼────────────┼────────────┤
│ ... │ huyak │ 1463783516 │ 1463662248 │
├─────────────┼──────────┼────────────┼────────────┤
│ ... │ tete │ 1463783516 │ 1463662248 │
├─────────────┼──────────┼────────────┼────────────┤
│ ... │ Te │ 1463783516 │ 1463662248 │
└─────────────┴──────────┴────────────┴────────────┘
but
┌─────────────┬──────────┬────────────┬────────────┐
│ external_id │ body │ created │ sent │
├─────────────┼──────────┼────────────┼────────────┤
│ ... │ Te │ 1463783516 │ 1463662248 │
├─────────────┼──────────┼────────────┼────────────┤
│ ... │ huyak │ 1463783516 │ 1463662248 │
├─────────────┼──────────┼────────────┼────────────┤
│ ... │ tete │ 1463783516 │ 1463662248 │
├─────────────┼──────────┼────────────┼────────────┤
│ ... │ qw │ 1463793500 │ 0 │ <-
└─────────────┴──────────┴────────────┴────────────┘
Does anyone know why it's happening and how to fix it?
COALESCE handles NULLs, it won't help you here. It will always return sent to you. If you compare its result to zero you're only sorting based on whether sent is zero or not. You'll have to use a CASE
... ORDER BY CASE sent WHEN 0 THEN created ELSE sent END DESC;
If you had NULLs where there is no timestamp then you could use COALESCE without the comparison.
COALESCE handles NULLs, not zeros.
You can convert zero values to NULL with the nullif() function:
SELECT * FROM messages ORDER BY COALESCE(NULLIF(sent,0),created) DESC;