I am trying to find number of days to close account for particular account.
I have table like below:
OPER_DAY | CODE_FILIAL | SUM_IN | SALDO_OUT | ACC
-------------------------------------------------------------------------------
2020-11-02 | 00690 | 0 | 1578509367.58 | 001
2020-11-03 | 00690 | 1578509367.58 | 9116497.5 | 001
2020-11-04 | 00690 | 9116497.5 | 0 | 001
2020-11-02 | 00690 | 0 | 157430882.96 | 101
2020-11-03 | 00690 | 157430882.96 | 0 | 101
2020-11-09 | 00690 | 0 | 500000 | 101
2020-11-19 | 00690 | 500000 | 0 | 101
Day starts with 0 sum and ends with 0 for particular ACC. I need to find number of days that filial had taken to close account.
For example for ACC 001 it took 2 days, from 2020-11-02 - 2020-11-04. For 101 ACC it took 11 days. Because from 2020-11-02 - 2020-11-03 -> 1 day,
from 2020-11-09 - 2020-11-19 -> 10 days
Overall 13 days.
Result I want:
----------------------------
CODE_FILIAL | NUM_OF_DAYS
---------------------------
00690 | 13
This reads like a gaps-and-island problem. An island starts with a value of 0 in sum_in, and ends with a value of 0 in saldo_out.
Assuming there there always is at most one end for each start, you can use window functions and aggregation as follows:
select code_filial, sum(end_dt - start_dt) as num_of_days
from (
select code_filial, acc, grp
min(oper_day) as start_dt,
max(case when saldo_out = 0 then oper_day end) as end_dt
from (
select t.*,
sum(case when sum_in = 0 then 1 else 0 end) over(partition by code_filial, acc order by oper_day) as grp
from mytable t
) t
group by code_filial, acc, grp
) group by code_filial
This works by building groups of records with a window sum that increments every time a value of 0 is met in colum sum_in for a given (code_filial, acc) tuple. We can then use aggregation to compute the corresponding end date. The final step is to aggregate by code_filial.
Related
after some manipulation, I ended up with a table in GBQ that lists all transactions made on blockchain (around 280 million rows):
+-------+-------------------------+--------+-------+----------+
| Linha | timestamp | sender | value | receiver |
+-------+-------------------------+--------+-------+----------+
| 1 | 2018-06-28 01:31:00 UTC | User1 | 1.67 | User2 |
| 2 | 2017-04-06 00:47:29 UTC | User3 | 0.02 | User4 |
| 3 | 2013-11-27 13:22:05 UTC | User5 | 0.25 | User6 |
+-------+-------------------------+--------+-------+----------+
Since this table has all transactions, if I sum all the values for each user up to a given date, I may have his balance, and once I have close to 22 million users, I want to binarize them by the amount of coin they have. I used this code to go through all the dataset:
#standardSQL
SELECT
COUNT(val) AS num,
bin
FROM (
SELECT
val,
CASE
WHEN val > 0 AND val <= 1 THEN '0_to_1'
WHEN val > 1
AND val <= 10 THEN '1_to_10'
WHEN val > 10 AND val <= 100 THEN '10_to_100'
WHEN val > 100
AND val <= 1000 THEN '100_to_1000'
WHEN val > 1000 AND val <= 10000 THEN '1000_to_10000'
WHEN val > 10000 THEN 'More_10000'
END AS bin
FROM (
SELECT
max(timestamp),
receiver,
SUM(value) as val
FROM
`table.transactions`
WHERE
timestamp < '2011-02-12 00:00:00'
group by
receiver))
GROUP BY
bin
Which gives me something like:
+-------+-------+---------------+
| Linha | num | bin |
+-------+-------+---------------+
| 1 | 11518 | 1_to_10 |
| 2 | 9503 | 100_to_1000 |
| 3 | 18070 | 10_to_100 |
| 4 | 20275 | 0_to_1 |
| 5 | 1781 | 1000_to_10000 |
| 6 | 158 | More_10000 |
+-------+-------+---------------+
Now I want to iterate through the rows of my transactions tables checking the number of users in each bin at the end of every day. The final table should be something like this:
+-------------------------+---------+-----------+-----------+-------------+---------------+------------+
| timestamp | 0_to_1 | 1_to_10 | 10_to_100 | 100_to_1000 | 1000_to_10000 | More_10000 |
+-------------------------+---------+-----------+-----------+-------------+---------------+------------+
| 2009-01-09 00:00:00 UTC | 1 | 1 | 0 | 0 | 0 | 0 |
| 2009-01-10 00:00:00 UTC | 0 | 2 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... |
| 2018-09-10 00:00:00 UTC | 2342823 | 124324325 | 43251315 | 234523555 | 2352355556 | 12124235231|
+-------------------------+---------+-----------+-----------+-------------+---------------+------------+
I can't order by timestamp to make my life easier because the dataset is too large, so I would appreciate some ideas. I wonder if there is ome way to improve performance and save resources using pagination, for example. I've heard about it, but don't have a clue how to use it.
Thanks in advance!
UPDATE: after some work, now I do have a transactions table ordered by timestamps.
The query below should give you the count of transactions within each bin by timestamp. Now, keep in mind that this query will evaluate the value of a transaction at the row level.
SELECT
timestamp,
COUNT(DISTINCT(CASE
WHEN value > 0 AND value <= 1 THEN receiver
END)) AS _0_to_1,
COUNT(DISTINCT(CASE
WHEN value > 1 AND value <= 10 THEN receiver
END)) AS _1_to_10,
COUNT(DISTINCT(CASE
WHEN value > 10 AND value <= 100 THEN receiver
END)) AS _10_to_100,
COUNT(DISTINCT(CASE
WHEN value > 100 AND value <= 1000 THEN receiver
END)) AS _100_to_1000,
COUNT(DISTINCT(CASE
WHEN value > 1000 AND value <= 10000 THEN receiver
END)) AS _1000_to_100000,
COUNT(DISTINCT(CASE
WHEN value > 10000 THEN receiver
END)) AS More_10000
FROM `table.transactions`
WHERE timestamp = TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
GROUP BY 1
Regarding your question for performance, one area you may want to explore (if possible) is to create a partitioned version of this big table. This will help you 1) improve performance, and 2) reduce cost of query the data for a specific data range. You can find more info here
EDIT
I added a WHERE clause to the query to filter for the previous day. I am assuming you will run your query, for example, today to get the data from the previous day. Now you may need to adjust CURRENT_TIMESTAMP() to your time zone by adding an additional TIMESTAMP_SUB(...., INTERVAL X HOUR or TIMESTAMP_ADD(...., INTERVAL X HOUR, where X is the number of hours that need to be subtracted or added to match the time zone of the data you are analyzing.
Also, you may need to CAST(timestamp AS TIMESTAMP) depending on the type of your field.
I have SQL Query:
SELECT Date, Hours, Counts FROM TRANSACTION_DATE
Example Output:
Date | Hours | Counts
----------------------------------
01-Feb-2018 | 20 | 5
03-Feb-2018 | 25 | 3
04-Feb-2018 | 22 | 3
05-Feb-2018 | 21 | 2
07-Feb-2018 | 28 | 1
10-Feb-2018 | 23 | 1
If you can see, there are days that missing because no data/empty, but I want the missing days to be shown and have a value of zero:
Date | Hours | Counts
----------------------------------
01-Feb-2018 | 20 | 5
02-Feb-2018 | 0 | 0
03-Feb-2018 | 25 | 3
04-Feb-2018 | 22 | 3
05-Feb-2018 | 21 | 2
06-Feb-2018 | 0 | 0
07-Feb-2018 | 28 | 1
08-Feb-2018 | 0 | 0
09-Feb-2018 | 0 | 0
10-Feb-2018 | 23 | 1
Thank you in advanced.
You need to generate a sequence of dates. If there are not too many, a recursive CTE is an easy method:
with dates as (
select min(date) as dte, max(date) as last_date
from transaction_date td
union all
select dateadd(day, 1, dte), last_date
from dates
where dte < last_date
)
select d.date, coalesce(td.hours, 0) as hours, coalesce(td.count, 0) as count
from dates d left join
transaction_date td
on d.dte = td.date;
I am trying to count whether a user has visited a site in three time ranges:
last 30 days
between 31 and 60 days
between 61 and 90 days
I am using Netezza, which does NOT support correlated subqueries in the SELECT clause. See Rextester for successful query that must be re-written to NOT use a correlated subquery: http://rextester.com/JGR62033
Sample Data:
| user_id | last_visit | num_days_since_2017117 |
|---------|------------|------------------------|
| 1234 | 2017-11-02 | 15.6 |
| 1234 | 2017-09-30 | 48.6 |
| 1234 | 2017-09-03 | 75.0 |
| 1234 | 2017-08-21 | 88.0 |
| 9876 | 2017-10-03 | 45.0 |
| 9876 | 2017-07-20 | 120.0 |
| 5545 | 2017-09-15 | 63.0 |
Desired Output:
| user_id | last_30 | btwn_31_60 | btwn_61_90 |
|---------|---------|------------|------------|
| 1234 | 1 | 1 | 1 |
| 5545 | 0 | 0 | 1 |
| 9876 | 0 | 1 | 0 |
Here is one way with conditional aggregation, Rextester:
select
user_id
,MAX(case when '2017-11-17'-visit_date <=30
then 1
else 0
end) as last_30
,MAX(case when '2017-11-17'-visit_date >=31
and '2017-11-17'-visit_date <=60
then 1
else 0
end) as between_31_60
,MAX(case when '2017-11-17'-visit_date >=61
and '2017-11-17'-visit_date <=90
then 1
else 0
end) as between_61_90
from
visits
group by user_id
order by user_id
I don't know the specific DBMS you're using, but if it supports CASE or an equivalent you don't need a correlated sub-query; you can do it with a combination of SUM() and CASE.
Untested in your DBMS, of course, but it should give you a starting point:
SELECT
user_id,
SUM(CASE WHEN num_days <= 30 then 1 else 0 end) as last_30,
SUM(CASE WHEN num_days > 30 AND numdays < 61 then 1 else 0 end) as btwn_31_60,
SUM(CASE WHEN num_days >= 61 then 1 else 0 end) as btwn_61_90
FROM
YourTableName -- You didn't provide a tablename
GROUP BY
user_id
Since your values are floating point and not integer, you may need to adjust the values used for the day ranges to work with your specific requirements.
My Sales data for first two weeks of june, Monday Date i.e 1st Jun , 8th Jun are below
date | count
2015-06-01 03:25:53 | 1
2015-06-01 03:28:51 | 1
2015-06-01 03:49:16 | 1
2015-06-01 04:54:14 | 1
2015-06-01 08:46:15 | 1
2015-06-01 13:14:09 | 1
2015-06-01 16:20:13 | 5
2015-06-01 16:22:13 | 1
2015-06-01 16:27:07 | 1
2015-06-01 16:29:57 | 1
2015-06-01 19:16:45 | 1
2015-06-08 10:54:46 | 1
2015-06-08 15:12:10 | 1
2015-06-08 20:35:40 | 1
I need a find weekly avg of sales happened in a given range .
Complex Query:
(some_manipulation_part), ifact as
( select date, sales_count from final_result_set
) select date_part('h',date )) as h ,
date_part('dow',date )) as day_of_week ,
count(sales_count)
from final_result_set
group by h, dow.
Output :
h | day_of_week | count
3 | 1 | 3
4 | 1 | 1
8 | 1 | 1
10 | 1 | 1
13 | 1 | 1
15 | 1 | 1
16 | 1 | 8
19 | 1 | 1
20 | 1 | 1
If I try to apply avg on the above final result, It is not actually fetching correct answer!
(some_manipulation_part), ifact as
( select date, sales_count from final_result_set
) select date_part('h',date )) as h ,
date_part('dow',date )) as day_of_week ,
avg(sales_count)
from final_result_set
group by h, dow.
h | day_of_week | count
3 | 1 | 1
4 | 1 | 1
8 | 1 | 1
10 | 1 | 1
13 | 1 | 1
15 | 1 | 1
16 | 1 | 1
19 | 1 | 1
20 | 1 | 1
So I 've two mondays in the given range, it is not actually dividing by it. I am not even sure what is happening inside redshift.
To get "weekly averages" use date_trunc():
SELECT date_trunc('week', my_date_column) as week
, avg(sales_count) AS avg_sales
FROM final_result_set
GROUP BY 1;
I hope you are not actually using date as name for your date column. It's a reserved word in SQL and a basic type name, don't use it as identifier.
If you group by the day of week (DOW) you get averages per weekday. and sunday is 0. (Use ISODOW to get 7 for Sunday.)
I got a problem in my query :
My table store data like this
ContractID | Staff_ID | EffectDate | End Date | Salary | active
-------------------------------------------------------------------------
1 | 1 | 2013-01-01 | 2013-12-30 | 100 | 0
2 | 1 | 2014-01-01 | 2014-12-30 | 150 | 0
3 | 1 | 2015-01-01 | 2015-12-30 | 200 | 1
4 | 2 | 2014-05-01 | 2015-04-30 | 500 | 0
5 | 2 | 2015-05-01 | 2016-04-30 | 700 | 1
I would like to write a query like below:
ContractID | Staff_ID | EffectDate | End Date | Salary | Increase
-------------------------------------------------------------------------
1 | 1 | 2013-01-01 | 2013-12-30 | 100 | 0
2 | 1 | 2014-01-01 | 2014-12-30 | 150 | 50
3 | 1 | 2015-01-01 | 2015-12-30 | 200 | 50
4 | 2 | 2014-05-01 | 2015-04-30 | 500 | 0
5 | 2 | 2015-05-01 | 2016-04-30 | 700 | 200
-------------------------------------------------------------------------
Increase column is calculated by current contract minus previous contract
I use sql server 2008 R2
Unfortunately 2008R2 doesn't have access to LAG, but you can simulate the effect of obtaining the previous row (prev) in the scope of a current row (cur), with a RANKing and a self join to the previous ranked row, in the same partition by Staff_ID):
With CTE AS
(
SELECT [ContractID], [Staff_ID], [EffectDate], [End Date], [Salary],[active],
ROW_NUMBER() OVER (Partition BY Staff_ID ORDER BY ContractID) AS Rnk
FROM Table1
)
SELECT cur.[ContractID], cur.[Staff_ID], cur.[EffectDate], cur.[End Date],
cur.[Salary], cur.Rnk,
CASE WHEN (cur.Rnk = 1) THEN 0 -- i.e. baseline salary
ELSE cur.Salary - prev.Salary END AS Increase
FROM CTE cur
LEFT OUTER JOIN CTE prev
ON cur.[Staff_ID] = prev.Staff_ID and cur.Rnk - 1 = prev.Rnk;
(If ContractId is always perfectly incrementing, we wouldn't need the ROW_NUMBER and could join on incrementing ContractIds, I didn't want to make this assumption).
SqlFiddle here
Edit
If you have Sql 2012 and later, the LEAD and LAG Analytic Functions make this kind of query much simpler:
SELECT [ContractID], [Staff_ID], [EffectDate], [End Date], [Salary],
Salary - LAG(Salary, 1, Salary) OVER (Partition BY Staff_ID ORDER BY ContractID) AS Incr
FROM Table1
Updated SqlFiddle
One trick here is that we are calculating delta increments in salary, so for the first employee contract we need to return the current salary so that Salary - Salary = 0 for the first increase.