How to find values between date ranges Power BI? - sql

I need help to understand a thing.
I have a table that uses slow changing dimension, with start and end date and indicator if it is active or not:
Type
start
end
value
active
A
0001/01/01
9999/12/31
10
1
B
2015/03/18
2016-06-25
4
0
B
2016-06-25
9999/12/31
7
1
C
2017-05-07
9999/12/31
8
1
I need to connect this table to a report in power bi and fetch the respective value in a line graph that brings the values by month.
Something like this:
I am using a report connect to SSAS through Direct Query. I am able to create a view with the new structure to connect to my cube.
How can I get this result using a table with this structure.
Thanks for the help!
I thought about creating a table with a value for each month, but as you can see, I have dates ranging from 0001-01-01 to 9999-12-31. (To be honest, I don't really know how to do that either).

Based on the chart, assume that date range is 2016-04 ~ 2017-12, so
Generate a month dimension for the above date range
Cross join month dimension and the give slow changing dimension slo_dim and get value in effective period
Draw a line chart based on the result from step 2
with cte_month (year_month, n) as (
select cast('2016-04-01' as date), 1
union all
select dateadd(month, 1, year_month), n+1
from cte_month
where n < 21)
select d.type,
m.year_month,
d.value
from cte_month m, slo_dim d
where m.year_month between d.start_dt and d.end_dt
order by d.type, m.year_month;
Result:
type|year_month|value|
----+----------+-----+
A |2016-04-01| 10|
A |2016-05-01| 10|
A |2016-06-01| 10|
A |2016-07-01| 10|
A |2016-08-01| 10|
A |2016-09-01| 10|
A |2016-10-01| 10|
A |2016-11-01| 10|
A |2016-12-01| 10|
A |2017-01-01| 10|
A |2017-02-01| 10|
A |2017-03-01| 10|
A |2017-04-01| 10|
A |2017-05-01| 10|
A |2017-06-01| 10|
A |2017-07-01| 10|
A |2017-08-01| 10|
A |2017-09-01| 10|
A |2017-10-01| 10|
A |2017-11-01| 10|
A |2017-12-01| 10|
B |2016-04-01| 4|
B |2016-05-01| 4|
B |2016-06-01| 4|
B |2016-07-01| 7|
B |2016-08-01| 7|
B |2016-09-01| 7|
B |2016-10-01| 7|
B |2016-11-01| 7|
B |2016-12-01| 7|
B |2017-01-01| 7|
B |2017-02-01| 7|
B |2017-03-01| 7|
B |2017-04-01| 7|
B |2017-05-01| 7|
B |2017-06-01| 7|
B |2017-07-01| 7|
B |2017-08-01| 7|
B |2017-09-01| 7|
B |2017-10-01| 7|
B |2017-11-01| 7|
B |2017-12-01| 7|
C |2017-06-01| 8|
C |2017-07-01| 8|
C |2017-08-01| 8|
C |2017-09-01| 8|
C |2017-10-01| 8|
C |2017-11-01| 8|
C |2017-12-01| 8|

Related

group by on the multiple inner join in postgres

i have 3 tables the first table "A" is the master table
id_grp|group_name |created_on |status|
------+--------------+-----------------------+------+
17|Teller |2022-09-09 16:00:44.842| 1|
18|Combined Group|2022-09-09 10:16:42.473| 1|
16|admnistrator |2022-09-08 10:11:14.313| 1|
Then i have another table table "b"
id_config|id_grp|id_utilis|
---------+------+---------+
159| 16| 1|
161| 16| 54|
164| 17| 55|
438| 17| 88|
166| 18| 39|
167| 18| 20|
439| 16| 89|
198| 18| 51|
Then i have the last table "C"
id_config|id_grp|id_pol|
---------+------+------+
46| 16| 7|
48| 17| 8|
51| 18| 8|
52| 18| 7|
84| 18| 9|
113| 17| 9|
but when i using group by with multiple join as follows
SELECT
a.id_grp,
a.group_name,
a.created_on,
a.status,
count(b.id_utilis) AS users,
count(c.id_pol) AS policy
FROM a
inner JOIN b on a.id_grp = b.id_grp
inner JOIN c on a.id_grp = c.id_grp
GROUP BY a.id_grp, a.group_name, a.created_on, a.status,
but i am getting wront result there both the count are creating matrix and multiplying each other
id_grp|group_name |created_on |status|users|policy|
------+--------------+-----------------------+------+-----+------+
17|Teller |2022-09-09 16:00:44.842| 1| 10| 10|
16|admnistrator |2022-09-08 10:11:14.313| 1| 3| 3|
18|Combined Group|2022-09-09 10:16:42.473| 1| 18| 18|
select *
from a
join (select id_grp, count(*) as users from b group by id_grp) b using(id_grp)
join (select id_grp, count(*) as policy from c group by id_grp) c using(id_grp)
id_grp
group_name
created_on
status
users
policy
17
Teller
2022-09-09 16:00:44
1
2
2
18
Combined Group
2022-09-09 10:16:42
1
3
3
16
admnistrator
2022-09-08 10:11:14
1
3
1
Fiddle

SQL query to find an output table

I have three dimension tables and a fact table and i need to write a query in such way that i join all the dimension columns with fact table to find out top 10 ATMs where most transactions are in the ’inactive’ state.I try the query with cartesian join but i dont know if this is the right way to join the tables.
select a.atm_number,a.atm_manufacturer,b.location,count(c.trans_id) as total_transaction_count,count(c.atm_status) as inactive_count
from dimen_atm a,dimen_location b,fact_atm_trans c
where a.atm_id = c.atm_id and b.location = c.location
order by inactive_count desc limit 10;
dimen_card_type
+------------+---------+
|card_type_id|card_type|
+------------+---------+
| 1| CIRRUS|
| 2| Dankort|
dimen_atm
+------+----------+----------------+---------------+
|atm_id|atm_number|atm_manufacturer|atm_location_id|
+------+----------+----------------+---------------+
| 1| 1| NCR| 16|
| 2| 2| NCR| 64|
+------+----------+----------------+---------------+
dimen_location
+-----------+--------------------+----------------+-------------+-------+------+------+
|location_id| location| streetname|street_number|zipcode| lat| lon|
+-----------+--------------------+----------------+-------------+-------+------+------+
| 1|Intern København|Rådhuspladsen| 75| 1550|55.676|12.571|
| 2| København| Regnbuepladsen| 5| 1550|55.676|12.571|
+-----------+--------------------+----------------+-------------+-------+------+------+
fact_atm_trans
+--------+------+--------------+-------+------------+----------+--------+----------+------------------+------------+------------+-------+----------+----------+------------+-------------------+
|trans_id|atm_id|weather_loc_id|date_id|card_type_id|atm_status|currency| service|transaction_amount|message_code|message_text|rain_3h|clouds_all|weather_id|weather_main|weather_description|
+--------+------+--------------+-------+------------+----------+--------+----------+------------------+------------+------------+-------+----------+----------+------------+-------------------+
| 1| 1| 16| 5229| 3| Active| DKK|Withdrawal| 5980| null| null| 0.0| 80| 803| Clouds| broken cloudsr|
| 2| 1| 16| 4090| 10| Active| DKK|Withdrawal| 3992| null| null| 0.0| 32| 802| Clouds| scattered cloudsr|
+--------+------+--------------+-------+------------+----------+--------+----------+------------------+------------+-----------

Pyspark : how to code complicated dataframe calculation lead sum

I have given dataframe that looks like this.
THIS dataframe is sorted by date, and col1 is just some random value.
TEST_schema = StructType([StructField("date", StringType(), True),\
StructField("col1", IntegerType(), True),\
])
TEST_data = [('2020-08-01',3),('2020-08-02',1),('2020-08-03',-1),('2020-08-04',-1),('2020-08-05',3),\
('2020-08-06',-1),('2020-08-07',6),('2020-08-08',4),('2020-08-09',5)]
rdd3 = sc.parallelize(TEST_data)
TEST_df = sqlContext.createDataFrame(TEST_data, TEST_schema)
TEST_df.show()
+----------+----+
| date|col1|
+----------+----+
|2020-08-01| 3|
|2020-08-02| 1|
|2020-08-03| -1|
|2020-08-04| -1|
|2020-08-05| 3|
|2020-08-06| -1|
|2020-08-07| 6|
|2020-08-08| 4|
|2020-08-09| 5|
+----------+----+
LOGIC : lead(col1) +1, if col1 ==-1, then from the previous value lead(col1) +2...
the resulted dataframe will look like this (want column is what i want as output)
+----------+----+----+
| date|col1|WANT|
+----------+----+----+
|2020-08-01| 3| 2|
|2020-08-02| 1| 6|
|2020-08-03| -1| 5|
|2020-08-04| -1| 4|
|2020-08-05| 3| 8|
|2020-08-06| -1| 7|
|2020-08-07| 6| 5|
|2020-08-08| 4| 6|
|2020-08-09| 5| -1|
+----------+----+----+
Let's look at last row, where col1==5, that 5 is leaded +1 which is in want==6 (2020-08-08)
If we have col==-1, then we add +1 more ,, if we have col==-1 repeated twice, then we add +2 more..
this is hard to explain in words,lastly since it created last column instead of null, replace with -1. I have a diagram
You can check if the following code and logic works for you:
create a sub-group label g which take running sum of int(col1!=-1), and we only concern about Rows with col1 == -1, and nullify all other Rows.
the residual is 1 and if col1 == -1, plus the running count on Window w2
take the prev_col1 over w1 which is not -1 (using nullif), (the naming of prev_col1 might be confusion since it takes only if col1 = -1 using typical pyspark's way to do ffill, otherwise keep the original).
set val = prev_col1 + residual, take the lag and set null to -1
Code below:
from pyspark.sql.functions import when, col, expr, count, desc, lag, coalesce
from pyspark.sql import Window
w1 = Window.orderBy(desc('date'))
w2 = Window.partitionBy('g').orderBy(desc('date'))
TEST_df.withColumn('g', when(col('col1') == -1, expr("sum(int(col1!=-1))").over(w1))) \
.withColumn('residual', when(col('col1') == -1, count('*').over(w2) + 1).otherwise(1)) \
.withColumn('prev_col1',expr("last(nullif(col1,-1),True)").over(w1)) \
.withColumn('want', coalesce(lag(expr("prev_col1 + residual")).over(w1),lit(-1))) \
.orderBy('date').show()
+----------+----+----+--------+---------+----+
| date|col1| g|residual|prev_col1|want|
+----------+----+----+--------+---------+----+
|2020-08-01| 3|null| 1| 3| 2|
|2020-08-02| 1|null| 1| 1| 6|
|2020-08-03| -1| 4| 3| 3| 5|
|2020-08-04| -1| 4| 2| 3| 4|
|2020-08-05| 3|null| 1| 3| 8|
|2020-08-06| -1| 3| 2| 6| 7|
|2020-08-07| 6|null| 1| 6| 5|
|2020-08-08| 4|null| 1| 4| 6|
|2020-08-09| 5|null| 1| 5| -1|
+----------+----+----+--------+---------+----+

how to create & sort by an ordered categorical variable in pyspark

I'm migrating some code from pandas to pyspark. My source dataframe looks like this:
a b c
0 1 insert 1
1 2 update 1
2 3 seed 1
3 4 insert 2
4 5 update 2
5 6 delete 2
6 7 snapshot 1
and the operation (in python / pandas) that I'm applying is:
df.b = pd.Categorical(df.b, ordered=True, categories=['insert', 'seed', 'update', 'snapshot', 'delete'])
df.sort_values(['c', 'b'])
resulting in the output dataframe:
a b c
0 1 insert 1
2 3 seed 1
1 2 update 1
6 7 snapshot 1
3 4 insert 2
4 5 update 2
5 6 delete 2
I'm unsure how best to set up ordered categoricals using pyspark, and my initial approach creates a new column using case-when and attempts to use that subsequently:
df = df.withColumn(
"_precedence",
when(col("b") == "insert", 1)
.when(col("b") == "seed", 2)
.when(col("b") == "update", 3)
.when(col("b") == "snapshot", 4)
.when(col("b") == "delete", 5)
)
You can use a map:
from pyspark.sql.functions import create_map, lit, col
categories=['insert', 'seed', 'update', 'snapshot', 'delete']
# per #HaleemurAli, adjusted the below list comprehension to create map
map1 = create_map([val for (i, c) in enumerate(categories) for val in (c, lit(i))])
#Column<b'map(insert, 0, seed, 1, update, 2, snapshot, 3, delete, 4)'>
df.orderBy('c', map1[col('b')]).show()
+---+---+--------+---+
| id| a| b| c|
+---+---+--------+---+
| 0| 1| insert| 1|
| 2| 3| seed| 1|
| 1| 2| update| 1|
| 6| 7|snapshot| 1|
| 3| 4| insert| 2|
| 4| 5| update| 2|
| 5| 6| delete| 2|
+---+---+--------+---+
to reverse the order on column-b: df.orderBy('c', map1[col('b')].desc()).show()
You could also do this using coalesce with ur when statements.
from pyspark.sql import functions as F
categories=['insert', 'seed', 'update', 'snapshot', 'delete']
cols=[(F.when(F.col("b")==x,F.lit(y))) for x,y in zip(categories,[x for x in (range(1, len(categories)+1))])]
df.orderBy("c",F.coalesce(*cols)).show()
#+---+--------+---+
#| a| b| c|
#+---+--------+---+
#| 1| insert| 1|
#| 3| seed| 1|
#| 2| update| 1|
#| 7|snapshot| 1|
#| 4| insert| 2|
#| 5| update| 2|
#| 6| delete| 2|
#+---+--------+---+

SQL Join returns duplicate entries

Just going to start out saying that I am new to SQL and what I've written is based off of tutorials (Also I am using SQL Server 2012). The issue I am having is I am trying to take data from 4 different tables and put them into 1 table to be accessed by Access. However I keep getting duplicate results if a value is different from the rest.
The tables look like
Cell1
|LotNum|SerialNum|PassFail|
| Lot11| 1234| 1|
| Lot11| 2345| 1|
| Lot11| 3456| 1|
| Lot11| 4567| 1|
Cell2
|LotNum|SerialNum|PassFail|
| Lot11| 1234| 1|
| Lot11| 2345| 1|
| Lot11| 3456| 1|
| Lot11| 4567| 1|
Cell3
|LotNum|SerialNum|PassFail|
| Lot11| 1234| 1|
| Lot11| 2345| 1|
| Lot11| 3456| 1|
| Lot11| 4567| 1|
Cell4
|LotNum|SerialNum|PassFail|
| Lot11| 1234| 1|
| Lot11| 2345| 1|
| Lot11| 3456| 1|
| Lot11| 4567| 0|
My code is
Alter Procedure [dbo].[spSingleData](
#LotNum varchar(50)
)
AS
Truncate Table dbo.SingleSheet
Begin
Insert INTO dbo.SingleSheet (SerialNum, Cell1PF, Cell2Pf, Cell3PF, Cell4PF)
Select Distinct Cell1.SerialNum, Cell1.PF, Cell2.PF, Cell3.PF, Cell4.PF
From dbo.Cell1
Left Join Cell2 On Cell1.LotNum=Cell2.LotNum
Left Join Cell3 On Cell1.LotNum=Cell3.LotNum
Left Join Cell4 On Cell1.LotNum=Cell4.LotNum
Where Cell1.LotNum = #LotNum
Order by SerialNum
End
PassFail can be 0, 1, or NULL, however, like in the example above, if one of the PassFails is different from the rest, the resulting table returns
|1234| 1| 1| 1| 0|
|1234| 1| 1| 1| 1|
|2345| 1| 1| 1| 0|
|2345| 1| 1| 1| 1|
|3456| 1| 1| 1| 0|
|3456| 1| 1| 1| 1|
|4567| 1| 1| 1| 0|
|4567| 1| 1| 1| 1|
Am I just using the wrong Join or should I be using something else?
Is this what you are trying to achieve:
If so then you are missing a JOIN predicate on SerialNum and you do not need the DISTINCT
Sample Data:
IF OBJECT_ID('tempdb..#Cell1') IS NOT NULL
DROP TABLE #Cell1
CREATE TABLE #Cell1 (LotNum varchar(10),SerialNum int,PassFail bit)
INSERT INTO #Cell1
VALUES
('Lot11',1234,1),
('Lot11',2345,1),
('Lot11',3456,1),
('Lot11',4567,1)
IF OBJECT_ID('tempdb..#Cell2') IS NOT NULL
DROP TABLE #Cell2
CREATE TABLE #Cell2 (LotNum varchar(10),SerialNum int,PassFail bit)
INSERT INTO #Cell2
VALUES
('Lot11',1234,1),
('Lot11',2345,1),
('Lot11',3456,1),
('Lot11',4567,1)
IF OBJECT_ID('tempdb..#Cell3') IS NOT NULL
DROP TABLE #Cell3
CREATE TABLE #Cell3 (LotNum varchar(10),SerialNum int,PassFail bit)
INSERT INTO #Cell3
VALUES
('Lot11',1234,1),
('Lot11',2345,1),
('Lot11',3456,1),
('Lot11',4567,1)
IF OBJECT_ID('tempdb..#Cell4') IS NOT NULL
DROP TABLE #Cell4
CREATE TABLE #Cell4 (LotNum varchar(10),SerialNum int,PassFail bit)
INSERT INTO #Cell4
VALUES
('Lot11',1234,1),
('Lot11',2345,1),
('Lot11',3456,1),
('Lot11',4567,0)
Query:
SELECT #Cell1.SerialNum,
#Cell1.PassFail,
#Cell2.PassFail,
#Cell3.PassFail,
#Cell4.PassFail
FROM #Cell1
LEFT JOIN #Cell2 ON #Cell1.LotNum = #Cell2.LotNum AND #Cell1.SerialNum = #Cell2.SerialNum
LEFT JOIN #Cell3 ON #Cell1.LotNum = #Cell3.LotNum AND #Cell1.SerialNum = #Cell3.SerialNum
LEFT JOIN #Cell4 ON #Cell1.LotNum = #Cell4.LotNum AND #Cell1.SerialNum = #Cell4.SerialNum
ORDER BY SerialNum;
Results: