T-SQL poor CTE join performance - sql

Long story short. I have a two SQL queries that are very similar and outputs value pairs of short and long names. Input parameter is for first query: short name and for second query is long name. The queries are constructed that outputs all rows that contained input parameters and this is not exact match (for example if I set tagname as ST2 this query outputs all object that contains ST2 and all other object that has ST2 at the beginning of its name.
All the queries are executed in same database. In below queries input parameters are set to "" this means that query outputs key value pairs of short and long names for all objects
declare #tagName as varchar(50) = ''
set #tagName = #tagName + '.'
-- this one query outputs ~700 rows
;with AnalogTag as
(
Select *
from [Runtime].[dbo].[AnalogTag]
where (substring(TagName, 0, charindex('.',#tagName))) in (substring(#tagName,0, len(#tagName)))
and (substring(TagName, charindex('.',TagName), 2)) not in ('.#')
),
-- this one query outputs ~7000 rows
HierarchicalName as
(
Select *
from [proba7].[dbo].[internal_list_objects_view]
where substring(tag_name, 0,len(#tagName)) = substring(#tagName,0, len(#tagName))
)
select HierarchicalName.tag_name as TagName
,HierarchicalName.hierarchical_name ilo_view_HierarchicalName
from AnalogTag
inner join HierarchicalName
on substring(AnalogTag.TagName, 0, CHARINDEX('.',AnalogTag.TagName)) = HierarchicalName.tag_name
Whole query above runs at approx 3 seconds. And outputs about 450 rows
I created a similar one query on same database:
declare #hierarchicalName as varchar(200) = ''
declare #Length as int
set #Length = LEN(#hierarchicalName)+1
-- this query outputs approx 700 rows and if runs separately it runs
--almost instantly
;with AnalogTag as
(
Select TagName
from [Runtime].[dbo].[AnalogTag]
where (substring(TagName, 0, CHARINDEX('.',TagName))) in
(
Select tag_name from [proba7].[dbo].[internal_list_objects_view]
where substring(hierarchical_name, 0, #Length) = #hierarchicalName
)
and (substring(TagName, CHARINDEX('.',TagName), 2)) not in ('.#')
),
-- this query outputs approx 7000 rows and if runs separately it runs
--almost instantly
HierarchicalName as
(
Select hierarchical_name, tag_name from [proba7].[dbo].[internal_list_objects_view]
where substring(hierarchical_name, 0, #Length) = #hierarchicalName
)
select HierarchicalName.tag_name as ilo_view_TagName
,HierarchicalName.hierarchical_name ilo_view_HierarchicalName
from AnalogTag
inner join HierarchicalName
on substring(AnalogTag.TagName, 0, CHARINDEX('.',AnalogTag.TagName)) = HierarchicalName.tag_name
And this time query runs in 28 seconds. Outputs similar amount of row as first query (because ouptut must be similar on these two queries). I noticed that if i change "inner join" to for example "full join", query runs instantly.
Analog tag example output:
TagName a b c d e f g h j j k l m
PomFPTemp.PV 1062 0 10 0 10 0 4 0 0 0 0 0 0
PomFPWilgWzgl.PV 1 0 10 0 10 0 4 0 0 0 0 0 0
SNST3P1BrakFWHMocD3f.PV 1063 0 10 0 10 0 4 0 0 0 0 0 0
SNST3P1BrakFWHMocP3f.PV 46 0 10 0 10 0 4 0 0 0 0 0 0
SNST3P1BrakFWHMocQ3f.PV 1063 0 10 0 10 0 4 0 0 0 0 0 0
SNST3P1BrakFWHMocQ3fIntExp.PV 1063 0 10 0 10 0 4 0 0 0 0 0 0
SNST3P1BrakFWHMocQ3fIntImp.PV 1063 0 10 0 10 0 4 0 0 0 0 0 0
SNST3P1BrakFWHMocQn3f.PV 1063 0 10 0 10 0 4 0 0 0 0 0 0
SNST3P1EnkvarhExp3f.PV 1060 0 10 0 10 0 4 0 0 0 0 0 0
SNST3P1EnkvarhImp3f.PV 1060 0 10 0 10 0 4 0 0 0 0 0 0
hierarchical name example output:
ID tag_name contained_name hierarchical_name a b c d e f g h i j k l m n o p q r s t u v w x y z aa bb cc dd ee ff gg hh
1 $Galaxy $Galaxy 1 0 0 0 0 1 NULL NULL 0 0 1 1 1 0 0 0 0 0 1 0 NULL 23 0 0 0 0 0 0 0 0 0 133123 50026 1 NULL
2 proba7 Galaxy_001 0 0 0 0 1 1 NULL NULL 0 0 848 1 0 1 0 0 0 0 1 0 NULL 23 0 0 0 0 0 0 0 0 0 4098 878020 1 NULL
3 $_AutoImport $_AutoImport 1 0 0 0 0 3 NULL NULL 0 0 4 2 0 0 0 0 0 0 1 0 NULL 10 0 0 0 0 0 0 0 0 0 131075 3699 1 NULL
4 $_DiCommon $_DiCommon 1 0 0 0 0 4 NULL NULL 0 0 5 3 0 0 0 0 0 0 1 0 NULL 10 0 0 0 0 0 0 0 0 0 131075 50023 1 NULL
5 $WinPlatform $WinPlatform 1 0 0 0 0 5 NULL 1 0 0 6 4 1 0 0 0 0 0 0 0 NULL 1 0 0 0 0 0 0 0 0 0 133121 419340 1 1
6 $AppEngine $AppEngine 1 0 0 0 0 6 NULL 1 0 0 7 5 1 0 0 0 0 0 0 0 NULL 3 0 0 0 0 0 0 0 0 0 133121 419341 1 1
7 $Area $Area 1 0 0 0 0 7 NULL 1 0 0 8 6 1 0 0 0 0 0 0 0 NULL 13 0 0 0 0 0 0 0 0 0 133121 3452998 1 1
8 $AnalogDevice $AnalogDevice 1 0 0 0 0 8 NULL 2 0 0 9 7 0 0 0 0 0 0 0 0 NULL 10 0 0 0 0 0 0 0 0 0 131073 419343 1 2
9 $DDESuiteLinkClient $DDESuiteLinkClient 1 0 0 0 0 9 NULL 3 0 0 10 8 1 0 0 0 0 0 0 0 NULL 11 0 0 0 0 0 0 0 0 0 133121 419344 1 3
10 $DiscreteDevice $DiscreteDevice 1 0 0 0 0 10 NULL 2 0 0 11 9 0 0 0 0 0 0 0 0 NULL 10 0 0 0 0 0 0 0 0 0 131073 419345 1 2
11 $InTouchProxy $InTouchProxy 1 0 0 0 0 11 NULL 3 0 0 12 10 0 0 0 0 0 0 0 0 NULL 11 0 0 0 0 0 0 0 0 0 131073 419346 1 3
Output table (example):
ilo_view_TagName ilo_view_HierarchicalName
ST4FP12Rozl ST4_FP1.Galaz_2.Rozladowywanie
ST4FP21Rozl ST4_FP2.Galaz_1.Rozladowywanie
ST4FP22Rozl ST4_FP2.Galaz_2.Rozladowywanie
ST4FP31Rozl ST4_FP3.Galaz_1.Rozladowywanie
ST4RS41AnWspKFL2 ST4_S1_RS4_1.Wsp.K_Factor.L2
ST4FP32Rozl ST4_FP3.Galaz_2.Rozladowywanie
ST4RS31AnWspKFL2 ST4_S2_RS3_1.Wsp.K_Factor.L2
ST4RS51AnWspKFL2 ST4_S3_RS5_1.Wsp.K_Factor.L2
ST4FP11U ST4_FP1.Galaz_1.Napiecie
Best regards and thanks in advance for any advices. I tried at my best to make this exapmple tables readable.

Related

Get rows with some value next to threshold

I'm processing data from some simulations and I need some support with SQL to get the data I need. Having a reference table with some times, I need to retrieve the state of the simulation at this particular time: this will be the row with nearest (but below) simulation time to the reference time.
Having this example
Simulation TimeS P1 P2 P2F P3 P3F P4 P4F P5 P5F P6
---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
1 0.0 1 0 0 0 0 0 0 0 0 0
1 0.0 0 1 0 0 0 0 0 0 0 0
1 28.0 0 1 0 0 0 0 0 0 0 0
1 28.0 0 1 0 0 0 0 0 0 0 0
1 2190.0 0 1 0 0 0 0 0 0 0 0
1 2190.33333 0 1 0 0 0 0 0 0 0 0
2 0.0 1 0 0 0 0 0 0 0 0 0
2 0.0 0 1 0 0 0 0 0 0 0 0
2 28.0 0 1 0 0 0 0 0 0 0 0
2 28.0 0 1 0 0 0 0 0 0 0 0
2 817.859662 0 0 1 0 0 0 0 0 0 0
2 840.0 0 0 1 0 0 0 0 0 0 0
2 840.0 0 0 1 0 0 0 0 0 0 0
2 840.0 0 0 1 0 0 0 0 0 0 0
2 840.0 0 0 1 0 0 0 0 0 0 0
2 1018.65247 0 0 0 1 0 0 0 0 0 0
2 1036.0 0 0 0 1 0 0 0 0 0 0
2 1036.0 0 0 0 1 0 0 0 0 0 0
2 2190.0 0 0 0 1 0 0 0 0 0 0
2 2190.0 0 0 0 1 0 0 0 0 0 0
2 2190.0 0 0 0 1 0 0 0 0 0 0
2 2190.04166 0 0 0 1 0 0 0 0 0 0
2 2190.20833 0 0 0 1 0 0 0 0 0 0
2 2190.25 1 0 0 0 0 0 0 0 0 0
2 2190.25 0 1 0 0 0 0 0 0 0 0
2 2190.33333 0 1 0 0 0 0 0 0 0 0
2 2212.0 0 1 0 0 0 0 0 0 0 0
2 2212.0 0 1 0 0 0 0 0 0 0 0
2 2472.94974 0 0 1 0 0 0 0 0 0 0
2 2492.0 0 0 1 0 0 0 0 0 0 0
2 2492.0 0 0 1 0 0 0 0 0 0 0
2 2492.0 0 0 1 0 0 0 0 0 0 0
2 2492.0 0 0 1 0 0 0 0 0 0 0
and some reference times
value
----------
0
365
730
1095
1460
1825
2190
2555
2920
I need a new table with only the values from the reference table (in this time 9 rows only) with this output
Simulation TimeV P1 P2 P2F P3 P3F P4 P4F P5 P5F P6
---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
1 0.0 0 1 0 0 0 0 0 0 0 0
1 365.0 0 1 0 0 0 0 0 0 0 0
1 730.0 0 1 0 0 0 0 0 0 0 0
1 1095.0 0 1 0 0 0 0 0 0 0 0
1 1460.0 0 1 0 0 0 0 0 0 0 0
1 1825.0 0 1 0 0 0 0 0 0 0 0
1 2190.0 0 1 0 0 0 0 0 0 0 0
2 0.0 0 1 0 0 0 0 0 0 0 0
2 365.0 0 0 1 0 0 0 0 0 0 0
2 730.0 0 0 1 0 0 0 0 0 0 0
...
I have tried this, I think the approach is near to the actual solution but is not working as I need. generate_series is a SQLite module I imported to my session to generate a table with numbers from 0 to 3000 with 365 step.
I have limited the output with the where clause.
select P.Simulation, S.value as TimeV, P.Time as TimeS,
"P.Track.Initial" as P1,
"P.Track.Good.U" as P2, "P.Track.Good.Finishing.U" as P2F ,"P.Track.Satisfactory.U" as P3, "P.Track.Satisfactory.Finishing.U" as P3F,
"P.Track.Poor.U" as P4, "P.Track.Poor.Finishing.U" as P4F, "P.Track.VeryPoor.U" as P5, "P.Track.VeryPoor.Finishing.U" as P5F, "P.Track.SuperRed.U" as P6
from generate_series(0, 3000, 365) S
inner join Places_1 P ON TimeV = ( select max(value) from generate_series(0,3000,365) where value <= TimeS )
where P.Simulation <= 2 and TimeS <= 3000
I expected to have only 9 rows per simulation, as I only need the state of the system at the reference times. But I'm getting repeated values for the reference time table...
Simulation TimeV TimeS P1 P2 P2F P3 P3F P4 P4F P5 P5F P6
---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
1 0 0.0 1 0 0 0 0 0 0 0 0 0
1 0 0.0 0 1 0 0 0 0 0 0 0 0
1 0 28.0 0 1 0 0 0 0 0 0 0 0
1 0 28.0 0 1 0 0 0 0 0 0 0 0
1 2190 2190.0 0 1 0 0 0 0 0 0 0 0
1 2190 2190.33333 0 1 0 0 0 0 0 0 0 0
...
One solution (not the prettiest) is to use unions. This will take a row (or multiple rows) and attach them below one another. Here's how you would do it:
(SELECT [COLUMNS HERE] FROM [TABLE] WHERE TimeS <= [TimeV #1] AND ROWNUM = 1 ORDER BY TimeS DESC)
UNION
(SELECT [SAME COLUMNS] FROM [TABLE] WHERE TimeS <= [TimeV #2] AND ROWNUM = 1 ORDER BY TimeS DESC)
UNION
...
Repeat this for each of the values of TimeV you need.
I got the solution!
select P.Simulation, max(P.Time) as Time, value as RefDay,
"P.Track.Initial" as P1,
"P.Track.Good.U" as P2, "P.Track.Good.Finishing.U" as P2F ,"P.Track.Satisfactory.U" as P3, "P.Track.Satisfactory.Finishing.U" as P3F,
"P.Track.Poor.U" as P4, "P.Track.Poor.Finishing.U" as P4F, "P.Track.VeryPoor.U" as P5, "P.Track.VeryPoor.Finishing.U" as P5F, "P.Track.SuperRed.U" as P6
from Places_1 P, generate_series(0,10950,365) where P.Time <= value group by Simulation,value
With this query I get the data I need. The key is grouping by the reference time.

How to turn a list of event in to a matrix to display in Panda

I have a list of events and i want to display on a graph how many happens per hour each day of the week as shown below:
Example of the graph i want
(each line is a day, x axis is the time of the day, y axis is the number of events)
As i am new to Panda i am not sure what's the best way to do it but here is my way:
x = [(rts[k].getDay(), rts[k].getHour(), 1) for k in rts]
df = pd.DataFrame(x[:30]) # Subset of 30 events
dfGrouped = df.groupby([0, 1]).sum() # Group them by day and hour
#Format to display
pd.DataFrame(np.random.randn(24, 7), index=range(0,24), columns=['Mo', 'Tu', 'We', 'Th', 'Fr', 'Sa', 'Su'])
Question is, how can i go from my dataframe with data grouped to a matrix 24x7 as required to display ?
I tried as_matrix but that give me only a one dimensional array, while i want the index of my dataframe to be the index in my matrix.
print(df)
2
0 1
0 19 1
23 1
1 10 2
18 3
22 1
2 17 1
3 8 2
9 3
11 3
13 1
19 1
4 7 1
9 1
14 1
15 1
18 1
5 1 2
7 1
13 1
19 1
6 12 1
Thanks for your help :)
Antoine
I think you need unstack for reshape data, then rename columns names by dict and if necessary add missing hours to index by reindex_axis:
df1 = df.groupby([0, 1])[2].sum().unstack(0, fill_value=0)
#set columns names
df = pd.DataFrame(x[:30], columns = ['days','hours','val'])
d = {0: 'Mo', 1: 'Tu', 2: 'We', 3: 'Th', 4: 'Fr', 5: 'Sa', 6: 'Su'}
df1 = df.groupby(['days', 'hours'])['val'].sum().unstack(0, fill_value=0)
df1 = df1.rename(columns=d).reindex_axis(range(24), fill_value=0)
print (df1)
days Mo Tu We Th Fr Sa Su
hours
0 0 0 0 0 0 0 0
1 0 0 0 0 0 2 0
2 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0
7 0 0 0 0 1 1 0
8 0 0 0 2 0 0 0
9 0 0 0 3 1 0 0
10 0 2 0 0 0 0 0
11 0 0 0 3 0 0 0
12 0 0 0 0 0 0 1
13 0 0 0 1 0 1 0
14 0 0 0 0 1 0 0
15 0 0 0 0 1 0 0
16 0 0 0 0 0 0 0
17 0 0 1 0 0 0 0
18 0 3 0 0 1 0 0
19 1 0 0 1 0 1 0
20 0 0 0 0 0 0 0
21 0 0 0 0 0 0 0
22 0 1 0 0 0 0 0
23 1 0 0 0 0 0 0

How to create dummy variables on Ordinal columns in Python

I am new to Python. I have created dummy columns on categorical column using pandas get_dummies. How to create dummy columns on ordinal column (say column Rating has values 1,2,3...,10)
Consider the dataframe df
df = pd.DataFrame(dict(Cats=list('abcdcba'), Ords=[3, 2, 1, 0, 1, 2, 3]))
df
Cats Ords
0 a 3
1 b 2
2 c 1
3 d 0
4 c 1
5 b 2
6 a 3
pd.get_dummies
works the same on either column
with df.Cats
pd.get_dummies(df.Cats)
a b c d
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 0 0 1 0
5 0 1 0 0
6 1 0 0 0
with df.Ords
0 1 2 3
0 0 0 0 1
1 0 0 1 0
2 0 1 0 0
3 1 0 0 0
4 0 1 0 0
5 0 0 1 0
6 0 0 0 1
with both
pd.get_dummies(df)
Ords Cats_a Cats_b Cats_c Cats_d
0 3 1 0 0 0
1 2 0 1 0 0
2 1 0 0 1 0
3 0 0 0 0 1
4 1 0 0 1 0
5 2 0 1 0 0
6 3 1 0 0 0
Notice that it split out Cats but not Ords
Let's expand on this by adding another Cats2 column and calling pd.get_dummies
pd.get_dummies(df.assign(Cats2=df.Cats)))
Ords Cats_a Cats_b Cats_c Cats_d Cats2_a Cats2_b Cats2_c Cats2_d
0 3 1 0 0 0 1 0 0 0
1 2 0 1 0 0 0 1 0 0
2 1 0 0 1 0 0 0 1 0
3 0 0 0 0 1 0 0 0 1
4 1 0 0 1 0 0 0 1 0
5 2 0 1 0 0 0 1 0 0
6 3 1 0 0 0 1 0 0 0
Interesting, it splits both object columns but not the numeric one.

How to combine rows or data into one row

I am working on a project to see how many units of each category that customers order and this is my Select clause:
SELECT
d2.customer_id
, ( CASE WHEN d2.category = 100 THEN d2.units ELSE 0 END ) AS produce_units
, ( CASE WHEN d2.category = 200 THEN d2.units ELSE 0 END ) AS meat_units
, ( CASE WHEN d2.category = 300 THEN d2.units ELSE 0 END ) AS seafood_units
, SUM (d2.units) AS total_units
And my result looks like this while 62779 is customer id and the last column is total units.
62779 0 0 0 0 20 0 0 0 0 0 0 20
62779 0 0 0 0 0 0 0 0 52 0 0 52
62779 0 6 0 0 0 0 0 0 0 0 0 6
62779 0 0 0 0 0 0 0 0 0 22 0 22
62779 0 0 0 0 0 14 0 0 0 0 0 14
62779 0 0 0 0 0 0 0 20 0 0 0 20
62779 0 0 0 8 0 0 0 0 0 0 0 8
62779 64 0 0 0 0 0 0 0 0 0 0 64
However, I want my result to look like this:
62779 64 6 0 8 20 14 0 20 52 22 0 206
Please advice. Thanks :)

Matplotlib pcolor not plotting correctly

I am trying to create a heat map from a DataFrame (df) of IDs (rows) and Positions (columns) at which a motif is possible. If the motif is present the value of the table is 1 and 0 if it is not present. Such as:
ID Position 1 2 3 4 5 6 7 8 9 10 ...etc
A 0 1 0 0 0 1 0 0 0 1
B 1 0 1 0 1 0 0 1 0 0
C 0 0 0 1 0 0 1 0 1 0
D 1 0 1 0 0 0 1 0 1 0
I then multiply this matrix by itself to find the number of times the motifs present co-occur with motifs at other positions using the code:
df.T.dot(df)
To obtain the Data Frame:
POS 1 2 3 4 5 6 7 8 9 10 ...
1 2 0 2 0 1 0 1 1 1 0
2 0 1 0 0 0 1 0 0 0 1
3 2 0 2 0 1 0 1 1 1 0
4 0 0 0 1 0 0 1 0 1 0
5 1 0 1 0 1 0 0 1 0 0
6 0 1 0 0 0 1 0 0 0 1
7 1 0 1 1 0 0 2 0 2 0
8 1 0 1 0 1 0 0 1 0 0
9 1 0 1 1 0 0 2 0 2 0
10 0 1 0 0 0 1 0 0 0 1
...
Which is symmetrical with the diagonal, however when I try to create the Heat Map using
pylab.pcolor(df)
It gives me an asymmetrical map that does not seem to be representing the dotted matrix. I don't have enough reputation to post an image though.
Does anyone know why this might be occurring? Thanks