SQL Server table index columns order - sql

Is there any difference when I create a table index for more columns if I use the columns in different order?
Exactly is difference between ID, isValid, Created and ID, Created, isValid indices?
And is there any difference in querying order?
where ID = 123
and isValid = 1
and Created < getdate()
vs.
where ID = 123
and Created < getdate()
and isValid = 1
Column types: ID [int], isValid [bit], Created [datetime])

Exactly is difference between ID, isValid, Created and ID, Created, isValid indices?
If you always use all three columns in your WHERE clause - there's no difference.
(as Martin Smith points out in his comment - since of the criteria is not an equality check, the sequence of the columns in the index does matter)
However: an index can only ever used if the n left-most columns (here: n between 1 and 3) are used.
So if you have a query that might only use ID and isValid for querying, then the first index can be used - but the second one will never be used for sure.
And if you have queries that use ID and Created as their WHERE parameters, then your second index might be used, but the first one can never be used.

AND is commutative, so the order of ANDed expressions in WHERE doesn't matter.
Order of columns in an index does matter, it should match your queries.
If ID is your table's clustered primary key and your queries ask for specific ID, don't bother creating an index. That would be like giving an index to a book saying "page 123 is on page 123" etc.

The order in the query makes no difference. The order in the index makes a difference. I'm not sure how good this will look in text but here goes:
where ID = 123 and isValid = 1 and Created < Date 'Jan 3'
Here are a couple of possible indexes:
ID IsValid Created
=== ======= =========
122 0 Jan 4
122 0 Jan 3
... ... ...
123 0 Jan 4
123 0 Jan 3
123 0 Jan 2
123 0 Jan 1
123 1 Jan 4
123 1 Jan 3
123 1 Jan 2 <-- Your data is here...
123 1 Jan 1 <-- ... and here
... ... ...
ID Created IsValid
=== ======= ========
122 Jan 4 0
122 Jan 4 1
... ... ...
123 Jan 4 0
123 Jan 4 1
123 Jan 3 0
123 Jan 3 1
123 Jan 2 0
123 Jan 2 1 <-- Your data is here...
123 Jan 1 0
123 Jan 1 1 <-- ... and here
... ... ...
As you can probably tell, creating an index(IsValid, Created, ID) or any other order, will separate your data even more. In general, you want to design the indexes to make your data as "clumpy" as possible for the queries executed most often.

Related

How to Count unique values across two colomns with certain types in postgresql

I've got data feeding into a database like below.
Now I can use a single query many times to achieve this but i'm looking for a way to query once and return all results
Time, ID ( The totally unique field), ReqType (can be 1 value out of 30), ReqName (can be 1 out of 20).
time
ID
ReQtype
ReqName
1345
12
1
test
1346
13
2
test
1352
14
1
hello world
1352
15
3
fith
1354
16
1
hello world
1357
17
4
apple
Without constructing a query for every variable of ReqType and ReqName i would like to return in X time(last 30min for example above) ((this will be two queries actually i think)
count
ReqName
2
test
2
hello world
1
fith
1
apple
Count
ReqType
3
1
1
2
1
3
1
4

Multiplication of returns by company increasing in time (BHARs)

I have the following Dataframe, organized in panel data. It contains daily returns of many companies on different days following the IPO date. The day_diff represents the days that have passed since the IPO, and return_1 represents the daily individual returns for that specific day for that specific company, from which I have already added +1. Each company has its own company_tic and I have about 300 companies. My goal is to calculate the first component of the right-hand side of the equation below (so having results for each day_diff and company_tic, always starting at day 0, until the last day of data; e.g. = from day 0 to day 1, then from day 0 to day 2, from 0 to day 3, and so on until my last day, which is day 730). I have tried df.groupby(['company_tic', 'day_diff'])['return_1'].expanding().prod() but it doesn't work. Any alternatives?
Index day_diff company_tic return_1
0 0 xyz 1.8914
1 1 xyz 1.0542
2 2 xyz 1.0016
3 0 abc 1.4398
4 1 abc 1.1023
5 2 abc 1.0233
... ... ... ...
159236 x 3
Not sure to fully get what you want, but you might want to use cumprod instead of expanding().prod().
Here's what I tried :
df['return_1_prod'] = df.groupby('company_tic')['return_1'].cumprod()
Output :
day_diff company_tic return_1 return_1_prod
0 0 xyz 1.8914 1.891400
1 1 xyz 1.0542 1.993914
2 2 xyz 1.0016 1.997104
3 0 abc 1.4398 1.439800
4 1 abc 1.1023 1.587092
5 2 abc 1.0233 1.624071

Merge rows and convert a string in row value to a user-defined one when condition related to other columns is matched

Assuming I'm dealing with this dataframe:
ID
Qualified
Year
Amount A
Amount B
1
No
2020
0
150
1
No
2019
0
100
1
Yes
2019
10
15
1
No
2018
0
100
1
Yes
2018
10
150
2
Yes
2020
0
200
2
No
2017
0
100
...
...
...
...
My desired output should be like this:
ID
Qualified
Year
Amount A
Amount B
1
No
2020
0
150
1
Partial
2019
10
115
1
Partial
2018
10
250
2
Yes
2020
0
200
2
No
2017
0
100
...
...
...
...
As you can see, Qualified column creates new merged values (Yes & No -> Partial, amount A + B ) from a condition: a year in an ID includes both Yes and No in Qualified column.
Don't know how to approach it. Anyone could provide any methodology?
You can use the function agg() and groupby() to perform this operation.
agg() allows you to use not only common aggregation functions (such as sum, mean, etc.) but also custom defined functions.
I would do as follows:
def agg_qualify(x):
values = x.unique()
if len(x)>1:
return 'Partial'
return values[0]
df.groupby(['ID', 'Year']).agg({
'Qualified': lambda x: agg_qualify(x),
'Amount A': 'sum',
'Amount B': 'sum',
}).reset_index()
Output:
ID Year Qualified Amount A Amount B
0 1 2018 Partial 10 250.0
1 1 2019 Partial 10 115.0
2 1 2020 No 0 150.0
3 2 2020 Yes 0 200.0

Select maximum value where another column is used for for the Grouping

I'm trying to join several tables, where one of the tables is acting as a
key-value store, and then after the joins find the maximum value in a
column less than another column. As a simplified example, I have the following three tables:
Documents:
DocumentID
Filename
LatestRevision
1
D1001.SLDDRW
18
2
P5002.SLDPRT
10
Variables:
VariableID
VariableName
1
DateReleased
2
Change
3
Description
VariableValues:
DocumentID
VariableID
Revision
Value
1
2
1
Created
1
3
1
Drawing
1
2
3
Changed Dimension
1
1
4
2021-02-01
1
2
11
Corrected typos
1
1
16
2021-02-25
2
3
1
Generic part
2
3
5
Screw
2
2
4
2021-02-24
I can use the LEFT JOIN/IS NULL thing to get the latest version of
variables relatively easily (see http://sqlfiddle.com/#!7/5982d/3/0).
What I want is the latest version of variables that are less than or equal
to a revision which has a DateReleased, for example:
DocumentID
Filename
Variable
Value
VariableRev
DateReleased
ReleasedRev
1
D1001.SLDDRW
Change
Changed Dimension
3
2021-02-01
4
1
D1001.SLDDRW
Description
Drawing
1
2021-02-01
4
1
D1001.SLDDRW
Description
Drawing
1
2021-02-25
16
1
D1001.SLDDRW
Change
Corrected Typos
11
2021-02-25
16
2
P5002.SLDPRT
Description
Generic Part
1
2021-02-24
4
How do I do this?
I figured this out. Add another JOIN at the start to add in another version of the VariableValues table selecting only the DateReleased variables, then make sure that all the VariableValues Revisions selected are less than this date released. I think the LEFT JOIN has to be added after this table.
The example at http://sqlfiddle.com/#!9/bd6068/3/0 shows this better.

Sql Server Row Concatenation

I have a table (table variable in-fact) that holds several thousand (50k approx) rows of the form:
group (int) isok (bit) x y
20 0 1 1
20 1 2 1
20 1 3 1
20 0 1 2
20 0 2 1
21 1 1 1
21 0 2 1
21 1 3 1
21 0 1 2
21 1 2 2
And to pull this back to the client is a fairly hefty task (especially since isok is a bit). What I would like to do is transform this into the form:
group mask
20 01100
21 10101
And maybe go even a step further by encoding this into a long etc.
NOTE: The way in which the data is stored currently cannot be changed.
Is something like this possible in SQL Server 2005, and if possible even 2000 (quite important)?
EDIT: I forgot to make it clear that the original table is already in an implicit ordering that needs to be maintained, there isnt one column that acts as a linear sequence, but rather the ordering is based on two other columns (integers) as above (x & y)
You can treat the bit as a string ('0', '1') and deploy one of the many string aggregate concatenation methods described here: http://www.simple-talk.com/sql/t-sql-programming/concatenating-row-values-in-transact-sql/