SQL impala generate Standard deviation manually - sql

I have a table like this:
values
frequencies
grpng
2
1
cat1
3
2
cat1
4
1
cat1
2
2
cat2
1
1
cat2
5
2
cat2
I want to generate the standard deviation (population sd) per group (cat1, cat2)
not with a window function but by grouping wrt to the grpng variable.
I see two options:
Expand the values using the frequencies and then use the standard sql sd dev function.
Directly group and get the sd dev manually if possible.
Can you suggest a solution? For the first option I am not able to find a function to expand in Impala.
My desired outcome is:
sddev
grpng
0.70710678118655
cat1
1.6733200530682
cat2

Related

PostgreSQL how to find row IDs based on aggregated metrics

I have a table in a PostgreSQL database that looks more or less like this:
ID
Side
Amount
Price
1
BUY
8
107295.000000000
2
SELL
18
107300.000000000
3
SELL
21
107305.000000000
4
BUY
17
107310.000000000
And I have some aggregated metrics that look like this:
{'BUY': {'amount_sum': 6655, 'price_avg': 105961.497370398197}, 'SELL': {'amount_sum': 6655, 'price_avg': 106214.787377911345}}
And I need to find the row IDs that match these metrics. How would I go about doing this?
I've read a bit into PostgreSQL documentation and I've tried using GROUP BY on SIDE, then using the HAVING clause, but wasn't successful.
===================================================
To clarify, given this table and input:
ID
Side
Amount
Price
1
BUY
2
1
2
SELL
1
2
3
SELL
2
1
4
BUY
1
3
5
SELL
8
1
6
BUY
5
2
{'BUY': {'amount_sum': 3, 'price_avg': 2}, 'SELL': {'amount_sum': 10, 'price_avg': 1}}
I would the expected output to be:
BUY: ids[1,4] SELL: ids[3,5] that's because for ids 1 and 4, which have side as BUY, the sum of the amount column is 3, and the average of the price column is 2. And for ids 3 and 5, which have side as SELL, the sum of the amount column is 10, and the average of the price column is 1.
#Gabriel Tkacz, I have not 50 reputaton to comment) and ask my question as "answer".
From input table was expected:
{'BUY':{ 'Ids':[1,4,6]','amount_sum':8','price_avg':2}
{'SELL':{ 'Ids':[2,3,5]','amount_sum':11','price_avg':1.3333333333333333}
why excluded [6] on BUY side and [2] on SELL side in your explanation
BUY: ids[1,4] SELL: ids[3,5]

add column with fixed values for each value of another column Redshift

I have following table
]1
want to add date range for each user
How to achieve this:
if this is possible from query in Redshift then that be useful
If not, efficient way to create this in python pandas as data is having 8lk records
Given this dataframe df:
userid username
0 1 a
1 2 b
2 3 c
you can use numpy repeat and tile:
dr = pd.date_range('2020-01-01','2020-01-03')
df = pd.DataFrame(np.repeat(df.to_numpy(), len(dr), 0), columns=df.columns).assign(date=np.tile(dr.to_numpy(), len(df)))
Result:
userid username date
0 1 a 2020-01-01
1 1 a 2020-01-02
2 1 a 2020-01-03
3 2 b 2020-01-01
4 2 b 2020-01-02
5 2 b 2020-01-03
6 3 c 2020-01-01
7 3 c 2020-01-02
8 3 c 2020-01-03
In Sql this is simple too - just cross join with the list of dates you want to add to each row (replicate rows). You can see that in your example that 3 rows and 3 dates results in 9 rows. (untested explanatory code:)
select userid, username, "date" from <table> cross join (select values ('2020-01-01'::date), ('2020-02-01'::date), ('2020-03-01'::date));
Now the problem with simple approach is that if you are dealing with large tables and long lists of dates the multiplication will kill you. 10 billion rows by 5,000 dates is 15 trillion resulting rows - making this will take a long time and storing it will takes lots of disk space. For small tables and short lists of dates this works fine.
If you are in the "big" side of things you will likely need to rethink what you are trying to do. Since you are using Redshift there is a possibility that you may need to do this.

Calculate percentage on SSRS Expressions

I have the following table on SQL:
Category | Requests
Cat1 | 150
Cat2 | 200
Cat3 | 550
Cat4 | 100
Cat5 | 50
SUM | 1050
How can create an expression to calculate the percentage of Cat5 compared to the total? (4.7% in this case).
Try this:
=Lookup("Cat5",Fields!Category.Value,Fields!Requests.Value,"DataSetName")/
Sum(Fields!Requests.Value,"DataSetName")
Replace "DataSetName" by the actual name of your dataset.
Assuming you want 150 to represent 150% within the rdl you can do the following:
first apply the following formula: =Fields!field.Value/100
Where Fields!field.Value is the field you want to convert to percentage so if your field is called Requests then you will have =Fields!Requests.Value/100
Then you need to change the type of the textbox to be percentage from the TextboxProperties
you should get a result like this:

Microsoft Access 2007 Report with Conditional Columns

I am looking to make a very simple report to condense and show data side by side. All of the examples of reports I find are only row by row.
The query I will use will only have three schema "Company, Model, Total"
The format I am trying to get to is
Company Model Total Company Model Total
A 123 2 B 123 4
A 222 3 B 333 3
A 444 7 B 444 7
The idea is to present the information in a way that multiple companies side by side can compare inventory of the same model and find discrepencies. Ideally the report would eventually group all Model's that span every company at the top, but thats a next generation problem.
I have attempted conditional formating on multiple "Company" boxes, but the conditionals do not seem to be applying properly or for some reason every "Company" box is adopting the same conditionals.
I think you want a crosstab query grouping by model (the rowHeader), company as the column header, and first(total) as the value.
The results should look like
model A total B total
123 2 4
222 3
333 3
444 7 7
then you can create another query based on the crosstab results to calculate the difference between company totals, if you want.
You have to do this in two steps:
Build a query that gives you:
Company Model Total
A 123 2
A 222 3
A 444 7
B 123 4
B 333 3
B 444 7
Let's call q this query.
Build a second query
SELECT q1.Company, q1.Model, q1.Total, q1.Company, q2.Model, q2.Total
FROM q AS q1 INNER JOIN q AS q2 ON q1.Model = q2.Model
WHERE q1.company < q2.company;
This will give you:
A 123 2 B 123 4
A 444 7 B 444 7
(There are no matching data for models 222 and 333)

How can I identify the text values that have the lowest row IDs across 4 columns?

I found a few articles that are close, but not the same as what I am trying to do. I have an Excel file that has 4 columns of duplicated data, each column is sorted based on a numeric value that came from a different worksheet.
I need to identify the 25(or so?) rows where the value of the four columns match, and the row ID is the lowest. There will be roughly 250 rows of data to sift through, so I only really need the top 10%.
I don't HAVE to approach it this way. I can dump this data into Access if this cannot be done in Excel. Or I can assign columns next to each text column (a way of assigning IDs to each field in column 1, 2, 3, and 4) and use those values. The approach is negotiable, as long as the outcome works.
Here's what my data looks like in Excel:
A B C D
abc bcd abc def
cde fgh def bcd
def def bcd abc
bcd hji xyz lmn
So in this case I would want to highlight (or somehow identify) the value "def" because it appears closest to the top of all 4 columns, hence it has the lowest row ID. The value "bcd" would be second on the list since it also is identified in all 4 and has a low row id.
Any suggestions would be appreciated. I know SQL fairly well, so if you think dumping it in a DB would be best and you can suggest a query that would be awesome. But ideally... keeping it in Excel would be the least amount of work for me. I'm open to formulas, conditional formatting, etc.
Thanks!!
I THINK I came up with a fairly cool solution...
So, supposing you have this data in columns A-D, begining in cell A2, say.
Now, you know that you ONLY want values if they already exist in column A - Otherwise they're not in all 4 columns.
So:
In E2, type in the formula =Row() - This basically says where A's value is located
In F2, type in =Match($A2,B:B,0) - This will find the first match for A2's value in columns B
Drag that formula across to G2 & H2 (to find the first match for A2's value in C & D respectively).
In I2, type in the formula =Sum(E2:H2)
Now, drag E:H down for your entire dataset.
So, If H = #N/A, that means the values weren't in all 4 columns
And the lower the value for H, the lower the rank of the match - (Column A's text being the value you're matching for).
Now you could sort according to Column H, etc, to suit your needs.
Hope this does the trick (and makes sense)!
Cool Q, BTW!!!
Do you have, or can you create, a master list of all of the possible cell values? If so, then some simple VLOOKUPs on each of the 4 data columns could give, for each unique cell value, the row number in each column. Add up the 4 reesults and sort on the total.
If you don't have the master list of unique values, I'd tend to go to Access because it's a pretty easy set of queries to get what you want.
Clarification Needed
When I first came up with this answer I used the same approach that John used in his clever Excel answer, namely to use the sum of the minimum rows per column to produce the rank. That produces the sample result in the question, but consider the following modified test data:
F1 F2 F3 F4 RowNum
--- --- --- --- ------
XXX bar baz bat 1
foo XXX baz bat 2
YYY bar XXX bat 3
foo YYY baz bat 4
foo bar YYY bat 5
foo bar baz YYY 6
foo bar baz bat 7
foo bar baz bat 8
foo bar baz bat 9
foo bar baz XXX 10
XXX appears in rows 1, 2, 3, and 10, so the sum would be 16. YYY appears in rows 3, 4, 5, and 6 so the sum would be 18. Ranking by sum would declare XXX the winner, even though if you started scanning for XXX from row 1 you would have to go all the way to row 10 to reach the last XXX, whereas if you started scanning for YYY from row 1 you would only have to go down to row 6 to reach the last YYY.
In this case should YYY actually be the winner?
(original answer)
The following code will import the Excel data into Access and add a [RowNum] column
Sub ImportExcelData()
On Error Resume Next '' in case it doesn't already exist
DoCmd.DeleteObject acTable, "ExcelData"
On Error GoTo 0
DoCmd.TransferSpreadsheet acImport, acSpreadsheetTypeExcel12Xml, "ExcelData", "C:\Users\Gord\Documents\ExcelData.xlsx", False
CurrentDb.Execute "ALTER TABLE ExcelData ADD COLUMN RowNum AUTOINCREMENT(1,1)", dbFailOnError
End Sub
So now we have an [ExcelData] table in Access like this
F1 F2 F3 F4 RowNum
--- --- --- --- ------
abc bcd abc def 1
cde fgh def bcd 2
def def bcd abc 3
bcd hji xyz lmn 4
Let's create a saved query named ExcelItems in Access to string the entries out in a long "list"...
SELECT F1 AS Item, RowNum, 1 AS ColNum FROM ExcelData
UNION ALL
SELECT F2 AS Item, RowNum, 2 AS ColNum FROM ExcelData
UNION ALL
SELECT F3 AS Item, RowNum, 3 AS ColNum FROM ExcelData
UNION ALL
SELECT F4 AS Item, RowNum, 4 AS ColNum FROM ExcelData
...returning...
Item RowNum ColNum
---- ------ ------
abc 1 1
cde 2 1
def 3 1
bcd 4 1
bcd 1 2
fgh 2 2
def 3 2
hji 4 2
abc 1 3
def 2 3
bcd 3 3
xyz 4 3
def 1 4
bcd 2 4
abc 3 4
lmn 4 4
Now we can find the lowest RowNum where Item is found for each ColNum...
TRANSFORM Min(ExcelItems.[RowNum]) AS MinOfRowNum
SELECT ExcelItems.[Item]
FROM ExcelItems
GROUP BY ExcelItems.[Item]
PIVOT ExcelItems.[ColNum] In (1,2,3,4);
...returning...
Item 1 2 3 4
---- - - - -
abc 1 1 3
bcd 4 1 3 2
cde 2
def 3 3 2 1
fgh 2
hji 4
lmn 4
xyz 4
If we save that query as ExcelItems_Crosstab then we can use it to rank the items that appear in all four columns:
SELECT Item, [1]+[2]+[3]+[4] AS Rank
FROM ExcelItems_Crosstab
WHERE ([1]+[2]+[3]+[4]) IS NOT NULL
ORDER BY 2
...returning...
Item Rank
---- ----
def 9
bcd 10