Multiply rows in group with SQLite - sql

I have a query that returns the probability that a token has a certain classification.
token class probPaired
---------- ---------- ----------
potato A 0.5
potato B 0.5
potato C 1.0
potato D 0.5
time A 0.5
time B 1.0
time C 0.5
I need to aggregate the probabilities of each class by multiplying them together.
-- Imaginary MUL operator
select class, MUL(probPaired) from myTable group by class;
class probability
---------- ----------
A 0.25
B 0.5
C 0.5
D 0.5
How can I do this in SQLite? SQLite doesn't have features like LOG/EXP or variables - solutions mentioned in other questions.

In general, if SQLite can't do it you can write a custom function instead. The details depend on what programming language you're using, here it is in Perl using DBD::SQLite. Note that functions created in this way are not stored procedures, they exist for that connection and must be recreated each time you connect.
For an aggregate function, you have to create a class which handles the aggregation. MUL is pretty simple, just an object to store the product.
{
package My::SQLite::MUL;
sub new {
my $class = shift;
my $mul = 1;
return bless \$mul, $class;
}
sub step {
my $self = shift;
my $num = shift;
$$self *= $num;
return;
}
sub finalize {
my $self = shift;
return $$self;
}
}
Then you'd install that as the aggregate function MUL which takes a single argument and uses that class.
my $dbh = ...doesn't matter how the connection is made...
$dbh->sqlite_create_aggregate("MUL", 1, "My::SQLite::MUL");
And now you can use MUL in queries.
my $rows = $dbh->selectall_arrayref(
"select class, MUL(probPaired) from myTable group by class"
);
Again, the details will differ with your particular language, but the basic idea will be the same.
This is significantly faster than fetching each row and taking the aggregate product.

You can calculate row numbers and then use a recursive cte for multiplication. Then get the max rnum (calculated row_number) value for each class which contains the final result of multiplication.
--Calculating row numbers
with rownums as (select t1.*,
(select count(*) from t t2 where t2.token<=t1.token and t1.class=t2.class) as rnum
from t t1)
--Getting the max rnum for each class
,max_rownums as (select class,max(rnum) as max_rnum from rownums group by class)
--Recursive cte starts here
,cte(class,rnum,probPaired,running_mul) as
(select class,rnum,probPaired,probPaired as running_mul from rownums where rnum=1
union all
select t.class,t.rnum,t.probPaired,c.running_mul*t.probPaired
from cte c
join rownums t on t.class=c.class and t.rnum=c.rnum+1)
--Final value selection
select c.class,c.running_mul
from cte c
join max_rownums m on m.max_rnum=c.rnum and m.class=c.class
SQL Fiddle

Related

Can't find a way to improve my PostgreSQL query

In my PostgreSQL database I have 6 tables named storeAPrices, storeBprices etc., holding the same columns and indexes as follows:
item_code (string, primary_key)
item_name (string, btree index)
is_whigthed (number : 0|1, betree index)
item_price (number )
My desire is to join each storePrices table to other by item_code or item_name similarity but "OR" should act as in programming language (check right side only if left is false).
Currently, my query has low performance.
select
*
FROM "storeAprices" sap
left JOIN LATERAL (
SELECT * FROM "storeBPrices" sbp
WHERE
similarity(sap.item_name,sbp.item_name) >= 0.45
ORDER BY similarity(sap.item_name,sbp.item_name) DESC
limit 1
) bp ON case when sap.item_code = bp.item_code then true else sap.item_name % bp.item_name end
left JOIN LATERAL (
select * FROM "storeCPrices" scp
WHERE similarity(sap.item_name,scp.item_name) >= 0.45
ORDER BY similarity(sap.item_name,scp.item_name) desc
limit 1
) rp ON case when sap.item_code = rp.item_code then true else sap.item_name % rp.item_name end
This is part of my query and it took too much time to response. My data is not so large (15k items per table)
Also I have another index "is_whigthed" that I'm not sure how to use it. (I don't want set it as variable because I want to get all "is_whigthed" results)
Any suggestions?
OR should be faster than using case
bp ON sap.item_code = bp.item_code OR sap.item_name % bp.item_name
also you can create Trigram index on item_name columns as mentioned in pg_trgm module docs, since you are using it's % operator for similarity

Chaining endless sql and performance

I am chaining sql according to user filter which is unknown.
For instance he would like to first ask for certain dates :
def filterDates(**kwargs):
q = ('''
SELECT date_num, {subject_col}, {in_col} as {out_col}
FROM {base}
WHERE date_num BETWEEN {date1} AND {date2}
ORDER BY date_num
''').format(subject_col=subject_col,**kwargs)
return q
(base is input query string from previous, see next)
and then he wants to calculate another thing(or many) so we pass the dates filter string query q as base to this query:
WITH BS AS (
SELECT date_num, {subject_col}, {in_col}
FROM {base}
)
SELECT t1.{subject_col},t1.{in_col}, t2.{in_col} - t1.{in_col} as {out_col}
FROM BS t1
JOIN BS t2
ON t1.{subject_col} = t2.{subject_col} AND t2.date_num = {date2}
WHERE t1.date_num = {date1}
''').format(subject_col=subject_col,**kwargs)
Here the {base} is going to be :
base='('+q+')'+'AS base'
Now we can chain queries as much as we want and it works.
How would the engine handle this ? is that means that the efficiency is bad because engine has to make 2 rounds ( instead of having a normal WHERE on the dates? ) how would he optimize this?
Is there a common good practice way to chain unknown number of queries?

Query Optimization for MS Sql Server 2012

I have a table containing ~5,000,000 rows of scada data, described by the following:
create table data (o int, m money).
Where:
- o is PK with clustered index on it. o's fill factor is close to 100%. o represents the date of meter reading, can be thought of as X axis.
- m is a decimal value laying within 1..500 region and is the actual meter reading can be thought of as Y axis.
I need to find out about certain patterns i.e. when, how often and for how long they had been occurring.
Example. Looking for all occurrences of m changing by a region from 500 to 510 within 5 units (well from 1 to 5) of o I run the following query:
select d0.o as tFrom, d1.o as tTo, d1.m - d0.m as dValue
from data d0
inner join data d1
on (d1.o = d0.o + 1 or d1.o = d0.o + 2 or d1.o = d0.o + 3 or d1.o = d0.o + 4)
and (d1.m - d0.m) between 500 and 510
the query takes 23 seconds to execute.
Previous version took 30 minutes (90 times slower), I' managed to optimize it using a naive approach by replacing : on (d1.o - d0.o) between 1 and 4 with on (d0.o = d1.o - 1 or d0.o = d1.o - 2 or d0.o = d1.o - 3 or d0.o = d1.o - 4)
It's clear to me why it's faster - on one hand indexed column scan should fork fast enough on another one I can afford it as dates are discrete (and I always give 5 minutes grace time to any o region, so for 120 minutes it's 115..120 region). I can't use the same approach with m values as they are integral though.
Things I've tried so far:
Soft sharding by applying where o between #oRegionStart and #oRegionEnd at the bottoom of my script. and running it within a loop, fetching results into a temp table. Execution time - 25 seconds.
Hard sharding by splitting data into a number of physical tables. The result is 2 minutes nevermind the maintenance hassle.
Using some precooked data structures, like:
create table data_change_matrix (o int, dM5Min money, dM5Max money, dM10Min money, dM10Max money ... dM{N}Min money, dM{N}Max money)
where N is max dept for which I run the analysis. Having such table I could easily write a query:
select * from data_change_matrix where dMin5Min between 500 and 510
The result is - it went nowhere due to the tremendous size requirements (5M X ~ 250) and maintenance related costs, I need to support that matrix actuality close to real time.
SQL CLR - don't even ask me what went wrong it just didn't work out.
Right now I'm out of inspiration and looking for help.
All in all - is it possible to get a close to instant response time running such type of queries on large volumes of data?
All's run on MS Sql Server 2012. Didn't try it on MS Sql Server 2014 but happy to do it if it'll make sense.
Update - execution plan: http://pastebin.com/PkSSGHvH.
Update 2 - While I really love LAG function suggested by usr I wonder if there's a LAG**S** function allowing for
select o, MIN(LAG**S**(o, 4)) over(...) - or what's its shortest implementation in TSL?
I tried something very similar using SQL CLR and got it working but the performance was awful.
I assume you meant to write "on (d1.o = ..." and not "on (d.o = ...". Anyway, I got pretty drastic improvements just by simplifying the statement (making it easy for the query optimizer to pick a better plan I guess):
select d0.o as tFrom, d1.o as tTo, d1.m - d0.m as dValue
from data d0
inner join data d1
on d1.o between d0.o + 1 and d0.o + 4
and (d1.m - d0.m) between 500 and 510
Good luck with your query!
You say you've already tried CLR but don't give any code.
It was fastest in my test for my sample data.
CREATE TABLE data
(
o INT PRIMARY KEY,
m MONEY
);
INSERT INTO data
SELECT TOP 5000000 ROW_NUMBER() OVER (ORDER BY ##SPID),
1 + ABS(CAST(CRYPT_GEN_RANDOM(4) AS INT) %500)
FROM master..spt_values v1,
master..spt_values v2
None of the versions actually return any results (it is impossible for m to be a decimal value laying within 1..500 and simultaneously for two m values to have a difference > 500) but disregarding this typical timings I got for the code submitted so far are.
+-----------------+--------------------+
| | Duration (seconds) |
+-----------------+--------------------+
| Lag/Lead | 39.656 |
| Original code | 40.478 |
| Between version | 21.037 |
| CLR | 13.728 |
+-----------------+--------------------+
The CLR code I used was based on that here
To call it use
EXEC [dbo].[WindowTest]
#WindowSize = 5,
#LowerBound = 500,
#UpperBound = 510
Full code listing
using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
public partial class StoredProcedures
{
public struct DataRow
{
public int o;
public decimal m;
}
[Microsoft.SqlServer.Server.SqlProcedure]
public static void WindowTest(SqlInt32 WindowSize, SqlInt32 LowerBound, SqlInt32 UpperBound)
{
int windowSize = (int)WindowSize;
int lowerBound = (int)LowerBound;
int upperBound = (int)UpperBound;
DataRow[] window = new DataRow[windowSize];
using (SqlConnection conn = new SqlConnection("context connection=true;"))
{
SqlCommand comm = new SqlCommand();
comm.Connection = conn;
comm.CommandText = #"
SELECT o,m
FROM data
ORDER BY o";
SqlMetaData[] columns = new SqlMetaData[3];
columns[0] = new SqlMetaData("tFrom", SqlDbType.Int);
columns[1] = new SqlMetaData("tTo", SqlDbType.Int);
columns[2] = new SqlMetaData("dValue", SqlDbType.Money);
SqlDataRecord record = new SqlDataRecord(columns);
SqlContext.Pipe.SendResultsStart(record);
conn.Open();
SqlDataReader reader = comm.ExecuteReader();
int counter = 0;
while (reader.Read())
{
DataRow thisRow = new DataRow() { o = (int)reader[0], m = (decimal)reader[1] };
int i = 0;
while (i < windowSize && i < counter)
{
DataRow previousRow = window[i];
var diff = thisRow.m - previousRow.m;
if (((thisRow.o - previousRow.o) <= WindowSize-1) && (diff >= lowerBound) && (diff <= upperBound))
{
record.SetInt32(0, previousRow.o);
record.SetInt32(1, thisRow.o);
record.SetDecimal(2, diff);
SqlContext.Pipe.SendResultsRow(record);
}
i++;
}
window[counter % windowSize] = thisRow;
counter++;
}
SqlContext.Pipe.SendResultsEnd();
}
}
}
This looks like a great case for windowed aggregate functions or LAG. Here a version using LAG:
select *
from (
select o
, lag(m, 4) over (order by o) as m4
, lag(m, 3) over (order by o) as m3
, lag(m, 2) over (order by o) as m2
, lag(m, 1) over (order by o) as m1
, m as m0
from data
) x
where 0=1
or (m1 - m0) between 500 and 510
or (m2 - m0) between 500 and 510
or (m3 - m0) between 500 and 510
or (m4 - m0) between 500 and 510
Using a windowed aggregate function you should be able to remove the manual expansion of those LAG calls.
SQL Server implements these things using a special execution plan operator called Window Spool. That makes it quite efficient.

Implementation apriori in SQL

I want to implement this pseudo code in SQL.
This is my code:
k = 1
C1 = generate counts from R1
repeat
k = k + 1
INSERT INTO R'k
SELECT p.Id, p.Item1, …, p.Itemk-1, q.Item
FROM Rk-1 AS p, TransactionTable as q
WHERE q.Id = p.Id AND
q.Item > p.Itemk-1
INSERT INTO Ck
SELECT p.Item1, …, p.Itemk, COUNT(*)
FROM R'k AS p
GROUP BY p.Item1, …, p.Itemk
HAVING COUNT(*) >= 2
INSERT INTO Rk
SELECT p.Id, p.Item1, …, p.Itemk
FROM R!k AS p, Ck AS q
WHERE p.item1 = q.item1 AND
.
.
p.itemk = q.itemk
until Rk = {}`
How can I code this so that it changes columns using k as a variable?
For APRIORI to be reasonably fast, you need efficient data structures. I'm not convinced storing the data in SQL again will do the trick. But of course it depends a lot on your actual data set. Depending on your data set, APRIORI, FPGrowth or Eclat may each be the better choice sometimes.
Either way, using a table layout like Item1, Item2, Item3, ... pretty much is no-go in SQL table design. You may end up on The Daily WTF...
Consider keeping your itemsets in main memory, and only scanning the database using an efficient iterator.

accessing an element like array in pig

I have data in the form:
id,val1,val2
example
1,0.2,0.1
1,0.1,0.7
1,0.2,0.3
2,0.7,0.9
2,0.2,0.3
2,0.4,0.5
So first I want to sort each id by val1 in decreasing order..so somethng like
1,0.2,0.1
1,0.2,0.3
1,0.1,0.7
2,0.7,0.9
2,0.4,0.5
2,0.2,0.3
And then select the second element id,val2 combination for each id
So for example:
1,0.3
2,0.5
How do I approach this?
Thanks
Pig is a scripting language and not relational one like SQL, it is well suited to work with groups with operators nested inside a FOREACH. Here is the solutions:
A = LOAD 'input' USING PigStorage(',') AS (id:int, v1:float, v2:float);
B = GROUP A BY id; -- isolate all rows for the same id
C = FOREACH B { -- here comes the scripting bit
elems = ORDER A BY v1 DESC; -- sort rows belonging to the id
two = LIMIT elems 2; -- select top 2
two_invers = ORDER two BY v1 ASC; -- sort in opposite order to bubble second value to the top
second = LIMIT two_invers 1;
GENERATE FLATTEN(group) as id, FLATTEN(second.v2);
};
DUMP C;
In your example id 1 has two rows with v1 == 0.2 but different v2, thus the second value for the id 1 can be 0.1 or 0.3
A = LOAD 'input' USING PigStorage(',') AS (id:int, v1:int, v2:int);
B = ORDER A BY id ASC, v1 DESC;
C = FOREACH B GENERATE id, v2;
DUMP C;