How to optimize querying multiple unrelated tables in SQLite? - sql

I have scenario when I have to iterate through multiple tables in quite big sqlite database. In tables I store informations about planet position on sky through years. So e.g. for Mars I have tables Mars_2000, Mars_2001 and so on. Table structure is always the same:
|id:INTEGER|date:TEXT|longitude:REAL|
Thing is that for certain task I need to iterate through this tables, which cost much time (for more than 10 queries it's painful).
I suppose that if I merge all tables with years to one big table performance might be better as one query through one big table is better than 50 through smaller tables. I wanted to make sure that this might work, as database is humongous (around 20Gb), and reshaping it would cost a while.
Is this plan I just described viable? Is there any other solution for such case?
It might be helpfull so I attach function that produces my SQL query that is unique for each table:
pub fn transition_query(
select_param: &str, // usually asterix
table_name: &str, // table I'd like to query
birth_degree: &f64, // constant number
wanted_degree: &f64, // another constant number
orb: &f64, // another constant number
upper_date_limit: DateTime<Utc>, // casts to SQL-like string
lower_date_limit: DateTime<Utc>, // casts to SQL-like string
) -> String {
let parsed_upper_date_limit = CelestialBodyPosition::parse_date(upper_date_limit);
let parsed_lower_date_limit = CelestialBodyPosition::parse_date(lower_date_limit);
return format!("
SELECT *,(SECOND_LAG>60 OR SECOND_LAG IS NULL) AS TRANSIT_START, (SECOND_LEAD > 60 OR SECOND_LEAD IS NULL) AS TRANSIT_END, time FROM (
SELECT
*,
UNIX_TIME - LAG(UNIX_TIME,1) OVER (ORDER BY time) as SECOND_LAG,
LEAD(UNIX_TIME,1) OVER (ORDER BY time) - UNIX_TIME as SECOND_LEAD FROM (
SELECT {select_param},
DATE(time) as day_scoped_date,
CAST(strftime('%s', time) AS INT) AS UNIX_TIME,
longitude
FROM {table_name}
WHERE ((-{orb} <= abs(realModulo(longitude -{birth_degree} -{wanted_degree},360))
AND abs(realModulo(longitude -{birth_degree} -{wanted_degree},360)) <= {orb})
OR
(-{orb} <= abs(realModulo(longitude -{birth_degree} +{wanted_degree},360))
AND abs(realModulo(longitude -{birth_degree} +{wanted_degree},360)) <= {orb}))
AND time < '{parsed_upper_date_limit}' AND time > '{parsed_lower_date_limit}'
)
) WHERE (TRANSIT_START AND NOT TRANSIT_END) OR (TRANSIT_END AND NOT TRANSIT_START) ;
");
}

I solved the issue programmatically. Whole thing was done with Rust and r2d2_sqlite library. I'm still doing a lot of queries, but now it's done in threads. It allowed me to reduce execution time from 25s to around 3s. Here's the code:
use std::sync::mpsc;
use std::thread;
use r2d2_sqlite::SqliteConnectionManager;
use r2d2;
let manager = SqliteConnectionManager::file("db_path");
let pool = r2d2::Pool::builder().build(manager).unwrap();
let mut result: Vec<CelestialBodyPosition> = vec![]; // Vector of structs
let (tx, rx) = mpsc::channel(); // Allows ansynchronous communication
let mut children = vec![]; //vector of join handlers (not sure if needed at all
for query in queries {
let pool = pool.clone(); // For each loop I clone connection to databse
let inner_tx = tx.clone(); // and messager, as each thread should have spearated one.
children.push(thread::spawn(move || {
let conn = pool.get().unwrap();
add_real_modulo_function(&conn); // this adds custom sqlite function I needed
let mut sql = conn.prepare(&query).unwrap();
// this does query, and maps result to my internal type
let positions: Vec<CelestialBodyPosition> = sql
.query_map(params![], |row| {
Ok(CelestialBodyPosition::new(row.get(1)?, row.get(2)?))
})
.unwrap()
.map(|position| position.unwrap())
.collect();
// this sends partial result to receiver
return inner_tx.send(positions).unwrap();
}));
}
// first messenger has to be dropped, otherwise program will wait for its input
drop(tx);
for received in rx {
result.extend(received); // combine all results
}
return result;
As you can see no optimization happened from sqlite site, which kinda makes me feel I'm doing something wrong, but for now it's alright. It might be good to press some more control over amount of spawned threads.

Related

Joiners with filtering performs very slowly

I have a constraint with some joiners but the performance are very poor. Is it a way to improve it ?
I need to have the count of WorkingDay ( with ::hasPermission ) within the previous four days of the current day analyzed.
Here is my current constraint :
private Constraint fiveConsecutiveWorkingDaysMax(ConstraintFactory constraintFactory) {
return constraintFactory
.from(WorkingDay.class)
.filter(WorkingDay::hasPermission)
.join(WorkingDay.class,
Joiners.equal(WorkingDay::hasPermission),
Joiners.equal(WorkingDay::getAgent),
Joiners.filtering((wd1, wd2) -> {
LocalDate fourDaysBefore = wd1.getDayJava().minusDays(4);
Boolean wd2IsBeforeWd1 = wd2.getDayJava().isBefore(wd1.getDayJava());
Boolean wd2IsAfterFourDaysBeforeWd1 = wd2.getDayJava().compareTo(fourDaysBefore) >= 0;
return (wd2IsBeforeWd1 && wd2IsAfterFourDaysBeforeWd1);
}))
.groupBy((wd1, wd2) -> wd2, ConstraintCollectors.countBi())
.filter((wd2, count) -> count >= 4)
.penalizeConfigurable(FIVE_CONSECUTIVE_WORKING_DAYS_MAX);
}
Thanx for your help
There is potential for improvement here. First, we pre-filter the right hand side of the join to reduce the size of the cartesian product:
return constraintFactory
.forEach(WorkingDay.class)
.filter(WorkingDay::hasPermission)
.join(constraintFactory.forEach(WorkingDay.class)
.filter(WorkingDay::hasPermission),
Joiners.equal(WorkingDay::getAgent),
Joiners.filtering((wd1, wd2) -> {
LocalDate fourDaysBefore = wd1.getDayJava().minusDays(4);
Boolean wd2IsBeforeWd1 = wd2.getDayJava().isBefore(wd1.getDayJava());
Boolean wd2IsAfterFourDaysBeforeWd1 = wd2.getDayJava().compareTo(fourDaysBefore) >= 0;
return (wd2IsBeforeWd1 && wd2IsAfterFourDaysBeforeWd1);
}))
...
This has the added benefit of simplifying the index as it removes one equals joiner. Next, part of the filter can be replaced by a joiner as well:
return constraintFactory
.forEach(WorkingDay.class)
.filter(WorkingDay::hasPermission)
.join(constraintFactory.forEach(WorkingDay.class)
.filter(WorkingDay::hasPermission),
Joiners.equal(WorkingDay::getAgent),
Joiners.greaterThan(wd -> wd.getDayJava()),
Joiners.filtering((wd1, wd2) -> {
LocalDate fourDaysBefore = wd1.getDayJava().minusDays(4);
Boolean wd2IsAfterFourDaysBeforeWd1 = wd2.getDayJava().compareTo(fourDaysBefore) >= 0;
return wd2IsAfterFourDaysBeforeWd1;
}))
...
Finally, the method does needless boxing of boolean into Boolean, wasting CPU cycles and memory. This is a micro-optimization, but if the filter happens often enough, the benefit will be measurable.
A constraint refactored like this should perform better. That said, large joins are still going to take considerable time and the only way to work around that is to figure out a way to make them smaller.
Also, as Geoffrey said, I'd consider penalizing by the actual count, as what you have here is a textbook example of a score trap.
I don't see why this should be slow. Except maybe because the Cartesian Product explodes for a long time window. How many days is your time window?
Do note that the nurse rostering example has a totally different approach to detecting consecutive working days, using a custom collector. You might want to look at that in optaplanner-examples.

How to modify value in column typeorm

I have 2 tables contractPoint and contractPointHistory
ContractPointHistory
ContractPoint
I would like to get contractPoint where point will be subtracted by pointChange. For example: ContractPoint -> id: 3, point: 5
ContractPointHistory has contractPointId: 3 and pointChange: -5. So after manipulating point in contractPoint should be 0
I wrote this code, but it works just for getRawMany(), not for getMany()
const contractPoints = await getRepository(ContractPoint).createQueryBuilder('contractPoint')
.addSelect('"contractPoint".point + COALESCE((SELECT SUM(cpHistory.point_change) FROM contract_point_history AS cpHistory WHERE cpHistory.contract_point_id = contractPoint.id), 0) AS points')
.andWhere('EXTRACT(YEAR FROM contractPoint.validFrom) = :year', { year })
.andWhere('contractPoint.contractId = :contractId', { contractId })
.orderBy('contractPoint.grantedAt', OrderByDirection.Desc)
.getMany();
The method getMany can be used to select all attributes of an entity. However, if one wants to select some specific attributes of an entity then one needs to use getRawMany.
As per the documentation -
There are two types of results you can get using select query builder:
entities or raw results. Most of the time, you need to select real
entities from your database, for example, users. For this purpose, you
use getOne and getMany. But sometimes you need to select some specific
data, let's say the sum of all user photos. This data is not an
entity, it's called raw data. To get raw data, you use getRawOne and
getRawMany
From this, we can conclude that the query which you want to generate can not be made using getMany method.

PIG FILTER relation with next row the same relation

i'm searching for a long time now to solve my problem but nearly found nothing helpful.
Hopefully some of you can give me a tip.
I have a relation A with the following format: username, timestamp, ip
For example:
Harald 2014-02-18T16:14:49.503Z 123.123.123.123
Harald 2014-02-18T16:14:51.503Z 123.123.123.123
Harald 2014-02-18T16:14:55.503Z 321.321.321.321
And i want to find out, who changed his ip adress in less then 5 seconds. So the second and the third row should be interesting.
I want do group the relation by username und want to compare the timestamp of the actuall row with the next row. if the ip adress isnt the same and the timestamp is less then 5 seconds bigger, this should be at the output.
could someone help me with that issue?
regards.
first i want to thank you for your time.
but i actually stuck at the Sessionize part.
this is my data comming in:
aoebcu 2014-02-19T14:23:17.503Z 220.61.65.25
aoebcu 2014-02-19T14:23:14.503Z 222.117.144.19
aoebcu 2014-02-19T14:23:14.503Z 222.117.144.19
jekgru 2014-02-19T14:23:14.503Z 213.56.157.109
zmembx 2014-02-19T14:23:12.503Z 199.188.198.91
qhixcg 2014-02-19T14:23:11.503Z 203.40.104.119
and my code till now looks like this:
hijack_Reduced = FOREACH finalLogs GENERATE ClientUserName, timestamp, OriginalClientIP;
hijack_Filtered = FILTER hijack_Reduced BY OriginalClientIP != '-';
hijack_Sessionized = FOREACH (GROUP hijack_Filtered BY ClientUserName) {
views = ORDER hijack_Filtered BY timestamp;
GENERATE FLATTEN(Sessionize(views)) AS (ClientUserName,timestamp,OriginalClientIP,session_id);
}
but when i run this script, i got the following error Message:
15:36:22 ERROR -
org.apache.pig.tools.pigstats.SimplePigStats.setBackendException(542)
| ERROR 0: Exception while executing [POUserFunc (Name:
POUserFunc(datafu.pig.sessions.Sessionize)[bag] - scope-199 Operator
Key: scope-199) children: null at []]:
java.lang.IllegalArgumentException: Invalid format: "aoebcu"
i already tried a lot, but nothing worked.
do you got an idea?
Regards
While you could write a UDF for this, you can actually make use of the UDFs already available in Apache DataFu to solve this.
My solution involves applying sessionization to the data. Basically you look at consecutive events and assign each event a session ID. If the time elapsed between two events exceeds a specified amount of time, in your case 5 seconds, then the next event gets a new session ID. Otherwise consecutive events get the same session ID. Once each event is assigned its session ID the rest is easy. We group by session ID and look for sessions that have more than one distinct IP address.
I'll walk through my solution.
Suppose you have the following input data. Both Harold and Kumar change their IP addresses. But Harold does it within 5 seconds, while Kumar does not. So the output of our script should just be simply "Harold".
Harold,2014-02-18T16:14:49.503Z,123.123.123.123
Harold,2014-02-18T16:14:51.503Z,123.123.123.123
Harold,2014-02-18T16:14:55.503Z,321.321.321.321
Kumar,2014-02-18T16:14:49.503Z,123.123.123.123
Kumar,2014-02-18T16:14:55.503Z,123.123.123.123
Kumar,2014-02-18T16:15:05.503Z,321.321.321.321
Load the data
data = LOAD 'input' using PigStorage(',')
AS (user:chararray,time:chararray,ip:chararray);
Now define a couple UDFs from DataFu. The Sessionize UDF performs sessionization as I described earlier. The DistinctBy UDF will be used to find the distinct IP addresses within each session.
define Sessionize datafu.pig.sessions.Sessionize('5s');
define DistinctBy datafu.pig.bags.DistinctBy('1');
Group the data by user, sort by time, and apply the Sessonize UDF. Note that the timestamp must be the first field, as this is what Sessionize expects. This UDF appends a session ID to each tuple.
data = FOREACH data GENERATE time,user,ip;
data_sessionized = FOREACH (GROUP data BY user) {
views = ORDER data BY time;
GENERATE flatten(Sessionize(views)) as (time,user,ip,session_id);
}
Now that the data is sessionized, we can group by the user and session. I group by user too because I want to spit this value back out. We pass the bag of events into the DistinctBy UDF. Check the documentation of this UDF for a more detailed description. But essentially we will get as many tuples as there are distinct IP addresses per session. Note that I have removed the time from the relation below. This is because 1) it isn't needed, and 2) the DistinctBy in 1.2.0 of DataFu has a bug when handling fields containing dashes, as the time field does.
data_sessionized = FOREACH data_sessionized GENERATE user,ip,session_id;
data_sessionized = FOREACH (GROUP data_sessionized BY (user, session_id)) GENERATE
group.user as user,
SIZE(DistinctBy(data_sessionized)) as distinctIpCount;
Now select all the sessions that had more than one distinct IP address and return the distinct users for these sessions.
data_sessionized = FILTER data_sessionized BY distinctIpCount > 1;
data_sessionized = FOREACH data_sessionized GENERATE user;
data_sessionized = DISTINCT data_sessionized;
This produces simply:
Harold
Here is the full source code, which you should be able to paste directly into the DataFu unit tests and run:
/**
define Sessionize datafu.pig.sessions.Sessionize('5s');
define DistinctBy datafu.pig.bags.DistinctBy('1'); -- distinct by ip
data = LOAD 'input' using PigStorage(',') AS (user:chararray,time:chararray,ip:chararray);
data = FOREACH data GENERATE time,user,ip;
data_sessionized = FOREACH (GROUP data BY user) {
views = ORDER data BY time;
GENERATE flatten(Sessionize(views)) as (time,user,ip,session_id);
}
data_sessionized = FOREACH data_sessionized GENERATE user,ip,session_id;
data_sessionized = FOREACH (GROUP data_sessionized BY (user, session_id)) GENERATE
group.user as user,
SIZE(DistinctBy(data_sessionized)) as distinctIpCount;
data_sessionized = FILTER data_sessionized BY distinctIpCount > 1;
data_sessionized = FOREACH data_sessionized GENERATE user;
data_sessionized = DISTINCT data_sessionized;
STORE data_sessionized INTO 'output';
*/
#Multiline private String sessionizeUserIpTest;
private String[] sessionizeUserIpTestData = new String[] {
"Harold,2014-02-18T16:14:49.503Z,123.123.123.123",
"Harold,2014-02-18T16:14:51.503Z,123.123.123.123",
"Harold,2014-02-18T16:14:55.503Z,321.321.321.321",
"Kumar,2014-02-18T16:14:49.503Z,123.123.123.123",
"Kumar,2014-02-18T16:14:55.503Z,123.123.123.123",
"Kumar,2014-02-18T16:15:05.503Z,321.321.321.321"
};
#Test
public void sessionizeUserIpTest() throws Exception
{
PigTest test = createPigTestFromString(sessionizeUserIpTest);
this.writeLinesToFile("input",
sessionizeUserIpTestData);
List<Tuple> result = this.getLinesForAlias(test, "data_sessionized");
assertEquals(result.size(),1);
assertEquals(result.get(0).get(0),"Harold");
}

Using LINQ to pull collection until aggregate condition met

At a high level, I need a query that can pull a subset of records based on the sum of a column, just like Linq: How to query items from a collection until the sum reaches a certain value.
However, the key difference is that he's already got his records in an object, and I don't and can't. My table can have millions of records. If I build my query the way he did, I get this error:
"A lambda expression with a statement body
cannot be converted to an expression tree"
Which makes sense after researching it, LINQ can't turn the answer in the above referenced question into valid SQL.
I'm going to make a hypothetical table that represents my situation.
Order Id | Cookie Name | Qty
1 Sugar 5
2 Snickerdoodle 4
3 Chocolate chip 8
4 Snickerdoodle 10
5 Snickerdoodle 5
Given this sample, I need to write a query that grabs the first X orders of Snickerdoodle until the summed Qty exceedes an input from the parameter (i.e. If the user chooses 13, it would return records 2 & 4 ).
I'm using Nhibernate.Linq, because I'm more comfortable in LINQ. I'm completely open to ICreate if the need arises.
As a side note, I'm interested in this as a concept as well as a direct problem. Even though I need a Sum, there has to be a way to do something akin to a takewhile that executes until a condition is met.
pragmatic approach
int needed = ...;
int actual = 0;
int page = 0;
const int pagesize = 20; // set to some sensible value, eg. the pagesize of the grid shown to the user
var results = new List<CookieOrder>();
while (actual < needed)
{
var partialResults = session.Query<CookieOrder>()
.Where(c => c.Name == "Snickerdoodle")
.OrderBy(c => c.Id)
.Skip(page * pagesize)
.Take(pagesize)
.ToList();
for(int i = 0; i < partialResults.Length && actual < needed; i++)
{
results.Add(partialResults[i]);
actual = partialResults[i].Quantity;
}
page++;
}
return results;

How to simplify this LINQ to Entities Query to make a less horrible SQL statement from it? (contains Distinct,GroupBy and Count)

I have this SQL expression:
SELECT Musclegroups.Name, COUNT(DISTINCT Workouts.WorkoutID) AS Expr1
FROM Workouts INNER JOIN
Series ON Workouts.WorkoutID = Series.WorkoutID INNER JOIN
Exercises ON Series.ExerciseID = Exercises.ExerciseID INNER JOIN
Musclegroups ON Musclegroups.MusclegroupID = Exercises.MusclegroupID
GROUP BY Musclegroups.Name
Since Im working on a project which uses EF in a WCF Ria LinqToEntitiesDomainService, I have to query this with LINQ (If this isn't a must then pls inform me).
I made this expression:
var WorkoutCountPerMusclegroup = (from s in ObjectContext.Series1
join w in ObjectContext.Workouts on s.WorkoutID equals w.WorkoutID
where w.UserID.Equals(userid) && w.Type.Equals("WeightLifting")
group s by s.Exercise.Musclegroup into g
select new StringKeyIntValuePair
{
TestID = g.Select(n => n.Exercise.MusclegroupID).FirstOrDefault(),
Key = g.Select(n => n.Exercise.Musclegroup.Name).FirstOrDefault(),
Value = g.Select(n => n.WorkoutID).Distinct().Count()
});
The StringKeyIntValuePair is just a custom Entity type I made so I can send down the info to the Silverlight client. Also this is why I need to set an "TestID" for it, as it is an entity and it needs one.
And the problem is, that this linq query produces this horrible SQL statement:
http://pastebay.com/144532
I suppose there is a better way to query this information, a better linq expression maybe. Or is it possible to just query with plain SQL somehow?
EDIT:
I realized that the TestID is unnecessary because the other property named "Key" (the one on which Im grouping) becomes the key of the group, so it will be a key also. And after this, my query looks like this:
var WorkoutCountPerMusclegroup = (from s in ObjectContext.Series1
join w in ObjectContext.Workouts on s.WorkoutID equals w.WorkoutID
where w.UserID.Equals(userid) && w.Type.Equals("WeightLifting")
group w.WorkoutID by s.Exercise.Musclegroup.Name into g
select new StringKeyIntValuePair
{
Key = g.Key,
Value = g.Select(n => n).Distinct().Count()
});
This produces the following SQL: http://pastebay.com/144545
This seems far better then the previous sql statement of the half-baked linq query.
But is this good enough? Or this is the boundary of LinqToEntities capabilities, and if I want even more clear sql, I should make another DomainService which operates with LinqToSQL or something else?
Or the best way would be using a stored procedure, that returns Rowsets? If so, is there a best practice to do this asynchronously, like a simple WCF Ria DomainService query?
I would like to know best practices as well.
Compiling of lambda expression linq can take a lot of time (3–30s), especially using group by and then FirstOrDefault (for left inner joins meaning only taking values from the first row in the group).
The generated sql excecution might not be that bad but the compilation using DbContext which cannot be precompiled with .NET 4.0.
As an example 1 something like:
var q = from a in DbContext.A
join b ... into bb from b in bb.DefaultIfEmtpy()
group new { a, b } by new { ... } into g
select new
{
g.Key.Name1
g.Sum(p => p.b.Salary)
g.FirstOrDefault().b.SomeDate
};
Each FirstOrDefault we added in one case caused +2s compile time which added up 3 times = 6s only to compile not load data (which takes less than 500ms). This basically destroys your application's usability. The user will be waiting many times for no reason.
The only way we found so far to speed up the compilation is to mix lambda expression with object expression (might not be the correct notation).
Example 2: refactoring of previous example 1.
var q = (from a in DbContext.A
join b ... into bb from b in bb.DefaultIfEmtpy()
select new { a, b })
.GroupBy(p => new { ... })
.Select(g => new
{
g.Key.Name1
g.Sum(p => p.b.Salary)
g.FirstOrDefault().b.SomeDate
});
The above example did compile a lot faster than example 1 in our case but still not fast enough so the only solution for us in response-critical areas is to revert to native SQL (to Entities) or using views or stored procedures (in our case Oracle PL/SQL).
Once we have time we are going to test if precompilation works in .NET 4.5 and/or .NET 5.0 for DbContext.
Hope this helps and we can get other solutions.