Can I fetch the result of a SELECT query iteratively using LIMIT and OFFSET? - sql

I am implementing a FUSE file system that uses a sqlite3 database for its backend. I do not plan to ever change the database backend as my file system uses sqlite3 as a file format. One of the functions a file system must implement is readdir function. This function allows the process to iteratively read the contents of a directory by repeatedly calling it and getting the next few directory entries (as many as the buffer can hold). The directory entries are returned may be returned in any order. I want to implement this operation with the following query:
SELECT fileno, name FROM dirents WHERE dirno = ? LIMIT -1 OFFSET ?;
where dirno is the directory I'm reading from and OFFSET ? is the number of entries I've already returned. I want to read as many rows as I can fit into the buffer (I cannot predict the count as these are variable-length records depending on the length of the file name) and then reset the query.
Due to the stateless nature of FUSE, keeping open a query and returning the next few rows until the directory has ended is not an option as I cannot reliably detect if the process closes the directory prematurely.
The dirents table has the following schema:
CREATE TABLE dirents (
dirno INTEGER NOT NULL REFERENCES inodes(inum),
fileno INTEGER NOT NULL REFERENCES inodes(inum),
name TEXT NOT NULL,
PRIMARY KEY (dirno, name)
) WITHOUT ROWID;
Question
In theory, a SELECT statement yields rows in no defined order. In practice, can I assume that when I execute the same prepared SELECT statement multiple times with successively larger OFFSET values, I get the same results as if I read all the data in a single query, i.e. the row order is the same unspecified order each time? An assumption that currently holds is that the database is not modified between queries.
Can I assume that the row order stays reasonably similar when a different query modifies the dirents table inbetween? Some glitches (e.g. directory entries appearing twice) will of course be observable by the program, but for usability (the main user of readdir is the ls command) it is highly useful if the directory listing is usually mostly correct.
If I cannot make these assumptions, what is a better way to reach the desired result?
I know that I could throw in an ORDER BY clause to make the row order well-defined, but I fear that this might have a strong impact on performance, especially when reading small chunks of a large directory—the directory has to be sorted every time a chunk is read.

The right solution to this problem is to use order by. If you are concerned about the performance of the order by, then use an index on the column used for the order by.
The simplest approach, in my opinion, would be to remove the without rowid option on the table creation. Then you can just access the table as:
SELECT fileno, name
FROM dirents
WHERE dirno = ?
ORDER BY rowid
LIMIT -1 OFFSET ?;
I realize this adds additional bytes to each row, but it is for a good purpose -- making sure that your queries are correct.
Actually, the best index for this table would be on dirno, rowid, fileno, name. Given the where clause, you are doing a full table scan anyway, unless you have an index.

If I add an ORDER BY name clause to the SELECT query, sqlite3 generates almost identical (except for a stray Noop) bytecode for the query but a row order is guaranteed:
sqlite> EXPLAIN SELECT fileno, name FROM dirents WHERE dirno = ? LIMIT -1 OFFSET ?;
addr opcode p1 p2 p3 p4 p5 comment
---- ------------- ---- ---- ---- ------------- -- -------------
0 Init 0 21 0 00
1 Integer -1 1 0 00
2 Variable 2 2 0 00
3 MustBeInt 2 0 0 00
4 SetIfNotPos 2 2 0 00
5 Add 1 2 3 00
6 SetIfNotPos 1 3 -1 00
7 OpenRead 1 3 0 k(2,nil,nil) 02
8 Variable 1 4 0 00
9 IsNull 4 19 0 00
10 Affinity 4 1 0 D 00
11 SeekGE 1 19 4 1 00
12 IdxGT 1 19 4 1 00
13 IfPos 2 18 1 00
14 Column 1 2 5 00
15 Column 1 1 6 00
16 ResultRow 5 2 0 00
17 DecrJumpZero 1 19 0 00
18 Next 1 12 0 00
19 Close 1 0 0 00
20 Halt 0 0 0 00
21 Transaction 0 0 3 0 01
22 TableLock 0 3 0 dirents 00
23 Goto 0 1 0 00
sqlite> EXPLAIN SELECT fileno, name FROM dirents WHERE dirno = ? ORDER BY name LIMIT -1 OFFSET ?;
addr opcode p1 p2 p3 p4 p5 comment
---- ------------- ---- ---- ---- ------------- -- -------------
0 Init 0 22 0 00
1 Noop 0 0 0 00
2 Integer -1 1 0 00
3 Variable 2 2 0 00
4 MustBeInt 2 0 0 00
5 SetIfNotPos 2 2 0 00
6 Add 1 2 3 00
7 SetIfNotPos 1 3 -1 00
8 OpenRead 2 3 0 k(2,nil,nil) 02
9 Variable 1 4 0 00
10 IsNull 4 20 0 00
11 Affinity 4 1 0 D 00
12 SeekGE 2 20 4 1 00
13 IdxGT 2 20 4 1 00
14 IfPos 2 19 1 00
15 Column 2 2 5 00
16 Column 2 1 6 00
17 ResultRow 5 2 0 00
18 DecrJumpZero 1 20 0 00
19 Next 2 13 0 00
20 Close 2 0 0 00
21 Halt 0 0 0 00
22 Transaction 0 0 3 0 01
23 TableLock 0 3 0 dirents 00
24 Goto 0 1 0 00
So I guess I'm going for that ORDER BY name clause.

Related

SQL: Subtracting certain rows with restrictions from a data table into a new table

I Have a data table in postgresql which has these columns and some rows like this:
st
epochnum
satnum
l1
l2
c1
p1
p2
1
1
1
10
11
12
13
14
1
1
2
15
16
17
18
19
1
2
1
20
21
22
23
24
1
2
2
25
26
27
28
29
20
1
1
30
41
52
63
74
20
1
2
75
76
87
88
null
20
2
1
...
I want to get some pairs of rows that have the same value for epochnum and satnum but have different value in "st". By the way, I have a list that specifies which "st" pairs should be subtracted. Its just another table that looks like this:
st1
st2
1
20
The rows in the first table have to be subtracted in l1,l2,c1,p1 and p2 with same epochnum and satnum according to this table and then inserted into a new table like this:
epochnum
st1
st2
satnum
dl1
dl2
dc1
dp1
dp2
1
1
20
1
20
30
40
50
60
1
1
20
2
65
65
75
75
null
...
The actual data has more than 400000 rows that has same epochnums and satnums like this. I have tried java programming in net-beans and used loops to simply get queries for each row and make the new table.
But I think maybe it is not efficient and unnecessarily takes long time due to the lots of queries that has to be done in java.
I wonder if there is a way that this can be done using just a few queries, or creating extra tables and .... I haven't come up with the best solution yet.
Are you looking for joins like this?
select t1.st, t1.epochnum, t1.satnum,
(t2.l1 - t1.l1),
(t2.l2 - t1.l2),
(t2.p1 - t1.p1),
(t2.p2 - t1.p2)
from t t1 join
t t2
on t1.epochnum = t2.epochnum and
t1.satnum = t2.satnum join
pairs p
on t1.st = p.st1 and t2.st = p.st2

Assigning Score based on Order Sequence in pandas

Following are the dataframes I have
score_df
col1_id col2_id score
1 2 10
5 6 20
records_df
date col_id
D1 6
D2 4
D3 1
D4 2
D5 5
D6 7
I would like to compute a score based on the following criteria:
When 2 occurs after 1 the score should be assigned 10 or when 1 occurs after 2, score should be assigned 10.
i.e when (1,2) gives a score 10 .. (2,1) also get the same score 10.
considering (1,2) . When 1 occurs first time we dont assign a score. We flag the row and wait for 2 to occur. When 2 occurs in the column we give the score 10.
considering (2,1). When 2 comes first. We assign value 0 and wait for 1 to occur. When 1 occurs, we give the score 10.
So, for the first time - dont assign the score and wait for the corresponding event to occur and then assign the score
So, my result dataframe should look something like this
result
date col_id score
D1 6 0 -- Eventhough 6 is there in score list, it occured for first time. So 0
D2 4 0 -- 4 is not even there in list
D3 1 0 -- 1 occurred for first time . So 0
D4 2 10 -- 1 occurred previously. 2 occurred now.. we can assign 10.
D5 5 20 -- 6 occurred previously. we can assign 20
D6 7 0 -- 7 is not in the list
I have around 100k rows in both score_df and record_df. Looping and assigning score is taking the time. Can someone help with logic without looping the entire dataframe?
From what i understand , you can try melt for unpivotting and then merge. keeping the index from the melted df , we check where the index is duplicated , and then return score from the merge else 0.
m = score_df.reset_index().melt(['index','uid','score'],
var_name='col_name',value_name='col_id')
final = records_df.merge(m.drop('col_name',1),on=['uid','col_id'],how='left')
c = final.duplicated(['index']) & final['index'].notna()
final = final.drop('index',1).assign(score=lambda x: x['score'].where(c,0))
print(final)
uid date col_id score
0 123 D1 6 0.0
1 123 D2 4 0.0
2 123 D3 1 0.0
3 123 D4 2 10.0
4 123 D5 5 20.0
5 123 D6 7 0.0

Two Condition Where-clause SQL

I need to filter some rows when 2 conditions are met, but not excluding the other rows.
Table:
idRow idMaster idList
1 10 45
2 10 46
3 10 47
4 11 10
5 11 98
6 14 56
7 16 28
8 20 55
Example:
When:
idMaster=10 and id List=45 (only show this combination for idMaster 10)
idMaster=11 and idList=98 (only show this combination for idMaster 11)
list all other rows as well.
Expected result:
idRow idMaster idList
1 10 45
5 11 98
6 14 56
7 16 28
8 20 55
Running SQL Server 2014
I tried combinations of CASE IF but all cases only filter the idMaster=10,11 and idList=45,98, excluding the other rows
Although you didn't mentioned the database name, this following query logic will be applicable for all databases-
SELECT *
FROM your_table
WHERE idMaster NOT IN (10,11)
OR (idMaster = 10 AND idList = 45)
OR (idMaster = 11 AND idList = 98)
You can indeed do this with a (nested) case. Hopefully this helps you understand better.
case idMaster
when 10 then case idList when 45 then 1 end
when 11 then case idList when 98 then 1 end
else 1
end = 1
This might be the best though:
not (idList = 10 and idList <> 45 or idList = 11 and idList <> 98)
Overall it's usually beneficial to avoid repeating that list of values in multiple places. Both of these avoid the need to keep things in sync when changes come.

SAS proq SQL: Conditionally summing and collapsing rows

In SAS, how to conditionally collapse rows, substituting with an explicitly name single row representing sum of values?
Specifically, I am looking to create a frequency table that displays the frequencies of value Subclass, but on the condition that that frequency is >9. In any other case (frequency <10), I want the frequency to count towards a sum of frequencies restricted to value Class. There are no missing or 0 values in the dataset.
Freq Class Subclass
---------------------
20 1 1a
20 1 1b
2 1 1c
2 1 1d
2 1 1e
1 1 1f
22 2 2a
6 2 2b
2 2 2c
1 2 2d
31 3 3a
17 3 3b
7 3 3c
3 3 3d
3 3 3e
My current approach has produced the first table using:
proc sql;
create table as
select* count (distinct subjectID) as count
from DATASET1
group by Subclass
;
run; quit;
The desired result would look something like this:
Freq Class Subclass
---------------------
20 1 1a
20 1 1b
7 1 OTHER (1c, 1d, 1e, 1f)
22 2 2a
9 2 OTHER (2b, 2c, 2d)
31 3 3a
17 3 3b
13 3 OTHER (3c, 3d, 3e)
Preferably I would like to additionally explicitly name the Subclass value representing the the summed measurement according to identifiers of the measurements represented in the rows. In this example that would be the names of the subclasses summed up.
I have tried using the Proc means procedure, which yields a new dataset of all Subclasseswith frequency <10, rather than a sum value.
A data step is the way to go to get your preferred output, utilizing the first. and last. statements. This gives you options to output values >9 or sum the other values within same class.
The call catx function will concatenate the subclass values, so you can see which ones make up the frequency.
data have;
input Freq Class Subclass $;
datalines;
20 1 1a
20 1 1b
2 1 1c
2 1 1d
2 1 1e
1 1 1f
22 2 2a
6 2 2b
2 2 2c
1 2 2d
31 3 3a
17 3 3b
7 3 3c
3 3 3d
3 3 3e
;
run;
data want;
set have;
by class;
length subclass_groups $20 subclass_temp $20;
retain subclass_temp;
if first.class then call missing(freq_temp,subclass_temp);
if freq>9 then do;
subclass_groups = subclass;
output;
end;
else do;
freq_temp + freq;
call catx(',',subclass_temp,subclass);
end;
if last.class then do;
freq = freq_temp;
subclass_groups = subclass_temp;
output;
end;
drop subclass subclass_temp freq_temp;
run;
not tested code and this will make it easy to understand how to go about the issue by using union.
proc sql;
create table as
select freq, class, subclass, count(subclass) as count
from DATASET1
where freq le 9
group by Subclass
union all
select freq, class, subclass, count(class) as count
from DATASET1
where freq ge 10
group by class;

Why SELECT 0, ... instead of SELECT

Lets say I have a SQLite database that contains a table:
sqlite> create table person (id integer, firstname varchar, lastname varchar);
Now I want to get every entry which is in the table.
sqlite> select t0.id, t0.firstname, t0.lastname from person t0;
This works fine and this it what I would use. However I have worked with a framework from Apple (Core Data) that generates SQL. This framework generates a slightly different SQL query:
sqlite> select 0, t0.id, t0.firstname, t0.lastname from person t0;
Every SQL query generated by this framework begins with "select 0,". Why is that?
I tried to use the explain command to see whats going on but this was inconclusive - at least to me.
sqlite> explain select t0.id, t0.firstname, t0.lastname from person t0;
addr opcode p1 p2 p3 p4 p5 comment
---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
0 Trace 0 0 0 00 NULL
1 Goto 0 11 0 00 NULL
2 OpenRead 0 2 0 3 00 NULL
3 Rewind 0 9 0 00 NULL
4 Column 0 0 1 00 NULL
5 Column 0 1 2 00 NULL
6 Column 0 2 3 00 NULL
7 ResultRow 1 3 0 00 NULL
8 Next 0 4 0 01 NULL
9 Close 0 0 0 00 NULL
10 Halt 0 0 0 00 NULL
11 Transactio 0 0 0 00 NULL
12 VerifyCook 0 1 0 00 NULL
13 TableLock 0 2 0 person 00 NULL
14 Goto 0 2 0 00 NULL
And the table for the second query looks like this:
sqlite> explain select 0, t0.id, t0.firstname, t0.lastname from person t0;
addr opcode p1 p2 p3 p4 p5 comment
---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
0 Trace 0 0 0 00 NULL
1 Goto 0 12 0 00 NULL
2 OpenRead 0 2 0 3 00 NULL
3 Rewind 0 10 0 00 NULL
4 Integer 0 1 0 00 NULL
5 Column 0 0 2 00 NULL
6 Column 0 1 3 00 NULL
7 Column 0 2 4 00 NULL
8 ResultRow 1 4 0 00 NULL
9 Next 0 4 0 01 NULL
10 Close 0 0 0 00 NULL
11 Halt 0 0 0 00 NULL
12 Transactio 0 0 0 00 NULL
13 VerifyCook 0 1 0 00 NULL
14 TableLock 0 2 0 person 00 NULL
15 Goto 0 2 0 00 NULL
Some frameworks do this in order to tell, without any doubt, whether a row from that table was returned.
Consider
A B
+---+ +---+------+
| a | | a | b |
+---+ +---+------+
| 0 | | 0 | 1 |
+---+ +---+------+
| 1 | | 1 | NULL |
+---+ +---+------+
| 2 |
+---+
SELECT A.a, B.b
FROM A
LEFT JOIN B
ON B.a = A.a
Results
+---+------+
| a | b |
+---+------+
| 0 | 1 |
+---+------+
| 1 | NULL |
+---+------+
| 2 | NULL |
+---+------+
In this result set, it is not possible to see that a = 1 exists in table B, but a = 2 does not. To get that information, you need to select a non-nullable expression from table b, and the simplest way to do that is to select a simple constant value.
SELECT A.a, B.x, B.b
FROM A
LEFT JOIN (SELECT 0 AS x, B.a, B.b FROM B) AS B
ON B.a = A.a
Results
+---+------+------+
| a | x | b |
+---+------+------+
| 0 | 0 | 1 |
+---+------+------+
| 1 | 0 | NULL |
+---+------+------+
| 2 | NULL | NULL |
+---+------+------+
There are a lot of situations where these constant values are not strictly required, for example when you have no joins, or when you could choose a non-nullable column from b instead, but they don't cause any harm either, so they can just be included unconditionally.
When I have code to dynamically generate a WHERE clause, I usually start the clause with a:
WHERE 1 = 1
Then the loop to add additional conditions always adds each condition in the same format:
AND x = y
without having to put conditional logic in place to check if this is the first condition or not: "if this is the first condition then start with the WHERE keyword, else add the AND keyword.
So I can imagine a framework doing this for similar reasons. If you start the statement with a SELECT 0 then the code to add subsequent columns can be in a loop without any conditional statements. Just add , colx each time without any conditional checking along the lines of "if this is the first column, don't put a comma before the column name, otherwise do".
Example pseudo code:
String query = "SELECT 0";
for (Column col in columnList)
query += ", col";
Only Apple knows … but I see two possibilities:
Inserting a dummy column ensures that the actual output columns are numbered beginning with 1, not 0. If some existing interface already assumed one-based numbering, doing it this way in the SQL backend might have been the easiest solution.
If you make a query for multiple objects using multiple subqueries, a value like this could be used to determine from which subquery a record originates:
SELECT 0, t0.firstname, ... FROM PERSON t0 WHERE t0.id = 123
UNION ALL
SELECT 1, t0.firstname, ... FROM PERSON t0 WHERE t0.id = 456
(I don't know if Core Data actually does this.)
Your EXPLAIN output shows that the only difference is (at address 4) that the second program sets the extra column to zero, so there is only a minimal performance difference.