Simple SELECT/WHERE query slow. Should I index a BIT field? [duplicate] - sql

This question already has answers here:
Should I index a bit field in SQL Server?
(18 answers)
Closed 8 years ago.
The following query takes 20-25 seconds to run
SELECT * from Books WHERE IsPaperback = 1
Where IsBundle is a BIT field. There's about 500k rows, and about 400k currently have this field set.
I also have a BIT field called IsBundle and only 900 records have this set. Again, execution time is about 20-25 seconds.
How can I speed up such a simple query?

Indexing a bit column will result in two parts, true and false. If the data is split 50/50 the gain will be 'some'. When it is 90/10 and you query the 10 part, yes it will make a difference.
You should first narrow down your result set column wise. Then, if you see you just need a few columns, and you execute this query a lot, you could even include those few fields in the index. Then there is no need for a lookup in the table itself.

First of all i would implicitly call out the columns,
select
field1
, field2
, field3
from books
where IsPaperback = 1;
this seems to be a small thing, but when you use star (*) for column selection, the DB has to look up the column names prior to actually performing the call.
do you have a index on IsPaperback ? that would impact the above query more than having an index on the IsBundle
if you had a condition of IsBundle = 1 then i would think that would be need for an index on that field.

Add an Index for IsPaperback
Try making it an int, or tinyint. The latest processors actually process 32 bit words faster than bytes.
This query should take no more than a couple of milliseconds.
You should not have a separate column for IsPaperback and IsBundle. It should be a Type column where Paperback and Bundle are the vaules.
Before the query set profiling on
SET profiling = 1
After the query show profiles:
SHOW PROFILES
It seems there are some out there that do not believe this query should take only a few milliseconds.
For those that down voted this answer without understanding what I said was true.
I found a table "cities" with 332,127 Records
In this table Russia has 929 cities
These benchmarks were preformed on a GoDaddy Server IP 50.63.0.80
This is a GoDaddy Virtual Dedicated Server
On average I find sites hosted on GoDaddy to have the worst performance.
$time = microtime(true);
$results = mysql_query("SELECT * FROM `cities` WHERE `country` LIKE 'RS'");
echo "\n" . number_format(microtime(true)-$time,6)."\n";
$time = microtime(true);
while ($row = mysql_fetch_array($results, MYSQL_NUM)){$r[]=$row;}
echo "\n" . number_format(microtime(true)-$time,6);
Results:
With Index: 2.9mS
0.002947 Seconds : $results = mysql_query("SELECT * FROM `cities` WHERE `country` LIKE 'RS'");
0.000081 Seconds : while ($row = mysql_fetch_array($results, MYSQL_NUM)){$r[]=$row;}
Without Index 93mS
0.093939 Seconds : $results = mysql_query("SELECT * FROM `cities` WHERE `country` LIKE 'RS'");
0.000073 Seconds : while ($row = mysql_fetch_array($results, MYSQL_NUM)){$r[]=$row;}
Then in phpMyAdmin Profiling:
SET PROFILING = ON;
SELECT * FROM `cities` WHERE `country` LIKE 'RS';
SHOW PROFILE;
Result:
Execution of the Query took 0.0000003 seconds
starting 0.000020
checking permissions 0.000004
Opening tables 0.000006
init 0.000007
optimizing 0.000003
executing 0.000003 ******
end 0.000004
query end 0.000003
closing tables 0.000003
freeing items 0.000010
logging slow query 0.000003
cleaning up 0.000003
Without Index
Execution of the Query took 0.0000012 seconds
starting 0.000046
checking permissions 0.000006
Opening tables 0.000010
init 0.000021
optimizing 0.000006
executing 0.000012 ******
end 0.000003
query end 0.000004
closing tables 0.000003
freeing items 0.000017
logging slow query 0.000004
cleaning up 0.000003
In phpMyAdmin doing a Search with Profiling turn on
GoDaddy Server Sending Data 92.6 ms
SELECT * FROM `cities` WHERE `country` LIKE 'RS' LIMIT 1000
Showing rows 0 - 928 (929 total, Query took 0.0907 sec)
Profiling Results:
Starting 52 µs
Checking Permissions 7 µs
Opening Tables 23 µs
System Lock 12 µs
Init 34 µs
optimizing 10 µs
Statistics 23 µs
Preparing 17 µs
Executing 4 µs
Sending Data 92.6 ms
End 18 µs
Query End 4 µs
Closing Tables 15 µs
Freeing Items 27 µs
Logging Slow Query 4 µs
Cleaning Up 5 µs
In phpMyAdmin doing a Search with Profiling turn on
On my Server, Sending Data 1.8mS
SELECT * FROM `cities` WHERE `country` LIKE 'RS' LIMIT 1000
Showing rows 0 - 928 (929 total, Query took 0.0022 sec)
Starting 27 µs
Checking Permissions 5 µs
Opening Tables 11 µs
System Lock 7 µs
Init 14 µs
Optimizing 5 µs
Statistics 43 µs
Preparing 6 µs
Executing 2 µs
Sending Data 1.8 ms
End 5 µs
Query End 3 µs
Closing Tables 5 µs
Freeing Items 13 µs
Logging Slow Query 2 µs
Cleaning Up 2 µs
Just to show the importance of an index.Over 400x Improvement.
A table with 5,480,942 Records and a Query that Returns 899 Rows
$time = microtime(true);
$results = mysql_query("SELECT * FROM `ipLocations` WHERE `id` = 33644");
echo "\n" . number_format(microtime(true)-$time,6);
$time = microtime(true);
while ($row = mysql_fetch_array($results, MYSQL_NUM)){$r[]=$row;}
echo "\n" . number_format(microtime(true)-$time,6);
No index
0.402005
0.001264
With Index (426x Faster)
0.001716
0.001962

Related

Creating a Nested/Loop Calculation in Vertica (?)

So maybe I'm just way over-thinking things, but is there any way to replicate a nested/loop calculation in Vertica with just SQL syntax.
Explanation -
In Column AP I have remaining values per month by an attribute key, in column CHANGE_1M I have an attribution value to apply.
The goal is for future values to calculate the preceding Row partition AP*CHANGE_1M, by the subsequent row partition CHANGE_1M to fill in the future AP values.
For reference I have 15,000 Keys Per Period and 60 Periods Per Year in the full-data set.
Sample Calculation
Period 5 =
(Period4_AP * Period5_CHANGE_1M)+Period4_AP
Period 6 =
(((Period4_AP * Period5_CHANGE_1M)+Period4_AP)*Period6_CHANGE_1M)
+
((Period4_AP * Period5_CHANGE_1M)+Period4_AP)
ect.
Sample Data on Top
Expected Results below
Vertica does not have (yet?) the RECURSIVE WITH clause, which you would need for the recursive calculation you seem to be needing here.
Only possible workaround would be tedious: write (or generate, using perl or Python, for example) as many nested queries as you need iterations.
I'll only want to detail this if you want to go down that path.
Long time no see - I should have returned to answer this question earlier.
I got so stuck on thinking of the programmatic way to solve this issue, I inherently forgot it is a math equation, and where you have math functions you have solutions.
Basically this question revolves around doing table multiplication.
The solution is to simply use LOG/LN functions to multiply and convert back using EXP.
Snippet of the simple solve.
Hope this helps other lost souls, don't forget your math background and spiral into a whirlpool of self-defeat.
EXP(SUM(LN(DEGREDATION)) OVER (ORDER BY PERIOD_NUMBER ASC ROWS UNBOUNDED PRECEDING)) AS DEGREDATION_RATE
** Controlled by what factors/attributes you need the data stratified by with a PARTITION
Basically instead of starting at the retention PX/P0, I back into with the degradation P1/P0 - P2/P1 ect.
PERIOD_NUMBER
DEGRADATION
DEGREDATION_RATE
DEGREDATION_RATE x 100000
0
100.00%
100.00%
100000.00
1
57.72%
57.72%
57715.18
2
60.71%
35.04%
35036.59
3
70.84%
24.82%
24820.66
4
76.59%
19.01%
19009.17
5
79.29%
15.07%
15071.79
6
83.27%
12.55%
12550.59
7
82.08%
10.30%
10301.94
8
86.49%
8.91%
8910.59
9
89.60%
7.98%
7984.24
10
86.03%
6.87%
6868.79
11
86.00%
5.91%
5907.16
12
90.52%
5.35%
5347.00
13
91.89%
4.91%
4913.46
14
89.86%
4.41%
4414.99
15
91.96%
4.06%
4060.22
16
89.36%
3.63%
3628.28
17
90.63%
3.29%
3288.13
18
92.45%
3.04%
3039.97
19
94.95%
2.89%
2886.43
20
92.31%
2.66%
2664.40
21
92.11%
2.45%
2454.05
22
93.94%
2.31%
2305.32
23
89.66%
2.07%
2066.84
24
94.12%
1.95%
1945.26
25
95.83%
1.86%
1864.21
26
92.31%
1.72%
1720.81
27
96.97%
1.67%
1668.66
28
90.32%
1.51%
1507.18
29
90.00%
1.36%
1356.46
30
94.44%
1.28%
1281.10
31
94.12%
1.21%
1205.74
32
100.00%
1.21%
1205.74
33
90.91%
1.10%
1096.13
34
90.00%
0.99%
986.52
35
94.44%
0.93%
931.71
36
100.00%
0.93%
931.71

read multiple table and write with chunk data to postgres database in pandas

I have read multiple tables from postgres database and write it to anohter database through pandas as below but every time start from first row of chunk and write it to database.. i want to write complete first table then 2nd and so on ....
code:
chunksize=30
offset=0
j=0
print('now Connected 1st')
conn1=psql.connect()
conn2=dvd.connect()
print('now Connected 2nd')
table=['all_m_splty','all_m_lgd_states']
for i in table:
sql="Select * from sha.%s Limit %d offset %d" % (i,chunksize,offset)
print('now reading...')
while True:
for df in pd.read_sql_query(sql,conn1,chunksize=chunksize):
print('now writting...')
df.to_sql(name=i,con=conn2, if_exists='replace',index=False)
offset+= chunksize
j+=df.index[-1]+1
print('Total no. of Rows inserted {} in {}'.format(j,i))
while df.index[-1]+1 < chunksize:
print('next')
break
print("Main DisConnected!")
conn.close()
conn2.close()
output:
now Connected 1st
now Connected 2nd
hello! now reading start
now reading...
now writting...
Total no. of Rows inserted 30 in all_m_splty
now writting...
Total no. of Rows inserted 60 in all_m_splty
now writting...
Total no. of Rows inserted 90 in all_m_splty
now writting...
Total no. of Rows inserted 120 in all_m_splty

sqldf R Error create a table

I am doing some experiments with SQL in R using the sqldf package.
I am trying to test some commands to check the output, in particular I am trying to create tables.
Here the code:
sqldf("CREATE TABLE tbl1 AS
SELECT cut
FROM diamonds")
Very simple code, however I get this error
sqldf("CREATE TABLE tbl1 AS
+ SELECT cut
+ FROM diamonds")
data frame with 0 columns and 0 rows
Warning message:
In result_fetch(res#ptr, n = n) :
Don't need to call dbFetch() for statements, only for queries
Why is it saying the the table create as 0 columns and 0 rows?
Can someone help?
That is a warning, not an error. The warning is caused by a backward incompatibility in recent versions of RSQLite. You can ignore it since it works anyways.
The sqldf statement that is shown in the question
creates an empty database
uploads the diamonds data frame to a table of the same name in that database
runs the create statement which creates a second table tbl1 in the database
returns nothing (actually a 0 column 0 row data frame) since a create statement has no value
destroys the database
When using sqldf you don't need create statements. It automatically creates a table in the backend database for any data frame referenced in your sql statement so the following sqldf statement
sqldf("select * from diamonds")
will
create an empty database
upload diamonds to it
run the select statement
return the result of the select statement as a data frame
destroy the database
You can use the verbose=TRUE argument to see the individual calls to the lower level RSQLite (or other backend database if you specify a different backend):
sqldf("select * from diamonds limit 3", verbose = TRUE)
giving:
sqldf: library(RSQLite)
sqldf: m <- dbDriver("SQLite")
sqldf: connection <- dbConnect(m, dbname = ":memory:")
sqldf: initExtension(connection)
sqldf: dbWriteTable(connection, 'diamonds', diamonds, row.names = FALSE)
sqldf: dbGetQuery(connection, 'select * from diamonds limit 3')
sqldf: dbDisconnect(connection)
carat cut color clarity depth table price x y z
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
Suggest you thoroughly review help("sqldf") as well as the info on the sqldf github home page

Which statistics is calculated faster in SAS, proc summary?

I need a theoretical answer.
Imagine that you have a table with 1.5 billion rows (the table is created as column-based with DB2-Blu).
You are using SAS and you will do some statistics by using Proc Summary like min/max/mean values, standard deviation value and percentile-10, percentile-90 through your peer-groups.
For instance, you have 30.000 peer-groups and you have 50.000 values in each peer group (Total 1.5 billions values).
The other case you have 3 million peer-groups and also you have 50 values in each peer-group. So you have total 1.5 billion values again.
Would it go faster if you have less peer groups but more values in each peer-group? Or would it go faster with more peer-groups but less less values in each peer-group.
I could test the first case (30.000 peer-groups and 50.000 values per peer group) and it took around 16 mins. But I can't test for the second case.
Can you write an approximate prognose for run-time in case when I have 3 million peer-groups and also 50 values in each peer-group?
One more dimension for the question. Would it be faster to do those statistics if I use Proc SQL instead?
Example code is below:
proc summary data = table_blu missing chartype;
class var1 var2; /* Var1 and var2 are toghether peer-group */
var values;
output out = stattable(rename = (_type_ = type) drop = _freq_)
n=n min=min max=max mean=mean std=std q1=q1 q3=q3 p10=p10 p90=p90 p95=p95
;
run;
So there are a number of things to think about here.
The first point and quite possibly the largest in terms of performance is getting the data from DB2 into SAS. (I'm assuming this is not an in database instance of SAS -- correct me if it is). That's a big table and moving it across the wire takes time. Because of that, if you can calculate all these statistics inside DB2 with an SQL statement, that will probably be your fastest option.
So assuming you've downloaded the table to the SAS server:
A table sorted by the CLASS variables will be MUCH faster to process than an unsorted table. If SAS knows the table is sorted, it doesn't have to scan the table for records to go into a group, it can do block reads instead of random IO.
If the table is not sorted, then the larger the number of groups, then more table scans that have to occur.
The point is, the speed of getting data from the HD to the CPU will be paramount in an unsorted process.
From there, you get into a memory and cpu issue. PROC SUMMARY is multithreaded and SAS will read N groups at a time. If group size can fit into the memory allocated for that thread, you won't have an issue. If the group size is too large, then SAS will have to page.
I scaled down the problem to a 15M row example:
%let grps=3000;
%let pergrp=5000;
UNSORTED:
NOTE: There were 15000000 observations read from the data set
WORK.TEST.
NOTE: The data set WORK.SUMMARY has 3001 observations and 9
variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
real time 20.88 seconds
cpu time 31.71 seconds
SORTED:
NOTE: There were 15000000 observations read from the data set
WORK.TEST.
NOTE: The data set WORK.SUMMARY has 3001 observations and 9
variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
real time 5.44 seconds
cpu time 11.26 seconds
=============================
%let grps=300000;
%let pergrp=50;
UNSORTED:
NOTE: There were 15000000 observations read from the data set
WORK.TEST.
NOTE: The data set WORK.SUMMARY has 300001 observations and 9
variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
real time 19.26 seconds
cpu time 41.35 seconds
SORTED:
NOTE: There were 15000000 observations read from the data set
WORK.TEST.
NOTE: The data set WORK.SUMMARY has 300001 observations and 9
variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
real time 5.43 seconds
cpu time 10.09 seconds
I ran these a few times and the run times were similar. Sorted times are about equal and way faster.
The more groups / less per group was faster unsorted, but look at the total CPU usage, it is higher. My laptop has an extremely fast SSD so IO was probably not the limiting factor -- the HD was able to keep up with the multi-core CPU's demands. On a system with a slower HD, the total run times could be different.
In the end, it depends too much on how the data is structured and the specifics of your server and DB.
Not a theoretical answer but still relevant IMO...
To speed up your proc summary on large tables add the / groupinternal option to your class statement. Of course, assuming you don't want the variables formatted prior to being grouped.
e.g:
class age / groupinternal;
This tells SAS that it doesn't need to apply a format to the value prior to calculating what class to group the value into. Every value will have a format applied to it even if you have not specified one explicitly. This doesn't make a large difference on small tables, but on large tables it can.
From this simple test, it reduces the time from 60 seconds on my machine to 40 seconds (YMMV):
data test;
set sashelp.class;
do i = 1 to 10000000;
output;
end;
run;
proc summary data=test noprint nway missing;
class age / groupinternal;
var height;
output out=smry mean=;
run;

SQL Server 2008, how much space does this occupy?

I am trying to calculate how much space (Mb) this would occupy. In the database table there is 7 bit columns, 2 tiny int and 1 guid.
Trying to calculate the amount that 16 000 rows would occupies.
My line of thought was that 7 bit columns consume 1 byte, 2 tiny ints consumes 2 bytes and a guid consumes 16 bytes. Total of 19byte for one row in the table? That would mean 304000 bytes for 16 000 rows or ~0.3mbs us that correct? Is there a meta data byte as well?
There are several estimators out there which take away the donkey work
You have to take into account the Null bitmap which will be 3 bytes in this case + number of rows per page + row header + row version + pointers + all the stuff here:
Inside the Storage Engine: Anatomy of a record
Edit:
Your 19 bytes of actual data
has 11 bytes overhead
total 30 bytes per row
around 269 rows per page (8096 / 30)
requires 60 pages (16000 / 269)
around 490k space (60 x 8192)
a few KB for the index structure of the primary