Updating a column from values - sql

I want to update a column col_123 in table TT from values such that some criteria are met.
The following is a piece of my code, where I have two values. But in my actual code, there are thousands of value rows.
UPDATE TT
SET col_123 = T2.score
FROM
(values ('1007163',2016,3,80.09), ('1034758',2013,4,68.85)) T2(person_id_t2, id_yr_t2, id_qtr_t2, score)
WHERE person_id = T2.person_id_t2 AND id_yr = T2.id_yr_t2 AND id_qtr = T2.id_qtr_t2;
But even with these two rows, it takes forever to update the table. What am I doing wrong?
Here is the output with EXPLAIN ANALYZE:
Update (slice0; segments: 56) (rows=1 width=3903)
-> Hash Join (cost=0.06..750889.50 rows=1 width=3903)
Hash Cond: TT.person_id::text = "*VALUES*".column1 AND TT.id_yr = "*VALUES*".column2::numeric AND TT.id_qtr = "*VALUES*".column3
Rows out: Avg 1.0 rows x 2 workers. Max 1 rows (seg29) with 236406 ms to first row, 236407 ms to end, start offset by 370 ms.
Executor memory: 1K bytes avg, 1K bytes max (seg0).
Work_mem used: 1K bytes avg, 1K bytes max (seg0). Workfile: (0 spilling, 0 reused)
(seg29) Hash chain length 1.0 avg, 1 max, using 2 of 262151 buckets.
-> Seq Scan on seamless_health_index (cost=0.00..466843.92 rows=676299 width=3871)
Rows out: Avg 676405.3 rows x 56 workers. Max 678281 rows (seg27) with 0.524 ms to first row, 243299 ms to end, start offset by 369 ms.
-> Hash (cost=0.03..0.03 rows=1 width=72)
Rows in: Avg 2.0 rows x 56 workers. Max 2 rows (seg0) with 0.080 ms to end, start offset by 375 ms.
-> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=1 width=72)
Rows out: Avg 2.0 rows x 56 workers. Max 2 rows (seg0) with 0.017 ms to first row, 0.020 ms to end, start offset by 375 ms.
Slice statistics:
(slice0) Executor memory: 5769K bytes avg x 56 workers, 5769K bytes max (seg0). Work_mem: 1K bytes max.
Statement statistics:
Memory used: 128000K bytes
Settings: from_collapse_limit=16; join_collapse_limit=16
Total runtime: 308388.391 ms
Thanks!
Note: The table TT has about 40,000,000 rows, and 1000 columns, but only two of the rows and col_123 should be updated.

Create an index on TT (person_id::text, id_yr, id_qtr).
Then a nested loop join can be used, which should find the one matching row faster.
You don't have to include all three columns in the index, only those where the join condition is selective.

Related

access scalar in dataframe in each iterate loop

I have a dataframe of a cryptoCoin in the format of:
time open high low close volume TM
0 1618617600000 61342.7 61730.9 61268.7 61648.8 82.523952 5
1 1618618500000 61648.9 61695.3 61188.4 61333.2 72.375605 5
2 1618619400000 61333.1 61396.4 61144.2 61200.0 52.882392 5
3 1618620300000 61200.0 61509.4 61199.9 61446.2 48.429485 5
4 1618621200000 61446.2 61764.7 61446.2 61647.4 83.822974 5
... ... ... ... ... ... ... ..
19213 1635909300000 63006.2 63087.2 62935.0 63081.9 35.265568 26
19214 1635910200000 63081.9 63214.5 62950.1 63084.0 41.213263 30
19215 1635911100000 63084.0 63236.0 63027.6 63213.9 32.429295 21
19216 1635912000000 63213.8 63213.8 63021.5 63024.1 47.032509 19
19217 1635912900000 63024.1 63091.4 62852.1 62970.7 84.098123 16
I want to calculate moving average of the close price with varied timeperiod, the timeperiod came from a TM column. I will use talib/ta library. efficiency is necessary so I tried apply and np.where:
dataframe['DMA'] = dataframe.apply(lambda x: ta.MA(dataframe['close'], timeperiod=dataframe['TM']), axis=0)
and
dataframe['DMA'] = np.where(dataframe['TM'].values , ta.MA(dataframe['close'], timeperiod=dataframe['TM'].values), )
both return error:
TypeError: only size-1 arrays can be converted to Python scalars
which I believed came from timeperiod= dataframe['TM'].values part. and if I use dataframe['TM'].values[0], only the first value, which is 5, apply to all iterate loop. How can I access to the scalar of the cell in TM, in vectorized-way and not iterating over index or use for_loop.
My desire output:
output dataframe has another column at the end, named it DMA, and last 3 rows should be like
............... DMA
19215 ..... ta.MA(dataframe['close'], timeperiod = 21)
19216 ..... ta.MA(dataframe['close'], timeperiod = 19)
19217 ..... ta.MA(dataframe['close'], timeperiod = 16)
in index 19215 I want to calculate Moving Average of last 21 close
prices
in index 19216 I want to calculate Moving Average of last 19
close prices
in index 19216 I want to calculate Moving Average of
last 16 close prices
Appreciate your time.

Redis `SCAN`: how to maintain a balance between newcomming keys that might match and ensure eventual result in a reasonable time?

I am not that familiar with Redis. At the moment I am designing some realtime service and I'd like to rely on it. I expect ~10000-50000 keys per minute to be SET with some reasonable EX and match over them using SCAN rarely enough not to bother with performance bottlenecks.
The thing I doubt is "in/out rate" and possible overflooding with keys that might match some SCAN query and thus it never terminates (i.e. always replies with latest cursor position and forces you to continue; that could happen easily if one consumes x items per second and there are x + y items per second coming in with y > 0).
Obviously, I could set desired SCAN size long enough; but I wonder if there exists a better solution or does Redis itself guarantees that SCAN will grow size automatically in such a case?
First some context, solution at the end:
From SCAN command > Guarantee of termination
The SCAN algorithm is guaranteed to terminate only if the size of the
iterated collection remains bounded to a given maximum size, otherwise
iterating a collection that always grows may result into SCAN to never
terminate a full iteration.
This is easy to see intuitively: if the collection grows there is more
and more work to do in order to visit all the possible elements, and
the ability to terminate the iteration depends on the number of calls
to SCAN and its COUNT option value compared with the rate at which the
collection grows.
But in The COUNT option it says:
Important: there is no need to use the same COUNT value for every
iteration. The caller is free to change the count from one iteration
to the other as required, as long as the cursor passed in the next
call is the one obtained in the previous call to the command.
Important to keep in mind, from Scan guarantees:
A given element may be returned multiple times. It is up to the
application to handle the case of duplicated elements, for example
only using the returned elements in order to perform operations that
are safe when re-applied multiple times.
Elements that were not
constantly present in the collection during a full iteration, may be
returned or not: it is undefined.
The key to a solution is in the cursor itself. See Making sense of Redis’ SCAN cursor. It is possible to deduce the percent of progress of your scan because the cursor is really the bits-reversed of an index to the table size.
Using DBSIZE or INFO keyspace command you can get how many keys you have at any time:
> DBSIZE
(integer) 200032
> info keyspace
# Keyspace
db0:keys=200032,expires=0,avg_ttl=0
Another source of information is the undocumented DEBUG htstats index, just to get a feeling:
> DEBUG htstats 0
[Dictionary HT]
Hash table 0 stats (main hash table):
table size: 262144
number of elements: 200032
different slots: 139805
max chain length: 8
avg chain length (counted): 1.43
avg chain length (computed): 1.43
Chain length distribution:
0: 122339 (46.67%)
1: 93163 (35.54%)
2: 35502 (13.54%)
3: 9071 (3.46%)
4: 1754 (0.67%)
5: 264 (0.10%)
6: 43 (0.02%)
7: 6 (0.00%)
8: 2 (0.00%)
[Expires HT]
No stats available for empty dictionaries
The table size is the power of 2 following your number of keys:
Keys: 200032 => Table size: 262144
The solution:
We will calculate a desired COUNT argument for every scan.
Say you will be calling SCAN with a frequency (F in Hz) of 10 Hz (every 100 ms) and you want it done in 5 seconds (T in s). So you want this finished in N = F*T calls, N = 50 in this example.
Before your first scan, you know your current progress is 0, so your remaining percent is RP = 1 (100%).
Before every SCAN call (or every given number of calls that you want to adjust your COUNT if you want to save the Round Trip Time (RTT) of a DBSIZE call), you call DBSIZE to get the number of keys K.
You will use COUNT = K*RP/N
For the first call, this is COUNT = 200032*1/50 = 4000.
For any other call, you need to calculate RP = 1 - ReversedCursor/NextPowerOfTwo(K).
For example, let say you have done 20 calls already, so now N = 30 (remaining number of calls). You called DBSIZE and got K = 281569. This means NextPowerOfTwo(K) = 524288, this is 2^19.
Your next cursor is 14509 in decimal = 000011100010101101 in binary. As the table size is 2^19, we represent it with 18 bits.
You reverse the bits and get 101101010001110000 in binary = 185456 in decimal. This means we have covered 185456 out of 524288. And:
RP = 1 - ReversedCursor/NextPowerOfTwo(K) = 1 - 185456 / 524288 = 0.65 or 65%
So you have to adjust:
COUNT = K*RP/N = 281569 * 0.65 / 30 = 6100
So in your next SCAN call you use 6100. Makes sense it increased because:
The amount of keys has increased from 200032 to 281569.
Although we have only 60% of our initial estimate of calls remaining, progress is behind as 65% of the keyspace is pending to be scanned.
All this was assuming you are getting all keys. If you're pattern-matching, you need to use the past to estimate the remaining amount of keys to be found. We add as a factor PM (percent of matches) to the COUNT calculation.
COUNT = PM * K*RP/N
PM = keysFound / ( K * ReversedCursor/NextPowerOfTwo(K))
If after 20 calls, you have found only keysFound = 2000 keys, then:
PM = 2000 / ( 281569 * 185456 / 524288) = 0.02
This means only 2% of the keys are matching our pattern so far, so
COUNT = PM * K*RP/N = 0.02 * 6100 = 122
This algorithm can probably be improved, but you get the idea.
Make sure to run some benchmarks on the COUNT number you'll use to start with, to measure how many milliseconds is your SCAN taking, as you may need to moderate your expectations about how many calls you need (N) to do this in a reasonable time without blocking the server, and adjust your F and T accordingly.

How to get list of datatypes of View Columns?

I have 32.000 columns, some of views contains up to million rows, may be more.
#ulrich from teradata forum provided almost nice solution. The main goal is to create volatile table, then by dynamic sql paste all required info into it. Here is full a bit modified solution:
.run file = /yourpath/logon.txt ;
.set width 500;
.OS rm /yourpath/view_col_type_sql.txt;
.export report file=/yourpath/view_col_type_sql.txt
select 'insert into view_column_data_type Select distinct''' !! Trim(databasename) !! ''','''!!Trim(tablename) !! ''','''!!Trim(columnname)!!''',type('!!trim(databasename)!! '.'
!! trim(tablename)!! '.' !! trim(columnname) !!');'(title '')
from dbc.columns
where (databasename, tablename) in (select databasename, tablename from dbc.tables where tablekind = 'V')
;
.export reset;
create volatile table view_column_data_type
(
databasename varchar(30),
tablename varchar(30),
columnname varchar(30),
columntype varchar(30)
) primary index (databasename, tablename)
on commit preserve rows;
.run file /yourpath/view_col_type_sql.txt;
select *
from view_column_data_type
order by 1,2,3
;
.logoff;
However I can't use that solution, I faced spool problem. The problem is that query:
select type(databasename.tableName.columName) returns type for column n times, there n is number of rows. Using distinct or group by 1 ( same way, because TD14 can choose it on his own).
Is anything changed after 4 years in TD v. 14.1?
UPD1
explain insert into view_column_data_type Select distinct'db1','tb1','col1',type(db1.tb1.col1);
1) First, we lock db1.o in view tb1 for access,
we lock db1.a in view tb1 for access, we
lock db1.o in view tb1 for access, and we
lock db1.a in view tb1 for access.
2) Next, we execute the following steps in parallel.
1) We do an all-AMPs RETRIEVE step from db1.o in view
tb1 by way of an all-rows scan with no residual
conditions into Spool 11 (all_amps), which is redistributed
by the hash code of (db1.o.GUID) to all AMPs. The
size of Spool 11 is estimated with low confidence to be
74,480 rows (66,659,600 bytes). The estimated time for this
step is 0.13 seconds.
2) We do an all-AMPs RETRIEVE step from db1.a in view
tb1 by way of an all-rows scan with no residual
conditions into Spool 12 (all_amps), which is redistributed
by the hash code of (db1.a.GUID) to all AMPs. The
size of Spool 12 is estimated with low confidence to be 280
rows (256,200 bytes). The estimated time for this step is
0.13 seconds.
3) We do an all-AMPs JOIN step from Spool 11 (Last Use) by way of an
all-rows scan, which is joined to Spool 12 (Last Use) by way of an
all-rows scan. Spool 11 and Spool 12 are full outer joined using
a single partition hash join, with condition(s) used for
non-matching on right table ("NOT (GUID IS NULL)"), with a join
condition of ("GUID = GUID"). The result goes into Spool 10
(all_amps), which is built locally on the AMPs. The size of Spool
10 is estimated with low confidence to be 74,759 rows (
134,491,441 bytes). The estimated time for this step is 0.84
seconds.
4) We do an all-AMPs STAT FUNCTION step from Spool 10 (Last Use) by
way of an all-rows scan into Spool 17 (Last Use), which is assumed
to be redistributed by value to all AMPs. The result rows are put
into Spool 15 (all_amps), which is built locally on the AMPs. The
size is estimated with low confidence to be 74,759 rows (
72,890,025 bytes).
5) We do an all-AMPs STAT FUNCTION step from Spool 15 (Last Use) by
way of an all-rows scan into Spool 20 (Last Use), which is
redistributed by hash code to all AMPs. The result rows are put
into Spool 19 (all_amps), which is built locally on the AMPs. The
size is estimated with low confidence to be 74,759 rows (
71,693,881 bytes).
6) We execute the following steps in parallel.
1) We do an all-AMPs RETRIEVE step from Spool 19 (Last Use) by
way of an all-rows scan with a condition of ("(Field_20 <>
'D') OR (Field_21 = 1)") into Spool 9 (used to materialize
view, derived table, table function or table operator t3)
(all_amps), which is built locally on the AMPs. The size of
Spool 9 is estimated with low confidence to be 74,759 rows (
69,600,629 bytes). The estimated time for this step is 4.66
seconds.
2) We do an all-AMPs RETRIEVE step from db1.o in view
tb1 by way of an all-rows scan with no residual
conditions into Spool 24 (all_amps), which is redistributed
by the hash code of (db1.o.MK) to all AMPs. Then
we do a SORT to order Spool 24 by row hash. The size of
Spool 24 is estimated with low confidence to be 280 rows (
116,200 bytes).
7) We do an all-AMPs RETRIEVE step from Spool 24 by way of an
all-rows scan into Spool 25 (all_amps), which is duplicated on all
AMPs. The size of Spool 25 is estimated with low confidence to be
78,400 rows (32,536,000 bytes). The estimated time for this step
is 0.02 seconds.
8) We do an all-AMPs JOIN step from db1.a in view
tb1 by way of an all-rows scan with no residual
conditions, which is joined to Spool 25 (Last Use) by way of an
all-rows scan. db1.a and Spool 25 are left outer
joined using a product join, with condition(s) used for
non-matching on left table ("NOT (db1.a.GUID IS NULL)"),
with a join condition of ("GUID = db1.a.GUID"). The
result goes into Spool 26 (all_amps), which is redistributed by
the hash code of (db1.o.MK) to all AMPs. Then we do a
SORT to order Spool 26 by row hash. The size of Spool 26 is
estimated with low confidence to be 559 rows (245,401 bytes).
9) We do an all-AMPs JOIN step from Spool 26 (Last Use) by way of a
RowHash match scan, which is joined to Spool 24 (Last Use) by way
of a RowHash match scan. Spool 26 and Spool 24 are full outer
joined using a merge join, with a join condition of ("Field_1 =
Field_1"). The result goes into Spool 23 (all_amps), which is
built locally on the AMPs. The size of Spool 23 is estimated with
low confidence to be 559 rows (463,411 bytes). The estimated time
for this step is 0.03 seconds.
10) We do an all-AMPs STAT FUNCTION step from Spool 23 (Last Use) by
way of an all-rows scan into Spool 31 (Last Use), which is
redistributed by hash code to all AMPs. The result rows are put
into Spool 29 (all_amps), which is built locally on the AMPs. The
size is estimated with low confidence to be 559 rows (273,910
bytes).
11) We do an all-AMPs STAT FUNCTION step from Spool 29 (Last Use) by
way of an all-rows scan into Spool 34 (Last Use), which is
redistributed by hash code to all AMPs. The result rows are put
into Spool 33 (all_amps), which is built locally on the AMPs. The
size is estimated with low confidence to be 559 rows (264,966
bytes).
12) We execute the following steps in parallel.
1) We do an all-AMPs RETRIEVE step from Spool 33 (Last Use) by
way of an all-rows scan with a condition of ("(Field_12 <>
'D') OR (Field_13 = 1)") into Spool 8 (used to materialize
view, derived table, table function or table operator t2)
(all_amps), which is built locally on the AMPs. The size of
Spool 8 is estimated with low confidence to be 559 rows (
249,314 bytes). The estimated time for this step is 0.01
seconds.
2) We do an all-AMPs RETRIEVE step from db1.o in view
tb1 by way of an all-rows scan with no residual
conditions locking for access into Spool 51 (all_amps), which
is redistributed by the hash code of (db1.o.GUID)
to all AMPs. Then we do a SORT to order Spool 51 by row hash.
The size of Spool 51 is estimated with low confidence to be
74,480 rows (1,564,080 bytes). The estimated time for this
step is 0.06 seconds.
3) We do an all-AMPs RETRIEVE step from db1.a in view
tb1 by way of an all-rows scan with no residual
conditions locking for access into Spool 52 (all_amps), which
is redistributed by the hash code of (db1.a.GUID)
to all AMPs. Then we do a SORT to order Spool 52 by row hash.
The size of Spool 52 is estimated with low confidence to be
280 rows (9,240 bytes). The estimated time for this step is
0.06 seconds.
13) We do an all-AMPs JOIN step from Spool 51 (Last Use) by way of a
RowHash match scan, which is joined to Spool 52 (Last Use) by way
of a RowHash match scan. Spool 51 and Spool 52 are full outer
joined using a merge join, with condition(s) used for non-matching
on right table ("NOT (GUID IS NULL)"), with a join condition of (
"GUID = GUID"). The result goes into Spool 50 (all_amps), which
is built locally on the AMPs. The size of Spool 50 is estimated
with low confidence to be 74,759 rows (3,214,637 bytes). The
estimated time for this step is 0.07 seconds.
14) We do an all-AMPs STAT FUNCTION step from Spool 50 (Last Use) by
way of an all-rows scan into Spool 57 (Last Use), which is assumed
to be redistributed by value to all AMPs. The result rows are put
into Spool 55 (all_amps), which is built locally on the AMPs. The
size is estimated with low confidence to be 74,759 rows (
6,952,587 bytes).
15) We do an all-AMPs STAT FUNCTION step from Spool 55 (Last Use) by
way of an all-rows scan into Spool 60 (Last Use), which is
redistributed by hash code to all AMPs. The result rows are put
into Spool 5 (all_amps), which is redistributed by hash code to
all AMPs. The size is estimated with low confidence to be 74,759
rows (5,457,407 bytes).
16) We do an all-AMPs RETRIEVE step from Spool 8 by way of an all-rows
scan with a condition of ("(t2.RDM$END_DATE <= TIMESTAMP
'9999-12-31 00:00:00.000000') AND ((t2.col1 > TIMESTAMP
'1900-01-01 00:00:00.000000') AND (NOT (t2.MK IS NULL )))") into
Spool 90 (all_amps), which is duplicated on all AMPs. The size of
Spool 90 is estimated with low confidence to be 156,520 rows (
5,791,240 bytes). The estimated time for this step is 0.02
seconds.
17) We do an all-AMPs JOIN step from Spool 90 (Last Use) by way of an
all-rows scan, which is joined to Spool 9 by way of an all-rows
scan. Spool 90 and Spool 9 are joined using a dynamic hash join,
with a join condition of ("(LVL_TYPE_MK = MK) AND ((col1
,RDM$END_DATE) OVERLAPS (col1 ,RDM$END_DATE))"). The
result goes into Spool 5 (all_amps), which is redistributed by the
hash code of ((CASE WHEN ((RDM$OPC = 'D') OR
(db1.a.RDM$VALIDFROM IS NULL )) THEN (TIMESTAMP
'1900-01-01 00:00:00.000000') ELSE (db1.a.RDM$VALIDFROM)
END), TIMESTAMP '9999-12-31 00:00:00.000000', (CASE WHEN
(db1.a.GUID IS NULL) THEN (db1.o.GUID) ELSE
(db1.a.GUID) END)) to all AMPs. The size of Spool 5 is
estimated with no confidence to be 227,602 rows (16,614,946 bytes).
The estimated time for this step is 0.19 seconds.
18) We execute the following steps in parallel.
1) We do an all-AMPs RETRIEVE step from Spool 9 by way of an
all-rows scan with a condition of ("NOT (t1.MK_SUCCESSOR IS
NULL)") into Spool 117 (all_amps) fanned out into 7 hash join
partitions, which is built locally on the AMPs. The size of
Spool 117 is estimated with low confidence to be 74,759 rows (
3,364,155 bytes). The estimated time for this step is 0.30
seconds.
2) We do an all-AMPs RETRIEVE step from Spool 9 by way of an
all-rows scan with a condition of ("(t3.RDM$END_DATE <=
TIMESTAMP '9999-12-31 00:00:00.000000') AND
((t3.col1 > TIMESTAMP '1900-01-01 00:00:00.000000')
AND (NOT (t3.MK IS NULL )))") into Spool 118 (all_amps) fanned
out into 7 hash join partitions, which is duplicated on all
AMPs. The result spool file will not be cached in memory.
The size of Spool 118 is estimated with low confidence to be
20,932,520 rows (774,503,240 bytes). The estimated time for
this step is 0.42 seconds.
19) We do an all-AMPs JOIN step from Spool 117 (Last Use) by way of an
all-rows scan, which is joined to Spool 118 (Last Use) by way of
an all-rows scan. Spool 117 and Spool 118 are joined using a hash
join of 7 partitions, with a join condition of ("(MK_SUCCESSOR =
MK) AND ((col1 ,RDM$END_DATE) OVERLAPS (col1
,RDM$END_DATE))"). The result goes into Spool 5 (all_amps), which
is redistributed by the hash code of ((CASE WHEN ((RDM$OPC = 'D')
OR (db1.a.RDM$VALIDFROM IS NULL )) THEN (TIMESTAMP
'1900-01-01 00:00:00.000000') ELSE (db1.a.RDM$VALIDFROM)
END), TIMESTAMP '9999-12-31 00:00:00.000000', (CASE WHEN
(db1.a.GUID IS NULL) THEN (db1.o.GUID) ELSE
(db1.a.GUID) END)) to all AMPs. Then we do a SORT to
order Spool 5 by the sort key in spool field1 eliminating
duplicate rows. The size of Spool 5 is estimated with no
confidence to be 98,165 rows (7,166,045 bytes). The estimated
time for this step is 2.83 seconds.
20) We do an all-AMPs STAT FUNCTION step from Spool 5 (Last Use) by
way of an all-rows scan into Spool 122 (Last Use), which is
assumed to be redistributed by value to all AMPs. The result rows
are put into Spool 120 (all_amps), which is built locally on the
AMPs. The size is estimated with no confidence to be 98,165 rows
(6,577,055 bytes). The estimated time for this step is 0.01
seconds.
21) We execute the following steps in parallel.
1) We do an all-AMPs RETRIEVE step from Spool 120 (Last Use) by
way of an all-rows scan into Spool 6 (used to materialize
view, derived table, table function or table operator vv)
(all_amps), which is built locally on the AMPs. The size of
Spool 6 is estimated with no confidence to be 98,165 rows (
4,024,765 bytes). The estimated time for this step is 0.01
seconds.
2) We do an all-AMPs RETRIEVE step from Spool 9 (Last Use) by way
of an all-rows scan into Spool 126 (all_amps), which is
duplicated on all AMPs. The result spool file will not be
cached in memory. The size of Spool 126 is estimated with low
confidence to be 20,932,520 rows. The estimated time for this
step is 0.21 seconds.
22) We do an all-AMPs JOIN step from Spool 6 (Last Use) by way of an
all-rows scan, which is joined to Spool 126 by way of an all-rows
scan. Spool 6 and Spool 126 are joined using a product join, with
a join condition of ("(1=1)"). The result goes into Spool 128
(all_amps), which is built locally on the AMPs. The result spool
file will not be cached in memory. The size of Spool 128 is
estimated with no confidence to be 7,338,717,235 rows. The
estimated time for this step is 42.98 seconds.
23) We do an all-AMPs JOIN step from Spool 8 (Last Use) by way of an
all-rows scan, which is joined to Spool 126 (Last Use) by way of
an all-rows scan. Spool 8 and Spool 126 are joined using a
product join, with a join condition of ("(1=1)"). The result goes
into Spool 129 (all_amps), which is duplicated on all AMPs. The
result spool file will not be cached in memory. The size of Spool
129 is estimated with low confidence to be 11,701,278,680 rows.
The estimated time for this step is 57.75 seconds.
24) We do an all-AMPs JOIN step from Spool 128 (Last Use) by way of an
all-rows scan, which is joined to Spool 129 (Last Use) by way of
an all-rows scan. Spool 128 and Spool 129 are joined using a
product join, with a join condition of ("(1=1)"). The result goes
into Spool 125 (one-amp), which is redistributed by the hash code
of ('db1', 'tb1') to all AMPs. The result
spool file will not be cached in memory. The size of Spool 125 is
estimated with no confidence to be *** rows (*** bytes). The
estimated time for this step is 1,820,312 hours and 14 minutes.
25) We do a single-AMP SORT to order Spool 125 (one-amp) by eliminate
duplicate rows.
26) We do a single-AMP MERGE into
"admin".view_column_data_type from Spool 125 (Last Use).
The size is estimated with no confidence to be *** rows. The
estimated time for this step is 881,263,274 hours and 32 minutes.
27) We spoil the parser's dictionary cache for the table.
28) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> No rows are returned to the user as the result of statement 1.
There shouldn't be a spool problem because this utilizes some old Tequel (=pre-SQL) syntax where optimizer resolves the view source code down to the base tables without actually accessing them.
When you explain
insert into view_column_data_type
SELECT TYPE(DBC.TablesV.DatabaseName); -- no FROM!
it should look like this:
1) First, we do an INSERT into Spool 2.
2) Next, we do an all-AMPs RETRIEVE step from Spool 2 (Last Use) by
way of an all-rows scan into Spool 1 (one-amp), which is
redistributed by the hash code of ('DBC', 'ColumnsVX') to few AMPs.
Then we do a SORT to order Spool 1 by row hash. The size of Spool
1 is estimated with high confidence to be 1 row (61 bytes). The
estimated time for this step is 0.01 seconds.
3) We do a single-AMP MERGE into
xxx.view_column_data_type from Spool 1 (Last Use).
The size is estimated with high confidence to be 1 row. The
estimated time for this step is 1 second.
Of course step 2) is a bit stupid, but there's no access to dbc.tvfields, dbc.dbase, etc.
I can't imagine this changed in newer releases...
I'm not sure about is there any possible solution just using sql language. My final solution uses BTEQ (nice guide how to use it) to get list of columns and tables by writing firstly writing into file dynamically generated sql query:
select 'select ' !! Trim(databasename) !! '.'!!Trim(tablename) !! '; ' !!
'help column ' !! Trim(databasename) !! '.'!!Trim(tablename) !! '.* ;'
from dbc.columnsV
where (databasename, tablename) in (select databasename, tablename from dbc.tablesV as tb where tb.tableKind = 'V'
and TRIM( tb.DatabaseName ) IN ( 'db1', 'db2' ))
;
Query above will generate table name plus help column result.
Generated csv file then might be parsed from any language, for example in python 2.7:
import pandas as pd
df = pd.read_csv('out.csv',sep = ';',)
df_logs = pd.DataFrame([])
for i in range(len(df)):
if i% 1000 == 0:
print i
if df['Column'].iloc[i][:5] == 'sit50':
full_name = df['Column'].iloc[i]
j = 3
while df['Column'].iloc[i+j][:5] != "'db_template":
if i+j == len(df) - 1:
break
df_logs = df_logs.append([[full_name + ' ' + df['Column'].iloc[i+j],df['Name'].iloc[i+j]]], ignore_index= True)
j = j + 1
i = i + j
df_logs.to_csv("db_logs", sep='\t')
Hope, that solution will help someone.

Simple SELECT/WHERE query slow. Should I index a BIT field? [duplicate]

This question already has answers here:
Should I index a bit field in SQL Server?
(18 answers)
Closed 8 years ago.
The following query takes 20-25 seconds to run
SELECT * from Books WHERE IsPaperback = 1
Where IsBundle is a BIT field. There's about 500k rows, and about 400k currently have this field set.
I also have a BIT field called IsBundle and only 900 records have this set. Again, execution time is about 20-25 seconds.
How can I speed up such a simple query?
Indexing a bit column will result in two parts, true and false. If the data is split 50/50 the gain will be 'some'. When it is 90/10 and you query the 10 part, yes it will make a difference.
You should first narrow down your result set column wise. Then, if you see you just need a few columns, and you execute this query a lot, you could even include those few fields in the index. Then there is no need for a lookup in the table itself.
First of all i would implicitly call out the columns,
select
field1
, field2
, field3
from books
where IsPaperback = 1;
this seems to be a small thing, but when you use star (*) for column selection, the DB has to look up the column names prior to actually performing the call.
do you have a index on IsPaperback ? that would impact the above query more than having an index on the IsBundle
if you had a condition of IsBundle = 1 then i would think that would be need for an index on that field.
Add an Index for IsPaperback
Try making it an int, or tinyint. The latest processors actually process 32 bit words faster than bytes.
This query should take no more than a couple of milliseconds.
You should not have a separate column for IsPaperback and IsBundle. It should be a Type column where Paperback and Bundle are the vaules.
Before the query set profiling on
SET profiling = 1
After the query show profiles:
SHOW PROFILES
It seems there are some out there that do not believe this query should take only a few milliseconds.
For those that down voted this answer without understanding what I said was true.
I found a table "cities" with 332,127 Records
In this table Russia has 929 cities
These benchmarks were preformed on a GoDaddy Server IP 50.63.0.80
This is a GoDaddy Virtual Dedicated Server
On average I find sites hosted on GoDaddy to have the worst performance.
$time = microtime(true);
$results = mysql_query("SELECT * FROM `cities` WHERE `country` LIKE 'RS'");
echo "\n" . number_format(microtime(true)-$time,6)."\n";
$time = microtime(true);
while ($row = mysql_fetch_array($results, MYSQL_NUM)){$r[]=$row;}
echo "\n" . number_format(microtime(true)-$time,6);
Results:
With Index: 2.9mS
0.002947 Seconds : $results = mysql_query("SELECT * FROM `cities` WHERE `country` LIKE 'RS'");
0.000081 Seconds : while ($row = mysql_fetch_array($results, MYSQL_NUM)){$r[]=$row;}
Without Index 93mS
0.093939 Seconds : $results = mysql_query("SELECT * FROM `cities` WHERE `country` LIKE 'RS'");
0.000073 Seconds : while ($row = mysql_fetch_array($results, MYSQL_NUM)){$r[]=$row;}
Then in phpMyAdmin Profiling:
SET PROFILING = ON;
SELECT * FROM `cities` WHERE `country` LIKE 'RS';
SHOW PROFILE;
Result:
Execution of the Query took 0.0000003 seconds
starting 0.000020
checking permissions 0.000004
Opening tables 0.000006
init 0.000007
optimizing 0.000003
executing 0.000003 ******
end 0.000004
query end 0.000003
closing tables 0.000003
freeing items 0.000010
logging slow query 0.000003
cleaning up 0.000003
Without Index
Execution of the Query took 0.0000012 seconds
starting 0.000046
checking permissions 0.000006
Opening tables 0.000010
init 0.000021
optimizing 0.000006
executing 0.000012 ******
end 0.000003
query end 0.000004
closing tables 0.000003
freeing items 0.000017
logging slow query 0.000004
cleaning up 0.000003
In phpMyAdmin doing a Search with Profiling turn on
GoDaddy Server Sending Data 92.6 ms
SELECT * FROM `cities` WHERE `country` LIKE 'RS' LIMIT 1000
Showing rows 0 - 928 (929 total, Query took 0.0907 sec)
Profiling Results:
Starting 52 µs
Checking Permissions 7 µs
Opening Tables 23 µs
System Lock 12 µs
Init 34 µs
optimizing 10 µs
Statistics 23 µs
Preparing 17 µs
Executing 4 µs
Sending Data 92.6 ms
End 18 µs
Query End 4 µs
Closing Tables 15 µs
Freeing Items 27 µs
Logging Slow Query 4 µs
Cleaning Up 5 µs
In phpMyAdmin doing a Search with Profiling turn on
On my Server, Sending Data 1.8mS
SELECT * FROM `cities` WHERE `country` LIKE 'RS' LIMIT 1000
Showing rows 0 - 928 (929 total, Query took 0.0022 sec)
Starting 27 µs
Checking Permissions 5 µs
Opening Tables 11 µs
System Lock 7 µs
Init 14 µs
Optimizing 5 µs
Statistics 43 µs
Preparing 6 µs
Executing 2 µs
Sending Data 1.8 ms
End 5 µs
Query End 3 µs
Closing Tables 5 µs
Freeing Items 13 µs
Logging Slow Query 2 µs
Cleaning Up 2 µs
Just to show the importance of an index.Over 400x Improvement.
A table with 5,480,942 Records and a Query that Returns 899 Rows
$time = microtime(true);
$results = mysql_query("SELECT * FROM `ipLocations` WHERE `id` = 33644");
echo "\n" . number_format(microtime(true)-$time,6);
$time = microtime(true);
while ($row = mysql_fetch_array($results, MYSQL_NUM)){$r[]=$row;}
echo "\n" . number_format(microtime(true)-$time,6);
No index
0.402005
0.001264
With Index (426x Faster)
0.001716
0.001962

SQL Server 2008, how much space does this occupy?

I am trying to calculate how much space (Mb) this would occupy. In the database table there is 7 bit columns, 2 tiny int and 1 guid.
Trying to calculate the amount that 16 000 rows would occupies.
My line of thought was that 7 bit columns consume 1 byte, 2 tiny ints consumes 2 bytes and a guid consumes 16 bytes. Total of 19byte for one row in the table? That would mean 304000 bytes for 16 000 rows or ~0.3mbs us that correct? Is there a meta data byte as well?
There are several estimators out there which take away the donkey work
You have to take into account the Null bitmap which will be 3 bytes in this case + number of rows per page + row header + row version + pointers + all the stuff here:
Inside the Storage Engine: Anatomy of a record
Edit:
Your 19 bytes of actual data
has 11 bytes overhead
total 30 bytes per row
around 269 rows per page (8096 / 30)
requires 60 pages (16000 / 269)
around 490k space (60 x 8192)
a few KB for the index structure of the primary