Why does select result fields double data scanned in BigQuery - google-bigquery

I have a table with 2 integer fields x,y and few millions of rows.
The fields are created with the following code:
Field.newBuilder("x", LegacySQLTypeName.INTEGER).setMode(Field.Mode.NULLABLE).build();
If I run the following from the web:
SELECT x,y FROM [myproject:Test.Test] where x=1 LIMIT 50
Query Editor: "Valid: This query will process 64.9 MB when run."
compared to:
SELECT x FROM [myproject:Test.Test] where x=1 LIMIT 50
Query Editor: " Valid: This query will process 32.4 MB when run."
It scans more than double of the original data scanned.
I would expect it will first find the relevant rows based on where clause and then bring the extra field without scanning the entire second field.
Any inputs on why it doubles the data scanned and how to avoid it will be appreciated.
In my application I have hundred of possible fields which I need to fetch for a very small number of rows (50) which answer the query.
Does this means I will need to processed all fields data?
* I'm aware how columnar database works, but wasn't aware for the huge price when you want to brings lots of fields based on a very specific where clause.
The following link provide very clear answer:
best-practices-performance-input

BigQuery does not have a concept of index or something like that. When you query a field column, BigQuery will scan through all the values of that column and then make the operations you want (for a deeper deep understanding they have some pretty cool posts about the inner workings of BQ).
That means that when you select x and y where x = 1, BQ will read through all values of x and y and then find where x = 1.
This ends up being an amazing feature of BQ, you just load your data there and it just works. It does force you to be aware on how much data you retrieve from each query. Queries of the type select * from table should be used only if you really need all columns.

Related

Redshift SQL result set 100s of rows wide efficiency (long to wide)

Scenario: Medical records reporting to state government which requires a pipe delimited text file as input.
Challenge: Select hundreds of values from a fact table and produce a wide result set to be (Redshift) UNLOADed to disk.
What I have tried so far is a SQL that I want to make into a VIEW.
;WITH
CTE_patient_record AS
(
SELECT
record_id
FROM fact_patient_record
WHERE update_date = <yesterday>
)
,CTE_patient_record_item AS
(
SELECT
record_id
,record_item_name
,record_item_value
FROM fact_patient_record_item fpri
INNER JOIN CTE_patient_record cpr ON fpri.record_id = cpr.record_id
)
Note that fact_patient_record has 87M rows and fact_patient_record_item has 97M rows.
The above code runs in 2 seconds for 2 test records and the CTE_patient_record_item CTE has about 200 rows per record for a total of about 400.
Now, produce the result set:
,CTE_result AS
(
SELECT
cpr.record_id
,cpri002.record_item_value AS diagnosis_1
,cpri003.record_item_value AS diagnosis_2
,cpri004.record_item_value AS medication_1
...
FROM CTE_patient_record cpr
INNER JOIN CTE_patient_record_item cpri002 ON cpr.cpr.record_id = cpri002.cpr.record_id
AND cpri002.record_item_name = 'diagnosis_1'
INNER JOIN CTE_patient_record_item cpri003 ON cpr.cpr.record_id = cpri003.cpr.record_id
AND cpri003.record_item_name = 'diagnosis_2'
INNER JOIN CTE_patient_record_item cpri004 ON cpr.cpr.record_id = cpri004.cpr.record_id
AND cpri003.record_item_name = 'mediation_1'
...
) SELECT * FROM CTE_result
Result set looks like this:
record_id diagnosis_1 diagnosis_2 medication_1 ...
100001 09 9B 88X ...
...and then I use the Reshift UNLOAD command to write to disk pipe delimited.
I am testing this on a full production sized environment but only for 2 test records.
Those 2 test records have about 200 items each.
Processing output is 2 rows 200 columns wide.
It takes 30 to 40 minutes to process just just the 2 records.
You might ask me why I am joining on the item name which is a string. Basically there is no item id, no integer, to join on. Long story.
I am looking for suggestions on how to improve performance. With only 2 records, 30 to 40 minutes is unacceptable. What will happen when I have 1000s of records?
I have also tried making the VIEW a MATERIALIZED VIEW however, it takes 30 to 40 minutes (not surprisingly) to compile the materialized view also.
I am not sure which route to take from here.
Stored procedure? I have experience with stored procs.
Create new tables so I can create integer id's to join on and indexes? However, my managers are "new table" averse.
?
I could just stop with the first two CTEs, pull the data down to python and process using pandas dataframe which I've done before successfully but it would be nice if I could have an efficient query, just use Redshift UNLOAD and be done with it.
Any help would be appreciated.
UPDATE: Many thanks to Paul Coulson and Bill Weiner for pointing me in the right direction! (Paul I am unable to upvote your answer as I am too new here).
Using (pseudo code):
MAX(CASE WHEN t1.name = 'somename' THEN t1.value END ) AS name
...
FROM table1 t1
reduced execution time from 30 minutes to 30 seconds.
EXPLAIN PLAN for the original solution is 2700 lines long, for the new solution using conditional aggregation is 40 lines long.
Thanks guys.
Without some more information it is impossible to know what is going on for sure but what you are doing is likely not ideal. An explanation plan and the execution time per step would help a bunch.
What I suspect is getting you is that you are reading a 97M row table 200 times. This will slow things down but shouldn't take 40 min. So I also suspect that record_item_name is not unique per value of record_id. This will lead to row replication and could be expanding the data set many fold. Also is record_id unique in fact_patient_record? If not then this will cause row replication. If all of this is large enough to cause significant spill and significant network broadcasting your 40 min execution time is very plausible.
There is no need to be joining when all the data is in a single copy of the table. #PhilCoulson is correct that some sort of conditional aggregation could be applied and the decode() syntax could save you space if you don't like case. Several of the above issues that might be affecting your joins would also make this aggregation complicated. What are you looking for if there are several values for record_item_value for each record_id and record_item_name pair? I expect you have some discovery of what your data holds in your future.

Unnesting a json in Redshift causing nested loop in the query plan

I have a column in my tables called 'data' with JSONs in it like below:
{"tt":"452.95","records":[{"r":"IN184366","t":"812812819910","s":"129.37","d":"982.7","c":"83"},{"r":"IN183714","t":"8028028029093","s":"33.9","d":"892","c":"38"}]}
I have written a code to unnest it into separate columns like tr,r,s.
Below is the code
with raw as (
SELECT json_extract_path_text(B.Data, 'records', true) as items
FROM tableB as B where B.date::timestamp between
to_timestamp('2019-01-01 00:00:00','YYYY-MM-DD HH24:MA:SS') AND
to_timestamp('2022-12-31 23:59:59','YYYY-MM-DD HH24:MA:SS')
UNION ALL
SELECT json_extract_path_text(C.Data, 'records', true) as items
FROM tableC as C where C.date-5 between
to_timestamp('2019-01-01 00:00:00','YYYY-MM-DD HH24:MA:SS') AND
to_timestamp('2022-12-31 23:59:59','YYYY-MM-DD HH24:MA:SS')
),
numbers as (
SELECT ROW_NUMBER() OVER (ORDER BY TRUE)::integer- 1 as ordinal
FROM <any_random_table> limit 1000
),
joined as (
select raw.*,
json_array_length(orders.items, true) as number_of_items,
json_extract_array_element_text(
raw.items,
numbers.ordinal::int,
true
) as item
from raw
cross join numbers
where numbers.ordinal <
json_array_length(raw.items, true)
),
parsed as (
SELECT J.*,
json_extract_path_text(J.item, 'tr',true) as tr,
json_extract_path_text(J.item, 'r',true) as r,
json_extract_path_text(J.item, 's',true)::float8 as s
from joined J
)
select * from parsed
The above code is working when there are small number of records but this taking more than a day to run and CPU utilization (in redshift) is reaching 100 % and even the disk space used also reaching 100% if I am putting date between last two years etc.. or if the number of records is large.
Can anyone please suggest any alternative way to unnnest JSON objects like above in redshift.
My query plan is saying:
Nested Loop Join in the query plan - review the join predicates to avoid Cartesian products
Goal: To Unnest without using any cross joins
Input: data column having JSON
"tt":"452.95","records":[{"r":"IN184366","t":"812812819910","s":"129.37","d":"982.7","c":"83"},{"r":"IN183714","t":"8028028029093","s":"33.9","d":"892","c":"38"}]}
Output should be for example
tr,r,s columns from the above json
You want to unnest json records of up to 1000 stored in a json array but nested loop join is taking too long.
The root issues is likely your data model. You have stored structured records (called "records"), inside a semi-structure text element (json), within a column of a structured columnar database. You want to perform some operation on these buried records that you haven't described but here's the problem. Columnar databases are optimized for performing read-centric analytic queries but you need to expand these json internal records into Redshift rows (records) which is fundamentally a write operation. This is working against the optimizations of the database.
The size of this expanding data is also large as compared to your disk storage on your cluster which is why the disks are filling up. You CPUs are likely spinning unpacking the jsons and managing overloaded disk and memory capacity. At the edge of filling up disks Redshift shifts to a mode that optimizes disk space utilization at the expense of execution speed. A larger cluster may give you a significantly faster execution if you can avoid this effect but that will cost money you may not have budgeted. Not an ideal solution.
One area that would improve speed of your query is not carrying all the data along. You keep raw.* and J.* all through the query but it is not clear you need these. Since part of the issue is data size during execution and that this execution includes loop joining, you are making the execution much harder that it needs to be by carrying all this data (including the original jsons).
The best way out of this situation is to change your data model and expand these json internal records into Redshift records on ingestion. Json data is fine for seldom used information or information that is only needed at the end of a query where the data is small. Needing the expanded json at the input end of the query for such a large amount of data is not good use case for json in Redshift. Each of these "records" inside of the json are records and need to be stored as such if you need to work across them as query input.
Now you want to know if there is some slick way to get around this issue in your case and the answer is "unlikely but maybe". Can you describe how you are using the final values in your query (t, r, and s)? If you are just using some aspect of this data (max value or sum or ...) then there may be a way to get to the answer without the large nested loop join. But if you need all the values then there is no other way to get these AFAIK. A description of what comes next in the data process could open up such an opportunity.

Access - Match Column Range / Complex Join

How do we match columns based on condition of closeness to value.
This requires Complex Query / Range Comparison / Multiple Joins conditions.
Getting Query size exceeded 2GB error.
Tables :
InvDetails1 / InvDetails2 / INVDL / ExpectedResult
Field Relation :
InvDetails1.F1 = InDetails2.F3
InvDetails2.F5 = INVDL.F1
INVDL.DLID = ExpectedResult.DLID
ExpectedResult.Total - 1 < InvDetails1.F6 < ExpectedResult.Total + 1
left(InvDetails1.F21,10) = '2013-03-07'
Return Results where Number of records from ExpectedResult is only 1.
Group by InvDetails1.F1 , count(ExpectedResult.DLID) works.
From this result.
Final Result :
InvDetails1.F1 , InvDetails1.F16 , ExpectedResult.DLID , ExpectedResult.NMR
ExpectedResult - has millions of rows.
InvDetails - few hundred thousands
If I was in that situation and was finding that my query was "hitting a wall" at 2GB then one thing I would try would be to create a separate saved Select Query to isolate the InvDetails1 records just for the specific date in question. Then I would use that query instead of the full InvDetails1 table when joining to the other tables.
My reasoning is that the query optimizer may not be able to use your left(InvDetails1.F21,10) = '2013-03-07' condition to exclude InvDetails1 records early in the execution plan, possibly causing the query to grow much larger than it really needs to (internally, while it is being processed). Forcing the date selection to the beginning of the process by putting it in a separate (prerequisite) query may keep the size of the "main" query down to a more feasible size.
Also, if I found myself in the situation where my queries were getting that big I would also keep a watchful eye on the size of my .accdb (or .mdb) file to ensure that it does not get too close to 2GB. I've never had it happen myself, but I've heard that database files that hit the 2GB barrier can result in some nasty errors and be rather "challenging" to recover.

postgres: get random entries from table - too slow

In my postgres database, I have the following relationships (simplified for the sake of this question):
Objects (currently has about 250,000 records)
-------
n_id
n_store_object_id (references store.n_id, 1-to-1 relationship, some objects don't have store records)
n_media_id (references media.n_id, 1-to-1 relationship, some objects don't have media records)
Store (currently has about 100,000 records)
-----
n_id
t_name,
t_description,
n_status,
t_tag
Media
-----
n_id
t_media_path
So far, so good. When I need to query the data, I run this (note the limit 2 at the end, as part of the requirement):
select
o.n_id,
s.t_name,
s.t_description,
me.t_media_path
from
objects o
join store s on (o.n_store_object_id = s.n_id and s.n_status > 0 and s.t_tag is not null)
join media me on o.n_media_id = me.n_id
limit
2
This works fine and gives me two entries back, as expected. The execution time on this is about 20 ms - just fine.
Now I need to get 2 random entries every time the query runs. I thought I'd add order by random(), like so:
select
o.n_id,
s.t_name,
s.t_description,
me.t_media_path
from
objects o
join store s on (o.n_store_object_id = s.n_id and s.n_status > 0 and s.t_tag is not null)
join media me on o.n_media_id = me.n_id
order by
random()
limit
2
While this gives the right results, the execution time is now about 2,500 ms (over 2 seconds). This is clearly not acceptable, as it's one of a number of queries to be run to get data for a page in a web app.
So, the question is: how can I get random entries, as above, but still keep the execution time within some reasonable amount of time (i.e. under 100 ms is acceptable for my purpose)?
Of course it needs to sort the whole thing according to random criteria before getting first rows. Maybe you can work around by using random() in offset instead?
Here's some previous work done on the topic which may prove helpful:
http://blog.rhodiumtoad.org.uk/2009/03/08/selecting-random-rows-from-a-table/
I'm thinking you'll be better off selecting random objects first, then performing the join to those objects after they're selected. I.e., query once to select random objects, then query again to join just those objects that were selected.
It seems like your problem is this: You have a table with 250,000 rows and need two random rows. Thus, you have to generate 250,000 random numbers and then sort the rows by their numbers. Two seconds to do this seems pretty fast to me.
The only real way to speed up the selection is not have to come up with 250,000 random numbers, but instead lookup rows through an index.
I think you'd have to change the table schema to optimize for this case. How about something like:
1) Create a new column with a sequence starting at 1.
2) Every row will then have a number.
3) Create an index on: number % 1000
4) Query for rows where number % 1000 is equal to a random number
between 0 and 999 (this should hit the index and load a random
portion of your database)
5) You can probably then add on RANDOM() to your ORDER BY clause and
it will then just sort that chunk of your database and be 1,000x
faster.
6) Then select the first two of those rows.
If this still isn't random enough (since rows will always be paired having the same "hash"), you could probably do a union of two random rows, or have an OR clause in the query and generate two random keys.
Hopefully something along these lines could be very fast and decently random.

Any idea why contains(...) querys so slow in SQL Server 2005

I've got a simple select query which executes in under 1 second normally, but when I add in a contains(column, 'text') into the where clause, suddenly it's running for 20 seconds up to a minute. The table it's selecting from has around 208k rows.
Any ideas what would cause this query to run so slow with just the addition of the contains clause?
Substring matching is a computationally expensive operation. Is the field indexed? If this is a major feature implementation, consider a search-caching table so you can simply lookup where the words exist.
Depending on the search keyword and the median length of characters in the column it is logical that it would take a long time.
Consider searching for 'cookie' in a column with median length 100 characters in a dataset of 200k rows.
Best case scenario with early outs, you would do 100 * 200k = 20m comparisons
Worst case scenario near missing on every compare, you would do (5 * 100) * 200k = 100m comparisons
Generally I would:
reorder your query to filter out as much as possible in advance prior to string matching
limit number of the results if you don't need all of them at once (TOP x)
reduce the number characters in your search term
reduce the number of search terms by filtering out terms that are likely to match a lot, or not at all (if applicable)
cache query results if possible (however cache invalidation can get pretty tricky if you want to do it right)
Try this:
SELECT *
FROM table
WHERE CONTAINS((column1, column2, column3), '"*keyword*"')
Instead of this:
SELECT *
FROM table
WHERE CONTAINS(column1, '"*keyword*"')
OR CONTAINS(column2, '"*keyword*"')
OR CONTAINS(column3y, '"*keyword*"')
The first one is a lot faster.
CONTAINS does a lot of extra work. There's a few things to note here:
NVarChar is always faster, so do CONTAINS(column, N'text')
If all you want to do is see if the word is in there, compare the performance to column LIKE '%' + text + '%'.
Compare query plans before and after, did it go to a table scan? If so, post more so we can figure out why.
In ultimo, you can break up the text's individual words into a separate table so they can be indexed.