PostgreSQL Notice: word is too long to be indexed error - sql

I try to make a fulltext search with postgresql.
Query:
SELECT * ,
count(*) OVER(order by id DESC) AS full_count,
ts_headline(
'english',
"content",
websearch_to_tsquery('test1 test2 test3 test4'),
'StartSel=<mark>, StopSel=</mark>, HighlightAll=true, MaxWords=35, MinWords=1, ShortWord=3'
) AS content_highlighted,
ts_rank_cd(to_tsvector('english', content), websearch_to_tsquery('test1 test2 test3 test4')) AS rank
FROM book_c_bookitem bi
WHERE to_tsvector('english', content) ## websearch_to_tsquery('test1 test2 test3 test4')
order by id DESC, is_latest DESC , rank DESC
limit 30
offset 0
When I search a word with websearch_to_tsquery, everything is fine. But when I try to search two or more words, I got error:
canceling statement due to statement timeout
How I can solve this problem?
Update
Thanks to #LaurenzAlbe. I adding EXPLAIN (ANALYZE, BUFFERS) to query, I got this error:
Notice: word is too long to be indexed - set successfully executed
How I can fix this error?

Related

How to run sql queries with multiple with clauses(sub-query refactoring)?

I have a code block that has 7/8 with clauses(
sub-query refactoring) in queries. I'm looking on how to run this query as I'm getting 'sql compilation errors' when running these!, While I'm trying to run them I'm getting errors in snowflake. for eg:
with valid_Cars_Stock as (
select car_id
from vw_standard.agile_car_issue_dime
where car_stock_expiration_ts is null
and car_type_name in ('hatchback')
and car_id = 1102423975
)
, car_sale_hist as (
select vw.issue_id, vw.delivery_effective_ts, bm.car_id,
lag(bm.sprint_id) over (partition by vw.issue_id order by vw.delivery_effective_ts) as previous_stock_id
from valid_Cars_Stock i
join vw_standard.agile_car_fact vw on vw.car_id = bm.car_id
left join vw_standard.agile_board_stock_bridge b on b.board_stock_bridge_dim_key = vw.issue_board_sprint_bridge_dim_key
order by vw.car_stock_expiration_ts desc
)
,
So how to run this 2 queries separately or together! I'm new to sql aswell any help would be ideal
So lets just reformate that code as it stands:
with valid_Cars_Stock as (
select
car_id
from vw_standard.agile_car_issue_dime
where car_stock_expiration_ts is null
and car_type_name in ('hatchback')
and car_id = 1102423975
), car_sale_hist as (
select
vw.issue_id,
vw.delivery_effective_ts,
bm.car_id,
lag(bm.sprint_id) over (partition by vw.issue_id order by vw.delivery_effective_ts) as previous_stock_id
from valid_Cars_Stock i
join vw_standard.agile_car_fact vw
on vw.car_id = bm.car_id
left join vw_standard.agile_board_stock_bridge b
on b.board_stock_bridge_dim_key = vw.issue_board_sprint_bridge_dim_key
order by vw.car_stock_expiration_ts desc
),
There are clearly part of a larger block of code.
For an aside of CTE, you should 100% ignore anything anyone (including me) says about them. They are 2 things, a statical sugar, and they allow avoidance of repetition, thus the Common Table Expression name. Anyways, they CAN perform better than temp tables, AND they CAN perform worse than just repeating the say SQL many times in the same block. There is no one rule. Testing is the only way to find for you SQL what is "fastest" and it can and does change as updates/releases are made. So ignoring performance comments.
if I am trying to run a chain like this to debug it I alter the point I would like to stop normally like so:
with valid_Cars_Stock as (
select
car_id
from vw_standard.agile_car_issue_dime
where car_stock_expiration_ts is null
and car_type_name in ('hatchback')
and car_id = 1102423975
)--, car_sale_hist as (
select
vw.issue_id,
vw.delivery_effective_ts,
bm.car_id,
lag(bm.sprint_id) over (partition by vw.issue_id order by vw.delivery_effective_ts) as previous_stock_id
from valid_Cars_Stock i
join vw_standard.agile_car_fact vw
on vw.car_id = bm.car_id
left join vw_standard.agile_board_stock_bridge b
on b.board_stock_bridge_dim_key = vw.issue_board_sprint_bridge_dim_key
order by vw.car_stock_expiration_ts desc
;), NEXT_AWESOME_CTE_THAT_TOTALLY_MAKES_SENSE (
-- .....
and now the result of car_sale_hist will be returned. because we "completed" the CTE chain by not "starting another" and the ; stopped the this is all part of my SQL block.
Then once you have that steps working nicely, remove the semi-colon and end of line comments, and get of with value real value.

Why does this query give a different result when it's used as a sub-query?

I have this data:
"config_timeslice_id","config_id","created"
14326,1145,"2021-08-31 13:45:00"
14325,1145,"2021-08-22 13:34:51"
14321,1145,"2021-06-16 10:47:59"
2357,942,"2019-12-24 10:09:38"
When I run this query:
SELECT config_timeslice_id
FROM config_timeslice
WHERE config_id = 1145
AND created <= CURRENT_TIMESTAMP
ORDER BY created DESC
LIMIT 1
I get 14325, as I would expect, because today is 2021-08-23.
But when I run this query:
SELECT DISTINCT t.config_id,
(
SELECT config_timeslice_id
FROM config_timeslice
WHERE config_id = t.config_id
AND created <= CURRENT_TIMESTAMP
ORDER BY created DESC
LIMIT 1
) AS ts_id
FROM config_timeslice t
I get:
config_id,ts_id
942,2357
1145,14321
I can’t figure out why the second row doesn’t give 14325
MariaDB 10.4.18 must have a bug. When I upgraded to 10.4.21, it works.

BigQuery MERGE unexpected row duplication

I am using standard SQL MERGE to update at regular target table based on an source external table that is a set of CVS files in a bucket. Here is a simplified input file:
$ gsutil cat gs://dolphin-dev-raw/demo/input/demo_20191125_20200505050505.tsv
"id" "PortfolioCode" "ValuationDate" "load_checksum"
"1" "CIMDI000TT" "2020-03-28" "checksum1"
The MERGE statement is:
MERGE xx_producer_conformed.demo T
USING xx_producer_raw.demo_raw S
ON
S.id = T.id
WHEN NOT MATCHED THEN
INSERT (id, PortfolioCode, ValuationDate, load_checksum, insert_time, file_name, extract_timestamp, wf_id)
VALUES (id, PortfolioCode, ValuationDate, load_checksum, CURRENT_TIMESTAMP(), _FILE_NAME, REGEXP_EXTRACT(_FILE_NAME, '.*_[0-9]{8}_([0-9]{14}).tsv'),'scheduled__2020-08-19T16:24:00+00:00')
WHEN MATCHED AND S.load_checksum != T.load_checksum THEN UPDATE SET
T.id = S.id, T.PortfolioCode = S.PortfolioCode, T.ValuationDate = S.ValuationDate, T.load_checksum = S.load_checksum, T.file_name = S._FILE_NAME, T.extract_timestamp = REGEXP_EXTRACT(_FILE_NAME, '.*_[0-9]{8}_([0-9]{14}).tsv'), T.wf_id = 'scheduled__2020-08-19T16:24:00+00:00'
If I wipe the target table and rerun the MERGE I get an row modified count of 1:
bq query --use_legacy_sql=false --location=asia-east2 "$(cat merge.sql | awk 'ORS=" "')"
Waiting on bqjob_r288f8d33_000001740b413532_1 ... (0s) Current status: DONE
Number of affected rows: 1
This successfully results in the target table updating:
$ bq query --format=csv --max_rows=10 --use_legacy_sql=false "select * from ta_producer_conformed.demo"
Waiting on bqjob_r7f6b6a46_000001740b5057a3_1 ... (0s) Current status: DONE
id,PortfolioCode,ValuationDate,load_checksum,insert_time,file_name,extract_timestamp,wf_id
1,CIMDI000TT,2020-03-28,checksum1,2020-08-20 09:44:20,gs://dolphin-dev-raw/demo/input/demo_20191125_20200505050505.tsv,20200505050505,scheduled__2020-08-19T16:24:00+00:00
If I return the MERGE again I get row modified count of 0:
$ bq query --use_legacy_sql=false --location=asia-east2 "$(cat merge.sql | awk 'ORS=" "')"
Waiting on bqjob_r3de2f833_000001740b4161b3_1 ... (0s) Current status: DONE
Number of affected rows: 0
That results no changes to the target table. So everything is working expected.
The problem is that when I run the code on a more complex example with many input files to insert into an empty target table I end up with rows that have the same id where count(id) is not equal to count(distinct id):
$ bq query --use_legacy_sql=false --max_rows=999999 --location=asia-east2 "select count(id) as total_records from xx_producer_conformed.xxx; select count(distinct id) as unique_records from xx_producer_conformed.xxx; "
Waiting on bqjob_r5df5bec8_000001740b7dfa50_1 ... (1s) Current status: DONE
select count(id) as total_records from xx_producer_conformed.xxx; -- at [1:1]
+---------------+
| total_records |
+---------------+
| 11582 |
+---------------+
select count(distinct id) as unique_records from xx_producer_conformed.xxx; -- at [1:78]
+----------------+
| unique_records |
+----------------+
| 5722 |
+----------------+
This surprises me as my expectation is that the underlying logic would step through each line in each underlying file and only insert on the first id then on any subsequent id it would update. So my expectation is that you cannot have more rows than unique ids in the input bucket.
If I then try to run the MERGE again it fails telling me that there is more than one row in the target table with the same id:
$ bq query --use_legacy_sql=false --location=asia-east2 "$(cat merge.sql | awk 'ORS=" "')"
Waiting on bqjob_r2fe783fc_000001740b8271aa_1 ... (0s) Current status: DONE
Error in query string: Error processing job 'xxxx-10843454-datamesh-
dev:bqjob_r2fe783fc_000001740b8271aa_1': UPDATE/MERGE must match at most one
source row for each target row
I was expecting that the there would be no two rows with the same "id" when the MERGE statement does it's inserts.
All the tables and queries used are generated from from a file that lists the "business columns". So the simple demo example above is identical to the full scale queries in terms of the login and joins in the MERGE statement.
Why would the MERGE query above result in rows with duplicated "id" and how do I fix this?
The problem is very easily repeatable by wiping the target table and duplicating a relatively large input as the input:
AAAA_20200805_20200814200000.tsv
AAAA_clone_20200805_20200814200000.tsv
I believe that what is at the heart of this is parallelism. A single large MERGE of many files can spawn many worker threads in parallel. It would be very slow for any two worker threads running in parallel loading different files to immediately "see" each others inserts. Rather I expect that they would run independently and not "see" each others writes into separate buffers. When the buffers are finally combined it leads to multiple inserts with the same id.
To fix this I am using some CTEs to pick the latest record for any id based on extract_timestamp by using using ROW_NUMBER() OVER (PARTITION BY id ORDER BY extract_timestamp DESC). We can then filter by the lowest value to pick the latest version of the record. The full query is:
MERGE xx_producer_conformed.demo T
USING (
WITH cteExtractTimestamp AS (
SELECT
id, PortfolioCode, ValuationDate, load_checksum
, _FILE_NAME
, REGEXP_EXTRACT(_FILE_NAME, '.*_[0-9]{8}_([0-9]{14}).tsv') AS extract_timestamp
FROM
xx_producer_raw.demo_raw
),
cteRanked AS (
SELECT
id, PortfolioCode, ValuationDate, load_checksum
, _FILE_NAME
, extract_timestamp
, ROW_NUMBER() OVER (PARTITION BY id ORDER BY extract_timestamp DESC) AS row_num
FROM
cteExtractTimestamp
)
SELECT
id, PortfolioCode, ValuationDate, load_checksum
, _FILE_NAME
, extract_timestamp
, row_num
, "{{ task_instance.xcom_pull(task_ids='get_run_id') }}" AS wf_id
FROM cteRanked
WHERE row_num = 1
) S
ON
S.id = T.id
WHEN NOT MATCHED THEN
INSERT (id, PortfolioCode, ValuationDate, load_checksum, insert_time, file_name, extract_timestamp, wf_id)
VALUES (id, PortfolioCode, ValuationDate, load_checksum, CURRENT_TIMESTAMP(), _FILE_NAME, extract_timestamp, wf_id)
WHEN MATCHED AND S.load_checksum != T.load_checksum THEN UPDATE SET
T.id = S.id, T.PortfolioCode = S.PortfolioCode, T.ValuationDate = S.ValuationDate, T.load_checksum = S.load_checksum, T.file_name = S._FILE_NAME, T.extract_timestamp = S.extract_timestamp, T.wf_id = S.wf_id
This means that cloning a file and not changing the extract_timestamp in the filename will pick one of the two rows at random. In normal running we would expect subsequent extracts that have updated data to be a source file with a new extract_timetamp. The above query will then pick the newest record to merge into the target table.

Oracle SQL Update statement with value generated in subquery

I am trying to write an update statement to insert a value that's calculated in a subquery, and having limited success.
The statement I've tried so far is:
update intuit.men_doc doc1
set doc1.doc_udf5 = (select
substr(doc.doc_dtyc, instr(doc.doc_dtyc, 'GAPP-', 2)+5 )||'_'||row_number() over(partition by
doc.doc_dtyc order by doc.doc_cret) docDeleteId
from
intuit.men_doc doc
where
doc.doc_dtyc != 'DM-GAPP-SFUL'
and doc.doc_dtyc like 'DM-GAPP%'
and doc.doc_cred >= '01/Oct/2017' and doc.doc_cred < '01/Oct/2018'
and doc1.doc_code = doc.doc_code
)
Which gives mes the following error message
ERROR: Error 1427 was encountered whilst running the SQL command. (-3)
Error -3 running SQL : ORA-01427: single-row subquery returns more than one row
I don't have much experience with UPDATE statements, so any advice on how I can rewrite this so that I can update a few thousand records at once would be appreciated.
EDIT: Adding example data
Example data:
MEN_DOC
DOC_CODE DOC_DTYC DOC_UDF5 DOC_CRED
123456A CV 08/Nov/2017
456789B CV 11/Jan/2018
789123C CV 15/Feb/2018
123987B TRAN 01/Dec/2017
How I want the data to look once the script is run
MEN_DOC
DOC_CODE DOC_DTYC DOC_UDF5 DOC_CRED
123456A CV CV_1 08/Nov/2017
456789B CV CV_2 11/Jan/2018
789123C CV CV_3 15/Feb/2018
123987B TRAN TRAN_1 01/Dec/2017
Thanks
You are using row_number(), which suggests that you expect the subquery to return more than one row. The inequality on doc_code supports this interpretation.
Just change the row_number() to count(*), so you have an aggregation which will always return one row and get the sequential count you want:
update intuit.men_doc doc1
set doc1.doc_udf5 = (select substr(doc.doc_dtyc, instr(doc.doc_dtyc, 'GAPP-', 2)+5 ) ||'_'|| count(*) docDeleteId
from intuit.men_doc doc
where doc.doc_dtyc <> 'DM-GAPP-SFUL' and
doc.doc_dtyc like 'DM-GAPP%' and
doc.doc_cred >= date '2017-10-01' and
doc.doc_cred < date '2018-10-01' and
doc1.doc_code = doc.doc_code
);
You can use your select as source table in merge, like here:
merge into men_doc tgt
using (select doc_code,
doc_dtyc||'_'||row_number() over (partition by doc_dtyc order by doc_cred) as calc
from men_doc) src
on (tgt.doc_code = src.doc_code)
when matched then update set tgt.doc_udf5 = src.calc;
dbfiddle
I assumed that doc_code is unique.

Bigquery Query failes the first time and successfully completes the 2nd time

I'm executing the following query.
SELECT properties.os, boundary, user, td,
SUM(boundary) OVER(ORDER BY rows) AS session
FROM
(
SELECT properties.os, ROW_NUMBER() OVER() AS rows, user, td,
CASE WHEN td > 1800 THEN 1 ELSE 0 END AS boundary
FROM (
SELECT properties.os, t1.properties.distinct_id AS user,
(t2.properties.time - t1.properties.time) AS td
FROM (
SELECT properties.os, properties.distinct_id, properties.time, srlno,
srlno-1 AS prev_srlno
FROM (
SELECT properties.os, properties.distinct_id, properties.time,
ROW_NUMBER()
OVER (PARTITION BY properties.distinct_id
ORDER BY properties.time) AS srlno
FROM [ziptrips.ziptrips_events]
WHERE properties.time > 1367916800
AND properties.time < 1380003200)) AS t1
JOIN (
SELECT properties.distinct_id, properties.time, srlno,
srlno-1 AS prev_srlno
FROM (
SELECT properties.distinct_id, properties.time,
ROW_NUMBER() OVER
(PARTITION BY properties.distinct_id ORDER BY properties.time) AS srlno
FROM [ziptrips.ziptrips_events]
WHERE
properties.time > 1367916800
AND properties.time < 1380003200 )) AS t2
ON t1.srlno = t2.prev_srlno
AND t1.properties.distinct_id = t2.properties.distinct_id
WHERE (t2.properties.time - t1.properties.time) > 0))
It fails the first time with the following error. However on 2nd run it completes without any issue. I'd appreciate any pointers on what might be causing this.
The error message is:
Query Failed
Error: Field 'properties.os' not found in table '__R2'.
Job ID: job_VWunPesUJVLxWGZsMgpoti14BM4
Thanks,
Navneet
We (the BigQuery team) are in the process of rolling out a new version of the query engine that fixes a number of issues like this one. You likely hit an old version of the query engine and then when you retried, hit the new one. It may take us a day or so with a portion of traffic pointing at the updated version in order to verify there aren't any regressions. Please let us know if you hit this again after 24 hours or so.