Find preceding and following rows for a matching row in BigQuery? - google-bigquery

Is it possible to find rows preceding and following a matching rows in a BigQuery query? For example if I do:
select textPayload from logs.logs_20160709 where textPayload like "%something%"
and say that I get these results back:
something A
something B
How can I also show the 3 rows preceding and following the matching rows? Something like this:
some text 1
some text 2
some text 3
something A
some text 4
some text 5
some text 6
some text 90
some text 91
some text 92
something B
some text 93
some text 94
some text 95
Is this possible and if so how?

While on Zuma Beach - I was thinking of avoiding CROSS JOIN in my original answer.
Check below - should be much cheaper especially for big set
SELECT textPayload
FROM (
SELECT textPayload,
SUM(match) OVER(ORDER BY ts ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING) AS flag
FROM (
SELECT textPayload, ts, IF(textPayload CONTAINS 'something', 1, 0) AS match
FROM YourTable
)
)
WHERE flag > 0
Of course another way to avoid cross join is to use BigQuery Standard SQL. But still - above solution with no joins at all is better than my original answer

I think, one piece is missing in your example - extra field that will define the order, so I added ts field for this in my answer. This mean I assume your table has two fields involved : textPayload and ts
Try below. Should give you exactly what you need
SELECT
all.textPayload
FROM (
SELECT start, finish
FROM (
SELECT textPayload,
LAG(ts, 3) OVER(ORDER BY ts ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS start,
LEAD(ts, 3) OVER(ORDER BY ts ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING) AS finish
FROM YourTable
)
WHERE textPayload CONTAINS 'something'
) AS matches
CROSS JOIN YourTable AS all
WHERE all.ts BETWEEN matches.start AND matches.finish
Please note: depends on type of your ts field - you might need to do some data casting in query for this field. hope not

Related

How to check max from range in cursor?

I have a problem with transferring an Excel formula to SQL. My excel formula is: =IF(P2<(MAX($P$2:P2));"Move";"").
The P column in excel is a sequence of numbers.
a | b
------
1
2
7
3 MOVE
4 MOVE
8
9
5 MOVE
10
You can find more example on this screenshot:
I created a cursor with a loop but I don't know how to check max from range.
For example when I iterate for fourth row, I have to check max from 1-4 row etc.
No need for a cursor and a loop. Assuming that you have a column that defines the ordering of the rows (say, id), you can use window functions:
select t.*,
case when a < max(a) over(order by id) then 'MOVE' end as b
from mytable t
One option would be using MAX() Analytic function . But in any case, you'd have an extra column such as id for ordering in order to determine the max value for the current row from the first row, since SQL statements represent unordered sets. If you have that id column with values ordered as in your sample data, then consider using
WITH t2 AS
(
SELECT MAX(a) OVER (ORDER BY id ROWS BETWEEN
UNBOUNDED PRECEDING
AND
CURRENT ROW) AS max_upto_this_row,
t.*
FROM t
)
SELECT a, CASE WHEN max_upto_this_row > a THEN 'Move' END AS b
FROM t2
ORDER BY id;
Demo

where clause with = sign matches multiple records while expected just one record

I have a simple inline view that contains 2 columns.
-----------------
rn | val
-----------------
0 | A
... | ...
25 | Z
I am trying to select a val by matching the rn randomly by using the dbms_random.value() method as in
with d (rn, val) as
(
select level-1, chr(64+level) from dual connect by level <= 26
)
select * from d
where rn = floor(dbms_random.value()*25)
;
My expectation is it should return one row only without failing.
But now and then I get multiple rows returned or no rows at all.
on the other hand,
>>select floor(dbms_random.value()*25) from dual connect by level <1000
returns a whole number for each row and I failed to see any abnormality.
What am I missing here?
The problem is that the random value is recalculated for each row. So, you might get two random values that match the value -- or go through all the values and never get a hit.
One way to get around this is:
select d.*
from (select d.*
from d
order by dbms_random.value()
) d
where rownum = 1;
There are more efficient ways to calculate a random number, but this is intended to be a simple modification to your existing query.
You also might want to ask another question. This question starts with a description of a table that is not used, and then the question is about a query that doesn't use the table. Ask another question, describing the table and the real problem you are having -- along with sample data and desired results.

SQL Combing the top 2 field values into 1 value

I have a very simple query that returns the Notes field. Since there can be multiple notes, I only want the top 2. No problem. However, I'm going to be using the sql within another query. I really don't want 2 lines in my results. I would like to combine the results into 1 field value so I only have 1 result line in the results. Is this possible?
For example, I currently get the following:
12345 1001 500.00 "Note 1"
12345 1001 500.00 "Note 2"
What I would like to see is this:
12345 1001 500.00 "Note 1 AND Note 2"
Following is the sql:
select top 2 rcai.field_value
from rnt_agrs ra
inner join rnt_agr_inv_notes rain on ra.rnt_agr_nbr=rain.rea_rnt_agr_nbr
inner join RNT_CUST_ADDNL_INFO rcai on rain.rea_rnt_agr_nbr=rcai.rea_rnt_agr_nbr and rain.bac_acc_id=rcai.bac_acct_id
where ra.rnt_agr_nbr=128260511
Thanks for your help. I appreciate this forum for help with these issues.....
Get the next row's value and filter all but the first row:
select ..., rcai.field_value || ' AND '
min(rcai.field_value) -- next row's value (same as LEAD in Standard SQL)
over (partition by ra.rnt_agr_nbr
order by rcai.field_value
rows between 1 following and 1 following) as next_field_value
from rnt_agrs ra
inner join rnt_agr_inv_notes rain on ra.rnt_agr_nbr=rain.rea_rnt_agr_nbr
inner join RNT_CUST_ADDNL_INFO rcai on rain.rea_rnt_agr_nbr=rcai.rea_rnt_agr_nbr and rain.bac_acc_id=rcai.bac_acct_id
where ra.rnt_agr_nbr=128260511
qualify
row_number() -- only the first row
over (partition by ra.rnt_agr_nbr
order by rcai.field_value) = 1
If there might be only a single row you need to add a COALESCE(min...,'') to get rid of the NULL.
Both OLAP functions specify the same PARTITION and ORDER, so this is a single working step.
select *,(SELECT top 2 rcai.field_value + ' AND ' AS [text()]
FROM RNT_CUST_ADDNL_INFO rcai
WHERE rcai.rea_rnt_agr_nbr = rain.rea_rnt_agr_nbr
AND rcai.bac_acct_id=rain.bac_acc_id
FOR XML PATH('')) AS Notes
from
rnt_agrs ra inner join rnt_agr_inv_notes rain
on ra.rnt_agr_nbr=rain.rea_rnt_agr_nbr
I had something like this, where there was a 1 to many, and I wanted a semicolon delimited set of values in a single column with the main record.
You could use PIVOT to transform the two note rows into two note columns based on row number, then concatenate them. Here's an example:
SELECT pvt.[1] + ' and ' + pvt.[2]
FROM
( --the selection of your table data, including a row-number column
SELECT Msg, ROW_NUMBER() OVER(ORDER BY Id)
--sample data shown here, but this would be your real table
FROM (VALUES(1, 'Note 1'), (2, 'Note 2'), (3, 'Note 3')) Note(Id, Msg)
) Data (Msg, Row)
PIVOT (MAX(Msg) FOR Row IN ([1], [2])) pvt
Note that MAX is used for the aggregate in the PIVOT since an aggregate is required, but since ROW_NUMBER is unique, you're only aggregating a single value.
This could also be easily extended to the first N rows - just include the row numbers you want in the pivot and combine them as desired in the select statement.

SQL Query to remove cyclic redundancy

I have a table that looks like this:
Column A | Column B | Counter
---------------------------------------------
A | B | 53
B | C | 23
A | D | 11
C | B | 22
I need to remove the last row because it's cyclic to the second row. Can't seem to figure out how to do it.
EDIT
There is an indexed date field. This is for Sankey diagram. The data in the sample table is actually the result of a query. The underlying table has:
date | source node | target node | path count
The query to build the table is:
SELECT source_node, target_node, COUNT(1)
FROM sankey_table
WHERE TO_CHAR(data_date, 'yyyy-mm-dd')='2013-08-19'
GROUP BY source_node, target_node
In the sample, the last row C to B is going backwards and I need to ignore it or the Sankey won't display. I need to only show forward path.
Removing all edges from your graph where the tuple (source_node, target_node) is not ordered alphabetically and the symmetric row exists should give you what you want:
DELETE
FROM sankey_table t1
WHERE source_node > target_node
AND EXISTS (
SELECT NULL from sankey_table t2
WHERE t2.source_node = t1.target_node
AND t2.target_node = t1.source_node)
If you don't want to DELETE them, just use this WHERE clause in your query for generating the input for the diagram.
If you can adjust how your table is populated, you can change the query you're using to only retrieve the values for the first direction (for that date) in the first place, with a little bit an analytic manipulation:
SELECT source_node, target_node, counter FROM (
SELECT source_node,
target_node,
COUNT(*) OVER (PARTITION BY source_node, target_node) AS counter,
RANK () OVER (PARTITION BY GREATEST(source_node, target_node),
LEAST(source_node, target_node), TRUNC(data_date)
ORDER BY data_date) AS rnk
FROM sankey_table
WHERE TO_CHAR(data_date, 'yyyy-mm-dd')='2013-08-19'
)
WHERE rnk = 1;
The inner query gets the same data you collect now but adds a ranking column, which will be 1 for the first row for any source/target pair in any order for a given day. The outer query then just ignores everything else.
This might be a candidate for a materialised view if you're truncating and repopulating it daily.
If you can't change your intermediate table but can still see the underlying table you could join back to it using the same kind of idea; assuming the table you're querying from is called sankey_agg_table:
SELECT sat.source_node, sat.target_node, sat.counter
FROM sankey_agg_table sat
JOIN (SELECT source_node, target_node,
RANK () OVER (PARTITION BY GREATEST(source_node, target_node),
LEAST(source_node, target_node), TRUNC(data_date)
ORDER BY data_date) AS rnk
FROM sankey_table) st
ON st.source_node = sat.source_node
AND st.target_node = sat.target_node
AND st.rnk = 1;
SQL Fiddle demos.
DELETE FROM yourTable
where [Column A]='C'
given that these are all your rows
EDIT
I would recommend that you clean up your source data if you can, i.e. delete the rows that you call backwards, if those rows are incorrect as you state in your comments.

SQL Select using distinct and Cast [duplicate]

This question already exists:
Closed 10 years ago.
Possible Duplicate:
SQL Select DISTINCT using CAST
Let me try this one more time... I'm not a sql guy so please bear with me as I try to explain this... I have a table called t_recordkeepingleg with three columns of data. Column1 is named LEGTRIPNUMBER that happens to be a string that starts with the letter Q followed by 4 numbers. I need to strip off the Q and convert the remaining 4 characters (numbers) to an integer. Everyone with me so far? Column2 of this table is named LEGDATE. Column3 is named LEGGROUP.
Here's the input scenario
LEGTRIPNUMBER LEGDATE LEGGROUP
Q1001 08/12/12 0001
Q1001 09/15/12 0002
Q1002 09/01/12 0001
Q1002 09/08/12 0003
Q1002 09/09/12 0002
As you can see the input table has rows where LEGTRIPNUMBER occurs more than once. I only want the first occurrence.
This is my current select statement - it works but returns all rows.
SELECT *,
CAST(
substring("t_RecordkeepingLeg"."LEGTRIPNUMBER",2,4) as INT
) as Num_Trip_Num
FROM "1669"."dbo"."t_RecordkeepingLeg" "t_RecordkeepingLeg"
Where left "t_RecordkeepingLeg"."LEGTRIPNUMBER",1) = 'Q'
I want to modify this so that it only selects ONE occurance of the Qnnnn. When the row gets selected I want to have LEGDATE and LEGGROUP available to me. How do I do this?
Thank you,
Can it be as simple as below? I've just added condiotion on leggroup being 0001
SELECT *,
CAST(substring("t_RecordkeepingLeg"."LEGTRIPNUMBER",2,4) as INT) as Num_Trip_Num
FROM "1669"."dbo"."t_RecordkeepingLeg" "t_RecordkeepingLeg"
Where left ("t_RecordkeepingLeg"."LEGTRIPNUMBER",1) = 'Q'
and "t_RecordkeepingLeg"."LEGGROUP"='0001'
If you have a unique primay key in your table you can do something like the below;
SELECT CAST(
substring("t_RecordkeepingLeg"."LEGTRIPNUMBER",2,4) as INT
) as Num_Trip_Num
FROM "1669"."dbo"."t_RecordkeepingLeg" "t_RecordkeepingLeg"
Where "t_RecordkeepingLeg"."ID" In(
Select Min("t_RecordkeepingLeg"."ID")
From "1669"."dbo"."t_RecordkeepingLeg" "t_RecordkeepingLeg"
Where left ("t_RecordkeepingLeg"."LEGTRIPNUMBER",1) = 'Q'
Group By "t_RecordkeepingLeg"."LEGTRIPNUMBER"
)
Which values of LEGDATE & LEGGROUP do you want for the distinct LEGTRIPNUMBER? there are multiple non-distinct possibilities and the concept of "first occurrence" is only valid with an explicit order.
To get the values where LEGDATE is the earliest for example;
select Num_Trip_Num, LEGDATE, LEGGROUP from (
select
cast(substring(t_RecordkeepingLeg.LEGTRIPNUMBER, 2, 4) as INT) as Num_Trip_Num,
row_number() over (partition by substring(t_RecordkeepingLeg.LEGTRIPNUMBER, 2, 4) order by t_RecordkeepingLeg.LEGDATE asc) as row,
t_RecordkeepingLeg.LEGDATE,
t_RecordkeepingLeg.LEGGROUP
from t_RecordkeepingLeg
where left (t_RecordkeepingLeg.LEGTRIPNUMBER, 1) = 'Q'
) T
where row = 1