Implicit Flattening in BigQuery - google-bigquery

When does BigQuery flatten an intermediate result set? I was under the impression that it was only when FLATTEN was invoked, but I've encountered an example where the result is flattened without a FLATTEN.
This is the case - this base query returns one record:
select count(*) from publicdata:samples.trigrams
where ngram = 'der Griindung im'
+-----+
| f0_ |
+-----+
| 1 |
+-----+
When queried, you can see that the record has a repeated field that is repeated twice.
select * from publicdata:samples.trigrams
where ngram = 'der Griindung im'
+------------------+-------+-----------+-------+--------+-------+------------+-------------------+----------------------+-----------------+------------------+----------------+------------------+-------------------+----------------------+---------------------+-----------------+
| ngram | first | second | third | fourth | fifth | cell_value | cell_volume_count | cell_volume_fraction | cell_page_count | cell_match_count | cell_sample_id | cell_sample_text | cell_sample_title | cell_sample_subtitle | cell_sample_authors | cell_sample_url |
+------------------+-------+-----------+-------+--------+-------+------------+-------------------+----------------------+-----------------+------------------+----------------+------------------+-------------------+----------------------+---------------------+-----------------+
| der Griindung im | der | Griindung | im | NULL | NULL | 2007 | 54 | 0.008746355685131196 | 54 | 54 | NULL | NULL | NULL | NULL | NULL | NULL |
| der Griindung im | der | Griindung | im | NULL | NULL | 2008 | 47 | 0.007612568837058633 | 47 | 47 | NULL | NULL | NULL | NULL | NULL | NULL |
+------------------+-------+-----------+-------+--------+-------+------------+-------------------+----------------------+-----------------+------------------+----------------+------------------+-------------------+----------------------+---------------------+-----------------+
When I add a filter on cell.value, I get two records instead of one - but I never flattened so I'm not sure about the behavior here. My expectation is that this would return the same output as the previous COUNT above. It doesn't:
select count(*) from publicdata:samples.trigrams
where ngram = 'der Griindung im' and cell.value in ('2007', '2008')
+-----+
| f0_ |
+-----+
| 2 |
+-----+
What this means is that while I expect select * from publicdata:samples.trigrams where ngram = 'der Griindung im' and select * from publicdata:samples.trigrams where ngram = 'der Griindung im' and cell.value in ('2007', '2008') to return the same output, they don't because one is implicitly flattened and the other is not. While this may not seem like a huge issue, this could matter significantly if it was part of a nested query that expected an intermediate result to be flattened or repeated.
Under what conditions does BigQuery flatten results without an explicit FLATTEN?

Let me answer first, how to get correct count in this case:
So instead of
SELECT COUNT(*)
FROM [publicdata:samples.trigrams]
WHERE ngram = 'der Griindung im'
AND cell.value IN ('2007', '2008')
with result of
+-----+
| f0_ |
+-----+
| 2 |
+-----+
you should do
SELECT COUNT(*)
FROM [publicdata:samples.trigrams]
WHERE ngram = 'der Griindung im'
OMIT RECORD IF EVERY(cell.value NOT IN ('2007', '2008'))
with result of
+-----+
| f0_ |
+-----+
| 1 |
+-----+
as I think what you expected
Secondly - Under what conditions does BigQuery flatten results without an explicit FLATTEN?
I think (just my guess baseed on BQ behavior observation) every time you explicitelly reference record's field within clauses like SELECT or WHERE , it gets automatically flattened for you. Using FLATTEN operator helps "control" this process.

Short story: use count(0) instead of count(*). (You get 1 instead of 2.)
count(*) behaves strangely with repeated fields. It looks like the results are flattened, but if that were really the case, this should also affect count(0). I've asked about this here, but I haven't so far received a full explanation.

Related

SQL - Given sequence of data, how do I query the origin?

Let's assume we have the following data.
| UUID | SEENTIME | LAST_SEENTIME |
------------------------------------------------------
| UUID1 | 2020-11-10T05:00:00 | |
| UUID2 | 2020-11-10T05:01:00 | 2020-11-10T05:00:00 |
| UUID3 | 2020-11-10T05:03:00 | 2020-11-10T05:01:00 |
| UUID4 | 2020-11-10T05:04:00 | 2020-11-10T05:03:00 |
| UUID5 | 2020-11-10T05:07:00 | 2020-11-10T05:04:00 |
| UUID6 | 2020-11-10T05:08:00 | 2020-11-10T05:07:00 |
Each data is connected to each other via LAST_SEENTIME.
In such case, is there a way to use SQL to identify these connected events as one? I want to be able to calculate start and end to calculate the duration of this event.
You can use a recursive CTE. The exact syntax varies by database, but something like this:
with recursive cte as
select uuid as orig_uuid, uuid, seentime
from t
where last_seentime is null
union all
select cte.orig_uuid, t.uuid, t.seentime
from cte join
t
on cte.seentime = t.last_seentime
)
select orig_uuid,
max(seentime) - min(seentime) -- or whatever your database uses
from cte
group by orig_uuid;

Padding to the result of a DISTINCT Sqlite query

I searched and figured out that I could use either substr with || or a printf statement with format specifiers in order to add padding to the results, but that doesn't seem to work if I had DISTINCT in the sqlite query.
I've a table called timeLapse that looks like so:
+----+-------+-----------+
| ID | Time | Status |
+----+-------+-----------+
| 1 | 0.001 | Initiated |
| 1 | 0.002 | Cranked |
| 3 | 0.002 | Initiated |
| 2 | 0.002 | Initiated |
| 2 | 0.003 | Cranked |
+----+-------+-----------+
I could query the distinct IDs with something like SELECT distinct(ID) FROM timeLapse as IDs, which returns this:
+-----+
| IDs |
+-----+
| 1 |
| 2 |
| 3 |
+-----+
However, I would like to pad the resultant distinct rows like so:
+----------+
| IDs |
+----------+
| Object-1 |
| Object-2 |
| Object-3 |
+----------+
My query SELECT substr('Object-' || DISTINCT(ID), 10, 10) as IDs FROM timeLapse results in an error:
"[17:22:47] Error while executing SQL query on database 'machining': near "distinct": syntax error"
Could someone please help me understand what am I doing wrong here? I am enormously thankful for your time and help.
get distinct() first before using substr() function.
select substr('Object-' || t1.ID, 1, 10) as IDs
from (SELECT DISTINCT(ID) ID FROM timeLapse) t1
see sqlfiddle
All credits to the user named ϻᴇᴛᴀʟ, as I only understood from their answer that I should have a sub-query within this query where the DISTINCT should go into.
This resolves my problem:
select printf('Object-%s', t1.ID) as IDs
FROM (SELECT DISTINCT(id) ID FROM timeLapse) t1

SELECTing Related Rows Based on a Single Row Match

I have the following table running on Postgres SQL 9.5:
+---+------------+-------------+
|ID | trans_id | message |
+---+------------+-------------+
| 1 | 1234567 | abc123-ef |
| 2 | 1234567 | def234-gh |
| 3 | 1234567 | ghi567-ij |
| 4 | 8902345 | ced123-ef |
| 5 | 8902345 | def234-bz |
| 6 | 8902345 | ghi567-ij |
| 7 | 6789012 | abc123-ab |
| 8 | 6789012 | def234-cd |
| 9 | 6789012 | ghi567-ef |
|10 | 4567890 | abc123-ab |
|11 | 4567890 | gex890-aj |
|12 | 4567890 | ghi567-ef |
+---+------------+-------------+
I am looking for the rows for each trans_id based on a LIKE query, like this:
SELECT * FROM table
WHERE message LIKE '%def-234%'
This, of course, returns just three rows, the three that match my pattern in the message column. What I am looking for, instead, is all the rows matching that trans_id in groups of messages that match. That is, if a single row matches the pattern, get all the rows with the trans_id of that matching row.
That is, the results would be:
+---+------------+-------------+
|ID | trans_id | message |
+---+------------+-------------+
| 1 | 1234567 | abc123-ef |
| 2 | 1234567 | def234-gh |
| 3 | 1234567 | ghi567-ij |
| 4 | 8902345 | ced123-ef |
| 5 | 8902345 | def234-bz |
| 6 | 8902345 | ghi567-ij |
| 7 | 6789012 | abc123-ab |
| 8 | 6789012 | def234-cd |
| 9 | 6789012 | ghi567-ef |
+---+------------+-------------+
Notice rows 10, 11, and 12 were not SELECTed because there was not one of them that matched the %def-234% pattern.
I have tried (and failed) to write a sub-query to get the all the related rows when a single message matches a pattern:
SELECT sub.*
FROM (
SELECT DISTINCT trans_id FROM table WHERE message LIKE '%def-234%'
) sub
WHERE table.trans_id = sub.trans_id
I could easily do this with two queries, but the first query to get a list of matching trans_ids to include in a WHERE trans_id IN (<huge list of trans_ids>) clause would be very large, and would not be a very inefficient way of doing this, and I believe there exists a way to do it with a single query.
Thank you!
This will do the job I think :
WITH sub AS (
SELECT trans_id
FROM table
WHERE message LIKE '%def-234%'
)
SELECT *
FROM table JOIN sub USING (trans_id);
Hope this help.
Try this:
SELECT ID, trans_id, message
FROM (
SELECT ID, trans_id, message,
COUNT(*) FILTER (WHERE message LIKE '%def234%')
OVER (PARTITION BY trans_id) AS pattern_cnt
FROM mytable) AS t
WHERE pattern_cnt >= 1
Using a FILTER clause in the windowed version of COUNT function we can get the number of records matching the predefined pattern within each trans_id slice. The outer query uses this count to filter out irrelevant slices.
Demo here
You can do this.
WITH trans
AS
(SELECT DISTINCT trans_id
FROM t1
WHERE message LIKE '%def234%')
SELECT t1.*
FROM t1,
trans
WHERE t1.trans_id = trans.trans_id;
I think this will perform better. If you have enough data, you can do an explain on both Sub query and CTE and compare the output.

Get row with max value in Hive/SQL?

I'm new to Hive/SQL, and I'm stuck on a fairly simple problem. My data looks like:
+------------+--------------------+-----------------------+
| carrier_iD | meandelay | meancanceled |
+------------+--------------------+-----------------------+
| EV | 13.795802119653473 | 0.028584251044292006 |
| VX | 0.450591016548463 | 2.364066193853424E-4 |
| F9 | 10.898001378359766 | 0.00206753962784287 |
| AS | 0.5071547420965062 | 0.0057404326123128135 |
| HA | 1.2031093279839498 | 5.015045135406214E-4 |
| 9E | 8.147899230704216 | 0.03876067292247866 |
| B6 | 9.45383857757506 | 0.003162096314343487 |
| UA | 8.101511665305816 | 0.005467725574605967 |
| FL | 0.7265068895709532 | 0.0041141513746490044 |
| WN | 7.156119279121648 | 0.0057419058192869415 |
| DL | 4.206288692245839 | 0.005123990066804269 |
| YV | 6.316802855264404 | 0.029304029304029346 |
| US | 3.2221527095063736 | 0.007984031936127766 |
| OO | 6.954715814690328 | 0.02596499362466706 |
| MQ | 9.74568222216328 | 0.025628100708354324 |
| AA | 8.720522654298968 | 0.019242775597574157 |
+------------+--------------------+-----------------------+
I want Hive to return the row with the meanDelay max value. I have:
SELECT CAST(MAX(meandelay) as FLOAT) FROM flightinfo;
which indeed returns the max (I use cast because my values are saved as STRING). So then:
SELECT * FROM flightinfo WHERE meandelay = (SELECT CAST(MAX(meandelay) AS FLOAT) FROM flightinfo);
I get the following error:
FAILED: ParseException line 1:44 cannot recognize input near 'select' 'cast' '(' in expression specification
Use the windowing and analytics functions
SELECT carrier_id, meandelay, meancanceled
FROM
(SELECT carrier_id, meandelay, meancanceled,
rank() over (order by cast(meandelay as float) desc) as r
FROM table) S
WHERE S.r = 1;
This will also solve the problem if more than one row has the same max value, you'll get all the rows as result. If you just want a single row change rank() to row_number() or add another term to the order by.
use join instead.
SELECT a.* FROM flightinfo a left semi join
(SELECT CAST(MAX(meandelay) AS FLOAT)
maxdelay FROM flightinfo)b on (a.meandelay=b.maxdelay)
You can use the collect_max UDF from Brickhouse ( http://github.com/klout/brickhouse ) to solve this problem, passing in a value of 1, meaning that you only want the single max value.
select array_index( map_keys( collect_max( carrier_id, meandelay, 1) ), 0 ) from flightinfo;
Also, I've read somewhere that the Hive max UDF does allow you to access other fields on the row, but I think its easier just to use collect_max.
I don't think your sub-query is allowed ...
A quick look here:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries
states:
As of Hive 0.13 some types of subqueries are supported in the WHERE
clause. Those are queries where the result of the query can be treated
as a constant for IN and NOT IN statements (called uncorrelated
subqueries because the subquery does not reference columns from the
parent query):

how to make postgresql result unique

this is somehow hard to describe, however I have a postgresql 9.1 table (planet_osm_roads).
My query is
SELECT
osm_id, name, highway, way, md5(astext(way)) AS md5
FROM planet_osm_roads
WHERE highway IS NOT NULL
AND md5(astext(way)) IN (
SELECT DISTINCT md5(astext(way))
FROM planet_osm_roads
WHERE highway IS NOT NULL
GROUP BY md5
HAVING count(osm_id) > 1
)
ORDER BY osm_id
The result is
osm_id | name | highway | ...way ... | md5
----------+------+---------------+-------...----...--+----------------------------------
-1641383 | | motorway | 010200...CA96...0 | 04b4336b997e7ea9d99208bd487bbe7d
-1641383 | | motorway | 010200...EC3E...0 | ae945148417ada285130c59277c48a25
-1641383 | | motorway | 010200...7BF6...0 | 5c5a1b8ae40c1b7f24e293a012ad2add
23133731 | | motorway_link | 010200...EC3E...0 | ae945148417ada285130c59277c48a25
31309105 | | motorway | 010200...7BF6...0 | 5c5a1b8ae40c1b7f24e293a012ad2add
49339926 | | motorway | 010200...CA96...0 | 04b4336b997e7ea9d99208bd487bbe7d
(6 rows)
I want a result that holds 3 rows (one for every md5 hash) and any of the other corresponding rows.
So a valid row for "ae945148417ada285130c59277c48a25" may contain osm_id-highway pair of "-1641383" & "motorway" or "23133731" & "motorway_link"- I don't mind and will consider both as correct.
How can I solve this and how is the required operation/technique called? So I know for next time how to call it an what to search for.
select
md5(astext(way)) as md5,
min(osm_id) osm_id,
min(name) name,
min(highway) highway,
min(way) way
from planet_osm_roads
where highway is not null
group by 1
having count(osm_id) > 1