Query specific tables in Bigtable from BigQuery - google-bigquery

I have some data in Google BigTable over which I have built a BigQuery external table (as per Querying Cloud Bigtable data so that I can query the Bigtable table using conventional SQL (which I'm very familiar with).
When I issue a select * I get this:
Now I would like to know the syntax for querying specific values in this nested data. For example, to get a list of accountIds I can do this:
SELECT ARRAY(SELECT timestamp FROM UNNEST(attributes.column[OFFSET(0)].cell)) AS timestamp,
ARRAY(SELECT SAFE_CONVERT_BYTES_TO_STRING(value) FROM UNNEST(attributes.column[OFFSET(0)].cell)) AS values
FROM `table`
where SAFE_CONVERT_BYTES_TO_STRING(rowkey) = 'XXXX'
which returns:
which is, well, kinda handy.
Similarly I can get car#le11mcr#policyStartDate by changing the OFFSET like so:
SELECT ARRAY(SELECT timestamp FROM UNNEST(attributes.column[OFFSET(6)].cell)) AS timestamp,
ARRAY(SELECT SAFE_CONVERT_BYTES_TO_STRING(value) FROM UNNEST(attributes.column[OFFSET(6)].cell)) AS values
FROM `table`
where SAFE_CONVERT_BYTES_TO_STRING(rowkey) = 'XXXX'
However both of these queries require me to know what value to pass to OFFSET() and that value appears to depend on the alphabetical order of the Bigtable columns hence if another column whose name starts with (say) 'b' appears in the future my queries would no longer return the same thing.
I need a better way of querying the table than using OFFSET(). Essentially I want to be able to say:
select the cell values and timestamp values for the cell whose name is accountId
or
select the cell values and timestamp values for the cell whose name is car#le11mcr#policyStartDate
Is there a way to do that? I'm not too familiar with BigQuery syntax for doing this.

OK, I've made a tiny bit of progress.
This:
SELECT
(
select array(select timestamp from unnest(cell))
from unnest(attributes.column) where name in ('accountId')
) accountIdTimestamp,
(
select array(select value from unnest(cell))
from unnest(attributes.column) where name in ('accountId')
) accountIdValue
FROM `table`
where SAFE_CONVERT_BYTES_TO_STRING(rowkey) = 'XXXX'
limit 3
returns:
which is better, but notice it didn't return anything for the first two rows. That's because those two rows don't have a cell called accountId, a problem I can get around by introducing a WHERE clause:
SELECT
(
select array(select timestamp from unnest(cell))
from unnest(attributes.column) where name in ('accountId')
) accountIdTimestamp,
(
select array(select value from unnest(cell))
from unnest(attributes.column) where name in ('accountId')
) accountIdValue
FROM `table`
where ARRAY_LENGTH(ARRAY(
select name from unnest(attributes.column) where name in ('accountId')
)) > 0
limit 3
which returns:
That does what I want, I guess, but I'd like to think there's a better way of achieving this that doesn't require quite so much typing and so much complicated logic (the WHERE clause in particular feels like a very complicated way of saying only give me rows if there's an accountId).
Any advice to make this more efficient or readable would be appreciated.
My next challenge to solve is to return the accountIdValue for the max(accountIdTimestamp)

Related

How to ignore 00 (two leading zeros) in Select query?

I am not sure whether it is possible or not, I have one DB table which is having fields refNumber, Some of this fields values contains two leading zeros, following is example.
id.
refNumber
10001
123
10002
00456
Now I am trying to write a query which can select from this table with our without leading zeros (Only two not less or greater than two).Here is an example, for select refNumber=123 OR refNumber=00123 should return result 10001 and for refNumber=00456 OR refNumber=456 should return result of 10002. I can not use like operator because in that case other records might also be return. Is it possible through the query? if not what would be the right way to select such records? I am avoiding looping the all rows in my application.
You need to apply TRIM function on both - column and the value you want to filter by:
SELECT * FROM MyTable
WHERE TRIM(LEADING '0' FROM refNumber) = TRIM(LEADING '0' FROM '00123') -- here you put your desired ref number
Use trim()
Select * from table where trim(refnumber) IN ('123','456')
Or replace() whichever supported
Select * from table where
replace(refnumber, '0','') IN
('123','456')
While the currently accepted answer would work, be aware that at best it would cause Db2 to do a full index scan and at worst could result in a full table scan.
Not a particularly efficient way to return 1 or 2 records out of perhaps millions. This happens anytime you use an expression over a table column in the WHERE clause.
If you know there's only ever going to be 5 digits or less , a better solution would be something that does the following:
SELECT * FROM MyTable
WHERE refNumber in ('00123','123')
That assumes you can build the two possibilities outside the query.
If you really want to have the query deal with the two possibilities..
SELECT * FROM MyTable
WHERE refNumber in (LPAD(:value,5,'0'),LTRIM(:value, '0'))
If '00123' or '123' is pass in as value, the above query would find records with '00123' or '123' in refNumber.
And assuming you have an index on refNumber, do so quickly and efficiently.
If there could be an unknown number of lead zeros, then you are stuck with
SELECT * FROM MyTable
WHERE LTRIM(refNumber,'0') = LTRIM(:value, '0')
However, if you platform/version of Db2 supports indexes over an expression you'd want to create one for efficiency's sake
create index myidx
on MyTable (LTRIM('0' from refNumber))

Sorting concatenated strings after grouping in Netezza

I'm using the code on this page to create concatenated list of strings on a group by aggregation basis.
https://dwgeek.com/netezza-group_concat-alternative-working-example.html/
I'm trying to get the concatenated string in sorted order, so that, for example, for DB1 I'd get data1,data2,data5,data9
I tied modifying the original code to selecting from a pre-sorted table but it doesn't seem to make any difference.
select Col1
, count(*) as NUM_OF_ROWS
, trim(trailing ',' from SETNZ..replace(SETNZ..replace (SETNZ..XMLserialize(SETNZ..XMLagg(SETNZ..XMLElement('X',col2))), '<X>','' ),'</X>' ,',' )) AS NZ_CONCAT_STRING
from
(select * from tbl_concat_demo order by 1,2) AS A
group by Col1
order by 1;
Is there a way to sort the strings before they get aggregated?
BTW - I'm aware there is a GROUP_CONCAT UDF function for Netezza, but I won't have access to it.
This is notoriously difficult to accomplish in sql, since sorting is usually done while returning the data, and you want to do it in the ‘input’ set.
Try this:
1)
Create temp table X as select * from tbl_concat_demo Order by col2
Partition by (col1)
In your original code above: select from X instead of tbl_concat_demo
Let me know if it works ?

How to extract record's table name when using Table wildcard functions [duplicate]

I have a set of day-sharded data where individual entries do not contain the day. I would like to use table wildcards to select all available data and get back data that is grouped by both the column I am interested in and the day that it was captured. Something, in other words, like this:
SELECT table_id, identifier, Sum(AppAnalytic) as AppAnalyticCount
FROM (TABLE_QUERY(database_main,'table_id CONTAINS "Title_" AND length(table_id) >= 4'))
GROUP BY identifier, table_id order by AppAnalyticCount DESC LIMIT 10
Of course, this does not actually work because table_id is not visible in the table aggregation resulting from the TABLE_QUERY function. Is there any way to accomplish this? Some sort of join on table metadata perhaps?
This functionality is available now in BigQuery through _TABLE_SUFFIX pseudocolumn. Full documentation is at https://cloud.google.com/bigquery/docs/querying-wildcard-tables.
Couple of things to note:
You will need to use Standard SQL to enable table wildcards
You will have to rename _TABLE_SUFFIX into something else in your SELECT list, i.e. following example illustrates it
SELECT _TABLE_SUFFIX as table_id, ... FROM `MyDataset.MyTablePrefix_*`
Not available today, but something I'd love to have too. The team takes feature requests seriously, so thanks for adding support for this one :).
In the meantime, a workaround is doing a manual union of a SELECT of each table, plus an additional column with the date data.
For example, instead of:
SELECT x, #TABLE_ID
FROM table201401, table201402, table201303
You could do:
SELECT x, month
FROM
(SELECT x, '201401' AS month FROM table201401),
(SELECT x, '201402' AS month FROM table201402),
(SELECT x, '201403' AS month FROM table201403)

Oracle : how to find occurrence of specified value in particular column of a table in oracle

I have a table with one of the columns is check and this column has only YES,NO and NA as data
When a query is executed, if the Check column has at least one Yes value then the YES value should be fetched otherwise NO value should be fetched and it if no data found for this column then the fetched value should be NA
Could someone help me out?
I am not sure I am answering the right question - but I hope this query example would help:
Select nvl(MAX(col),'NA')
from (select id, col from table
where id = 1
union all
select 1 id, null from dual)
group by id
(that would work with YES and NO since 'YES'>'NO' so the priority will be given to 'YES', and then to 'NO' - in the case that there are no results - the nvl will make sure you get NA.
For the case no result exists - the "select id, null from dual will make sure you'll get results).
There are many other ways you can write this query (including analytical functions, and creating an aggregation function, which is what you really want...) - but this is the simplest... (make sure all the col values are only 'YES' and 'NO' by constraint - otherwise the correctness will be harmed..)
Now I am not sure if it answers your question, but if I understand your need correctly, the ideas shown here should be applicable to your requirement.
This should also be pretty efficient, assuming there is an index on the rows you're querying.
Sample Solution:
WITH tbl_values as (
SELECT distinct mt.check_value
FROM my_table mt
ORDER BY decode(mt.check_value,'YES',1,'NO',2,3) ASC)
SELECT tv.check
FROM tbl_values tv
WHERE rownum = 1;
COMMENTS: The DECODE statement is the key enforcement of the rules stated in the OP.

Aggregate functions in WHERE clause in SQLite

Simply put, I have a table with, among other things, a column for timestamps. I want to get the row with the most recent (i.e. greatest value) timestamp. Currently I'm doing this:
SELECT * FROM table ORDER BY timestamp DESC LIMIT 1
But I'd much rather do something like this:
SELECT * FROM table WHERE timestamp=max(timestamp)
However, SQLite rejects this query:
SQL error: misuse of aggregate function max()
The documentation confirms this behavior (bottom of page):
Aggregate functions may only be used in a SELECT statement.
My question is: is it possible to write a query to get the row with the greatest timestamp without ordering the select and limiting the number of returned rows to 1? This seems like it should be possible, but I guess my SQL-fu isn't up to snuff.
SELECT * from foo where timestamp = (select max(timestamp) from foo)
or, if SQLite insists on treating subselects as sets,
SELECT * from foo where timestamp in (select max(timestamp) from foo)
There are many ways to skin a cat.
If you have an Identity Column that has an auto-increment functionality, a faster query would result if you return the last record by ID, due to the indexing of the column, unless of course you wish to put an index on the timestamp column.
SELECT * FROM TABLE ORDER BY ID DESC LIMIT 1
I think I've answered this question 5 times in the past week now, but I'm too tired to find a link to one of those right now, so here it is again...
SELECT
*
FROM
table T1
LEFT OUTER JOIN table T2 ON
T2.timestamp > T1.timestamp
WHERE
T2.timestamp IS NULL
You're basically looking for the row where no other row matches that is later than it.
NOTE: As pointed out in the comments, this method will not perform as well in this kind of situation. It will usually work better (for SQL Server at least) in situations where you want the last row for each customer (as an example).
you can simply do
SELECT *, max(timestamp) FROM table
Edit:
As aggregate function can't be used like this so it gives error. I guess what SquareCog had suggested was the best thing to do
SELECT * FROM table WHERE timestamp = (select max(timestamp) from table)