Unnesting a big quantity of columns in BigQuery and BigTable - sql

I have a table in BigTable, with a single column family, containing some lead data. I was following the Google Cloud guide to querying BigTable data from BigTable (https://cloud.google.com/bigquery/external-data-bigtable) and so far so good.
I've crated the table definition file, like the docs required:
{
"sourceFormat": "BIGTABLE",
"sourceUris": [
"https://googleapis.com/bigtable/projects/{project_id}/instances/{instance_id}/tables/{table_id}"
],
"bigtableOptions": {
"readRowkeyAsString": "true",
"columnFamilies": [
{
"familyId": "leads",
"columns": [
{
"qualifierString": "Id",
"type": "STRING"
},
{
"qualifierString": "IsDeleted",
"type": "STRING"
},
...
]
}
]
}
}
But then, things started to go south...
This is how the BigQuery "table" ended up looking:
Each row is a rowkey and inside each column there's a nested cell, where the only value I need is the value from leads.Id.cell (in this case)
After a bit of searching I found a solution to this:
https://stackoverflow.com/a/70728545/4183597
So in my case it would be something like this:
SELECT
ARRAY_TO_STRING(ARRAY(SELECT value FROM UNNEST(leads.Id.cell)), "") AS Id,
...
FROM xxx
The problem is that I'm dealing with a dataset with more than 600 columns per row. It is unfeasible (and impossible, given BigQuery's subquery limits) to repeat this process more than 600 times per row/query.
I couldn't think of a way to automate this query or even think about other methods to unnest this many cells (my SQL knowledge stops here).
Is there any way to do a unesting like this for 600+ columns, with an SQL/BigQuery query? Preferable in a more efficient way? If not, I'm thinking of doing a daily batch process, using a simple Python connector from BigTable to BigQuery, but I'm afraid of the costs this will incur.
Any documentation, reference or idea will be greatly appreciated.
Thank you.

In general, you're setting yourself up for a world of pain when you try to query a NoSQL database (like BigTable) using SQL. Unnesting data is a very expensive operation in SQL because you're effectively performing a cross join (which is many-to-many) every time UNNEST is called, so trying to do that 600+ times will give you either a query timeout or a huge bill.
The BigTable API will be way more efficient than SQL since it's designed to query NoSQL structures. A common pattern is to have a script that runs daily (such as a Python script in a Cloud Function) and uses the API to get that day's data, parse it, and then output that to a file in Cloud Storage. Then you can query those files via BigQuery as needed. A daily script that loops through all the columns of your data without requiring extensive data transforms is usually cheap and definitely less expensive than trying to force it through SQL.
That being said, if you're really set on using SQL, you might be able to use BigQuery's JSON functions to extract the nested data you need. It's hard to visualize what your data structure is without sample data, but you may be able to read the whole row in as a single column of JSON or a string. Then if you have a predictable path for the values you are looking to extract, you could use a function like JSON_EXTRACT_STRING_ARRAY to extract all of those values into an array. A Regex function could be used similarly as well. But if you need to do this kind of parsing on the whole table in order to query it, a batch job to transform the data first will still be much more efficient.

Related

Query Vs Scan operation in dynamodb

Background
I am currently trying to figure out the best way of calculating some stats in lambda function based on the db design I have. Let’s say I have records of users from China which has 23 provinces that I stored in an array, to which I want to determine the total number of both females and males, as well as the number of users in each province.
Given a GSI table with 200,000 items with a total size of 100bytes per item as seen below, with the province attribute being the partition key.
{
"createdAt": {
"S": "2020-08-05T19:21:07.532Z"
},
"gender": {
"S": "Male"
},
"updatedAt": {
"S": "2020-08-05T19:21:07.532Z"
},
"province": {
"S": "Heilongjiang"
}
}
I am considering using two methods for this calculation:
1.Query method
I plan on looping over the province array and providing a partition key on each loop to the query method which would end up making too many requests(23 to be precise, that’s if the each request returned doesn’t pass the limit of 1MB which might lead me to keep repeating until there is no more lastEvaluationKey for the current query).
2.Scan method
In this method, I would make requests iteratively to the database until there is no more lastEvaluationKey.
Having the knowledge of both scan and query methods being able to return only 1mb of data, which method would be the most appropriate to use in this particular use case?
I am considering going for the scan method seeing as I would need to read all the data in table in order to calculate the stats anyways; however, I am afraid of how slow the operation will become when the table grows.
PS: Suggestions for a different keySchema for better access would also be very appreciated.
Neither.
Use DDB Streams + Lambda to update your stats as records are created/updated/deleted in your DDB table.
See also
Using Global Secondary Indexes for Materialized Aggregation Queries
How to do basic aggregation with DynamoDB?

How to properly store a JSON object into a Table?

I am working on a scenario where I have invoices available in my Data Lake Store.
Invoice example (extremely simplified):
{
"business_guid":"b4f16300-8e78-4358-b3d2-b29436eaeba8",
"ingress_timestamp": 1523053808,
"client":{
"name":"Jake",
"age":55
},
"transactions":[
{
"name":"peanut",
"amount":100
},
{
"name":"avocado",
"amount":2
}
]
}
All invoices are stored in ADLS, and can be queried. But, It is my desire to provide access to the same data inside an ALD DB.
I am not an expert on unstructed data: I have RDBMS background. Taking that into consideration, I can only think of 2 possible scenarios:
2/3 tables - invoice, client (could be removed) and transaction. In this scenario, I would have to create an invoice ID to be able to build relationships between those tables
1 table - client info could be normalized into invoice data. But, transactions could (maybe) be defined as an SQL.ARRAY<SQL.MAP<string, object>>
I have mainly 3 questions:
What is the correct way of doing so? Solution 1 seems much better structured.
If I go with solution 1, how do I properly create an ID (probably GUID)? Is it acceptable to require ID creation when working with ADL?
Is there another solution I am missing here?
Thanks in advance!
This type of question is a bit like do you prefer your sauce on the pasta or next to the pasta :). The answer is: it depends.
To answer your 3 questions more seriously:
#1 has the benefit of being normalized that works well if you want to operate on the data separately (e.g., just clients, just invoices, just transactions) and want to the benefits of normalization, get the right indexing, and are not limited by the rowsize limits (e.g., your array of map needs to fit into a row). So I would recommend that approach unless your transaction data is always small and you always access the data together and mainly search on the column data.
U-SQL per se has no understanding of the hierarchy of the JSON document. Thus, you would have to write an extractor that turns your JSON into rows in a way that it either gives you the correlation of the parent to the child (normally done by stepwise downwards navigation with cross apply) and use the key value of the parent data item as the foreign key, or have the extractor generate the key (as int or guid).
There are some sample JSON extractors on the U-SQL GitHub site (start at http://usql.io) that can get you started with the JSON to rowset conversion. Note that you will probably want to optimize the extraction at some point to be JSON Reader based so you process larger docs without loading it into memory.

How to count rows of finished BigQuery job using node.js client library

I would like to get the row count of job that was run using:
bigquery.startQuery(options)
The naive way of doing this, would be to stream the result (e.g. using):
job.getQueryResultsStream()
And count one by one. This obviously isn't very efficient, especially for large results. Another way I thought of is using the metadata of the job:
job.on('complete', function(metadata) {...}
Where I could kind of "reverse engineer" the response, to get the query plan, and see the number of written rows in the last step. I could find that in:
statistics.query.queryPlan[statistics.query.queryPlan.length - 1].recordsWritten
While a sample of different queries convinced me that this might work, it feels like a "hack", and it's difficult to say how robust it will be. Seems like I might need to handle different cases (failed queries, etc.)
EDIT: Another option suggested below is "SELECT COUNT"ing the temp table created by the original query (available in the job metadata). While this absolutely is a straightforward way to get the result I'm looking for, it has the disadvantage of requiring another roundtrip to query the BigQuery service, which costs several seconds. It is a 0 "bytes billed" query (counting a full table uses table metadata only), but it seems redundant when the job "knows" how many rows it has written to the output.
Is there a straightforward and "correct" way to get this count from the job object, without a roundtrip to BQ service? Perhaps a field I missed / misinterpreted, or a function in the job object that returns this?
Any job has destination table - even when you do not explicitly set it - result is still saved in so-called anonymous table that you can in turn query to get the count of output rows. So below simple extra query will work (note - names are just as an example)
SELECT COUNT(1)
FROM `yourProject._0511743a77ca76c1b55482d7cb1f8e91ac5c7b36.anon17286defe54b5c07ba6810a71abfdba6388ac4e0`
The actual destination table to use - can be retrieved from configuration.query.destinationTable property of job
job.on('complete', function(metadata) {
console.log(metadata.statistics.query.numDmlAffectedRows)
}

Calculating a proxy bit size in a BigQuery table

How does one go about calculating the bit size of each record in BigQuery sharded tables across a range of time?
Objective: how much has it grown over time
Nuances: Of the 70 some fields, some records would have nulls for most, some records would have long string text grabbed directly from the raw logs, and some of them could be float/integer/date types.
Wondering if there's an easy way to do a proxy count of the bit size for one day and then I can expand that to a range of time.
Example from my experience:
One of my tables is daily sharded table with daily size of 4-5TB. Schema has around 780 fields. I wanted to understand cost of each data-point (bit-size) [it was used then for calculating ROI based on cost/usage]
So, let me give you an idea on how cost (bit-size) side of it was approached.
The main piece here is use of dryRun property of Jobs: Query API
Setting dryRun to true allows BigQuery (instead of actually running job) return statistics about the job such as how many bytes would be processed. And that’s exactly what is needed here!
So, for example, below Request is designed to get cost of trafficSource.referralPath in ga_session table for 2017-01-05
POST https://www.googleapis.com/bigquery/v2/projects/yourBillingProject/queries?key={YOUR_API_KEY}
{
"query": "SELECT trafficSource.referralPath FROM yourProject.yourDataset.ga_sessions_20170105`",
"dryRun": true,
"useLegacySql": false
}
You can get this value by parsing totalBytesProcessed out of Response. See example of such response below
{
"kind": "bigquery#queryResponse",
"jobReference": {
"projectId": "yourBillingProject"
},
"totalBytesProcessed": "371385",
"jobComplete": true,
"cacheHit": false
}
So, you can write relatively simple script in the client of your choice that:
reads schema of your table – you can use Tables: get API for this or if schema is known and readily available you can just simply hardcode it
organize loop through all (each and every) field in the schema
inside loop – call query api and extract size of respective filed (as it is outlined above)) and of course log it (or just collect it in memory)
As a result of above - you will have list of all fields with their respective size
If now, you need to analyze those sizes changes over the time – you can wrap above with yet another loop where you will iterate through as many days as you need and collect stats for each and every day
if you are not interested in day-by-day analysis - you just can make sure your query actually queries the range you are interested with. This can be done with use of a Wildcard Table
I consider this relatively easy way to go with
Me personally, I remember doing this with Go-lang, but it doesn't matter - you can use any client that you are most comfortable with
Hope this will help you!

What is a best way to organise the complex couchdb view (sql-like query)?

In my application I need a SQL-like query of the documents. The big picture is that there is a page with a paginated table showing the couchdb documents of a certain "type". I have about 15 searchable columns like timestamp, customer name, the us state, different numeric fields, etc. All of these columns are orderable, also there is a filter form allowing the user to filter by each of the fields.
For a more concrete below is a typical query which is a result by a customer setting some of the filter options and following to the second page. Its written in a pseodo-sql code, just to explain the problem:
timestamp > last_weeks_monday_epoch AND timestamp < this_weeks_monday_epoch AND marked_as_test = False AND dataspace="production" AND fico > 650
SORT BY timestamp DESC
LIMIT 15
SKIP 15
This would be a trivial problem if I were using any sql-like database, but couchdb is way more fun ;) To solve this I've created a view with the following structure of the emitted rows:
key: [field, value], id: doc._id, value: null
Now, to resolve the example query above I need to perform a bunch of queries:
{startkey: ["timestamp", last_weeks_monday_epoch], endkey: ["timestamp", this_weeks_monday_epoch]}, the *_epoch here are integers epoch timestamps,
{key: ["marked_as_test", False]},
{key: ["dataspace", "production"]},
{startkey: ["fico", 650], endkey: ["fico", {}]}
Once I have the results of the queries above I calculate intersection of the sets of document IDs and apply the sorting using the result of timestamp query. Than finally I can apply the slice resolving the document IDs of the rows 15-30 and download their content using bulk get operation.
Needless to say, its not the fastest operation. Currently the dataset I'm working with is roughly 10K documents big. I can already see that the part when I'm calculating the intersection of the sets can take like 4 seconds, obviously I need to optimize it further. I'm afraid to think, how slow its going to get in a few months when my dataset doubles, triples, etc.
Ok, so having explained the situation I'm at, let me ask the actual questions.
Is there a better, more natural way to reach my goal without loosing the flexibility of the tool?
Is the view structure I've used optimal ? At some point I was considering using a separate map() function generating the value of each field. This would result in a smaller b-trees but more work of the view server to generate the index. Can I benefit this way ?
The part of algorithm where I have to calculate intersections of the big sets just to later get the slice of the result bothers me. Its not a scalable approach. Does anyone know a better algorithm for this ?
Having map function:
function(doc){
if(doc.marked_as_test) return;
emit([doc.dataspace, doc.timestamp, doc.fico], null):
}
You can made similar request:
http://localhost:5984/db/_design/ddoc/_view/view?startkey=["production", :this_weeks_monday_epoch]&endkey=["production", :last_weeks_monday_epoch, 650]&descending=true&limit=15&skip=15
However, you should pass :this_weeks_monday_epoch and :last_weeks_monday_epoch values from the client side (I believe they are some calculable variables on database side, right?)
If you don't care about dataspace field (e.g. it's always constant), you may move it into the map function code instead of having it in query parameters.
I don't think CouchDB is a good fit for the general solution to your problem. However, there are two basic ways you can mitigate the ways CouchDB fits the problem.
Write/generate a bunch of map() functions that use each separate column as the key (for even better read/query performance, you can even do combinatoric approaches). That way you can do smart filtering and sorting, making use of a bunch of different indices over the data. On the other hand, this will cost extra disk space and index caching performance.
Try to find out which of the filters/sort orders your users actually use, and optimize for those. It seems unlikely that each combination of filters/sort orders is used equally, so you should be able to find some of the most-used patterns and write view functions that are optimal for those patterns.
I like the second option better, but it really depends on your use case. This is one of those things SQL engines have been pretty good at traditionally.