BigQuery rows get swapped - google-bigquery

I have been trying to create a table in BigQuery from a .csv file stored in my bucket. The table is created and the data is loaded with correct number of rows and columns, however, the rows get swapped in BigQuery for some reason.
I tried to use R connector to push the data from my local machine to BigQuery and the same problem occurs.
So when I do SELECT * FROM ,
it shows me the complete table inside BigQuery but the rows are swapped (i.e. row 21 becomes row 1, row 4000 becomes row 3 for example).
I will appreciate your response.

As in most SQL-related databases, data stored in BigQuery has no natural order. When you store data in BigQuery it will be automatically sorted in ways that can optimize the execution time of queries.
If you need to preserve the original order, you might need to add an additional column noting the relative order - and then call it out with an ORDER BY on queries.

Related

Get column timestamp when they got added in Bigquery

I'm trying to find which all new columns got added to the table. Is there any way to find it? I was thinking to get all columns for a table with timestamps when they got created or modified so that I can filter which are new columns.
With INFORMATION_SCHEMA.SCHEMATA I get only table creation and modified date but not for columns.
With INFORMATION_SCHEMA.COLUMNS I am able to get all column names and it's information but no details about its modified or creation timestamp.
My table doesn't have a snapshot so I can't compare it with the previous version to get changes.
Is there any way to capture this?
According to the BigQuery columns documentation, this is not metadata currently capture by BigQuery.
A possible solution would be to go into the BigQuery logs to see when and how tables were updated. Source control over the schemas and scripts that create these tables could also give you insight into how and when columns may have been added.
As #RileyRunnoe mentioned, this kind of metadata is not captured by BQ and a possible solution is go dig into the Audit Logs. Prior to doing this, you should have created a BQ sink that points to the dataset. See creating a sink for more details.
When the sink is created, all operations to be executed will store data usage logs in table cloudaudit_googleapis_com_data_access_YYYYMMDD and activity logs in table cloudaudit_googleapis_com_activity_YYYYMMDD under the BigQuery dataset you selected in your sink. Keep in mind that you can only track the usage starting at the date when you set up the logs export tables.
The query below has a CTE that queries from cloudaudit_googleapis_com_data_access_* since this logs the data changes and only gets completed jobs hence filtering for jobservice.jobcompleted. Query the CTE to get queries that contain "COLUMN" and don't include queries that don't have a destination table like the query we are about to run.
WITH CTE AS (
SELECT
protopayload_auditlog.methodName,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.query as query,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatus.state as status,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.destinationTable.datasetId as dataset,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.destinationTable.tableId as table,
timestamp
FROM `my-project.dataset_name.cloudaudit_googleapis_com_data_access_*`
WHERE protopayload_auditlog.methodName = 'jobservice.jobcompleted'
)
SELECT query,
REGEXP_EXTRACT(query,r'ADD COLUMN (\w+) \w+') as column,
table,
timestamp,
status
FROM CTE
WHERE query like '%COLUMN%'
AND NOT REGEXP_CONTAINS(dataset, r'^_')
ORDER BY timestamp DESC
Result:

Querying BigQuery cached temp tables gives different results everytime

I have a very large dataset (approx 20TB) in BigQuery. The problem is I have to get the data in an ordered way and to do that I have to use the ORDER BY clause. Now when I am using order by clause, the data are getting returned but when I add LIMIT and OFFSET to it, it's exceeding the resource limit. Now to counter that I came up with a way to use the temp cache table that is created every time I run a query in it.
So essentially I run an ORDER BY query to sort the data which is getting saved in the temp table and then I am using that temp table to get a small part of the data using LIMIT and OFFSET. But sometimes the temp table is giving random results and the results are not ordered as it should be.
I read in the BigQuery doc multiple time that
using them as inputs for dependent jobs is strongly discouraged
but I never found the reason for that. Also, the randomization of the data is not happening in all the case, but in some case. Am I missing something here?
If there is a better alternative to getting the sorted data other than using temp cache table, that is also welcomed.
Note: I have tried creating a temp table on my own to store the ordered data, but that is also throwing resource exceeded error

Understanding data scanned when querying ORC with Presto/Athena

I have a large amount of data in ORC files in AWS S3. The data in ORC files is sorted by uuid. I create an AWS Athena (Presto) table on top of them and run the following experiment.
First, I retrieve the first row to see how much data gets scanned:
select * from my_table limit 1
This query reports 18 MB of data being scanned.
I record the uuid from the row returned from the first query and run the following query:
select * from my_table where uuid=<FIRST_ROW_UUID> limit 1
This query reports 8.5 GB of data being scanned.
By design, both queries return the same result but the second query scans 500 times more data!
Any ideas why this is happening? Is this something inherent to ORC design or is it specific to how Presto interacts with S3?
[EDIT after ilya-kisil's response]
Let's change the last query to only select the uuid column:
select uuid from my_table where uuid=<FIRST_ROW_UUID> limit 1
For this query, the amount of data scanned drops to about 600 MB! This means that the bulk of the 8.5 GB scanned in the second query is attributed to gathering values from all columns for the record found and not to finding this record.
Given that all values in the record add up to no more than 1 MB, scanning almost 8 GB of data to put these values together seems extremely excessive. This seems like some idiosyncrasy of ORC or columnar formats in general and I am wondering if there are standard practices, e.g. ORC properties, that help reduce this overhead?
Well this is fairly simple. The very first time your query would pick a random record from your data. On top of that it is not guaranteed that you have read the very first record, since ORC files are splittable and can be processed in parallel. On the other hand, the second query looks for a specific record.
Here is an analogy. Let's assume you have 100 coins UUID and some other info imprinted at on their backs. All of them are face up on a table, so you can't see their UUID.
select * from my_table limit 1
This query is like you flipped some random coin, looked at what it is written on the back and put it back on a table face up. Next, someone came and shuffled all of the coins.
select * from my_table where uuid=<FIRST_ROW_UUID> limit 1
This query is like you wanting to look at the information written on the back of a specific coin. It is unlikely that you would flip the correct coin with your first try. So you would need to "scan" more coins (data).
One of the common ways to reduce size of scanned data is to partition your data, i.e. put it into separate "folders" (not files) in your S3 bucket. Then "folder" names can be use as a virtual columns within your table definition, i.e. additional metadata for your table. Have a look at this post, which goes into mor details on how to optimise queries in Athena.

BigQuery "copy table" not working for small tables

I am trying to copy a BigQuery table using the API from one table to the other in the same dataset.
While copying big tables seems to work just fine, copying small tables with a limited number of rows (1-10) I noticed that the destination table comes out empty (created but 0 rows).
I get the same results using the API and the BigQuery management console.
The issue is replicated for any table in any dataset I have. Looks like a bug or a designed behavior.
Could not find any "minimum lines" directive in the docs.. am I missing something?
EDIT:
Screenshots
Original table: video_content_events with 2 rows
Copy table: copy111 with 0 rows
How are you populating the small tables? Are you perchance using streaming insert (bq insert from the command line tool, tabledata.insertAll method)? If so, per the documentation, data can take up to 90 minutes to be copyable/exportable:
https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataavailability
I won't get super detailed, but the reason is that our copy and export operations are optimized to work on materialized files. Data within our streaming buffers are stored in a completely different system, and thus aren't picked up until the buffers are flushed into the traditional storage mechanism. That said, we are working on removing the copy/export delay.
If you aren't using streaming insert to populate the table, then definitely contact support/file a bug here.
There is no minimum records limit to copy the table within the same dataset or over a different dataset. This applies both for the API and the BigQuery UI. I just replicated your scenario of creating a new table with just 2 records and I was able to successfully copy the table to another table using UI.
Attaching screenshot
I tried to copy to a timestamp partitioned table. I messed up the timestamp, and 1000 x current timestamp. Guess it is beyond BigQuery's max partition range. Despite copy job success, no data is actually loaded to the destination table.

Incremental extraction from DB2

What would be the most efficient way to select only rows from DB2 table that are inserted/updated since the last select (or some specified time)? There is no field in the table that would allow us to do this easily.
We are extracting data from the table for purposes of reporting, and now we have to extract the whole table every time, which is causing big performance issues.
I found example on how to select only rows changed in last day:
SELECT * FROM ORDERS
WHERE ROW CHANGE TIMESTAMP FOR ORDERS >
CURRENT TIMESTAMP - 24 HOURS;
But, I am not sure how efficient this would be, since the table is enormous.
Is there some other way to select only rows that are changed, that might be more efficient that this?
I also found solution called ParStream. This seems as something that can speed up demanding queries on the data, but I was unable to find any useful documentation about it.
I propose these options:
You can use Change Data Capture, and this will replay automatically the modifications to another data source.
Normally, a select statement does not assure the order of the rows. That means that you cannot use a select without a time reference in order to retrieve the most recent. Thus, you have to have a time column in order to retrieve the most recent. You can keep track of the most recent row in a global variable, and the next time retrieve the rows with a time bigger than that variable. If you want to increase performance, you can put the table in append mode, and in this way the new rows will be physically together. Keeping an index on this time column could be expensive to maintain, but it will speed (no table scan) when you need to extract the rows.
If your server is DB2 for i, use database journaling. You can extract after images of inserted records by time period or journal entry number from the journal receiver(s). The data entries can then be copied to your target file.