Load order of entires in big query tables - google-bigquery

I have some sample data that I've been loading into Google BigQueries. I have been importing the data in ndjson format. If I load the data all in one file, I see them show up in a different order in the table's preview tab than when I sequentially import them one ndjson line at a time.
When importing sequentially I wait till I see the following output:
Waiting on bqjob_XXXX ... (2s) Current status: RUNNING
Waiting on bqjob_XXXX ... (2s) Current status: DONE
The order the rows show up seems to match the order I append them as the job importing them seem to finish before I move on to the next. But when loading them all in one file, they show up in a different order than they exist in my data file.
So why do the data entries show up in a different order when loading in bulk? How are the data entries queued to be loaded and also how are they indexed into the table?

BigQuery has no notion of indexes. Data in BigQuery tables have no particular order that you can rely on. If you need to get ordered data out of BigQuery you will need to use explicit ORDER BY in your query - which btw quite not recommended for large results as it increases resource cost and can end up with Resources Exceeded error.
BigQuery internal storage can "shuffle" your data rows internally for the best / most optimal performance of querying. So again - there is no such things as physical order of data in BigQuery tables
Oficial language in docs is like this - line ordering is not guaranteed for compressed or uncompressed files.

Related

Data disappears when moving from events_intraday_ to events_

I am using BigQuery to analyze FirebaseAnalytics events. I use events_intraday_ for real-time analysis and events_ for daily analysis, and the data is automatically transferred from events_intraday to events_ after a certain time, but some data will disappear at that time. The table exists, but the data is clearly reduced. About 2 days out of a week's data is lost here. Please tell me why this happens.
Thanks.
Data should not be lost when moved from events_intraday_ to events_.
A common problem that is easy problem fix is with the set up of intraday collects the data from “today” in realtime, you first need to agree with Google BigQuery on what “today” refers to. BigQuery can’t guess what timezone you want to query, which is why the default UNIX timestamp format of the event_timestamp column in BigQuery is always in UTC time. this post explains it clearly Firebase BigQuery server offset time
Also I am not sure your last statement is correct "events_intraday_" and "events_" are not quite the same thing, an "events_intraday_" table contains raw, unsampled event data for the current day while the "events_" table contains processed and aggregated event data.
This processing of data after its collected but before data is exported to BigQuery, this means you would expect some data to be lost. Generally, the affected fields are traffic sources and linked marketing products (AdWords, Campaign Manager, etc.), if these are areas you are looking at its probably a GA4 processing issue.

What is the best approach for bulk cleaning a database table that has a large amount of duplicated data loaded every day (snowflake db)

Thanks in advance for reading this, I hope I explain my problem.
In one of our domains, we have a pipeline (Multiple) where data flows from S3 into a snowflake staging table using airflow. The data itself originates from a number of different applications but the process is always the same. The data is extracted from the application by the support teams (multiple support teams across multiple countries, using different technologies), then into AWS S3 and then bulk loaded into snowflake. Due to limitations on the data from source their often isn't any filter on the data itself and effectively the staging table is loaded with the raw CSV every single day, a file date column is added to the data itself. The result is that we have tables that have been loaded with the same data every single day since 2009.
However the data does change, from day to day a column value will change and so the file date is very useful in tracking changed attributes and something that I want to exploit. Further if the data was cleansed we would need approximately 1% of the data.
These tables are huge some contain around 16 trillion rows but can we be quite narrow.
I would like to optimally loop through each days worth of data and then only load into the staging tables new data as apposed to just loading everything each day.
I have tried the following
A query that windows over the entire set and compares the hashed value of each row (minus the file date) and then only returns if it did not appear in the previous dates data set. This works but not for the larger tables as the warehouse starts to write to disk and then it takes hours.
A day by day loop that looks at each file date data set and compares to the previous day and only loads the difference, this takes to long on the initial clean of the tables but is what I am doing once the data has been cleaned and will form the initial load procedure.
The current solution is where I dynamically create multiple minus set statement where I look at each day minus the day before then batch these into blocks of 10-20 based of the average daily row size so as an example
INSERT INTO TEMP TABLE
(Select * FROM TABLE A WHERE FILE_DATE = 040123
MINUS
Select * FROM TABLE A WHERE FILE_DATE = 030123)
UNION ALL
(Select * FROM TABLE A WHERE FILE_DATE = 030123
MINUS
Select * FROM TABLE A WHERE FILE_DATE = 020123)
etc...
This is not pretty though does work however its taking me around 12 hours to process 70 odd tables.
I would like advice on if their is another approach.
Please bear in mind that I am limited to using snowflake due to resourcing issues and politics.
Any guidance and ideas would be much appreciated.
Regards

Spark joins- save as dataframes or partitioned hive tables

I’m working on a project with test data close to 1 million records and 4 such files .
The task is to perform around 40 calculations joining the data from 4 different files each close to 1gb .
Currently, I save the data from each into a spark table using saveastable and perform operations . For e.g. - table1 joins with table2 and the results are saved to table3 . Table3(result of 1 and 2 ) joins with table4 and so on . Finally I’m saving these calculations on a different table and generating the reports.
The entire process takes around 20 minutes and my concern is when this code gets to the production with data probably 5 times more than this , will there be performance issues .
Or is it better to save those data from each file in a partitioned way and then perform the joins and arrive to the final resultset .
P.S - The objective is to get instant results and there might be cases where the user is updating a few rows from the file and expecting an instant result. And the data is on a monthly basis , basically once every month with categories and sub-categories within .
What you are doing is just fine, but make sure to cache + count after every resource extensive operations instead of writing all the joins and then save at last step.
If you do not cache in between, spark will run entire DAG from top to bottom at the last step , it may cause JVM to overflow and spill to disk during operations which may in turn affect the execution time.

Bigquery and Tableau

I attached Tableau with Bigquery and was working on the Dash boards. Issue hear is Bigquery charges on the data a query picks everytime.
My table is 200GB data. When some one queries the dash board on Tableau, it runs on total query. Using any filters on the dashboard it runs again on the total table.
on 200GB data, if someone does 5 filters on different analysis, bigquery is calculating 200*5 = 1 TB (nearly). For one day on testing the analysis we were charged on a 30TB analysis. But table behind is 200GB only. Is there anyway I can restrict Tableau running on total data on Bigquery everytime there is any changes?
The extract in Tableau is indeed one valid strategy. But only when you are using a custom query. If you directly access the table it won't work as that will download 200Gb to your machine.
Other options to limit the amount of data are:
Not calling any columns that you don't need. Do this by hiding unused fields in Tableau. It will not include those fields in the query it sends to BigQuery. Otherwise it's a SELECT * and then you pay for the full 200Gb even if you don't use those fields.
Another option that we use a lot is partitioning our tables. For instance, a partition per day of data if you have a date field. Using TABLE_DATE_RANGE and TABLE_QUERY functions you can then smartly limit the amount of partitions and hence rows that Tableau will query. I usually hide the complexity of these table wildcard functions away in a view. And then I use the view in Tableau. Another option is to use a parameter in Tableau to control the TABLE_DATE_RANGE.
1) Right now I learning BQ + Tableau too. And I found that using "Extract" is must for BQ in Tableau. With this option you can also save time building dashboard. So my current pipeline is "Build query > Add it to Tableau > Make dashboard > Upload Dashboard to Tableau Online > Schedule update for Extract
2) You can send Custom Quota Request to Google and set up limits per project/per user.
3) If each of your query touching 200GB each time, consider to optimize these queries (Don't use SELECT *, use only dates you need, etc)
The best approach I found was to partition the table in BQ based on a date (day) field which has no timestamp. BQ allows you to partition a table by a day level field. The important thing here is that even though the field is day/date with no timestamp it should be a TIMESTAMP datatype in the BQ table. i.e. you will end up with a column in BQ with data looking like this:
2018-01-01 00:00:00.000 UTC
The reasons the field needs to be a TIMESTAMP datatype (even though there is no time in the data) is because when you create a viz in Tableau it will generate SQL to run against BQ and for the partitioned field to be utilised by the Tableau generated SQL it needs to be a TIMESTAMP datatype.
In Tableau, you should always filter on your partitioned field and BQ will only scan the rows within the ranges of the filter.
I tried partitioning on a DATE datatype and looked up the logs in GCP and saw that the entire table was being scanned. Changing to TIMESTAMP fixed this.
The thing about tableau and Big Query is that tableau calculates the filter values using your query ( live query ). What I have seen in my project logging is, it creates filters from your own query.
select 'Custom SQL Query'.filtered_column from ( your_actual_datasource_query ) as 'Custom SQL Query' group by 'Custom SQL Query'.filtered_column
Instead, try to create the tableau data source with incremental extracts and also try to have your query date partitioned ( Big Query only supports date partitioning) so that you can limit the data use.

Bigquery return nested results without flattening it without using a table

It is possible to return nested results(RECORD type) if noflatten_results flag is specified but it is possible to just view them on screen without writing it to table first.
for example, here is an simple user table(my actual table is big large(400+col with multi-level of nesting)
ID,
name: {first, last}
I want to view record particular user & display in my applicable, so my query is
SELECT * FROM dataset.user WHERE id=423421 limit 1
is it possible to return the result directly?
You should write your output to "temp" table with noflatten_results option (also respective expiration to be set to purge table after it is used) and serve your client out of this temp table. All "on-fly"
Have in mind that no matter how small "temp" table is - if you will be querying it (in above second step) you will be billed for at least 10MB, so you better use Tabledata.list API in this step (https://cloud.google.com/bigquery/docs/reference/v2/tabledata/list) which is free!
So if you try to get repeated records it will fail on the interface/BQ console with the error:
Error: Cannot output multiple independently repeated fields at the same time.
and in order to get past this error is to FLATTEN your output.