Compute the final dataframe ready for use - apache-spark-sql

I have big job to join different tables together and eventually have an aggregated table for final report. But every time when fetching the final summary table for some filters, the job takes really long time to finish and I believe because of lazy evaluation of Spark. Is there a way to evaluate the final summary table first so that later when filtering the summary each time, it could be faster?
I know if I write that summary table to storage and read it back, it could solve the problem but if I don't want to write and read back, is there any other way?

Related

Bigquery: querying partitioned table with SELECT within WHERE doesn't work as expected

I am trying to optimise some DBT models by using incremental runs on partitioned data and ran into a problem - the suggested approach that I've found doesn't seem to work. By not working I mean that it doesn't decrease the processing load as I'd expect.
Below is the processing load of a simple select of the partitioned table:
unfiltered query processing load
Now I try to select only new rows added since the last incremental run:
filtered query
You can notice, that the load is exactly the same.
However, the select inside the WHERE is rather lightweight:
selecting only the max date
And when I fill in the result of that select manually, the processing load is suddenly minimal, what I'd expect:
expected processing load
Finally, both tables (the one I am querying data from, and the one I am querying max(event_time)) are configured in exactly the same way, both being partitioned by DAY on field event_time:
config on tables
What am I doing wrong? How could I improve my code to actually make it work? I'd expect the processing load to be similar to the one using an explicit condition.
P.S. apologies for posting links instead of images. My reputation is too low, as this is my first question here.
Since the nature of query is dynamic, i.e. the where condition is not absolute(constant), BigQuery cannot estimate the accurate processed data before execution.
This is due the fact that max(event_time) is not constant and might change, hence affecting the size of the data to be fetched by the outer query.
For estimation purposes, try one of these 2 approaches:
Replace the inner query by a constant value and check the estimated bytes to be processed.
Try running the query once and check the processed data under Query results -> Job Information ->Bytes processed and Bytes billed

Bigquery Tier 1 exceeded for partitioned table but not for by day tables

We have two tables in bigquery: one is large (a couple billion rows) and on a 'table-per-day' basis, the other is date partitioned, has the exact same schema but a small subset of the rows (~100 million rows).
I wanted to run a (standard-sql) query with a subselect in form of a join (same when subselect is in the where clause) on the small partitioned dataset. I was not able to run it because tier 1 was exceeded.
If I run the same query on the big dataset (that contains the data I need and a lot of other data) it runs fine.
I do not understand the reason for this.
Is it because:
Partitioned tables need more resources to query
Bigquery has some internal rules that the ratio of data processed to resources needed must meet a certain threshold, i.e. I was not paying enough when I queried the small dataset given the amount of resources I needed.
If 1. is true, we could simply make the small dataset also be on a 'table-per-day' basis in order to solve the issue. But before we do that though we would like to know if it is really going to solve our problem.
Details on the queries:
Big datset
queries 11 GB, runs 50 secs, Job Id remilon-study:bquijob_2adced71_15bf9a59b0a
Small dataset
Job Id remilon-study:bquijob_5b09f646_15bf9acd941
I'm an engineer on BigQuery and I just took a look at your jobs but it looks like your second query has an additional filter with a nested clause that your first query does not. It is likely that that extra processing is making your query exceed your tier. I would recommend running the queries in the BigQuery UI and looking at the Explanation tab to see how the queries differ in the query plan.
If you try running the exact same query (modifying only the partition syntax) for both tables and still get the same error I would recommend filing a bug.

BigQuery range decorator duplicate issue

We are facing issues with BigQuery range decorators on streaming table. The range decorator queries give duplicate data.
My case:
My BQ table is getting data regularly from customer events through streaming inserts. Another job is periodically fetching time bound data from the table using range decorator and sending it to dataflow jobs. like
First time fetching all the data from table using
SELECT * FROM [project_id:alpha.user_action#1450287482158]
when i ran this query got 91 records..
after 15 mins another query based on last interval
SELECT * FROM [alpha.user_action#1450287482159-1450291802380]
this also gave the same result with 91 records.
however i tried to run the same query again to cross check
SELECT * FROM [project_id:alpha.user_action#1450287482158]
Gives empty data.
any help on this?
First off, have you tried using streaming dataflow? That might be a better fit (though your logic is not expressible as a query). Streaming dataflow also supports Tee-ing your writes, so you can keep both raw data and aggregate results.
On to your question:
Unfortunately this is a collision of two concepts that were built concurrently and somewhat independently, thus resulting in ill-defined interactions.
Time range table decorators were designed/built in a world where only load jobs existed. As such, blocks of data are atomically committed to a table at a single point in time. Time range decorators work quite well with this, as there are clear boundaries of inclusion/exclusion, and the relationship is stable.
Streaming Ingestion + realtime query is somewhat counter to the "load job" world. BigQuery is buffering your data for some period of time, making it available for analysis, and then periodically flushing the buffers onto the table using the traditional storage means. While the data is buffered, we have "infinite" time granularity. However, once we flush the buffer onto the table, that infinite granularity is compressed into a single time, which is currently the flush time.
Thus, using time range decorators on streaming tables can unfortunately result in some unexpected behaviors, as the same data may appear in two non-overlapping time windows (once while it is buffered, and once when it is flushed).
Our recommendation if you're trying to achieve windowed queries over recent data is to do the following:
Include a system timestamp in your data.
For the table decorator timestamps, include some buffer around the actual window to account for clock skew between your clock and Google's, and late arrivals from retry. This buffer should be both before and after your target window.
Modify your query to apply your actual time window.
It should be noted that depending on your actual usage purpose, this may not address your problems. If you can give more details, there might be a way to achieve what you need.
Sorry for the inconvenience.

Need help designing a DB - for a non DBA

I'm using Google's Cloud Storage & BigQuery. I am not a DBA, I am a programmer. I hope this question is generic enough to help others too.
We've been collecting data from a lot of sources and will soon start collecting data real-time. Currently, each source goes to an independent table. As new data comes in we append it into the corresponding existing table.
Our data analysis requires each record to have a a timestamp. However our source data files are too big to edit before we add them to cloud storage (4+ GB of textual data/file). As far as I know there is no way to append a timestamp column to each row before bringing them in BigQuery, right?
We are thus toying with the idea of creating daily tables for each source. But don't know how this will work when we have real time data coming in.
Any tips/suggestions?
Currently, there is no way to automatically add timestamps to a table, although that is a feature that we're considering.
You say your source files are too big to edit before putting in cloud storage... does that mean that the entire source file should have the same timestamp? If so, you could import to a new BigQuery table without a timestamp, then run a query that basically copies the table but adds a timestamp. For example, SELECT all,fields, CURRENT_TIMESTAMP() FROM my.temp_table (you will likely want to use allow_large_results and set a destination table for that query). If you want to get a little bit trickier, you could use the dataset.DATASET pseudo-table to get the modified time of the table, and then add it as a column to your table either in a separate query or in a JOIN. Here is how you'd use the DATASET pseudo-table to get the last modified time:
SELECT MSEC_TO_TIMESTAMP(last_modified_time) AS time
FROM [publicdata:samples.__DATASET__]
WHERE table_id = 'wikipedia'
Another alternative to consider is the BigQuery streaming API (More info here). This lets you insert single rows or groups of rows into a table just by posting them directly to bigquery. This may save you a couple of steps.
Creating daily tables is a reasonable option, depending on how you plan to query the data and how many input sources you have. If this is going to make your queries span hundreds of tables, you're likely going to see poor performance. Note that if you need timestamps because you want to limit your queries to certain dates and those dates are within the last 7 days, you can use the time range decorators (documented here).

Select part of a table for later use

I'm currently trying to optimize my program. I have a large database which consists of data which are timestamped. The data I need to update is only data for the current day, so I don't want to search the entire database more than once to find only the entries of today. Is there a way to select something and then use it later in several different (MERGE INTO) commands?
I want to select all the data of today, then run a while loop (in java) on every single entry of data for today updating them all. So is this even possible? Or do I have to traverse the entire database for each while-loop iteration?
If you are optimizing your program and your database is timestamped. Then the first thing you can do is to create index for the timestamps field. This will reduce your query execution time because your filter criteria is related to that time-stamp field.
Use a proper data caching technology, like memcached in order to minimize database hits for read-heavy, slowly changing data.