I have a very large dataset (approx 20TB) in BigQuery. The problem is I have to get the data in an ordered way and to do that I have to use the ORDER BY clause. Now when I am using order by clause, the data are getting returned but when I add LIMIT and OFFSET to it, it's exceeding the resource limit. Now to counter that I came up with a way to use the temp cache table that is created every time I run a query in it.
So essentially I run an ORDER BY query to sort the data which is getting saved in the temp table and then I am using that temp table to get a small part of the data using LIMIT and OFFSET. But sometimes the temp table is giving random results and the results are not ordered as it should be.
I read in the BigQuery doc multiple time that
using them as inputs for dependent jobs is strongly discouraged
but I never found the reason for that. Also, the randomization of the data is not happening in all the case, but in some case. Am I missing something here?
If there is a better alternative to getting the sorted data other than using temp cache table, that is also welcomed.
Note: I have tried creating a temp table on my own to store the ordered data, but that is also throwing resource exceeded error
Related
I am trying to optimise some DBT models by using incremental runs on partitioned data and ran into a problem - the suggested approach that I've found doesn't seem to work. By not working I mean that it doesn't decrease the processing load as I'd expect.
Below is the processing load of a simple select of the partitioned table:
unfiltered query processing load
Now I try to select only new rows added since the last incremental run:
filtered query
You can notice, that the load is exactly the same.
However, the select inside the WHERE is rather lightweight:
selecting only the max date
And when I fill in the result of that select manually, the processing load is suddenly minimal, what I'd expect:
expected processing load
Finally, both tables (the one I am querying data from, and the one I am querying max(event_time)) are configured in exactly the same way, both being partitioned by DAY on field event_time:
config on tables
What am I doing wrong? How could I improve my code to actually make it work? I'd expect the processing load to be similar to the one using an explicit condition.
P.S. apologies for posting links instead of images. My reputation is too low, as this is my first question here.
Since the nature of query is dynamic, i.e. the where condition is not absolute(constant), BigQuery cannot estimate the accurate processed data before execution.
This is due the fact that max(event_time) is not constant and might change, hence affecting the size of the data to be fetched by the outer query.
For estimation purposes, try one of these 2 approaches:
Replace the inner query by a constant value and check the estimated bytes to be processed.
Try running the query once and check the processed data under Query results -> Job Information ->Bytes processed and Bytes billed
I have dataset that has data added almost everyday, and needs to be processed everyday in a part of a larger ETL.
When I select the partition directly, the query is really fast:
SELECT * FROM JSON.`s3://datalake/partitoned_table/?partition=2019-05-20`
Yet, the issue is that the event type does not generate data on some Sundays, resulting in a non-existing partition on that particular day. Because of this, I cannot use the previous statement to run my daily job.
Another attempt led me to try to have spark find the latest partition of that dataset, in order to be sure the bigger query wouldn't fail:
SELECT * FROM JSON.`s3://datalake/partitoned_table/`
WHERE partition = (SELECT MAX(partition) FROM JSON.`s3://datalake/partitoned_table/`)
This works every time, but it is unbelievably slow.
I found numerous articles and reference on how to build and maintain partitions, yet nothing about how to read them correctly.
Any idea how to have this done properly?
(SELECT MAX(partition) FROM JSON.s3://datalake/partitoned_table/)
This query will be executed as a subquery in Spark.
Reason for slowness
1. Subquery needs to be executed completely before the actual query execution starts.
2. The Above query will list all the S3 files to retrieve the partition information. If the folder has a large number of files, this process will take a long time. Time taken for listing is directly proportional to the number of files.
We could create a table on top of s3://datalake/partitoned_table/ with the partitioning scheme, let's say the name of the table is tbl
You could perform an
ALTER TABLE tbl RECOVER PARTITIONS
which stores the partition information in metastore. This also involves a listing, but it is a one-time operation and spark spawns multiple threads to perform the listing to make it faster.
Then we could fire
SELECT * FROM tbl WHERE partition = (SELECT MAX(partition) FROM tbl`)
Which will get the partition information only from metastore, without having to list the object store which I believe is an expensive operation.
The cost incurred in this approach is that of recovering partitions.
After which all queries will be faster(when data for new partition comes, we need to recover partitions again)
WorkAround when you don't have Hive-
FileSystem.get(URI.create("s3://datalake/partitoned_table"), conf).listStatus(new Path("s3://datalake/partitoned_table/"))
Above code will give you list of file partitions example - List(s3://datalake/partitoned_table/partition=2019-05-20, s3://datalake/partitoned_table/partition=2019-05-21....)
This is very efficient because it is only fetching metadata from the s3 location.
Just take the latest file partitions and use it your SQL.
I have been trying to create a table in BigQuery from a .csv file stored in my bucket. The table is created and the data is loaded with correct number of rows and columns, however, the rows get swapped in BigQuery for some reason.
I tried to use R connector to push the data from my local machine to BigQuery and the same problem occurs.
So when I do SELECT * FROM ,
it shows me the complete table inside BigQuery but the rows are swapped (i.e. row 21 becomes row 1, row 4000 becomes row 3 for example).
I will appreciate your response.
As in most SQL-related databases, data stored in BigQuery has no natural order. When you store data in BigQuery it will be automatically sorted in ways that can optimize the execution time of queries.
If you need to preserve the original order, you might need to add an additional column noting the relative order - and then call it out with an ORDER BY on queries.
I have a table with a non-Clustered index on a varchar column 'A'.
when I use Order By A clause I can see it scans the index and gives me the result in a few seconds.
But when I use Sort Component of SSIS for column 'A', I can see it takes minutes to sort records.
So I understand that it does not recognize my non-clustered index
Does anyone has any idea for using indexes for SSIS but not using queries instead of components??
Order By A is run in the database.
When using a sort component, the sort is done in the SSIS runtime. Note that the query you use to feed to the sort does not have an order by in it (I assume)
It's done in the runtime because it is data source agnostic - your source could be excel or a text file or an in memory dataset or a multicase or pivot or anything.
My advice is to use the database as much as possible.
The only reason to use a sort in a SSIS package is if your source doesn't support sorting (i.e. a flat file) and you want to do a merge join in your package to something else. Which is a very rare and specific case
As I researched and working with SSIS these times I found out that the only way to use indexes is to connnect to database. However, when you fetch your data in the flow, all you have are just records and data. no indexes!
So for tasks like Merge Join which needs a Sort component before that, I tried to use Lookup component instead with full cache option. and cache whole data then use ORDER BY in the Source component query
31 Days of SSIS – What The Sorts:
Whether there are one hundred rows or ten million rows – all of the rows have to be consumed by the Sort Transformation before it can return the first row. This potentially places all of the data for the data flow path in memory. And the potentially bit is because if there is enough data it will spill over out of memory.
In the image to the right you can see that until the ten million rows are all received that data after that point in the Data Flow cannot be processed.
This behavior should be expected if you consider what the transformation needs to do. Before the first row can be sent along, the last row needs to be checked to make sure that it is not the first row.
For small and narrow datasets, this is not an issue. But if you’re dataset are large or wide you can find performance issues with packages that have sorts within them. All of the data load and sorted in memory can be a serious performance hog
I am going to be maintaining a local copy of a database on bigquery. I will be using the API and tabledata:list. This database is not my own, and is regularly updated by the maintainers by appending new data (say every hour).
First, can I assume that when this data is appended, it will definitely be added to the end of the database?
Now, let's assume that currently the database has 1,000,000 rows and I am now downloading all of these by paging through tabledata:list. Also, let's assume that the database is updated partway through (with 10,000 rows). By using the page tokens, can I be assured that I will only download the 1m rows present when I started in the order they are in in the database?
Finally, now let's say that I come to update my copy. If I initiate the tabledata:list with a startIndex of 1,000,000 and I use a maxResults of 1000, will I get 10 pages containing the updated data that I am expecting?
I suppose all these questions boil down to whether bigquery respects the order the data is in, whether this order is used by tabledata:list, and whether appended data is guaranteed to follow previous data.
As there is a column whose values are unique, and I can perform a simple select count(1) from table to get the length of the table, I can of course check that my local copy is complete by comparing the length of my local db with that of the remote, however if the above weren't guaranteed and I ended up with holes in my data, it would be quite impractical to remedy as the primary key is not sequential (otherwise I could just fill in the missing rows) and the database is very large.
When you append data, we will append to the end of the table data list, however, bigquery may periodically coalesce data, which does not respect ordering. We have been discussing being able to preserve the ordering, or at least have a way of accessing the most recent data, but this is not yet implemented or designed. If it is an important feature for you, let us know and we'll prioritize it accordingly.
If you use page tokens, you are assured of a stable listing. If the table gets updated in the middle of paging through the data, you'll still only see the data that was in the table when you created the page token. Note that because of this, page tokens are only valid for 24 hours.
This should work as long as no coalesce has occurred since you have updated the table.
You can get the number of rows in the table by calling tables.get, which is usually simpler and faster than running a query.