When I click on "Details" to see a preview of the data in the table, the web UI locks up and I see the following errors. I've tried refreshing and restating my browser, but it doesn't help.
Yes, we (the BQ team) are aware of performance issues when viewing very repeated rows in the BigQuery web UI. The public genomics tables are known to tickle these performance issues since individual rows of their table are highly repeated.
We're considering a few methods of fixing this, but the simplest would probably be to default to the JSON display of rows for problematic tables, and allow switching to the tabular view with a "View it at your own risk!"-style warning message.
It took a little time for me too, but it eventually (1min 40sec) loaded up to UI.
I think it is because of how table data is presented in Native BQ UI for Preview mode.
As you could noticed - it is showed in sort of hierarchical way.
I noticed this slowness for heavy tables (row size and/or hierarchical structure wise) when this was intorduced. And btw. only one row is showed for this particular table because of this.
Of course this is just my guess - would be great to hear from Google Team!
Meantime - when I am using internal application that uses same APIs for preview table data - i dont see any slowness at all (10 rows in 3 sec), which supports my above guess.
Related
I need some help cleaning my data...
I have a BQ table where I receive new entries from my back-end, these data are recorded to my BQ and I'm using Google Data Studio to present these data.
My problem is, I a field named sessions that sometimes are duplicates, I can't solve that directly in my back-end because a user can send different data from the same session so I can't just stop recording duplicates.
I've managed my problem by creating a View that selects the newest duplicate record and I'm using this view as data-source for my report. The problem with this approach is that I lost the feature of "real-time report" and that is important in this case. And another problem is that I also lost Accelerated by BigQuery BI Engine and I would like to have these feature too.
Is this the best solution for my problem and I'll need to accept this outcome or there is another way?
Many thanks in advance, kind regards.
Using the view should work for BI Engine acceleration. Can you please share more details on BI Engine? It should show you the reason query wasn't accelerated, likely mentioning one of the limitations. If you hover over "not accelerated" sign it should give you more details on why your query wasn't supported. Feel free to share it here and I will be happy to help.
Another way you can clean up the data: Have scheduled job to preprocess the data. It will mean data may not be the most recent, but it will give you ability to clean up and aggregate data.
I am running a query in GCP console. The script is correct but it shows the below message while I run the query.
"Results are not displayed because the table has a large amount of cells and may cause the BigQuery console to become unresponsive. Consider modifying your query to improve browser performance."
What is the solution.
This is just a BigQuery's mechanism to avoid overwhelming your browser.
You still can see the results by clicking on View results
If you think this message is confuse or something should be changed, I encourage you to create a Public Issue sharing your thoughts. You can do that through this link
You may have potentially caused a humongous CROSS JOIN in your query. You need to take a look at your query first, and BigQuery will compute the results if you force it, but it will lead to the use of a lot of Slots and rack up a high bill for you
I noticed that BigQuery no longer cache the same query even I have chosen to use cache in the GUI (both Alpha and Classic). I didn't edit the query at all, just keep clicking run query button and every time GUI executed the query without using cache results.
It happens to my PHP script as well. Before, it was enable to use cache and came back with results very quick and now it executes the query every time even the same query has been executed minutes ago. I can confirm the behaviour in the logs.
I am wondering if there is anything changed in the last few weeks? Or some kind of account level settings control this? Because it was working fine for me.
As per official docs here cache is disable when:
...any of the tables referenced by the query have recently received
streaming inserts...
Even if you are streaming to one partition, and then querying to another, this will invalidate caching functionality for the whole table. There is this feature request opened where it is requested to be able to hit cache when doing streaming inserts to one partition but querying a different partition.
EDIT***:
After some investigation I've found out that some months ago there was an issue going on which was allowing to hit the cache even streaming inserts were being made. This was not expected behavior, and therefore it got solved in May. I guess this is the change you have experienced and what you are talking about.
Docs have not changed related to this, and they aren't/weren't incorrect. Just the previous behavior was the incorrect one.
I know there has already been a question regarding the table number limits, but it was vague...
In a dataset I want to create about 1-2 milion tables. This happens because I want to split my users activity table into smaller tables; for each user a table. And in time this number will keep on growing.
As I understand there will be no problem from BigQuery's perpective...but i'm concerned that I will not be able to access (list) those datasets from browser (https://bigquery.cloud.google.com/queries/appname); because the tables are not grouped by time (like in the case of tables with timerange) and they get all listed in an endless scroll (possibly blocking the browser)
Thank you for any suggestions
… the problem is that the browser will get blocked while listing all
tables in the dataset
You can use the "?minimal" parameter to limit the load operation to 30,000 tables per project, so browser will not be blocked. For example:
https://bigquery.cloud.google.com/queries/<your_project_name>?minimal.
see more about Display limits
I can't easily explore my dataset because of this (and query them)
If you are planning to have 2+ million tables in same dataset, even if Web UI were to show them to you without being blocked - I really doubt you would be able to somehow reasonably visually explore them. Just too many objects to “swallow”
Btw, this is not only human specific issue - even querying such "long" tables list programmatically can be problematic. See more about Using meta-tables
because the tables are not grouped by time (like in the case of tables with timerange) and they get all listed in an endless scroll (possibly blocking the browser)
That’s right, in BigQuery Web UI tables will be grouped only if they follow table_preffixYYYYMMDD pattern. Even if you would map your userID namespace to YYYYMMDD value – you would still be out of luck as your group still will consists of those millions tables.
Thank you for any suggestions
BigQuery supports Partitioned Tables which allows to have multiple partitions in the same table. Unfortunately, as of today, only Date-Partitioned tables are supported, but from what I heard BigQuery Team plans to add partitioning by arbitrary column.
This would probably fit to your desired design, unless there will be a limitation to column cardinality.
Meantime, if you want, you can experiment with applying your design using date-partitioned tables feature by mapping userid to YYYYMMDD (~9999*12*30 >> 3+ million users)
My recommendation:
Play/experiment with partitioned tables as I suggested in previous (above) section
Sharding (splitting) tables in BigQuery to millions of tables sound to me extremely impractical. You should revisit your design. What it is that you are trying to address by such sharding? Try to focus on this and if needed - post specific question here on SO!
As an alternative solution for this you can use Google cloud sdk client.
You can read the documentation for this bq Command-Line tool here.
eg: bq ls [project_id:][dataset_id] to list all tables.
NOTE: Maximum tables per query is limited to 1000. Refer
I'm considering MongoDB right now. Just so the goal is clear here is what needs to happen:
In my app, Finch (finchformac.com for details) I have thousands and thousands of entries per day for each user of what window they had open, the time they opened it, the time they closed it, and a tag if they choose one for it. I need this data to be backed up online so it can sync to their other Mac computers, etc.. I also need to be able to draw charts online from their data which means some complex queries hitting hundreds of thousands of records.
Right now I have tried using Ruby/Rails/Mongoid in with a JSON parser on the app side sending up data in increments of 10,000 records at a time, the data is processed to other collections with a background mapreduce job. But, this all seems to block and is ultimately too slow. What recommendations does (if anyone) have for how to go about this?
You've got a complex problem, which means you need to break it down into smaller, more easily solvable issues.
Problems (as I see it):
You've got an application which is collecting data. You just need to
store that data somewhere locally until it gets sync'd to the
server.
You've received the data on the server and now you need to shove it
into the database fast enough so that it doesn't slow down.
You've got to report on that data and this sounds hard and complex.
You probably want to write this as some sort of API, for simplicity (and since you've got loads of spare processing cycles on the clients) you'll want these chunks of data processed on the client side into JSON ready to import into the database. Once you've got JSON you don't need Mongoid (you just throw the JSON into the database directly). Also you probably don't need rails since you're just creating a simple API so stick with just Rack or Sinatra (possibly using something like Grape).
Now you need to solve the whole "this all seems to block and is ultimately too slow" issue. We've already removed Mongoid (so no need to convert from JSON -> Ruby Objects -> JSON) and Rails. Before we get onto doing a MapReduce on this data you need to ensure it's getting loaded into the database quickly enough. Chances are you should architect the whole thing so that your MapReduce supports your reporting functionality. For sync'ing of data you shouldn't need to do anything but pass the JSON around. If your data isn't writing into your DB fast enough you should consider Sharding your dataset. This will probably be done using some user-based key but you know your data schema better than I do. You need choose you sharding key so that when multiple users are sync'ing at the same time they will probably be using different servers.
Once you've solved Problems 1 and 2 you need to work on your Reporting. This is probably supported by your MapReduce functions inside Mongo. My first comment on this part, is to make sure you're running at least Mongo 2.0. In that release 10gen sped up MapReduce (my tests indicate that it is substantially faster than 1.8). Other than this you can can achieve further increases by Sharding and directing reads to the the Secondary servers in your Replica set (you are using a Replica set?). If this still isn't working consider structuring your schema to support your reporting functionality. This lets you use more cycles on your clients to do work rather than loading your servers. But this optimisation should be left until after you've proven that conventional approaches won't work.
I hope that wall of text helps somewhat. Good luck!