I am streaming data into three BigQuery tables from Cloud DataFlow sinks, but am seeing very slow query results on one of the target tables - a
"small" one, about 1.5 million rows. If I stop the DataFlow streaming job and come back to the table some time later, the same query runs
quickly. Here's the query in Standard SQL Dialect:
SELECT appname, start_time_to_minute, SUM(sum_client_bytes + sum_server_bytes) as `sum`
FROM `myproject.myset`.`mysmalltable`
GROUP BY 1, 2
appname:STRING
start_time_to_minute:TIMESTAMP
sum_client_bytes:INTEGER
sum_server_bytes:INTEGER
Job ID: bquijob_568af346_15c82f9e5ec - it takes 12s.
This table is growing by about 2000 rows per minute, via streaming. Another target table in the same project grows much more quickly via streaming,
maybe 200,000 rows per minute. If I run the query above on mysmalltable while streaming, it can take close to a minute. We experienced query times of several minutes on similar queries.
Job ID: bquijob_7b4ea8a1_15c830b7f12, it takes 42.8s
If I add a filter, things get worse e.g.
WHERE REGEXP_CONTAINS(`appname`, 'Oracle')
Job ID: bquijob_502b0a06_15c830d9ecf, it takes 57s
One yesterday took over 6 minutes:
Job ID: bquijob_49761f0d_15c80c6f415, it took 6m38s
I understand that in order to support querying "live" data, BigQuery has a much less efficient data provider that operates on top of the streaming
buffer. Is this implicated here? Is there a way we can make these queries run reliably in under 30s? For example, somehow avoiding the streaming
buffer and using >1 minute old data? If the streaming buffer is implicated, it still doesn't quite add up for me, since I would've thought most of the data being read out of mysmalltable would still be in native format.
I appreciate any guidance!
I've seen also this behaviour, the way that I workaround it (I'm not going to say solved, because that's mostly something from Google), was to use micro-batching instead of stream inserts. When the concurrency is low, the stream inserts work really well, but with real BigData (like in my case, hundreds of thousands) the best way is to use micro-batching. I'm using the FILE_LOADS option with a windowing of 3 minutes and works really well. Hopefully that can help you.
Related
I'm trying to update a modest dataset of 60k records with a value which takes a little time to compute. From a small trial run of 6k records in the production environment, it took 4 minutes to complete, so the full execution should take around 40 minutes.
However this trial run showed that there were SQL timeouts occurring on user requests when accessing data in related tables (but not necessarily on the actual rows which were being updated).
My question is, is there a way of running non-urgent queries as a background operation in the SQL server without causing timeouts or table locking for extensive periods of time? The data within the column which is being updated during this period is not essential to have the new value returned; aka if a request happened to come in for this row, returning the old value would be perfectly acceptable rather than locking the set until the update is complete (I'm not sure the ins and outs of how this works, obviously I do want to prevent data corruption; could be a way of queuing any additional changes in the background)
This is possibly a situation where the NOLOCK hint is appropriate. You can read about SQL Server isolation levels in the documentation. And Googling "SQL Server NOLOCK" will give you plenty of material on why you should not over-use the construct.
I might also investigate whether you need a SQL query to compute values. A single query that takes 4 minutes on 6k records . . . well, that is a long time. You might want to consider reading the data into an application (say, using Python, R, or whatever) and doing the data manipulation there. It may also be possible to speed up the query processing itself.
We have about 1.7 million products in our eshop, we want to keep record of how many views this products had for 1 year long period, we want to record the views every atleast 2 hours, the question is what structure to use for this task?
Right now we tried keeping stats for 30 days back in records that have 2 columns classified_id,stats where stats is like a stripped json with format date:views,date:views... for example a record would look like
345422,{051216:23212,051217:64233} where 051216,051217=mm/dd/yy and 23212,64233=number of views
This of course is kinda stupid if you want to go 1 year back since if you want to get the sum of views of say 1000 products you need to fetch like 30mb from the database and calculate it your self.
The other way we think of going right now is just to have a massive table with 3 columns classified_id,date,view and store its recording on its own row, this of course will result in a huge table with hundred of millions of rows , for example if we have 1.8 millions of classifieds and keep records 24/7 for one year every 2 hours we need
1800000*365*12=7.884.000.000(billions with a B) rows which while it is way inside the theoritical limit of postgres I imagine the queries on it(say for updating the views), even with the correct indices, will be taking some time.
Any suggestions? I can't even imagine how google analytics stores the stats...
This number is not as high as you think. In current work we store metrics data for websites and total amount of rows we have is much higher. And in previous job I worked with pg database which collected metrics from mobile network and it collected ~2 billions of records per day. So do not be afraid of billions in number of records.
You will definitely need to partition data - most probably by day. With this amount of data you can find indexes quite useless. Depends on planes you will see in EXPLAIN command output. For example that telco app did not use any indexes at all because they would just slow down whole engine.
Another question is how quick responses for queries you will need. And which steps in granularity (sums over hours/days/weeks etc) for queries you will allow for users. You may even need to make some aggregations for granularities like week or month or quarter.
Addition:
Those ~2billions of records per day in that telco app took ~290GB per day. And it meant inserts of ~23000 records per second using bulk inserts with COPY command. Every bulk was several thousands of records. Raw data were partitioned by minutes. To avoid disk waits db had 4 tablespaces on 4 different disks/ arrays and partitions were distributed over them. PostreSQL was able to handle it all without any problems. So you should think about proper HW configuration too.
Good idea also is to move pg_xlog directory to separate disk or array. No just different filesystem. It all must be separate HW. SSDs I can recommend only in arrays with proper error check. Lately we had problems with corrupted database on single SSD.
First, do not use the database for recording statistics. Or, at the very least, use a different database. The write overhead of the logs will degrade the responsiveness of your webapp. And your daily backups will take much longer because of big tables that do not need to be backed up so frequently.
The "do it yourself" solution of my choice would be to write asynchronously to log files and then process these files afterwards to construct the statistics in your analytics database. There is good code snippet of async write in this response. Or you can benchmark any of the many loggers available for Java.
Also note that there are products like Apache Kafka specifically designed to collect this kind of information.
Another possibility is to create a time series in column oriented database like HBase or Cassandra. In this case you'd have one row per product and as many columns as hits.
Last, if you are going to do it with the database, as #JosMac pointed, create partitions, avoid indexes as much as you can. Set fillfactor storage parameter to 100. You can also consider UNLOGGED tables. But read thoroughly PostgreSQL documentation before turning off the write-ahead log.
Just to raise another non-RDBMS option for you (so a little off topic), you could send text files (CSV, TSV, JSON, Parquet, ORC) to Amazon S3 and use AWS Athena to query it directly using SQL.
Since it will query free text files, you may be able to just send it unfiltered weblogs, and query them through JDBC.
We are facing issues with BigQuery range decorators on streaming table. The range decorator queries give duplicate data.
My case:
My BQ table is getting data regularly from customer events through streaming inserts. Another job is periodically fetching time bound data from the table using range decorator and sending it to dataflow jobs. like
First time fetching all the data from table using
SELECT * FROM [project_id:alpha.user_action#1450287482158]
when i ran this query got 91 records..
after 15 mins another query based on last interval
SELECT * FROM [alpha.user_action#1450287482159-1450291802380]
this also gave the same result with 91 records.
however i tried to run the same query again to cross check
SELECT * FROM [project_id:alpha.user_action#1450287482158]
Gives empty data.
any help on this?
First off, have you tried using streaming dataflow? That might be a better fit (though your logic is not expressible as a query). Streaming dataflow also supports Tee-ing your writes, so you can keep both raw data and aggregate results.
On to your question:
Unfortunately this is a collision of two concepts that were built concurrently and somewhat independently, thus resulting in ill-defined interactions.
Time range table decorators were designed/built in a world where only load jobs existed. As such, blocks of data are atomically committed to a table at a single point in time. Time range decorators work quite well with this, as there are clear boundaries of inclusion/exclusion, and the relationship is stable.
Streaming Ingestion + realtime query is somewhat counter to the "load job" world. BigQuery is buffering your data for some period of time, making it available for analysis, and then periodically flushing the buffers onto the table using the traditional storage means. While the data is buffered, we have "infinite" time granularity. However, once we flush the buffer onto the table, that infinite granularity is compressed into a single time, which is currently the flush time.
Thus, using time range decorators on streaming tables can unfortunately result in some unexpected behaviors, as the same data may appear in two non-overlapping time windows (once while it is buffered, and once when it is flushed).
Our recommendation if you're trying to achieve windowed queries over recent data is to do the following:
Include a system timestamp in your data.
For the table decorator timestamps, include some buffer around the actual window to account for clock skew between your clock and Google's, and late arrivals from retry. This buffer should be both before and after your target window.
Modify your query to apply your actual time window.
It should be noted that depending on your actual usage purpose, this may not address your problems. If you can give more details, there might be a way to achieve what you need.
Sorry for the inconvenience.
a simple count query on one of my tables takes a long time to complete (~18 secs), this table has around half a million rows, and making the same query in a bigger table (around 3 mil) takes less than 3 secs. The schema is exactly the same and the query is a simple SELECT count(*) FROM [dataset.table]
Any ideas why this is happening and what can I do to prevent it?
It looks like the issue with your table is that it was created in a lot of small chunks; this takes more work to query, since we spend a lot of time on filesystem operations (listing files and opening them).
Even so, a table the size of yours should not be so slow; BigQuery is currently experiencing high filesystem load that is causing high variability in latency. We're actively working on resolving this one. So that is the first problem.
The second problem is that we probably should do a better job of compacting the table. I've filed an internal bug that we should tweak our heuristics to be a bit more aggressive in compaction.
As a workaround, you can compact the table manually by copying the table in place. In other words, run a SELECT * from ... and writing the output to the same table, using writeDisposition:WRITE_TRUNCATE, destinationTable:<your table> and allowLargeResults:true and flattenSchema:false.
Again, this last step shouldn't be needed, but for now it should improve your situation.
Folks, I'm using BigQuery as a superfast database for my analytics queries, but I'm very disappointed with its performance.
Let me show you the numbers:
Just one Table at "from" clause
Select about 15 fields with group by each, about 5 fields with SUM()
Total table rows: 3.7 millions
Total rows returned: 830K
When I execute this query on BigQuery's console, it takes about 1 minute to process. Is this ok for you? I was expecting that it will return in about 2 seconds... If I execute this query on a columnar database, like Sybase IQ, it takes less than 2 seconds.
Big Query is a highly scalable database, before being a "super fast" database. It's designed to process HUGE amount of data distributing the processing among several different machines using a technique named Dremel. Because it's designed to use several machines and parallel processing, you should expect to have super-scalability with a good performance.
For example: analyzing all the wikipedia revisions in 5-10 seconds isn't bad, is it? But even a much smaller table would take about the same time.
Sybase IQ is often installed in a single database and it doesn't use Dremel. That said, it's going to be faster than Big Query in many scenarios...as designed.
Cheers!
Since you are returning 830k rows and BQ is always creating a temporary result table, the creation is more than a small result.
Have you turned on large results?
We are working in a shared environment and sometime loads ( table creation ) takes a while.
Certainly the performance differ from a dedicated environment. You get your dedicated environment for 20K$ a month.