Strangely large processing size in BigQuery query on partitioned table - google-bigquery

I have just noticed some strange processing size behavior while running BigQuery queries on a Timestamp-partitioned table. We have a table with about 70 devices streaming inserts once per minute, so about 70 inserts/minute or 4,200 inserts/hour. Showing a small example is the easiest way to describe the problem. In the BigQuery UI on GCP, if I query one day of data:
select * from dataset.table where DATE(time) = '2020-09-21';
it says
This query will process 3.2 GB when run.
However, if I query 7 days of data:
select * from dataset.table where DATE(time) >= '2020-09-15' and DATE(time) <= '2020-09-21';
it says
This query will process 3.3 GB when run.
Through some experimentation, I found that there is about 14 MB increase in processing size for each day, which I expect is the true size of each partition.
What is even stranger is that this extra space seems to grow exponentially with the size of the table. As an example, I created a new table containing only 10 of the 70 devices and added a few months of data from these devices. When I performed these queries, I found that the query processed 1.4 MB for a 1-day query, and 8.9 MB for a 7-day query, which is an increase of about 1.07 MB/day, meaning this "extra size" is only about 33 KB, or about 2,300x smaller than the table with all devices.
So my question is what is this large extra processing size and why is it there? There must be something related to streaming inserts or caching or something that I'm missing?

Related

BigQuery. Long execution time on small datasets

I created a new Google cloud project and set up BigQuery data base. I tried different queries, they all are executing too long. Currently we don't have a lot of data, so high performance was expected.
Below are some examples of queries and their execution time.
Query #1 (Job Id bquxjob_11022e81_172cd2d59ba):
select date(installtime) regtime
,count(distinct userclientid) users
,sum(fm.advcost) advspent
from DWH.DimUser du
join DWH.FactMarketingSpent fm on fm.date = date(du.installtime)
group by 1
The query failed in 1 hour + with error "Query exceeded resource limits. 14521.457814668494 CPU seconds were used, and this query must use less than 12800.0 CPU seconds."
Query execution plan: https://prnt.sc/t30bkz
Query #2 (Job Id bquxjob_41f963ae_172cd41083f):
select fd.date
,sum(fd.revenue) adrevenue
,sum(fm.advcost) advspent
from DWH.FactAdRevenue fd
join DWH.FactMarketingSpent fm on fm.date = fd.date
group by 1
Execution time ook 59.3 sec, 7.7 MB processed. What is too slow.
Query Execution plan: https://prnt.sc/t309t4
Query #3 (Job Id bquxjob_3b19482d_172cd31f629)
select date(installtime) regtime
,count(distinct userclientid) users
from DWH.DimUser du
group by 1
Execution time 5.0 sec elapsed, 42.3 MB processed. What is not terrible but must be faster for such small volumes of data.
Tables used :
DimUser - Table size 870.71 MB, Number of rows 2,771,379
FactAdRevenue - Table size 6.98 MB, Number of rows 53,816
FaceMarketingSpent - Table size 68.57 MB, Number of rows 453,600
The question is what am I doing wrong so that query execution time is so big? If everything is ok, I would be glad to hear any advice on how to reduce execution time for such simple queries. If anyone from google reads my question, I would appreciate if jobids are checked.
Thank you!
P.s. Previously I had experience using BigQuery for other projects and the performance and execution time were incredibly good for tables of 50+ TB size.
Posting same reply i've given in the gcp slack workspace:
Both your first two queries looks like you have one particular worker who is overloaded. Can see this because in the compute section, the max time is very different from the avg time. This could be for a number of reasons, but i can see that you are joining a table of 700k+ rows (looking at the 2nd input) to a table of ~50k (looking at the first input). This is not good practice, you should switch it so the larger table is the left most table. see https://cloud.google.com/bigquery/docs/best-practices-performance-compute?hl=en_US#optimize_your_join_patterns
You may also have a heavily skew in your join keys (e.g. 90% of rows are on 1/1/2020, or NULL). check this.
For the third query, that time is expected, try a approx count instead to speed it up. Also note BQ starts to get better if you perform the same query over and over, so this will get quicker.

Google DataPrep is extremely slow

In Google Dataflow, i have a job that basically looks like this:
Dataset: 100 rows, 1 column.
Recipe: 0 steps
Output: New Table.
But it takes between 6-8 minutes to run. What could be the issue?
Usually times are in minutes, not in seconds for Dataprep/dataflow setup.
These solutions are for large data sets and the duration stays constant even if you have 10 times the size.
DataPrep creates for you a DataFlow workflow, and provisions a few VMs for you, that takes time, usually that phase could be in the minute mark. And only a bit later is scaling that up to 50 or 1000 boxes.

Why standard SQL UDF increase the Bytes billed so much for all the query after it?

BQ support team,
We recently investigate the Standard SQL with UDF in BQ, and seems it works pretty good. But we do noticed that it can be too costly to use it. As the Bytes billed can be hundred times than the original table. I think that makes sense as the UDF make need memory to process. But what I do not understand is that all the queries use the table generated by the UDF SQL still use the memory like the UDF SQL. Our original table is about 1.03K, and the UDF SQL run has a 10M billed. And below is the job info for a normal query:
select * from project.udf_sql_table_name;
Job ID *
Creation Time Apr 14, 2017, 2:57:29 PM
Start Time Apr 14, 2017, 2:57:29 PM
End Time Apr 14, 2017, 2:57:30 PM
Bytes Processed 1.05 KB
Bytes Billed 10.0 MB
Billing Tier 1
Destination Table *
Use Legacy SQL fase
From the job info, we can see the UDF SQL generate a table about 1.05K, it is saved as project.udf_sql_table_name. And now even do a simple "SELECT", the "Bytes Billed" is still 10M, 1000 times bigger that the processed table.
May I know is this correct when using UDF?
Thanks
the "Bytes Billed" is still 10M, 1000 times bigger that the processed table. ... is this correct?
Yes. This is correct. See On-demand pricing
Charges are rounded to the nearest MB, with a minimum 10 MB data
processed per table referenced by the query, and with a minimum 10 MB
data processed per query.

Neo4j Configuration for 4M Nodes 10M relationship

I am new to Neo4j and have made few graph queries with 4M nodes and 10M relationships. Till now i've completely surprised from the performance of my queries.
SCHEMA
.......
(a:user{data:1})-[:follow]->(:user)-[:next*1..10]-(:activity)
Here user with data:1 is following another 100,000 user. Each of those 100,000 users have 2-8 next nodes(lets say activity of users) attached. Now i want to fetch the activities of users till next level 3 [:next*1..3] . Each activity has property relevance number.
So now i have 100,000 *3 nodes to traverse.
CYPHER
.......
match (u:user{data:1})-[:follow]-(:user)-[:next*1..3]-(a:activity)
return a order by a.relevance desc limit 50
This query is taking 72000 ms almost every time. Since i am new to Neo4j and i am sure that i haven't done tuning of the OS.
I am using following parameters-
Initial Java Heap Size (in MB)
wrapper.java.initmemory=2000
Maximum Java Heap Size (in MB)
wrapper.java.maxmemory=2456
Default values for the low-level graph engine
neostore.nodestore.db.mapped_memory=25M
neostore.relationshipstore.db.mapped_memory=50M
neostore.propertystore.db.mapped_memory=90M
neostore.propertystore.db.strings.mapped_memory=130M
neostore.propertystore.db.arrays.mapped_memory=130M
Please tell me where i am doing wrong. I read all the documentation from neo4j website but the query time didn't improve.
please tell me how can i configure high performing cache? What should i do so that all the graph loads up in memory? When i see my RAM usage , it is always like 1.8 GB out of 4 Gb. I am using enterprise license on windows (Neo4j 2.0). Please help.
You are actually following not 100k * 3 but, 100k * (2-10)^10 meaning 10^15 paths.
More memory in your machine would make a lot of sense, so try to get 8 or more GB.
Then you can increase the heap, e.g. to 6GB:
wrapper.java.initmemory=6000
wrapper.java.maxmemory=6000
neo4j.properties
neostore.nodestore.db.mapped_memory=100M
neostore.relationshipstore.db.mapped_memory=500M
neostore.propertystore.db.mapped_memory=200M
neostore.propertystore.db.strings.mapped_memory=200M
neostore.propertystore.db.arrays.mapped_memory=10M
If you want to pull your data through, you would most probably want to inverse your query.
match (a:activity),(u:user{data:1})
with a,u
order by a.relevance
desc limit 100
match (followed:user)-[:next*1..3]-(a:activity)
where (followed)-[:follow]-(user)
return a
order by a.relevance
desc limit 50

Cassandra secondary index get_indexed_slices timing out

I am using Cassandra 0.8 with 2 secondary indexes for columns like "DeviceID" and "DayOfYear". I have these two indexes in order to retrieve data for a device within a range of dates. Whenever I get a date filter, I convert this into DayOfYear and search using indexed slices using .net Thrift API. Currently I cannot upgrade the DB as well.
My problem is I usually do not have any issues retrieving rows using the get_indexed_slices query for the current date (using current day of year). But whenever I query for yesterday's day of year (which is one of the indexed column), I get a time out for the first time I make the query. And most of the times, it returns when I query the second time and 100% during the third time.
Both these columns are created as double data type in the column family and I generally get 1 record per minute. I have 3 nodes in the cluster and the nodetool reports suggest that the nodes are up and running, though the load distribution report from nodetool looks like this.
Starting NodeTool
Address DC Rack Status State Load Owns
xxx.xx.xxx.xx datacenter1 rack1 Up Normal 7.59 GB 51.39%
xxx.xx.xxx.xx datacenter1 rack1 Up Normal 394.24 MB 3.81%
xxx.xx.xxx.xx datacenter1 rack1 Up Normal 4.42 GB 44.80%
and my configuration in YAML is as below.
hinted_handoff_enabled: true
max_hint_window_in_ms: 3600000 # one hour
hinted_handoff_throttle_delay_in_ms: 50
partitioner: org.apache.cassandra.dht.RandomPartitioner
commitlog_sync: periodic
commitlog_sync_period_in_ms: 120000
flush_largest_memtables_at: 0.75
reduce_cache_sizes_at: 0.85
reduce_cache_capacity_to: 0.6
concurrent_reads: 32
concurrent_writes: 24
sliced_buffer_size_in_kb: 64
rpc_keepalive: true
rpc_server_type: sync
thrift_framed_transport_size_in_mb: 15
thrift_max_message_length_in_mb: 16
incremental_backups: true
snapshot_before_compaction: false
column_index_size_in_kb: 64
in_memory_compaction_limit_in_mb: 64
multithreaded_compaction: false
compaction_throughput_mb_per_sec: 16
compaction_preheat_key_cache: true
rpc_timeout_in_ms: 50000
index_interval: 128
Is there something I may be missing? Are there any problems in the config?
Duplicate your data in another column family where the key is your search data. Row slice are mutch faster
Personally I never got to use secondary index in production environments. Or I had problems with timeout, or the speed of data retrieve by secondary index was lower that the amount of data inserted. I think that it is related with not sequentially reading data and HD seek time.
If you come from a relational model, playOrm is just as fast and you can be relational on a noSQL store BUT you just need to partition your extremely large tables. IF you do that, you can then use "scalable JQL" to do your stuff
#NoSqlQuery(name="findJoinOnNullPartition", query="PARTITIONS t(:partId) select t FROM TABLE as t INNER JOIN t.security as s where s.securityType = :type and t.numShares = :shares")
IT also has the #ManyToOne, #OneToMany, etc. etc. annotations for a basic ORM though some things work differently in noSQL but a lot is the similar.
I finally solved my problem in a different way. In fact I realized the problem is with my data model.
The problem comes because we we come from a RDBMS background. I restructured the data model a little and now, I get responses faster.