Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Our company is going to implement Big Query.
We saw many drawbacks in Big Query like
1. Only 1000 requests per day allowed.
2. No update delete allowed.
and so on...
Can u guys highlight some more drawbacks and also discuss on above two.
Please share any issues come during and after implementing Big Query.
Thanks in Advance.
"Only 1000 requests per day allowed"
Not true, fortunately! There is a limit of how many batch loads you can do to a table per day (1000, so one every 90 seconds), but this is about loading data, not querying it. And if you need to load data more frequently, you can use the streaming API for up to a 100,000 rows per second per table.
"No update delete allowed"
BigQuery is an analytical database which are not optimized for updates and deletes of individual rows. The analytical databases that support these operations usually do with caveats and performance costs. You can achieve the equivalent update and deletes with BigQuery by re-materializing your tables in just a couple minutes: https://stackoverflow.com/a/31663889/132438
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
Is it a valid approach when building a dataflow pipeline which aims to store the newest data per key in BigQuery to
stream-insert the events in a partitioned staging table
periodically merge (update/insert) into target table, (so that only the newest data to a key is stored in this table). It's a requirement that the merge happens every 2-5 minutes and respects all rows in the staging table.
The idea of this approach is taken from the Google project https://github.com/GoogleCloudPlatform/DataflowTemplates, com.google.cloud.teleport.v2.templates.DataStreamToBigQuery
So far it works okay in our tests, the question here arises from the fact, that Google states in its documentation:
"Rows that were written to a table recently by using streaming (the tabledata.insertall method or the Storage Write API) cannot be modified with UPDATE, DELETE, or MERGE statements."
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-manipulation-language#limitations
Has someone gone this road in a production dataflow pipeline with stable positive results?
After a few hours and some thinking, I think I can answer my own question: Since I only stream to the staging table and merge into the target table, the approach is perfectly fine.
I did this yesterday and the time lag is around 15-45 minutes. If you have an ingestion time column/field you can use that to restrict which rows you are UPDATEing.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
According to this Techcrunch news
Gmail has 900 million users. When I try to login with my username and password to gmail, It queries with the speed of light. Do they use rdms (relational) or no-sql? Is it possible with rdms?
I'm sure this isn't exactly how it's done, but one billion records at say 50 bytes per user name is only 50 gigabytes. They could keep it all in RAM in a sorted tree and just search the sorted tree.
A binary tree of that size is only thirty nodes deep, which would take microseconds to traverse, and I suspect they'd use something that branches more than a binary tree so it would be even flatter.
All in all, there's probably much more amazing things google does, this part is relatively trivial.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
Optimization was never one of my expertise. I have users table. every user has many followers. So now I'm wondering if I should use a counter column in case that some user has a million followers. So instead of counting a whole table of relations, shouldn't I use a counter?
I'm working with SQL database.
Update 1
Right now I'm only writing the way I should build my site. I haven't write the code yet. I don't know if I'll have slow performance, that's why I'm asking you.
You should certainly not introduce a counter right away. The counter is redundant data and it will complicate everything. You will have to master the additional complexity and it'll slow down the development process.
Better start with a normalized model and see how it works. If you really run into performance problems, solve it then then.
Remember: premature optimization is the root of all evils.
It's generally a good practice to avoid duplication of data, such as summarizing one data point in another data's table.
It depends on what this is for. If this is for reporting, speed is usually not an issue and you can use a join.
If it has to do with the application and you're running into performance issues with join or computed column, you may want to consider summary table generated on a schedule.
If you're not seeing a performance issue, leave it alone.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am going to develop some business logic. During requirement phase, I need to provide SLA.
Business always want everything in less than second. Some time it is very frustatung also.They are not bothered about complexity.
My DB is SQL server 2012 and transaction DB.
Is there any formula which will take number of tables, columns etc and provide estimate?
No, you won't be able to get an execution time. Not only do the # of tables/joins factor in, but how much data is returned, network speed, load on the server, and many other factors. What you can get is a Query plan. SQL Server generates a query plan for every query it executes. And the execution plan will give you a "cost" value that you can use as a VERY general guideline about the query's performance. Check out these two links for more information...
1) this is a good StackOverflow question/answer.
2) sys.dm_exec_query_plan
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Apologies in advance if this is a stupid question. I've more or less just started learning how to use SQL.
I'm making a website, the website stores main accounts, each having many sub-accounts associated with them. Each sub-account has a few thousand records in various tables associated with it.
My question is to do with the conventional usage of databases. Is it better to use a database per main account with everything associated with it stored in the same place, store everything in one database, or an amalgamation of both?
Some insight would be much appreciated.
Will you need to access more than one of these databases at the same time? If so put them all in one. You will not like the amount of effort and cost 'joining' them back together to do a query. On top of that, every database you have needs to be managed, and should you need to transfer data between them that can get painful as well.
Segregating data by database is a last resort.