Google says BigQuery can handle billions of rows.
For my application I estimate a usage of 200,000,000 * 1000 rows. Well over a few billion.
I can partition data into 200,000,000 rows per partition but the only support for this in BigQuery seems to be different tables. (please correct me if I am wrong)
The total data size will be around 2TB.
I saw in the examples some large data sizes, but the rows were all under a billion.
Can BigQuery support the number of rows I am dealing with in a single table?
If not, can I partition it in any way besides multiple tables?
Below should answer your question
I run it agains one of our dataset
As you can see tables size close to 10TB with around 1.3-1.6 Billion rows
SELECT
ROUND(size_bytes/1024/1024/1024/1024) as TB,
row_count as ROWS
FROM [mydataset.__TABLES__]
ORDER BY row_count DESC
LIMIT 10
I think the max table we dealt so far was at least up to 5-6 Billion and all worked as expected
Row TB ROWS
1 10.0 1582903965
2 11.0 1552433513
3 10.0 1526783717
4 9.0 1415777124
5 10.0 1412000551
6 10.0 1410253780
7 11.0 1398147645
8 11.0 1382021285
9 11.0 1378284566
10 11.0 1369109770
Short answer: Yes, BigQuery will handle this just fine, even if you put all the data in a single table.
If you do want to partition your data, the only way to do it right now is to explicitly store your data in multiple tables. You might consider doing so to reduce your bill if you frequently query only a subset of your data. Many users partition their data by date and use table wildcard functions to write queries across a subset of those partitioned tables.
Related
Suppose I have a table of relationships like in a directed graph. For some pairs of ids there are both 1->2 and 2->1 relations, for others there are not. Some nodes are only present in one column.
a b
1 2
2 1
1 3
4 1
5 2
Now I want to work with it as undirected graph. For example, grouping, filtering using both columns present. For example filter node 5 and count neighbors of the rest
node neighbor_count
1 3
2 1
3 1
4 1
Is it possible to compose queries in such a way that first column a is used and then column b is used in the same manner?
I know it is achievable by doubling the table:
select a,count(distinct(b))
from
(select * from grap
union all
select b as a, a as b from grap)
where (not a in (5,6,7)) and (not b in (5,6,7))
group by a;
However, the real tables are quite large (10^9 - 1^10 of pairs). Would union require additional disk usage? A single scan through the base is already quite slow for me. Are there better ways to do this?
(Currently database is sqlite, but the less platform specific the answer the better)
The union all is generated only for the duration of the query. Does it use more disk space? Not permanently.
If the processing of the query requires saving the data out to disk, then it will use more temporary storage for intermediate results.
I would suggests, though, that if you want an undirected graph with this representation, then add in the addition pairs that are not already in the table. This will use more disk space. But you won't have to play games with queries.
Currently I have arround 250 clients with their 5 years datas and the tables structure were splited up based on their years (Eg),A client named as XX.
T00_XX_2011,T00_XX_2012,T00_XX_2013,T00_XX_2014 each table contains 220 column with more or less 10 millions records in which 12 column already has indexes
The issue was for a single select query it get arround 5 to 10 min Can anyone help to tweek the performance
I have a table with millions of records which holds information about a user, his or her documents in a BLOB, and a column holding the file size per row.
While reporting I need to extract all these records along with their attachments and store them in a folder. However, the constraint is that the folder size should not exceed 4GB.
What I need, is to fetch records only till that record, where the summation of file sizes is less than 4GB. I have hardly any experience in databases, and do not have any DB expert to refer.
for eg - say i need to fetch only records till sum(fileSize) < 9
Name fileSize
A 1
B 2
C 3
D 2
E 9
F 4
My query needs to return records A,B,C and D.
Also, i need to store the rowID/uniqueID of the first and last record for another subsequent process.
The DB being used is IBM DB2.
Thanks!
So here is some trick how you can find your file size. and in procedure you can manage with data.
select length(file_data) from files
where length(file_data)<99999999;
LENGTH(FILE_DATA)
82944
82944
91136
3 rows selected.
select dbms_lob.getlength(file_data) from files
where length(file_data)<89999;
DBMS_LOB.GETLENGTH(FILE_DATA)
82944
82944
2 rows selected.
dbms_lob.getlength() vs. length() to find blob size in oracle
hope this helps....
I have a database with about five possible index columns, all of which are useful in different ways. Let's call them System, Source, Heat, Time, and Row. Using System and Row together will make a unique key, and if sorted by System-Row the database will also be sorted for any combination of the five index variables (in the order I listed them above).
My problem is that I use all combinations of these columns: sometimes I want to JOIN each System-Row to the next System-(Row+1), sometimes I want to GROUP or WHERE by System-Source-Heat, sometimes I want to look at all entries of System-Source WHERE Time is in a specific window, etc.
Basically, I want an index structure that functions similarly to every possible permutation of those five indexes (in the correct order, of course), without actually making every permutation (although I am willing to do so if necessary). I'm doing statistics / analytics, not traditional database work, so the size of the index and speed of creating / updating it is not a concern; I only care about speeding my improvised queries as I tend to think them up, run them, wait 5-10 minutes, and then never use them again. Thus my main concern is reducing the "wait 5-10 minutes" to something more like "wait 1-2 minutes."
My sorted data would look something like this:
Sys So H Ti R
1 1 0 .1 1
1 1 1 .2 2
1 1 1 .3 3
1 1 2 .3 4
1 2 0 .5 5
1 2 0 .6 6
1 2 1 .8 7
1 2 2 .8 8
EDIT: It may simplify things a bit that System virtually always needs to be included as the first column to make any of the other 4 columns in sorted order.
If you are ONLY concerned with SELECT speed and don't care about INSERT, then you can materialize ALL the combinations as INDEXED views. You only need 24 times the storage of the original table, making one table and 23 INDEXED VIEWs of 5 columns each.
e.g.
create table data (
id int identity primary key clustered,
sys int,
so int,
h float,
ti datetime,
r int);
GO
create view dbo.data_v1 with schemabinding as
select sys, so, h, ti, r
from dbo.data;
GO
create unique clustered index cix_data_v1 on data_v1(sys, h, ti, r, so)
GO
create view dbo.data_v2 with schemabinding as
select sys, so, h, ti, r
from dbo.data;
GO
create unique clustered index cix_data_v2 on data_v2(sys, ti, r, so, h)
GO
-- and so on and so forth, keeping "sys" anchored at the front
Do note, however
Q. Why isn't my indexed view being picked up by the query optimizer for use in the query plan? (search within linked article)
If space IS an issue, then the next best thing is to create individual indexes on each of the 4 columns, leading with system, i.e. (sys,ti), (sys,r) etc. These can be used together if it will help the query, otherwise it will revert to a full table scan.
Sorry for taking a while to get back to this, I had to work on something else for a few weeks. Anyway, after trying a bunch of things (including everything suggested here, even the brute-force "make an index for every permutation" method), I haven't found any indexing method that significantly improves performance.
However, I HAVE found an alternate, non-indexing solution: selecting only the rows and columns I'm interested in into intermediary tables, and then working with those instead of the complete table (so I use about 5 mil rows of 6 cols instead of 30 mil rows of 35 cols). The initial select and table creation is a bit slow, but the steps after that are so much faster I actually save time even if I only run it once (and considering how often I change things, it's usually much more than once).
I have a suspicion that the reason for this vast improvement will be obvious to most SQL users (probably something about pagefile size), and I apologize if so. My only excuse is that I'm a statistician trying to teach myself how to do this as I go, and while I'm pretty decent at getting what I want done to happen (eventually), my understanding of the mechanics of how it's being done are distressingly close to "it's a magic black box, don't worry about it."
I have a database in SQLite Administrator, with 3 tables, say A,B and C.
Table A has 3 columns p1,p2 and p3, with about 2 million rows.
Table B has 2 columns p1 and p4, with also about 2 million rows.
Table C has 1 column p4 with about 800,000 rows.
The query that I am trying to run is as following:
SELECT A.p1, B.p4, A.p2, A.p3
FROM A,B,C
WHERE A.p1=B.p1 AND B.p4=C.p4
The query already took 3 days, and still didn't finish. I wonder if I should abort it or wait till it completes. If it will finish in next 5-6 days I will probably wait, but if it takes more than that, I will have to abort it.
Should I wait or not?
My PC specs are:
Core 2 duo 1.86GHz, 2 GB RAM,
I would say there's nothing strange in 3 days (if no indexes).
If no indexes on A, B, C then your query would make a full scan of A x B x C.
The number of records in A x B x C is
SELECT COUNT(*)
FROM A,B,C
which is (2*10^6) * (2*10^6) * (0.8*10^6) = 3.2 * 10^18
Assuming that you can apply the where condition to billion records in a second you would still need 3.2 * 10^9 seconds. Which is just over 101 years.
However, if you have indexes on p1 and p4 decent RDBMS would be able to access results directly and not scan the full Cartesian product (well, I think that some DBs would choose to build temporary indexes, which would still be slow, but would make the query actually execute).
Do you have indexes on A.p1, B.p1, B.p4, C.p4 ?
If not, then you'd better stop it, it might run for several years.
For this kind of operations you need something bigger. This is not Lite at all. Think about switching to another RDBMS.