I have some data that I would like to partition by date, and also partition by an internally-defined client id.
Currently, we store this data uses the table-per-date model. It works well, but querying individual client ids is slow and expensive.
We have considered creating a table per client id, and using date partitioning within those tables. The only issue here is that would force us to incur thousands of load jobs per day, and also have the data partitioned by client id in advance.
Here is a potential solution I came up with:
-Stick with the table-per-date approach (eg log_20170110)
-Create a dummy date column which we use as the partition date, and set that date to -01-01 (eg for client id 1235, set _PARTITIONTIME to 1235-01-01)
This would allow us to load data per-day, as we do now, would give us partitioning by date, and would leverage the date partitioning functionality to partition by client id. Can you see anything wrong with this approach? Will BigQuery allow us to store data for the year 200, or the year 5000?
PS: We could also use a scheme that pushes the dates to post-zero-unixtime, eg add 2000 to the year, or push the last two digits to the month and day, eg 1235 => 2012-03-05.
Will BigQuery allow us to store data for the year 200, or the year 5000?
Yes, any date between 00001-01-01 and 9999-12-31
So formally speaking this is an option (and btw depends on how many clients you plan / already have)
See more about same idea at https://stackoverflow.com/a/41091896/5221944
Meantime, I would expect BigQuery to have soon ability to partition by arbitrary field. Maybe at NEXT 2017 - just guessing :o)
The suggested idea is likely to create some performance issues for queries (as the number of partitions increase). Generally speaking, Date partitioning works well with a few 1000 partitions.
client_ids are generally unrelated with each other and are ideal for hashing. While we work towards supporting richer partitioning flavors, one option is to hash your client_ids into N buckets (~100?), and have N partitioned tables. That way you can query across your N tables for a given date. Using, for example, 100 tables would drop the cost down to 1% of what it would be using 1 table with all the client_ids. It should also scan a small number of partitions, improving performance also accordingly. Unfortunately, this approach doesn't address the concern of putting the client ids in the right table (it has to be managed by you).
Related
It looks like LIMIT would have no effect on the amount of processed/queried data (if you trust the UI).
SELECT
* --count(*)
FROM
`bigquery-public-data.github_repos.commits`
-- LIMIT 20
How to limit the amount of queried data to a minimum (even though one whole partition would probably always be needed)
without to use "preview" or similar
without to know the partition / clustering of the data
How to check the real approximate amount before a query execution?
In the execution details is stated that only 163514 rows has been queried as input (not 244928379 rows)
If you want to limit the amount of data BQ uses for a query you have this two options:
Table Partitioning
Big query can partition data using either a Date/Datetime/Timemestamp column you provide or by insert date (which is good if you have regular updates on a table).
In order to do this, you must specify the partition strategy in the DDL:
CREATE TABLE mydataset.mytable (foo: int64, txdate:date)
PARTITION BY txdate
Wildcard tables (like Sharding - splitting the data into multiple tables
This works when your data holds information about different domains (geographical, customer type, etc.) or sources.
Instead of having one big table, you can create 'subtables' or 'shards' like this with a similar schema (usually people use the same). For instance,dateset.tablename.eur for european data and ```dataset.tablename.jap`` for data from Japan.
You can query one of those tables directll select col1,col2... from dataset.tablename.custromer_eur; or from all tables select col1,col2 from 'dataset.tablename.*'
Wildcard tables can be also partitioned by date.
You pay for the volume of data loaded in the workers. Of course, you do nothing in your request and you ask for the 20 first result, the query stop earlier, and all the data aren't processed, but at least loaded. And you will pay for this!
Have a look to this. I have a similar request
Now, let's go to the logs
The total byte billed is ~800Mb
So you, have to think differently when you work with BigQuery, it's analytics database and not designed to perform small requests (too slow to start, the latency is at least 500ms due to worker warm up).
My table contain 3M+ of rows, and only 10% have been processed
And you pay for the reservation and the load cost (moving data have a cost and reserving slots has also a cost).
That's why, there is a lot of tip to save money on Google BigQuery. Some examples by a former BigQuery Dev Advocate
as of december 2021, I notice select * from Limit, will not scan the whole table and you pay only for a small number of rows, obviously if you add order by, it will scan everything.
We have the following scenario:
We have an existing table containing approx. 15 billion records. It was not explicitly partitioned on creation.
We are creating a copy of this table with partitions, hoping for faster read time on certain types of queries.
Our tables are on Databricks Cloud, and we use Databricks Delta.
We commonly filter by two columns, one of which is the ID of an entity (350k distinct values) and one of which is the date at which an event occurred (31 distinct values so far, but increasing every day!).
So, in creating our new table, we ran a query like this:
CREATE TABLE the_new_table
USING DELTA
PARTITIONED BY (entity_id, date)
AS SELECT
entity_id,
another_id,
from_unixtime(timestamp) AS timestamp,
CAST(from_unixtime(timestamp) AS DATE) AS date
FROM the_old_table
This query has run for 48 hours and counting. We know that it is making progress, because we have found around 250k prefixes corresponding to the first partition key in the relevant S3 prefix, and there are certainly some big files in the prefixes that exist.
However, we're having some difficulty monitoring exactly how much progress has been made, and how much longer we can expect this to take.
While we waited, we tried out a query like this:
CREATE TABLE a_test_table (
entity_id STRING,
another_id STRING,
timestamp TIMESTAMP,
date DATE
)
USING DELTA
PARTITIONED BY (date);
INSERT INTO a_test_table
SELECT
entity_id,
another_id,
from_unixtime(timestamp) AS timestamp,
CAST(from_unixtime(timestamp) AS DATE) AS date
FROM the_old_table
WHERE CAST(from_unixtime(timestamp) AS DATE) = '2018-12-01'
Notice the main difference in the new table's schema here is that we partitioned only on date, not on entity id. The date we chose contains almost exactly four percent of the old table's data, which I want to point out because it's much more than 1/31. Of course, since we are selecting by a single value that happens to be the same thing we partitioned on, we are in effect only writing one partition, vs. the probably hundred thousand or so.
The creation of this test table took 16 minutes using the same number of worker-nodes, so we would expect (based on this) that the creation of a table 25x larger would only take around 7 hours.
This answer appears to partially acknowledge that using too many partitions can cause the problem, but the underlying causes appear to have greatly changed in the last couple of years, so we seek to understand what the current issues might be; the Databricks docs have not been especially illuminating.
Based on the posted request rate guidelines for S3, it seems like increasing the number of partitions (key prefixes) should improve performance. The partitions being detrimental seems counter-intuitive.
In summary: we are expecting to write many thousands of records in to each of many thousands of partitions. It appears that reducing the number of partitions dramatically reduces the amount of time it takes to write the table data. Why would this be true? Are there any general guidelines on the number of partitions that should be created for data of a certain size?
You should partition your data by date because it sounds like you are continually adding data as time passes chronologically. This is the generally accepted approach to partitioning time series data. It means that you will be writing to one date partition each day, and your previous date partitions are not updated again (a good thing).
You can of course use a secondary partition key if your use case benefits from it (i.e. PARTITIONED BY (date, entity_id))
Partitioning by date will necessitate that your reading of this data will always be made by date as well, to get the best performance. If this is not your use case, then you would have to clarify your question.
How many partitions?
No one can give you answer on how many partitions you should use because every data set (and processing cluster) is different. What you do want to avoid is "data skew", where one worker is having to process huge amounts of data, while other workers are idle. In your case that would happen if one clientid was 20% of your data set, for example. Partitioning by date has to assume that each day has roughly the same amount of data, so each worker is kept equally busy.
I don't know specifically about how Databricks writes to disk, but on Hadoop I would want to see each worker node writing it's own file part, and therefore your write performance is paralleled at this level.
I am not a databricks expert at all but hopefully this bullets can help
Number of partitions
The number of partitions and files created will impact the performance of your job no matter what, especially using s3 as data storage however this number of files should be handled easily by a cluster of descent size
Dynamic partition
There is a huge difference between partition dynamically by your 2 keys instead of one, let me try to address this in more details.
When you partition data dynamically, depending on the number of tasks and the size of the data, a big number of small files could be created per partition, this could (and probably will) impact the performance of next jobs that will require use this data, especially if your data is stored in ORC, parquet or any other columnar format. Note that this will require only a map only job.
The issue explained before, is addressed in different ways, being the most common the file consolidation. For this, data is repartitioned with the purpose of create bigger files. As result, shuffling of data will be required.
Your queries
For your first query, the number of partitions will be 350k*31 (around 11MM!), which is really big considering the amount of shuffling and task required to handle the job.
For your second query (which takes only 16 minutes), the number of required tasks and shuffling required is much more smaller.
The number of partitions (shuffling/sorting/tasks scheduling/etc) and the time of your job execution does not have a linear relationship, that is why the math doesn't add up in this case.
Recomendation
I think you already got it, you should split your etl job in 31 one different queries which will allow to optimize the execution time
My recommendations in case of occupying partitioned columns is
Identify the cardinality of all the columns and select those that have a finite amount in time, therefore exclude identifiers and date columns
Identify the main search to the table, perhaps it is date or by some categorical field
Generate sub columns with a finite cardinality in order to speed up the search example in the case of dates it is possible to decompose it into year, month, day, etc. , or in the case of integer identifiers, decompose them into the integer division of these IDs% [1,2,3 ...]
As I mentioned earlier, using columns with a high cardinality to partition, will cause poor performance, by generating a lot of files which is the worst working case.
It is advisable to work with files that do not exceed 1 GB for this when creating the delta table it is recommended to occupy "coalesce (1)"
If you need to perform updates or insertions, specify the largest number of partitioned columns to rule out the inceserary cases of file reading, which is very effective to reduce times.
Basically we are building a reporting dashboard for our software. We are giving the Clients the ability to view basic reporting information.
Example: (I've removed 99% of the complexity of our actual system out of this example, as this should still get across what I'm trying to do)
One example metric would be...the number of unique products viewed over a certain time period. AKA, if 5 products were each viewed by customers 100 times over the course of a month. If you run the report for that month, it should just say 5 for number of products viewed.
Are there any recommendations on how to go about storing data in such a way where it can be queried for any time range, and return a unique count of products viewed. For the sake of this example...lets say there is a rule that the application cannot query the source tables directly, and we have to store summary data in a different database and query it from there.
As a side note, we have tons of other metrics we are storing, which we store aggregated by day. But this particular metric is different because of the uniqueness issue.
I personally don't think it's possible. And our current solution is that we offer 4 pre-computed time ranges where metrics affected by uniqueness are available. If you use a custom time range, then that metric is no longer available because we don't have the data pre-computed.
Your problem is that you're trying to change the grain of the fact table. This can't be done.
Your best option is what I think you are doing now - define aggregate fact tables at the grain of day, week and month to support your performance constraint.
You can address the custom time range simply by advising your users that this will be slower than the standard aggregations. For example, a user wanting to know the counts of unique products sold on Tuesdays can write a query like this, at the expense of some performance loss:
select distinct dim_prod.pcode
,count(*)
from fact_sale
join dim_prod on dim_prod.pkey = fact_sale.pkey
join dim_date on dim_date.dkey = fact_sale.dkey
where dim_date.day_name = 'Tuesday'
group by dim_prod.pcode
The query could also be written against a daily aggregate rather than a transactional fact, and as it would be scanning less data it would run faster, maybe even meeting your need
From the information that you have provided, I think you are trying to measure ' number of unique products viewed over a month (for example)'.
Not sure if you are using Kimball methodologies to design your fact tables. I believe in Kimball methodology, an Accumulating Snapshot Fact table will be recommended to meet such a requirement.
I might be preaching to the converted(apologies in that case), but if not then I would let you go through the following link where the experts have explained the concept in detail:
http://www.kimballgroup.com/2012/05/design-tip-145-time-stamping-accumulating-snapshot-fact-tables/
I have also included another link from Kimball, which explains different types of fact tables in detail:
http://www.kimballgroup.com/2014/06/design-tip-167-complementary-fact-table-types/
Hope that explains the concepts in detail. More than happy to answer any questions(to the best of my ability)
Cheers
Nithin
So I'm looking into data warehousing and partitioning and am very curious at to what scale makes the most sense for partitioning a data on a key (for instance, SaleDate).
Tutorials often mention that you're trying to break it down into logical chunks so as to make updating the data less likely to cause service disruptions.
So let's say I'm a medium scale company working in a given US state. I do a lot of work in relation to SaleDate, often tens of thousands of transactions a day (with requisite transaction details, 4-50 each?), and have about 5 years of data. I would like to query and build trend information off of that; for instance:
On a yearly basis to know what items are becoming less popular over time.
On a monthly basis to see what items get popular at a certain time of year (ice in summer)
On a weekly basis to see how well my individual stores are doing
On a daily basis to observe theft trends or something
Now my business unit also wants to query that data, but I'd like to be able to keep it responsive.
How do I know that it would be best to partition on Year, Month, Week, Day, etc for this data set? Is it just whatever I actually observe as providing the best response time by testing out each scenario? Or is there some kind of scale that I can use to understand where my partitions would be the most efficient?
Edit: I, personally, am using Sql Server 2012. But I'm curious as to how others view this question in relation to the core concept rather than the implementation (Unless this isn't one of those cases where you can do so).
Things to consider:
What type of database are you using? Really important, different strategies for Oracle vs SQLServer vs IBM, etc.
Sample queries and run times. Partitions usage depends on the conditions in your where clause, what are you filtering on?
Does it make sense to create/use aggregate tables? Seems like a monthly aggregate would save you some time.
Partitions usage depends on the conditions in your where clause, what are you filtering on?
Lots of options based on the hardware and storage options available to you, need more details to make a more specific recommendation.
Here is an Ms-SQL 2012 database with 7 million records a day, with an ambition to grow the database to 6 years of data for trend analyses.
The partitions are based on the YearWeek column, expressed as an integer (after 201453 comes 201501). So each partition holds one week of transaction data.
This makes for a maximum of 320 partitions, which is well chosen below the maximum of 1000 partitions within a scheme. The maximum size for one partition in one table is now approx. 10 Gb, which makes it much easier to handle than the 3Tb size of the total.
A new file in the partition scheme is used for each new year. The 500Gb datafiles are suitable for backup and deletion.
When calculating data for one month the 4 processors are working in parallel to handle one partition each.
We are experimenting with BigQuery to analyze user data generated by our software application.
Our working table consists hundreds of millions of rows, each representing a unique user "session". Each containing a timestamp, UUID, and other fields describing the user's interaction with our product during that session. We currently generate about 2GB of data (~10M rows) per day.
Every so often we may run queries against the entire dataset (about 2 months worth right now, and growing), However typical queries will span just a single day, week, or month. We're finding out that as our table grows, our single-day query becomes more and more expensive (as we would expect given BigQuery architecture)
What isthe best way to query subsets of of our data more efficiently? One approach I can think of is to "partition" the data into separate tables by day (or week, month, etc.) then query them together in a union:
SELECT foo from
mytable_2012-09-01,
mytable_2012-09-02,
mytable_2012-09-03;
Is there a better way than this???
BigQuery now supports table partitions by date:
https://cloud.google.com/blog/big-data/2016/03/google-bigquery-cuts-historical-data-storage-cost-in-half-and-accelerates-many-queries-by-10x
Hi David: The best way to handle this is to shard your data across many tables and run queries as you suggest in your example.
To be more clear, BigQuery does not have a concept of indexes (by design), so sharding data into separate tables is a useful strategy for keeping queries as economically efficient as possible.
On the flip side, another useful feature for people worried about having too many tables is to set an expirationTime for tables, after which tables will be deleted and their storage reclaimed - otherwise they will persist indefinitely.