How to show a sample of the data in BigQuery? - sql

Let us suppose I have a 1TB dataset in BigQuery, and I want to be able to view the data in a columnar view, limiting to 1000 results. Here are a few of the queries I might use:
1. SELECT * FROM mytable LIMIT 1000
2. SELECT first_name, last_name FROM mytable LIMIT 1000
3. SELECT last_name, first_name FROM mytable LIMIT 1000
4. SELECT * FROM mytable ORDER BY first_name LIMIT 1000
If I ran these four queries I would be charged ~$20 ($5/tb, pretend * = first_name, last_name). This seems like a very high amount to pay to just sample the data -- is there another way to query this data to view a limited view of the data, like the above?

This seems like a very high amount to pay to just sample the data -- is there another way to
If your data dynamic, meaning is updated daily or whatever other way - you can use Table Decorators
For example
SELECT * FROM mytable#-3600000--1800000 LIMIT 1000
will query only data inserted within last hour, thus lowering cost a lot!!
Another option is to use Day partitioned tables so you can query only specific day worth of data
Is there a way to export a subset of the data instead of doing a query?
Yes. You can use Tabledata.list API to list page-by-page data in your original table and insert into new [sampled] table using whatever sampling logic you need. Note: this API is free as it actually doesn't use BigQuery query engine per se, but rather reading from underlying storage!!! so you can be reasonably wild :o)
Of course you need to implement this in client of your choice.

I assume you are accessing BQ through the online query interface (https://bigquery.cloud.google.com/table . . . ).
Click on the table in the data set. Go down to where it says "Table Details" in bold letters, beneath the "Run Query" icon.
In the second row below that is an option for "Preview". This will show you some data and it's free.

We have a sample table that's generated every day at work which I find extremely useful for many tasks. It's as simple as:
SELECT * FROM mytable WHERE RAND() < 0.01
The table is hierarchical, and this sampling is set to reproduce the whole structure; so queries can be tested/replicated in exactly the same form and then swapped over to the big table if needed. The 1% sample applies to the top level of the hierarchy (meaning you don't have to wonder whether you are getting valid results from branches).
For us, there is enough data that sums and ratios are generally very representative. The only kind of data that poses a significant problem is relatively rare events, which means counts of unique elements can't be relied on.
And of course, after the single daily charge for making this table, the billing goes from dollars to cents!

Related

BigQuery in Google Cloud, limiting search of views and cost

I'm a little new to BQ, I'm doing a query, very simply of a view to get a quick look at the data, but when I put say LIMIT 100, to see just the first 100 rows, I don't get a reduction in the data required and hence the cost. If I want to simply do this, what can I do that is inexpensive to get the data.
For example:
select * from table
uses exactly the same projected data as
select * from table limit 100
Is there not any simplification under hood. Is BQ searching all rows and then taking the top 100?
BigQuery charging is based on the data queried and unfortunately limit does not reduce the volume of data queried.
The following can help:
using the table preview in the console (this is free if I recall correctly) but does not work on views or some types of attached tables
reducing the number of columns that are queried
if your data is partitioned, you can query a specific partition - https://cloud.google.com/bigquery/docs/querying-partitioned-tables
There is information from Google on this page https://cloud.google.com/bigquery/docs/best-practices-performance-input

GCP BigQuery - LIMIT but full table read - How to limit queried data to a minimum

It looks like LIMIT would have no effect on the amount of processed/queried data (if you trust the UI).
SELECT
* --count(*)
FROM
`bigquery-public-data.github_repos.commits`
-- LIMIT 20
How to limit the amount of queried data to a minimum (even though one whole partition would probably always be needed)
without to use "preview" or similar
without to know the partition / clustering of the data
How to check the real approximate amount before a query execution?
In the execution details is stated that only 163514 rows has been queried as input (not 244928379 rows)
If you want to limit the amount of data BQ uses for a query you have this two options:
Table Partitioning
Big query can partition data using either a Date/Datetime/Timemestamp column you provide or by insert date (which is good if you have regular updates on a table).
In order to do this, you must specify the partition strategy in the DDL:
CREATE TABLE mydataset.mytable (foo: int64, txdate:date)
PARTITION BY txdate
Wildcard tables (like Sharding - splitting the data into multiple tables
This works when your data holds information about different domains (geographical, customer type, etc.) or sources.
Instead of having one big table, you can create 'subtables' or 'shards' like this with a similar schema (usually people use the same). For instance,dateset.tablename.eur for european data and ```dataset.tablename.jap`` for data from Japan.
You can query one of those tables directll select col1,col2... from dataset.tablename.custromer_eur; or from all tables select col1,col2 from 'dataset.tablename.*'
Wildcard tables can be also partitioned by date.
You pay for the volume of data loaded in the workers. Of course, you do nothing in your request and you ask for the 20 first result, the query stop earlier, and all the data aren't processed, but at least loaded. And you will pay for this!
Have a look to this. I have a similar request
Now, let's go to the logs
The total byte billed is ~800Mb
So you, have to think differently when you work with BigQuery, it's analytics database and not designed to perform small requests (too slow to start, the latency is at least 500ms due to worker warm up).
My table contain 3M+ of rows, and only 10% have been processed
And you pay for the reservation and the load cost (moving data have a cost and reserving slots has also a cost).
That's why, there is a lot of tip to save money on Google BigQuery. Some examples by a former BigQuery Dev Advocate
as of december 2021, I notice select * from Limit, will not scan the whole table and you pay only for a small number of rows, obviously if you add order by, it will scan everything.

My SQL table is too big: retrieving data via paging/segmenting the result?

This is a design/algorithm question.
Here's the outline of my scenario:
I have a large table (say, 5 mil. rows) of data which I'll call Cars
Then I have an application, which performs a SELECT * on this Cars table, taking all the data and packaging it into a single data file (which is then uploaded somewhere.)
This data file generated by my application represents a snapshot, what the table looked like at an instant in time.
The table Cars, however, is updated sporadically by another process, regardless of whether the application is currently generating a package from the table or not. (There currently is no synchronization.)
My problem:
This table Cars is becoming too big to do a single SELECT * against. When my application retrieves all this data at once, it quickly overwhelms the memory capacity for my machine (let's say, 2GB.) Also, simply performing chained SELECTs with LIMIT or OFFSET fails the condition of synchronization: the table is frequently updated and I can't have the data change between SELECT calls.
What I'm looking for:
A way to pull the entirety of this table into an application whose memory capacity is smaller than the data, assuming the data size could approach infinity. Particularly, how do I achieve a pagination/segmented effect for my SQL selects? i.e. Make recurring calls with a page number to retrieve the next segment of data. The ideal solution allows for scalability in data size.
(For the sake of simplifying my scenario, we can assume that when given a segment of data, the application can process/write it then free up the memory used before requesting the next segment.)
Any suggestions you may be able to provide would be most helpful. Thanks!
EDIT: By request, my implementation uses C#.NET 4.0 & MSSQL 2008.
EDIT #2: This is not a SQL command question. This is design-pattern related question: what is the strategy to perform paginated SELECTs against a large table? (Especially when said table receives consistent updates.)
What database are you using? In MySQL for example the following would select 20 rows beginning from row 40 but this is mysql-only clause (edit: it seems Postgres also allows this)
select * from cars limit 20 offset 40
If you want a "snapshot" effect you have to copy the data into holding table where it will not get updated. You can accomplish some nice things with various types of change-tracking, but that's not what you stated you wanted. If you need a snapshot of the exact table state then take the snapshot and write it to a seperate table and use the limit and offset (or whatever) to create pages.
And at 5 million rows, I think it is likely the design requirement that might need to be modified...if you have 2000 clients all taking 5 million-row snapshots you are going to start having some size issues if you don't watch out.
You should provide details of the format of the resultant data file. Depending on the format this could be possible directly in your database, with no app code involved eg for mysql:
SELECT * INTO OUTFILE "c:/mydata.csv"
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY "\n"
FROM my_table;
For oracle there would be export, for sqlserver/sybase it would be BCP, etc.
Or alternatively achievable by streaming the data, without holding it all in memory, this would vary depending on the app language.
In terms of paging, the easy option is to just use the limit clause (if mysql) or the equivelent in whatever rdbms you are using, but this is a last resort:
select * from myTable order by ID LIMIT 0,1000
select * from myTable order by ID LIMIT 1000,1000
select * from myTable order by ID LIMIT 2000,1000
...
This selects the data in 1000 row chunks.
Look at this post on using limit and offset to create paginated results from your sql query.
http://www.petefreitag.com/item/451.cfm
You would have to first:
SELECT * from Cars Limit 10
and then
SELECT * from Cars limit 10 offset 10
And so on. You will have to figure out the best pagination for this.

Sql Server Paging issue

Friends,
I have already implemented paging in my SP -
with MyData As (
select ROW_NUMBER() over (order by somecolumn desc) AS [Row],
x,y,z,...
)
Select x,y,z,...
From MyData
Where [Row] between ((#currentPage - 1) * #pageSize + 1) and (#currentPage*#pageSize)
The problem here is that data is retried very fast if with clause return smaller number of rows but it takes long time when there are millions of records. Sometimes it times out.
Is there any other alternative?
Thanks for sharing your valuable time.
SQL server optimisation is a very broad subject and it is pretty much impossible to work out the issue with the limited amount of information you have posted. However since you're in a rush for a solution - Firstly I would suggest checking your actual execution plan, post it here, and making sure that the index is actually being used - if this is not the case then consider using the FASTFIRSTROW table hint to force the index to be used - check here and here on how it can improve things and here in how it can make things worse.
Next to consider is SQL parameter sniffing - it's unlikely from what you have said but possible check here for details enter link description here
For large scale performance gains you may need to look at architectural changes at the very least ensure that your transaction logs are on a different disk to your data.The reason you separate the database files from the log files is because database access is random and log access is sequential. Best practice dictates that you don't mix those two I/O types on the same disk
Also if you've got million of rows then you really need to consider splitting the data across multiple disks.
Finally I would strongly consider partioning either the table or the index see here for a start
The reason why your query is slow is that you have sort whole table on every request. To speed it up significantly you need to avoid sorting big chuck of data, at cost of CPU, HDD/Memory or limitations on pagination logic.
As there is not much information about how you table is sorted and if you insert in the middle / delete entries very often, I'll narrow down you question by making these assumptions:
I would imagine you have a table storing an archive of articles. New entries are mostly at the bottom of the table, entries from the middle of the table deleted rarely.
You sort always by the same column somecolumn and in the same order, e.g. descending.
You do not have any user entered filters (like article title or author).
This makes the table static in terms of the output: each article will be on the same place, unless a new one inserted. New one come to the top of your output. Then you can store ROW_NUMBER() OVER () as a column. A more convenient solution will be an IDENTITY column. It will speed up things if you create a clustered index on this column
alter table add [Record_Number] int null IDENTITY
This new column is added as null so you can populate values first time. Then you can make it not null.
On the other hand you can last row number very quickly by
select #Max_Row = SELECT MAX(row_number) from MyTable
Now when you have total number of rows, page size and page number you can select rows you need in one statement without sorting the whole lot.
Select * From MyTable
Where row_number between
(#Max_Row - #Page * #Page_Size) + 1 AND
#Max_Row -(#Page - 1) * #Page_Size
If you do have a filter in your CTE, then give some more information about how your data is structured, so we can think of a way to limit the scope of CTE.

Appropriate query and indexes for a logging table in SQL

Assume a table named 'log', there are huge records in it.
The application usually retrieves data by simple SQL:
SELECT *
FROM log
WHERE logLevel=2 AND (creationData BETWEEN ? AND ?)
logLevel and creationData have indexes, but the number of records makes it take longer to retrieve data.
How do we fix this?
Look at your execution plan / "EXPLAIN PLAN" result - if you are retrieving large amounts of data then there is very little that you can do to improve performance - you could try changing your SELECT statement to only include columns you are interested in, however it won't change the number of logical reads that you are doing and so I suspect it will only have a neglible effect on performance.
If you are only retrieving small numbers of records then an index of LogLevel and an index on CreationDate should do the trick.
UPDATE: SQL server is mostly geared around querying small subsets of massive databases (e.g. returning a single customer record out of a database of millions). Its not really geared up for returning truly large data sets. If the amount of data that you are returning is genuinely large then there is only a certain amount that you will be able to do and so I'd have to ask:
What is it that you are actually trying to achieve?
If you are displaying log messages to a user, then they are only going to be interested in a small subset at a time, and so you might also want to look into efficient methods of paging SQL data - if you are only returning even say 500 or so records at a time it should still be very fast.
If you are trying to do some sort of statistical analysis then you might want to replicate your data into a data store more suited to statistical analysis. (Not sure what however, that isn't my area of expertise)
1: Never use Select *
2: make sure your indexes are correct, and your statistics are up-to-date
3: (Optional) If you find you're not looking at log data past a certain time (in my experience, if it happened more than a week ago, I'm probably not going to need the log for it) set up a job to archive that to some back-up, and then remove unused records. That will keep the table size down reducing the amount of time it takes search the table.
Depending on what kinda of SQL database you're using, you might look into Horizaontal Partitioning. Oftentimes, this can be done entirely on the database side of things so you won't need to change your code.
Do you need all columns? First step should be to select only those you actually need to retrieve.
Another aspect is what you do with the data after it arrives to your application (populate a data set/read it sequentially/?).
There can be some potential for improvement on the side of the processing application.
You should answer yourself these questions:
Do you need to hold all the returned data in memory at once? How much memory do you allocate per row on the retrieving side? How much memory do you need at once? Can you reuse some memory?
A couple of things
do you need all the columns, people usually do SELECT * because they are too lazy to list 5 columns of the 15 that the table has.
Get more RAM, themore RAM you have the more data can live in cache which is 1000 times faster than reading from disk
For me there are two things that you can do,
Partition the table horizontally based on the date column
Use the concept of pre-aggregation.
Pre-aggregation:
In preagg you would have a "logs" table, "logs_temp" table, a "logs_summary" table and a "logs_archive" table. The structure of logs and logs_temp table is identical. The flow of application would be in this way, all logs are logged in the logs table, then every hour a cron job runs that does the following things:
a. Copy the data from the logs table to "logs_temp" table and empty the logs table. This can be done using the Shadow Table trick.
b. Aggregate the logs for that particular hour from the logs_temp table
c. Save the aggregated results in the summary table
d. Copy the records from the logs_temp table to the logs_archive table and then empty the logs_temp table.
This way results are pre-aggregated in the summary table.
Whenever you wish to select the result, you would select it from the summary table.
This way the selects are very fast, because the number of records are far less as the data has been pre-aggregated per hour. You could even increase the threshold from an hour to a day. It all depends on your needs.
Now the inserts would be fast too, because the amount of data is not much in the logs table as it holds the data only for the last hour, so index regeneration on inserts would take very less time as compared to very large data-set hence making the inserts fast.
You can read more about Shadow Table trick here
I employed the pre-aggregation method in a news website built on wordpress. I had to develop a plugin for the news website that would show recently popular (popular during the last 3 days) news items, and there are like 100K hits per day, and this pre-aggregation thing has really helped us a lot. The query time came down from more than 2 secs to under a second. I intend on making the plugin publically available soon.
As per other answers, do not use 'select *' unless you really need all the fields.
logLevel and creationData have indexes
You need a single index with both values, what order you put them in will affect performance, but assuming you have a small number of possible loglevel values (and the data is not skewed) you'll get better performance putting creationData first.
Note that optimally an index will reduce the cost of a query to log(N) i.e. it will still get slower as the number of records increases.
C.
I really hope that by creationData you mean creationDate.
First of all, it is not enough to have indexes on logLevel and creationData. If you have 2 separate indexes, Oracle will only be able to use 1.
What you need is a single index on both fields:
CREATE INDEX i_log_1 ON log (creationData, logLevel);
Note that I put creationData first. This way, if you only put that field in the WHERE clause, it will still be able to use the index. (Filtering on just date seems more likely scenario that on just log level).
Then, make sure the table is populated with data (as much data as you will use in production) and refresh the statistics on the table.
If the table is large (at least few hundred thousand rows), use the following code to refresh the statistics:
DECLARE
l_ownname VARCHAR2(255) := 'owner'; -- Owner (schema) of table to analyze
l_tabname VARCHAR2(255) := 'log'; -- Table to analyze
l_estimate_percent NUMBER(3) := 5; -- Percentage of rows to estimate (NULL means compute)
BEGIN
dbms_stats.gather_table_stats (
ownname => l_ownname ,
tabname => l_tabname,
estimate_percent => l_estimate_percent,
method_opt => 'FOR ALL INDEXED COLUMNS',
cascade => TRUE
);
END;
Otherwise, if the table is small, use
ANALYZE TABLE log COMPUTE STATISTICS FOR ALL INDEXED COLUMNS;
Additionally, if the table grows large, you shoud consider to partition it by range on creationDate column. See these links for the details:
Oracle Documentation: Range Partitioning
OraFAQ: Range partitions
How to Create and Manage Partition Tables in Oracle