Find out the amount of space each field takes in Google Big Query - google-bigquery

I want to optimize the space of my Big Query and google storage tables. Is there a way to find out easily the cumulative space that each field in a table gets? This is not straightforward in my case, since I have a complicated hierarchy with many repeated records.

You can do this in Web UI by simply typing (and not running) below query changing to field of your interest
SELECT <column_name>
FROM YourTable
and looking into Validation Message that consists of respective size
Important - you do not need to run it – just check validation message for bytesProcessed and this will be a size of respective column
Validation is free and invokes so called dry-run
If you need to do such “columns profiling” for many tables or for table with many columns - you can code this with your preferred language using Tables.get API to get table schema ; then loop thru all fields and build respective SELECT statement and finally Dry Run it (within the loop for each column) and get totalBytesProcessed which as you already know is the size of respective column

I don't think this is exposed in any of the meta data.
However, you may be able to easily get good approximations based on your needs. The number of rows is provided, so for some of the data types, you can directly calculate the size:
https://cloud.google.com/bigquery/pricing
For types such as string, you could get the average length by querying e.g. the first 1000 fields, and use this for your storage calculations.

Related

Limitations in using all string columns in BigQuery

I have an input table in BigQuery that has all fields stored as strings. For example, the table looks like this:
name dob age info
"tom" "11/27/2000" "45" "['one', 'two']"
And in the query, I'm currently doing the following
WITH
table AS (
SELECT
"tom" AS name,
"11/27/2000" AS dob,
"45" AS age,
"['one', 'two']" AS info )
SELECT
EXTRACT( year from PARSE_DATE('%m/%d/%Y', dob)) birth_year,
ANY_value(PARSE_DATE('%m/%d/%Y', dob)) bod,
ANY_VALUE(name) example_name,
ANY_VALUE(SAFE_CAST(age AS INT64)) AS age
FROM
table
GROUP BY
EXTRACT( year from PARSE_DATE('%m/%d/%Y', dob))
Additionally, I tried doing a very basic group by operation casting an item to a string vs not, and I didn't see any performance degradation on a data set of ~1M rows (actually, in this particular case, casting to a string was faster):
Other than it being bad practice to "keep" this all-string table and not convert it into its proper type, what are some of the limitations (either functional or performance-wise) that I would encounter by keeping a table all-string instead of storing it as their proper type. I know there would be a slight increase in size due to storing strings instead of number/date/bool/etc., but what would be the major limitations or performance hits I'd run into if I kept it this way?
Off the top of my head, the only limitations I see are:
Queries would become more complex (though wouldn't really matter if using a query-builder).
A bit more difficult to extract non-string items from array fields.
Inserting data becomes a bit trickier (for example, need to keep track of what the date format is).
But these all seem like very small items that can be worked around. Are there are other, "bigger" reasons why using all string fields would be a huge limitation, either in limiting query-ability or having a huge performance hit in various cases?
First of all - I don't really see any bigger show-stoppers than those you already know and enlisted
Meantime,
though wouldn't really matter if using a query-builder ...
based on above excerpt - I wanted to touch upon some aspect of this approach (storing all as strings)
While we usually concerned about CASTing from string to native type to apply relevant functions and so on, I realized that building complex and generic query with some sort of query builder in some cases requires opposite - cast native type to string for applying function like STRING_AGG [just] as a quick example
So, my thoughts are:
When table is designed for direct user's access with trivial or even complex queries - having native types is beneficial and performance wise and being more friendly for user to understand, etc.
Meantime, if you are developing your own query builder and you design table such that it will be available to users for querying via that query builder with some generic logic being implemented - having all fields in string can be helpful in building the query builder itself.
So it is a balance - you can lose a little in performance but you can win in being able to better implement generic query builder. And such balance depend on nature of your business - both from data prospective and what kind of query you envision to support
Note: your question is quite broad and opinion based (which is btw not much respected on SO) so, obviously my answer - is totally my opinion but based on quite an experience with BigQuery
Are you OK to store string "33/02/2000" as a date in one row and "21st of December 2012" in another row and "22ое октября 2013" in another row?
Are you OK to store string "45" as age in one row and "young" in another row?
Are you OK when age "10" is less than age "9"?
Data types provide some basic data validation mechanism at the database level.
Does BigQuery databases have a notion of indexes?
If yes, then most likely these indexes become useless as soon as you start casting your strings to proper types, such as
SELECT
...
WHERE
age > 10 and age < 30
vs
SELECT
...
WHERE
ANY_VALUE(SAFE_CAST(age AS INT64)) > 10
and ANY_VALUE(SAFE_CAST(age AS INT64)) < 30
It is normal that with less columns/rows you don't feel the problems. You start to feel the problems when your data gets huge.
Major concerns:
Maintenance of the code: Think of future requirements that you may receive. Every conversion for data manipulation will add extra complexity to your code. For example, if your customer asks for retrieving teenagers in future, you'll need to convert string to date to get the age and then be able to do the manupulation.
Data size: The data size has broader impacts that can not be seen at the start. For example if you have N parallel test teams which require own test systems, you'll need to allocate more disk space.
Read Performance: When you have more bytes to read in huge tables it will cost you considerable time. For example typically telco operators have a couple of billions of rows data per month.
If your code complexity increase, you'll need to replicate conversions in multiple places.
Even single of above items should push one to distance from using strings for everything.
I would think the biggest issue with this would be if there are other users of this table/data, for instance if someone is trying to write reports with it and do calculations or charts or date ranges it could be a big headache having to always cast or convert the data with whatever tool they are using. You or someone would likely get a lot of complaints about it.
And if someone decided to build a layer between this data and the reporting tool which converted all of the data, then you may as well just do it one time to the table/data and be done with it.
From the solution below, you might face some storage and performance problems, you can find some guidance in the official documentation:
The main performance problem will come from the CAST operation, remember that the BigQuery Engine will have to deal with a CAST operation for each value per row.
In order to test the compute cost of this operations, I used the following query:
SELECT
street_number
FROM
`bigquery-public-data.austin_311.311_service_requests`
LIMIT
5000
Inspecting the stages executed in the execution details we are able to see the following:
READ
$1:street_number
FROM bigquery-public-data.austin_311.311_service_requests
LIMIT
5000
WRITE
$1
TO __stage00_output
Only the Read, Limit and Write operations are required. However if we execute the same query adding the the CAST operator.
SELECT
CAST(street_number AS int64)
FROM
`bigquery-public-data.austin_311.311_service_requests`
LIMIT
5000
We see that a compute operation is also required in order to perform the cast operation:
READ
$1:street_number
FROM bigquery-public-data.austin_311.311_service_requests
LIMIT
5000
COMPUTE
$10 := CAST($1 AS INT64)
WRITE
$10
TO __stage00_output
Those compute operations will consume some time, that might cause problems when escalating the operation size.
Also, remember that each time that you want to use the data type properties of each data type, you will have to cast your value, and deal with the compute operation time required.
Finally, referring to the storage performance, as you mentioned Strings do not have a fixed size, and that might cause a size increase.

SAP DBTech JDBC: [2048]: column store error: search table error: [2724] Olap temporary data size exceeded 31/32 bit limit"

I have a calculation view which is based on other calculation views and joins to bring material Accounts data from different vendors (all joins have 1-1 mapping with target). in the final view I have a calculated column as Formatted_MATERIAL (Material numbers without any leading zeros, used Ltrim() to remove leading zeros.)
Now, when I'm searching Formatted_MATERIAL equal to some specific number it's showing read error (heading). If I'm searching for some range of material it's giving results.
For example, if I search for material (500098), it's present in following query results
select "Formatted_MATERIAL"
FROM "_SYS_BIC"."CA_REPORTS_001_VK"
where "Formatted_MATERIAL" between 5000000 and 6000000
order by "Formatted_MATERIAL"
but no results for
select "Formatted_MATERIAL"
FROM "_SYS_BIC"."CA_REPORTS_001_VK"
where "Formatted_MATERIAL" = 5000098
The cause of the error is that during some processing step in one of the views you're using, the intermediate result set exceeds 2 billion records.
Based on my experience with typical HANA use cases (that would mostly be use cases in relation to SAP products) I am pretty sure that the way these underlying views have been modelled is not really right. Whenever you try and join or aggregate an intermediate result set of two billion records at once, chances are that important operations like filtering, projection and aggregation should have been done much earlier in the model.
Of course, without seeing the model(s) and the execution details (use PlanViz for this) and knowing with HANA version you're using, there is nothing we can say about how to solve this issue.

Large results set from Oracle SELECT

I have a simple, pasted below, statement called against an Oracle database. This result set contains names of businesses but it has 24,000 results and these are displayed in a drop down list.
I am looking for ideas on ways to reduce the result set to speed up the data returned to the user interface, maybe something like Google's search or a completely different idea. I am open to whatever thoughts and any direction is welcome.
SELECT BusinessName FROM MyTable ORDER BY BusinessName;
Idea:
SELECT BusinessName FROM MyTable WHERE BusinessName LIKE "A%;
I'm know all about how LIKE clauses are not wise to use but like I said this is a LARGE result set. Maybe something along the lines of a BINARY search?
The last query can perform horribly. String comparisons inside the database can be very slow, and depending on the number of "hits" it can be a huge drag on performance. If that doesn't concern you that's fine. This is especially true if the Company data isn't normalized into it's own db table.
As long as the user knows the company he's looking up, then I would identify an existing JavaScript component in some popular JavaScript library that provides a search text field with a dynamic dropdown that shows matching results would be an effective mechanism. But you might want to use '%A%', if they might look for part of a name. For example, If I'm looking for IBM Rational, LLC. do I want it to show up in results when I search for "Rational"?
Either way, watch your performance and if it makes sense cache that data in the company look up service that sits on the server in front of the DB. Also, make sure you don't respond to every keystroke, but have a timeout 500ms or so, to allow the user to type in multiple chars before going to the server and searching. Also, I would NOT recommend bringing all of the company names to the client. We're always looking to reduce the size and frequency of traversals to the server from the browser page. Waiting for 24k company names to come down to the client when the form loads (or even behind the scenes) when shorter quicker very specific queries will perform sufficiently well seems more efficient to me. Again, test it and identify the performance characteristics that fit your use case best.
These are techniques I've used on projects with large data, like searching for a user from a base of 100,000+ users. Our code was a custom Dojo widget (dijit), I 'm not seeing how to do it directly with the dijit code, but jQuery UI provides the autocomplete widget.
Also use limit on this query with a text field so that the drop down only provides a subset of all the matches, forcing the user to further refine the query.
SELECT BusinessName FROM MyTable ORDER BY BusinessName LIMIT 10

Suggestions/Opinions for implementing a fast and efficient way to search a list of items in a very large dataset

Please comment and critique the approach.
Scenario: I have a large dataset(200 million entries) in a flat file. Data is of the form - a 10 digit phone number followed by 5-6 binary fields.
Every week I will be getting a Delta files which will only contain changes to the data.
Problem : Given a list of items i need to figure out whether each item(which will be the 10 digit number) is present in the dataset.
The approach I have planned :
Will parse the dataset and put it a DB(To be done at the start of the
week) like MySQL or Postgres. The reason i want to have RDBMS in the
first step is I want to have full time series data.
Then generate some kind of Key Value store out of this database with
the latest valid data which supports operation to find out whether
each item is present in the dataset or not(Thinking some kind of a
NOSQL db, like Redis here optimised for search. Should have
persistence and be distributed). This datastructure will be read-only.
Query this key value store to find out whether each item is present
(if possible match a list of values all at once instead of matching
one item at a time). Want this to be blazing fast. Will be using this functionality as the back-end to a REST API
Sidenote: Language of my preference is Python.
A few considerations for the fast lookup:
If you want to check a set of numbers at a time, you could use the Redis SINTER which performs set intersection.
You might benefit from using a grid structure by distributing number ranges over some hash function such as the first digit of the phone number (there are probably better ones, you have to experiment), this would e.g. reduce the size per node, when using an optimal hash, to near 20 million entries when using 10 nodes.
If you expect duplicate requests, which is quite likely, you could cache the last n requested phone numbers in a smaller set and query that one first.

DATA_BUFFER_EXCEEDED error when calling RFC_READ_TABLE?

My java/groovy program receives table names and table fields from the user input, it queries the tables in SAP and returns its contents.
The user input may concern the tables CDPOS and CDHDR. After reading the SAP documentations and googling, I found these are tables storing change document logs. But I did not find any remote call functions that can be used in java to perform this kind of queries.
Then I used the deprecated RFC Function Module RFC_READ_TABLE and tried to build up customized queries only depending on this RFC. However, I found if the number of desired fields I passed to this RFC are more than 2, I always got the DATA_BUFFER_EXCEEDED error even if I limit the max rows.
I am not authorized to be an ABAP developer in the SAP system and can not add any FM to existing systems, so I can only write code to accomplish this requirement in JAVA.
Am I doing something wrong? Could you give me some hints on that issue?
DATA_BUFFER_EXCEEDED only happens if the total width of the fields you want to read exceeds the width of the DATA parameter, which may vary depending on the SAP release - 512 characters for current systems. It has nothing to do with the number of rows, but the size of a single dataset.
So the question is: What are the contents of the FIELDS parameter? If it's empty, this means "read all fields." CDHDR is 192 characters in width, so I'd assume that the problem is CDPOS which is 774 characters wide. The main issue would be the fields VALUE_OLD and VALUE_NEW, both 245 Characters.
Even if you don't get developer access, you should prod someone to get read-only dictionary access to be able to examine the structures in detail.
Shameless plug: RCER contains a wrapper class for RFC_READ_TABLE that takes care of field handling and ensures that the total width of the selected fields is below the limit imposed by the function module.
Also be aware that these tables can be HUGE in production environments - think billions of entries. You can easily bring your database to a grinding halt by performing excessive read operations on these tables.
PS: RFC_READ_TABLE is not released for customer use as per SAP note 382318, and the note 758278 recommends to create your own function module and provides a template with an improved logic.
Use BBP_RFC_READ_TABLE instead
There is a way around the DATA_BUFFER_EXCEED error. Although this function is not released for customer use as per SAP OSS note 382318, you can get around this issue with changes to the way you pass parameters to this function. Its not a single field that is causing your error, but if the row of data exceeds 512 bytes this error will be raised. CDPOS will have this issue for sure!
The work around if you know how to call the function using Jco and pass table parameters is to specify the exact fields you want returned. You then can keep your returned results under the 512 byte limit.
Using your example of table CDPOS, specify something like this and you should be good to go...(be careful, CDPOS can get massive! You should specify and pass a where clause!)
FIELDS = 'OBJECTCLAS'....
FIELDS = 'OBJECTID'
In Java it can be expressed as..
listParams.setValue(this.getpObjectclas(), "OBJECTCLAS");
By limiting the fields you are returning you can avoid this error.