How to Optimize Google Big Query Bytes Billed - sql

I have recently discovered Google Big Query and it's open datasets. Upon performing the following query on the 311_service requests table in the new_york dataset, the cloud console reports the bytes billed to be 130 MB.
SQL Query:
SELECT unique_key FROM `bigquery-public-data.new_york.311_service_requests` LIMIT 10
Query Returns:
+------+-------------+
| Rows | unique_key |
+------+-------------+
| 1 | 37911459 |
| 2 | 38162601 |
| 3 | 32560181 |
| 4 | 38259076 |
| 5 | 36034528 |
| 6 | 36975822 |
| 7 | 38028455 |
| 8 | 37993135 |
| 9 | 37988664 |
| 10 | 35382611 |
+------+-------------+
For a query returning such a small amount of data, why is the bytes billed valued at 130 MB?
Is there a way to optimize this? Should the results of a query be stored in another database for later retrieval?

why is the bytes billed valued at 130 MB?
Query pricing refers to the cost of running your SQL commands and user-defined functions. BigQuery charges for queries by using one metric: the number of bytes processed (also referred to as bytes read). You are charged for the number of bytes processed whether the data is stored in BigQuery or in an external data source such as Cloud Storage, Google Drive, or Cloud Bigtable.
When you run a query, you're charged according to the total data processed in the columns you select, even if you set an explicit LIMIT on the results. The total bytes per column is calculated based on the types of data in the column. For more information about how we calculate your data size, see Data size calculation.
Query pricing is based on your usage pattern: a monthly flat rate for queries or pricing based on interactive queries. Enterprise customers generally prefer flat-rate pricing for queries because that model offers consistent month-to-month costs. On-demand (or interactive) pricing offers flexibility and is based solely on usage.
You can see more at https://cloud.google.com/bigquery/pricing
So, in your case 130MB is the size of the respective unique_key column
Should the results of a query be stored in another database for later retrieval?
sure
You can do so to manage cost for consecutive processing of that small data w/o touching the original one
Have in mind - this will invoke storage price for you - see same above mentioned link for details

Related

In CloudWatch Insights, how do I build a histogram of an aggregation function (second level query)?

I'm not sure I'm asking this correctly which is probably why I can't find the solution. So I'll provide an example.
Suppose I have a log of employees hired by managers in a given time period. I can create a query that groups by manager and shows the number of employees hired
stats count() as numEmployees by managerId
| filter #message like /hired employee/
| sort numEmployees desc
Let's suppose that generates the following table
Mngr | numHires
Jack | 4
Judy | 3
May | 3
John | 2
Jake | 2
Mary | 1
Sam | 1
Alan | 1
I'd like to further refine my result so that I can produce another histogram of numHires and count like so
4 | 1
3 | 2
2 | 2
1 | 3
This table means there was 1 instance of 4 hires, 2 instances of 3 hires, 2 instances of 2 hires, and 3 instances of 1 hire.
Is there a way to do this?
ps - I know I can download the csv and do this in excel. However, there is a limit of 10000 results returned in cloudwatch
I needed to do the same type of aggregation and raise a support case with AWS to ask how this could be done. The response from the AWS team was that unfortunately at the moment it is not possible using Insights.
Insights does not have the capabilities for second level aggregations
currently.
So an alternate workaround is to use AWS Quick Sights or MS Excel to
plot the required graphs.
In my case Excel is not an option because my the resulting dataset for a day has millions of records. That being said in the end my solution was to sample over just a few minutes of data, export this to Excel, and generate a pivot table to aggregate the data. This allowed me to get a rough idea of my system.
I have not looked into AWS Quick Sights.
There may be other third-party solutions besides AWS Insights, such as Datadog, that provide more powerful log analysis functionality. I have not used Datadog personally so cannot vouch for it but have read good things about it.
References:
[1] https://docs.aws.amazon.com/quicksight/latest/user/histogram-charts.html

Select rows in a table (postgis) from selected features QGIS

How do I select rows in a table based on a key (PK) from another table. I have selected multiple polygons which is within a geografical region from one layer.
The attributes table from the selected layer look like this:
| Bloknr | Column 1 | Column 2 | Column 3 |
| 111-08 | xqyz | xyzq | qxyz |
| 208-09 | abc | cba | bca |
Where the row in question (row 1) is selected.
I now want to select this row from a nongeographic layer (from a postgresql database) with a table that looks like this:
| BLOKNR | Column 1 | Column 2 | Column 3 |
| 111-08 | cab | bac | cab |
| 208-09 | abc | cba | bca |
| 111-08 | cba | bca | cab |
Where the first and third row is to be selected.
There is about 20.000.000 rows in the postgres table and multiple matches on each bloknr
I work in qgis ver. 3.2 and postgresql with PGadmin4
Any help most appreciated.
UPDATE to answer the comments
It would be simple, if it was a matter of doing it within postgres - it's kind of made for that - but i cannot figure out how to query within qgis i would like not to have to export each table (I have a few, and for each i need multiple selection queries, based on geography) to postgresql - partly because i would like to keep the workflow in qgis, and partly because the export feature in the DB manager of qgis gives me this error - which i think means that i have to make all the tables manually.
" ERROR: function addgeometrycolumn(unknown, unknown, unknown,
integer, unknown, integer) does not exist LINE 1: SELECT
AddGeometryColumn('public','Test',NULL,0,'MULTIPOLYGO...
HINT: No function matches the given name and argument types. You might need to add explicit type casts."
So again any help appreciated.
So i have come up with an answer, that will work in theory.
First make the desired geographical selection and make a new layer with the selection
Then export the layer to the postgis database, with which you are connected
Now it is possible to make queries in postgresql - and PGadmin.
Note that this does not keep the workflow in qgis - and for further processing of statistics etc. one will have to work on the integration between the new postgis layer and selection within this - and it doesn't quite solve the geographical/mapbased selection approach - although it will work

How to group rows vertically in PowerBuilder?

I have this sample rows of plate nos with bay nos:
Plate no | Bay no
------------------
AAA111 | 1
AAA222 | 1
AAA333 | 2
BBB111 | 3
BBB222 | 3
CCC111 | 1
Is there a way to make it look like this in a datawindow in powerbuilder?
1 | 2 | 3
------------------------
AAA111 | AAA333 | BBB111
AAA222 BBB222
CCC111
There isn't an simple answer, especially if you need cells to be update-able.
Variable Column Count Strategy
If the number of columns across the top is unknown at development time than you might get by with a "Crosstab" style datawindow but it would be a display only. If you need updates you'll need to do manual data manipulations & updates as each cell would probably represent one row.
Fixed Column Count Strategy
If the number of columns is known (fixed) you could flatten the data at the database and use a standard tabular (or grid) datawindow control but you'll still need to get creative if updates are needed.
If you use Oracle to obtain the data you can use the Pivot and Unpivot function to perform what you are looking for. Here is an example of how to do it:
http://www.oracle.com/technetwork/es/articles/sql/caracteristicas-database11g-2108415-esa.html

Dynamic creation of Table type

I have a column table with a single column.
I would like to create a table type with all the elements in the column of the above mentioned table as column names with fixed datatype and size and use it in a function.
similarly like below:
Dynamic creation of table in tsql
Any suggestions would be appreciated.
EDIT:
To finish a product, a machine has to perform different Jobs on the material with different tools.
I have a list of Jobs a machine can perform and a list of Tools. a specific tool for a specific Job.
Each job needs a specific tool and number of hours (to change the tool once it reached its change time). A Job can be performed many times on a product. (in this case if a Job is performed for 1 hour = tool has been used for 1 hour)
For each product, a set of tools will be at work in a sequence. so I Need a report for each product, number of hours the tool has worked.
EDIT 2:
Product table
---------+-----+
ProductID|Jobs |
---------+-----+
1 | job1 |
1 | job2 |
1 | job3 |
1 | . |
1 | . |
1 |100th |
2 | job1 |
2 | . |
2 | . |
2 |200th |
Jobs table
-------+-------+-------
Jobs | tool | time
-------+-------+-------
job1 |tool 10| 2
job1 |tool 09| 1
job2 |tool 11| 4
job3 |tool 17| 0.5
required report (this table does not physically exist)
----------+------+------+------+------+------+-----
productID | job1 | job2 | job3 | job4 | job5 | . . .
----------+------+------+------+------+------+------
1 | 20 | 10 | 5 | . | . | .
----------+------+------+------+------+------+------
2 | 10 | 13 | 5 | . | . | .
----------+------+------+------+------+------+------
Based on the added information, there are two main requirements here:
You want to sum up the time spent for producing each product grouped by the jobs involved
and
You want to have a cross-table report showing the times from step 1 against products and jobs.
For the first bit, you probably could do this with a query like this:
SELECT
p.product_id,
j.jobs,
SUM(j.time) as SUM_TIME
FROM
products p
INNER JOIN jobs j
ON p.jobs = j.jobs
GROUP BY
p.product_id,
j.jobs;
For the second part: this is usually called a PIVOT report.
SAP HANA does not provide a dynamic SQL command for generating output in this form (other DBMS have that).
However, this dynamic transformation is usually relevant for the data presentation and not so much for the processing.
So, as you probably want to use some form of front end for this report (e.g. MS Excel, Crystal Reports, Business Objects X, Tableau, ...) I would recommend doing the transformation and formatting in the frontend report. Look for "PIVOT" or "CROSSTAB" options to do that.

Storing large SQL datasets with variable numbers of columns

In America’s Cup yachting, we generate large datasets where at every time-stamp (e.g. 100Hz) we need to store maybe 100-1000 channels of sensor data (e.g. speed, loads, pressures). We store this in MS SQL Server and need to be able to retrieve subsets of channels of the data for analysis, and perform queries such as the maximum pressure on a particular sensor in a test, or over an entire season.
The set of channels to be stored stays the same for several thousand time-stamps, but day-to-day will change as new sensors are added, renamed, etc... and depending on testing, racing or simulating, the number of channels can vary greatly.
The textbook way to structure the SQL tables would probably be:
OPTION 1
ChannelNames
+-----------+-------------+
| ChannelID | ChannelName |
+-----------+-------------+
| 50 | Pressure |
| 51 | Speed |
| ... | ... |
+-----------+-------------+
Sessions
+-----------+---------------+-------+----------+
| SessionID | Location | Boat | Helmsman |
+-----------+---------------+-------+----------+
| 789 | San Francisco | BoatA | SailorA |
| 790 | San Francisco | BoatB | SailorB |
| ... | ... | ... | |
+-----------+---------------+-------+----------+
SessionTimestamps
+-------------+-------------+------------------------+
| SessionID | TimestampID | DateTime |
+-------------+-------------+------------------------+
| 789 | 12345 | 2013/08/17 10:30:00:00 |
| 789 | 12346 | 2013/08/17 10:30:00:01 |
| ... | ... | ... |
+-------------+-------------+------------------------+
ChannelData
+-------------+-----------+-----------+
| TimestampID | ChannelID | DataValue |
+-------------+-----------+-----------+
| 12345 | 50 | 1015.23 |
| 12345 | 51 | 12.23 |
| ... | ... | ... |
+-------------+-----------+-----------+
This structure is neat but inefficient. Each DataValue requires three storage fields, and at each time-stamp we need to INSERT 100-1000 rows.
If we always had the same channels, it would be more sensible to use one row per time-stamp and structure like this:
OPTION 2
+-----------+------------------------+----------+-------+----------+--------+-----+
| SessionID | DateTime | Pressure | Speed | LoadPt | LoadSb | ... |
+-----------+------------------------+----------+-------+----------+--------+-----+
| 789 | 2013/08/17 10:30:00:00 | 1015.23 | 12.23 | 101.12 | 98.23 | ... |
| 789 | 2013/08/17 10:30:00:01 | 1012.51 | 12.44 | 100.33 | 96.82 | ... |
| ... | ... | ... | | | | |
+-----------+------------------------+----------+-------+----------+--------+-----+
However, the channels change every day, and over the months the number of columns would grow and grow, with most cells ending up empty. We could create a new table for every new Session, but it doesn’t feel right to be using a table name as a variable, and would ultimately result in tens of thousands of tables – also, it becomes very difficult to query over a season, with data stored in multiple tables.
Another option would be:
OPTION 3
+-----------+------------------------+----------+----------+----------+----------+-----+
| SessionID | DateTime | Channel1 | Channel2 | Channel3 | Channel4 | ... |
+-----------+------------------------+----------+----------+----------+----------+-----+
| 789 | 2013/08/17 10:30:00:00 | 1015.23 | 12.23 | 101.12 | 98.23 | ... |
| 789 | 2013/08/17 10:30:00:01 | 1012.51 | 12.44 | 100.33 | 96.82 | ... |
| ... | ... | ... | | | | |
+-----------+------------------------+----------+----------+----------+----------+-----+
with a look-up from Channel column IDs to channel names – but this requires an EXEC or eval to execute a pre-constructed query to obtain the channel we want – because SQL isn’t designed to have column names as variables. On the plus side, we can re-use columns when channels change, but there will still be many empty cells because the table has to be as wide as the largest number of channels we ever encounter. Using a SPARSE table may help here, but I am uncomfortable with the EXEC/eval issue above.
What is the right solution to this problem, that achieves efficiency of storage, inserts and queries?
I would go with Option 1.
Data integrity is first, optimization (if needed) - second.
Other options would eventually have a lot of NULL values and other problems stemming from not being normalized. Managing the data and making efficient queries would be difficult.
Besides, there is a limit on the number of columns that a table can have - 1024, so if you have 1000 sensors/channels you are already dangerously close to the limit. Even if you make your table a wide table, which allows 30,000 columns, still there is a limitation on the size of the row in a table - 8,060 bytes per row. And there are certain performance considerations.
I would not use wide tables in this case, even if I was sure that the data for each row would never exceed 8060 bytes and growing number of channels would never exceed 30,000.
I don't see a problem with inserting 100 - 1000 rows in Option 1 vs 1 row in other options. To do such INSERT efficiently don't make 1000 individual INSERT statements, do it in bulk. In various places in my system I use the following two approaches:
1) Build one long INSERT statement
INSERT INTO ChannelData (TimestampID, ChannelID, DataValue) VALUES
(12345, 50, 1015.23),
(12345, 51, 12.23),
...
(), (), (), (), ........... ();
that contains 1000 rows and execute it as normal INSERT in one transaction, rather than 1000 transactions (check the syntax details).
2) Have a stored procedure that accepts a table-valued parameter. Call such procedure passing 1000 rows as a table.
CREATE TYPE [dbo].[ChannelDataTableType] AS TABLE(
[TimestampID] [int] NOT NULL,
[ChannelID] [int] NOT NULL,
[DataValue] [float] NOT NULL
)
GO
CREATE PROCEDURE [dbo].[InsertChannelData]
-- Add the parameters for the stored procedure here
#ParamRows dbo.ChannelDataTableType READONLY
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
BEGIN TRANSACTION;
BEGIN TRY
INSERT INTO [dbo].[ChannelData]
([TimestampID],
[ChannelID],
[DataValue])
SELECT
TT.[TimestampID]
,TT.[ChannelID]
,TT.[DataValue]
FROM
#ParamRows AS TT
;
COMMIT TRANSACTION;
END TRY
BEGIN CATCH
ROLLBACK TRANSACTION;
END CATCH;
END
GO
If possible, accumulate data from several timestamps before inserting to make the batches larger. You should try with your system and find the optimal size of the batch. I have batches around 10K rows using the stored procedure.
If you have your data coming from sensors 100 times a second, then I would at first dump the incoming raw data in some very simple CSV file(s) and have a parallel background process that would insert it into the database in chunks. In other words, have some buffer for incoming data, so that if the server can't cope with the incoming volume, you would not loose your data.
Based on your comments, when you said that some channels are likely to be more interesting and queried several times, while others are less interesting, here is one optimization that I would consider. In addition to having one table ChannelData for all channels have another table InterestingChannelData. ChannelData would have the whole set of data, just in case. InterestingChannelData would have a subset only for the most interesting channels. It should be much smaller and it should take less time to query it. In any case, this is an optimization (denormalization/data duplication) built on top of properly normalized structure.
Is your process like this:
Generate data during the day
Analyse data afterwards
If these are separate activities then you might want to consider using different 'insert' and 'select' schemas. You could create a schema that's fast for inserting on the boat, then afterwards you batch upload this data into an analysis optimised schema. This requires a transformation step (where for example you map generic column names into useful column names)
This is along the lines of data warehousing and data marts. In this kind of design, you batch load and optimise the schema for reporting. Does your current daily upload have much of a window?