How can I perform aggregate function logic across multiple fields? - sql

How can I get max() of three dimensions to come from the same record?
Description:
I have a large list of widgets, with multiple attributes, from multiple sources. Think manual data entry, where you have the same stuff being entered by different people, and then you need to consolidate differences. Though, instead of auditing each difference, I just want to perform some logic to choose a value over another under certain criteria.
An analogous example: if one source a says widget xyz weighs 3 pounds, and source b says it weighs 4 pounds, I am just blindly taking the 4, as it is greater, and say I need to be over cautious for packing/shipping purposes. That is easy, I choose MAX().
Now, I have a group of attributes that are in separate fields but related. Think dimensions of a box. There are width/length/height fields. If one source says the 'dimensions' are 2x3x4, and another says they are 3x3x4, I need to take the larger, for the same reason as above. Also sounds like MAX(), except...
My sources disagree on which is the width, height, or length. A 2x3x4 box could be entered 4x3x2, or 2x4x3, depending on how the source was looking at it. If I took the MAX of 3 such sources, I would end up with 4x4x4, even though all 3 sources measured it correctly. This is undesirable.
How do I take the greatest 'measurement' value, but make sure all three values comes from the same record?
If 'greatest' is impossible, we could settle for unique... except there is a fourth source, which has 0x0x0 for about 40% of the widgets. I can't leave a 0x0x0 if any of the other sources did in fact measure that widget.
some sample data
ID,widget_name,height,width,leng
(a1,widget3,2,3,4)
(b1,widget3,2,4,3)
(c1,widget3,4,3,2)
(d1,widget3,0,0,0)
output should be (widget3,4,3,2)

you could use row_number instead of group by like
select * from
(select data, ID,widget_name,height,width,leng, ROW_NUMBER() over ( partition by widget_name order by height + width + leng desc ) rowid
from yourTable
) as t
where rowid = 1

Related

BigQuery SQL computing spatial statistics around multiple locations

Problem: I want to make a query that allows me to compute sociodemographic statistics within a given distance of multiple locations. I can only make a query that computes the statistics at a single location at a time.
My desired result would be a table where I can see the name of the library (title), P_60YMAS (some sociodemographic data near these libraries), and a geometry of the multipolygons within the buffer distance of this location as a GEOGRAPHY data type.
Context:
I have two tables:
'cis-sdhis.de.biblioteca' or library, that have points as GEOGRAPHY data type;
'cis-sdhis.inegi.resageburb' in which I have many sociodemographic data, including polygons as GEOGRAPHY data type (column name: 'GEOMETRY')
I want to make 1 Km Buffer around the libraries, make new multipolygons within this buffers and get some sociodemographic data and geometry from those multipolygons.
My first approach was with this query:
SELECT
SUM(P_60YMAS) AS age60_plus,
ST_UNION_AGG(GEOMETRY) AS geo
FROM
`cis-sdhis.inegi.resageburb`
WHERE
ST_WITHIN(GEOMETRY,
(
SELECT
ST_BUFFER(geography,
1000)
FROM
`cis-sdhis.de.biblioteca`
WHERE
id = 'bpm-mty3'))
As you can see, this query only gives me one library ('bpm-mty3'), an that's my problem: I want them all at once.
I thought that using OVER() would be one solution, but I don't really know where or how to use it.
OVER() could be a solution, but a more performant solution is JOIN.
You first need to join two tables on condition that the distance between them is less than 1000 meters, that gives you pairs of row, where libraries are paired with all relevant data, note we'll get multiple rows per library. The predicate to use is ST_DWithin(geo1, geo2, distance), alternative form is ST_Distance(geo1, geo2) < distance - in both cases you don't need buffer.
SELECT
*
FROM
`cis-sdhis.inegi.resageburb` data,
`cis-sdhis.de.biblioteca` lib
WHERE
ST_DWITHIN(lib.geometry, data.geography)
Then we need to compute stats per library, remember we have many rows per-library, and we need a single row for each library. For this we need to aggregate per library, let's do GROUP BY using id. When we need an info about the library itself the cheapest way to do it is to use ANY_VALUE aggregation function. So it will be something like
SELECT
lib.id, ANY_VALUE(lib.title) AS title,
SUM(P_60YMAS) AS age60_plus,
-- if you need to show the circle around library
ST_BUFFER(ANY_VALUE(lib.geometry)) AS buffered_location
FROM
`cis-sdhis.inegi.resageburb` data,
`cis-sdhis.de.biblioteca` lib
WHERE
ST_DWITHIN(lib.geometry, data.geography)
GROUP BY lib.id
One thing to note here is that ST_BUFFER(ANY_VALUE(...)) is much cheaper than ANY_VALUE(ST_BUFFER(...)) - it only computes buffer once per output row.

Improve performance of deducting values of same table in SQL

for a metering project I use a simple SQL table in the following format
ID
Timestamp: dat_Time
Metervalue: int_Counts
Meterpoint: fk_MetPoint
While this works nicely in general I have not found an efficient solution for one specific problem: There is one Meterpoint which is a submeter of another Meterpoint. I'd be interested in the Delta of those two Meterpoints to get the remaining consumption. As the registration of counts is done by one device I get datapoints for the various Meterpoints at the same Timestamp.
I think I found a solution applying a subquery which appears to be not very efficient.
SELECT
A.dat_Time,
(A.int_Counts- (SELECT B.int_Counts FROM tbl_Metering AS B WHERE B.fk_MetPoint=2 AND B.dat_Time=A.dat_Time)) AS Delta
FROM tbl_Metering AS A
WHERE fk_MetPoint=1
How could I improve this query?
Thanks in advance
You can try using a window function instead:
SELECT m.dat_Time,
(m.int_counts - m.int_counts_2) as delta
FROM (SELECT m.*,
MAX(CASE WHEN fk.MetPoint = 2 THEN int_counts END) OVER (PARTITION BY dat_time) as int_counts_2
FROM tbl_Metering m
) m
WHERE fk_MetPoint = 1
From a query point of view, you should as a minimum change to a set-based approach instead of an inline sub-query for each row, using a group by as a minimum but it is a good candidate for a windowing query, just as suggested by the "Great" Gordon Linoff
However if this is a metering project, then we are going to expect a high volume of records, if not now, certainly over time.
I would recommend you look into altering the input such that delta is stored as it's own first class column, this moves much of the performance hit to the write process which presumably will only ever occur once for each record, where as your select will be executed many times.
This can be performed using an INSTEAD OF trigger or you could write it into the business logic, in a recent IoT project we computed or stored these additional properties with each inserted reading to greatly simplify many types of aggregate and analysis queries:
Id of the Previous sequential reading
Timestamp of the Previous sequential reading
Value Delta
Time Delta
Number of readings between this and the previous reading
The last one sounds close to your scenario, we were deliberately batching multiple sequential readings into a single record.
You could also process the received data into a separate table that includes this level of aggregation information, so as not to pollute the raw feed and to allow you to re-process it on demand.
You could redirect your analysis queries to this second table, which is now effectively a data warehouse of sorts.

Infinite scroll algorithm for random items with different weight ( probability to show to the user )

I have a web / mobile application that should display an infinite scroll view (the continuation of the list of items is loaded periodically in a dynamic way) with items where each of the items have a weight, the bigger is the weight in comparison to the weights of other items the higher should be the chances/probability to load the item and display it in the list for the users, the items should be loaded randomly, just the chances for the items to be in the list should be different.
I am searching for an efficient algorithm / solution or at least hints that would help me achieve that.
Some points worth to mention:
the weight has those boundaries: 0 <= w < infinite.
the weight is not a static value, it can change over time based on some item properties.
every item with a weight higher than 0 should have a chance to be displayed to the user even if the weight is significantly lower than the weight of other items.
when the users scrolls and performs multiple requests to API, he/she should not see duplicate items or at least the chance should be low.
I use a SQL Database (PostgreSQL) for storing items so the solution should be efficient for this type of database. (It shouldn't be a purely SQL solution)
Hope I didn't miss anything important. Let me know if I did.
The following are some ideas to implement the solution:
The database table should have a column where each entry is a number generated as follows:
log(R) / W,
where—
W is the record's weight greater than 0 (itself its own column), and
R is a per-record uniform random number in (0, 1)
(see also Arratia, R., "On the amount of dependence in the prime factorization of a uniform random integer", 2002). Then take the records with the highest values of that column as the need arises.
However, note that SQL has no standard way to generate random numbers; DBMSs that implement SQL have their own ways to do so (such as RANDOM() for PostgreSQL), but how they work depends on the DBMS (for example, compare MySQL's RAND() with T-SQL's NEWID()).
Peter O had a good idea, but had some issues. I would expand it a bit in favor of being able to shuffle a little better as far as being user-specific, at a higher database space cost:
Use a single column, but store in multiple fields. Recommend you use the Postgres JSONB type (which stores it as json which can be indexed and queried). Use several fields where the log(R) / W. I would say roughly log(U) + log(P) where U is the number of users and P is the number of items with a minimum of probably 5 columns. Add an index over all the fields within the JSONB. Add more fields as the number of users/items get's high enough.
Have a background process that is regularly rotating the numbers in #1. This can cause duplication, but if you are only rotating a small subset of the items at a time (such as O(sqrt(P)) of them), the odds of the user noticing are low. Especially if you are actually querying for data backwards and forwards and stitch/dedup the data together before displaying the next row(s). Careful use of manual pagination adjustments helps a lot here if it's an issue.
Before displaying items, randomly pick one of the index fields and sort the data on that. This means you have a 1 in log(P) + log(U) chance of displaying the same data to the user. Ideally the user would pick a random subset of those index fields (to avoid seeing the same order twice) and use that as the order, but can't think of a way to make that work and be practical. Though a random shuffle of the index and sorting by that might be practical if the randomized weights are normalized, such that the sort order matters.

How does NTILE() handle imbalanced data?

Long story short, I was grouping some data into various segments, and noticed that migrations in one column that was split into deciles using NTILE(10) OVER(ORDER BY column_name DESC).
Roughly 50% of the values in this column are 0, which means that the first 5 deciles would all have the same value.
How does the NTILE() function handle cases like this?
I would naively assume that it sorts by value and just chunks it up into 10 even pieces, which means that it more or less randomly assigns the 0's to a decile, but I haven't been able to find documentation that explains this particular case.
Bonus question -- Does the behavior change if the values are NULL instead of 0?
NTILE() is defined to make the tiles as equal in size as possible. The sizes may differ by 1 row, but not by more than one.
As a result, rows with the same value of the order by keys can be in different tiles.
The documentation attempts to describe this:
Divides the rows for each window partition into n buckets ranging from 1 to at most n. Bucket values will differ by at most 1.
The second sentence is really that the bucket sizes differ by at most 1.

Find out the amount of space each field takes in Google Big Query

I want to optimize the space of my Big Query and google storage tables. Is there a way to find out easily the cumulative space that each field in a table gets? This is not straightforward in my case, since I have a complicated hierarchy with many repeated records.
You can do this in Web UI by simply typing (and not running) below query changing to field of your interest
SELECT <column_name>
FROM YourTable
and looking into Validation Message that consists of respective size
Important - you do not need to run it – just check validation message for bytesProcessed and this will be a size of respective column
Validation is free and invokes so called dry-run
If you need to do such “columns profiling” for many tables or for table with many columns - you can code this with your preferred language using Tables.get API to get table schema ; then loop thru all fields and build respective SELECT statement and finally Dry Run it (within the loop for each column) and get totalBytesProcessed which as you already know is the size of respective column
I don't think this is exposed in any of the meta data.
However, you may be able to easily get good approximations based on your needs. The number of rows is provided, so for some of the data types, you can directly calculate the size:
https://cloud.google.com/bigquery/pricing
For types such as string, you could get the average length by querying e.g. the first 1000 fields, and use this for your storage calculations.