Internal db logic/operation to group/compress result - cratedb

I have a CrateDB table storing various information for zipcodes. It contains around 30k zipcodes, and I need my query to return certain profiling information for all zipcodes at once. I understand that typically it wouldn't be feasible, but since I only need ballpark information and many zipcodes are consecutive, I think an optimization is possible.
For example, if I wanted to profile population, a grouped result such as this would work for me:
group 1 (0-1000): 00000-02000,02004-02010,02012
group 2 (1001-3000): ...
...
The populations and groups above are fake, but the idea should hold. Basically, group profiled category into buckets, assign zipcodes to correct bucket, and further reduce size by using range representation. I could settle for a predefined number of groups or have group buckets defined by request/query itself. This would hopefully reduce the response from something that would be too large for a single query to one that's manageable.
Is it possible to write a cratedb function to do something similar to avoid bandwidth issues from having this grouping done on a different service/container/vm?

You could probably crate groups on the fly or as columns if you wish with a regex, I have done this on a 23M row table and group by that.
In my example regex grouping and AVG took around 30s, but this is very subjective to my hardware.
Something like this would probably work as a general pointer
SELECT avg (--yourColumn--), regexp_matches(--yourColumn--, '--your regex--','i')[1]
FROM "doc"."--yourTable--"
group by regexp_matches(postcode, '--your regex--','i')[1]
order by regexp_matches(postcode, '--your regex--','i')[1]
You could use over windowed function but this doesn't yet have the full SQL support for partitioning etc.

Related

BigQuery SQL computing spatial statistics around multiple locations

Problem: I want to make a query that allows me to compute sociodemographic statistics within a given distance of multiple locations. I can only make a query that computes the statistics at a single location at a time.
My desired result would be a table where I can see the name of the library (title), P_60YMAS (some sociodemographic data near these libraries), and a geometry of the multipolygons within the buffer distance of this location as a GEOGRAPHY data type.
Context:
I have two tables:
'cis-sdhis.de.biblioteca' or library, that have points as GEOGRAPHY data type;
'cis-sdhis.inegi.resageburb' in which I have many sociodemographic data, including polygons as GEOGRAPHY data type (column name: 'GEOMETRY')
I want to make 1 Km Buffer around the libraries, make new multipolygons within this buffers and get some sociodemographic data and geometry from those multipolygons.
My first approach was with this query:
SELECT
SUM(P_60YMAS) AS age60_plus,
ST_UNION_AGG(GEOMETRY) AS geo
FROM
`cis-sdhis.inegi.resageburb`
WHERE
ST_WITHIN(GEOMETRY,
(
SELECT
ST_BUFFER(geography,
1000)
FROM
`cis-sdhis.de.biblioteca`
WHERE
id = 'bpm-mty3'))
As you can see, this query only gives me one library ('bpm-mty3'), an that's my problem: I want them all at once.
I thought that using OVER() would be one solution, but I don't really know where or how to use it.
OVER() could be a solution, but a more performant solution is JOIN.
You first need to join two tables on condition that the distance between them is less than 1000 meters, that gives you pairs of row, where libraries are paired with all relevant data, note we'll get multiple rows per library. The predicate to use is ST_DWithin(geo1, geo2, distance), alternative form is ST_Distance(geo1, geo2) < distance - in both cases you don't need buffer.
SELECT
*
FROM
`cis-sdhis.inegi.resageburb` data,
`cis-sdhis.de.biblioteca` lib
WHERE
ST_DWITHIN(lib.geometry, data.geography)
Then we need to compute stats per library, remember we have many rows per-library, and we need a single row for each library. For this we need to aggregate per library, let's do GROUP BY using id. When we need an info about the library itself the cheapest way to do it is to use ANY_VALUE aggregation function. So it will be something like
SELECT
lib.id, ANY_VALUE(lib.title) AS title,
SUM(P_60YMAS) AS age60_plus,
-- if you need to show the circle around library
ST_BUFFER(ANY_VALUE(lib.geometry)) AS buffered_location
FROM
`cis-sdhis.inegi.resageburb` data,
`cis-sdhis.de.biblioteca` lib
WHERE
ST_DWITHIN(lib.geometry, data.geography)
GROUP BY lib.id
One thing to note here is that ST_BUFFER(ANY_VALUE(...)) is much cheaper than ANY_VALUE(ST_BUFFER(...)) - it only computes buffer once per output row.

Improve performance of deducting values of same table in SQL

for a metering project I use a simple SQL table in the following format
ID
Timestamp: dat_Time
Metervalue: int_Counts
Meterpoint: fk_MetPoint
While this works nicely in general I have not found an efficient solution for one specific problem: There is one Meterpoint which is a submeter of another Meterpoint. I'd be interested in the Delta of those two Meterpoints to get the remaining consumption. As the registration of counts is done by one device I get datapoints for the various Meterpoints at the same Timestamp.
I think I found a solution applying a subquery which appears to be not very efficient.
SELECT
A.dat_Time,
(A.int_Counts- (SELECT B.int_Counts FROM tbl_Metering AS B WHERE B.fk_MetPoint=2 AND B.dat_Time=A.dat_Time)) AS Delta
FROM tbl_Metering AS A
WHERE fk_MetPoint=1
How could I improve this query?
Thanks in advance
You can try using a window function instead:
SELECT m.dat_Time,
(m.int_counts - m.int_counts_2) as delta
FROM (SELECT m.*,
MAX(CASE WHEN fk.MetPoint = 2 THEN int_counts END) OVER (PARTITION BY dat_time) as int_counts_2
FROM tbl_Metering m
) m
WHERE fk_MetPoint = 1
From a query point of view, you should as a minimum change to a set-based approach instead of an inline sub-query for each row, using a group by as a minimum but it is a good candidate for a windowing query, just as suggested by the "Great" Gordon Linoff
However if this is a metering project, then we are going to expect a high volume of records, if not now, certainly over time.
I would recommend you look into altering the input such that delta is stored as it's own first class column, this moves much of the performance hit to the write process which presumably will only ever occur once for each record, where as your select will be executed many times.
This can be performed using an INSTEAD OF trigger or you could write it into the business logic, in a recent IoT project we computed or stored these additional properties with each inserted reading to greatly simplify many types of aggregate and analysis queries:
Id of the Previous sequential reading
Timestamp of the Previous sequential reading
Value Delta
Time Delta
Number of readings between this and the previous reading
The last one sounds close to your scenario, we were deliberately batching multiple sequential readings into a single record.
You could also process the received data into a separate table that includes this level of aggregation information, so as not to pollute the raw feed and to allow you to re-process it on demand.
You could redirect your analysis queries to this second table, which is now effectively a data warehouse of sorts.

Find out the amount of space each field takes in Google Big Query

I want to optimize the space of my Big Query and google storage tables. Is there a way to find out easily the cumulative space that each field in a table gets? This is not straightforward in my case, since I have a complicated hierarchy with many repeated records.
You can do this in Web UI by simply typing (and not running) below query changing to field of your interest
SELECT <column_name>
FROM YourTable
and looking into Validation Message that consists of respective size
Important - you do not need to run it – just check validation message for bytesProcessed and this will be a size of respective column
Validation is free and invokes so called dry-run
If you need to do such “columns profiling” for many tables or for table with many columns - you can code this with your preferred language using Tables.get API to get table schema ; then loop thru all fields and build respective SELECT statement and finally Dry Run it (within the loop for each column) and get totalBytesProcessed which as you already know is the size of respective column
I don't think this is exposed in any of the meta data.
However, you may be able to easily get good approximations based on your needs. The number of rows is provided, so for some of the data types, you can directly calculate the size:
https://cloud.google.com/bigquery/pricing
For types such as string, you could get the average length by querying e.g. the first 1000 fields, and use this for your storage calculations.

SDK2 query for counting: which is more efficient?

I have an app that is displaying metrics about defects in a project.
I have the option of making one query that returns all the defects, and from that I can break out about four different metrics (How many defects escaped QA in 90 days, 180 days, and then the same metrics again but only counting sev1/sev2 defects).
I could make four queries and limit the results to one so that I just get a count for each. Or I could make one query that encompass them all (all defects that escaped QA in 180 days) and then count up the difference.
I'm figuring worst case, the number of defects that escaped QA in the last six months will generally be less than 100, certainly less 500 worst case.
Which would you do-- four queryies with one result each, or one single query that on average might return 50, perhaps worst case 500?
And I guess the key question is-- where are the inflections points? Perhaps I have more metrics tomorrow (who knows, 8?) and a different average defect counts. Is there a rule of thumb I could use to help choose which approach?
Well I would probably make the series of four queries and use the result count. If you are expecting 500 defects that will end up being three queries each with 200 defects anyways.
The solution where you do each individual query and use the total result count would be safe with even a very large amount of defects. Plus I usually find it to be a bad plan to think that I know the data sets that an App will be dealing with. Most of my Apps end up living much longer and being used on larger datasets than I intended.
The max page size is 200, so it sounds like you'd be requesting between 1 and 3 pages to get all the data vs. 4 queries with a page size of 1 and using the TotalResultCount...
You'd definitely have less aggregation code to write if you use the multi query approach (letting the server do the counting for you based on your supplied filters).
I'd guess the 4 independent queries might be faster but it would be interesting to hear back your experimental results...

selecting "similar" groups - where to start with probabilities?

Let's say I have a table with 10.000 lines (representing 10.000 persons) and the following columns:
id qualification gender age income
When I select all persons having a certain qualification (say "plumber") I get 100 lines, having a certain gender, age and income distribution.
What I now want to do is select some kind of test group to check if the income is influenced by qualification or by the distribution of the other attributes.
That means (and now I come to my question) I want to get another set of 100 lines, having the same gender and age distribution (but a different qualification value). These 100 lines should of course been chosen by random.
My primary problem is that I don't know how to write an SQL command that would take care of the distributions (which of course could and maybe should be seen as probabilities in this context) when I select random lines.
Thank you in advance!
You seem to be trying to solve something that is tightly related to this extremely thorny problem.
The wiki page depicts a number of approaches for detecting correlations in a database, complete with references to prior pg-hacker discussions (here's another), a variety of (rejected) patch proposals, and scientific papers that discusses the topic.
If it sounds too thorny, I'd second Catcall's pl/r suggestion. Or another applicable pl, for that matter.
As an aside, you might find pg-kmeans useful too:
http://pgxn.org/dist/kmeans/doc/kmeans.html
As well as PostStat (never tried it myself):
http://poststat.projects.postgresql.org/
Might be better on stats.stackexchange.com.
Selecting random rows is easy; matching the distribution is hard.
You could write a stored procedure that
repeatedly selects 100 rows at random,
calculates the statistics,
and returns when it finds 100 rows that fit.
But that seems a lot like kicking dead whales down the beach. And, depending on your data, it might never return.
Before you spend much time trying to do this in SQL, consider spending a little time to see how hard (or how easy) this is to do with statistical software, like R.
Later
Just discovered that there's a package called pl/R.
PL/R is a loadable procedural language that enables you to write
PostgreSQL functions and triggers in the R programming language. PL/R
offers most (if not all) of the capabilities a function writer has in
the R language.
Google postgresql +statistics +r +pl for additional links to papers and tutorials.
SELECT * from Table1 order by random() limit 100;
random() is valid for PostgreSql. For MySql you can use RAND() instead of Random()