Is there a class to signify "grouped by and reduced"? - scalding

consider the following code in Scalding:
Let's say I have the following tuples in a scalding TypedPipe[(Int, Int)]:
(1, 2)
(1, 3)
(2, 1)
(2, 2)
On this pipe I can call groupBy(t => t._1) to generate a Grouped[Int, (Int, Int)] , which will still represent the same data, but grouped by the 1st item of the tuple.
Now, let's say I sum the resulting object, so the total flow is like that:
def sumGroup(a : TypedPipe[(Int, Int)]) : Grouped[Int, (Int, Int)] =
{
a.groupBy(t => t._1).sum
}
The result of doing this on the initial example would result in the following tuples:
(1, (2, 5))
(2 (4, 3))
And now we know for sure that there is only one item per key (for the key "1", we only have one resulting tuple) because this is the behavior of sum. However the type returned by sum is still Grouped[Int, (Int, Int)], which doesn't convey the fact that there can only be one item per key.
Is there a specific type like Grouped[K, V] that would convey the meaning that there is only one "V" value for a given "K" value ? If not, why is that?
It seems it could be useful to optimize joins when we can be sure that both sides exactly have one value per key.

Related

How to Select value By SuiteQL That I can filter Multiple Select Field

I had one simple table ItemMapping, 2 Field, one Field is single Item List Field SingleSelectField With value "A",
Other for Multiple Item List Field MultiSelectField with Value ("B", "C", "D").
I Wanna get This mapping relationship By "B", I tried to set up one dataset, And try some single SuiteQL like before, But I always get empty results returned.
SELECT *
FROM ItemMapping
WHERE ItemMapping.MultiSelectField IN ('B')
Any tips may Help me.
Thank you in advance.
As was pointed out, Marty Zigman's article describes how Boban D. located an undocumented "feature" of SuiteQL which can be used.
I will leave most of the explaining to the article but to summarize, NetSuite automatically creates a relationship table named map_sourcTableId_fieldId which contain two columns: mapone and maptwo. mapone is the record id from the source table and maptwo is record id for the joined table.
This method seems to work well and maybe the most straight forward if you are accustomed to working in SQL.
As an alternative, I constructed a native SuiteScript Query object with a condition on a multiple select field. Then I used the toSuiteQL() method to convert it into SuiteQL to see how NetSuite natively deals with this. What I found was another undocumented "feature". The resulting query used a BUILTIN.MNFILTER function. So for example if you've got a custom transaction body field, custbody_link_type, that is a multiple select field and want to get transactions where one of te values in custbody_link_type is 4 then here is the generated SuiteQL:
SELECT T.tranid, T.custbody_link_types
FROM "transaction" as T
WHERE BUILTIN.MNFILTER(T.custbody_link_types , 'MN_INCLUDE', '', 'FALSE', NULL, 4) = 'T'
And if you want transactions where the custbody_link_types does not contain all of the following: 1, 2, 3 ...
SELECT T.tranid, T.custbody_link_types
FROM "transaction" as T
WHERE BUILTIN.MNFILTER(T.custbody_link_types , 'MN_EXCLUDE_ALL', '', 'FALSE', NULL, 1, 2, 3) = 'T'
OR T.custbody_link_types IS NULL
To wrap it up, the undocumented BUILTIN.MNFILTER function is used by NetSuite's query module to filter multiple select fields. It accepts the multiple select column, the internal string value of the query.Operator enum, some other stuff I don't know anything about, and finally one or more the values to compare. It appears to return a string of either 'T' for when the condition is met otherwise 'F'.
Ultimately, I'm not sure whether this is a "better" way to address the need but I thought it was worth documenting.

Transform a column of type string to an array/record i.e. nesting a column

I am trying to get calculate and retrieve some indicators from mutiple tables I have in my dataset on bigquery. I am want to invoke nesting on sfam which is a column of strings which I can't do for now i.e. it could have values or be null. So the goal is to transform that column into an array/record...that's the idea that came to mind and I have no idea how to go about doing it.
The product and cart are grouped by key_web, dat_log, univ, suniv, fam and sfam.
The data is broken down into universe refered to as univ which is composed of sub-universe refered to as suniv. Sub-universes contain families refered to as 'fam' which may or may not have sub-families refered to as sfam. I want to invoke nesting on prd.sfam to reduce the resulting columns.
The data is collected from Google Analytics for insight into website trafic and users activities.
I am trying to get information and indicators about each visitor, the amount of time he/she spent on particular pages, actions taken and so on. The resulting table gives me the sum of time spent on those pages, sum of total number of visits for a single day and a breakdown to which category it belongs, thus the univ, suniv, fam and sfam colummns which are of type string (the sfam could be null since some sub-universes suniv only have families famand don't go down to a sub-family level sfam.
dat_log: refers to the date
nrb_fp: number of views for a product page
tps_fp: total time spent on said page
I tried different methods that I found online but none worked, so I post my code and problem in hope of finding guidance and a solution !
A simpler query would be:
select
prd.key_web
, dat_log
, prd.nrb_fp
, prd.tps_fp
, prd.univ
, prd.suniv
, prd.fam
, prd.sfam
from product as prd
left join cart as cart
on prd.key_web = cart.key_web
and prd.dat_log = cart.dat_log
and prd.univ = cart.univ
and prd.suniv = cart.suniv
and prd.fam = cart.fam
and prd.sfam = cart.sfam
And this is a sample result of the query for the last 6 columns in text and images:
Again, I want to get a column of array as sfam where I have all the string values of sfam even nulls.
I limited the output to only only the last 6 columns, the first 3 are the row, key_web and dat_log. Each fam is composed of several sfam or none (null), I want to be able to do nesting on either the fam or sfam.
I want to get a column of array as sfam where I have all the string values of sfam even nulls.
This is not possible in BigQuery. As the documentation explains:
Currently, BigQuery has two following limitations with respect to NULLs and ARRAYs:
BigQuery raises an error if query result has ARRAYs which contain NULL elements, although such ARRAYs can be used inside the query.
That is, your result set cannot contain an array with NULL elements.
Obviously, in BigQuery you cannot output array which holds NULL, but if for some reason you need to preserve them somehow - the workaround is to create array of structs as opposed to arrays of single elements
For example (BigQuery Standard SQL) if you try to execute below
SELECT ['a', 'b', NULL] arr1, ['x', NULL, NULL] arr2
you will get error: Array cannot have a null element; error in writing field arr1
While if you will try below
SELECT ARRAY_AGG(STRUCT(val1, val2)) arr
FROM UNNEST(['a', 'b', NULL]) val1 WITH OFFSET
JOIN UNNEST(['x', NULL, NULL]) val2 WITH OFFSET
USING(OFFSET)
you get result
Row arr.val1 arr.val2
1 a x
b null
null null
As you can see - approaching this way - you can have have even both elements as NULL

What's the most efficient way to store sets in a database?

I want to store sets in a such a way that I can query for sets that are a superset of, subset of, or intersect with another set.
For example, if my database has the sets { 1, 2, 3 }, { 2, 3, 5 }, { 5, 10, 12} and I query it for:
Sets which are supersets of { 2, 3 } it should give me { 1, 2, 3 }, { 2, 3, 5 }
Sets which are subsets of { 1, 2, 3, 4 } it should give me { 1, 2, 3 }
Sets which intersect with { 1, 10, 20 } it should give me { 1, 2, 3 }, { 5, 10, 12}
Since some sets are unknown in advance (your comment suggests they come from the client as a search criteria), you cannot "precook" the set relationships into the database. Even if you could, that would represent a redundancy and therefore an opportunity for inconsistencies.
Instead, I'd do something like this:
CREATE TABLE "SET" (
ELEMENT INT, -- Or whatever the element type is.
SET_ID INT,
PRIMARY KEY (ELEMENT, SET_ID)
)
Additional suggestions:
Note how ELEMENT field is at the primary key's leading edge. This should aid the queries below better than PRIMARY KEY (SET_ID, ELEMENT). You can still add the latter if desired, but if you don't, then you should also...
Cluster the table (if your DBMS supports it), which means that the whole table is just a single B-Tree (and no table heap). That way, you maximize the performance of queries below, and minimize storage requirements (and cache effectiveness).
You can then find IDs of sets that are equal to or supersets of (for example) set {2, 3} like this:
SELECT SET_ID
FROM "SET"
WHERE ELEMENT IN (2, 3)
GROUP BY SET_ID
HAVING COUNT(*) = 2;
And sets that intersect {2, 3} like this:
SELECT SET_ID
FROM "SET"
WHERE ELEMENT IN (2, 3)
GROUP BY SET_ID;
And sets that are equal to or are subsets of {2, 3} like this:
SELECT SET_ID
FROM "SET"
WHERE SET_ID NOT IN (
SELECT SET_ID
FROM "SET" S2
WHERE S2.ELEMENT NOT IN (2, 3)
)
GROUP BY SET_ID;
"Efficient" can mean a lot of things, but the normalized way would be to have an Items table with all the possible elements and a Sets table with all the sets, and an ItemsSets lookup table. If you have sets A and B in your Sets table, queries like (doing this for clarity rather than optimization... also "Set" is a bad name for a table or field, given it is a keyword)
SELECT itemname FROM Items i
WHERE i.itemname IN
(SELECT itemname FROM ItemsSets isets WHERE isets.setname = 'A')
AND i.name IN
(SELECT itemname FROM ItemsSets isets WHERE isets.setname = 'B')
That, for instance, is the intersection of A and B (you can almost certainly speed this up as a JOIN; again, "efficient" can mean a lot of things, and you'll want an architecture that allows a query like that). Similar queries can be made to find out the difference, complement, test for equality, etc.
Now, I know you asked about efficiency, and this is a horribly slow way to query, but this is the only reliably scalable architecture for the tables to do this, and the query was just an easy one to show how the tables are built. You can do all sorts of crazy things to, say, cache intersections, or store multiple items that are in a set in one field and process that, or what have you. But don't. Cached info will eventually get stale; static limits on the number of items in the field size will be surpassed; ad-hoc members of new tuples will be misinterpreted.
Again, "efficient" can mean a lot of different things, but ultimately an information architecture you as a programmer can understand and reason about is going to be the most efficient.

Pig FILTER returns empty bag that I can't COUNT

I'm trying to count how many values in a data set match a filter condition, but I'm running into issues when the filter matches no entries.
There are a lot of columns in my data structure, but there's only three of use for this example: key - data key for the set (not unique), value - float value as recorded, nominal_value - float representing the nominal value.
Our use case right now is to find the number of values that are 10% or more below the nominal value.
I'm doing something like this:
filtered_data = FILTER data BY value <= (0.9 * nominal_value);
filtered_count = FOREACH (GROUP filtered_data BY key) GENERATE COUNT(filtered_data.value);
DUMP filtered_count;
In most cases, there are no values that fall outside of the nominal range, so filtered_data is empty (or null. Not sure how to tell which.). This results in filtered_count also being empty/null, which is not desirable.
How can I construct a statement that will return a value of 0 when filtered_data is empty/null? I've tried a couple of options that I've found online:
-- Extra parens in COUNT required to avoid syntax error
filtered_count = FOREACH (GROUP filtered_data BY key) GENERATE COUNT((filtered_data.value is null ? {} : filtered_data.value));
which results in:
Two inputs of BinCond must have compatible schemas. left hand side: #1259:bag{} right hand side: #1261:bag{#1260:tuple(cf#1038:float)}
And:
filtered_count = FOREACH (GROUP filtered_data BY key) GENERATE (filtered_data.value is null ? 0 : COUNT(filtered_data.value));
which results in an empty/null result.
The way you have it set up right now, you will lose information about any keys for which the count of bad values is 0. Instead, I'd recommend preserving all keys, so that you can see positive confirmation that the count was 0, instead of inferring it by absence. To do that, just use an indicator and then SUM that:
data2 =
FOREACH data
GENERATE
key,
((value <= 0.9*nominal_value) ? 1 : 0) AS bad;
bad_count = FOREACH (GROUP data2 BY key) GENERATE group, SUM(data2.bad);

Find a series of data using non-exact measurements (fuzzy logic)

This is a more complex follow-up question to: Efficient way to look up sequential values
Each Product can have many Segment rows (thousands). Each segment has position column that starts at 1 for each product (1, 2, 3, 4, 5, etc.) and a value column that can contain any values such as (323.113, 5423.231, 873.42, 422.64, 763.1, etc.). The data is read-only.
It may help to think of the product as a song and the segments as a set of musical notes in the song.
Given a subset of contiguous segments, like a snippet of a song, I would like to identify potential matches for products. However, due to potential errors in measurements, the segments in the subset may not match the segments in the database exactly.
How can I identify product candidates by finding the segments of products which most closely match the subset of segments I have measured? Also, is a database the best medium for this type of data?
-
Here are just some thoughts for how I was about to approach this problem. Please don't take these as exact requirements. I am open to any kind of algorithms to make this work as best as possible. I was thinking there needs to be multiple threshold variables for determining closeness. One possibility might be to implement a proximity threshold and a match threshold.
For example, given these values:
Product A contains these segments: 11,21,13,13,15.
Measurement 1 has captured: 20,14,14,15.
Measurement 2 has captured: 11,21,78,13.
Measurement 3 has captured: 15,13,21,13,11.
If a proximity threshold allowed the measured segment to be 1 above or below the actual segment, then Measurement 1 may match Product A because, although many segments do not match exactly, they are within the proximity threshold relative to the actual values.
If a match threshold allowed for measurements with matches of 3 or more, Measurement 2 may return Product A because, although one of the segments (78) far exceeds the proximity threshold, it still matches 3 segments in the correct order and so is within the match threshold.
Measurement 3 would not match Product A because, although all measured segments exist in the actual segments, they are not within the proximity or match thresholds.
Update: One of the answers asked me to define what I mean by most closely match. I'm not exactly sure how to answer that, but I'll try to explain by continuing with the song analogy. Let's say the segments represent maximum frequencies of a recorded song. If I record that same song again it will be similar, but due to background noise and other limitations of recording equipment, some of the frequencies will match, some will be close, and a few will be way off. In this scenario, how would you define when one recording "matches" another? That's the same kind of matching logic I'm looking for to use in this problem.
From the information you posted this can be solved with the edmond's blossom v perfect match algorithm. Either you can minimize or maximize the function and it will always find the best match. Maybe you can use a brute force solution with 2 loops. The wikipedia about edmond's matching algorithm: http://en.wikipedia.org/wiki/Edmonds%27s_matching_algorithm
You need to come up with a definition for "most closely match". I don't know how anyone here can help you with that since no one here is going to know the business requirements or intricacies of the data. Your two methods both sound reasonable, but I have no idea if they actually are or not.
As for whether or not a database is the correct medium for this kind of data, I'd say that a database is probably the perfect medium for the data, but it is very like not the correct medium for processing the data. Whether it's possible or not will depend on your final solution on what constitutes "most closely match".
As a quick note, SSIS has some fuzzy match capabilities built into it for processing data. I've only played around with it though and that was a couple of years ago, so I don't know if it would work for what you're doing or not.
If you take literally your song example, one approach is to boil down your input to a bit-vector fingerprint, and then look up that fingerprint in a database as an exact match. You can increase the chances of finding a good match by extracting several fingerprints from your input and/or trying e.g. all bit-vectors that are only 1 or bit-errors away from your fingerprint.
If you have access to the ACM digital library, you can read a description of this sort of approach in "The Shazam Music Recognition service" at acm=1321038137_73cd62cf2b16cd73ca9070e7d5ea0744">http://delivery.acm.org/10.1145/1150000/1145312/p44-wang.pdf?ip=94.195.253.182&acc=ACTIVE%20SERVICE&CFID=53180383&CFTOKEN=41480065&acm=1321038137_73cd62cf2b16cd73ca9070e7d5ea0744. There is also some information at http://www.music.mcgill.ca/~alastair/621/porter11fingerprint-summary.pdf.
The input format you describe suggests that you might be able to do something with the random projection method described in http://en.wikipedia.org/wiki/Locality_sensitive_hashing.
To answer your second question, depending on exactly what a position corresponds to, you might consider boiling down the numbers to hash fingerprints made up of bits or characters, and storing these in a text search database, such as Apache Lucene.
Could you take the approach of matching the measurements against each segment position by position and calculating the difference for each position. Then slide the measurements along one position and calculate the difference. Then find which slide position scored the lowest difference. Do this for every product and then you know which product the measurements match to closest.
Test tables and data:
CREATE TABLE [dbo].[Segment]
(
[ProductId] INT,
[Position] INT,
[Value] INT
)
INSERT [dbo].[Segment]
VALUES (1, 1, 300),
(1, 2, 5000),
(1, 3, 900),
(1, 4, 400),
(1, 5, 800),
(2, 1, 400),
(2, 2, 6000),
(2, 3, 1000),
(2, 4, 500),
(2, 5, 900),
(3, 1, 400),
(3, 2, 5400),
(3, 3, 900),
(3, 4, 400),
(3, 5, 900)
CREATE TABLE #Measurement
(
[Position] INT,
[Value] INT
)
INSERT #Measurement
VALUES (1, 5400),
(2, 900),
(3, 400)
As you can see, the measurements match (a subset of) the third product exactly.
Some helpers:
CREATE TABLE #ProductSegmentCount
(
[ProductId] INT,
[SegmentCount] INT
)
INSERT #ProductSegmentCount
SELECT [ProductId], MAX([Position])
FROM [dbo].[Segment]
GROUP BY [ProductId]
DECLARE #MeasurementSegmentCount INT = (SELECT MAX([Position]) FROM #Measurement)
A recursive common table expression to show the products ordered by closest match:
;WITH [cteRecursive] AS
(
SELECT s.[ProductId],
0 AS [RecursionId],
m.[Position] AS [MeasurementPosition],
s.[Position] AS [SegmentPosition],
ABS(m.[Value] - s.[Value]) AS [Difference]
FROM #Measurement m
INNER JOIN [dbo].[Segment] s
ON m.[Position] = s.[Position]
UNION ALL
SELECT s.[ProductId],
[RecursionId] + 1 AS [RecursionId],
m.[Position],
s.[Position],
ABS(m.[Value] - s.[Value]) AS [Difference]
FROM [cteRecursive] r
INNER JOIN #Measurement m
ON m.[Position] = r.[MeasurementPosition]
INNER JOIN [dbo].[Segment] s
ON r.[ProductId] = s.[ProductId]
AND m.[Position] + (r.[RecursionId]) = s.[Position]
INNER JOIN #ProductSegmentCount psc
ON s.[ProductId] = psc.[ProductId]
WHERE [RecursionId] <= ABS(#MeasurementSegmentCount - psc.[SegmentCount])
)-- select * from [cteRecursive] where [ProductId] = 3 order by RecursionId, SegmentPosition
, [cteDifferences] AS
(
SELECT [ProductId], [RecursionId], SUM([Difference]) AS [Difference]
FROM [cteRecursive]
GROUP BY [ProductId], [RecursionId]
)-- select * from [cteDifferences]
SELECT [ProductId], MIN([Difference]) AS [Difference]
FROM [cteDifferences]
GROUP BY [ProductId]
ORDER BY MIN([Difference])
OPTION (MAXRECURSION 0)