Pulling previous cell value using conditional lag function - sql

I am trying to condense down a data table which has separate rows for a particular ID: one row has an intent string and the following rows have one or more log strings. There can be more than one set of intents/logs for each ID. I want to pull down the intent string cells in a separate column so they are listed on the same row/s as the associated log strings.
I've "tried" LAG(tobi_intent, 1,0) OVER (ORDER BY datevalue) as AssociatedIntent
but firstly, this isn't valid code, and secondly, wouldn't ensure that the associated intent and logs are for the same ID.
Can anyone advise on the correct sql code to get the output below?
expected table output:
ID log intent associated_intent
1 x
1 b x
1 a x
1 u
1 f u
2 x
2 f x
5 e
5 a e
5 s e

Related

PostgreSQL data transformation - Turn rows into columns

I have a table whose structure looks like the following:
k | i | p | v
Notice that the key (k) is not unique, there are no keys, nothing. Each key can have multiple attributes (i = 0, 1, 2, ...) which can be of different types (p) and have different values (v). One attribute type may also appear multiple times (p(i-1) = p(i)).
What I want to do is pick certain attribute types and their corresponding values and place them in the same row. For example I want to have:
k | attr_name1 | attr_name2
I have managed to make a query that does this and works for all keys (k) for which attr_name1 and attr_name2 appear in the column p of the initial table:
SELECT DISTINCT ON (key) fn.k AS key, fn.v AS attr_name1, a.v AS attr_name2
FROM Table fn
LEFT JOIN Table a ON fn.k = a.k
AND a.p = 'attr_name2'
WHERE fn.p = 'attr_name1'
I would like, however, to take into account the case where a certain key has no attribute named attr_name1 and insert a NULL value into the corresponding column of the new table. I am not sure how to achieve that. I have no issue using multiple queries or intermediate tables etc, but there are quite a lot of rows in the table and I need something that scales to millions of rows.
Any help would be appreciated.
Example:
k i p v
1 0 a 10
1 1 b 12
1 2 c 34
1 3 d 44
1 4 e 09
2 0 a 11
2 1 b 13
2 2 d 22
2 3 f 34
Would turn into (assuming I am only interested in columns a, b, c):
k a b c
1 10 12 34
2 11 13 NULL
I would use conditional aggregation. That is, an aggregate function around a CASE expression.
SELECT
k,
MAX(CASE WHEN p='a' THEN v END) AS a,
MAX(CASE WHEN p='b' THEN v END) AS b,
MAX(CASE WHEN p='c' THEN v END) AS c
FROM
your_table
GROUP BY
k
This presumes that (k, p) is unique. If there are duplicate keys, this will clearly find the one v with the highest value (for each (k,p))
As a general rule this kind of pivoting makes the data harder to process in SQL. This is often done for display purposes because humans find this easier to read. However, from a software engineering perspective, such formatting should not be done in the data layer; be careful that by doing this you don't actually make your future life harder.

Grouping rows so a column sums to no more than 10 per group

I have a table that looks like:
col1
------
2
2
3
4
5
6
7
with values sorted in ascending order.
I want to assign each row to groups with labels 0,1,...,n so that each group has a total of no more than 10. So in the above example it would look like this:
col1 |label
------------
2 0
2 0
3 0
4 1
5 1
6 2
7 3
I tried using this:
floor(sum(col1) OVER (partition by ORDER BY col1 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) /10))
But this doesn't work correctly because it is performing the operations
as:
floor(2/10) = 0
floor([2+2]/10) = 0
floor([2+2+3]/10) = 0
floor([2+2+3+4]/10) = 1
floor([2+2+3+4+5]/10 = 1
floor([2+2+3+4+5+6]/10 = 2
floor([2+2+3+4+5+6+7]/10) = 2
It's all coincidentally correct until the last calculation, because even though
[2+2+3+4+5+6+7] / 10 = 2.9
and
floor(2.9) = 2
what it should do is realise 6+7 is > 10 so the 5th row with value 7 needs be in its own group so iterate the group number + 1 and allocate this row into a new group.
What I really want it to do is when it encounters a sum > 10 then set group number = group number + 1, allocate the CURRENT ROW into this new group, and then finally set the new start row to be the CURRENT ROW.
This is too long for a comment.
Solving this problem requires scanning the table, row-by-row. In SQL, this would be through a recursive CTE (or hierarchical query). Hive supports neither of these.
The issue is that each time a group is defined, the difference between 10 and the sum is "forgotten". That is, when you are further down in the list, what happens earlier on is not a simple accumulation of the available data. You need to know how it was split into groups.
A related problem is solvable. The related problem would assign all rows to groups of size 10, splitting rows between two groups. Then you would know what group a later row is in based only on the cumulative sum of the previous rows.

Business Objects CountIf by cell reference

So I have a column with this data
1
1
1
2
3
4
5
5
5
how can I do a count if where the value at any given location in the above table is equal to a cell i select? i.e. doing Count([NUMBER]) Where([NUMBER] = Coordinates(0,0)) would return 3, because there are 3 rows where the value is one in the 0 position.
it's basically like in excel where you can do COUNTIF(A:A, 1) and it would give you the total number of rows where the value in A:A is 1. is this possible to do in business objects web intelligence?
Functions in WebI operate on rows, so you have to think about it a little differently.
If your intent is to create a cell outside of the report block and display the count of specific values, you can use Count() with Where():
=Count([NUMBER];All) Where ([NUMBER] = "1")
In a freestanding cell, the above will produce a value of "3" for your sample data.
If you want to put the result in the same block and have it count up the occurrences of values on that row, for example:
NUMBER NUMBER Total
1 3
1 3
1 3
2 1
3 1
4 1
5 3
5 3
5 3
it gets a little more complicated. You have to have at least one other dimension in the query to reference. It can be anything, but you have to be counting something in conjunction with the NUMBER dimension. So, the following would work, assuming there's another dimension in the query named [Duh]:
=Count([NUMBER];All) ForAll([Duh])

Working of Merge in SAS (with IN=)

I have two dataset data1 and data2
data data1;
input sn id $;
datalines;
1 a
2 a
3 a
;
run;
data data2;
input id $ sales x $;
datalines;
a 10 x
a 20 y
a 30 z
a 40 q
;
run;
I am merging them from below code:
data join;
merge data1(in=a) data2(in=b);
by id;
if a and b;
run;
Result: (I was expecting an Inner Join result which is not the case)
1 a 10 x
2 a 20 y
2 a 30 z
2 a 40 w
Result from proc sql inner join.
proc sql;
select data1.id,sn,sales,x from data2 inner join data1 on data1.hh_id;
quit;
Result: (As expected from an inner join)
a 1 10 x
a 1 20 y
a 1 30 z
a 1 40 w
a 2 10 x
a 2 20 y
a 2 30 z
a 2 40 w
b 3 10 x
b 3 20 y
b 3 30 z
b 3 40 w
I want to know the concept and STEP BY STEP working of merge statement in SAS with In= and proving the above result.
PS: I have read this, and it says
An obvious use for these variables is to control what kind of 'merge'
will occur, using if statements. For example, if
ThisRecordIsFromYourData and ThisRecordIsFromOtherData; will make SAS
only include rows that match on the by variables from both input data
sets (like an inner join).
which I guess, (like an Inner Join) is not always the case.
Basically, this is a result of the difference in how the SAS data step and SQL process their respective join/merges.
SQL creates a separate record for each possible combination of keys. This is a Cartesian Product (at the key level).
SAS data step, however, process merges very differently. MERGE is really nothing more than a special case of SET. It still processes rows iteratively, one at a time - it never goes back, and never has more than one row from any dataset in the PDV at once. Thus, it cannot create a Cartesian product in its normal process - that would require random access, which the SAS datastep doesn't do normally.
What it does:
For each unique BY value
Take the next record from the left side dataset, if one exists with that BY value
Take the next record from the right side dataset, if one exists with that BY value
Output a row
Continue until both datasets are exhausted for that BY value
With BY values that yield unique records per value on either side (or both), it is effectively identical to SQL. However, with BY values that yield duplicates on BOTH sides, you get what you have there: a side-by-side merge, and if one runs out before the other, the values from the last row of the shorter dataset (for that by value) are more-or-less copied down. (They're actually RETAINED, so if you overwrite them with changes, they will not reset on new records from the longer dataset).
So, if left has 3 records and right has 4 records for key value a, like in your example, then you get data from the following records (assuming you don't alter the data after):
left right
1 1
2 2
3 3
3 4

Excel: one column has duplicates of each value, I need to take averages of the corresponding two values from the other columns

Example:
column A column B
A 1
A 2
B 2
B 2
C 1
C 1
I would somehow like to get the following result:
column A column B
A 1.5
B 2
C 1
(which are averages of 1 and 2, 2 and 2 and 1 and 1)
How do I achieve that?
Thanks
If you're using Excel 2007 or above, you can also use the shorter AVERAGEIF function:
=AVERAGEIF($A$1:$A:$6,D1,$B$1:$B$6)
Less typing, easier to read..
In D1:D3, type A, B, C. Then in E1, put this formula
=SUMIF($A$1:$A$6,D1,$B$1:$B$6)/COUNTIF($A$1:$A$6,D1)
and fill down to E3. If you want to replace the existing data, copy E1:E3 and paste-special-values over itself. Then delete A:C.
Alternatively, you can add headers to your data, say "Letter" and "Number". Then create a Pivot Table from your data. Put Letter in the rows section and Number in the Data section. Change your Data section from SUM to AVERAGE and you'll get the same result.