Related
I have a data set in google sheets, for each week of data I have 3 rows. I wish to query the data in every second row to calculate the max value and the last value.
For instance:
ROW
DATA
1
800
2
Text
3
500
4
More text
5
600
6
Blah
7
700
8
Blah
For Max value I have the following which will return 800
MAX(FILTER(QUERY(A1:A,"Select * skipping 2"), QUERY(A1:A,"Select * skipping 2") <> 0))
How do I change it up to return the last value? Which should return 700
try:
=LOOKUP(2^99,FILTER(A:A,A:A<>0))
#rockinfreakshow answer will successfully find the last number.
To filter a range by n amount of rows, you can use:
=FILTER(A:A,MOD(ROW(A:A),n)=1)
Change n with your desired value, and 1 with the number of row you want to get. 1 for the first, 2 for the second, but 0 if you want the nth one. To find MAX, just wrap it in MAX()
To find the last one, even if it's a text or number, you can use SORTN and SEQUENCE:
=SORTN(FILTER(A:A,MOD(ROW(A:A),n)=1,A:A<>""),1,1, SEQUENCE(COUNTA(FILTER(A:A,MOD(ROW(A:A),n)=1,A:A<>""))),0)
It orders the elements in reverse order and only chooses the first one
Remember to change n with the number of rows and =1 with the number of row you want to choose
So I have a column with this data
1
1
1
2
3
4
5
5
5
how can I do a count if where the value at any given location in the above table is equal to a cell i select? i.e. doing Count([NUMBER]) Where([NUMBER] = Coordinates(0,0)) would return 3, because there are 3 rows where the value is one in the 0 position.
it's basically like in excel where you can do COUNTIF(A:A, 1) and it would give you the total number of rows where the value in A:A is 1. is this possible to do in business objects web intelligence?
Functions in WebI operate on rows, so you have to think about it a little differently.
If your intent is to create a cell outside of the report block and display the count of specific values, you can use Count() with Where():
=Count([NUMBER];All) Where ([NUMBER] = "1")
In a freestanding cell, the above will produce a value of "3" for your sample data.
If you want to put the result in the same block and have it count up the occurrences of values on that row, for example:
NUMBER NUMBER Total
1 3
1 3
1 3
2 1
3 1
4 1
5 3
5 3
5 3
it gets a little more complicated. You have to have at least one other dimension in the query to reference. It can be anything, but you have to be counting something in conjunction with the NUMBER dimension. So, the following would work, assuming there's another dimension in the query named [Duh]:
=Count([NUMBER];All) ForAll([Duh])
Let's say I have two columns.
3.5463 11
4.5592 12
1.6993 111
0.92521 112
1.7331 121
2.1407 122
1.4082 1111
2.0698 1112
2.3973 1121
2.4518 1122
1.1719 1211
1.153 1212
0.67139 1221
0.64744 1222
1.3705 11111
0.9557 11112
0.64868 11121
0.7325 11211
0.58874 11212
0.86673 11221
0.17075 11222
0.64026 12111
0.80229 12112
0.43422 12122
1.0405 12211
0.63376 12212
0.56491 12221
0.34626 12222
0.81631 111111
0.91837 111112
0.70013 111121
0.87384 111122
1.1474 111211
0.47411 111221
0.12249 111222
0.56728 112111
0.88169 112112
0.14509 112121
0.68655 112211
0.36274 112212
1.1652 121111
0.99314 121112
0.42024 121121
0.23937 121122
1.0346 122111
0.64642 122112
0.15632 122121
0.41725 122122
0.40793 122211
In the first column, there is a number. With every one of those numbers, in the second column, is an associated ID. Now, there are some blank rows that do not contain any numbers in them.
Define one of these numbers to be a "daughter" of another number if the ID of the first number is the same as the ID of the second, with an extra digit on the end. For example, both IDs 11211 and 11212 are daughters of 1121, because the ID of 1121 has an extra digit, either a 1 or a 2, added onto the end to form the ID of its daughters. Thus, 1121 is the parent of both 11211 and 11212.
Here is what I want the macro to do. It must output a third column which contains, for every row, a cumulative sum of the number of the first column in that row, plus the parent number of that number, and the parent number of the parent number, etc. all the way up until it reachers either 11 or 12. It will begin by simply outputting the numbers in column 1 for 11 and 12 in the third column. Then, in a loop beginning with 111, it will add up the cumulative sum of every row (the number in that row plus the third column output of the parent), only if that row has a number and an id, and only if the parent exists and has an output in column 3. So for example, the number in the 3rd column of the row with ID 11222 should be the number in column 1 of that row, plus that of 1122, plus that of 112, plus that of 11. So, 0.17075+2.4518+0.92521+3.5463, or 7.09406. However, if you try to do this for ID 111221, you will notice that the row where the parent 11122 should be is empty. Thus, the parent does not exist, and no value will be outputted in column 3 for 111221.
I would greatly appreciate it if someone has some time on their hands to code up this VBA macro for me in exchange for an accepted solution.
Thanks
I don't think a macro is needed, just some formulas. First, I put a header on my columns of data, such as "value," and "id." If you then highlight the column labels (i.e., A and B) and sort by B ("id") then A ("value"), you'll group your blank rows. You can then delete those rows. Now you have the data almost ready. When I did this, I converted the id column to text, as opposed to a number value, so if I sort the table by id, the pattern will be, "11, 111, 1111," and so on, instead of, "11, 12, 111, 112, 121." Then, I added columns to separate the separate characters or levels of the ids. This is to help with parents and children. You can use text-to-columns, or a MID formula, but what I did was have 6 more columns to the right. For each id row, each column would either have a "1," a "2," or a blank (null) value. Then I added another column, calling it "level." I used a formula like COUNTA across all my id splitting columns. So, for 11, my level value was 2. 111 would be 3, 11221 would be 5, and so on. This gives me the id level (parent, child, grandchild, etc). Then I added my final column to the right to compute my cumulative sum of the values. In concept I have one big nested IF statement, but in practice, I needed two. My formula says, if the row above me has a lower level number (i.e., it is some kind of parent), add the value of the current row to the value of the above row. Otherwise, keep going up a row till I do get a parent, and add the current row value to that number.
My final formula for all but the first 5 rows of data was (in the 6th row of data):
=if(K6
rest of answer is below
=if(K6<K7,L6+C7,if(K5<K7,L5+C7,if(K4<K7,L4+C7,if(K3<K7,L3+C7,if(K2<K7,L2+C7,C7)))))
The values were column C, the original id in column D, the id split columns were E through J, the level column was K, and my formula was in L. This formula can be copied down the table. For the first 4 rows, you just need 1 less IF statement each row you go up. The fifth row of data might take the above formula; it depends how it will deal with the column headers in row one. The formula on the 4 row of data might be:
=if(K4<K5,L4+C5,if(K3<K5,L3+C5,if(K2<K5,L2+C5,if(K1<K5,L1+C5,C5))))
I'm still learning how to format these comments, so I'll try to provide a sample of the layout I have...
C D E F G H I J K L
1 value id 1 2 3 4 5 6 lvl cumul_sum
2 3.546300 11 1 1 2 3.546300
3 1.699300 111 1 1 1 3 5.245600
4 1.408200 1111 1 1 1 1 4 6.653800
5 1.370500 11111 1 1 1 1 1 5 8.024300
6 0.816310 111111 1 1 1 1 1 1 6 8.840610
7 0.918370 111112 1 1 1 1 1 2 6 8.942670
8 0.955700 11112 1 1 1 1 2 5 7.609500
So for example, the number in the 3rd column of the row with ID 11222 should be the number in column 1 of that row, plus that of 1122, plus that of 112, plus that of 11. So, 0.17075+2.4518+0.92521+3.5463, or 7.09406.However, if you try to do this for ID 111221, you will notice that the row where the parent 11122 should be is empty. Thus, the parent does not exist, and no value will be outputted in column 3 for 111221.
As a native worksheet array formula¹ in D1,
=IF(LEN(B1), SUM(SUMIFS(A$1:INDEX(A:A, MATCH(1E+99, A:A)),
B$1:INDEX(B:B, MATCH(1E+99, A:A)), LEFT(B1, ROW(INDIRECT("2:"&LEN(B1)))))), TEXT(,))
The above does not compensate for missing parents (null string). It totals everything it can find and uses zero for missing parents.
As a VBA UDF² in E1,
Function conditionalCumulativeSum(nums As Range, _
ids As Range, sib As Range, _
Optional nullOnBlank As Boolean = True)
Dim i As Integer
'truncate any full column reference to the UsedRange
Set nums = Intersect(nums, nums.Parent.UsedRange)
'match the nums and ids ranges
Set ids = ids.Resize(nums.Rows.Count, nums.Columns.Count)
For i = Len(sib.Value2) To 2 Step -1
If nullOnBlank And IsError(Application.Match(--Left(sib, i), ids, 0)) Then
conditionalCumulativeSum = vbNullString
Exit For
End If
conditionalCumulativeSum = conditionalCumulativeSum + _
Application.SumIfs(nums, ids, Left(sib, i))
Next i
If i = 0 Then conditionalCumulativeSum = vbNullString
End Function
The above defaults to return a null string when it encounters any missing parent through the hereditary chain. This can be turned off by adding FALSE as the optional fourth parameter and then the UDF will behave identically to the native formula.
Results from sample data
¹ Array formulas need to be finalized with Ctrl+Shift+Enter↵. If entered correctly, Excel with wrap the formula in braces (e.g. { and }). You do not type the braces in yourself. Once entered into the first cell correctly, they can be filled or copied down or right just like any other formula. Try and reduce your full-column references to ranges more closely representing the extents of your actual data. Array formulas chew up calculation cycles logarithmically so it is good practise to narrow the referenced ranges to a minimum. See Guidelines and examples of array formulas for more information.
² A User Defined Function (aka UDF) is placed into a standard module code sheet. Tap Alt+F11 and when the VBE opens, immediately use the pull-down menus to Insert ► Module (Alt+I,M). Paste the function code into the new module code sheet titled something like Book1 - Module1 (Code). Tap Alt+Q to return to your worksheet(s).
I'm trying to make groups on a database with 10.000+ rows.
I need to be fast and efficient, so I'm doing binary variables for each cluster.
One, Two, Four, Five and Six is in Group1.
But 'Two' might also be in Group nr. 2, because of errors I cannot overcome because my dataset is from a webscrape. I try to sort everything in a unique way, but it's basically impossible not to do errors, if I wish to be efficient and fast.
ID Title Group1 Group2 Group3 Ungrouped
1 One 1 0 0 0
2 Two 1 1 0 0
3 Three 0 1 1 0
4 Four 1 0 1 0
5 Five 1 0 0 0
6 Six 1 1 1 0
7 Seven 0 0 0 1
My idea for a sollution:
Assign groups (one's) until everything is grouped one or more times.
Make a query for everything that has more than one group assigned (2, 3, 4, 6)
Manually decide which 1's to remove, until they only have one group assigned each.
It's actually a good idea to do the 3rd part manually, because it requires content analysis of the documents)
My question:
How do I specify, that I need to see everything with more than one group? Does it have something to do with constraints and unique values, or is there a more simple and obvious way that I'm not seeing?
If your clusters are stored as integers, you can just do:
select c.*
from clusters c
where (cluster1 + cluster2 + cluster3) > 1;
I don't know what a "binary variable" is in SQLite. Some databases do support binary flags, and you would need to convert the values to integers for the where clause.
I have a SQL Server database containing real-time stock quotes.
There is a Quotes table containing what you would expect-- a sequence number, ticker symbol, time, price, bid, bid size, ask, ask size, etc.
The sequence number corresponds to a message that was received containing data for a set of ticker symbols being tracked. A new message (with a new, incrementing sequence number) is received whenever anything changes for any of the symbols being tracked. The message contains data for all symbols (even for those where nothing changed).
When the data was put into the database, a record was inserted for every symbol in each message, even for symbols where nothing changed since the prior message. So a lot of records contain redundant information (only the sequence number changed) and I want to remove these redundant records.
This is not the same as removing all but one record from the entire database for a combination of identical columns (already answered). Rather, I want to compress each contiguous block of identical records (identical except for sequence number) into a single record. When finished, there may be duplicate records but with differing records between them.
My approach was to find contiguous ranges of records (for a ticker symbol) where everything is the same except the sequence number.
In the following sample data I simplify things by showing only Sequence, Symbol, and Price. The compound primary key would be Sequence+Symbol (each symbol appears only once in a message). I want to remove records where Price is the same as the prior record (for a given ticker symbol). For ticker X it means I want to remove the range [1, 6], and for ticker Y I want to remove the ranges [1, 2], [4, 5] and [7, 7]:
Before:
Sequence Symbol Price
0 X $10
0 Y $ 5
1 X $10
1 Y $ 5
2 X $10
2 Y $ 5
3 X $10
3 Y $ 6
4 X $10
4 Y $ 6
5 X $10
5 Y $ 6
6 X $10
6 Y $ 5
7 X $11
7 Y $ 5
After:
Sequence Symbol Price
0 X $10
0 Y $ 5
3 Y $ 6
6 Y $ 5
7 X $11
Note that (Y, $5) appears twice but with (Y, $6) between.
The following generates the ranges I need. The left outer join ensures I select the first group of records (where there is no earlier record that is different), and the BETWEEN is intended to reduce the number of records that need to be searched to find the next-earlier different record (the results are the same without the BETWEEN, but slower). I would need only to add something like "DELETE FROM Quotes WHERE Sequence BETWEEN StartOfRange AND EndOfRange".
SELECT
GroupsOfIdenticalRecords.Symbol,
MIN(GroupsOfIdenticalRecords.Sequence)+1 AS StartOfRange,
MAX(GroupsOfIdenticalRecords.Sequence) AS EndOfRange
FROM
(
SELECT
Q1.Symbol,
Q1.Sequence,
MAX(Q2.Sequence) AS ClosestEarlierDifferentRecord
FROM
Quotes AS Q1
LEFT OUTER JOIN
Quotes AS Q2
ON
Q2.Sequence BETWEEN Q1.Sequence-100 AND Q1.Sequence-1
AND Q2.Symbol=Q1.Symbol
AND Q2.Price<>Q1.Price
GROUP BY
Q1.Sequence,
Q1.Symbol
) AS GroupsOfIdenticalRecords
GROUP BY
GroupsOfIdenticalRecords.Symbol,
GroupsOfIdenticalRecords.ClosestEarlierDifferentRecord
The problem is that this is way too slow and runs out of memory (crashing SSMS- remarkably) for the 2+ million records in the database. Even if I change "-100" to "-2" it is still slow and runs out of memory. I expected the "ON" clause of the LEFT OUTER JOIN to limit the processing and memory usage (2 million iterations, processing about 100 records each, which should be tractable), but it seems like SQL Server may first be generating all combinations of the 2 instances of the table, Q1 and Q2 (about 4e12 combinations) before selecting based on the criteria specified in the ON clause.
If I run the query on a smaller subset of the data (for example, by using "(SELECT TOP 100000 FROM Quotes) AS Q1", and similar for Q2), it completes in a reasonable amount time. I was trying to figure out how to automatically run this 20 or so times using "WHERE Sequence BETWEEN 0 AND 99999", then "...BETWEEN 100000 AND 199999", etc. (actually I would use overlapping ranges such as [0,99999], [99900, 199999], etc. to remove ranges that span boundaries).
The following generates sets of ranges to split the data into 100000 record blocks ([0,99999], [100000, 199999], etc). But how do I apply the above query repeatedly (once for each range)? I keep getting stuck because you can't group these using "BETWEEN" without applying an aggregate function. So instead of selecting blocks of records, I only know how to get MIN(), MAX(), etc. (single values) which does not work with the above query (as Q1 and Q2). Is there a way to do this? Is there totally different (and better) approach to the problem?
SELECT
CONVERT(INTEGER, Sequence / 100000)*100000 AS BlockStart,
MIN(((1+CONVERT(INTEGER, Sequence / 100000))*100000)-1) AS BlockEnd
FROM
Quotes
GROUP BY
CONVERT(INTEGER, Sequence / 100000)*100000
You can do this with a nice little trick. The groups that you want can be defined as the difference between two sequences of numbers. One is assigned for each symbol in order by sequence. The other is assigned for each symbol and price. This is what is looks like for your data:
Sequence Symbol Price seq1 seq2 diff
0 X $10 1 1 0
0 Y $ 5 1 1 0
1 X $10 2 2 0
1 Y $ 5 2 2 0
2 X $10 3 3 0
2 Y $ 5 3 3 0
3 X $10 4 4 0
3 Y $ 6 4 1 3
4 X $10 5 5 0
4 Y $ 6 5 2 3
5 X $10 6 6 0
5 Y $ 6 6 3 3
6 X $10 7 7 0
6 Y $ 5 7 4 3
7 X $11 8 1 7
7 Y $ 5 8 5 3
You can stare at this and figure out that the combination of symbol, diff, and price define each group.
The following puts this into a SQL query to return the data you want:
select min(q.sequence) as sequence, symbol, price
from (select q.*,
(row_number() over (partition by symbol order by sequence) -
row_number() over (partition by symbol, price order by sequence)
) as grp
from quotes q
) q
group by symbol, grp, price;
If you want to replace the data in the original table, I would suggest that you store the results of the query in a temporary table, truncate the original table, and then re-insert the values from the temporary table.
Answering my own question. I want to add some additional comments to complement the excellent answer by Gordon Linoff.
You're right. It is a nice little trick. I had to stare at it for a while to understand how it works. Here's my thoughts for the benefit of others.
The numbering by Sequence/Symbol (seq1) always increases, whereas the numbering by Symbol/Price (seq2) only increases sometimes (within each group, only when a record for Symbol contains the group's Price). Therefore seq1 either remains in lock step with seq2 (i.e., diff remains constant, until either Symbol or Price changes), or seq1 "runs away" from seq2 (while it is busy "counting" other Prices and other Symbols-- which increases the difference between seq1 and seq2 for a given Symbol and Price). Once seq2 falls behind, it can never "catch up" to seq1, so a given value of diff is never seen again once diff moves to the next larger value (for a given Price). By taking the minimum value within each Symbol/Price group, you get the first record in each contiguous block, which is exactly what I needed.
I don't use SQL a lot, so I wasn't familiar with the OVER clause. I just took it on faith that the first clause generates seq1 and the second generates seq2. I can kind of see how it works, but that's not the interesting part.
My data contained more than just Price. It was a simple thing to add the other fields (Bid, Ask, etc.) to the second OVER clause and the final GROUP BY:
row_number() over (partition by Symbol, Price, Bid, BidSize, Ask, AskSize, Change, Volume, DayLow, DayHigh, Time order by Sequence)
group by Symbol, grp, price, Bid, BidSize, Ask, AskSize, Change, Volume, DayLow, DayHigh, Time
Also, I was able to use use >MIN(...) and <=MAX(...) to define ranges of records to delete.