Rebol select like function returning more than just the next value? - rebol

Does this exist ? If not what's the best way to create it ?

If you want to return all the values after the target value, you can use next find
eg:
data: copy [1 2 3 4 5 6 7 8 9]
select data 5
== 6 ;; returns the next value only.
find data 5
== [5 6 7 8 9] ;; returns the series at that point, so ...
next find data 5
== [6 7 8 9] ;; ... returns the series after that point.
If you just want the next N items, add a copy/part...N
eg (next three items):
copy/part next find data 5 3
== [6 7 8]
I'll leave you to add the error code for when the value is not found:
next find data 0

Use find/tail,
>> find/tail [a b c d e] 'c
== [d e]
>> find/tail [a b c d e] 'x
== none

Related

Python: obtaining the first observation according to its date [duplicate]

I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4

When does passing lambda work and not work?

In certain cases you can pass in lambda x to chain functions in a dataframe like examples below:
df.loc[lambda x: x]
df.assign(lambda x: x)
but in certain cases, it does not work. If you take the example below, how would you chain eq()?
df = pd.DataFrame({'a':[[1,2,3],[2,3,4],[3,4,5]],'b':[1,2,4]})
a b
0 [1, 2, 3] 1
1 [2, 3, 4] 2
2 [3, 4, 5] 4
If we were to for example call explode() on column a, it would return the below:
a b
0 1 1
0 2 1
0 3 1
1 2 2
1 3 2
1 4 2
2 3 4
2 4 4
2 5 4
But what if we wanted to see where column a equaled column b. The following code does not work. How would you chain eq()? When can you pass lambda x: x and when can you not?
df.explode('a').eq(lambda x: x['b'],axis=0)
lambda will work wherever a function or callable is allowed as an input, for instance in the df.apply documentation:
DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwds
Parameters
func: function
Function to apply to each column or row.
vs the df.eq documentation:
DataFrame.eq(other, axis='columns', level=None)
Parameters
other: scalar, sequence, Series, or DataFrame
Any single or multiple element data structure, or list-like object.

How to rename pandas dataframe column name by checking columns's data

Example df would be:
a b c d e
0 SN123456 3 5 7 SN123456
1 SN456123 4 6 8 SN456123
I am wondering how I can rename the column name from 'a' to 'Serial_Number' base on the data -- it starts with 'SN' and length is fix:8.
(we may not know the name of 'a' as it read from some csv file, also the position is not known)
Also how to remove duplicated column 'e', it's completely duplicated with column 'a'
Any idea on a faster way?
Loop each column serial and get it's index and rename column's name is somehow not a good method.
Thanks!
Here's a rewrite in response to your comment. This will rename + drop in a vectorized fashion.
Given df:
>>> df
a b c d e f g
0 SN123456 3 5 7 SN123456 0 0
1 SN456123 4 6 8 SN456123 0 0
Create 3 boolean masks of the same length as the columns:
>>> mask1 = df.dtypes == 'object'
>>> mask2 = df.iloc[0].str.len() == 8
>>> mask3 = df.iloc[0].str.startswith('SN')
Use these to identify which columns look like serial numbers. The first will be renamed; the rest will be dropped.
>>> rename, *drop = df.columns[mask1 & mask2 & mask3]
Then rename + drop:
>>> rename
'a'
>>> drop
['e']
>>> df.rename(columns={rename: 'Serial_Number'}).drop(drop, axis=1)
Serial_Number b c d f g
0 SN123456 3 5 7 0 0
1 SN456123 4 6 8 0 0

What's the equivalent of Python's list[3:7] in REBOL or Red?

With Rebol pick I can only get one element:
list: [1 2 3 4 5 6 7 8 9]
pick list 3
In python one can get a whole sub-list with
list[3:7]
AT can seek a position at a list.
COPY will copy from a position to the end of list, by default
the /PART refinement of COPY lets you add a limit to copying
Passing an integer to /PART assumes how many things you want to copy:
>> list: [1 2 3 4 5 6 7 8 9]
>> copy/part (at list 3) 5
== [3 4 5 6 7]
If you provide a series position to be the end, then it will copy up to that point, so you'd have to be past it if your range means to be inclusive.
>> copy/part (at list 3) (next at list 7)
== [3 4 5 6 7]
There have been some proposals for range dialects, I can't find any offhand. Simple code to give an idea:
range: func [list [series!] spec [block!] /local start end] [
if not parse spec [
set start integer! '.. set end integer!
][
do make error! "Bad range spec, expected e.g. [3 .. 7]"
]
copy/part (at list start) (next at list end)
]
>> list: [1 2 3 4 5 6 7 8 9]
>> range list [3 .. 7]
== [3 4 5 6 7]
>> list: [1 2 3 4 5 6 7 8 9]
== [1 2 3 4 5 6 7 8 9]
>> copy/part skip list 2 5
== [3 4 5 6 7]
So, you can skip to the right location in the list, and then copy as many consecutive members as you need.
If you want an equivalent function, you can write your own.

intersect(A,B) returns the data with no repetitions

I was using "intersect" in my Matlab code where I want the following:
A = [ 4 1 1 2 3];
[B] = sort(A, 'ascend'); % so that B is sorting A in ascending order, so I got B = [1 1 2 3 4]
[same,a] = intersect(B,A);
I want same = [1 1 2 3 4] but the simulation gives me same = [1 2 3 4] by omitting the repeated '1'.
I understand by using intersect it will return data with no repetition
C = intersect(A,B) returns the data common to both A and B with no repetitions.
I want it to show the complete data including those repetition, what are the alternatives I can use rather than the function "intersect"?
For example:
A = [ 4 1 1 2 3];
[B] = sort(A, 'ascend'); % so that B is sorting A in ascending order, so I got B = [1 1 2 3 4]
[same,a] = intersect(B,A);
So now I want it to be like this same =[1 1 2 3 4] and a=[2 3 4 5 1].
I need to access ‘a’ where ‘a’ shows the original index prior to sorting so I can use it for further processing.
Thank you very much.
Why do you need intersect of A and B knowing that B contains the same values than A ?
From what you said, I think you have all the needed results in B.