Filter dataframe index on multiple conditions - pandas

In pandas.DataFrame.filter is there a way to use the parameters "like" or "regex" so they support an OR condition. for example:
df.filter(like='bbi', axis=1)
would filter on columns with bbi in their name, but how would I filter on columns containing 'bbi' OR 'abc' ?
A few options that fail:
df.filter(like='bbi' or 'abc', axis=1)
df.filter(like=('bbi' or 'abc'), axis=1)

I would do the below:
Setup:
df=pd.DataFrame(np.random.randint(0,20,20).reshape(5,4),
columns=['abcd','bcde','efgh','bbia'])
print(df)
abcd bcde efgh bbia
0 10 17 2 7
1 7 12 18 9
2 17 7 11 17
3 14 4 2 9
4 15 10 12 11
Solution:
Using df.filter:
df.filter(regex=r'(abc|bbi)')
abcd bbia
0 10 7
1 7 9
2 17 17
3 14 9
4 15 11

Not familiar with the filter command. But you could achieve what you want like this I think:
df[(df['column'].str.contains('bbi', case=False)) | (df['column'].str.contains('abc', case=False))]

Please find the attached screenshot.
Regex search is slower. So we keep regex=False.
Hope this helps.Thank you.

Related

Print Pandas Unique Rows by Column Condition

I am trying to print the rows whereby a data condition is met in a pandas DF based on the unique values in the DF. For example, I have data that looks like this:
DF:
site temp month day
A 15 7 18
A 11 6 12
A 22 9 3
B 9 4 23
B 3 2 11
B -1 5 18
I need the result to print the rows where the max in the 'temp' column occurs such as this for the final result:
A 15
B 9
I have tried this but it is not working correctly:
for i in DF['site'].unique():
print(DF.temp.max())
I get the same answer of:
22
22
but the answer should be:
site temp month day
A 22 9 3
B 9 4 23
thank you!
A possible solution:
df.groupby('site', as_index=False).max()
Output:
site temp
0 A 22
1 B 9
In case you want to use a for loop:
for i in df['site'].unique():
print(df.loc[df['site'].eq(i), 'temp'].max())
Output:
22
9
df.groupby('site').max()
output:
temp month day
site
A 22 9 18
B 9 5 23
Let us do sort_values + drop_duplicates
df = df.sort_values('temp',ascending=False).drop_duplicates('site')
Out[190]:
site temp month day
2 A 22 9 3
3 B 9 4 23

Pick row with key GROUP_FILENAME and add a new column with column name

I have a data frame which looks like this
GROUP_FIELD_NAME:BKR_ID
GROUP_FIELD_VALUE:T80
GROUP_FIELD_NAME:GROUP_OFFSET
GROUP_FIELD_VALUE:0
GROUP_FIELD_NAME:GROUP_LENGTH
GROUP_FIELD_VALUE:0
GROUP_FIELD_NAME:FIRM_ID
GROUP_FIELD_VALUE:KIZEM
GROUP_FILENAME:000000018.pdf
GROUP_FIELD_NAME:BKR_ID
GROUP_FIELD_VALUE:T80
GROUP_FIELD_VALUE:P
GROUP_FIELD_NAME:FI_ID
GROUP_FIELD_VALUE:
GROUP_FIELD_NAME:RUN_DTE
GROUP_FIELD_VALUE:20220208
GROUP_FIELD_NAME:FIRM_ID
GROUP_FIELD_VALUE:KIZEM
GROUP_FILENAME:000000019.pdf
It has three keys Group field ,group field value and group file name,i want to create a dataframe like this
I am expecting a data frame with three column group_field_name,group_field_value and group_file name.
You can use:
(df['col'].str.extract('GROUP_FILENAME:(.*)|([^:]+):(.*)')
.set_axis(['GROUP_FILENAME', 'var', 'val'], axis=1)
.assign(GROUP_FILENAME=lambda d: d['GROUP_FILENAME'].bfill(),
n=lambda d: d.groupby(['GROUP_FILENAME', 'var']).cumcount()
)
.dropna(subset=['var'])
.pivot(index=['GROUP_FILENAME', 'n'], columns='var', values='val')
.droplevel(1).rename_axis(columns=None)
.reset_index('GROUP_FILENAME')
)
Output:
GROUP_FILENAME GROUP_FIELD_NAME GROUP_FIELD_VALUE
0 000000018.pdf BKR_ID T80
1 000000018.pdf GROUP_OFFSET 0
2 000000018.pdf GROUP_LENGTH 0
3 000000018.pdf FIRM_ID KIZEM
4 000000019.pdf BKR_ID T80
5 000000019.pdf FI_ID P
6 000000019.pdf RUN_DTE
7 000000019.pdf FIRM_ID 20220208
8 000000019.pdf NaN KIZEM
Used input:
col
0 GROUP_FIELD_NAME:BKR_ID
1 GROUP_FIELD_VALUE:T80
2 GROUP_FIELD_NAME:GROUP_OFFSET
3 GROUP_FIELD_VALUE:0
4 GROUP_FIELD_NAME:GROUP_LENGTH
5 GROUP_FIELD_VALUE:0
6 GROUP_FIELD_NAME:FIRM_ID
7 GROUP_FIELD_VALUE:KIZEM
8 GROUP_FILENAME:000000018.pdf
9 GROUP_FIELD_NAME:BKR_ID
10 GROUP_FIELD_VALUE:T80
11 GROUP_FIELD_VALUE:P
12 GROUP_FIELD_NAME:FI_ID
13 GROUP_FIELD_VALUE:
14 GROUP_FIELD_NAME:RUN_DTE
15 GROUP_FIELD_VALUE:20220208
16 GROUP_FIELD_NAME:FIRM_ID
17 GROUP_FIELD_VALUE:KIZEM
18 GROUP_FILENAME:000000019.pdf

Group rows using the cumulative sum of a third column

I have a table with two columns:
sort_column = A column I use for sorting
value_column = My metric of interest (a positive integer)
Using SQL, I need to create contiguous groups of rows, ordered by sort_column, such that the sum of value_column within each group is the largest possible but staying below 100 (100 not included).
Find below an example of my desired result.
Thanks
sort_column
value_column
desired_result
1
53
1
2
25
1
3
33
2
4
25
2
5
10
2
6
46
3
7
9
3
8
49
4
9
48
4
10
53
5
11
33
5
12
52
6
13
29
6
14
16
6
15
66
7
16
1
7
17
62
8
18
57
9
19
47
10
20
12
10
Ok, so after a few lengthy attempts, I came to the conclusion the task is impossible with pure SQL, because a given value of the desired column depends on previous values of that same column, in a way that cannot be obtained from the first two columns alone, so the problem is impossible to tackle without using a recursive CTE, which BigQuery does not support.
I solved the issue by writing a javascript UDF for the task. It seems to be working fine and produces the expected results.
Many thanks everyone!

Auctions System Logical Condition

I am trying to make an auctions system but can not figure out the logical conditions for doing so..
Lets say that I have 10 credit
$credit
I have already bet 5 credits on another auction... so I owe 5 from 10 $owe
I thus have 5 available... $available = $credit - $owe (=5)
I bet 3 from available (on a different item)...
I wish to bet again 4 (cancel 3, update to 4), but credit available is now $available - 3 (=2)
Can't find a logical solution.... written in code.
What is the condition for setting a bet???
Made up a matrix with the dependence between variables:
bet available owe lastbet
1 10 10 0
2 9 11 1
3 7 13 2
4 4 16 3
5 0 20 4
6 -5 25 5
7 -11 31 6
8 -18 38 7
9 -26 46 8
10 -35 55 9
11 -45 65 10
Need to translate it into a condition statement.... (the next row would not meet the conditions)
The condition should fail on the 11th row....
Based on the Matrix... I found out that the condition is:
if ($bet <= (($owe + $available) / 2)) {}
Not very intuitive......

How do I split a single column into multiple columns over multiple rows in SQL Server?

I need to create a stored procedure in SQL Server that accepts the following two parameters:
A select statement returning 1 column.
A number of columns.
The stored procedure would then run the select statement and return the result of the select statement with the values of the single column split into the given amount of columns per row.
Here are some examples:
exec stored_proc ‘select id from table where id between 1 and 20’, 5
The result of the select would be:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
The result of the stored procedure call would be:
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
Or the call could be:
exec stored_proc ‘select id from table where id between 1 and 20’, 10
Giving the result of:
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20
Though I'm not sure you should be doing this in SQL, it can be done.
I think the way to do it would be do create a cursor and use it's iterations to build a dynamic SQL statement.
During each iteration, add each piece of data as a new column (field) and when you reach the number of columns add something like Union Select