My dataset -
A B C
abc 0 12
ert 0 45
ghj 14 0
kli 56 78
qas 0 0
I want to find the values of A for which values of B and C together are non-zero.
Expected output-
A B C
kli 56 78
I tried-
aggr(
sum({<[B]={"<>0"},[C]={"<>0"}>}A)
,[B],[C])
Depends where you are doing this, in the Data Load or through Set Analysis on the front, but really this will work on both the load editor and a table.
if("B" <> 0 and "C" <> 0, 'Non-Zero Value', 'Zero Value')
Example of what I created
Related
Existing df :
Id status value
A1 clear 23
A1 in-process 50
A1 done 20
B1 start 2
B1 end 30
Expected df :
Id status value
A1 clear 0
A1 in-process 50
A1 done 20
B1 start 0
B1 end 30
looking to replace first value of each group with 0
Use Series.duplicated for duplicated values, set first duplicate by inverse mask by ~ with DataFrame.loc:
df.loc[~df['Id'].duplicated(), 'value'] = 0
print (df)
Id status value
0 A1 clear 0
1 A1 in-process 50
2 A1 done 20
3 B1 start 0
4 B1 end 30
One approach could be as follows:
Compare the values for each row in df.Id with the next row, combining Series.shift with Series.ne. This will return a boolean Series with True for each first row of a new Id value.
Next, use df.loc to select only rows with True for column value and assign 0.
df.loc[df.Id.ne(df.Id.shift()), 'value'] = 0
print(df)
Id status value
0 A1 clear 0
1 A1 in-process 50
2 A1 done 20
3 B1 start 0
4 B1 end 30
N.B. this approach assumes that the "groups" in Id are sorted (as they seem to be, indeed). If this is not the case, you could use df.sort_values('Id', inplace=True) first, but if that is necessary, the answer by #jezrael will be faster, surely.
df1.mask(~df1.Id.duplicated(),0)
I'm doing cross-tabulation between two columns in the dataframe. Here's a sample from the columns:
column_1 column_2
A -8
B 95
A -93
D 11
C -62
D -14
A -55
C 66
B 76
D -49
I'm looking for a code that returns sub totals for A, B, C and D. For instance, for A the subtotal will be -156 (-8-93-55 = -156).
I tried to do that with pandas.crosstab() function:
pandas.crosstab(df[column_1], df[column_2], margins=True, margins_name=column_1).Total
Here's a sample of the output:
-271 -263 -241 -223 -221 -212 -207 -201 ... sum_column
A 1 0 1 0 0 1 0 0 ... ##
B 0 0 0 1 0 0 0 0 ... ##
C 0 0 0 0 1 0 0 1 ... ##
D 0 1 0 0 0 0 1 0 ... ##
The sum column consists of the sums of the boolean values in each row, instead of the sub totals for each of the four letters. I saw once that a boolean table can be used for calculations but I quite sure that by changing the pandas.crosstab() command the desired output can be achieved.
I'd be happy to get some ideas and thoughts from you.
Thanks.
If you'd simply like the totals by the individual categories in column_1 (A, B, C, D), maybe a groupby and summation could be helpful! You would call the groupby on the column with your categories, and then call sum on the result, like this:
df.groupby('column1').sum()
I want to fill null values of a column based on values in other column.
A B
1 21
0 21
0 21
1 25
1 28
0 28
My B value increases only if A value is 1.
So I have some null values in column A like
A B
1 21
0 21
NAN 21
1 25
1 28
0 28
I want to fill this null value with 0 beacuse corresponding value of B didn't increase.
df['A'] = np.where((df['A'].isnull()) & (df['B'] ==df['B'].shift()),0,df['A'])
This isn't giving the correct results. Where am i going wrong
loc might work better here.
df.loc[(df['A'] == np.nan) & (df['B'] == df['B'].shift(-1)),'A'] = 0
# I havent checked if the shift needs to be -1 or 1
I have a data set were a number of rows are nearly identical, meaning they have the same values for all fields except column C.
A B C D ..... Z
0 50 'Ohio' 'Rep' 3 45
1 50 'Ohio' 'Dem' 3 45
2 40 'Kansas' 'Dem' 34 1
3 30 'Kansas' 'Dem' 45 2
4 55 'Texas' 'Rep' 2 7
....
38 55 'Texas' 'Dem' 2 7
I would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'.
A B C D ......Z
0 50 'Ohio' 'Rep' 3 45
1 50 'Ohio' 'Dem' 3 45
4 55 'Texas' 'Rep' 2 7
38 55 'Texas' 'Dem' 2 7
I have used the duplicated method on all columns (but C) and that provides all the rows that are identical. However, it does not lead to a duplication where each duplicated row with 'Rep' has exactly one duplicated row with 'Dem'.
Get all columns without C by difference to list, then sort_values per column C and convert it to tuples per groups. Last join to original, compare by Rep,Dem and filter by boolean indexing:
cols = df.columns.difference(['C']).tolist()
s = df.sort_values('C').groupby(cols)['C'].apply(tuple).rename('m') == ('Dem','Rep')
df = df[df.join(s, on=cols)['m']]
Another solution is compare by sets, but if possible because multiple same values per groups like Rep,Dem,Dem is possible chain condition with size:
g = df.groupby(cols)['C']
m1 = g.transform('size') == 2
m2 = g.transform(lambda x: set(x) == set(['Rep','Dem']))
df = df[m1 & m2]
print (df)
A B C D Z
0 50 'Ohio' Rep 3 45
1 50 'Ohio' Dem 3 45
4 55 'Texas' Rep 2 7
38 55 'Texas' Dem 2 7
You can use duplicated with the argument keepto Falseto create a mask for duplicated rows having dropped column c and use isin to filter the rows that have any of ['Rep','Dem']in them:
mask = df.drop(['C'], axis = 1).duplicated(keep=False)
df[mask][df['C'].isin(['Rep','Dem'])].drop_duplicates()
A B C D Z
0 50 'Ohio' 'Rep' 3 45
1 50 'Ohio' 'Dem' 3 45
4 55 'Texas' 'Rep' 2 7
5 55 'Texas' 'Dem' 2 7
I have example values in column like this:
values
-------
89
65
56
78
74
73
45
23
5
654
643
543
345
255
233
109
43
23
2
The values are rising up and then fall down to 0 and rising up again.
I need to count differencies between rows in new column and the sum of these differencies too (cumulative sum) for all values. The values 56 and 5 are new differencies from zero
The sum is 819.
Example from bottom> (23-2)+(43-23)+(109-43)+..+(654-643)+(5)+(23-5)+..
Okay, here is my try. However, you need to add an Identity field (which I called "AddSequence") that starts with 1 for the first value ("2") and increments by one for every other value.
SELECT SUM(C.Diff) FROM
(
SELECT CASE WHEN (A.[Value] - (SELECT [Value] FROM [TestValue] AS B WHERE B.[AddSequence]= A.[AddSequence]-1)) > 0
THEN (A.[Value] - (SELECT [Value] FROM [TestValue] AS D WHERE D.[AddSequence]= A.[AddSequence]-1))
ELSE 0
END AS Diff
FROM [TestValue] AS A
) AS C
The first solution I had neglected that fact that we had to start over whenever the difference was negative.
I think you are looking for something like:
SELECT SUM(a - b)) as sum_of_differences
FROM ...
I think you want this for the differences, I've tested it in sqlite
SELECT CASE WHEN (v.value - val) < 0 THEN 0 ELSE (v.value - val) END AS differences
FROM v,
(SELECT rowid, value AS val FROM v WHERE rowid > 1) as next_val
WHERE v.rowid = next_val.rowid - 1
as for the sums
SELECT SUM(differences) FROM
(
SELECT CASE WHEN (v.value - val) < 0 THEN 0 ELSE (v.value - val) END AS differences
FROM v,
(SELECT rowid, value AS val FROM v WHERE rowid > 1) AS next_val
WHERE v.rowid = next_val.rowid - 1
)
EDITED - BASED OFF OF YOUR QUESTION EDIT (T-SQL)
I don't know how you can do this without adding an Id.
If you ad an Id - this gives the exact output you had posted before your edit. There's probably a better way, but this is quick and dirty - for a one time shot. Using a SELF JOIN. Differences was the name of your new column originally.
UPDATE A
SET differences = CASE WHEN A.[values] > B.[Values] THEN A.[values] - B.[Values]
ELSE A.[values] END
FROM SO_TTABLE A
JOIN SO_TTABLE B ON A.ID = (B.ID - 1)
OUTPUT
Select [Values], differences FROM SO_TTABLE
[values] differences
------------------------
89 24
65 9
56 56
78 4
74 1
73 28
45 22
23 18
5 5
654 11
643 100
543 198
345 90
255 22
233 124
109 66
43 20
23 21
2 0