How to change structure of a pandas dataframe - pandas

I have this structure
Structure 1
And I want to transform it into this one
Structure 2

Given df,
df1 = pd.DataFrame({'result1':[True, True, False],
'result2':[False, True, False],
'result3':[False, True, True]}, index=[1,2,3])
If, 'id' is in the index, you'll need to reset_index first,
df1 = df1.reset_index()
df1.melt('index').query('value')
Output:
index variable value
0 1 result1 True
1 2 result1 True
4 2 result2 True
7 2 result3 True
8 3 result3 True

Related

Drop pandas column with constant alphanumeric values

I have a dataframe df that contains around 2 million records.
Some of the columns contain only alphanumeric values (e.g. "wer345", "gfer34", "123fdst").
Is there a pythonic way to drop those columns (e.g. using isalnum())?
Apply Series.str.isalnum column-wise to mask all the alphanumeric values of the DataFrame. Then use DataFrame.all to find the columns that only contain alphanumeric values. Invert the resulting boolean Series to select only the columns that contain at least one non-alphanumeric value.
is_alnum_col = df.apply(lambda col: col.str.isalnum()).all()
res = df.loc[:, ~is_alnum_col]
Example
import pandas as pd
df = pd.DataFrame({
'a': ['aas', 'sd12', '1232'],
'b': ['sdds', 'nnm!!', 'ab-2'],
'c': ['sdsd', 'asaas12', '12.34'],
})
is_alnum_col = df.apply(lambda col: col.str.isalnum()).all()
res = df.loc[:, ~is_alnum_col]
Output:
>>> df
a b c
0 aas sdds sdsd
1 sd12 nnm!! asaas12
2 1232 ab-2 12.34
>>> df.apply(lambda col: col.str.isalnum())
a b c
0 True True True
1 True False True
2 True False False
>>> is_alnum_col
a True
b False
c False
dtype: bool
>>> res
b c
0 sdds sdsd
1 nnm!! asaas12
2 ab-2 12.34

Apply "any" or "all" function row-wise to arbitrary number of Boolean columns in Julia DataFrames.jl

Suppose I have a dataframe with multiple boolean columns representing certain conditions:
df = DataFrame(
id = ["A", "B", "C", "D"],
cond1 = [true, false, false, false],
cond2 = [false, false, false, false],
cond3 = [true, false, true, false]
)
id
cond1
cond2
cond3
1
A
1
0
1
2
B
0
0
0
3
C
0
0
1
4
D
0
0
0
Now suppose I want to identify rows where any of these conditions are true, ie "A" and "C". It is easy to do this explicitly:
df[:, :all] = df.cond1 .| df.cond2 .| df.cond3
But how can this be done when there are an arbitrary number of conditions, for example something like:
df[:, :all] = any.([ df[:, Symbol("cond$i")] for i in 1:3 ])
The above fails with DimensionMismatch("tried to assign 3 elements to 4 destinations") because the any function is being applied column-wise, rather than row-wise. So the real question is: how to apply any row-wise to multiple Boolean columns in a dataframe?
The ideal output should be:
id
cond1
cond2
cond3
all
1
A
1
0
1
1
2
B
0
0
0
0
3
C
0
0
1
1
4
D
0
0
0
0
Here is one way to do it:
julia> df = DataFrame(
id = ["A", "B", "C", "D", "E"],
cond1 = [true, false, false, false, true],
cond2 = [false, false, false, false, true],
cond3 = [true, false, true, false, true]
)
5×4 DataFrame
Row │ id cond1 cond2 cond3
│ String Bool Bool Bool
─────┼─────────────────────────────
1 │ A true false true
2 │ B false false false
3 │ C false false true
4 │ D false false false
5 │ E true true true
julia> transform(df, AsTable(r"cond") .=> ByRow.([maximum, minimum]) .=> [:any, :all])
5×6 DataFrame
Row │ id cond1 cond2 cond3 any all
│ String Bool Bool Bool Bool Bool
─────┼───────────────────────────────────────────
1 │ A true false true true false
2 │ B false false false false false
3 │ C false false true true false
4 │ D false false false false false
5 │ E true true true true true
Note that it is quite fast even for very wide tables:
julia> df = DataFrame(rand(Bool, 10_000, 10_000), :auto);
julia> #time transform(df, AsTable(r"x") .=> ByRow.([maximum, minimum]) .=> [:any, :all]);
0.059275 seconds (135.41 k allocations: 103.038 MiB)
In the examples I have used a regex column selector, but of course you can use any row selector you like.

Consolidating columns by the number before the decimal point in the column name

I have the following dataframe (three example columns below):
import pandas as pd
array = {'25.2': [False, True, False], '25.4': [False, False, True], '27.78': [True, False, True]}
df = pd.DataFrame(array)
25.2 25.4 27.78
0 False False True
1 True False False
2 False True True
I want to create a new dataframe with consolidated columns names, i.e. add 25.2 and 25.4 into 25 new column. If one of the values in the separate columns is True then the value in the new column is True.
Expected output:
25 27
0 False True
1 True False
2 True True
Any ideas?
use rename()+groupby()+sum():
df=(df.rename(columns=lambda x:x.split('.')[0])
.groupby(axis=1,level=0).sum().astype(bool))
OR
In 2 steps:
df.columns=[x.split('.')[0] for x in df]
#OR
#df.columns=df.columns.str.replace(r'\.\d+','',regex=True)
df=df.groupby(axis=1,level=0).sum().astype(bool)
output:
25 27
0 False True
1 True False
2 True True
Note: If you have int columns then you can use round() instead of split()
Another way:
>>> df.T.groupby(np.floor(df.columns.astype(float))).sum().astype(bool).T
25.0 27.0
0 False True
1 True False
2 True True

How to get unique values counts of a column for other boolean columns?

I have a dataframe containing a set of columns all of which are booleans. I would like to get a unique count of another column (an ID column) for each of these columns (and respective boolean values)
ex:
data = [
["ABC", True, False],
["DEF", False, False],
["GHI", False, True],
]
df = pd.DataFrame(data, columns=["ID", "CONTAINS_Y", "CONTAINS_X"])
df.groupby("CONTAINS_Y")["ID"].nunique().reset_index()
yields:
CONTAINS_Y ID
False 2
True 1
I would like a command that yields:
CONTAINS_Y ID
False 2
True 1
CONTAINS_X ID
False 2
True 1
Is there a way to group all boolean columns to get unique counts? Or does it have to be done individually?
There are other similar questions (Pandas count(distinct) equivalent) but the solutions haven't worked for what I am trying to achieve.
First melt then groupby
df.melt(id_vars='ID').groupby(['variable', 'value']).nunique()
# ID
#variable value
#CONTAINS_X False 2
# True 1
#CONTAINS_Y False 2
# True 1
melt reshapes your DataFrame from a wide to a long format so that it will work with a single groupby
df.melt(id_vars='ID')
# ID variable value
#0 ABC CONTAINS_Y True
#1 DEF CONTAINS_Y False
#2 GHI CONTAINS_Y False
#3 ABC CONTAINS_X False
#4 DEF CONTAINS_X False
#5 GHI CONTAINS_X True

Drawing bar charts from boolean fields:

I have three boolean fields, where their count is shown below:
I want to draw a bar chart that have
Offline_RetentionByTime with 37528
Offline_RetentionByCount with 29640
Offline_RetentionByCapacity with 3362
How to achieve that?
I think you can use apply value_counts for creating new df1 and then DataFrame.plot.bar:
df = pd.DataFrame({'Offline_RetentionByTime':[True,False,True, False],
'Offline_RetentionByCount':[True,False,False,True],
'Offline_RetentionByCapacity':[True,True,True, False]})
print (df)
Offline_RetentionByCapacity Offline_RetentionByCount Offline_RetentionByTime
0 True True True
1 True False False
2 True False True
3 False True False
df1 = df.apply(pd.value_counts)
print (df1)
Offline_RetentionByCapacity Offline_RetentionByCount \
True 3 2
False 1 2
Offline_RetentionByTime
True 2
False 2
df1.plot.bar()
If need plot only True values select by loc:
df1.loc[True].plot.bar()