Extract information to work with with pandas - pandas

I have this dataframe:
Column Non-Null Dtype
0 nombre 74 non-null object
1 fabricante - 74 non-null - object
2 calorias -74 non-null -int64
3 proteina -74 non-null -int64
4 grasa -74 non-null -int64
5 sodio -74 non-null -int64
6 fibra dietaria -74 non-null -float64
7 carbohidratos -74 non-null -float64
8 azúcar -74 non-null -int64
9 potasio -74 non-null -int64
10 vitaminas y minerales -74 non-null -int64
I am trying to extract information like this:
cereal_df.loc[cereal_df['fabricante'] == 'Kelloggs', 'sodio']
The output is (good, that is what I want to extract in this case right?)
2 260
3 140
6 125
16 290
17 90
19 140
21 220
24 125
25 200
26 0
27 240
37 170
38 170
43 150
45 190
46 220
47 170
50 320
55 210
57 0
59 290
63 70
64 230
Name: sodio, dtype: int64
That is what I need so far, but when I try to write a function like this (in order to get the confidence):
def valor_medio_intervalo(fabricante, variable, confianza):
subconjunto = cereal_df.loc[cereal_df['fabricante'] == fabricante, cereal_df[variable]]
inicio, final = sm.stats.DescrStatsW(subconjunto[variable]).zconfint_mean(alpha = 1 - confianza)
return inicio, final
Then I run the function:
valor_medio_intervalo('Kelloggs', 'azúcar', 0.95)
And the output is:
KeyError Traceback (most recent call last)
<ipython-input-57-11420ac4d15f> in <module>()
1 #TEST_CELL
----> 2 valor_medio_intervalo('Kelloggs', 'azúcar', 0.95)
7 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
1296 if missing == len(indexer):
1297 axis_name = self.obj._get_axis_name(axis)
-> 1298 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
1299
1300 # We (temporarily) allow for some missing keys with .loc, except in
KeyError: "None of [Int64Index([ 6, 8, 5, 0, 8, 10, 14, 8, 6, 5, 12, 1, 9, 7, 13, 3, 2,\n 12, 13, 7, 0, 3, 10, 5, 13, 11, 7, 12, 12, 15, 9, 5, 3, 4,\n 11, 10, 11, 6, 9, 3, 6, 12, 3, 13, 6, 9, 7, 2, 10, 14, 3,\n 0, 0, 6, -1, 12, 8, 6, 2, 3, 0, 0, 0, 15, 3, 5, 3, 14,\n 3, 3, 12, 3, 3, 8],\n dtype='int64')] are in the [columns]"
I do not understand what is going on.
I appreciate your help or any hint.
Thanks in advance

Just got the answer examining the code:
def valor_medio_intervalo(fabricante, variable, confianza):
subconjunto = cereal_df.loc[cereal_df['fabricante'] == fabricante,cereal_df[variable]]
inicio, final = sm.stats.DescrStatsW(subconjunto[variable]).zconfint_mean(alpha = 1 -
confianza)
return inicio, final
in the line
inicio, final = sm.stats.DescrStatsW(subconjunto[variable]).zconfint_mean(alpha = 1 -
confianza)
the
(subconjunto[variable])
is just
(subconjunto)

Related

Obtain corresponding column based on another column that matches another dataframe

I want to find matching values from two data frames and return a third value.
For example, if cpg_symbol["Gene_Symbol"] corresponds with diff_meth_kirp_symbol.index, I want to assign cpg_symbol.loc["Composite_Element_REF"] as index.
My code returned an empty dataframe.
diff_meth_kirp.index = diff_meth_kirp.merge(cpg_symbol, left_on=diff_meth_kirp.index, right_on="Gene_Symbol")[["Composite_Element_REF"]]
Example:
diff_meth_kirp
Hello
My
name
is
First
0
1
2
3
Second
4
5
6
7
Third
8
9
10
11
Fourth
12
13
14
15
Fifth
16
17
18
19
Sixth
20
21
22
23
cpg_symbol
Composite_Element_REF
Gene_Symbol
cg1
First
cg2
Third
cg3
Fifth
cg4
Seventh
cg5
Ninth
cg6
First
Expected output:
Hello
My
name
is
cg1
0
1
2
3
cg2
8
9
10
11
cg3
16
17
18
19
cg6
0
1
2
3
Your code works well for me but you can try this version:
out = (diff_meth_kirp.merge(cpg_symbol.set_index('Gene_Symbol'),
left_index=True, right_index=True)
.set_index('Composite_Element_REF')
.rename_axis(None).sort_index())
print(out)
# Output
Hello My name is
cg1 0 1 2 3
cg2 8 9 10 11
cg3 16 17 18 19
cg6 0 1 2 3
Input dataframes:
data1 = {'Hello': {'First': 0, 'Second': 4, 'Third': 8, 'Fourth': 12, 'Fifth': 16, 'Sixth': 20},
'My': {'First': 1, 'Second': 5, 'Third': 9, 'Fourth': 13, 'Fifth': 17, 'Sixth': 21},
'name': {'First': 2, 'Second': 6, 'Third': 10, 'Fourth': 14, 'Fifth': 18, 'Sixth': 22},
'is': {'First': 3, 'Second': 7, 'Third': 11, 'Fourth': 15, 'Fifth': 19, 'Sixth': 23}}
diff_meth_kirp = pd.DataFrame(data1)
data2 = {'Composite_Element_REF': {0: 'cg1', 1: 'cg2', 2: 'cg3', 3: 'cg4', 4: 'cg5', 5: 'cg6'},
'Gene_Symbol': {0: 'First', 1: 'Third', 2: 'Fifth', 3: 'Seventh', 4: 'Ninth', 5: 'First'}}
cpg_symbol = pd.DataFrame(data2)

Create nested array for all unique indices in a pandas MultiIndex DataFrame

generate dummy data
np.random.seed(42)
df = pd.DataFrame({'subject': ['A'] * 10 + ['B'] * 10,
'trial': list(range(5)) * 4,
'value1': np.random.randint(0, 100, 20),
'value2': np.random.randint(0, 100, 20)
})
df = df.set_index(['subject', 'trial']).sort_index()
print(df)
value1 value2
subject trial
A 0 51 1
0 20 75
1 92 63
1 82 57
2 14 59
2 86 21
3 71 20
3 74 88
4 60 32
4 74 48
B 0 87 90
0 52 79
1 99 58
1 1 14
2 23 41
2 87 61
3 2 91
3 29 61
4 21 59
4 37 46
Notice: Each subject / trial combination has multiple rows.
I want to create a array with the rows as nested dimensions.
My (as I find ugly) data transformation via list
tmp=list()
for idx in df.index.unique():
tmp.append(df.loc[idx].to_numpy())
goal = np.array(tmp)
print(goal)
[[[51 1]
[20 75]]
...
[[21 59]
[37 46]]]
Can you show me a native pandas / numpy way to do it (without the list crutch)?
To be able to generate a non-ragged numpy array, the number of duplicates must be equal for all values. Thus you don't have to loop over them. Just find out the number and reshape
n = len(df)/(~df.index.duplicated()).sum()
assert n.is_integer()
out = df.to_numpy().reshape(-1, df.shape[1], int(n))
Output:
array([[[51, 1],
[20, 75]],
[[92, 63],
[82, 57]],
[[14, 59],
[86, 21]],
[[71, 20],
[74, 88]],
[[60, 32],
[74, 48]],
[[87, 90],
[52, 79]],
[[99, 58],
[ 1, 14]],
[[23, 41],
[87, 61]],
[[ 2, 91],
[29, 61]],
[[21, 59],
[37, 46]]])
You can use stack:
<code>df.stack().values
</code>
Output:
<code>array([[ 0, 25],
[16, 11],
[49, 87],
[38, 77],
[67, 6],
[27, 27],
[40, 0],
[22, 81],
[83, 89],
[36, 55],
[41, 1],
[13, 74],
[88, 61],
[85, 73],
[55, 66],
[44, 82],
[20, 30],
[82, 69],
[37, 71],
[30, 16],
[81, 96],
[ 0, 56],
[ 5, 99],
[73, 86]], dtype=int64)
</code>

i try to set multiply values using DataFrame.loc in pandas but "Must have equal len keys and value when setting with an ndarray" appear

When i do the following code:
df = pd.DataFrame({'team': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'],
'a': [18, 22, 19, 14, 14, 11, 20, 28, 22],
'b': [5, 7, 7, 9, 12, 9, 9, 4, 8],
'c': [11, 8, 10, 6, 6, 5, 9, 12, 9]})
df.loc[(df.b>8), ["one", "two"]] = df.c, df.a
df.loc[(df.b<=8), ["one", "two"]] = df.b*5, df.c*10
print(df)
I got ValueError: Must have equal len keys and value when setting with an ndarray
What is wrong?
If i do:
df.loc[(df.b>8), ["one", "two"]] = df.c
df.loc[(df.b<=8), ["one", "two"]] = df.b
it works
You can't due to index alignment.
You would need to use:
df.loc[(df.b>8), ["one", "two"]] = df[['c', 'a']].set_axis(['one', 'two'], axis=1)
df.loc[(df.b<=8), ["one", "two"]] = df[['b', 'c']].mul([5,10]).set_axis(['one', 'two'], axis=1)
Alternative with numpy.where:
df[['one', 'two']] = np.where(np.tile(df['b'].gt(8), (2, 1)).T,
df[['c', 'a']], df[['b', 'c']].mul([5,10]))
output:
team a b c one two
0 A 18 5 11 25 110
1 B 22 7 8 35 80
2 C 19 7 10 35 100
3 D 14 9 6 6 14
4 E 14 12 6 6 14
5 F 11 9 5 5 11
6 G 20 9 9 9 20
7 H 28 4 12 20 120
8 I 22 8 9 40 90

How can I conditionally remove elements from level 1 of a nested list given the value of a level 2 element?

Platform: Mathematica
I have a table of x and y coordinates belonging to individual connected paths (trajectories):
{{Trajectory, Frame, x, y}, {1, 0, 158.22, 11.519}, {1, 1, 159.132, 11.637}, ... {6649, 1439, 148.35, 316.144}}
in table format it would look like this:
Trajectory Frame x y
------------------------------------------
1 0 158.22 11.519
1 1 159.13 11.637
1 2 158.507 11.68
1 3 157.971 11.436
1 4 158.435 11.366
1 5 158.626 11.576
2 0 141 12 remove this row, path too short!
2 1 143 15 remove this row, path too short!
2 2 144 16 remove this row, path too short!
2 3 147 18 remove this row, path too short!
3 0 120 400
3 1 121 401
3 2 121 396
3 3 122 394
3 4 121 392
3 5 120 390
3 6 124 388
3 7 125 379
...
I want to remove any elements/rows where the total length of the trajectory is less than "n" frames/rows/elements (5 frames for this example). The list is ~80k elements long, and I want to remove all the rows containing trajectories under the specified threshold.
For the given example, trajectory 2 exists across only 4 frames, so I want to delete all rows for Trajectory 2.
I am new to Mathematica and I don't even know where to begin. I thought perhaps creating a list that contains the trajectory numbers that have a Count[] value less than the threshold, then conditionally eliminating any elements that follow that pattern with something like DeleteCases[], but I wasn't able to get very far given my limited syntax knowledge.
I appreciate your help and look forward to a solution!
table = {{"Trajectory", "Frame", "x", "y"},
{1, 0, 158.22, 11.519}, {1, 1, 159.13, 11.637},
{1, 2, 158.507, 11.68}, {1, 3, 157.971, 11.436},
{1, 4, 158.435, 11.366}, {1, 5, 158.626, 11.576},
{2, 0, 141, 12}, {2, 1, 143, 15}, {2, 2, 144, 16},
{2, 3, 147, 18}, {3, 0, 120, 400}, {3, 1, 121, 401},
{3, 2, 121, 396}, {3, 3, 122, 394}, {3, 4, 121, 392},
{3, 5, 120, 390}, {3, 6, 124, 388}, {3, 7, 125, 379}};
traj = First /# Rest[table];
n = 5;
under = First /# Select[Tally[traj], Last[#] < n &];
discard = Flatten[Position[table[[All, 1]], #] & /# under];
newtable = Delete[table, List /# discard]
or alternatively, for the last two lines, this could be faster
discard = Position[table[[All, 1]], _?(MemberQ[under, #] &)];
newtable = Delete[table, discard]

Sorting pandas dataframe by groups

I would like to sort a dataframe by certain priority rules.
I've achieved this in the code below but I think this is a very hacky solution.
Is there a more proper Pandas way of doing this?
import pandas as pd
import numpy as np
df=pd.DataFrame({"Primary Metric":[80,100,90,100,80,100,80,90,90,100,90,90,80,90,90,80,80,80,90,90,100,80,80,100,80],
"Secondary Metric Flag":[0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0],
"Secondary Value":[15, 59, 70, 56, 73, 88, 83, 64, 12, 90, 64, 18, 100, 79, 7, 71, 83, 3, 26, 73, 44, 46, 99,24, 20],
"Final Metric":[222, 883, 830, 907, 589, 93, 479, 498, 636, 761, 851, 349, 25, 405, 132, 491, 253, 318, 183, 635, 419, 885, 305, 258, 924]})
Primary_List=list(np.unique(df['Primary Metric']))
Primary_List.sort(reverse=True)
df_sorted=pd.DataFrame()
for p in Primary_List:
lol=df[df["Primary Metric"]==p]
lol.sort_values(["Secondary Metric Flag"],ascending = False)
pt1=lol[lol["Secondary Metric Flag"]==1].sort_values(by=['Secondary Value', 'Final Metric'], ascending=[False, False])
pt0=lol[lol["Secondary Metric Flag"]==0].sort_values(["Final Metric"],ascending = False)
df_sorted=df_sorted.append(pt1)
df_sorted=df_sorted.append(pt0)
df_sorted
The priority rules are:
First sort by the 'Primary Metric', then by the 'Secondary Metric
Flag'.
If the 'Secondary Metric Flag' ==1, sort by 'Secondary Value' then
the 'Final Metric'
If ==0, go right for the 'Final Metric'.
Appreciate any feedback.
You do not need for loop and groupby here , just split them and sort_values
df1=df.loc[df['Secondary Metric Flag']==1].sort_values(by=['Primary Metric','Secondary Value', 'Final Metric'], ascending=[True,False, False])
df0=df.loc[df['Secondary Metric Flag']==0].sort_values(["Primary Metric","Final Metric"],ascending = [True,False])
df=pd.concat([df1,df0]).sort_values('Primary Metric')
sorted with loc
def k(t):
p, s, v, f = df.loc[t]
return (-p, -s, -s * v, -f)
df.loc[sorted(df.index, key=k)]
Primary Metric Secondary Metric Flag Secondary Value Final Metric
9 100 1 90 761
5 100 1 88 93
1 100 1 59 883
3 100 1 56 907
23 100 1 24 258
20 100 0 44 419
13 90 1 79 405
19 90 1 73 635
7 90 1 64 498
11 90 1 18 349
10 90 0 64 851
2 90 0 70 830
8 90 0 12 636
18 90 0 26 183
14 90 0 7 132
15 80 1 71 491
21 80 1 46 885
17 80 1 3 318
24 80 0 20 924
4 80 0 73 589
6 80 0 83 479
22 80 0 99 305
16 80 0 83 253
0 80 0 15 222
12 80 0 100 25
sorted with itertuples
def k(t):
_, p, s, v, f = t
return (-p, -s, -s * v, -f)
idx, *tups = zip(*sorted(df.itertuples(), key=k))
pd.DataFrame(dict(zip(df, tups)), idx)
lexsort
p = df['Primary Metric']
s = df['Secondary Metric Flag']
v = df['Secondary Value']
f = df['Final Metric']
a = np.lexsort([
-p, -s, -s * v, -f
][::-1])
df.iloc[a]
Construct New DataFrame
df.mul([-1, -1, 1, -1]).assign(
**{'Secondary Value': lambda d: d['Secondary Metric Flag'] * d['Secondary Value']}
).pipe(
lambda d: df.loc[d.sort_values([*d]).index]
)