check similarity of some values in data frame based on id column

check similarity of some values in data frame based on id column - pandas

I have a df like this:
ix y1 y2 id
ix1 X X AP10579
ix2 E E AP17998
ix3 C C AP283716
ix4 C C AP283716
ix5 E E AP17998
ix6 T T AP21187
ix7 X Z AP10579
ix8 T K AP21187
ix9 E E AP12457
ix10 C C Ap87930
in id column, we have two ids which are similar (f.x. ix1 & ix7 have the same id, ix2 & ix5, and so on) . also we have some unique ids,
I want to check if y1+y2 of each of these two ids are the same or not,
and if they are the same so move one of them in a new df,
also move every unique id,
so I should have a new df, df_new, like this:
ix y1 y2 id
ix2 E E AP17998
ix3 C C AP283716
ix9 E E AP12457
ix10 C C Ap87930
any suggestions is appreciated.
df = {
'ix': ['ix1','ix2','ix3','ix4','ix5','ix6','ix7','ix8','ix9','ix10'],
'y1': ['X','E','C','C','E','T','X','T', 'E','C'],
'y2': ['X','E','C','C','E','T','Z','K', 'E','C'],
'id': ['AP10579','AP17998','AP283716','AP283716','AP17998','AP21187','AP10579','AP21187', 'AP12457', 'Ap87930']
}

This is a possible approach:
df = pd.DataFrame({
'ix': ['ix1','ix2','ix3','ix4','ix5','ix6','ix7','ix8','ix9','ix10'],
'y1': ['X','E','C','C','E','T','X','T', 'E','C'],
'y2': ['X','E','C','C','E','T','Z','K', 'E','C'],
'id': ['AP10579','AP17998','AP283716','AP283716','AP17998','AP21187','AP10579','AP21187', 'AP12457', 'Ap87930']
})
def filter_df(g):
if len(g) == 1:
return g.iloc[0]
if g.y1.unique().size + g.y2.unique().size == 2:
return g.iloc[0]
df.groupby('id').agg(filter_df).dropna().reset_index()
output:
id ix y1 y2
0 AP12457 ix9 E E
1 AP17998 ix2 E E
2 AP283716 ix3 C C
3 Ap87930 ix10 C C

Related

Create multiple-columns pandas dataframe from list

I can't figure out how to create pandas dataframe (multiple-columns) from list. Some lines contains character ">" at the beggining. I want them to be column headers. Number of lines after each header is not the same.
My list:
>header
a
b
>header2
c
d
e
f
>header3
g
h
i
Dataframe I want to create:
>header1 >header2 >header3
a c g
b d h
e i
f

Simply iterate through lines and match the headers with '>'. The challenge though is to create a df from a dictionary of lists with unequal size.
# The given list
lines = [">header", "a", "b", ">header2", "c", "d", "e", "f", ">header3", "g", "h", "i"]
# Iterate through the lines and create a sublist for each header
data = {}
column = ''
for line in lines:
if line.startswith('>'):
column = line
data[column] = []
continue
data[column].append(line)
# Create the DataFrame
df = pd.DataFrame.from_dict(data,orient='index').T
output:
>header >header2 >header3
0 a c g
1 b d h
2 None e i
3 None f None

I'm assuming you have a text with this list. You can use str.splitlines() to split it and then construct the dataframe with help of itertools.zip_longest:
from itertools import zip_longest
text = '''\
>header
a
b
>header2
c
d
e
f
>header3
g
h
i'''
current, data = None, {}
for line in text.splitlines():
line = line.strip()
if line.startswith('>'):
current = line
else:
data.setdefault(current, []).append(line)
df = pd.DataFrame(zip_longest(*data.values(), fillvalue=''), columns=list(data))
print(df)
Prints:
>header >header2 >header3
0 a c g
1 b d h
2 e i
3 f

get values of previous rows as list

How to get the values of the previous three rows in a new column?
data = { 'foo':['a','b','c','d','e','f','g']}
df = pd.DataFrame(data)
df = some_function(x)
print(df)
foo bar
1 a None
2 b None
3 c None
4 d ['a','b','c']
5 e ['b','c','d']
6 f ['c','d','e']
7 g ['d','e','f']
I could use the following method, by adding columns and then merging it to a new one, but i wonder if there is a better way to do this
def some_function_v1(df)
df[foo1] = df.foo.shift(1)
df[foo2] = df.foo.shift(2)
df[foo3] = df.foo.shift(3)
df['bar'] = df.apply(lambda x: [x['foo1'],x['foo2'],x['foo3']], axis=1)
df = df.drop(columns=[foo1,foo2,foo3]
return df

Try sliding_window_view on foo to create a new DataFrame with the grouped lists:
window = 3
bar_df = pd.DataFrame({
'bar': np.lib.stride_tricks.sliding_window_view(df['foo'], window).tolist()
})
Offset the index:
bar_df.index += window
bar_df:
bar
3 [a, b, c]
4 [b, c, d]
5 [c, d, e]
6 [d, e, f]
7 [e, f, g]
Then join back to the original frame:
out = df.join(bar_df)
out:
foo bar
0 a NaN
1 b NaN
2 c NaN
3 d [a, b, c]
4 e [b, c, d]
5 f [c, d, e]
6 g [d, e, f]
Complete Working Example:
import numpy as np
import pandas as pd
data = {'foo': ['a', 'b', 'c', 'd', 'e', 'f', 'g']}
df = pd.DataFrame(data)
window = 3
bar_df = pd.DataFrame({
'bar': np.lib.stride_tricks.sliding_window_view(df['foo'], window).tolist()
})
bar_df.index += window
out = df.join(bar_df)
print(out)

We can try list comprehension to generate sliding window view
n, v = 3, df['foo'].to_numpy()
df['bar'] = [None] * n + [v[i: i + n] for i in range(len(v) - n)]
Alternative approach with sliding_window_view method
n, v = 3, df['foo'].to_numpy()
df['bar'] = [None] * n + list(np.lib.stride_tricks.sliding_window_view(v[:-1], n))
foo bar
0 a None
1 b None
2 c None
3 d [a, b, c]
4 e [b, c, d]
5 f [c, d, e]
6 g [d, e, f]

You can use shift with zip to shift and merge lists element-wise instead of creating new columns-
df['bar'] = pd.Series(zip(df.foo.shift(3), df.foo.shift(2), df.foo.shift(1))).apply(lambda x:None if np.nan in x else list(x))
Here's a function to make the shift dynamic-
n_shift = lambda s, n: pd.Series(zip(*[s.shift(x) for x in range(n,0,-1)])).apply(lambda x:None if np.nan in x else list(x))
df['bar'] = n_shift(df.foo, 3))
Output-
foo bar
1 a None
2 b None
3 c None
4 d ['a','b','c']
5 e ['b','c','d']
6 f ['c','d','e']
7 g ['d','e','f']

How to split dict in dataframe to many columns

I'm using dataframe. How to split dict list to many columns?
This is for a junior dataprocessor. In the past, I've tried on many ways.
import pandas as pd
l = [{'a':1,'b':2},{'a':3,'b':4}]
data = [{'key1':'x','key2':'y','value':l}]
df = pd.DataFrame(data)
data1 = {'key1':['x','x'],'key2':['y','y'],'a':[1,3],'b':[2,4]}
df1 = pd.DataFrame(data1)
df1 is what I need.

comprehension
d1 = df.drop('value', axis=1)
co = d1.columns
d2 = df.value
pd.DataFrame([
{**dict(zip(co, tup)), **d}
for tup, D in zip(zip(*map(d1.get, d1)), d2)
for d in D
])
a b key1 key2
0 1 2 x y
1 3 4 x y
Explode
See post on explode
This is a tad different but close
idx = df.index.repeat(df.value.str.len())
val = np.concatenate(df.value).tolist()
d0 = pd.DataFrame(val)
df.drop('value', axis=1).loc[idx].reset_index(drop=True).join(d0)
a b key1 key2
0 1 2 x y
1 3 4 x y

Adding dataframes that share the same columns, and expand one more dimesion

I want to sum two dataframes that share the same columns
df1=pd.DataFrame(np.random.randn(3,3),index=list("ABC"),columns=list("XYZ"))
df2=pd.DataFrame(np.random.randn(3,3),index=list("abc"),columns=list("XYZ"))
My desired result would be:
X Y Z
A a
A b
A c
....
C c
How can I achieve this?
I have tried the following but didnt get what I wanted.
df1.add(df2,axis="columns")

You can create MultiIndex in both DataFrames first by MultiIndex.from_product and then reindex for MultiIndex in both DataFrames:
np.random.seed(45)
df1=pd.DataFrame(np.random.randn(3,3),index=list("ABC"),columns=list("XYZ"))
df2=pd.DataFrame(np.random.randn(3,3),index=list("abc"),columns=list("XYZ"))
mux = pd.MultiIndex.from_product([df1.index, df2.index])
df1 = df1.reindex(mux, level=0)
df2 = df2.reindex(mux, level=1)
print (df1)
X Y Z
A a 0.026375 0.260322 -0.395146
b 0.026375 0.260322 -0.395146
c 0.026375 0.260322 -0.395146
B a -0.204301 -1.271633 -2.596879
b -0.204301 -1.271633 -2.596879
c -0.204301 -1.271633 -2.596879
C a 0.289681 -0.873305 0.394073
b 0.289681 -0.873305 0.394073
c 0.289681 -0.873305 0.394073
print (df2)
X Y Z
A a 0.935106 -0.015685 0.259596
b -1.473314 0.801927 -1.750752
c -0.495052 -1.008601 0.025244
B a 0.935106 -0.015685 0.259596
b -1.473314 0.801927 -1.750752
c -0.495052 -1.008601 0.025244
C a 0.935106 -0.015685 0.259596
b -1.473314 0.801927 -1.750752
c -0.495052 -1.008601 0.025244
df3 = df1.add(df2,axis="columns")
print (df3)
X Y Z
A a 0.961480 0.244637 -0.135550
b -1.446939 1.062248 -2.145898
c -0.468677 -0.748279 -0.369901
B a 0.730805 -1.287317 -2.337283
b -1.677615 -0.469706 -4.347631
c -0.699353 -2.280233 -2.571634
C a 1.224786 -0.888989 0.653669
b -1.183633 -0.071378 -1.356680
c -0.205371 -1.881905 0.419317

IIUIC, Here's one way, using merge on temporary k, resulting in every index combination and then groupby on columns.
In [192]: (df1.reset_index().assign(k='k').merge(df2.assign(k='k').reset_index(), on=['k'])
.set_index(['index_x', 'index_y'])
.groupby(lambda x:x.split('_')[0], axis=1)
.sum()
.drop('k', 1))
Out[192]:
X Y Z
index_x index_y
A a -2.281005 -1.606760 -0.853813
b -2.683788 -2.487876 2.471459
c -0.333471 -2.155734 1.688883
B a -0.790146 0.074629 -2.368680
b -1.192928 -0.806487 0.956592
c 1.157388 -0.474345 0.174017
C a -2.114412 0.100412 -2.352661
b -2.517195 -0.780704 0.972611
c -0.166878 -0.448562 0.190036
Details
In [193]: df1
Out[193]:
X Y Z
A -1.087129 -1.264522 1.147618
B 0.403731 0.416867 -0.367249
C -0.920536 0.442650 -0.351229
In [194]: df2
Out[194]:
X Y Z
a -1.193876 -0.342237 -2.001431
b -1.596659 -1.223354 1.323841
c 0.753658 -0.891211 0.541265
In [196]: (df1.reset_index().assign(k='k').merge(df2.assign(k='k').reset_index(), on=['k'])
.set_index(['index_x', 'index_y']))
Out[196]:
X_x Y_x Z_x k X_y Y_y Z_y
index_x index_y
A a -1.087129 -1.264522 1.147618 k -1.193876 -0.342237 -2.001431
b -1.087129 -1.264522 1.147618 k -1.596659 -1.223354 1.323841
c -1.087129 -1.264522 1.147618 k 0.753658 -0.891211 0.541265
B a 0.403731 0.416867 -0.367249 k -1.193876 -0.342237 -2.001431
b 0.403731 0.416867 -0.367249 k -1.596659 -1.223354 1.323841
c 0.403731 0.416867 -0.367249 k 0.753658 -0.891211 0.541265
C a -0.920536 0.442650 -0.351229 k -1.193876 -0.342237 -2.001431
b -0.920536 0.442650 -0.351229 k -1.596659 -1.223354 1.323841
c -0.920536 0.442650 -0.351229 k 0.753658 -0.891211 0.541265

Edit text in file(UTF16)

I want replace 1 word in text file (file format is not .txt)
file Unicode is (UTF16)
few text example:
I D = " f f 0 3 4 a 9 2 - d d 9 f - 4 3 7 4 - a 8 a d - f 5 5 4 0 0 2 a 4 1 9 b " I S S U E _ D A T E = " 2 0 1 7 - 0 2 - 1 6 T 1 7 : 2 9 : 1 8 . 9 7 0 2 2 9 4 Z " S E Q U E N C E = " 0 " M A N A G I N G _ A P P L I C A T I O N _ T O K E N = " " > < L I C E N S E P U B L I C _ I D = " 3 A A - U J F - 8 K P " U S E R N A M E = " N d a G 6 Z T w u v I X Z B i t h 8 g o d d Q x E r x 0 + O g M c t 0 2 3 f X K O E w = " P A S S W O R D = " F 9 b n 6 b v w l f I 5 Z A 2 t h M h 9 d d s x Q L w = " T Y P E = " T R I A L " F L A G S = " 4 " D I S P L A Y _ N A M E =
I want change T R I A L to other word

It's not too hard to modify your text file. Use the IO class to assign it to a text file, then use String.Replace(oldValue As String, newValue As String) to change your string. Then use IO again to save the string to the file. This should work so long as your file isn't open and being used in another program - regardless of file extensions.
An example, to help you, could be something such as this:
Dim myFileContents as String = IO.File.ReadAllText("Path\To\My\File\File.extension")
myFileContents = myFileContents.Replace("T R I A L", "Some other word")
IO.File.WriteAllText("Path\To\My\File\File.extension", myFileContents)
Modify the contents to suit your situation - however, this is only a basic implementation. Additionally, it is important to note that String.Replace() will change all occurrences of your word to the new word.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

check similarity of some values in data frame based on id column - pandas

Related

Create multiple-columns pandas dataframe from list

get values of previous rows as list

How to split dict in dataframe to many columns

Adding dataframes that share the same columns, and expand one more dimesion

Edit text in file(UTF16)

Categories

Resources