Separating tags in dirty data in pandas dataframe - pandas

I have a dataframe similar to the following:
Column1 Column2 Tags Column3
str1 str2 owner:u1,env:prod1 str3
str2 str4 env:prod2,env:prod2 str6
str1 str3 str7
str3 str4 dwdws:qsded,ewe:22w str8
I cant filter the data based on Tags and clear out that is not in proper tags format since I need the whole data set. Please Note:
Third row, Tags column is already filtered as empty string
There may be repetitions in the key:value pairs.
I need to have Tags that I am interested in as a separate columns something like this:
Column1 Column2 Tags Column3 Owner env
str1 str2 owner:u1,env:prod1 str3 u1 prod1
str2 str4 env:prod2,env:prod2 str6 prod2
str1 str3 str7
str3 str4 dwdws:qsded,ewe:22w str8
I tried along the lines as:
Data['owner']=Data['Tags'].str.slice(Data.Tags.str.find('owner:'),Data.Tags.str.find('owner:')+<length until comma after owner is reached>)
I get all NaN values in the column. I am hoping there is a one or two liner to filter this out.
Thanks in advance

A generic method would be to extractall the key:value pairs, then to pivot:
out = (df.join(df['Tags'].str.extractall('([^:,]+):([^:,]+)')
.droplevel('match').pivot(columns=0, values=1))
)
Output:
Column1 Column2 Tags Column3 dwdws env ewe owner
0 str1 str2 owner:u1,env:prod str3 NaN prod NaN u1
1 str2 str4 env:prod str6 NaN prod NaN NaN
2 str1 str3 NaN str7 NaN NaN NaN NaN
3 str3 str4 dwdws:qsded,ewe:22w str8 qsded NaN 22w NaN
If you want to restrict the tags, adapt the first part of the regex:
out = (df.join(df['Tags'].str.extractall('(owner|env):([^:,]+)')
.droplevel('match').pivot(columns=0, values=1))
)
Output:
Column1 Column2 Tags Column3 env owner
0 str1 str2 owner:u1,env:prod str3 prod u1
1 str2 str4 env:prod str6 prod NaN
2 str1 str3 NaN str7 NaN NaN
3 str3 str4 dwdws:qsded,ewe:22w str8 NaN NaN
handling duplicated keys
out = (df.join(df['Tags'].str.extractall('(owner|env):([^:,]+)')
.droplevel('match').reset_index()
.pivot_table(index='index', columns=0, values=1, aggfunc='first')
)
)

Related

pandas joining strings in a group, skipping na values

I'm using a combination of str.join (let's call the column joined col_str) and groupby (Let's call the grouped col col_a) in order to summarize data row-wise.
col_str, may contain nan values. Unsurprisingly, and as seen in str.join documentation, joining nan will result in an empty string:
df = df.join(df['col_a'].map(df.groupby('col_a')['col_str'].unique().str.join(', '))
To mitigate this, I tried to convert col_str to string (e.g. df['col_str'] = df['col_str'].astype(str) ). But then, empty values now literally have a string nan value, hence considered non empty.
Not only that str.join now includes nan strings, but also other calculations over the script, that rely on those nans, are ruined.
To address that, I thought about converting just the non-empty values as follows:
df['col_str'] = np.where(pd.isnull(df['col_str']), df['col_str'],
df['col_str'].astype(str))
But now str.join return empty values again :-(
So, I tried fillna('') and even dropna(). None provided me with the desired results.
You get the vicious cycle here, right?
astype(str) => nan strings in join and calculations ruined
Leaving as-is => join.str returns empty results.
Thanks for your assistance!
Edit:
Data is read from a csv. Sample:
Code to test -
df = pd.read_csv('/Users/goidelg/Downloads/sample_data.csv', low_memory=False)
print("---Original DF ---")
print(df)
print("---Joining NaNs as NaN---")
print(df.join(df['col_a'].map(df.groupby('col_a')['col_str'].unique().str.join(', ')).rename('strings_concat')))
print("---Convertin col to str---")
df['col_str'] = df['col_str'].astype(str)
print(df.join(df['col_a'].map(df.groupby('col_a')['col_str'].unique().str.join(', ')).rename('strings_concat')))
And results for the script:
First remove missing values by DataFrame.dropna or Series.notna in boolean indexing:
df = pd.DataFrame({'col_a':[1,2,3,4,1,2,3,4,1,2],
'col_str':['a','b','c','d',np.nan, np.nan, np.nan, np.nan,'a', 's']})
df1 = (df.join(df['col_a'].map(df[df['col_str'].notna()]
.groupby('col_a')['col_str'].unique()
.str.join(', ')). rename('labels')))
print (df1)
col_a col_str labels
0 1 a a
1 2 b b, s
2 3 c c
3 4 d d
4 1 NaN a
5 2 NaN b, s
6 3 NaN c
7 4 NaN d
8 1 a a
9 2 s b, s
df2 = (df.join(df['col_a'].map(df.dropna(subset=['col_str'])
.groupby('col_a')['col_str']
.unique().str.join(', ')).rename('labels')))
print (df2)
col_a col_str labels
0 1 a a
1 2 b b, s
2 3 c c
3 4 d d
4 1 NaN a
5 2 NaN b, s
6 3 NaN c
7 4 NaN d
8 1 a a
9 2 s b, s

Assigning a value in a column based on unique values in another - Pandas

I have 2 columns, column A has many string values, some unique, and some repeat several times in the column. Column B has either 1 or 0.
Some unique values have only an equivalent zero in column B and some have only 1, and for some, it may differ between 1 and zero in different rows.
I'd like to 'override' the zeroes by checking if a value in column A has 1 in column B, look for rows where the same value equals zero and replace it with 1.
I have a variable with all values that equal 1.
If possible I'd like to avoid for loop with the iterrows method which would probably be the immediate suspect:
is_1=data.query('is_1==1')
A_unique=is_1['A'].unique()
for index, row in data.iterrows():
if row['is_1']==0:
if row['A'] in A_unique:
data.loc[data.A==row['A'],'is_1']=1
One way I can think of is to sort and use the fillna method to forward fill the zeros -
df = pd.DataFrame({'A': list('ABBABCB'), 'B': list('0100011')})
# A B
#0 A 0
#1 B 1
#2 B 0
#3 A 0
#4 B 0
#5 C 1
#6 B 1
# First we replace all 0's with nan's
df.loc[df['B'] == '0', 'B'] = np.nan
# Then we sort and fillna
df = df.sort_values(['A', 'B']).fillna(method="ffill").fillna('0')
# A B
#0 A 0
#3 A 0
#1 B 1
#6 B 1
#2 B 1
#4 B 1
#5 C 1
This could be a solution as well using list comprehension
df = pd.DataFrame({
'a': ['str1', 'str2', 'str3', 'str1', 'str1', 'str1', 'str4', 'str4'],
'b': [0, 1, 0, 1, 0, 1, 0, 1]})
# a b
#0 str1 0
#1 str2 1
#2 str3 0
#3 str1 1
#4 str1 0
#5 str1 1
#6 str4 0
#7 str4 1
tup_list = [(j, 1) if (j, 1) in zip(df['a'], df['b']) else (j, i) for(j, i) in zip(df['a'], df['b'])]
df = pd.DataFrame(tup_list, columns=['a', 'b'])
# a b
#0 str1 1
#1 str2 1
#2 str3 0
#3 str1 1
#4 str1 1
#5 str1 1
#6 str4 1
#7 str4 1

Pandas str replace dropping entire value instead of replacing

There is a behaviour and I don't understand why it is happening.
# Make a dataframe with a column of floats
df = pd.DataFrame(columns=['col1'])
df.loc[0] = 10.0
df.loc[1] = 5.0
df.loc[2] = 6.0
df.loc[3] = 20.0
# Convert the column to string
df['col1'] = df['col1'].astype(str)
# Use .str.replace to replace the decimal points of .0 with nothing
df['col1'].str.replace('.0', '')
But this returns an empty string for the first and last value
0
1 5
2 6
3
Name: col1, dtype: object
However doing this:
# Apply replace in a lambda function
df['col1'].apply(lambda x: x.replace('.0', ''))
And this returns the expected results
0 10
1 5
2 6
3 20
Name: col1, dtype: object
Is it something to do with it confusing 0.0 with .0?
Any idea why this is happening?
Because . is special regex character is necessry escape it or add regex=False:
df['col2'] = df['col1'].str.replace('\.0', '', regex=True)
df['col3'] = df['col1'].str.replace('.0', '', regex=False)
print (df)
col1 col2 col3
0 10.0 10 10
1 5.0 5 5
2 6.0 6 6
3 20.0 20 20

Why there is no rows which are all null values in my dataframe?

I have found some values which seems all columns are null.
The examples are below
I want to remove there rows. But when I use the method from the link below,
the return dataframe has no rows which should represent the all value null rows.
Python Pandas find all rows where all values are NaN
So I want to know what's wrong with my data frame. Is the NA matters ?
What should I do to get the null rows' row number?
Besides, I use
df_features.loc[df_features['sexo'].isnull() & (df_features['age']=='NA'),:]
But it returns no rows from my data frame.
I think you need boolean indexing with mask created by notnull:
df_features[df_features['sexo'].notnull()]
It seems you need:
df_features[(df_features['sexo'].notnull()) & (df_features['age'] != 'NA')]
Sample:
df_features = pd.DataFrame({'sexo':[np.nan,2,3],
'age':['10','20','NA']})
print (df_features)
age sexo
0 10 NaN
1 20 2.0
2 NA 3.0
a = df_features[(df_features['sexo'].notnull()) & (df_features['age'] != 'NA')]
print (a)
age sexo
1 20 2.0
But it seems your colunmns with NA values are not numeric, but string.
If need convert some columns to numeric, try to_numeric, parameter errors='coerce' means convert all values which cannot bye parsed to numeric to NaN:
df_features = pd.DataFrame({'sexo':[np.nan,2,3],
'age':['10','20','NA']})
print (df_features)
age sexo
0 10 NaN
1 20 2.0
2 NA 3.0
df_features['age'] = pd.to_numeric(df_features['age'], errors='coerce')
print (df_features)
age sexo
0 10.0 NaN
1 20.0 2.0
2 NaN 3.0
a = df_features[(df_features['sexo'].notnull()) & (df_features['age'].notnull())]
print (a)
age sexo
1 20.0 2.0

Matlab SQL update query

So I have some data as follows:
var1
(time value1)
2 1934
3 3221
4 1314
var2
(time value2)
2 836
3 5364
4 2143
and I want to add it to a new table in a database which I have created containing the following fields: time, value1, value2.
Using the datainsert function of matlab I get the following (which is not what I want):
time value1 value2
2 1934
3 3221
4 1314
2 836
3 5364
4 2143
Now I am trying to use the update function instead so I hopefully get the following:
time value1 value2
2 1934 836
3 3221 5364
4 1314 2143
To get the time and value1 into the table I do the following:
datainsert(connection,'table',{'time','value1'},var1);
but what should I do to now insert the value2 data?
Thanks in advance!
Do you have time values for var1 and var2 identical (e.g. same values with same order)?
In this case you can simply create common array with three fields and insert this array into database:
new_var=[var1 var2(:,2)];
datainsert(connection,'table',{'time','value1','value2'},new_var);
If order of time values in not same for var1 and var2 than you will need more complex action to create common data set. For example, you can use intersection:
[new_time,i1,i2] = intersect(var1(:,1), var22(:,1));
new_var=[new_time var1(:,2) var2(:,2)];