How to join two data frames using regexp_replace - dataframe

I want to join two dataframes by removing the matching records using column cust_id. Some of the cust_id have leading zeros. So I need to match by removing zeros in the 'ON' clause. Tried the below query, but it's giving error in Databricks notebook.
PS: I don't want to create another DF1 with zeros removed.
Query:
df1 = df1.join(df2,[regexp_replace("cust_id", r'^[0]*','')], "leftanti")
Py4j.Py4JException: Method and Class java.lang.string does not exist

The following works, but the output that you provided will not be reached using "leftanti" join: S15 matches S15 from another table, so it is removed too. In the example that you provided, "leftanti" join does not return any row.
from pyspark.sql import functions as F
df1 = spark.createDataFrame(
[(1, 'S15', 'AAA'),
(2, '00767', 'BBB'),
(3, '03246', 'CCC')],
['ID', 'cust_id', 'Name'])
df2 = spark.createDataFrame(
[(1, 'S15', 'AAA'),
(2, '767', 'BBB'),
(3, '3246', 'CCC')],
['ID', 'cust_id', 'Name'])
df = df1.join(df2, df2.cust_id == F.regexp_replace(df1.cust_id, r'^0*', ''), "leftanti")
df.show()
# +---+-------+----+
# | ID|cust_id|Name|
# +---+-------+----+
# +---+-------+----+

No Need of square brackets [ ]
df1.join(df2, regexp_replace(df2("cust_id"), r'^[0]*', lit("")))
see documentatin here regexp_replace

Related

find a value from df1 in df2 and replace other values of the matching rows

I have the following code with 2 dataframes (df1 & df2)
import pandas as pd
data = {'Name': ['Name1', 'Name2', 'Name3', 'Name4', 'Name5'],
'Number': ['456', 'A977', '132a', '6783r', '868354']}
replace = {'NewName': ['NewName1', 'NewName3', 'NewName4', 'NewName5', 'NewName2'],
'ID': ['I753', '25552', '6783r', '868354', 'A977']}
df1 = pd.DataFrame(data, columns = ['Name', 'Number'])
df2 = pd.DataFrame(replace, columns = ['NewName', 'ID'])
Now I would like to compare every item in the 'Number' column of df1 with the 'ID' column of df2. If there is a match, I would like to replace the 'Name' of df1 with the 'NewName' of df2, otherwise it should keep the 'Name' of df1.
First I tried the following code, but unfortunately it mixed the name and the number in the different rows.
df1.loc[df1['Number'].isin(df2['ID']), ['Name']] = df2.loc[df2['ID'].isin(df1['Number']),['NewName']].values
The next code that I tried worked a bit better, but it replaced the 'Name' of df1 with the 'Number' of df1 if there was no matching.
df1['Name'] = df1['Number'].replace(df2.set_index('ID')['NewName'])
How can I stop this behavior in my last code or are there better ways in general to achieve what I would like to do?
You can use map instead of replace to substitute each value in the column Number in df1 with corresponding value from the NewName column in df2 then fill the NaN values (values which can't be substituted) in mapped column with the original values from the Name column in df1:
df1['Name'] = df1['Number'].map(df2.set_index('ID')['NewName']).fillna(df1['Name'])
>>> df1
Name Number
0 Name1 456
1 NewName2 A977
2 Name3 132a
3 NewName4 6783r
4 NewName5 868354

Pandas read .csv separated by whitespace but columns with names that contain spaces

I have a .csv file that have to read. It is separated by a whitespace but the column names also have spaces. Something like this:
column1 another column final column
value ONE valueTWO valueTHREE
I have been trying to read it withthis but it confuses with the spaces of the column names (not separators). I tried using read_fwf and read_csv but did not worked:
df_mccf=pd.read_fwf(r'C:\Users\MatíasGuerreroIrarrá\OneDrive - BIWISER\Orizon\MCCF\inputs\valores-MCCF (3).csv',\
colspecs=[(0, 4), (5, 10), (11, 21), (22, 32), (33, 54), (55, 1000)])
and:
df_mccf=pd.read_fwf(r'C:\Users\MatíasGuerreroIrarrá\OneDrive - BIWISER\Orizon\MCCF\inputs\valores-MCCF (3).csv',\
sep=' ')
get this
and with this line:
df_mccf=pd.read_csv(r'C:\Users\MatíasGuerreroIrarrá\OneDrive - BIWISER\Orizon\MCCF\inputs\valores-MCCF (3).csv',\
encoding='UTF-16', delim_whitespace=True)
got this
Any help would be really amazing.
I'd suggest you ignore the header altogether and instead pass the names argument. That way you can use the whitespace separator for the rest of the file:
import io
import pandas as pd
data = """column one column two column three
a 1 x
b 2 y
"""
with io.StringIO(data) as f:
df = pd.read_csv(
f,
delim_whitespace=True,
names=['one', 'two', 'three'], # custom header names
skiprows=1, # Skip the initial row (header)
)
Result:
one two three
0 a 1 x
1 b 2 y

How to concatenate values from multiple rows using Pandas?

In the screenshot, 'Ctrl' column contains a key value. I have two duplicate rows for OTC-07 which I need to consolidate. I would like to concat the rest of column values for OTC-07. i.e, OTC-07 should have Type A,B and Assertion a,b,c,d after consolidation.. Can anyone help me on this? :o
First define a dataframe of given structure:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Ctrl': ['OTC-05', 'OTC-06', 'OTC-07', 'OTC-07', 'OTC-08'],
'Type': ['A', 'A', 'A', 'B', np.NaN],
'Assertion': ['a,b,c', 'c,b', 'a,c', 'b,c,d', 'a,b,c']
})
df
Output:
Then replace NaN values with empty strings:
df = df.replace(np.NaN, '', regex=True)
Then group by column 'Ctrl' and aggregate columns 'Type' and 'Assertion'. Please not that assertion aggregation is a bit tricky as you need not a simple concatenation, but sorted list of unique letters:
df.groupby(['Ctrl']).agg({
'Type': ','.join,
'Assertion': lambda x: ','.join(list(sorted(set(','.join(x).split(',')))))
})
Output:

Get row and column index of value in Pandas df

Currently I'm trying to automate scheduling.
I'll get requirement as a .csv file.
However, the number of day changes by month, and personnel also changes occasionally, which means the number of columns and rows is not fixed.
So, I want to put value '*' as a marker meaning end of a table. Unfortunately, I can't find a function or method that take a value as a parameter and return a(list of) index(name of column and row or index numbers).
Is there any way that I can find a(or a list of) index of a certain value?(like coordinate)
for example, when the data frame is like below,
|column_1 |column_2
------------------------
1 | 'a' | 'b'
------------------------
2 | 'c' | 'd'
how can I get 'column_2' and '2' by the value, 'd'? It's something similar to the opposite of .loc or .iloc.
Interesting question. I also used a list comprehension, but with np.where. Still I'd be surprised if there isn't a less clunky way.
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
[(i, np.where(df[i] == 'd')[0].tolist()) for i in list(df) if len(np.where(df[i] == 'd')[0]) > 0]
> [[('column_2', [1])]
Note that it returns the numeric (0-based) index, not the custom (1-based) index you have. If you have a fixed offset you could just add a +1 or whatever to the output.
If I understand what you are looking for: Find the (index value, column location) for a value in a dataframe. You can use list comprehension in a loop. Probably wont be the fastest if your dataframe is large.
# assume this dataframe
df = pd.DataFrame({'col':['abc', 'def','wert','abc'], 'col2':['asdf', 'abc', 'sdfg', 'def']})
# list comprehension
[(df[col][df[col].eq('abc')].index[i], df.columns.get_loc(col)) for col in df.columns for i in range(len(df[col][df[col].eq('abc')].index))]
# [(0, 0), (3, 0), (1, 1)]
change df.columns.get_loc to col if you want the column value rather than location:
[(df[col][df[col].eq('abc')].index[i], col) for col in df.columns for i in range(len(df[col][df[col].eq('abc')].index))]
# [(0, 'col'), (3, 'col'), (1, 'col2')]
I might be misunderstanding something, but np.where should get the job done.
df_tmp = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
solution = np.where(df_tmp == 'd')
solution should contain row and column index.
Hope this helps!
To search single value:
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
df[df == 'd'].stack().index.tolist()
[Out]:
[('column_2', 2)]
To search a list of values:
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
df[df.isin(['a', 'd'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (2, 'column_2')]
Also works when value occurs at multiple places:
df = pd.DataFrame({'column_1':['test','test'], 'column_2':['test','test']}, index=[1,2])
df[df == 'test'].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_1'), (2, 'column_2')]
Explanation
Select cells where the condition matches:
df[df.isin(['a', 'b', 'd'])]
[Out]:
column_1 column_2
1 a b
2 NaN d
stack() reshapes the columns to index:
df[df.isin(['a', 'b', 'd'])].stack()
[Out]:
1 column_1 a
column_2 b
2 column_2 d
Now the dataframe is a multi-index:
df[df.isin(['a', 'b', 'd'])].stack().index
[Out]:
MultiIndex([(1, 'column_1'),
(1, 'column_2'),
(2, 'column_2')],
)
Convert this multi-index to list:
df[df.isin(['a', 'b', 'd'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_2')]
Note
If a list of values are searched, the returned result does not preserve the order of input values:
df[df.isin(['d', 'b', 'a'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_2')]
Had a similar need and this worked perfectly
# deals with case sensitivity concern
df = raw_df.applymap(lambda s: s.upper() if isinstance(s, str) else s)
# get the row index
value_row_location = df.isin(['VALUE']).any(axis=1).tolist().index(True)
# get the column index
value_column_location = df.isin(['VALUE']).any(axis=0).tolist().index(True)
# Do whatever you want to do e.g Replace the value above that cell
df.iloc[value_row_location - 1, value_column_location] = 'VALUE COLUMN'

Filter negative values from a pyspark dataframe

I have a spark dataframe with >40 columns with mixed values. How can I select only positive values from all columns at once & filter out negative ones? I visited [Python Pandas: DataFrame filter negative values but none of the solutions are working. I would like to fit Naive Bayes in pyspark where one of the assumptions are all the features have to be positive. How can I prepare data for the same by selecting only positive values from my features?
suppose you have a dataframe like this
data = [(0,-1,3,4,5, 'a'), (0,-1,3,-4,5, 'b'), (5,1,3,4,5, 'c'),
(10,1,13,14,5,'a'),(7,1,3,4,2,'b'), (0,1,23,4,-5,'c')]
df = sc.parallelize(data).toDF(['f1', 'f2','f3','f4', 'f5', 'class'])
use a VectorAssembler to assemble all the columns in a vector.
from pyspark.ml.feature import VectorAssembler
transformer = VectorAssembler(inputCols =['f1','f2','f3','f4','f5'], outputCol='features')
df2 = transformer.transform(df)
Now, filter the dataframe using a udf
from pyspark.sql.types import *
foo = udf(lambda x: not np.any(np.array(x)<0), BooleanType())
df2.drop('f1','f2','f3','f4','f5').filter(foo('features')).show()
result
+-----+--------------------+
|class| features|
+-----+--------------------+
| c|[5.0,1.0,3.0,4.0,...|
| a|[10.0,1.0,13.0,14...|
| b|[7.0,1.0,3.0,4.0,...|
+-----+--------------------+