Compare columns from two different data frames and one column value - pandas

I have two different data frames named as df1 and df2.
df1 has columns date1 and value1.
df2 has date2 and val ( initially it contains 0).
The val column value from df2 need to update to 1 when matching date found in df1.
This one was achieved by looping both the data frames with two for loops,
As volume is very high,it is taking more time.
Is there any best way to do that.

You probably need something like this:
import pandas as pd
common = pd.np.intersect1d(df1.date1.values, df2.date2.values)
df2.loc[df2.date2.isin(common), 'val'] = 1

Related

How to apply function to each column and row of dataframe pandas

I have two dataframes.
df1 has an index list made of strings like (row1,row2,..,rown) and a column list made of strings like (col1,col2,..,colm) while df2 has k rows and 3 columns (char_1,char_2,value). char_1 contains strings like df1 indexes while char_2 contains strings like df1 columns. I only want to assign the df2 value to df1 in the right position. For example if the first row of df2 reads ['row3','col1','value2'] I want to assign value2 to df1 in the position ([2,0]) (third row and first column).
I tried to use two functions to slide rows and columns of df1:
def func1(val):
# first I convert the series to dataframe
val=val.to_frame()
val=val.reset_index()
val=val.set_index('index') # I set the index so that it's the right column
def func2(val2):
try: # maybe the combination doesn't exist
idx1=list(cou.index[df2[char_2]==(val2.name)]) #val2.name reads col name of df1
idx2=list(cou.index[df2[char_1]==val2.index.values[0]]) #val2.index.values[0] reads index name of df1
idx= list(reduce(set.intersection, map(set, [idx1,idx2])))
idx=int(idx[0]) # final index of df2 where I need to take value to assign to df1
check=1
except:
check=0
if check==1: # if index exists
val2[0]=df2['value'][idx] # assign value to df1
return val2
val=val.apply(func2,axis=1) #apply the function for columns
val=val.squeeze() #convert again to series
return val
df1=df1.apply(func1,axis=1) #apply the function for rows
I made the conversion inside func1 because without this step I wasn't able to work with series keeping index and column names so I wasn't able to find the index idx in func2.
Well the problem is that it takes forever. df1 size is (3'600 X 20'000) and df2 is ( 500 X 3 ) so it's not too much. I really don't understand the problem.. I run the code for the first row and column to check the result and it's fine and it takes 1 second, but now for the entire process I've been waiting for hours and it's still not finished.
Is there a way to optimize it? As I wrote in the title I only need to run a function that keeps column and index names and works sliding the entire dataframe. Thanks in advance!

Pyspark Big data question - How to add column from another dataframe (no common join column) and sizes can be uneven

I am looking for a way to add a column from one pyspark dataframe, lets say this is DF1:
column1
123
234
345
to another pyspark dataframe, which will have any number of columns itself but not column1, DF2:
column2
column3
column4
000
data
some1
253774
etc
etc
1096
null
more
999
other
null
The caveat here is, I would like to avoid using Pandas, and I would like to avoid pulling all of the data into a single partition if possible. This will be up to Terabytes of data on the DF2 side, it will be running distributed on an EMR cluster.
DF1 will be a fixed set of numbers, which could be more or less than the row count of DF2.
If DF2 has more rows, DF1 values should be repeated (think cycle).
If DF1 has more rows, we don't exceed the rows in DF2 we just attach a value to each row (it doesn't matter if we include all of the rows from DF1.
If these requirements seem strange, it is because the value itself is important in DF1 and we need to use them in DF2, but it doesn't matter which value from DF1 is attached to each DF2 row (we just don't want to repeat the same value over and over, though some duplicates are fine)
What I've Tried:
I have tried adding a row_number to each to join the dataframes, but we run into an issue with that when DF2 is larger than DF1.
I tried duplicating DF1 x number of times to make it large enough to join to DF2 given a row_number, but this is running into java heap space issues on the EMR.
What I am hoping to find:
I am looking for a way to simply cycle over the values from DF1 and apply them to each row on DF2, but doing it with native Pyspark if possible.
In the end an example would look like this:
column1
column2
column3
column4
123
000
data
some1
234
253774
etc
etc
345
1096
null
more
123
999
other
null
The combination of window functions row_number and ntile might be the answer:
Apply a row_number on DF1 to get all records enumerated as the new column id
Get the count of records in DF1 and store it as df1_count
Apply ntile(df1_count) on DF2 as the new column id. Ntile will 'split' your DF2 rows into n as much as possible equal groups
Join DF1 and DF2 on a new generated column id to combine both dataframes
Alternatively, instead of ntile(n), DF2 can also get a row_number() based column id which then can be used to calculate mod:
df.withColumn("id_mod", col("id") % lit(df1_count))
and then that id_mod to be joined with DF1 using DF1.id

is there a known issue with pandas merging two data frames that each have an index of type datetime

I am merging two data frames that each have an index of type datetime and getting as result a data frame with more rows than the two original.
The two data frames have the same number of records each and the same values of the index.
When taking a look - I see that there are duplicate records on the same index - is it a known issue ?
the code:
df_merged = df1.merge(df2, left_index=True, right_index=True)
This is not specific to datetime indices. The reason for the merged dataframe having more rows than either of the original ones is that when the merge key (here: the datetime index) values are not unique, the merge algorithm falls back on the cross product style join for the repeated values, regardless of which join type you specify.
However, if I understand you correctly, using merge is overkill here anyway, because you just want to concatenate the two dataframes:
pd.concat([df1, df2], axis=1)

Can we sort multiple data frames comparing values of each element in column

I have two csv files having some data and I would like to combine and sort data based on one common column:
Here is data1.csv and data2.csv file:
The data3.csv is the output file where you I need data to be combined and sorted as below:
How can I achieve this?
Here's what I think you want to do here:
I created two dataframes with simple types, assume the first column is like your timestamp:
df1 = pd.DataFrame([[1,1],[2,2], [7,10], [8,15]], columns=['timestamp', 'A'])
df2 = pd.DataFrame([[1,5],[4,7],[6,9], [7,11]], columns=['timestamp', 'B'])
c = df1.merge(df2, how='outer', on='timestamp')
print(c)
The outer merge causes each contributing DataFrame to be fully present in the output even if not matched to the other DataFrame.
The result is that you end up with a DataFrame with a timestamp column and the dependent data from each of the source DataFrames.
Caveats:
You have repeating timestamps in your second sample, which I assume may be due to the fact you do not show enough resolution. You would not want true duplicate records for this merge solution, as we assume timestamps are unique.
I have not repeated the timestamp column here a second time, but it is easy to add in another timestamp column based on whether column A or B is notnull() if you really need to have two timestamp columns. Pandas merge() has an indicator option which would show you the source of the timestamp if you did not want to rely on columns A and B.
In the post you have two output columns named "timestamp". Generally you would not output two columns with same name since they are only distinguished by position (or color) which are not properties you should rely upon.

Remove rows from multiple dataframe that contain bad data

Say I have n dataframes, df1, df2...dfn.
Finding rows that contain "bad" values in a row in a given dataframe is done by e.g.,
index1 = df1[df1.isin([np.nan, np.inf, -np.inf])]
index2 = df2[df2.isin([np.nan, np.inf, -np.inf])]
Now, droping these bad rows in the bad dataframe is done with:
df1 = df1.replace([np.inf, -np.inf], np.nan).dropna()
df2 = df2.replace([np.inf, -np.inf], np.nan).dropna()
The problem is that any function that expects the two (n) dataframes columns to be of the same length may give an error if there is bad data in one df but not the other.
How do I drop not just the bad row from the offending dataframe, but the same row from a list of dataframes?
So in the two dataframe case, if in df1 date index 2009-10-09 contains a "bad" value, that same row in df2 will be dropped.
[Possible "ugly"? solution?]
I suspect that one way to do it is to merge the two (n) dataframes on date, then apply the cleanup function to drop "bad" values are automatic since the entire row gets dropped? But what happens if a date is missing from one dataframe and not the other? [and they still happen to be the same length?]
Doing your replace
df1 = df1.replace([np.inf, -np.inf], np.nan)
df2 = df2.replace([np.inf, -np.inf], np.nan)
Then, Here we using inner .
newdf=pd.concat([df1,df2],axis=1,keys=[1,2], join='inner').dropna()
And split it back to two dfs , here we using combine_first with dropna of original df
df1,df2=[s[1].loc[:,s[0]].combine_first(x.dropna()) for x,s in zip([df1,df2],newdf.groupby(level=0,axis=1))]