pyspark: isIN and isNOT IN replcaement with another df column - dataframe

i'm trying to filter dataframe in pyspark using "isin"
also tried another way of filtering.
unable to get the correct result.
getting error of Spark Array literal.
can anyone help
One way:
df1.select("COL1").distinct().show()
df2.select(('col1').isin(df1.select("COL1").distinct()))
-------
Second way :
uniquelist=df1.select("COL1").distinct().collect()
df2.filter(F.col('col1').contains(uniqueVIN)).show()
can anyone help me solve the error :
An error occurred while calling z:org.apache.spark.sql.functions.lit.
I also have to perform a "is not in"
data_array = np.array(df_list.select("f_col").collect())
df_filtered = df_2.filter(~df_2["colname"].isin([data_array]))

collect() returns a list of Row objects, you need to get the values from the rows before passing it to isin column method:
unique_list = [r["COL1"] for r in df1.select("COL1").distinct().collect()]
df2.filter(F.col('col1').isin(unique_list)).show()
However, you should use join for this:
Use left_semi to get row from df2 with corresponding rows from df1:
df2.join(df1, df1["COL1"] == df2["col1"], "left_semi").show()
And left_anti to get rows from df2 that have no corresponding values in df1:
df2.join(df1, df1["COL1"] == df2["col1"], "left_anti").show()

Related

Aggregating multiple data types in pandas groupby

I have a data frame with rows that are mostly translations of other rows e.g. an English row and an Arabic row. They share an identifier (location_shelfLocator) and I'm trying to merge the rows together based on the identifier match. In some columns the Arabic doesn't contain a translation, but the same English value (e.g. for the language column both records might have ['ger'] which becomes ['ger', 'ger']) so I would like to get rid of these duplicate values. This is my code:
df_merged = df_filled.groupby("location_shelfLocator").agg(
lambda x: np.unique(x.tolist())
)
It works when the values being aggregated are the same type (e.g. when they are both strings or when they are both arrays). When one is a string and the other is an array, it doesn't work. I get this warning:
FutureWarning: ['subject_name_namePart'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.
df_merged = df_filled.groupby("location_shelfLocator").agg(lambda x: np.unique(x.tolist()))
and the offending column is removed from the final data frame. Any idea how I can combine these values and remove duplicates when they are both lists, both strings, or one of each?
Here is some sample data:
location_shelfLocator,language_languageTerm,subject_topic,accessCondition,subject_name_namePart
81055/vdc_100000000094.0x000093,ara,"['فلك، العرب', 'فلك، اليونان', 'فلك، العصور الوسطى', 'الكواكب']",المُلكية العامة,كلاوديوس بطلميوس (بطليمو)
81055/vdc_100000000094.0x000093,ara,"['Astronomy, Arab', 'Astronomy, Greek', 'Astronomy, Medieval', 'Constellations']",Public Domain,"['Claudius Ptolemaeus (Ptolemy)', ""'Abd al-Raḥmān ibn 'Umar Ṣūfī""]"
And expected output:
location_shelfLocator,language_languageTerm,subject_topic,accessCondition,subject_name_namePart
"[‘81055/vdc_100000000094.0x000093’] ",[‘ara’],"['فلك، العرب', 'فلك، اليونان', 'فلك، العصور الوسطى', ‘الكواكب’, 'Astronomy, Arab', 'Astronomy, Greek', 'Astronomy, Medieval', 'Constellations']","[‘المُلكية العامة’, ‘Public Domain’]","[‘كلاوديوس بطلميوس (بطليمو)’,’Claudius Ptolemaeus (Ptolemy)', ""'Abd al-Raḥmān ibn 'Umar Ṣūfī""]"
If you cannot have a control over the input value, you need to fix it somehow.
Something like this. Here, I am converting string value in subject_name_namePart to array of string.
from ast import literal_eval
mask = df.subject_name_namePart.str[0] != '['
df.loc[mask, 'subject_name_namePart'] = "['" + df.loc[mask, 'subject_name_namePart'] + "']"
df['subject_name_namePart'] = df.subject_name_namePart.transform(literal_eval)
Then, you can do (explode) + aggregation.
df = df.explode('subject_name_namePart')
df = df.groupby('location_shelfLocator').agg(lambda x: x.unique().tolist())

Assigning value to an iloc slice with condition on a separate column?

I would like to slice my dataframe using iloc (rather than loc) + some condition based on one of the dataframe's columns and assign a value to all the items in this slice (which is effectively a subset of the main dataframe).
My simplified attempt:
df.iloc[:, 1:21][df['column1'] == 'some_value'] = 1
This is meant to take a slice of the dataframe:
All rows;
Columns 2 to 20;
Then slice it again:
Only the rows where column1 = some_value.
The slicing works fine, but equalling this to 1 doesn't work. Nothing changes in df and I get this warning
A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
I really need to use iloc rather than loc if possible. It feels like there should be a way of doing this?
You can search for the error on SO. In short, you should update on one single loc/iloc:
df.loc[df['column1']=='some_value', df.columns[1:21]] = 1

Pandas Dataframe: How to get the cell instead of is value

I have a task to compare two dataframe with same columns name but different size, we can call it previous and current. I am trying to get the difference between (previous and current) in the Quantity and Booked Columns and highlight it as yellow. The common key between the two dataframe would be the 'SN' columns
I have coded out the following
for idx, rows in df_n.iterrows():
if rows["Quantity"] == rows['Available'] + rows['Booked']:
continue
else:
rows["Quantity"] = rows["Quantity"] - rows['Available'] - rows['Booked']
df_n.loc[idx, 'Quantity'].style.applymap('background-color: yellow')
# pdb.set_trace()
if (df_o['Booked'][df_o['SN'] == rows["SN"]] != rows['Booked']).bool():
df_n.loc[idx, 'Booked'].style.apply('background-color: yellow')
I realise I have a few problems here and need some help
df_n.loc[idx, 'Quantity'] returns value instead of a dataframe type. How can I get a dataframe from one cell. Do I have to pd.DataFrame(data=df_n.loc[idx, 'Quantity'], index=idx, columns ='Quantity'). Will this create a copy or will update the reference?
How do I compare the SN of both dataframe, looking for a better way to compare. One thing I could think of is to use set index for both dataframe and when finished using them, reset them back?
My dataframe:
Previous dataframe
Current Dataframe
df_n.loc[idx, 'Quantity'] returns value instead of a dataframe type.
How can I get a dataframe from one cell. Do I have to
pd.DataFrame(data=df_n.loc[idx, 'Quantity'], index=idx, columns
='Quantity'). Will this create a copy or will update the reference?
To create a DataFrame from one cell you can try: df_n.loc[idx, ['Quantity']].to_frame().T
How do I compare the SN of both dataframe, looking for a better way to
compare. One thing I could think of is to use set index for both
dataframe and when finished using them, reset them back?
You can use df_n.merge(df_o, on='S/N') to merge dataframes and 'compare' columns.

how to name colums?

I have a pandas Data Frame where some of the id's are repeated a few times. I've written this code:
df = df["id"].value_counts()
and got this output
What should I do to get something like in the following image?
Thanks
As Quang Hoang answered, value_counts set the column you count as the index. Therefore in order to get the id and the count as columns, you need to do 2 things:
Make the counts as column - to_frame(name='B')
Reset the index to make the ids another column which we'll rename to the desired name: .reset_index().rename(columns={'index': 'A'})
So in one line it'll be:
df = df["id"].value_counts().to_frame(name='B').reset_index().rename(columns={'index': 'A'})
Another possible way is:
col = list(["A", "B")]
df.columns = col

Apply Groupby on a resulting empty dataframe which is a result of filter

A sample dataframe as mentioned below:
df_A = pd.DataFrame({'field1':[1,2,3,4,5], 'field2':[11,12,13,14,15], 'field3':[c1,c2,c3,c4,c5], 'field4':[m1,m2,m3,m4,m5], 'field5':[21,22,23,24,25], 'field6':[f1,f2,f3,f4,f5], 'field7':[31,32,33,34,35]})
I have a logic as mentioned below:
df_A['field7'] = df_A[(df_A['filed4']== 'abc') & (df_A['field5']== 'def')].groupby(['field1', 'field2','field3'], as_index=False)[['field6']].transform('count')
but in some scenarios the filter might yield no values and I am getting the following error:
ValueError: No objects to concatenate
Though I partially understand what the error is, I am not able to get the null value column as my expected answer(Example
to apply groupby on an empty dataframe: Keep columns after a groupby in an empty dataframe)
Kindly let me know if I was wrong in any sense and Thanks in advance!
Edit: Added an example dataframe for the above mentioned case