Synapse Analytics Pyspark: TypeError: list object is not callable - dataframe

I need to create a new dataframe in Synapse Analytics using column names from another dataframe. The new dataframe will have just one column (column header:col_name and the columns names from the other dataframe are the cell values. Here's my code:
df1= df.columns
colName =[]
for e in df1:
list1 = [e]
colName.append(list1)
col=['col_name']
df2=spark.createDataFrame(colName,col)
display(df2)
The output table created look like below:
With the output dataframe, i can do the following count, display or withColumn command.
df2.count()
df2=df2.withColumn('index',lit(1))
But when i start doing the below filter command, i ended up with 'list' object not callable error message.
display(df2.filter(col('col_name')=='dob'))
I am just wondering if anyone know what I am missing and how I can solve this.At the end i'd like to add a conditional column based on the value in the col_name column.

The problem is that you have two objects called col.
You did this :
col=['col_name']
therefore, when you do this :
display(df2.filter(col('col_name')=='dob'))
you do not call pyspark.sql.functions.col anymore but ['col_name'], hence, TypeError: list object is not callable.
Simply replace here :
# display(df2.filter(col('col_name')=='dob'))
from pyspark.sql import functions as F
display(df2.filter(F.col('col_name')=='dob'))

Related

Pandas splitting a column with new line separator

I am extracting tables from pdf using Camelot. Two of the columns are getting merged together with a newline separator. Is there a way to separate them into two columns?
Suppose the column looks like this.
A\nB
1\n2
2\n3
3\n4
Desired output:
|A|B|
|-|-|
|1|2|
|2|3|
|3|4|
I have tried df['A\nB'].str.split('\n', 2, expand=True) and that splits it into two columns however I want the new column names to be A and B and not 0 and 1. Also I need to pass a generalized column label instead of actual column name since I need to implement this for several docs which may have different column names. I can determine such column name in my dataframe using
colNew = df.columns[df.columns.str.contains(pat = '\n')]
However when I pass colNew in split function, it throws an attribute error
df[colNew].str.split('\n', 2, expand=True)
AttributeError: DataFrame object has no attribute 'str'
You can take advantage of the Pandas split function.
import pandas as pd
# recreate your pandas series above.
df = pd.DataFrame({'A\nB':['1\n2','2\n3','3\n4']})
# first: Turn the col into str.
# second. split the col based on seperator \n
# third: make sure expand as True since you want the after split col become two new col
test = df['A\nB'].astype('str').str.split('\n',expand=True)
# some rename
test.columns = ['A','B']
I hope this is helpful.
I reproduced the error from my side... I guess the issue is that "df[colNew]" is still a dataframe as it contains the indexes.
But .str.split() only works on Series. So taking as example your code, I would convert the dataframe to series using iloc[:,0].
Then another line to split the column headers:
df2=df[colNew].iloc[:,0].str.split('\n', 2, expand=True)
df2.columns = 'A\nB'.split('\n')

pyspark: isIN and isNOT IN replcaement with another df column

i'm trying to filter dataframe in pyspark using "isin"
also tried another way of filtering.
unable to get the correct result.
getting error of Spark Array literal.
can anyone help
One way:
df1.select("COL1").distinct().show()
df2.select(('col1').isin(df1.select("COL1").distinct()))
-------
Second way :
uniquelist=df1.select("COL1").distinct().collect()
df2.filter(F.col('col1').contains(uniqueVIN)).show()
can anyone help me solve the error :
An error occurred while calling z:org.apache.spark.sql.functions.lit.
I also have to perform a "is not in"
data_array = np.array(df_list.select("f_col").collect())
df_filtered = df_2.filter(~df_2["colname"].isin([data_array]))
collect() returns a list of Row objects, you need to get the values from the rows before passing it to isin column method:
unique_list = [r["COL1"] for r in df1.select("COL1").distinct().collect()]
df2.filter(F.col('col1').isin(unique_list)).show()
However, you should use join for this:
Use left_semi to get row from df2 with corresponding rows from df1:
df2.join(df1, df1["COL1"] == df2["col1"], "left_semi").show()
And left_anti to get rows from df2 that have no corresponding values in df1:
df2.join(df1, df1["COL1"] == df2["col1"], "left_anti").show()

How do you split All columns in a large pandas data frame?

I have a very large data frame that I want to split ALL of the columns except first two based on a comma delimiter. So I need to logically reference column names in a loop or some other way to split all the columns in one swoop.
In my testing of the split method:
I have been able to explicitly refer to ( i.e. HARD CODE) a single column name (rs145629793) as one of the required parameters and the result was 2 new columns as I wanted.
See python code below
HARDCODED COLUMN NAME --
df[['rs1','rs2']] = df.rs145629793.str.split(",", expand = True)
The problem:
It is not feasible to refer to the actual column names and repeat code.
I then replaced the actual column name rs145629793 with columns[2] in the split method parameter list.
It results in an ERROR
'str has ni str attribute'
You can index columns by position rather than name using iloc. For example, to get the third column:
df.iloc[:, 2]
Thus you can easily loop over the columns you need.
I know what you are asking, but it's still helpful to provide some input data and expected output data. I have included random input data in my code below, so you can just copy and paste this to run, and try to apply it to your dataframe:
import pandas as pd
your_dataframe=pd.DataFrame({'a':['1,2,3', '9,8,7'],
'b':['4,5,6', '6,5,4'],
'c':['7,8,9', '3,2,1']})
import copy
def split_cols(df):
dict_of_df = {}
cols=df.columns.to_list()
for col in cols:
key_name = 'df'+str(col)
dict_of_df[key_name] = copy.deepcopy(df)
var=df[col].str.split(',', expand=True).add_prefix(col)
df=pd.merge(df, var, how='left', left_index=True, right_index=True).drop(col, axis=1)
return df
split_cols(your_dataframe)
Essentially, in this solution you create a list of the columns that you want to loop through. Then you loop through that list and create new dataframes for each column where you run the split() function. Then you merge everything back together on the index. I also:
included a prefix of the column name, so the column names did not have duplicate names and could be more easily identifiable
dropped the old column that we did the split on.
Just import copy and use the split_cols() function that I have created and pass the name of your dataframe.

Pandas - Appending data from one Dataframe to

I have a Dataframe (called df) that has list of tickets worked for a given date. I have a script that runs each day where this df gets generated and I would like to have a new master dataframe (lets say df_master) that appends values form df to a new Dataframe. So anytime I view df_master I should be able to see all the tickets worked across multiple days. Also would like to have a new column in df_master that shows date when the row was inserted.
Given below is how df looks like:
1001
1002
1003
1004
I tried to perform concat but it threw an error
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "Series"
Update
df_ticket = tickets['ticket']
df_master = df_ticket
df_master['Date'] = pd.Timestamp('now').normalize()
L = [df_master,tickets]
master_df = pd.concat(L)
master_df.to_csv('file.csv', mode='a', header=False, index=False)
I think you need pass sequence to concat, obviously list is used:
objs : a sequence or mapping of Series, DataFrame, or Panel objects
If a dict is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised
L = [s1,s2]
df = pd.concat(L)
And it seems you pass only Series, so raised error:
df = pd.concat(s)
For insert Date column is possible set pd.Timestamp('now').normalize(), for master df I suggest create one file and append each day DataFrame:
df_ticket = tickets[['ticket']]
df_ticket['Date'] = pd.Timestamp('now').normalize()
df_ticket.to_csv('file.csv', mode='a', header=False, index=False)
df_master = pd.read_csv('file.csv', header=None)

pandas merge produce duplicate columns

n1 = DataFrame({'zhanghui':[1,2,3,4] , 'wudi':[17,'gx',356,23] ,'sas'[234,51,354,123] })
n2 = DataFrame({'zhanghui_x':[1,2,3,5] , 'wudi':[17,23,'sd',23] ,'wudi_x':[17,23,'x356',23] ,'wudi_y':[17,23,'y356',23] ,'ddd':[234,51,354,123] })
code above defined two DataFrame objects. I wanna use 'zhanghui' field from n1 and 'zhanghui_x' field from n2 as "on" field merge n1 and n2,so my code like this:
n1.merge(n2,how = 'inner',left_on = 'zhanghui',right_on='zhanghui_x')
and then result columns given like this :
sas wudi_x zhanghui ddd wudi_y wudi_x wudi_y zhanghui_x
Some duplicate columns appeared,such as 'wudi_x' ,'wudi_y'.
So it's a pandas inner problems or I had a wrong usage about pd.merge ?
From pandas documentation, the merge() function has following properties;
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False,
validate=None)
where suffixes denote default suffix string to be attached to 'over-lapping' columns with defaults '_x' and '_y'.
I'm not sure if I understood your follow-up question correctly, but;
#case1
if the first dataFrame has column 'column_name_x' and the second dataFrame has column 'column_name' then there are no over-lapping columns and therefore no suffixes are attached.
#case2
if the first dataFrame has columns 'column_name', 'column_name_x' and the second dataFrame also has column 'column_name', the default suffixes attach to over-lapping columns and therefore the first frame's 'columnn_name' becomes 'column_name_x' and result in a duplicate of already existing column.
You can however, pass a None value to one(not all) of the suffixes to ensure that column names of certain dataFrame remain as-is.
Your approach is right, pandas automatically gives postscripts after merging the columns that are "duplicated" with the original headers given a postscript _x, _y, etc.
you can first select what columns to merge and proceed:
cols_to_use = n2.columns - n1.columns
n1.merge(n2[cols_to_use],how = 'inner',left_on = 'zhanghui',right_on='zhanghui_x')
result columns:
sas wudi zhanghui ddd wudi_x wudi_y zhanghui_x
When I tried to run cols_to_use = n2.columns - n1.columns,it gave me a TypeError like this:
cannot perform __sub__ with this index type: <class pandas.core.indexes.base.Index'>
then I tried to use code below:
cols_to_use = [i for i in list(n2.columns) if i not in list(n1.columns) ]
It worked fine,result columns given like this:
sas wudi zhanghui ddd wudi_x wudi_y zhanghui_x
So,#S Ringne's method really resolved my problems.
=============================================
Pandas just simply add suffix such as '_x' to resolve the duplicate-column-name problem when it comes to merging two Frame objects.
But what will it happen if the name form of 'a-column-name'+'_x' appears in either Frame object? I used to think that it will check if the name form of 'a-column-name'+'_x' appears, But actually pandas doesn't have this check?