How to get name of dataframe column in PySpark? - dataframe

In pandas, this can be done by column.name.
But how to do the same when it's a column of Spark dataframe?
E.g. the calling program has a Spark dataframe: spark_df
>>> spark_df.columns
['admit', 'gre', 'gpa', 'rank']
This program calls my function: my_function(spark_df['rank'])
In my_function, I need the name of the column, i.e. 'rank'.
If it was pandas dataframe, we could use this:
>>> pandas_df['rank'].name
'rank'

You can get the names from the schema by doing
spark_df.schema.names
Printing the schema can be useful to visualize it as well
spark_df.printSchema()

The only way is to go an underlying level to the JVM.
df.col._jc.toString().encode('utf8')
This is also how it is converted to a str in the pyspark code itself.
From pyspark/sql/column.py:
def __repr__(self):
return 'Column<%s>' % self._jc.toString().encode('utf8')

Python
As #numeral correctly said, column._jc.toString() works fine in case of unaliased columns.
In case of aliased columns (i.e. column.alias("whatever") ) the alias can be extracted, even without the usage of regular expressions: str(column).split(" AS ")[1].split("`")[1] .
I don't know Scala syntax, but I'm sure It can be done the same.

If you want the column names of your dataframe, you can use the pyspark.sql class. I'm not sure if the SDK supports explicitly indexing a DF by column name. I received this traceback:
>>> df.columns['High']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: list indices must be integers, not str
However, calling the columns method on your dataframe, which you have done, will return a list of column names:
df.columns will return ['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']
If you want the column datatypes, you can call the dtypes method:
df.dtypes will return [('Date', 'timestamp'), ('Open', 'double'), ('High', 'double'), ('Low', 'double'), ('Close', 'double'), ('Volume', 'int'), ('Adj Close', 'double')]
If you want a particular column, you'll need to access it by index:
df.columns[2] will return 'High'

I found the answer is very very simple...
// It is in Java, but it should be same in PySpark
Column col = ds.col("colName"); //the column object
String theNameOftheCol = col.toString();
The variable theNameOftheCol is "colName"

I hope these options may serve more like universal ones. Cases covered:
column not having an alias
column having an alias
column having several consecutive aliases
column names surrounded with backticks
No regex:
str(col).replace("`", "").split("'")[-2].split(" AS ")[-1])
Using regex:
import re
re.search(r"'.*?`?(\w+)`?'", str(col)).group(1)

#table name as an example if you have multiple
loc = '/mnt/tablename' or 'whatever_location/table_name' #incase of external table or any folder
table_name = ['customer','department']
for i in table_name:
print(i) # printing the existing table name
df = spark.read.format('parquet').load(f"{loc}{i.lower()}/") # creating dataframe from the table name
for col in df.dtypes:
print(col[0]) # column_name as per availability
print(col[1]) # datatype information of the respective column

Since none of the answers have been marked as the Answer -
I may be over-simplifying the OPs ask but:
my_list = spark_df.schema.fields
for field in my_list:
print(field.name)

Related

why can't we use the argument expand of split() inside apply() -in pandas-

# Split single column into two columns use apply()
df[['First Name', 'Last Name']] = df["Student_details"].apply(lambda x: pd.Series(str(x).split(",")))
print(df)
1- why when i change the code to .apply(lambda x: str(x).split("," , expand=True)) i got an error which is "expand is invalid argument to split function"
2- why do i have to use pd.Series() although the default return value of str.split() is Series
3- how does pd.Series() return a series while it returns a DF -here-
i tried to write expand and use it normally but it didn't work
here is the DF
import pandas as pd
import numpy as np
technologies = {
'Student_details':["Pramodh_Roy", "Leena_Singh", "James_William", "Addem_Smith"],
'Courses':["Spark", "PySpark", "Pandas", "Hadoop"],
'Fee' :[25000, 20000, 22000, 25000]
}
df = pd.DataFrame(technologies)
print(df)
df[['First Name', 'Last Name']] = df["Student_details"].str.split("_", expand=True)
I don't get what you want... is it about the solution above? Or do you really wanna know why #1 throws an error?
EDIT 1:
The expand parameter does not exist for the split function of the str type you are referring to, as this is the str type of python. The expand parameter has been written for the split function for a Series in pandas.
EDIT 2:
Re your third question: As you can see in my suggestion, I'm not using even the pd.Series function, however of course the df["Student_details"] is a series. The key here in my answer is, that the "expand" parameter is here returning a DF with as many columns as required for the split results. So if one of the names were "a_b_c_d" I would get in total a df with four columns.
EDIT 3:
pd.Series will convert result to a series. The result of the split function will be a list, which in turn will be casted to a Series. All of which would not have been necessary in my eyes. That piece of code I wrote saves time, both in execution and understanding (I hope)
FYI, this does work:
technologies = {
'Student_details':["Pramodh_Roy", "Leena_Singh", "James_William", "Addem_Smith"],
'Courses':["Spark", "PySpark", "Pandas", "Hadoop"],
'Fee' :[25000, 20000, 22000, 25000]
}
df = pd.DataFrame(technologies)
df[['First Name', 'Last Name']] = df["Student_details"].apply(lambda x: pd.Series(str(x).split("_")))
print(df)
Output:
Student_details Courses Fee First Name Last Name
0 Pramodh_Roy Spark 25000 Pramodh Roy
1 Leena_Singh PySpark 20000 Leena Singh
2 James_William Pandas 22000 James William
3 Addem_Smith Hadoop 25000 Addem Smith

Pandas splitting a column with new line separator

I am extracting tables from pdf using Camelot. Two of the columns are getting merged together with a newline separator. Is there a way to separate them into two columns?
Suppose the column looks like this.
A\nB
1\n2
2\n3
3\n4
Desired output:
|A|B|
|-|-|
|1|2|
|2|3|
|3|4|
I have tried df['A\nB'].str.split('\n', 2, expand=True) and that splits it into two columns however I want the new column names to be A and B and not 0 and 1. Also I need to pass a generalized column label instead of actual column name since I need to implement this for several docs which may have different column names. I can determine such column name in my dataframe using
colNew = df.columns[df.columns.str.contains(pat = '\n')]
However when I pass colNew in split function, it throws an attribute error
df[colNew].str.split('\n', 2, expand=True)
AttributeError: DataFrame object has no attribute 'str'
You can take advantage of the Pandas split function.
import pandas as pd
# recreate your pandas series above.
df = pd.DataFrame({'A\nB':['1\n2','2\n3','3\n4']})
# first: Turn the col into str.
# second. split the col based on seperator \n
# third: make sure expand as True since you want the after split col become two new col
test = df['A\nB'].astype('str').str.split('\n',expand=True)
# some rename
test.columns = ['A','B']
I hope this is helpful.
I reproduced the error from my side... I guess the issue is that "df[colNew]" is still a dataframe as it contains the indexes.
But .str.split() only works on Series. So taking as example your code, I would convert the dataframe to series using iloc[:,0].
Then another line to split the column headers:
df2=df[colNew].iloc[:,0].str.split('\n', 2, expand=True)
df2.columns = 'A\nB'.split('\n')

Trouble specifying column names when making dataframe from aggregate function

When I make a dataframe with:
freq = pd.DataFrame(combined.groupby(['Latitude', 'Longitude','from_station_name']).agg('count')['trip_id'])
It works just fine, but when I attempt:
freq = pd.DataFrame(combined.groupby(['Latitude', 'Longitude','from_station_name']).agg('count')['trip_id'], columns = ['lat','long','station','trips'])
I just see the headers when I look at the dataframe. I can make the dataframe and then use:
freq.columns = ['lat','long','station','trips']
But was wondering how to do this in one step. I've tried specifying "data =" for the aggregate function. Tried double enclosing the brackets for the column names, removing the brackets for the column names. Any advice is appreciated.
You don't need to pass your groupby object into a new dataframe constructor (like #Vaishali mentioned already)
If you want to rename your columns after groupby, you can simply do something like:
combined.groupby(['Latitude', 'Longitude','from_station_name']).trip_id.agg('count').rename(columns={'Latitude': 'lat', 'Longitude': 'long', 'from_station_name':'station', 'count': 'trips})

Data slicing a pandas frame - I'm having problems with unique

I am facing issues trying to select a subset of columns and running unique on it.
Source Data:
df_raw = pd.read_csv('data/master.csv', nrows=10000)
df_raw.shape()
Produces:
(10000, 86)
Process Data:
df = df_raw[['A','B','C']]
df.shape()
Produces:
(10000, 3)
Furthermore, doing:
df_raw.head()
df.head()
produces a correct list of rows and columns.
However,
print('RAW:',sorted(df_raw['A'].unique()))
works perfectly
Whilst:
print('PROCESSED:',sorted(df['A'].unique()))
produces:
AttributeError: 'DataFrame' object has no attribute 'unique'
What am I doing wrong? If the shape and head output are exactly what I want, I'm confused why my processed dataset is throwing errors. I did read Pandas 'DataFrame' object has no attribute 'unique' on SO which correctly states that unique needs to be applied to columns which is what I am doing.
This was a case of a duplicate column. Given this is proprietary data, I abstracted it as 'A', 'B', 'C' in this question and therefore masked the problem. (The real data set had 86 columns and I had duplicated one of those columns twice in my subset, and was trying to do a unique on that)
My problem was this:
df_raw = pd.read_csv('data/master.csv', nrows=10000)
df = df_raw[['A','B','C', 'A']] # <-- I did not realize I had duplicated A later.
This was causing problems when doing a unique on 'A'
From the entire dataframe to extract a subset a data based on a column ID. This works!!
df = df.drop_duplicates(subset=['Id']) #where 'id' is the column used to filter
print (df)

pandas merge produce duplicate columns

n1 = DataFrame({'zhanghui':[1,2,3,4] , 'wudi':[17,'gx',356,23] ,'sas'[234,51,354,123] })
n2 = DataFrame({'zhanghui_x':[1,2,3,5] , 'wudi':[17,23,'sd',23] ,'wudi_x':[17,23,'x356',23] ,'wudi_y':[17,23,'y356',23] ,'ddd':[234,51,354,123] })
code above defined two DataFrame objects. I wanna use 'zhanghui' field from n1 and 'zhanghui_x' field from n2 as "on" field merge n1 and n2,so my code like this:
n1.merge(n2,how = 'inner',left_on = 'zhanghui',right_on='zhanghui_x')
and then result columns given like this :
sas wudi_x zhanghui ddd wudi_y wudi_x wudi_y zhanghui_x
Some duplicate columns appeared,such as 'wudi_x' ,'wudi_y'.
So it's a pandas inner problems or I had a wrong usage about pd.merge ?
From pandas documentation, the merge() function has following properties;
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False,
validate=None)
where suffixes denote default suffix string to be attached to 'over-lapping' columns with defaults '_x' and '_y'.
I'm not sure if I understood your follow-up question correctly, but;
#case1
if the first dataFrame has column 'column_name_x' and the second dataFrame has column 'column_name' then there are no over-lapping columns and therefore no suffixes are attached.
#case2
if the first dataFrame has columns 'column_name', 'column_name_x' and the second dataFrame also has column 'column_name', the default suffixes attach to over-lapping columns and therefore the first frame's 'columnn_name' becomes 'column_name_x' and result in a duplicate of already existing column.
You can however, pass a None value to one(not all) of the suffixes to ensure that column names of certain dataFrame remain as-is.
Your approach is right, pandas automatically gives postscripts after merging the columns that are "duplicated" with the original headers given a postscript _x, _y, etc.
you can first select what columns to merge and proceed:
cols_to_use = n2.columns - n1.columns
n1.merge(n2[cols_to_use],how = 'inner',left_on = 'zhanghui',right_on='zhanghui_x')
result columns:
sas wudi zhanghui ddd wudi_x wudi_y zhanghui_x
When I tried to run cols_to_use = n2.columns - n1.columns,it gave me a TypeError like this:
cannot perform __sub__ with this index type: <class pandas.core.indexes.base.Index'>
then I tried to use code below:
cols_to_use = [i for i in list(n2.columns) if i not in list(n1.columns) ]
It worked fine,result columns given like this:
sas wudi zhanghui ddd wudi_x wudi_y zhanghui_x
So,#S Ringne's method really resolved my problems.
=============================================
Pandas just simply add suffix such as '_x' to resolve the duplicate-column-name problem when it comes to merging two Frame objects.
But what will it happen if the name form of 'a-column-name'+'_x' appears in either Frame object? I used to think that it will check if the name form of 'a-column-name'+'_x' appears, But actually pandas doesn't have this check?