Trouble specifying column names when making dataframe from aggregate function - pandas

When I make a dataframe with:
freq = pd.DataFrame(combined.groupby(['Latitude', 'Longitude','from_station_name']).agg('count')['trip_id'])
It works just fine, but when I attempt:
freq = pd.DataFrame(combined.groupby(['Latitude', 'Longitude','from_station_name']).agg('count')['trip_id'], columns = ['lat','long','station','trips'])
I just see the headers when I look at the dataframe. I can make the dataframe and then use:
freq.columns = ['lat','long','station','trips']
But was wondering how to do this in one step. I've tried specifying "data =" for the aggregate function. Tried double enclosing the brackets for the column names, removing the brackets for the column names. Any advice is appreciated.

You don't need to pass your groupby object into a new dataframe constructor (like #Vaishali mentioned already)
If you want to rename your columns after groupby, you can simply do something like:
combined.groupby(['Latitude', 'Longitude','from_station_name']).trip_id.agg('count').rename(columns={'Latitude': 'lat', 'Longitude': 'long', 'from_station_name':'station', 'count': 'trips})

Related

How to rename the columns starting with 'abcd' to starting with 'wxyz' in Spark Scala?

How can I rename the columns starting with abcd to starting with wxyz.
List of columns: abcd_name, abcd_id, abcd_loc, empId, empCode
I need to change the names of columns in a dataframe that starts with abcd
Required column list: wxyz_name, wxyz_id, wxyz_loc, empId, empCode
I tried getting all the columns' lists using the below code, but not sure how to implement it.
val df_cols_abcd = df.columns.filter(_.startsWith("abcd")).map(df(_))
You can do that with foldLeft:
val oldPrefix = "abcd"
val newPrefix = "wxyz"
val newDf = df.columns
.filter(_.startsWith(oldPrefix))
.foldLeft(df)((acc, oldName) =>
acc.withColumnRenamed(oldName, newPrefix + oldName.substring(oldPrefix.length))
)
Your first idea to filter columns with startWith is correct. The only think you miss the the part where you rename all the columns.
I recommend to do some research about foldLeft if you're not familiar with. The idea is the following:
I start with an initial dataframe (df in the first brackets).
I will apply a function to it with each of the columns I need to rename (the function is the one in the second brackets). This function takes as argument an accumulator (acc) that is an intermediate dataframe (because it will rename the columns one at a time), and another argument which is the current element of the list (here the list contains the name of the columns that need to be modified).

Pandas splitting a column with new line separator

I am extracting tables from pdf using Camelot. Two of the columns are getting merged together with a newline separator. Is there a way to separate them into two columns?
Suppose the column looks like this.
A\nB
1\n2
2\n3
3\n4
Desired output:
|A|B|
|-|-|
|1|2|
|2|3|
|3|4|
I have tried df['A\nB'].str.split('\n', 2, expand=True) and that splits it into two columns however I want the new column names to be A and B and not 0 and 1. Also I need to pass a generalized column label instead of actual column name since I need to implement this for several docs which may have different column names. I can determine such column name in my dataframe using
colNew = df.columns[df.columns.str.contains(pat = '\n')]
However when I pass colNew in split function, it throws an attribute error
df[colNew].str.split('\n', 2, expand=True)
AttributeError: DataFrame object has no attribute 'str'
You can take advantage of the Pandas split function.
import pandas as pd
# recreate your pandas series above.
df = pd.DataFrame({'A\nB':['1\n2','2\n3','3\n4']})
# first: Turn the col into str.
# second. split the col based on seperator \n
# third: make sure expand as True since you want the after split col become two new col
test = df['A\nB'].astype('str').str.split('\n',expand=True)
# some rename
test.columns = ['A','B']
I hope this is helpful.
I reproduced the error from my side... I guess the issue is that "df[colNew]" is still a dataframe as it contains the indexes.
But .str.split() only works on Series. So taking as example your code, I would convert the dataframe to series using iloc[:,0].
Then another line to split the column headers:
df2=df[colNew].iloc[:,0].str.split('\n', 2, expand=True)
df2.columns = 'A\nB'.split('\n')

How do I return this groupby calculated values back to the dataframe as a single column?

I am new to Pandas. Sorry for using images instead of tables here; I tried to follow the instructions for inserting a table, but I couldn't.
Pandas version: '1.3.2'
Given this dataframe with Close and Volume for stocks, I've managed to calculate OBV, using pandas, like this:
df.groupby('Ticker').apply(lambda x: (np.sign(x['Close'].diff().fillna(0)) * x['Volume']).cumsum())
The above gave me the correct values for OBV as
shown here.
However, I'm not able to assign the calculated values to a new column.
I would like to do something like this:
df['OBV'] = df.groupby('Ticker').apply(lambda x: (np.sign(x['Close'].diff().fillna(0)) * x['Volume']).cumsum())
But simply doing the expression above of course will throw us the error:
ValueError: Columns must be same length as key
What am I missing?
How can I insert the calculated values into the original dataframe as a single column, df['OBV'] ?
I've checked this thread so I'm sure I should use apply.
This discussion looked promising, but it is not for my case
Use Series.droplevel for remove first level of MultiIndex:
df['OBV'] = df.groupby('Ticker').apply(lambda x: (np.sign(x['Close'].diff().fillna(0)) * x['Volume']).cumsum()).droplevel(0)

How do you split All columns in a large pandas data frame?

I have a very large data frame that I want to split ALL of the columns except first two based on a comma delimiter. So I need to logically reference column names in a loop or some other way to split all the columns in one swoop.
In my testing of the split method:
I have been able to explicitly refer to ( i.e. HARD CODE) a single column name (rs145629793) as one of the required parameters and the result was 2 new columns as I wanted.
See python code below
HARDCODED COLUMN NAME --
df[['rs1','rs2']] = df.rs145629793.str.split(",", expand = True)
The problem:
It is not feasible to refer to the actual column names and repeat code.
I then replaced the actual column name rs145629793 with columns[2] in the split method parameter list.
It results in an ERROR
'str has ni str attribute'
You can index columns by position rather than name using iloc. For example, to get the third column:
df.iloc[:, 2]
Thus you can easily loop over the columns you need.
I know what you are asking, but it's still helpful to provide some input data and expected output data. I have included random input data in my code below, so you can just copy and paste this to run, and try to apply it to your dataframe:
import pandas as pd
your_dataframe=pd.DataFrame({'a':['1,2,3', '9,8,7'],
'b':['4,5,6', '6,5,4'],
'c':['7,8,9', '3,2,1']})
import copy
def split_cols(df):
dict_of_df = {}
cols=df.columns.to_list()
for col in cols:
key_name = 'df'+str(col)
dict_of_df[key_name] = copy.deepcopy(df)
var=df[col].str.split(',', expand=True).add_prefix(col)
df=pd.merge(df, var, how='left', left_index=True, right_index=True).drop(col, axis=1)
return df
split_cols(your_dataframe)
Essentially, in this solution you create a list of the columns that you want to loop through. Then you loop through that list and create new dataframes for each column where you run the split() function. Then you merge everything back together on the index. I also:
included a prefix of the column name, so the column names did not have duplicate names and could be more easily identifiable
dropped the old column that we did the split on.
Just import copy and use the split_cols() function that I have created and pass the name of your dataframe.

pandas merge produce duplicate columns

n1 = DataFrame({'zhanghui':[1,2,3,4] , 'wudi':[17,'gx',356,23] ,'sas'[234,51,354,123] })
n2 = DataFrame({'zhanghui_x':[1,2,3,5] , 'wudi':[17,23,'sd',23] ,'wudi_x':[17,23,'x356',23] ,'wudi_y':[17,23,'y356',23] ,'ddd':[234,51,354,123] })
code above defined two DataFrame objects. I wanna use 'zhanghui' field from n1 and 'zhanghui_x' field from n2 as "on" field merge n1 and n2,so my code like this:
n1.merge(n2,how = 'inner',left_on = 'zhanghui',right_on='zhanghui_x')
and then result columns given like this :
sas wudi_x zhanghui ddd wudi_y wudi_x wudi_y zhanghui_x
Some duplicate columns appeared,such as 'wudi_x' ,'wudi_y'.
So it's a pandas inner problems or I had a wrong usage about pd.merge ?
From pandas documentation, the merge() function has following properties;
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False,
validate=None)
where suffixes denote default suffix string to be attached to 'over-lapping' columns with defaults '_x' and '_y'.
I'm not sure if I understood your follow-up question correctly, but;
#case1
if the first dataFrame has column 'column_name_x' and the second dataFrame has column 'column_name' then there are no over-lapping columns and therefore no suffixes are attached.
#case2
if the first dataFrame has columns 'column_name', 'column_name_x' and the second dataFrame also has column 'column_name', the default suffixes attach to over-lapping columns and therefore the first frame's 'columnn_name' becomes 'column_name_x' and result in a duplicate of already existing column.
You can however, pass a None value to one(not all) of the suffixes to ensure that column names of certain dataFrame remain as-is.
Your approach is right, pandas automatically gives postscripts after merging the columns that are "duplicated" with the original headers given a postscript _x, _y, etc.
you can first select what columns to merge and proceed:
cols_to_use = n2.columns - n1.columns
n1.merge(n2[cols_to_use],how = 'inner',left_on = 'zhanghui',right_on='zhanghui_x')
result columns:
sas wudi zhanghui ddd wudi_x wudi_y zhanghui_x
When I tried to run cols_to_use = n2.columns - n1.columns,it gave me a TypeError like this:
cannot perform __sub__ with this index type: <class pandas.core.indexes.base.Index'>
then I tried to use code below:
cols_to_use = [i for i in list(n2.columns) if i not in list(n1.columns) ]
It worked fine,result columns given like this:
sas wudi zhanghui ddd wudi_x wudi_y zhanghui_x
So,#S Ringne's method really resolved my problems.
=============================================
Pandas just simply add suffix such as '_x' to resolve the duplicate-column-name problem when it comes to merging two Frame objects.
But what will it happen if the name form of 'a-column-name'+'_x' appears in either Frame object? I used to think that it will check if the name form of 'a-column-name'+'_x' appears, But actually pandas doesn't have this check?