Convert df column to a tuple - dataframe

I am having trouble converting a df column into a tuple that I can iterate through. I started with a simple code that works like this:
set= 'pare-10040137', 'pare-10034330', 'pare-00022936', 'pare-10025987', 'pare-10036617'
for i in set:
ref_data=req_data[req_data['REQ_NUM']==i]
This works fine, but now I want my set to come from a df. The df looks like this:
open_reqs
Out[233]:
REQ_NUM
4825 pare-00023728
4826 pare-00023773
.... ..............
I want all of those REQ_NUM values thrown into a tuple, so I tried to do open_reqs.apply(tuple, axis=1) and tuple(zip(open_reqs.columns,open_reqs.T.values.tolist())) but it's not able to iterate through either of these.
My old set looks like this, so this is the format I need to match to iterate through like I was before. I'm not sure if the Unicode is also an issue (when I print above I get (u'pare-10052173',)
In[236]: set
Out[236]:
('pare-10040137',
'pare-10034330',
'pare-00022936',
'pare-10025987',
'pare-10036617')
So basically I need the magic code to get a nice simple set like that from the REQ_NUM column of my open_reqs table. Thank you!

The following statement makes a list out of the specified column and then converts it to an array of tuple
open_req_list = tuple(list(open_reqs['REQ_NUM']))

You can use the tolist() function to convert to a list and the tuple() the whole list
req_num = tuple(open_reqs['REQ_NUM'].tolist())
#type(req_num)
req_num

df = pd.DataFrame(data)
columns_tuple = tuple(df.columns)
df.columns has the datatype of object. To convert it into tuples, use this code and you will GET TUPLE OF ALL COLUMN NAMES

Related

How do I return this groupby calculated values back to the dataframe as a single column?

I am new to Pandas. Sorry for using images instead of tables here; I tried to follow the instructions for inserting a table, but I couldn't.
Pandas version: '1.3.2'
Given this dataframe with Close and Volume for stocks, I've managed to calculate OBV, using pandas, like this:
df.groupby('Ticker').apply(lambda x: (np.sign(x['Close'].diff().fillna(0)) * x['Volume']).cumsum())
The above gave me the correct values for OBV as
shown here.
However, I'm not able to assign the calculated values to a new column.
I would like to do something like this:
df['OBV'] = df.groupby('Ticker').apply(lambda x: (np.sign(x['Close'].diff().fillna(0)) * x['Volume']).cumsum())
But simply doing the expression above of course will throw us the error:
ValueError: Columns must be same length as key
What am I missing?
How can I insert the calculated values into the original dataframe as a single column, df['OBV'] ?
I've checked this thread so I'm sure I should use apply.
This discussion looked promising, but it is not for my case
Use Series.droplevel for remove first level of MultiIndex:
df['OBV'] = df.groupby('Ticker').apply(lambda x: (np.sign(x['Close'].diff().fillna(0)) * x['Volume']).cumsum()).droplevel(0)

Check multiple columns for multiple values and return a dataframe

I have a list of strings and my dataframe has several columns that i need to search (each of type object).
I need to return all rows where any of the selected columns have any of the string items within them, or is part of the string.
How do i check if 4 columns in my dataframe has any one of the items in the list of strings? The string inside the column may have part of the string provided in the list object, but probably wont have it all.
Ive tried 'list' both as a tuple and as a python list:
list = ("25110", "25910", "25990", "30110", "33110", "43999")
new_df = df.loc[(df['column1'].isin(list))
| (df['column2'].isin(list))
| (df['column3'].isin(list))
| (df['column4'].isin(list))]
When i run new_df.shape, i get (0, 12).
Im new to pandas, got a mountain of analysis to do for an intense uni project, and cant get this to work. Do i need to convert each column to be a string datatype first? (ive actually already tried THAT as well, but each datatype is still stubbornly an 'object').
IIUC:
try:
lst = ["25110", "25910", "25990", "30110", "33110", "43999"]
cols=['column1','column2','column3','column4']
Finally:
m=df[cols].astype(str).agg(lambda x:x.str.contains('|'.join(lst)),1).any(1)
#you can also use apply() in place of agg()
df[m]
#OR
df.loc[m]

Pandas selecting dataframe columns using a specific string and array/list

I have a dataframe with hundreds of columns (stocks). My issue is that I need to always pull a specific column (date) followed by an array/list of others (dynamic).
Previously I was doing something like this:
df = stocks[['date', 'AAPL', 'AMZN']]
but now if I need to dynamically choose stocks based on a sector I am not sure how to make these play nice together. I am only able to pull the list without using date like this:
print(rowData['symbol'])
3 [APA.OQ, BKR.N, COG.N, CVX.N, CXO.N, COP.N, DV...
Name: symbol, dtype: object
selection = rowData['symbol'].explode()
df = stocks[selection]
how do I also get the date values? something like this doesn't work:
df = stocks[['date'][selection]]
Thanks
Let us try
df = stocks[['date'] + rowData['symbol'].iloc[0]]

Selecting columns from a dataframe

I have a dataframe of monthly returns for 1,000 stocks with ids as column names.
monthly returns
I need to select only the columns that match the values in another dataframe which includes the ids I want.
permno list
I'm sure this is really quite simple, but I have been struggling for 2 days and if someone has an easy solution it would be so very much appreciated. Thank you.
You could convert the single-column permno list dataframe (osr_curr_permnos) into a list, and then use that list to select certain columns from your main dataframe (all_rets).
To convert the osr_curr_permnos column "0" into a list, you can use .to_list()
Then, you can use that list to slice all_rets and .copy() to make a fresh copy of it into a new dataframe.
The python code might look something like:
keep = osr_curr_permnos['0'].to_list()
selected_rets = all_rets[keep].copy()
"keep" would be a list, and "selected_rets" would be your new dataframe.
If there's a chance that osr_curr_permnos would have duplicates, you'll want to filter those out:
keep = osr_curr_permnos['0'].drop_duplicates().to_list()
selected_rets = all_rets[keep].copy()
As I expected, the answer was more simple than I was making it. Basically, I needed to take the integer values in my permnos list and recast those as strings.
osr_curr_permnos['0'] = osr_curr_permnos['0'].apply(str)
keep = osr_curr_permnos['0'].values
Then I can use that to select columns from my returns dataframe which had string values as column headers.
all_rets[keep]
It was all just a mismatch of int vs. string.

pandas merge produce duplicate columns

n1 = DataFrame({'zhanghui':[1,2,3,4] , 'wudi':[17,'gx',356,23] ,'sas'[234,51,354,123] })
n2 = DataFrame({'zhanghui_x':[1,2,3,5] , 'wudi':[17,23,'sd',23] ,'wudi_x':[17,23,'x356',23] ,'wudi_y':[17,23,'y356',23] ,'ddd':[234,51,354,123] })
code above defined two DataFrame objects. I wanna use 'zhanghui' field from n1 and 'zhanghui_x' field from n2 as "on" field merge n1 and n2,so my code like this:
n1.merge(n2,how = 'inner',left_on = 'zhanghui',right_on='zhanghui_x')
and then result columns given like this :
sas wudi_x zhanghui ddd wudi_y wudi_x wudi_y zhanghui_x
Some duplicate columns appeared,such as 'wudi_x' ,'wudi_y'.
So it's a pandas inner problems or I had a wrong usage about pd.merge ?
From pandas documentation, the merge() function has following properties;
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False,
validate=None)
where suffixes denote default suffix string to be attached to 'over-lapping' columns with defaults '_x' and '_y'.
I'm not sure if I understood your follow-up question correctly, but;
#case1
if the first dataFrame has column 'column_name_x' and the second dataFrame has column 'column_name' then there are no over-lapping columns and therefore no suffixes are attached.
#case2
if the first dataFrame has columns 'column_name', 'column_name_x' and the second dataFrame also has column 'column_name', the default suffixes attach to over-lapping columns and therefore the first frame's 'columnn_name' becomes 'column_name_x' and result in a duplicate of already existing column.
You can however, pass a None value to one(not all) of the suffixes to ensure that column names of certain dataFrame remain as-is.
Your approach is right, pandas automatically gives postscripts after merging the columns that are "duplicated" with the original headers given a postscript _x, _y, etc.
you can first select what columns to merge and proceed:
cols_to_use = n2.columns - n1.columns
n1.merge(n2[cols_to_use],how = 'inner',left_on = 'zhanghui',right_on='zhanghui_x')
result columns:
sas wudi zhanghui ddd wudi_x wudi_y zhanghui_x
When I tried to run cols_to_use = n2.columns - n1.columns,it gave me a TypeError like this:
cannot perform __sub__ with this index type: <class pandas.core.indexes.base.Index'>
then I tried to use code below:
cols_to_use = [i for i in list(n2.columns) if i not in list(n1.columns) ]
It worked fine,result columns given like this:
sas wudi zhanghui ddd wudi_x wudi_y zhanghui_x
So,#S Ringne's method really resolved my problems.
=============================================
Pandas just simply add suffix such as '_x' to resolve the duplicate-column-name problem when it comes to merging two Frame objects.
But what will it happen if the name form of 'a-column-name'+'_x' appears in either Frame object? I used to think that it will check if the name form of 'a-column-name'+'_x' appears, But actually pandas doesn't have this check?