Looping over only a certain range of object indices in python - pandas

I am trying to append a pandas dataframe based on two pre-existing columns in that dataframe. The issue I'm having is that the index of the pandas dataframe is in object format, not integer format. To make things more complicated, I only want to append a certain range of the dataframe, leaving the remaining cells in the new column as 'NaN'. In order to append over only a certain range of the dataframe, I will have to use a "for" loop.
Here is my question: How do I loop over a certain range of the dataframe when I have an object index?
My initial pandas dataframe is simply...
import pandas as pd
dates = ['2005Q4','2006Q1','2006Q2','2006Q3','2006Q4','2007Q1','2007Q2']
col1 = [ 5.9805, 6.2181, 6.3508, 6.7878, 6.6212, 6.4583, 6.4068 ]
col2 = [ 'NaN', -0.001054985938, -0.121731711952, 0.046275331889,
-0.017517211963, -0.023422842422, 0.009072170884 ]
data = pd.DataFrame(
{
'col1': col1,
'col2': col2
},
columns = [
'col1',
'col2'
],
index = dates
)
All I'm trying to do is something like this...
data['col3'] = 'NaN'
for i in range('2006Q1','2006Q4',1):
data['col3'][i] = data['col1'][i-1] +\
data['col2'][i]
Naively, I had hoped that python would be able to correlate the object name in the index with the actual index number associated with that particular indice. For example, if I define the index as given, python would be able to know that '2005Q4' is index = 0, '2006Q1' is index = 1, etc. In this way, I could use object strings in the range() function and it would still know the integer I'm referring to. However, this appears not to be the case.
I need to avoid converting the objects into date format as well. It is important that I keep the index in the format 'YearQuarter', and I have yet to find a simple way of using pd.to_datetime that is able to do this.
Does anyone have any suggestions on how to loop over only a certain range of object-based indices in python?

Using .index() with a list returns the index of the item you're looking for. Try this tweak to your for loop.
for i in range(dates.index('2006Q1'),dates.index('2006Q4'),1):
Obviously however much more efficient ways to do this. .shift() shifts your entire column up or down however much you want :
data['col3'] = data.col1 - data.col2.shift(-1)

Related

pandas: split pandas columns of unequal length list into multiple columns

I have a dataframe with one column of unequal list which I want to spilt into multiple columns (the item value will be the column names). An example is given below
I have done through iterrows, iterating thruough the rows and examine the list from each rows. It seem workable as my dataframe has few rows. However, I wonder if there is any clean methods
I have done through additional_df = pd.DataFrame(venue_df.location.values.tolist())
However the list break down into as below
thanks fro your help
Can you try this code: built assuming venue_df.location contains the list you have shown in the cells.
venue_df['school'] = venue_df.location.apply(lambda x: ('school' in x)+0)
venue_df['office'] = venue_df.location.apply(lambda x: ('office' in x)+0)
venue_df['home'] = venue_df.location.apply(lambda x: ('home' in x)+0)
venue_df['public_area'] = venue_df.location.apply(lambda x: ('public_area' in x)+0)
Hope this helps!
First lets explode your location column, so we can get your wanted end result.
s=df['Location'].explode()
Then lets use crosstab in that series so we can get your end result
import pandas as pd
pd.crosstab(s).unstack()
I didnt test it out cause i dont know you base_df

Pandas splitting a column with new line separator

I am extracting tables from pdf using Camelot. Two of the columns are getting merged together with a newline separator. Is there a way to separate them into two columns?
Suppose the column looks like this.
A\nB
1\n2
2\n3
3\n4
Desired output:
|A|B|
|-|-|
|1|2|
|2|3|
|3|4|
I have tried df['A\nB'].str.split('\n', 2, expand=True) and that splits it into two columns however I want the new column names to be A and B and not 0 and 1. Also I need to pass a generalized column label instead of actual column name since I need to implement this for several docs which may have different column names. I can determine such column name in my dataframe using
colNew = df.columns[df.columns.str.contains(pat = '\n')]
However when I pass colNew in split function, it throws an attribute error
df[colNew].str.split('\n', 2, expand=True)
AttributeError: DataFrame object has no attribute 'str'
You can take advantage of the Pandas split function.
import pandas as pd
# recreate your pandas series above.
df = pd.DataFrame({'A\nB':['1\n2','2\n3','3\n4']})
# first: Turn the col into str.
# second. split the col based on seperator \n
# third: make sure expand as True since you want the after split col become two new col
test = df['A\nB'].astype('str').str.split('\n',expand=True)
# some rename
test.columns = ['A','B']
I hope this is helpful.
I reproduced the error from my side... I guess the issue is that "df[colNew]" is still a dataframe as it contains the indexes.
But .str.split() only works on Series. So taking as example your code, I would convert the dataframe to series using iloc[:,0].
Then another line to split the column headers:
df2=df[colNew].iloc[:,0].str.split('\n', 2, expand=True)
df2.columns = 'A\nB'.split('\n')

How to compare one row in Pandas Dataframe to all other rows in the same Dataframe

I have a csv file with in which I want to compare each row with all other rows. I want to do a linear regression and get the r^2 value for the linear regression line and put it into a new matrix. I'm having trouble finding a way to iterate over all the other rows (it's fine to compare the primary row to itself).
I've tried using .iterrows but I can't think of a way to define the other rows once I have my primary row using this function.
UPDATE: Here is a solution I came up with. Please let me know if there is a more efficient way of doing this.
def bad_pairs(df, limit):
list_fluor = list(combinations(df.index.values, 2))
final = {}
for fluor in list_fluor:
final[fluor] = (r2_score(df.xs(fluor[0]),
df.xs(fluor[1])))
bad_final = {}
for i in final:
if final[i] > limit:
bad_final[i] = final[i]
return(bad_final)
My data is a pandas DataFrame where the index is the name of the color and there is a number between 0-1 for each detector (220 columns).
I'm still working on a way to make a new pandas Dataframe from a dictionary with all the values (final in the code above), not just those over the limit.

Selecting columns from a dataframe

I have a dataframe of monthly returns for 1,000 stocks with ids as column names.
monthly returns
I need to select only the columns that match the values in another dataframe which includes the ids I want.
permno list
I'm sure this is really quite simple, but I have been struggling for 2 days and if someone has an easy solution it would be so very much appreciated. Thank you.
You could convert the single-column permno list dataframe (osr_curr_permnos) into a list, and then use that list to select certain columns from your main dataframe (all_rets).
To convert the osr_curr_permnos column "0" into a list, you can use .to_list()
Then, you can use that list to slice all_rets and .copy() to make a fresh copy of it into a new dataframe.
The python code might look something like:
keep = osr_curr_permnos['0'].to_list()
selected_rets = all_rets[keep].copy()
"keep" would be a list, and "selected_rets" would be your new dataframe.
If there's a chance that osr_curr_permnos would have duplicates, you'll want to filter those out:
keep = osr_curr_permnos['0'].drop_duplicates().to_list()
selected_rets = all_rets[keep].copy()
As I expected, the answer was more simple than I was making it. Basically, I needed to take the integer values in my permnos list and recast those as strings.
osr_curr_permnos['0'] = osr_curr_permnos['0'].apply(str)
keep = osr_curr_permnos['0'].values
Then I can use that to select columns from my returns dataframe which had string values as column headers.
all_rets[keep]
It was all just a mismatch of int vs. string.

Datetime column coerced to int when setting with .loc and slice

I have a column of datetimes and need to change several of these values to new datetimes. When I set the values using df.loc[indices, 'col'] = new_datetimes, the unaffected values are coerced to int while the new set values are in datetime. If I set the values one at a time, no type coercion occurs.
For illustration I created a sample df with just one column.
df = pd.DataFrame([dt.datetime(2019,1,1)]*5)
df.loc[[1,3,4]] = [dt.datetime(2019,1,2)]*3
df
This produces the following:
output
If I change indices 1,3,4 individually:
df = pd.DataFrame([dt.datetime(2019,1,1)]*5)
df.loc[1] = dt.datetime(2019,1,2)
df.loc[3] = dt.datetime(2019,1,2)
df.loc[4] = dt.datetime(2019,1,2)
df
I get the correct output:
output
A suggestion was to turn the list to a numpy array before setting, which does resolve the issue. However, if you try to set multiple columns (some of which are not datetime) using a numpy array, The issue arises again.
In this example the dataframe has two columns and I try to set both columns.
df = pd.DataFrame({'dt':[dt.datetime(2019,1,1)]*5, 'value':[1,1,1,1,1]})
df.loc[[1,3,4]] = np.array([[dt.datetime(2019,1,2)]*3, [2,2,2]]).T
df
This gives the following output:
output
Can someone please explain what is causing the coercion and how to prevent it from doing so? The code I wrote that uses this was written over a month ago and used to work just fine, could it be one of those warnings about future version of pandas deprecating certain functionalities?
An explanation of what is going on would be greatly appreciated because I wrote a other codes that likely employ similar functionality want to make sure everything works as intended.
The solution proposed by w-m has such an "awkward detail" than
the result column has also the time part (it didn't have it
before).
I have also such a remark, that DataFrames are tables not Series,
so they have columns, each with its name and it is a bad habit to
rely on default column names (consecutive numbers).
So I propose another solution, addressing both above issues:
To create the source DataFrame I executed:
df = pd.DataFrame([dt.datetime(2019, 1, 1)]*5, columns=['c1'])
Note that I provided a name for the only column.
Then I created another DataFrame:
df2 = pd.DataFrame([dt.datetime(2019,1,2)]*3, columns=['c1'], index=[1,3,4])
It contains your "new" dates and the numbers which you used in loc
I set as the index (again with the same column name).
Then, to update df, use (not surprisingly) df.update:
df.update(df2)
This function performs in-place update, so if you print(df), you will get:
c1
0 2019-01-01
1 2019-01-02
2 2019-01-01
3 2019-01-02
4 2019-01-02
As you can see, under indices 1, 3 and 4 you have new dates
and there is no time part, just like before.
[dt.datetime(2019,1,2)]*3 is a Python list of objects. This particular list happens to contain only datetimes, but Pandas does not seem to recognize that, and treats it as it is - a list of any kind of objects.
If you convert it into a typed array, then Pandas will keep the original dtype of the column intact:
df.loc[[1,3,4]] = np.asarray([dt.datetime(2019,1,2)]*3)
I hope this workaround helps you, but you may still want to file a bug with Pandas. I don't have an explanation as to why the datetime objects should be coerced to ints in the first output example.