Creating new columns with pandas .loc - pandas

My dataset looks like this:
ex = pd.DataFrame.from_dict({'grp1': np.random.choice('A B'.split(), 20), 'grp2': np.random.choice([1, 2], 20), 'var1': np.random.rand(20), 'var2': np.random.randint(20)})
I want to create new columns with the next value within the groups, but the following code results in SettingWithCopyWarning:
ex[['next_var1', 'next_var2']] = ex.groupby(['grp1', 'grp2'])[['var1', 'var2']].shift(-1)
Therefore I tried to use .loc:
ex.loc[:, ['next_var1', 'next_var2']] = ex.groupby(['grp1', 'grp2'])[['var1', 'var2']].shift(-1)
However, it results in error:
KeyError: "None of [Index(['next_var1', 'next_var2'], dtype='object')] are in the [columns]"
What's wrong with the .loc usage?

With loc you can't create new columns. But you could do:
ex['next_var1'], ex['next_var2'] = None, None
ex.loc[:, ['next_var1', 'next_var2']] = ex.groupby(['grp1', 'grp2'])[['var1', 'var2']].shift(-1).values
However you could also do:
ex[['next_var1', 'next_var2']] = ex.groupby(['grp1', 'grp2'])[['var1', 'var2']].shift(-1)
Is what you tried but it works fine with python 3.7 and pandas 0.25.

Related

Dataframe index with isclose function

I have a dataframe with numerical values between 0 and 1. I am trying to create simple summary statistics (manually). I when using boolean I can get the index but when I try to use math.isclose the function does not work and gives an error.
For example:
import pandas as pd
df1 = pd.DataFrame({'col1':[0,.05,0.74,0.76,1], 'col2': [0,
0.05,0.5, 0.75,1], 'x1': [1,2,3,4,5], 'x2':
[5,6,7,8,9]})
result75 = df1.index[round(df1['col2'],2) == 0.75].tolist()
value75 = df1['x2'][result75]
print(value75.mean())
This will give the correct result but occasionally the value result is NAN so I tried:
result75 = df1.index[math.isclose(round(df1['col2'],2), 0.75, abs_tol = 0.011)].tolist()
value75 = df1['x2'][result75]
print(value75.mean())
This results in the following error message:
TypeError: cannot convert the series to <class 'float'>
Both are type "bool" so not sure what is going wrong here...
This works:
rows_meeting_condition = df1[(df1['col2'] > 0.74) & (df1['col2'] < 0.76)]
print(rows_meeting_condition['x2'].mean())

Transform list of pyspark rows into pandas data frame through a dictionary

I am trying to convert a list of PySpark sorted rows to a Pandas data frame using dictionary comprehension but only works when explicitly stating the key and value of the desired dictionary.
row_list = sorted(data, key=lambda row: row['date'])
future_df = {'key': int(key),
'date': map(lambda row: row["date"], row_list),
'col1': map(lambda row: row["col1"], row_list),
'col2': map(lambda row: row["col2"], row_list)}
And then converting it to Pandas with:
pd.DataFrame(future_df)
This operation is to be found inside the class ForecastByKey invoked by:
rdd = df.select('*')
.rdd \
.map(lambda row: ((row['key']), row)) \
.groupByKey() \
.map(lambda args: spark_ops.run(args[0], args[1]))
Up to this point, everything works fine; meaning explicitly indicating the columns inside the dictionary future_df.
The problem arises when trying to convert the whole set of columns (700+) with something like:
future_df = {'key': int(key),
'date': map(lambda row: row["date"], row_list)}
for col_ in columns:
future_df[col_] = map(lambda row: row[col_], row_list)
pd.DataFrame(future_df)
Where columns contains the name of each coumn passed to the ForecastByKey class.
The result of this operation is a data frame with empty or close-to-zero columns.
I am using Python 3.6.10 and PySpark 2.4.5
How is this iteration to be done in order to get a data frame with the right information?
After some research, I realized this can be solved with:
row_list = sorted(data, key=lambda row: row['date'])
def f(x):
return map(lambda row: row[x], row_list)
pre_df = {col_: col_ for col_ in self.sdf_cols}
future_df = toolz.valmap(f, pre_df)
future_df['key'] = int(key)

'numpy.int64' object is not iterable when iterate thru a dictionary

I have a dictionary like this
dd={888202515573088257: tweepy.error.TweepError([{'code': 144,
'message': 'No status found with that ID.'}]),
873697596434513921: tweepy.error.TweepError([{'code': 144,
'message': 'No status found with that ID.'}]),
....,
680055455951884288: tweepy.error.TweepError([{'code': 144,
'message': 'No status found with that ID.'}])}
I want to make a dataframe from this dictionary, like so
df=pd.DataFrame(columns = ['twid','msg'])
for k,v in dd:
df = df.append({'twid': k, 'msg': v},ignore_index = True)
But I get TypeError: 'numpy.int64' object is not iterable. Can someone help me solve this please?
Thanks!
By default, iterating over a dictionary will iterate over the keys. If you want to unpack the (key, value) pairs, you can use dd.items().
In this case, it looks like you don't need the values, so the below should work.
df = pd.DataFrame(columns = ['twid'])
for k in dd:
df = df.append({'twid': k}, ignore_index = True)
Alternatively, you can just pass the keys in when creating the DataFrame.
df = pd.DataFrame(list(dd.keys()), columns=['twid'])
I did this and it works :
df=pd.DataFrame(list(dd.items()), columns=['twid', 'msg'])
df

pandas series multi indexing error

I'm trying to slice into a multi-indexed data frame. I'm confused about conditions that generate IndexingError: Too many indexers. I'm also skeptical because I've found some bug reports about what may be this issue.
Specifically, this generates the error:
idx1 = [str(elem) for elem in [5, 6, 7, 8]]
idx2 = [str(elem) for elem in [10, 20, 30]]
index = pd.MultiIndex.from_product([idx1, idx2], names=('idx1', 'idx2'))
columns = ['m1', 'm2', 'm3']
df = pd.DataFrame(index=index, columns= columns)
df['m1'].loc[:,10]
That code above is trying to index into an index of dtypes of str, with an int, it seems to me. The error threw me off, as I don't understand why it says Too many indexers.
The below code works:
idx1 = [5, 6, 7, 8]
idx2 = [10, 20, 30]
index = pd.MultiIndex.from_product([idx1, idx2], names=('idx1', 'idx2'))
columns = ['m1', 'm2', 'm3']
df = pd.DataFrame(index=index, columns= columns)
df.loc[5,10] = [1,2,3]
df.loc[6,10] = [4,5,6]
df.loc[7,10] = [7,8,9]
type(df2['m1'])
df['m1'].loc[:,10]
There are some references to the same error: https://github.com/pandas-dev/pandas/issues/13597 which is marked closed and https://github.com/pandas-dev/pandas/issues/14885 which is open.
Is it ok to slice (a multi-indexed series) as in the lines above, assuming I get the dtype right? Also "Too many indexers" with DataFrame.loc
My pandas version is 20.3.

Concatenate DataFrames.DataFrame in Julia

I have a problem when I try to concatenate multiple DataFrames (a datastructure from the DataFrames package!) with the same columns but different row numbers. Here's my code:
using(DataFrames)
DF = DataFrame()
DF[:x1] = 1:1000
DF[:x2] = rand(1000)
DF[:time] = append!( [0] , cumsum( diff(DF[:x1]).<0 ) ) + 1
DF1 = DF[DF[:time] .==1,:]
DF2 = DF[DF[:time] .==round(maximum(DF[:time])),:]
DF3 = DF[DF[:time] .==round(maximum(DF[:time])/4),:]
DF4 = DF[DF[:time] .==round(maximum(DF[:time])/2),:]
DF1[:T] = "initial"
DF2[:T] = "final"
DF3[:T] = "1/4"
DF4[:T] = "1/2"
DF = [DF1;DF2;DF3;DF4]
The last line gives me the error
MethodError: Cannot `convert` an object of type DataFrames.DataFrame to an object of type LastMain.LastMain.LastMain.DataFrames.AbstractDataFrame
This may have arisen from a call to the constructor LastMain.LastMain.LastMain.DataFrames.AbstractDataFrame(...),
since type constructors fall back to convert methods.
I don't understand this error message. Can you help me out? Thanks!
I just ran into this exact problem on Julia 0.5.0 x86_64-linux-gnu, DataFrames 0.8.5, with both hcat and vcat.
Neither clearing the workspace nor reloading DataFrames solved the problem, but restarting the REPL fixed it immediately.