For loop to create pandas dataframes - varying dataframe names? - pandas

I would like to create 3 dataframes as follows:
basket = [['Apple', 'Banana', 'Orange']]
for fruit in basket:
fruit = pd.DataFrame(np.random.rand(10,3))
However, after running this, running something like:
Apple
Gives the error
NameError: name 'Apple is not defined
But 'fruit' as a dataframe does work.
How is it possible to have each dataframe produced take a variable as its name?

This would work:
basket = ['Apple', 'Banana', 'Orange']
for fruit in basket:
vars()[fruit] = pd.DataFrame(np.random.rand(10,3))
However it would be better practice to perhaps assign to a dictionary e.g.:
var_dict={}
basket = ['Apple', 'Banana', 'Orange']
for fruit in basket:
var_dict[fruit] = pd.DataFrame(np.random.rand(10,3))

Instead of creating variables use dict to store the dfs, its not a good practice to create variables on loop i.e
basket = ['Apple', 'Banana', 'Orange']
d_o_dfs = {x: pd.DataFrame(np.random.rand(10,3)) for x in basket}
Not recommended , but in case you want to store it in a variable then use globals i.e
for i in basket:
globals()[i] = pd.DataFrame(np.random.rand(10,3))
Output :
Banana And d_o_dfs['Banana']
0 1 2
0 0.822190 0.115136 0.807569
1 0.698041 0.936516 0.438414
2 0.184000 0.772022 0.006315
3 0.684076 0.988414 0.991671
4 0.017289 0.560416 0.349688
5 0.379464 0.642631 0.373243
6 0.956938 0.485344 0.276470
7 0.910433 0.062117 0.670629
8 0.507549 0.393622 0.003585
9 0.878740 0.209498 0.498594

Related

In dataframe, merge row by matching multiple id but, condition is different for all id like (full or partial match) [duplicate]

I want to merge several strings in a dataframe based on a groupedby in Pandas.
This is my code so far:
import pandas as pd
from io import StringIO
data = StringIO("""
"name1","hej","2014-11-01"
"name1","du","2014-11-02"
"name1","aj","2014-12-01"
"name1","oj","2014-12-02"
"name2","fin","2014-11-01"
"name2","katt","2014-11-02"
"name2","mycket","2014-12-01"
"name2","lite","2014-12-01"
""")
# load string as stream into dataframe
df = pd.read_csv(data,header=0, names=["name","text","date"],parse_dates=[2])
# add column with month
df["month"] = df["date"].apply(lambda x: x.month)
I want the end result to look like this:
I don't get how I can use groupby and apply some sort of concatenation of the strings in the column "text". Any help appreciated!
You can groupby the 'name' and 'month' columns, then call transform which will return data aligned to the original df and apply a lambda where we join the text entries:
In [119]:
df['text'] = df[['name','text','month']].groupby(['name','month'])['text'].transform(lambda x: ','.join(x))
df[['name','text','month']].drop_duplicates()
Out[119]:
name text month
0 name1 hej,du 11
2 name1 aj,oj 12
4 name2 fin,katt 11
6 name2 mycket,lite 12
I sub the original df by passing a list of the columns of interest df[['name','text','month']] here and then call drop_duplicates
EDIT actually I can just call apply and then reset_index:
In [124]:
df.groupby(['name','month'])['text'].apply(lambda x: ','.join(x)).reset_index()
Out[124]:
name month text
0 name1 11 hej,du
1 name1 12 aj,oj
2 name2 11 fin,katt
3 name2 12 mycket,lite
update
the lambda is unnecessary here:
In[38]:
df.groupby(['name','month'])['text'].apply(','.join).reset_index()
Out[38]:
name month text
0 name1 11 du
1 name1 12 aj,oj
2 name2 11 fin,katt
3 name2 12 mycket,lite
We can groupby the 'name' and 'month' columns, then call agg() functions of Panda’s DataFrame objects.
The aggregation functionality provided by the agg() function allows multiple statistics to be calculated per group in one calculation.
df.groupby(['name', 'month'], as_index = False).agg({'text': ' '.join})
The answer by EdChum provides you with a lot of flexibility but if you just want to concateate strings into a column of list objects you can also:
output_series = df.groupby(['name','month'])['text'].apply(list)
If you want to concatenate your "text" in a list:
df.groupby(['name', 'month'], as_index = False).agg({'text': list})
For me the above solutions were close but added some unwanted /n's and dtype:object, so here's a modified version:
df.groupby(['name', 'month'])['text'].apply(lambda text: ''.join(text.to_string(index=False))).str.replace('(\\n)', '').reset_index()
Please try this line of code : -
df.groupby(['name','month'])['text'].apply(','.join).reset_index()
Although, this is an old question. But just in case. I used the below code and it seems to work like a charm.
text = ''.join(df[df['date'].dt.month==8]['text'])
Thanks to all the other answers, the following is probably the most concise and feels more natural. Using df.groupby("X")["A"].agg() aggregates over one or many selected columns.
df = pandas.DataFrame({'A' : ['a', 'a', 'b', 'c', 'c'],
'B' : ['i', 'j', 'k', 'i', 'j'],
'X' : [1, 2, 2, 1, 3]})
A B X
a i 1
a j 2
b k 2
c i 1
c j 3
df.groupby("X", as_index=False)["A"].agg(' '.join)
X A
1 a c
2 a b
3 c
df.groupby("X", as_index=False)[["A", "B"]].agg(' '.join)
X A B
1 a c i i
2 a b j k
3 c j

Translate my SKUs using a dictionary with Pandas

I have a table which has internal SKUs in column 0 and then synonyms along that row. The number of synonyms is not constant (ranging from 0 to 7, but will have a tendency to grow)
I need an effient function which will allow me to get SKUs from one column in a large table and translate them to synonym 0 from my other table.
This is my current function which takes an array of SKUs from one table, searches for them in another and gives me the first column value where it finds a synonym.
def new_array(dfarray1, array1, trans_dic):
missing_values = set([])
new_array = []
for value in array1:
pos = trans_dic.eq(str(value)).any(axis=1)
if len(pos[pos]) > 0 :
new_array.append(trans_dic['sku_0'][pos[pos].index[0]])
else:
missing_values.add(str(value))
if len(missing_values) > 0 :
print("The following values are missing in dictionary. They are in DF called:"+dfarray1)
print(missing_values)
sys.exit()
else:
return new_array
I'm sure that this is very badly written because it takes my laptop about 3 minutes to go through about 75K values only. Can anyone help me make this faster?
Some questions asked previously:
What types are your function parameters? (can guess pandas, but no way to know for sure)
Yes. I am working on two pandas dataframes.
What does your table even look like?
Dictionary table:
SKU0
Synonym 0
Synonym 1
Synonym 2
foo
bar
bar1
foo1
baar1
foo2
baaar0
baar2
Values table:
SKU
Value
Value1
value1
foo
3
1
7
baar1
4
5
7
baaar0
5
5
9
Desired table:
SKU
Value
Value1
value1
foo
3
1
7
foo1
4
5
7
foo2
5
5
9
What does the rest of your code that is calling this function look like?
df1.sku = new_array('df1', list(df1.sku), sku_dic)
Given the dictionary dataframe in the format
df_dict = pd.DataFrame({
"SKU0": ["foo", "foo1", "foo2"],
"Synonym 0": ["bar", "baar1", "baaar0"],
"Synonym 1": ["bar1", np.nan, np.nan],
"Synonym 2": [np.nan, np.nan, "baar2"]
})
and a values dataframe in the format
df_values = pd.DataFrame({
"SKU": ["foo", "baar1", "baaar0"],
"Value": [3, 4, 5],
"Value1": [1, 5, 5],
"value1": [7, 7, 9]
})
you can get the output you want by first using pd.melt to restructure your dictionary dataframe and then join it to your values dataframe. Then you can use some extra logic to check which column to take the final value from and to select the final columns needed.
(
df_dict
# converts dict df from wide to long format
.melt(id_vars=["SKU0"])
# filters rows where there is no synonym
.loc[lambda x: x["value"].notna()]
# join dictionary with values df
.merge(df_values, how="right", left_on="value", right_on="SKU")
# get final value by taking the value from column "SKU0" if available, else "SKU"
.assign(SKU = lambda x: np.where(x["SKU0"].isna(), x["SKU"], x["SKU0"]))
# select final columns needed in output
[["SKU", "Value", "Value1", "value1"]]
)
# output
SKU Value Value1 value1
0 foo 3 1 7
1 foo1 4 5 7
2 foo2 5 5 9

How to pass function parameters if using one or more parameters

Thank you in advance for your assistance.
#Create df.
import pandas as pd
d = {'dep_var' : pd.Series([10, 20, 30, 40], index =['a', 'b', 'c', 'd']),
'one' : pd.Series([9, 23, 37, 41], index =['a', 'b', 'c', 'd']),
'two' : pd.Series([1, 6, 5, 4], index =['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df)
dep_var one two
a 10 9 1
b 20 23 6
c 30 37 5
d 40 41 4
#Define function.
def df_two(dep_var, ind_var_1, ind_var_2):
global two
data = {
dep_var: df[dep_var],
ind_var_1: df[ind_var_1],
ind_var_2: df[ind_var_2]
}
two = pd.DataFrame(data)
return two
# Execute function.
df_two("dep_var", "one", "two")
dep_var one two
a 10 9 1
b 20 23 6
c 30 37 5
d 40 41 4
Works perfect. I'd like to, fairly new at this, be able to use a single function when using say three or four parameters, of course, using the above code I get error message with third parameter.
So rookie move I define another function with 3 parameters.
def df_three(dep_var, ind_var_1, ind_var_2, ind_var_3):
global three
data = {
dep_var: df[dep_var],
ind_var_1: df[ind_var_1],
ind_var_2: df[ind_var_2],
ind_var_3: df[ind_var_2]
}
three = pd.DataFrame(data)
return three
I've tried *args, *kargs, mapping and host of things with no luck. My sense is I'm close but need a way to tell the function that sometimes there might be one, two, or three parameters, and then map one, two or three parameters to created dataframe.
Use unpack *args:
def foo(dep_var, *args):
global df
data = {dep_var: df[dep_var]}
for a in args:
data[a] = df[a]
return pd.DataFrame(data)
And then you can call
foo('dep_var', 'one')
foo('dep_var', 'one', 'two')
To eliminate the need of global argument, I'd pass df to the function as well:
def foo(df, dep_var, *args):
data = {dep_var: df[dep_var]}
for a in args:
data[a] = df[a]
return pd.DataFrame(data)
More information on *args.
It sounds like you want to select only some columns from a data frame, in a certain order. You can just pass a list of the column names for that:
two[["dep_var", "one", "two"]]
If you want to, you can pack that into a function, using tuple unpacking to have a variable number of arguments.
def select(df, *columns):
return df[list(columns)]
This should directly work with your use cases:
select(two, "dep_var", "one", "two")
select(three, "dep_var", "one", "two", "three")
Note that I also passed the data frame variable, so you don't need to rely on a global variable.
The call to list is needed, because tuple unpacking produces, well, a tuple. And using a tuple as an index to the data frame produces different results than using a list.
You might want to append a .copy() to the return line, depending on how you use the return value of this.
A variable number of arguments also includes zero, so you might want to add a check for that.

Covert pandas into list

I have a pandas dataframe with two columns:
Stock1 Stock2
0 SPXS SPXU
1 IAU GLD
2 C JETS
I want to turn the columns into a list I can pass through other code like this:
pairs = (['SPXS', 'SPXU'], ['IAU', 'GLD'], ['C', 'JETS'])
So that I can call them in a for loop
for pair in pairs:
stock1 = pair[-2]
stock2 = pair[-1]
Looking for the help as to the best way to execute this.
Thanks!
Paul
You can use zip loop for
pairs = []
for i,k in zip(df.Stock1,df.Stock2):
pairs.append([i,k])
Now you can run on your new list:
for pair in pairs:
stock1 = pair[-2]
stock2 = pair[-1]
print(stock1,'|',stock2)
This prints the following:
SPXS | SPXU
IAU | GLD
C | JETS

how to deal with the pandas "id variables need to uniquely identify each row"?

I have a dataframe like:
df=pd.DataFrame({
'name1': ['A', 'A', 'C','B','C','A','D'],
'name2': ['D', 'B', 'A','D','B','C','A'],
'text': ['cars', 'cars', 'flower', 'tea','ball','phone','ice'],
'time':['10/01','10/01','10/01','10/01','10/02','10/02','10/02'],
'Flag1':[1,1,2,0,2,1,0],
'Flag2':[0,0,1,0,0,2,1]})
expect:
pd.DataFrame({
'name': ['A', 'B', 'C','D','A','B','C'],
'text': ['cars,flower','cars,tea', 'flower', 'cars,tea','phone,ice','ball','phone'],
'time':['10/01','10/01','10/01','10/01','10/02','10/02','10/02'],
'Flag':[1,0,2,0,1,0,2]})
I want to combine information according to "time". columns are merged by 'time';
'name': 'name1' and 'name2' are merged into 'name';
'words': on each day, words are merged once it shows up in the identical user's row;
'time': the date that the user shows up on that day;
'Flag': 'Flag1' and 'Flag2' are merge into 'Flag'. Each user has a unique
'Flag'('0','1','2') no matter what the date is.
But When I do:
pd.wide_to_long(df, stubnames=["name", "Flag"], i=["text", "time"], j="ref",
suffix="\d*").reset_index().groupby(["name","time"],
as_index=False).agg({"text": ",".join, "Flag": "first"}).sort_values(["time", "name"])
I get:
id variables need to uniquely identify each row
How to deal with that?
Let me know if this works for you. Try :
Reshape by getting the index in, to serve as unique identifier i :
m = pd.wide_to_long(df.reset_index(), stubnames=["name", "Flag"], i="index", j="num")
Munge to get desired output, using groupby and some text manipulation :
(
m.groupby(["name", "time"])
.agg(Flag=("Flag", "first"), text=("text", lambda x: ",".join(set(x))))
.reset_index()
.sort_values("time")
)
name time Flag text
0 A 10/01 1 flower,cars
2 B 10/01 0 tea,cars
4 C 10/01 2 flower
6 D 10/01 0 tea,cars
1 A 10/02 1 phone,ice
3 B 10/02 0 ball
5 C 10/02 2 phone,ball
7 D 10/02 0 ice