Split a dataframe into multiple dataframes - pandas

I have a dataframe; I split it using groupby. I understand this splits the dataframes into multiple dataframes. How can I get back those individual dataframes , based on the groups and name them accordingly? So if said df.groupby(['A','B'])
and A has values A1, and B has values B1-B4, I want to get back those 4 dataframes callefdf_A1B1..df_A1B1, df_A1B2...df_A1B4?

This can be done by locals but not recommend
variables = locals()
for i,j in df.groupby(['A','B']):
variables["df_{0[0]}{0[1]}".format(i)] = j
df_01
Out[332]:
A B C
0 0 1 a-1524112-124
Using dict is the right way
{"df_{0[0]}{0[1]}".format(i) : j for i,j in df.groupby(['A','B'])}

Offering an alternate solution, using pandas.DataFrame.xs and some exec magic -
df = pd.DataFrame({'A': ['a1', 'a2']*4,
'B': ['b1', 'b2', 'b3', 'b4']*2,
'val': [i for i in range(8)]
})
df
# A B val
# 0 a1 b1 0
# 1 a2 b2 1
# 2 a1 b3 2
# 3 a2 b4 3
# 4 a1 b1 4
# 5 a2 b2 5
# 6 a1 b3 6
# 7 a2 b4 7
for i in df.set_index(['A', 'B']).index.unique().tolist():
exec("df_{}{}".format(i[0], i[1]) + " = df.set_index(['A','B']).xs(i)")
df_a1b1
# val
# A B
# a1 b1 0
# b1 4

Related

Pivoting data without column

Starting from an imported df from excel like that:
Code
Material
Text
QTY
A1
X222
Model3
1
A2
4027721
Gruoup1
1
A2
4647273
Gruoup1.1
4
A1
573828
Gruoup1.2
1
I want to create a new pivot table like that:
Code
Qty
A1
2
A2
5
I tried with the following command but they do not work:
df.pivot(index='Code', columns='',values='Qty')
df_pivot = df ("Code").Qty([sum, max])
You don't need pivot but groupby:
out = df.groupby('Code', as_index=False)['QTY'].sum()
# Or
out = df.groupby('Code')['QTY'].agg(['sum', 'max']).reset_index()
Output:
>>> out
Code sum max
0 A1 2 1
1 A2 5 4
The equivalent code with pivot_table:
out = (df.pivot_table('QTY', 'Code', aggfunc=['sum', 'max'])
.droplevel(1, axis=1).reset_index())

append lists of different length to dataframe pandas

Consider I have multiple lists
A = ['acc_num=1', 'A1', 'A2']
B = ['acc_num=2', 'B1', 'B2', 'B3','B4']
C = ['acc_num=3', 'C1']
How to I put them in dataframe to export to excel as:
acc_num _1 _2 _3 _4
_1 1 A1 A2
_2 2 B1 B2 B3 B4
_3 3 C1
Hi here is a solution for you in 3 basic steps:
Create a DataFrame just by passing a list of your lists
Manipulate the acc_num column and remove the starting string "acc_num=" this is done with a string method on the vectorized column (but that goes maybe to far for now)
Rename the Column Header / Names as you wish by passing a dictionary {} to the df.rename
The Code:
# Create a Dataframe from your lists
df = pd.DataFrame([A,B,C])
# Change Column 0 and remove initial string
df[0] = df[0].str.replace('acc_num=','')
# Change the name of Column 0
df.rename(columns={0:"acc_num"},inplace=True)
Final result:
Out[26]:
acc_num 1 2 3 4
0 1 A1 A2 None None
1 2 B1 B2 B3 B4
2 3 C1 None None None

Base R or tidyverse function to expand rows along time series columns [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 2 years ago.
I have a time series data frame with two ID columns and about 1000 columns for day 1, day 2, etc.
I want to convert from my data frame being in the form
a b t1 t2 ... t1000
_____________________________
a1 b1 # # #
a2 b2 # # #
to being in the form
a b t value
____________________
a1 b1 't1' #
a1 b1 't2' #
a2 b2 't1' #
a2 b2 't2' #
Essentially, I want to do something like this:
dataframe %>%
select(starts_with("t_") ) %>%
gather(key = "t", value = "value")
so that I have a dataframe looking like this:
t value
__________
't1' #
't2' #
...
't100' #
for each row in the original dataframe. Then once I have the time columns that I generated from one row in the original dataframe, I want to left append the "a" and "b" columns to each row, so that this:
t value
__________
't1' #
't2' #
...
't100' #
turns into this:
a b t value
____________________
a1 b1 't1' #
a1 b1 't2' #
...
a1 b1 't100' #
and then I repeat this process for each row, stacking the new generated dataframe below (or above) the previously generated dataframe. At the end I want to have the dataframe in the code block above generated for each row of the original dataframe, and all stacked on top of one another. I could do this with a for loop, or maybe even some sapply magic, but I don't want to use a for loop in R, and I feel like there's a better way to do this than sapply-ing some function into each row of the original dataframe.
Can anyone help me? Thanks. Preferably using tidyverse.
We can use pivot_longer from tidyr.
library(tidyr)
data <- matrix(1:1000,nrow=2,ncol=1000)
colnames(data) <- paste0("t",1:1000)
data <- data.frame(a=c("a1","a2"),b=c("b1","b2"),data)
data[1:2,1:10]
# a b t1 t2 t3 t4 t5 t6 t7 t8
#1 a1 b1 1 3 5 7 9 11 13 15
#2 a2 b2 2 4 6 8 10 12 14 16
data %>%
pivot_longer(cols = starts_with("t"), names_to = "t")
## A tibble: 2,000 x 4
# a b t value
# <fct> <fct> <chr> <int>
# 1 a1 b1 t1 1
# 2 a1 b1 t2 3
# 3 a1 b1 t3 5
# 4 a1 b1 t4 7
# 5 a1 b1 t5 9
# 6 a1 b1 t6 11
# 7 a1 b1 t7 13
# 8 a1 b1 t8 15
# 9 a1 b1 t9 17
#10 a1 b1 t10 19
Try this:
library(dplyr)
dataframe %>%
gather(key = "t", value = "value", 3:ncol(.))

pandas dataframe group by and agg

I am new to ipython and I am trying to do something with dataframe grouping . I have a dataframe like below
df_test = pd.DataFrame({"A": range(4), "B": ["B1", "B2", "B1", "B2"], "C": ["C1", "C1", np.nan, "C2"]})
df_test
A B C
0 0 B1 C1
1 1 B2 C1
2 2 B1 NaN
3 3 B2 C2
I would like to achieve following things:
1) group by B but creating multilevel column instead of grouped to rows with B1 and B2 as index, B1 and B2 are basically count
2) column A and C are agg function applied with something like {'C':['count'],'A':['sum']}
B
A B1 B2 C
0 6 2 2 3
how ? Thanks
You are doing separate actions to each column. You can hack this by aggregating A and C and then taking the value counts of B separately and then combine the data back together.
ac = df_test.agg({'A':'sum', 'C':'count'})
b = df_test['B'].value_counts()
pd.concat([ac, b]).sort_index().to_frame().T
A B1 B2 C
0 6 2 2 3

Compare Pandas dataframes and add column

I have two dataframe as below
df1 df2
A A C
A1 A1 C1
A2 A2 C2
A3 A3 C3
A1 A4 C4
A2
A3
A4
The values of column 'A' are defined in df2 in column 'C'.
I want to add a new column to df1 with column B with its value from df2 column 'C'
The final df1 should look like this
df1
A B
A1 C1
A2 C2
A3 C3
A1 C1
A2 C2
A3 C3
A4 C4
I can loop over df2 and add the value to df1 but its time consuming as the data is huge.
for index, row in df2.iterrows():
df1.loc[df1.A.isin([row['A']]), 'B']= row['C']
Can someone help me to understand how can I solve this without looping over df2.
Thanks
You can use map by Series:
df1['B'] = df1.A.map(df2.set_index('A')['C'])
print (df1)
A B
0 A1 C1
1 A2 C2
2 A3 C3
3 A1 C1
4 A2 C2
5 A3 C3
6 A4 C4
It is same as map by dict:
d = df2.set_index('A')['C'].to_dict()
print (d)
{'A4': 'C4', 'A3': 'C3', 'A2': 'C2', 'A1': 'C1'}
df1['B'] = df1.A.map(d)
print (df1)
A B
0 A1 C1
1 A2 C2
2 A3 C3
3 A1 C1
4 A2 C2
5 A3 C3
6 A4 C4
Timings:
len(df1)=7:
In [161]: %timeit merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
1000 loops, best of 3: 1.73 ms per loop
In [162]: %timeit df1['B'] = df1.A.map(df2.set_index('A')['C'])
The slowest run took 4.44 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 873 µs per loop
len(df1)=70k:
In [164]: %timeit merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
100 loops, best of 3: 12.8 ms per loop
In [165]: %timeit df1['B'] = df1.A.map(df2.set_index('A')['C'])
100 loops, best of 3: 6.05 ms per loop
IIUC you can just merge and rename the col
df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
In [103]:
df1 = pd.DataFrame({'A':['A1','A2','A3','A1','A2','A3','A4']})
df2 = pd.DataFrame({'A':['A1','A2','A3','A4'], 'C':['C1','C2','C4','C4']})
merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
merged
Out[103]:
A B
0 A1 C1
1 A2 C2
2 A3 C4
3 A1 C1
4 A2 C2
5 A3 C4
6 A4 C4
Based on searchsorted method, here are three approaches with different indexing schemes -
df1['B'] = df2.C[df2.A.searchsorted(df1.A)].values
df1['B'] = df2.C[df2.A.searchsorted(df1.A)].reset_index(drop=True)
df1['B'] = df2.C.values[df2.A.searchsorted(df1.A)]