Consider I have multiple lists
A = ['acc_num=1', 'A1', 'A2']
B = ['acc_num=2', 'B1', 'B2', 'B3','B4']
C = ['acc_num=3', 'C1']
How to I put them in dataframe to export to excel as:
acc_num _1 _2 _3 _4
_1 1 A1 A2
_2 2 B1 B2 B3 B4
_3 3 C1
Hi here is a solution for you in 3 basic steps:
Create a DataFrame just by passing a list of your lists
Manipulate the acc_num column and remove the starting string "acc_num=" this is done with a string method on the vectorized column (but that goes maybe to far for now)
Rename the Column Header / Names as you wish by passing a dictionary {} to the df.rename
The Code:
# Create a Dataframe from your lists
df = pd.DataFrame([A,B,C])
# Change Column 0 and remove initial string
df[0] = df[0].str.replace('acc_num=','')
# Change the name of Column 0
df.rename(columns={0:"acc_num"},inplace=True)
Final result:
Out[26]:
acc_num 1 2 3 4
0 1 A1 A2 None None
1 2 B1 B2 B3 B4
2 3 C1 None None None
Related
Starting from an imported df from excel like that:
Code
Material
Text
QTY
A1
X222
Model3
1
A2
4027721
Gruoup1
1
A2
4647273
Gruoup1.1
4
A1
573828
Gruoup1.2
1
I want to create a new pivot table like that:
Code
Qty
A1
2
A2
5
I tried with the following command but they do not work:
df.pivot(index='Code', columns='',values='Qty')
df_pivot = df ("Code").Qty([sum, max])
You don't need pivot but groupby:
out = df.groupby('Code', as_index=False)['QTY'].sum()
# Or
out = df.groupby('Code')['QTY'].agg(['sum', 'max']).reset_index()
Output:
>>> out
Code sum max
0 A1 2 1
1 A2 5 4
The equivalent code with pivot_table:
out = (df.pivot_table('QTY', 'Code', aggfunc=['sum', 'max'])
.droplevel(1, axis=1).reset_index())
This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 2 years ago.
I have a time series data frame with two ID columns and about 1000 columns for day 1, day 2, etc.
I want to convert from my data frame being in the form
a b t1 t2 ... t1000
_____________________________
a1 b1 # # #
a2 b2 # # #
to being in the form
a b t value
____________________
a1 b1 't1' #
a1 b1 't2' #
a2 b2 't1' #
a2 b2 't2' #
Essentially, I want to do something like this:
dataframe %>%
select(starts_with("t_") ) %>%
gather(key = "t", value = "value")
so that I have a dataframe looking like this:
t value
__________
't1' #
't2' #
...
't100' #
for each row in the original dataframe. Then once I have the time columns that I generated from one row in the original dataframe, I want to left append the "a" and "b" columns to each row, so that this:
t value
__________
't1' #
't2' #
...
't100' #
turns into this:
a b t value
____________________
a1 b1 't1' #
a1 b1 't2' #
...
a1 b1 't100' #
and then I repeat this process for each row, stacking the new generated dataframe below (or above) the previously generated dataframe. At the end I want to have the dataframe in the code block above generated for each row of the original dataframe, and all stacked on top of one another. I could do this with a for loop, or maybe even some sapply magic, but I don't want to use a for loop in R, and I feel like there's a better way to do this than sapply-ing some function into each row of the original dataframe.
Can anyone help me? Thanks. Preferably using tidyverse.
We can use pivot_longer from tidyr.
library(tidyr)
data <- matrix(1:1000,nrow=2,ncol=1000)
colnames(data) <- paste0("t",1:1000)
data <- data.frame(a=c("a1","a2"),b=c("b1","b2"),data)
data[1:2,1:10]
# a b t1 t2 t3 t4 t5 t6 t7 t8
#1 a1 b1 1 3 5 7 9 11 13 15
#2 a2 b2 2 4 6 8 10 12 14 16
data %>%
pivot_longer(cols = starts_with("t"), names_to = "t")
## A tibble: 2,000 x 4
# a b t value
# <fct> <fct> <chr> <int>
# 1 a1 b1 t1 1
# 2 a1 b1 t2 3
# 3 a1 b1 t3 5
# 4 a1 b1 t4 7
# 5 a1 b1 t5 9
# 6 a1 b1 t6 11
# 7 a1 b1 t7 13
# 8 a1 b1 t8 15
# 9 a1 b1 t9 17
#10 a1 b1 t10 19
Try this:
library(dplyr)
dataframe %>%
gather(key = "t", value = "value", 3:ncol(.))
I have a dataframe; I split it using groupby. I understand this splits the dataframes into multiple dataframes. How can I get back those individual dataframes , based on the groups and name them accordingly? So if said df.groupby(['A','B'])
and A has values A1, and B has values B1-B4, I want to get back those 4 dataframes callefdf_A1B1..df_A1B1, df_A1B2...df_A1B4?
This can be done by locals but not recommend
variables = locals()
for i,j in df.groupby(['A','B']):
variables["df_{0[0]}{0[1]}".format(i)] = j
df_01
Out[332]:
A B C
0 0 1 a-1524112-124
Using dict is the right way
{"df_{0[0]}{0[1]}".format(i) : j for i,j in df.groupby(['A','B'])}
Offering an alternate solution, using pandas.DataFrame.xs and some exec magic -
df = pd.DataFrame({'A': ['a1', 'a2']*4,
'B': ['b1', 'b2', 'b3', 'b4']*2,
'val': [i for i in range(8)]
})
df
# A B val
# 0 a1 b1 0
# 1 a2 b2 1
# 2 a1 b3 2
# 3 a2 b4 3
# 4 a1 b1 4
# 5 a2 b2 5
# 6 a1 b3 6
# 7 a2 b4 7
for i in df.set_index(['A', 'B']).index.unique().tolist():
exec("df_{}{}".format(i[0], i[1]) + " = df.set_index(['A','B']).xs(i)")
df_a1b1
# val
# A B
# a1 b1 0
# b1 4
I am new to ipython and I am trying to do something with dataframe grouping . I have a dataframe like below
df_test = pd.DataFrame({"A": range(4), "B": ["B1", "B2", "B1", "B2"], "C": ["C1", "C1", np.nan, "C2"]})
df_test
A B C
0 0 B1 C1
1 1 B2 C1
2 2 B1 NaN
3 3 B2 C2
I would like to achieve following things:
1) group by B but creating multilevel column instead of grouped to rows with B1 and B2 as index, B1 and B2 are basically count
2) column A and C are agg function applied with something like {'C':['count'],'A':['sum']}
B
A B1 B2 C
0 6 2 2 3
how ? Thanks
You are doing separate actions to each column. You can hack this by aggregating A and C and then taking the value counts of B separately and then combine the data back together.
ac = df_test.agg({'A':'sum', 'C':'count'})
b = df_test['B'].value_counts()
pd.concat([ac, b]).sort_index().to_frame().T
A B1 B2 C
0 6 2 2 3
I want to mark duplicate values within an ID group. For example
ID A B
i1 a1 b1
i1 a1 b2
i1 a2 b2
i2 a1 b2
should become
ID A An B Bn
i1 a1 2 b1 1
i1 a1 2 b2 2
i1 a2 1 b2 2
i2 a1 1 b2 1
Basically An and Bn count multiplicity within each ID group. How can I do this in pandas? I've found groupBy, but it was quite messy to put everything together. Also I tried individual groupby for ID, A and ID, B. Maybe there is a way to pre-group by ID first and then do all the other variables? (there are many variables and I have very man rows!)
Also I tried individual groupby for ID, A and ID, B
I think this is a straight-forward way to tackle it; As you suggest, you can groupby each separately and then compute the size of the groups. And use transform so you can easily add the results to the original dataframe:
df['An'] = df.groupby(['ID','A'])['A'].transform(np.size)
df['Bn'] = df.groupby(['ID','B'])['B'].transform(np.size)
print df
ID A B An Bn
0 i1 a1 b1 2 1
1 i1 a1 b2 2 2
2 i1 a2 b2 1 2
3 i2 a1 b2 1 1
Of course, with lots of columns you could do:
for col in ['A','B']:
df[col + 'n'] = df.groupby(['ID',col])[col].transform(np.size)
The duplicated method can also be used to give you something similar, but it will mark observations within a group after the first as duplicates:
for col in ['A','B']:
df[col + 'n'] = df.duplicated(['ID',col])
print df
ID A B An Bn
0 i1 a1 b1 False False
1 i1 a1 b2 True False
2 i1 a2 b2 False True
3 i2 a1 b2 False False
EDIT: increasing performance for large data. I did it on a large dataset (4 million rows) and it was significantly faster if I avoided transform with something like the following (it is much less elegant):
for col in ['A','B']:
x = df.groupby(['ID',col]).size()
df.set_index(['ID',col],inplace=True)
df[col + 'n'] = x
df.reset_index(inplace=True)