I have a dataframe:
ID val
1 a
2 b
3 c
4 d
5 a
7 d
6 v
8 j
9 k
10 a
I have a dictionary as follows:
{aa:3, bb: 3,cc:4}
In the dictionary the numerical values indicates the number of records. The sum of numerical values is equal to the number of rows that I have in the data frame. In this example 3 + 3 + 4 = 10 and I have 10 rows in the data frame.
I am trying to split the data frame by rows that are equal to the number given in the dictionary and fill the key as column value into a new column. The desired output is as follows:
ID val. new_col
1 a. aa
2 b aa
3 c. aa
4 d. bb
5 a. bb
6 v. bb
7. d. cc
8 j. cc
9 k. cc
10 a. cc
The order of the fill is not important as long as the count of records match with the count given in the dict. I am trying to resolve this by iterating through the dict but I am not able to isolate specific number of records of the data frame with every new key value pair.
I have also tried using pd.cut by splitting the dict values to bins and keys as column values. However I am getting the error ValueError: bins must increase monotonically.
d = {'aa':3, 'bb': 3,'cc':4}
df['new_col'] = pd.Series([np.repeat(i, j) for i, j in d.items()]).explode().to_numpy()
df
Out[64]:
ID val new_col
0 1 a aa
1 2 b aa
2 3 c aa
3 4 d bb
4 5 a bb
5 7 d bb
6 6 v cc
7 8 j cc
8 9 k cc
9 10 a cc
Given table:
ID
LINE
SITE
DATE
UNITS
TOTAL
1
X
AAA
02-May-2017
12
30
2
X
AAA
03-May-2017
10
22
3
X
AAA
04-May-2017
22
40
4
Z
AAA
20-MAY-2017
15
44
5
Z
AAA
21-May-2017
8
30
6
Z
BBB
22-May-2017
10
32
7
Z
BBB
23-May-2017
25
52
8
K
CCC
02-Jun-2017
6
22
9
K
CCC
03-Jun-2017
4
33
10
K
CCC
12-Aug-2017
11
44
11
K
CCC
13-Aug-2017
19
40
12
K
CCC
14-Aug-2017
30
40
for each row if ID,LINE ,SITE equal to previous row (day) need to calculate as below (last day) and (last 3 days ) :
Note that is need to insure date are consecutive under "groupby" of ID,LINE ,SITE columns
ID
LINE
SITE
DATE
UNITS
TOTAL
Last day
Last 3 days
1
X
AAA
02-May-2017
12
30
0
0
2
X
AAA
03-May-2017
10
22
12/30
12/30
3
X
AAA
04-May-2017
22
40
10/22
(10+12)/(30+22)
4
Z
AAA
20-MAY-2017
15
44
0
0
5
Z
AAA
21-May-2017
8
30
15/44
15/44
6
Z
BBB
22-May-2017
10
32
0
0
7
Z
BBB
23-May-2017
25
52
10/32
10/32
8
K
CCC
02-Jun-2017
6
22
0
0
9
K
CCC
03-Jun-2017
4
33
6/22
6/22
10
K
CCC
12-Aug-2017
11
44
4/33
0
11
K
CCC
13-Aug-2017
19
40
11/44
(11/44)
12
K
CCC
14-Aug-2017
30
40
19/40
(11+19/44+40)
In this cases i usually do a for loop with groupby:
import pandas as pd
import numpy as np
#copied your table
table = pd.read_csv('/home/fm/Desktop/stackover.csv')
table.set_index('ID', inplace = True)
table[['Last day','Last 3 days']] = np.nan
for i,r in table.groupby(['LINE' ,'SITE']):
#First subset non sequential dates
limits_interval = pd.to_datetime(r['DATE']).diff() != '1 days'
#First element is a false positive, as its impossible to calculate past days from first day
limits_interval.iloc[0]=False
ids_subset = r.index[limits_interval].to_list()
ids_subset.append(r.index[-1]+1) #to consider all values
id_start = 0
for id_end in ids_subset:
r_sub = r.loc[id_start:id_end-1, :].copy()
id_start = id_end
#move all values one day off, if the database is as in your example (1 line per day) wont have problems
r_shifted = r_sub.shift(1)
r_sub['Last day']=r_shifted['UNITS']/r_shifted['TOTAL']
aux_units_cumsum = r_shifted['UNITS'].cumsum()
aux_total_cumsum = r_shifted['TOTAL'].cumsum()
r_sub['Last 3 days'] = aux_units_cumsum/aux_total_cumsum
r_sub.fillna(0, inplace = True)
table.loc[r_sub.index,:]=r_sub.copy()
You can make a function and apply in groupby, it would be cleaner: Apply function to pandas groupby. It would be more elegant.
Wish I could help you, good luck
I have the following dataframe:
source name cost other_c other_b
a a 7 dd 33
b a 6 gg 44
c c 3 ee 55
b a 2
d b 21 qw 21
e a 16 aq
c c 10 55
I am doing a sum of name and source with:
new_df = df.groupby(['source', 'name'], as_index=False)['cost'].sum()
but it is dropping the remaining 6 columns in my dataframe. Is there a way to keep the rest of the columns? I'm not looking to add new column, just carry over the columns from the original dataframe
This is my pandas df:
Id Protein A_Egg B_Meat C_Milk Category
A 10 10 20 0 egg
B 20 10 0 10 milk
C 20 10 10 10 meat
D 25 20 10 0 egg
I wish to merge protein column with other column based on "Category"
My output is
Id Protein_final
A 20
B 30
C 30
D 45
Ideally, I would like to show how I am approaching but, I am frankly clueless!!
EDIT: Also, How to handle is the category is blank or does meet one of the column (in that can final should be same as initial value in protein column)
Use DataFrame.lookup with some preprocessing with remove values in columns names before _ and lowercase, last add to column:
arr = df.rename(columns=lambda x: x.split('_')[-1].lower()).lookup(df.index, df['Category'])
df['Protein'] += arr
print (df)
Id Protein A_Egg B_Meat C_Milk Category
0 A 20 10 20 0 egg
1 B 30 10 0 10 milk
2 C 30 10 10 10 meat
3 D 45 20 10 0 egg
If need only 2 columns finally:
df = df[['Id','Protein']]
You can melt the dataframe, and filter for rows where category equals the variable column, and sum the final columns :
(
df
.melt(["Id", "Protein", "Category"])
.assign(variable=lambda x: x.variable.str[2:].str.lower(),
Protein_final=lambda x: x.Protein + x.value)
.query("Category == variable")
.filter(["Id", "Protein_final"])
)
Id Protein_final
0 A 20
3 D 45
6 C 30
9 B 30
I want to group my data using SQL or R so that I can get top or bottom 10 Subarea_codes for each Company and Area_code. In essence: the Subarea_codes within the Area_codes where each Company has its largest or smallest result.
data.csv
Area_code Subarea_code Company Result
10 101 A 15
10 101 P 10
10 101 C 4
10 102 A 10
10 102 P 8
10 102 C 5
11 111 A 15
11 111 P 20
11 111 C 5
11 112 A 10
11 112 P 5
11 112 C 10
result.csv should be like this
Company Area_code Largest_subarea_code Result Smallest_subarea_code Result
A 10 101 15 102 10
P 10 101 10 102 8
C 10 102 5 101 4
A 11 111 15 112 10
P 11 111 20 112 5
C 11 112 10 111 5
Within each Area_code there can be hundreds of Subarea_codes but I only want the top and bottom 10 for each Company.
Also this doesn't have to be resolved in one query, but can be divided into two queries, meaning smallest is presented in results_10_smallest and largest in result_10_largest. But I'm hoping I can accomplish this with one query for each result.
What I've tried:
SELECT Company, Area_code, Subarea_code MAX(Result)
AS Max_result
FROM data
GROUP BY Subarea_code
ORDER BY Company
;
This gives me all the Companies with the highest results within each Subarea_code. Which would mean: A, A, P, A-C for the data above.
Using sqldf package:
df <- read.table(text="Area_code Subarea_code Company Result
10 101 A 15
10 101 P 10
10 101 C 4
10 102 A 10
10 102 P 8
10 102 C 5
11 111 A 15
11 111 P 20
11 111 C 5
11 112 A 10
11 112 P 5
11 112 C 10", header=TRUE)
library(sqldf)
mymax <- sqldf("select Company,
Area_code,
max(Subarea_code) Largest_subarea_code
from df
group by Company,Area_code")
mymaxres <- sqldf("select d.Company,
d.Area_code,
m.Largest_subarea_code,
d.Result
from df d, mymax m
where d.Company=m.Company and
d.Subarea_code=m.Largest_subarea_code")
mymin <- sqldf("select Company,
Area_code,
min(Subarea_code) Smallest_subarea_code
from df
group by Company,Area_code")
myminres <- sqldf("select d.Company,
d.Area_code,
m.Smallest_subarea_code,
d.Result
from df d, mymin m
where d.Company=m.Company and
d.Subarea_code=m.Smallest_subarea_code")
result <- sqldf("select a.*, b.Smallest_subarea_code,b.Result
from mymaxres a, myminres b
where a.Company=b.Company and
a.Area_code=b.Area_code")
If you already doing it in R, why not use the much more efficient data.table instead of sqldf using SQL syntax? Assuming data is your data set, simply:
library(data.table)
setDT(data)[, list(Largest_subarea_code = Subarea_code[which.max(Result)],
Resultmax = max(Result),
Smallest_subarea_code = Subarea_code[which.min(Result)],
Resultmin = min(Result)), by = list(Company, Area_code)]
# Company Area_code Largest_subarea_code Resultmax Smallest_subarea_code Resultmin
# 1: A 10 101 15 102 10
# 2: P 10 101 10 102 8
# 3: C 10 102 5 101 4
# 4: A 11 111 15 112 10
# 5: P 11 111 20 112 5
# 6: C 11 112 10 111 5
There seems to be a discrepancy between the output shown and the description. The description asks for the top 10 and bottom 10 results for each Area code/Company but the sample output shows only the top 1 and the bottom 1. For example, for area code 10 and company A subarea 101 is top with a result of 15 and and subarea 102 is 2nd largest with a result of 10 so according to the description there should be two rows for that company/area code combination. (If there were more data there would be up to 10 rows for that company/area code combination.)
We give two answers. The first assumes the top 10 and bottom 10 are wanted for each company and area code as in the question's description and the second assumes only the top and bottom for each company and area code as in the question's sample output.
1) Top/Bottom 10
Here we assume that the top 10 and bottom 10 results for each Company/Area code are wanted. If its just the top and bottom one then see (2) later on (or replace 10 with 1 in the code here). Bottom10 is all rows for which there are 10 or fewer subareas for the same area code and company with equal or smaller results. Top10 is similar.
library(sqldf)
Bottom10 <- sqldf("select a.Company,
a.Area_code,
a.Subarea_code Bottom_Subarea,
a.Result Bottom_Result,
count(*) Bottom_Rank
from df a join df b
on a.Company = b.Company and
a.Area_code = B.Area_code and
b.Result <= a.Result
group by a.Company, a.Area_code, a.Subarea_code
having count(*) <= 10")
Top10 <- sqldf("select a.Company,
a.Area_code,
a.Subarea_code Top_Subarea,
a.Result Top_Result,
count(*) Top_Rank
from df a join df b
on a.Company = b.Company and
a.Area_code = B.Area_code and
b.Result >= a.Result
group by a.Company, a.Area_code, a.Subarea_code
having count(*) <= 10")
The description indicated you wanted the top 10 OR the bottom 10 for each company/area code in which case just use one of the results above. If you want to combine them we show a merge below. We have added a Rank column to indicate the smallest/largest (Rank is 1), second smallest/largest (Rank is 2), etc.
sqldf("select t.Area_code,
t.Company,
t.Top_Rank Rank,
t.Top_Subarea,
t.Top_Result,
b.Bottom_Subarea,
b.Bottom_Result
from Bottom10 b join Top10 t
on t.Area_code = b.Area_code and
t.Company = b.Company and
t.Top_Rank = b.Bottom_Rank
order by t.Area_code, t.Company, t.Top_Rank")
giving:
Area_code Company Rank Top_Subarea Top_Result Bottom_Subarea Bottom_Result
1 10 A 1 101 15 102 10
2 10 A 2 102 10 101 15
3 10 C 1 102 5 101 4
4 10 C 2 101 4 102 5
5 10 P 1 101 10 102 8
6 10 P 2 102 8 101 10
7 11 A 1 111 15 112 10
8 11 A 2 112 10 111 15
9 11 C 1 112 10 111 5
10 11 C 2 111 5 112 10
11 11 P 1 111 20 112 5
12 11 P 2 112 5 111 20
Note that this format makes less sense if there are ties and, in fact, could generate more than 10 rows for a Company/Area code so you might just want to use the individual Top10 and Bottom10 in that case. You could also consider jittering df$Result if this a problem:
df$Result <- jitter(df$Result)
# now perform SQL statements
2) Top/Bottom Only
Here we give only the top and bottom results and the corresponding subareas for each company/area code. Note that this uses an extension to SQL supported by sqlite and the SQL code is substantially simpler:
Bottom1 <- sqldf("select Company,
Area_code,
Subarea_code Bottom_Subarea,
min(Result) Bottom_Result
from df
group by Company, Area_code")
Top1 <- sqldf("select Company,
Area_code,
Subarea_code Top_Subarea,
max(Result) Top_Result
from df
group by Company, Area_code")
sqldf("select a.Company,
a.Area_code,
Top_Subarea,
Top_Result,
Bottom_Subarea
Bottom_Result
from Top1 a join Bottom1 b
on a.Company = b.Company and
a.Area_code = b.Area_code
order by a.Area_code, a.Company")
This gives:
Company Area_code Top_Subarea Top_Result Bottom_Result
1 A 10 101 15 102
2 C 10 102 5 101
3 P 10 101 10 102
4 A 11 111 15 112
5 C 11 112 10 111
6 P 11 111 20 112
Update Correction and added (2).
Above answers are fine to fetch max result.
This solves the top10 issue:
data.top <- data[ave(-data$Result, data$Company, data$Area_code, FUN = rank) <= 10, ]
In this script the user declares the company. The script then indicates the max top 10 results (idem for min values).
Result=NULL
A <- read.table(/your-file.txt",header=T,sep="\t",na.string="NA")
Company<-A$Company=="A" #can be A, C, P or other values
Subarea<-unique(A$Subarea)
for (i in 1:length(unique(A$Subarea)))
{Result[i]<-max(A$Result[Company & A$Subarea_code==Subarea[i]])}
Res1<-t((rbind(Subarea,Result)))
Res2<-Res1[order(-Res1[,2]),]
Res2[1:10,]