I assign normal rv with mean 12 to ultimately generate order_date for my mock data. But when i am checking it out to see whether 12 is most used value, it turns out to be not. Any recommendations why it is the case...
def random_date_generator(month):
day_range = calendar.monthrange(2020, month)[1]
day = random.randint(1, day_range)
**first_hour = int(np.random.normal(12, 2))**
# second_hour = int(np.random.normal(18, 2))
hour = random.choices([first_hour])[0]
minute = random.randint(1, 59)
date = dt.datetime(2020, month, day, hour, minute).strftime("%Y/%m/%d %H:%M")
return date
columns = ['Order ID', 'Product', 'Quantity Ordered', 'Price Each', 'Order Date',
'Purchase Address', "Month"]
df = pd.DataFrame(columns=columns)
order_id = 123
for month_int in range(1, 13):
if month_int == 12:
order_amount = int(np.random.normal(100, 30))
if month_int == 11:
order_amount = int(np.random.normal(90, 30))
if month_int < 11:
order_amount = int(np.random.normal(60, 10))
for i in range(order_amount):
products_list = [product for product in products]
weights = [products[key][1] for key in products_list]
product = random.choices(products_list, weights=weights)[0]
price = products[product]
date = random_date_generator(month_int)
address = generate_random_addresses()
month = calendar.month_name[month_int]
df.loc[i] = [order_id, product, "NA" ,price, date, address, month_int]
order_id += 1
df.to_csv(f"{month}_data2.csv")
print(f"{month}_data2.csv")
break
january = pd.read_csv("January_data2.csv")
january["Hour"] = january["Order Date"].str[-5:-3]
january.groupby("Hour").count()
That is the output and as you see the most generated time is 12 but 10.
Unnamed: 0 Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month
Hour
07 2 2 2 0 2 2 2 2
08 2 2 2 0 2 2 2 2
09 9 9 9 0 9 9 9 9
10 15 15 15 0 15 15 15 15
11 11 11 11 0 11 11 11 11
12 10 10 10 0 10 10 10 10
13 9 9 9 0 9 9 9 9
14 7 7 7 0 7 7 7 7
15 3 3 3 0 3 3 3 3
Related
I have a df of 15 x 4 and I'm trying to compute the maximum gradient in a North (N) minus South (S) direction for each row using a "S" and "N" value for each min or max in the rows below. I'm not sure that this is the best pythonic way to do this. My df "ms" looks like this:
minSlats minNlats maxSlats maxNlats
0 57839.4 54917.0 57962.6 56979.9
0 57763.2 55656.7 58120.0 57766.0
0 57905.2 54968.6 58014.3 57031.6
0 57796.0 54810.2 57969.0 56848.2
0 57820.5 55156.4 58019.5 57273.2
0 57542.7 54330.6 58057.6 56145.1
0 57829.8 54755.4 57978.8 56777.5
0 57796.0 54810.2 57969.0 56848.2
0 57639.4 54286.6 58087.6 56140.1
0 57653.3 56182.7 57996.5 57975.8
0 57665.1 56048.3 58069.7 58031.4
0 57559.9 57121.3 57890.8 58043.0
0 57689.7 55155.5 57959.4 56440.8
0 57649.4 56076.5 58043.0 58037.4
0 57603.9 56290.0 57959.8 57993.9
My loop structure looks like this:
J = len(ms)
grad = pd.DataFrame()
for i in range(J):
if ms.maxSlats.iloc[i] > ms.maxNlats.iloc[i]:
gr = ( ms.maxSlats.iloc[i] - ms.minNlats.iloc[i] ) * -1
grad[gr] = [i+1, i]
elif ms.maxNlats.iloc[i] > ms.maxSlats.iloc[i]:
gr = ms.maxNlats.iloc[i] - ms.minSlats.iloc[i]
grad[gr] = [i+1, i]
grad = grad.T # need to transpose
print(grad)
I obtain the correct answer but I'm wondering if there is a cleaner way to do this to obtain the same answer below:
grad.T
Out[317]:
0 1
-3045.6 1 0
-2463.3 2 1
-3045.7 3 2
-3158.8 8 7
-2863.1 5 4
-3727.0 6 5
-3223.4 7 6
-3801.0 9 8
-1813.8 10 9
-2021.4 11 10
483.1 12 11
-2803.9 13 12
-1966.5 14 13
390.0 15 14
thank you,
Use np.where to compute gradient and keep only last duplicated index.
grad = np.where(ms.maxSlats > ms.maxNlats, (ms.maxSlats - ms.minNlats) * -1,
ms.maxNlats - ms.minSlats)
df = pd.DataFrame({'A': pd.RangeIndex(1, len(ms)+1),
'B': pd.RangeIndex(0, len(ms))},
index=grad)
df = df[~df.index.duplicated(keep='last')]
>>> df
A B
-3045.6 1 0
-2463.3 2 1
-3045.7 3 2
-2863.1 5 4
-3727.0 6 5
-3223.4 7 6
-3158.8 8 7
-3801.0 9 8
-1813.8 10 9
-2021.4 11 10
483.1 12 11
-2803.9 13 12
-1966.5 14 13
390.0 15 14
I have a list as primary = ['A' , 'B' , 'C' , 'D']
and a DataFrame as
df2 = pd.DataFrame(data=dateRange, columns = ['Date'])
which contains 1 date column starting from 01-July-2020 till 31-Dec-2020.
I created another column 'DayNum' which will contain the day number from the date like 01-July-2020 is Wednesday so the 'DayNum' column will have 2 and so on.
Now using the list I want to create another column 'primary' so that the DataFrame looks as follows:
In short, the elements on the list should repeat. You can say that this is a roster to show the name of the person on the roster on a weekly basis where Monday is the start (day 0) and Sunday is the end (day 6).
The output should be like this:
Date DayNum Primary
0 01-Jul-20 2 A
1 02-Jul-20 3 A
2 03-Jul-20 4 A
3 04-Jul-20 5 A
4 05-Jul-20 6 A
5 06-Jul-20 0 B
6 07-Jul-20 1 B
7 08-Jul-20 2 B
8 09-Jul-20 3 B
9 10-Jul-20 4 B
10 11-Jul-20 5 B
11 12-Jul-20 6 B
12 13-Jul-20 0 C
13 14-Jul-20 1 C
14 15-Jul-20 2 C
15 16-Jul-20 3 C
16 17-Jul-20 4 C
17 18-Jul-20 5 C
18 19-Jul-20 6 C
19 20-Jul-20 0 D
20 21-Jul-20 1 D
21 22-Jul-20 2 D
22 23-Jul-20 3 D
23 24-Jul-20 4 D
24 25-Jul-20 5 D
25 26-Jul-20 6 D
26 27-Jul-20 0 A
27 28-Jul-20 1 A
28 29-Jul-20 2 A
29 30-Jul-20 3 A
30 31-Jul-20 4 A
First compare column for 0 by Series.eq with cumulative sum by Series.cumsum for groups for each week, then use modulo by Series.mod with number of values in list and last map by dictioanry created by enumerate and list by Series.map:
primary = ['A','B','C','D']
d = dict(enumerate(primary))
df['Primary'] = df['DayNum'].eq(0).cumsum().mod(len(primary)).map(d)
I have a gritty industrial control problem i'm trying to solve with T-SQL.
The goal is to calculate an index position for each of two pallet loading robots, positioned in one of two ranges; 2 to 78 (robot 1) and 4 to 80 (robot 2).
Each robot indexes in steps of 4 so complete coverage of 80 spots on the pallet is achieved. The robots work side by side with a minimum spacing of 2 spots while they move along the pallet.
Two sized boxes can be placed on the pallet, one twice as long as the other. If two small boxes are placed side by side taking up 1 spot each, a single larger box can be placed on top, taking up 2 spots until a maximum height is reached. Thus the spot number for a small box is always odd and for a large box is always even and the robot index number is always even. e.g. (see diagram) from index position 14 spots 13 and 15 are loaded, and from index 20 spots 19 and 21 can be loaded.
Robot Index Positions
I need a conversion formula that calculates the Index number for a given Spot and Robot.
The calculated Index column should look like the following:
Spot Robot Index
1 1 2
2 1 2
3 1 2
- - -
13 1 14
14 1 14
15 1 14
16 2 16
17 2 16
18 1 18
19 2 20
- - -
- - -
77 1 78
78 1 78
79 2 80
80 2 80
One way would be to do an update to the Index column with every possible combination of Spot and Robot using a simple CASE WHEN selection or maybe do lookups on a reference table holding every possible combination. What I would like to explore (if any math wizards are inclined!) is a math formula that calculate the Index value.
So far I've come up with the following by converting formula developed for use in Excel. The Robot 2 section is incomplete. The 95 to 99 values are for error checking.
UPDATE MovesTable SET [Index] =
CASE
WHEN Robot = 1 THEN
CASE
WHEN Spot%4 = 0 THEN '99'
WHEN Spot = 1 or Spot = 2 or Spot = 3 THEN '02'
WHEN Spot = 5 or Spot = 6 or Spot = 7 THEN '06'
WHEN Spot = 9 or Spot = 10 or Spot = 11 THEN '10'
WHEN Spot%10 = 4 THEN CONCAT(Spot/10,'4')
WHEN Spot < 57 AND (((Spot/10)%2 = 1 AND (Spot%10)%2 = 1) AND (Spot%10 = 3 OR Spot%10 = 5)) THEN CONCAT(Spot/10,'4')
WHEN Spot%10 = 8 THEN CONCAT(Spot/10,'8')
WHEN Spot < 57 AND (((Spot/10)%2 = 1 AND (Spot%10)%2 = 1) AND (Spot%10 = 7 OR Spot%10 = 9)) THEN CONCAT(Spot/10,'8')
WHEN Spot%10 = 2 THEN CONCAT(Spot/10,'2')
WHEN Spot < 57 AND (((Spot/10)%2 = 1 AND (Spot%10)%2 = 0) AND (Spot%10 = 1 OR Spot%10 = 3)) THEN CONCAT(Spot/10,'2')
WHEN Spot%10 = 6 THEN CONCAT(Spot/10,'6')
WHEN Spot < 57 AND (((Spot/10)%2 = 0 AND (Spot%10)%2 = 1) AND (Spot%10 = 5 OR Spot%10 = 7)) THEN CONCAT(Spot/10,'6')
WHEN Spot%10 = 0 THEN CONCAT(Spot/10,'')
WHEN Spot = 49 THEN '50'
WHEN Spot < 57 AND (((Spot/10)%2 = 0 AND (Spot%10)%2 = 1) AND Spot%10 = 9) THEN '30'
WHEN Spot < 57 AND (((Spot/10)%2 = 1 AND (Spot%10)%2 = 1) AND Spot%10 = 1) THEN CONCAT(Spot/10,'0')
WHEN Spot > 56 AND (((Spot/10)%2 = 1 AND (Spot%10)%2 = 1) AND (Spot%10 = 7 OR Spot%10 = 9)) THEN CONCAT(Spot/10,'8')
WHEN Spot > 56 AND (((Spot/10)%2 = 0 AND (Spot%10)%2 = 1) AND (Spot%10 = 1 OR Spot%10 = 3)) THEN CONCAT(Spot/10,'2')
WHEN Spot > 56 AND (((Spot/10)%2 = 0 AND (Spot%10)%2 = 1) AND (Spot%10 = 5 OR Spot%10 = 7)) THEN CONCAT(Spot/10,'6')
ELSE '98'
END
ELSE
CASE
WHEN Robot = 2 THEN
CASE
WHEN (Spot%2 = 0 AND Spot%4 <> 0) OR (Spot = 1 OR Spot = 2) THEN '97'
WHEN Spot = 4 then '04'
WHEN Spot = 8 then '08'
WHEN Spot%4 = 0 THEN Spot
WHEN Spot = 2 OR Spot = 5 THEN '05'
WHEN Spot = 7 OR Spot = 9 THEN '08'
WHEN Spot = 19 THEN '20'
WHEN Spot = 39 THEN '40'
WHEN Spot = 59 THEN '60'
ELSE '96'
END
ELSE '95'
END
END
I tried to solve this mathematically, rather than by analyzing cases, etc. It matches all of your sample results:
declare #t table (Spot int, Robot int, [Index] int)
insert into #t(Spot,Robot,[Index]) values
(1 ,1 , 2 ),
(2 ,1 , 2 ),
(3 ,1 , 2 ),
(13 ,1 ,14 ),
(14 ,1 ,14 ),
(15 ,1 ,14 ),
(16 ,2 ,16 ),
(17 ,2 ,16 ),
(18 ,1 ,18 ),
(19 ,2 ,20 ),
(77 ,1 ,78 ),
(78 ,1 ,78 ),
(79 ,2 ,80 ),
(80 ,2 ,80 )
select *,
CONVERT(int,
ROUND((Spot +
CASE WHEN Robot = 1 THEN 2 ELSE 0 END
)/4.0,0)* 4 -
CASE WHEN Robot = 1 THEN 2 ELSE 0 END
) as Index2
from #t
The logic is "round to the nearest multiple of four" but we use a couple of expressions to offset Robot 1's results by 2.
Results:
Spot Robot Index Index2
----------- ----------- ----------- -----------
1 1 2 2
2 1 2 2
3 1 2 2
13 1 14 14
14 1 14 14
15 1 14 14
16 2 16 16
17 2 16 16
18 1 18 18
19 2 20 20
77 1 78 78
78 1 78 78
79 2 80 80
80 2 80 80
In my column in SQL Server, I must delete outliers for each group separately. Here are my columns
select
customer,
sku,
stuff,
action,
acnumber,
year
from
mytable
Sample data:
customer sku year stuff action
-----------------------------------
1 1 2 2017 10 0
2 1 2 2017 20 1
3 1 3 2017 30 0
4 1 3 2017 40 1
5 2 4 2017 50 0
6 2 4 2017 60 1
7 2 5 2017 70 0
8 2 5 2017 80 1
9 1 2 2018 10 0
10 1 2 2018 20 1
11 1 3 2018 30 0
12 1 3 2018 40 1
13 2 4 2018 50 0
14 2 4 2018 60 1
15 2 5 2018 70 0
16 2 5 2018 80 1
I must delete outlier from stuff variable, but separately by group customer+sku+year.
All that is below the 25th percentile and above 75 percentile should be considered an outlier and this principle must be respected for each group.
How to clear dataset for next working ?
Note, in this dataset, there is variable action (it tales value 0 and 1). It is not group variable, but outliers must be delete only for ZERO(0) categories of action variable.
in R language this is decided as
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
new <- remove_outliers(vpg$stuff)
vpg=cbind(new,vpg)
Something like this, maybe:
DELETE mytable
WHERE PERCENT_RANK() OVER (PARTITION BY Department ORDER BY customer, sku, year ORDER BY stuff ) < .25 OR
PERCENT_RANK() OVER (PARTITION BY Department ORDER BY customer, sku, year ORDER BY stuff ) > .75
I need to create some checks to make sure that students are enrolled in the correct courses with the correct number of units. Here is my SQL at the moment.
SELECT StudentID
,AssessmentCode
,BoardCode
,BoardCategory
,BoardUnits
,sum(cast(boardunits as int)) over (partition by studentid,boardcategory) as UnitCount
,Count(boardcategory) over (partition by studentid) as SubjectCount
FROM uvNCStudentSubjectDetails
where fileyear = 2015
and filesemester = 1
and studentyearlevel = 11
and StudentIBFlag = 0
order by Studentnameinternal,BoardCategory
This gives me the following info...
StudentID AssessmentCode BoardCode BoardCategory BoardUnits UnitCount SubjectCount
61687 11TECDAT 11080 A 2 11 7
61687 11PRS1U 11350 A 1 11 7
61687 11MATGEN 11235 A 2 11 7
61687 11LANGRB 11870 A 2 11 7
61687 11ENGSTD 11130 A 2 11 7
61687 11GEOGEO 11190 A 2 11 7
64549 11TECIND 11200 A 2 10 7
64549 11SCIPHY 11310 A 2 10 7
64549 11SCIEAE 11100 A 2 10 7
64549 11MATGEN 11235 A 2 10 7
64549 11ENGSTD 11130 A 2 10 7
64549 11TECHOS 26501 B 2 2 7
64549 11MUSDRS 63212 C 1 1 7
45461 11ECOECO 11110 A 2 13 7
45461 11ENGADV 11140 A 2 13 7
45461 11HISMOD 11270 A 2 13 7
45461 11HISLST 11220 A 2 13 7
45461 11MATMAT 11240 A 2 13 7
45461 11PRS1U 11350 A 1 13 7
45461 11SCIBIO 11030 A 2 13 7
Note for the first student, I have a count of Category A subject Units (11 in total) He is only doing Category A subjects. For the second student, he has 10 units of Category A subjects, he is doing 1 Category B subject worth 2 units and one category C subject worth 1 unit. the final student just has 13 Category A units.
Now what I would really like is something like this...!
StudentID Sum A Units Sum B Units Sum C Units Sum A Units + Sum B Units Count of Subjects
61687 11 0 0 11 7
64549 10 2 1 12 7
45461 13 0 0 13 7
So I would like some aggregated functions with a student grouped onto only 1 row and the sum of his different units as separate fields. I would also like a field which sums the Category A and B Units and also a field which gives a count of the total number of subjects they are doing. I could then use this data to set up some warning messages if a student is not doing the correct number of A or B Units etc
I have played around with common table expressions, subqueries etc but am not really sure what I am doing and am not sure which is the correct way about getting the data in the form I want.
Is anyone able to help?
SELECT
STUDENTID,
SUM(CASE BOARDCATEGORY WHEN 'A' THEN 1 ELSE 0 END) AS SUM_A_UNITS,
SUM(CASE BOARDCATEGORY WHEN 'B' THEN 1 ELSE 0 END) AS SUM_B_UNITS,
SUM(CASE BOARDCATEGORY WHEN 'C' THEN 1 ELSE 0 END) AS SUM_C_UNITS,
SUM(CASE BOARDCATEGORY WHEN 'A' THEN 1 WHEN 'B' THEN 1 ELSE 0 END) AS SUM_A_UNITS+SUM_B_UNITS,
COUNT(BOARDCODE) AS COUNT_OF_SUBJECTS
FROM (
SELECT StudentID
,AssessmentCode
,BoardCode
,BoardCategory
,BoardUnits
,sum(cast(boardunits as int)) over (partition by studentid,boardcategory) as UnitCount
,Count(boardcategory) over (partition by studentid) as SubjectCount
FROM uvNCStudentSubjectDetails
where fileyear = 2015
and filesemester = 1
and studentyearlevel = 11
and StudentIBFlag = 0
order by Studentnameinternal,BoardCategory
)
GROUP BY STUDENTID;
Wrapped your SQL statement in the solution, so that you can see what the solution does straight away.
Use SUM and CASE (i.e. SUM only when a condition is met).