Google BigQuery Standard SQL get weight summarize result by group - sql

Original data
structure(list(Year = c(1999, 1999, 1999, 2000, 2000, 2000),
Country = c("a", "b", "b", "a", "a", "b"), number = c(2,
3, 4, 5, 3, 6), result = c(2, 4, 5, 6, 2, 2)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -6L))
What I need is
Year Country weightresult
weightresult=result*(number/sum(number_year,country))
with weight result by number and the sum number is according to Year,Country group
the process result is
tructure(list(Year = c(1999, 1999, 1999, 2000, 2000, 2000),
Country = c("a", "b", "b", "a", "a", "b"), number = c(2,
3, 4, 5, 3, 6), result = c(2, 4, 5, 6, 2, 2), weight = c(2,
7, 7, 8, 8, 6), wre = c(2, 1.71428571428571, 2.85714285714286,
3.75, 0.75, 2)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
Finally need is
structure(list(Country = c("a", "a", "b", "b"), Year = c(1999,
2000, 1999, 2000), wre = c(2, 4.5, 4.57142857142857, 2)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -4L))
How to get the finally result in Bigquery Standard SQL
SELECT
Year,
Country,
(number/(SUM(number) OVER (PARTITION BY Year, Country))) * result AS wre,
Count(*),
FROM `table`
Where
Year<=2020
GROUP BY Year,Country
ORDER BY Year,Country
And the error is
SELECT list expression references column number which is neither grouped nor aggregated at ...

Use below
select
year,
country,
sum(number * result) / sum(number) as weighted_result
from your_table
where year <= 2020
group by year,country
order by year,country
with output

Related

Adding geom_vline for eventdate after filtering for ID adds vlines for every IDs eventdate

Im having a large dataset with repeated measurements in long format for several IDs. It contains measurments of patients. Every measurement is recorded to a timepoint as well as a date which is stored in date variable. In addition I record whether or not the ID experience an "event". The time of the event is stored in a date variable. I'm drawing a plot for every single ID using ggplot2 of the measurements over time, and want to add a vertical line for when the "event" has happened. What i do is I first filter the data for the ID of which I want to draw the graph. Then I add the vline to the event date. However, when I add the vline, I get a line for every eventdate, even the IDs that are not filtered for in the analysis.
Here's is some sample data (In my real data there are alot more IDs)
library(tidyverse)
sampledata <- structure(list(ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), Measure1 = c(10, 20, 0, 30, 20, 10, 2, 0, 0, 0), timepoint = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5), time = structure(c(18628, 18748, 18840, 18932, 19024, 19205, 19297, 19024, 19113, 19205), class = "Date"), event = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), eventdate = structure(c(18779, 18779, 18779, 18779, 18779, 19024, 19024, 19024, 19024, 19024), class = "Date")), row.names = c(NA, 10L), class = "data.frame")
Here is the graph for ID 1:
filter(sampledata, ID %in% 1&Measure1 !="NA") %>% ggplot(aes(x = time, y = Measure1)) +
geom_line(size=0.3,linetype="solid") +
geom_point(size=2, color="#0073C2FF") +
geom_vline(xintercept = as.numeric(as.Date(sampledata$eventdate)), linetype=1) +
theme_gray() + theme(text = element_text(size=12), axis.text=element_text(size=8), legend.position="none", axis.title.y = element_blank()) +
labs(y="ylab", x = "Follow up") +
scale_x_date(date_labels = "%Y-%m-%d", date_breaks = "2 months")
graph picture link
As you can see, I get a vertical line for ID 1's eventdate (2021-06-01), but I also get a line for ID 2's eventdate (2022-02-01).
I guess I'm doing something wrong when filtering. Any idea as to how I can achieve the graph with only the vline for the selected ID? (My next step is to loop the graph so as to do the same graph for all the IDs so I do not want to hard code anything)
Thank you!
The issue is that you passed the eventdate column from your unfiltered dataset sampledata to xintercept. Hence you get a vline for each eventdate in the unfiltered data.
To fix this use aesthetics, i.e. do aes(xintercept=eventdate). Additionally, even after doing so you are actually plotting multiple vlines as the events and event dates are duplicated. To fix this I use data = ~ distinct(.x, event, eventdate) to filter the data for unique events and event dates.
library(tidyverse)
sampledata <- structure(list(ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), Measure1 = c(10, 20, 0, 30, 20, 10, 2, 0, 0, 0), timepoint = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5), time = structure(c(18628, 18748, 18840, 18932, 19024, 19205, 19297, 19024, 19113, 19205), class = "Date"), event = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), eventdate = structure(c(18779, 18779, 18779, 18779, 18779, 19024, 19024, 19024, 19024, 19024), class = "Date")), row.names = c(NA, 10L), class = "data.frame")
filter(sampledata, ID %in% 1 & Measure1 != "NA") %>%
ggplot(aes(x = time, y = Measure1)) +
geom_line(size = 0.3, linetype = "solid") +
geom_point(size = 2, color = "#0073C2FF") +
geom_vline(data = ~ distinct(.x, event, eventdate), aes(xintercept = eventdate), linetype = 1) +
theme_gray() +
theme(text = element_text(size = 12), axis.text = element_text(size = 8), legend.position = "none", axis.title.y = element_blank()) +
labs(y = "ylab", x = "Follow up") +
scale_x_date(date_labels = "%Y-%m-%d", date_breaks = "2 months")

how do I select rows from pandas df without returning False values?

I have a df and I need to select rows based on some conditions in multiple columns.
Here is what I have
import pandas as pd
dat = [('p','q', 5), ('k','j', 2), ('p','-', 5), ('-','p', 4), ('q','pkjq', 3), ('pkjq','q', 2)
df = pd.DataFrame(dat, columns = ['a', 'b', 'c'])
df_dat = df[(df[['a','b']].isin(['k','p','q','j']) & df['c'] > 3)] | df[(~df[['a','b']].isin(['k','p','q','j']) & df['c'] > 2 )]
Expected result = [('p','q', 5), ('p','-', 5), ('-','p', 4), ('q','pkjq', 3)]
Result I am getting is an all false dataframe
When you have the complicate condition I recommend, make the condition outside the slice
cond1 = df[['a','b']].isin(['k','p','q','j']).any(1) & df['c'].gt(3)
cond2 = (~df[['a','b']].isin(['k','p','q','j'])).any(1) & df['c'].gt(2)
out = df.loc[cond1 | cond2]
Out[305]:
a b c
0 p q 5
2 p - 5
3 - p 4
4 q pkjq 3

Performing a mod function on time data column pandas python

Hello I wanted to apply a mod function of column % 24 to the hour of time column.
I believe the time column is in a string format,
I was wondering how I should go about performing the operation.
sales_id,date,time,shopping_cart,price,parcel_size,Customer_lat,Customer_long,isLoyaltyProgram,nearest_storehouse_id,nearest_storehouse,dist_to_nearest_storehouse,delivery_cost
ORD0056604,24/03/2021,45:13:45,"[('bed', 3), ('Chair', 1), ('wardrobe', 4), ('side_table', 2), ('Dining_table', 2), ('mattress', 1)]",3152.77,medium,-38.246,145.61984,1,4,Sunshine,78.43,5.8725000000000005
ORD0096594,13/12/2018,54:22:20,"[('Study_table', 4), ('wardrobe', 4), ('side_table', 1), ('Dining_table', 2), ('sofa', 4), ('Chair', 3), ('mattress', 1)]",3781.38,large,-38.15718,145.05072,1,4,Sunshine,40.09,5.8725000000000005
ORD0046310,16/02/2018,17:23:36,"[('mattress', 2), ('wardrobe', 1), ('side_table', 2), ('sofa', 1), ('Chair', 3), ('Study_table', 4)]",2219.09,medium,144.69623,-38.00731,0,2,Footscray,34.2,16.9875
ORD0031675,25/06/2018,17:38:48,"[('bed', 4), ('side_table', 1), ('Chair', 1), ('mattress', 3), ('Dining_table', 2), ('sofa', 2), ('wardrobe', 2)]",4542.1,large,144.65506,-38.40669,1,2,Footscray,72.72,18.274500000000003
ORD0019799,05/01/2021,18:37:16,"[('wardrobe', 1), ('Study_table', 3), ('sofa', 4), ('side_table', 2), ('Chair', 4), ('Dining_table', 4), ('bed', 1)]",3132.71,L,-37.66022,144.94286,1,0,Clayton,17.77,14.931
ORD0041462,25/12/2018,07:29:33,"[('Chair', 3), ('bed', 1), ('mattress', 3), ('side_table', 3), ('wardrobe', 3), ('sofa', 4)]",4416.42,medium,-38.39154,145.87448,0,6,Sunshine,105.91,6.151500000000001
ORD0047848,30/07/2021,34:18:01,"[('Chair', 3), ('bed', 3), ('wardrobe', 4)]",2541.04,small,-37.4654,144.45832,1,2,Footscray,60.85,18.4635
Convert values to timedeltas by to_timedelta and then remove days by indexing - selecting last 8 values:
print (df)
sales_id date time
0 ORD0056604 24/03/2021 45:13:45
1 ORD0096594 13/12/2018 54:22:20
print (pd.to_timedelta(df['time']))
0 1 days 21:13:45
1 2 days 06:22:20
Name: time, dtype: timedelta64[ns]
df['time'] = pd.to_timedelta(df['time']).astype(str).str[-8:]
print (df)
sales_id date time
0 ORD0056604 24/03/2021 21:13:45
1 ORD0096594 13/12/2018 06:22:20
If need also add days to date column solution is add timedeltas to dates and last extract values by Series.dt.strftime:
dates = pd.to_datetime(df['date'], dayfirst=True) + pd.to_timedelta(df['time'])
df['time'] = dates.dt.strftime('%H:%M:%S')
df['date'] = dates.dt.strftime('%d/%m/%Y')
print (df)
sales_id date time
0 ORD0056604 25/03/2021 21:13:45
1 ORD0096594 15/12/2018 06:22:20

VB.NET - How to calculate the parity bit of a byte array

What is the most efficient way calculate the parity bit (if the number of active bits are odd or even) in a byte array? I have though about iterating through all the bits and summing up the active bits, but that would be very impractical purely based on the number of iterations required on larger byte arrays/files.
For your convenience (and my curiosity), I have done some timing tests with a parity lookup table compared to the other two methods suggested so far:
Module Module1
Dim rand As New Random
Dim parityLookup(255) As Integer
Sub SetUpParityLookup()
' setBitsCount data from http://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer
Dim setBitsCount = {
0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
}
For i = 0 To 255
parityLookup(i) = setBitsCount(i) And 1
Next
End Sub
' Method using lookup table
Function ParityOfArray(a() As Byte) As Integer
Dim parity As Integer = 0 ' use an Integer because they are faster
For i = 0 To a.Length - 1
parity = parity Xor parityLookup(a(i))
Next
Return parity
End Function
' Method by Alireza
Function ComputeParity(bytes() As Byte) As Byte
Dim parity As Boolean = False
For i As Integer = 0 To bytes.Length - 1
Dim b As Byte = bytes(i)
While b <> 0
parity = Not parity
b = CByte(b And (b - 1))
End While
Next
Return Convert.ToByte(parity)
End Function
' Method by dbasnett
Function CountBits(byteArray As Byte()) As Integer
Dim rv As Integer = 0
For Each b As Byte In byteArray
Dim count As Integer = b
count = ((count >> 1) And &H55) + (count And &H55)
count = ((count >> 2) And &H33) + (count And &H33)
count = ((count >> 4) And &HF) + (count And &HF)
rv += count
Next
Return rv
End Function
Sub FillWithRandomBytes(ByRef a() As Byte)
rand.NextBytes(a)
End Sub
Sub Main()
SetUpParityLookup()
Dim nBytes = 10000
Dim a(nBytes - 1) As Byte
FillWithRandomBytes(a)
Dim p As Integer
Dim sw As New Stopwatch
sw.Start()
p = ParityOfArray(a)
sw.Stop()
Console.WriteLine("ParityOfArray - Parity: {0} Time: {1}", p, sw.ElapsedTicks)
sw.Restart()
p = ComputeParity(a)
sw.Stop()
Console.WriteLine("ComputeParity - Parity: {0} Time: {1}", p, sw.ElapsedTicks)
sw.Restart()
p = CountBits(a)
sw.Stop()
' Note that the value returned from CountBits should be And-ed with 1.
Console.WriteLine("CountBits - Parity: {0} Time: {1}", p And 1, sw.ElapsedTicks)
Console.ReadLine()
End Sub
End Module
Typical ouput:
ParityOfArray - Parity: 0 Time: 386
ComputeParity - Parity: 0 Time: 1014
CountBits - Parity: 0 Time: 695
An efficient way to do this is to use the x & (x - 1) operation in a loop, until x becomes zero. This way you will loop only by the number of bits set to 1.
In VB.NET for a byte array:
Function ComputeParity(bytes() As Byte) As Byte
Dim parity As Boolean = False
For i As Integer = 0 To bytes.Length - 1
Dim b As Byte = bytes(i)
While b <> 0
parity = Not parity
b = b And (b - 1)
End While
Next
Return Convert.ToByte(parity)
End Function
Here is a function that counts bits.
Private Function CountBits(byteArray As Byte()) As Integer
Dim rv As Integer = 0
For x As Integer = 0 To byteArray.Length - 1
Dim b As Byte = byteArray(x)
Dim count As Integer = b
count = ((count >> 1) And &H55) + (count And &H55)
count = ((count >> 2) And &H33) + (count And &H33)
count = ((count >> 4) And &HF) + (count And &HF)
rv += count
Next
Return rv
End Function
Note: this code came from a collection of bit twiddling hacks I found some years ago. I converted it to VB.

Data handling for matplotlib histogram with error bars

I've got a data set which is a list of tuples in python like this:
dataSet = [(6.1248199999999997, 27), (6.4400500000000003, 4), (5.9150600000000004, 1), (5.5388400000000004, 38), (5.82559, 1), (7.6892199999999997, 2), (6.9047799999999997, 1), (6.3516300000000001, 76), (6.5168699999999999, 1), (7.4382099999999998, 1), (5.4493299999999998, 1), (5.6254099999999996, 1), (6.3227700000000002, 1), (5.3321899999999998, 11), (6.7402300000000004, 4), (7.6701499999999996, 1), (5.4589400000000001, 3), (6.3089700000000004, 1), (6.5926099999999996, 2), (6.0003000000000002, 5), (5.9845800000000002, 1), (6.4967499999999996, 2), (6.51227, 6), (7.0302600000000002, 1), (5.7271200000000002, 49), (7.5311300000000001, 7), (5.9495800000000001, 2), (5.1487299999999996, 18), (5.7637099999999997, 6), (5.5144500000000001, 44), (6.7988499999999998, 1), (5.2578399999999998, 1)]
Where the first element of the tuple is an energy and the second a counter, how many sensor where affected.
I want to create a histogram to study the relation between the number of affected sensors and the energy. I'm pretty new to matplotlib (and python), but this is what I've done so far:
import math
import matplotlib.pyplot as plt
dataSet = [(6.1248199999999997, 27), (6.4400500000000003, 4), (5.9150600000000004, 1), (5.5388400000000004, 38), (5.82559, 1), (7.6892199999999997, 2), (6.9047799999999997, 1), (6.3516300000000001, 76), (6.5168699999999999, 1), (7.4382099999999998, 1), (5.4493299999999998, 1), (5.6254099999999996, 1), (6.3227700000000002, 1), (5.3321899999999998, 11), (6.7402300000000004, 4), (7.6701499999999996, 1), (5.4589400000000001, 3), (6.3089700000000004, 1), (6.5926099999999996, 2), (6.0003000000000002, 5), (5.9845800000000002, 1), (6.4967499999999996, 2), (6.51227, 6), (7.0302600000000002, 1), (5.7271200000000002, 49), (7.5311300000000001, 7), (5.9495800000000001, 2), (5.1487299999999996, 18), (5.7637099999999997, 6), (5.5144500000000001, 44), (6.7988499999999998, 1), (5.2578399999999998, 1)]
binWidth = .2
binnedDataSet = []
#create another list and append the "binning-value"
for item in dataSet:
binnedDataSet.append((item[0], item[1], math.floor(item[0]/binWidth)*binWidth))
energies, sensorHits, binnedEnergy = [[q[i] for q in binnedDataSet] for i in (0,1,2)]
plt.plot(binnedEnergy, sensorHits, 'ro')
plt.show()
This works so far (although it doesn't even look like a histogram ;-) but OK), but now I want to calculate the mean value for each bin and append some error bars.
What's the way to do it? I looked at histogram examples for matplotlib, but they all use one-dimensional data which will be counted, so you get a frequency spectrum… That's not really what I want.
I am somewhat confused by exactly what you are trying to do, but I think this (to first order) will do what I think you want:
bin_width = .2
bottom = 5.0
top = 8.0
binned_data = [0.0] * int(math.ceil(((top - bottom) / bin_width)))
binned_count = [0] * int(math.ceil(((top - bottom) / bin_width)))
n_bins = len(binned_data)
for E, cnt in dataSet:
if E < bottom or E > top:
print 'out of range'
continue
bin_id = int(math.floor(n_bins * (E - bottom) / (top - bottom)))
binned_data[bin_id] += cnt
binned_count[bin_id] += 1
binned_avergaed_data = [C_sum / hits if hits > 0 else 0 for C_sum, hits in zip(binned_data, binned_count)]
bin_edges = [bottom + j * bin_width for j in range(len(binned_data))]
plt.bar(bin_edges, binned_avergaed_data, width=bin_width)
I would also suggest looking into numpy, it would make this much simpler to write.