I have created a time tree in Memgraph, I ran the queries individually, and I am getting the correct result. But if I run them together as part of the same query, I get a null set.
Why is that so?
memgraph> MATCH (y:YEAR {year:2015})-[:HAS_MONTH]->(m:MONTH {month:1})-[:HAS_DAY]->(d:DAY {day:22})-[:HAS_HOUR]->(h:HOUR {hour:6})-[:HAS_MINUTE]->(min2: MINUTE {minute: 50})
-> RETURN min2;
+----------------------------------------------------+
| min2 |
+----------------------------------------------------+
| (:MINUTE {day: 22, hour: 6, minute: 50, month: 1}) |
+----------------------------------------------------+
1 row in set (0.000 sec)
memgraph> MATCH (y:YEAR {year:2015})-[:HAS_MONTH]->(m:MONTH {month:1})-[:HAS_DAY]->(d:DAY {day:22})-[:HAS_HOUR]->(h:HOUR {hour:7})-[:HAS_MINUTE]->(min2: MINUTE {minute: 0})
-> RETURN min2;
+---------------------------------------------------+
| min2 |
+---------------------------------------------------+
| (:MINUTE {day: 22, hour: 7, minute: 0, month: 1}) |
+---------------------------------------------------+
1 row in set (0.001 sec)
memgraph> MATCH (y:YEAR {year:2015})-[:HAS_MONTH]->(m:MONTH {month:1})-[:HAS_DAY]->(d:DAY {day:22})-[:HAS_HOUR]->(h:HOUR {hour:7})-[:HAS_MINUTE]->(min2: MINUTE {minute: 0})
-> MATCH (y:YEAR {year:2015})-[:HAS_MONTH]->(m:MONTH {month:1})-[:HAS_DAY]->(d:DAY {day:22})-[:HAS_HOUR]->(h:HOUR {hour:6})-[:HAS_MINUTE]->(min1: MINUTE {minute: 50})
-> RETURN min1, min2;
Empty set (0.000 sec)
If you want to get the same year, month, day and hour, and two different minutes the code would be:
memgraph> MATCH (y:YEAR {year:2015})-[:HAS_MONTH]->(m:MONTH {month:1})-[:HAS_DAY]->(d:DAY {day:22})-[:HAS_HOUR]->(h:HOUR {hour:7})-[:HAS_MINUTE]->(min:MINUTE)
-> WHERE min.minute = 0 OR min.minute = 50
-> RETURN min;
Related
I would like to use the aggregation for each ID key to select rows with max(day).
ID
col1
col2
month
Day
AI1
5
2
janv
15
AI2
6
0
Dec
16
AI1
1
7
March
16
AI3
9
4
Nov
18
AI2
3
20
Fev
20
AI3
10
8
June
06
Desired result:
ID
col1
col2
month
Day
AI1
1
7
March
16
AI2
3
20
Fev
20
AI3
9
4
Nov
18
The only solution that comes to my mind is to :
Get the highest day for each ID (using groupBy)
Append the value of the highest day to each line (with matching ID) using join
Then a simple filter where the value of the two lines match
# select the max value for each of the ID
maxDayForIDs = df.groupBy("ID").max("day").withColumnRenamed("max(day)", "maxDay")
# now add the max value of the day for each line (with matching ID)
df = df.join(maxDayForIDs, "ID")
# keep only the lines where it matches "day" equals "maxDay"
df = df.filter(df.day == df.maxDay)
Usually this kind of operation is done using window functions like
rank,
dense_rank
or row_number.
from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
[('AI1', 5, 2, 'janv', '15'),
('AI2', 6, 0, 'Dec', '16'),
('AI1', 1, 7, 'March', '16'),
('AI3', 9, 4, 'Nov', '18'),
('AI2', 3, 20, 'Fev', '20'),
('AI3', 10, 8, 'June', '06')],
['ID', 'col1', 'col2', 'month', 'Day']
)
w = W.partitionBy('ID').orderBy(F.desc('Day'))
df = df.withColumn('_rn', F.row_number().over(w))
df = df.filter('_rn=1').drop('_rn')
df.show()
# +---+----+----+-----+---+
# | ID|col1|col2|month|Day|
# +---+----+----+-----+---+
# |AI1| 1| 7|March| 16|
# |AI2| 3| 20| Fev| 20|
# |AI3| 9| 4| Nov| 18|
# +---+----+----+-----+---+
Make it simple
new= (df.withColumn('max',first('Day').over(w))#Order by day descending and keep first value in a group in max
.where(col('Day')==col('max'))#filter where max=Day
.drop('max')#drop max
).show()
I have a postgres table with columns:
id: text
availabilities: integer[]
A certain ID can has multiply availabilities (different days (not continuous) in a range for up to a few years). Each availability is a Unix timestamp (in seconds) for a certain day.
Hours, minutes, seconds, ms are set to 0, i.e. a timestamp represents the start of a day.
Question:
How can I find all IDs very fast, which contain at least one availability inbetween a certain from-to range (also timestamp)?
I can also store them differently in an array, e.g "days since epoch", if needed (to get 1 (day) steps instead of 86400 (second) steps).
However, if possible (and speed is roughly same), I want to use an array and on row per each entry.
Example:
Data (0 = day-1, 86400 = day-2, ...)
| id | availabilities |
| 1 | [0 , 86400, 172800, 259200 ]
| 2 | [ 86400, 259200 ]
| 3 | [ , 345600 ]
| 4 | [ , 172800, ]
| 5 | [0, ]
Now I want to get a list of IDs which contains at least 1 availability which:
is between 86400 AND 259200 --> ID 1, 2, 4
is between 172800 AND 172800 --> ID 1, 4
is between 259200 AND (max-int) --> ID 1,2,3
In PostgreSQL unnest function is the best function for converting array elements to rows and gets the best performance. You can use this function. Sample Query:
with mytable as (
select 1 as id, '{12,2500,6000,200}'::int[] as pint
union all
select 2 as id, '{0,200,3500,150}'::int[]
union all
select 4 as id, '{20,10,8500,1100,9000,25000}'::int[]
)
select id, unnest(pint) as pt from mytable;
-- Return
1 12
1 2500
1 6000
1 200
2 0
2 200
2 3500
2 150
4 20
4 10
4 8500
4 1100
4 9000
4 25000
I have df like this
Date amount
0 2021-06-18 14
1 2021-06-19 -8
2 2021-06-20 -8
3 2021-06-21 17
4 2021-07-02 -8
5 2021-07-05 77
6 2021-07-06 -10
7 2021-08-02 -78
8 2021-08-06 77
9 2021-07-08 10
i went the count of sign change in amount month wise of count each month like in
count = [{"June-2021": 2},{"July-2021" : 3},{"Aug-2021" : 1}]
Note: Last Date of each month and first date of next month is different then count as in different count
i want a function for this
You can use (x.mul(x.shift()) < 0).sum() (current entry multiply by last entry being negative indicates a sign change) to get the count of sign changes within a group of month-year, as follows:
count = (df.groupby(df['Date'].dt.strftime('%b-%Y'), sort=False)['amount']
.agg(lambda x: (x.mul(x.shift()) < 0).sum())
.to_dict()
)
Result:
print(count)
{'Jun-2021': 2, 'Jul-2021': 3, 'Aug-2021': 1}
Edit
If you want list of dict, you can use:
count = (df.groupby(df['Date'].dt.strftime('%b-%Y'), sort=False)['amount']
.agg(lambda x: (x.mul(x.shift()) < 0).sum())
.reset_index()
.apply(lambda x: {x['Date']: x['amount']}, axis=1)
.to_list()
)
Result:
print(count)
[{'Jun-2021': 2}, {'Jul-2021': 3}, {'Aug-2021': 1}]
For CodeChef problem C_HOLIC2, I tried iterating over elements: 5, 10, 15, 20, 25,... and for each number checking the number of trailing zeros using the efficient technique as specified over here, but got TLE.
What is the fastest way to solve this using formula method?
Here is the Problem Link
As we know for counting the number of trailing zeros in factorial of a number, the trick used is:
The number of multiples of 5 that are less than or equal to 500 is 500÷5=100
Then, the number of multiples of 25 is 500÷25=20
Then, the number of multiples of 125 is 500÷125=4
The next power of 5 is 625, which is > than 500.
Therefore, the number of trailing zeros of is 100+20+4=124
For detailed explanation check this page
Thus, this count can be represented as:
Using this trick, given a number N you can determine the no. of trailing zeros count in its factorial. Codechef Problem Link
Now, suppose we are given the no. of trailing zeros, count and we are asked the smallest no. N whose factorial has count trailing zeros Codechef Problem Link
Here the question is how can we split count into this representation?
This is a problem because in the following examples, as we can see it becomes difficult.
The count jumps even though the no is increasing by the same amount.
As you can see from the following table, count jumps at values whose factorials have integral powers of 5 as factors e.g. 25, 50, ..., 125, ...
+-------+-----+
| count | N |
+-------+-----+
| 1 | 5 |
+-------+-----+
| 2 | 10 |
+-------+-----+
| 3 | 15 |
+-------+-----+
| 4 | 20 |
+-------+-----+
| 6 | 25 |
+-------+-----+
| 7 | 30 |
+-------+-----+
| 8 | 35 |
+-------+-----+
| 9 | 40 |
+-------+-----+
| 10 | 45 |
+-------+-----+
| 12 | 50 |
+-------+-----+
| ... | ... |
+-------+-----+
| 28 | 120 |
+-------+-----+
| 31 | 125 |
+-------+-----+
| 32 | 130 |
+-------+-----+
| ... | ... |
+-------+-----+
You can see this from any brute force program for this task, that these jumps occur frequently i.e. at 6, 12, 18, 24 in case of numbers whose factorials have 25.(Interval = 6=1×5+1)
After N=31 factorials will also have a factor of 125. Thus, these jumps corresponding to 25 will still occur with the same frequency i.e. at 31, 37, 43, ...
Now the next jump corresponding to 125 will be at 31+31 which is at 62. Thus jumps corresponding to 125 will occur at 31, 62, 93, 124.(Interval =31=6×5+1)
Now the jump corresponding to 625 will occur at 31×5+1=155+1=156
Thus you can see there exists a pattern. We need to find the formula for this pattern to proceed.
The series formed is 1, 6, 31, 156, ...
which is 1 , 1+5 , 1+5+52 , 1+5+52+53 , ...
Thus, nth term is sum of n terms of G.P. with a = 1, r = 5
Thus, the count can be something like 31+31+6+1+1, etc.
We need to find this tn which is less than count but closest to it. i.e.
Say the number is count=35, then using this we identify that tn=31 is closest. For count=63 we again see that using this formula, we get tn=31 to be the closest but note that here, 31 can be subtracted twice from count=63. Now we go on finding this n and keep on subtracting tn from count till count becomes 0.
The algorithm used is:
count=read_no()
N=0
while count!=0:
n=floor(log(4*count+1,5))
baseSum=((5**n)-1)/4
baseOffset=(5**n)*(count/baseSum) // This is integer division
count=count%baseSum
N+=baseOffset
print(N)
Here, 5**n is 5n
Let's try working this out for an example:
Say count = 70,
Iteration 1:
Iteration 2:
Iteration 3:
Take another example. Say count=124 which is the one discussed at the beginning of this page:
Iteration 1:
PS: All the images are completely owned by me. I had to use images because StackOverflow doesn't allow MathJax to be embedded. #StackOverflowShouldAllowMathJax
I have table which contains power values (kW) for devices. Values are read from each device once a minute and inserted into table with timestamp. What I need to do is calculate power consumption (kWh) for given time span and return 10 most power consuming devices. Right now I query results for given time span and do calculation in backend looping all records. This works fine with small amount of devices and with short time span, but in real use case I could have thousands of devices and long time span.
So my question is how could I do this all in PostgreSQL 9.4.4 so that my query would return only 10 most power consuming (device_id, power_consumption) pairs?
Example table:
CREATE TABLE measurements (
id serial primary key,
device_id integer,
power real,
created_at timestamp
);
Simple data example:
| id | device_id | power | created_at |
|----|-----------|-------|--------------------------|
| 1 | 1 | 10 | August, 26 2015 08:23:25 |
| 2 | 1 | 13 | August, 26 2015 08:24:25 |
| 3 | 1 | 12 | August, 26 2015 08:25:25 |
| 4 | 2 | 103 | August, 26 2015 08:23:25 |
| 5 | 2 | 134 | August, 26 2015 08:24:25 |
| 6 | 2 | 2 | August, 26 2015 08:25:25 |
| 7 | 3 | 10 | August, 26 2015 08:23:25 |
| 8 | 3 | 13 | August, 26 2015 08:24:25 |
| 9 | 3 | 20 | August, 26 2015 08:25:25 |
Wanted results for query:
| id | device_id | power_consumption |
|----|-----------|-------------------|
| 1 | 1 | 24.0 |
| 2 | 2 | 186.5 |
| 3 | 3 | 28.0 |
Simplified example (created_at in hours) how I calculate kWh value:
data = [
[
{ 'id': 1, 'device_id': 1, 'power': 10.0, 'created_at': 0 },
{ 'id': 2, 'device_id': 1, 'power': 13.0, 'created_at': 1 },
{ 'id': 3, 'device_id': 1, 'power': 12.0, 'created_at': 2 }
],
[
{ 'id': 4, 'device_id': 2, 'power': 103.0, 'created_at': 0 },
{ 'id': 5, 'device_id': 2, 'power': 134.0, 'created_at': 1 },
{ 'id': 6, 'device_id': 2, 'power': 2.0, 'created_at': 2 }
],
[
{ 'id': 7, 'device_id': 3, 'power': 10.0, 'created_at': 0 },
{ 'id': 8, 'device_id': 3, 'power': 13.0, 'created_at': 1 },
{ 'id': 9, 'device_id': 3, 'power': 20.0, 'created_at': 2 }
]
]
# device_id: power_consumption
results = { 1: 0, 2: 0, 3: 0 }
for d in data:
for i in range(0, len(d)):
if i < len(d)-1:
# Area between two records gives us kWh
# X-axis is time(h)
# Y-axis is power(kW)
x1 = d[i]['created_at']
x2 = d[i+1]['created_at']
y1 = d[i]['power']
y2 = d[i+1]['power']
# Area between two records gives us kWh
# X-axis is time(h)
# Y-axis is power(kW)
x1 = d[i]['created_at']
x2 = d[i+1]['created_at']
y1 = d[i]['power']
y2 = d[i+1]['power']
results[d[i]['device_id']] += ((x2-x1)*(y2+y1))/2
print results
EDIT: Check this to see how I ended up solving this.
Some of the elements that you'll need in order to do this are:
Sum() aggregations, to calculate the total of a number of records
Lag()/Lead() functions, to calculate for a given record what the "previous" or "next" record's values were.
So where for a given row you can get the current created_at and power records, in SQL you'd probably use a Lead() windowing function to get the created_at and power records for the record for the same device id that has the next highest value for created_at.
Docs for Lead() are here: http://www.postgresql.org/docs/9.4/static/functions-window.html
When for each row you have calculated the power consumption by reference to the "next" record, you can use a Sum() to aggregate up all of the calculated powers for that one device.
When you have calculated the power per device, you can use ORDER BY and LIMIT to select the top n power-consuming devices.
Steps to follow, if you're not confident to plunge in and just write the final SQL -- after each step make sure you haveSQL you understand, and which returns just the data you need:
Start small, by selecting the data rows that you want.
Work out the Lead() function, defining the appropriate partition and order clauses to get the next row.
Add the calculation of power per row.
Define the Sum() function, and group by the device id.
Add the ORDER BY and LIMIT clauses.
If you have trouble with any one of these steps, they would each make a decent StackOverflow question.
If someone happens to wonder same thing here is how I solved this.
I followed instructions by David and made this:
SELECT
t.device_id,
sum(len_y*(extract(epoch from date_trunc('milliseconds', len_x)))/7200) AS total
FROM (
SELECT
m.id,
m.device_id,
m.power,
m.created_at,
m.power+lag(m.power) OVER (
PARTITION BY device_id
ORDER BY m.created_at
) AS len_y,
m.created_at-lag(m.created_at) OVER (
PARTITION BY device_id
ORDER BY m.created_at
) AS len_x
FROM
mes AS m
WHERE m.created_at BETWEEN '2015-08-26 13:39:57.834674'::timestamp
AND '2015-08-26 13:43:57.834674'::timestamp
) AS t
GROUP BY t.device_id
ORDER BY total
DESC LIMIT 10;