From SQL to Neo4j: trying to group query results - sql

I have the following query in SQL (Oracle DB 11g XE)
Just for context: this query search the sensor with the biggest Power Factor, in a range between 0.90 and 0.99, for each month)
with abc as (select extract(month from peak_time) as Month,
max(total_power_factor) as Max_Power_Factor
from sensors group by extract(month from peak_time) order by Month DESC)
select abc.Month, Max_Power_Factor, meter_id as "Made by"
from abc join sensors
on sensors.total_power_factor = abc.Max_Power_Factor
where Max_Power_Factor between 0.90 and 0.99
order by Max_Power_Factor;
SQL Developer show me the correct result, only ONE line for each month, without duplicates; for example:
Month Max_Power_Factor Scored by
6 0.981046427565 b492b271760a
1 0.945921825336 db71ffead179
3 0.943302142482 a9c471b03587
8 0.9383185638 410bd58c8396
7 0.930911694091 fe5954a46888
5 0.912872055549 ee3c8ec29155
My problem is trying to replicate the same query on Neo4j (3.2.1 CE, on Windows 10): I don't know exactly how to group the data in order to have the same results. (As you can see I'm using APOC to manage dates)
match(a:Sensor) with a, a.peak_time as peak_time
where (a.total_power_factor > 0.90 and a.total_power_factor <0.99 )
RETURN distinct a.meterid, max(peak_time),apoc.date.format(peak_time,'s','MM') as month
order by month desc
These are my Cypher results and, as you can see, there are multiple row for each month.
Month Max_Power_Factor Scored by
06 0.981046427565 b492b271760a
01 0.945921825336 db71ffead179
03 0.943302142482 a9c471b03587
08 0.9383185638 410bd58c8396
08 0.93451098613 dfd6b67cc6d6
07 0.930911694091 fe5954a46888
02 0.916440282713 649956b34e87
05 0.912872055549 ee3c8ec29155
08 0.907059974935 a3e8df8a0ba8
So my question is: How can I group the data in order to have the same ouput as Oracle DB? (If it's possible, of course)
Thanks in advance for your help.

The fields in the output you show do not correspond to the query (for example, what exactly is "Scored By" ?) but the trick to aggregating in Neo4j is understanding that the aggregation keys are implicit.
So if you have
RETURN distinct a.meterid, max(peak_time),apoc.date.format(peak_time,'s','MM') as month
you are grouping on meterid and month.
If you want to group on month only it should be
RETURN max(peak_time),apoc.date.format(peak_time,'s','MM') as month
Hope this helps !
Regards,
Tom

Related

Create a funnel in SQL with 30 days delay

I have table like this with hundreds of records : month_signup, nb_signups, month_purchase and nb_purchases
month_signup
nb_signups
month_purchase
nb_purchases
01
100
01
10
02
200
02
20
03
150
03
10
Let's say I want to calculate the signup to purchase ratio month after month.
Normaly I can juste divide nb_purchases/nb_signups*100 but here no.
I want to calculate a signup to purchase ratio with 1 month (or 30days) delay.
To let the signups the time to purchase, I want to do the nb_purchase from month 2 divided by nb_signups from month_1. So 20/100 for exemple in my table.
I tried this but really not sure.
SELECT
month_signup
,SAFE_DIVIDE(CASE WHEN purchase_month BETWEEN signups_month AND DATE_ADD(signups_month, INTERVAL 30 DAY) THEN nb_purchases ELSE NULL END, nb_signups)*100 AS sign_up_to_purchase_ratio
FROM table
ORDER BY 1
You can use LEAD() function to get the next value of the current row, I'll provide a MySQL query syntax for this.
with cte as
(select month_signup, nb_signups, lead(nb_purchases) over (order by month_signup) as
nextPr from MyData
order by month_signup)
select cte.month_signup, (nextPr/cte.nb_signups)*100 as per from cte
where (nextPr/cte.nb_signups)*100 is not null;
You may replace (nextPr/cte.nb_signups) with the SAFE_DIVIDE function.
See the demo from db-fiddle.

Can I query a aggregated query and a specific row's query when using subqueries?

I am new to SQL and I wanted to return the results of a specific value and the average of similar values. I have gotten the average part working but I'm not sure how to do the specific value part.
For more context, I have a list of carbon emissions by companies. I wanted the average of a industry based on a company's industry(working perfectly below), but I am not sure how to add the specific companies info.
Here's my query:
SELECT
year, AVG(carbon) AS AVG_carbon,
-- carbon as CompanyCarbon, <--my not working attempt
FROM
"company"."carbon" c
WHERE
LOWER(c.ticker) IN (SELECT LOWER(g4.ticker)
FROM "company"."General" g4
WHERE industry = (SELECT industry
FROM "company"."General" g3
WHERE LOWER(g3.ticker) = 'ibm.us'))
GROUP BY
c.year
ORDER BY
year ASC;
The current result is:
year avg_carbon
--------------------------------
1998 7909.0000000000000000
1999 19465.500000000000
2000 19478.000000000000
2001 182679.274509803922
2002 179821.156862745098
My desired output is:
year avg_carbon. Carbon
---------------------------------------
1998 7909.0000000000000000 343
1999 19465.500000000000 544
2000 19478.000000000000 653
2001 182679.274509803922 654
2002 179821.156862745098 644
(adding the carbon column based on "IBM" carbon
Here's my Carbon table:
ticker year carbon
-----------------------
hurn.us 2016 6282
hurn.us 2015 6549
hurn.us 2014 5897
hurn.us 2013 5300
hurn.us 2012 5340
ibm.us 2019 1496520
ibm.us 2018 1438365
Based on my limited knowledge, I think my where the statement is causing the problem. Right now I took at a company, get a list of tickers/identifiers of the same industry then create an average for each year.
I tried to just call the carbon column but I think because it's processing the list of tickers, it's not outputting the result I want.
What can I do? Also if I'm making any other mistakes you see above please let me know.
Sample data nd output do not match. So I can't say for sure but this might be the answer you are looking for.
select year, AVG(carbon) AS AVG_carbon,
max(case when lower(ticker) = 'ibm.us' then carbon else 0 end) as CompanyCarbon
from "company"."carbon" c
GROUP BY c.year
order by year ASC;
This will select max(carbon) for any year as CompanyCarbon if lower(ticker) = 'ibm.us'. Average will be calculated as you did.
To select only rows having positive value in CompanyCarbon column:
select year, AVG_carbon, CompanyCarbon
from
(
select year, AVG(carbon) AS AVG_carbon,
max(case when lower(ticker) = 'ibm.us' then carbon else 0 end) as CompanyCarbon
from "company"."carbon" c
GROUP BY c.year
order by year ASC;
)t where carbon > 0
Similar to the answer that Kazi provided you can use the FILTER syntax on an aggregate which makes it a bit more readable than the case/when IMO.
SELECT
year,
AVG(carbon) as avg_carbon,
MAX(carbon) FILTER (WHERE ticker = 'ibm.us') as company_carbon
FROM company_carbon
GROUP BY year
ORDER by year;

sum last n days quantity using sql window function

I am trying to create following logic in Alteryx and data is coming from Exasol database.
Column “Sum_Qty_28_days“ should sum up the values of “Qty ” column for same article which falls under last 28 days.
My sample data looks like:
and I want following output:
E.g. “Sum_Qty_28_days” value for “article” = ‘A’ and date = ‘’2019-10-8” is 8 because it is summing up the “Qty” values associated with dates (coming within previous 28 days) Which are:
2019-09-15
2019-10-05
2019-10-08
for “article” = ‘A’.
Is this possible using SQL window function?
I tried myself with following code:
SUM("Qty") OVER (PARTITION BY "article", date_trunc('month',"Date")
ORDER BY "Date")
But, it is far from what I need. It is summing up the Qty for dates falling in same month. However, I need to sum of Qty for last 28 days.
Thanks in advance.
Yes, this is possible using standard SQL and in many databases. However, this will not work in all databases:
select t.*,
sum(qty) over (partition by article
order by date
range between interval '27 day' preceding and current row
) as sum_qty_28_days
from t;
If your RDBMS does not support the range frame, an alternative solution is to use an inline subquery:
select
t.*,
(
select sum(t1.qty)
from mytable t1
where
t1.article = t.article
and t1.date between t.date - interval 28 days and t.date
) sum_qty_28_days
from mytable t

SQL store results table with month name

I have several CSV's stored to query against. Each CSV represents a month of data. I would like to count all the records in each CSV and save that data to a table as a row in the table. For instance, the table that represents May should return something that looks like this with June following. The data starts in Feb 2018 and continues to Feb 2019 so year value would be needed as well.
Month Results
----------------
May 18 1170
June 18 1167
I want to run the same query against all the tables for purposes of efficiency. I also want the query to work with all future updates eg. a March 19 table gets added, and the query will still work.
So far, I have this query.
SELECT COUNT(*)
FROM `months_data.*`
I am querying in Google Big Query using Standard SQL.
It sounds like you just want an aggregation that counts rows for each month:
SELECT
DATE_TRUNC(DATE(timestamp), MONTH) AS Month,
COUNT(*) AS Results
FROM `dataset.*`
GROUP BY month
ORDER BY month
You can use the DATE_FORMAT function if you want to control the formatting.
You seem to need union all:
select 2018 as yyyy, 2 as mm, count(*) as num
from feb2018
union all
select 2018 as yyyy, 3 as mm, count(*)
from mar2018
union all
. . .
Note that you have a poor data model. You should be storing all the data in a single table with a date column.

SQL: Can GROUP BY contain an expression as a field?

I want to group a set of dated records by year, when the date is to the day. Something like:
SELECT venue, YEAR(date) AS yr, SUM(guests) AS yr_guests
FROM Events
...
GROUP BY venue, YEAR(date);
The above is giving me results instead of an error, but the results are not grouping by year and venue; they do not appear to be grouping at all.
My brute force solution would be a nested subquery: add the YEAR() AS yr as an extra column in the subquery, then do the grouping on yr in the outer query. I'm just trying to learn to do as much as possible without nesting, because nesting usually seems horribly inefficient.
I would tell you the exact SQL implementation I'm using, but I've had trouble discovering it. (I'm working through the problems on http://www.sql-ex.ru/ and if you can tell what they're using, I'd love to know.) Edited to add: Per test in comments, it is probably not SQL Server.
Edited to add the results I am getting (note the first two should be summed):
venue | yr | yr_guests
1 2012 15
1 2012 35
2 2012 12
1 2008 15
I expect those first two lines to instead be summed as
1 2012 50
Works Fine in SQL Server 2008.
See working Example here: http://sqlfiddle.com/#!3/3b0f9/6
Code pasted Below.
Create The Events Table
CREATE TABLE [Events]
( Venue INT NOT NULL,
[Date] DATETIME NOT NULL,
Guests INT NOT NULL
)
Insert the Rows.
INSERT INTO [Events] VALUES
(1,convert(datetime,'2012'),15),
(1,convert(datetime,'2012'),35),
(2,convert(datetime,'2012'),12),
(1,convert(datetime,'2008'),15);
GO
-- Testing, select newly inserted rows.
--SELECT * FROM [Events]
--GO
Run the GROUP BY Sql.
SELECT Venue, YEAR(date) AS yr, SUM(guests) AS yr_guests
FROM Events
GROUP BY venue, YEAR(date);
See the Output Results.
VENUE YR YR_GUESTS
1 2008 15
1 2012 50
2 2012 12
it depends of your database engine (or SQL)
to be sure (over different DB Systems & Versions), make a subquery
SELECT venue, theyear, SUM(guests) from (
SELECT venue, YEAR(date) AS theyear, guest
FROM Events
)
GROUP BY theyear
you make a subtable of
venue, date as theyear, guest
aaaa, 2001, brother
aaaa, 2001, bbrother
bbbb, 2001, nobody
... and so on
and then
count them