Partition By Logic - sql

I have a dataset that has roughly 1 million rows. Without hard coding any claims, what would be the way to get the resulting output? From my research I determined something like DENSE_RANK or ROW_NUMBER() with a partition expression should do the trick.Is there a way to use DENSE_RANK to say "go down the list of PATNO and if the PATNO is the same, then keep going, but if it changes group the above PATNO and make them one claim".
The dates do not matter in this case. Basically I just want to find a way to tell SQL to automically recognize sets of claims based on Patno. Sometimes there are 50 lines with the same Patno which makes 1 claim, and sometimes there could be only 1-2 lines with the same Patno that make up a claim.

If you want the sum of charges for a patno, then you want a group by, I think:
select patno, sum(charges)
from t
group by patno;
I think you are overcomplicating the problem.

Related

Access SQL GROUP BY problem (eg. tbl_Produktion.ID not part of the aggregation-function)

I want to group by two columns, however MS Access won't let me do it.
Here is the code I wrote:
SELECT
tbl_Produktion.Datum, tbl_Produktion.Schichtleiter,
tbl_Produktion.ProduktionsID, tbl_Produktion.Linie,
tbl_Produktion.Schicht, tbl_Produktion.Anzahl_Schichten_P,
tbl_Produktion.Schichtteam, tbl_Produktion.Von, tbl_Produktion.Bis,
tbl_Produktion.Pause, tbl_Produktion.Kunde, tbl_Produktion.TeileNr,
tbl_Produktion.FormNr, tbl_Produktion.LabyNr,
SUM(tbl_Produktion.Stueckzahl_Prod),
tbl_Produktion.Stueckzahl_Ausschuss, tbl_Produktion.Ausschussgrund,
tbl_Produktion.Kommentar, tbl_Produktion.StvSchichtleiter,
tbl_Produktion.Von2, tbl_Produktion.Bis2, tbl_Produktion.Pause2,
tbl_Produktion.Arbeiter3, tbl_Produktion.Von3, tbl_Produktion.Bis3,
tbl_Produktion.Pause3, tbl_Produktion.Arbeiter4,
tbl_Produktion.Von4, tbl_Produktion.Bis4, tbl_Produktion.Pause4,
tbl_Produktion.Leiharbeiter5, tbl_Produktion.Von5,
tbl_Produktion.Bis5, tbl_Produktion.Pause5,
tbl_Produktion.Leiharbeiter6, tbl_Produktion.Von6,
tbl_Produktion.Bis6, tbl_Produktion.Pause6, tbl_Produktion.Muster
FROM
tbl_Personal
INNER JOIN
tbl_Produktion ON tbl_Personal.PersID = tbl_Produktion.Schichtleiter
GROUP BY
tbl_Produktion.Datum, tbl_Produktion.Schichtleiter;
It works when I group it by all the columns, but not like this.
The error message say that the rest of the columns aren't part of the aggregation-function (translated from german to english as best as I could).
PS.: I also need the sum of "tbl_Produktion.Stueckzahl_Prod" therefore I tried using the SUM function (couldn't try it yet).
Have you tried something along these lines?
SELECT
tbl_Produktion.Datum, tbl_Produktion.Schichtleiter,
MAX(tbl_Produktion.ProduktionsID), MAX(tbl_Produktion.Linie),
MAX(tbl_Produktion.Schicht), MAX(tbl_Produktion.Anzahl_Schichten_P),
MAX(tbl_Produktion.Schichtteam), MAX(tbl_Produktion.Von), MAX(tbl_Produktion.Bis),
SUM(tbl_Produktion.Stueckzahl_Prod)
FROM
tbl_Personal
INNER JOIN
tbl_Produktion ON tbl_Personal.PersID = tbl_Produktion.Schichtleiter
GROUP BY
tbl_Produktion.Datum, tbl_Produktion.Schichtleiter;
I have used the MAX function for all the data except the two items you specify in the GROUP BY and the one where you desire the SUM. I took the liberty of leaving out mush of your data just to get started.
Using the MAX function turns out to be a convenient workaround when the data item is known to be unique within each group. We cannot know your data or your itent, so we cannot tell you whether MAX will yield the results you need.
If you use an aggregation function in the select clause, you must group by every column that you're selecting that's not an aggregation. If you don't want to do that for some reason (perhaps it changes the output of the aggregation in way that you don't intend) you either must think of an aggregate to use (pick a value. Average? Max? Min?) or just do two selects, one for the aggregate, and one for the non-aggregates. But, then, you have to decide how to get the non-aggregated fields that make sense for the aggregate (or show them all in a table, I suppose?)

SQL: Reduce resultset to X rows?

I have the following MYSQL table:
measuredata:
- ID (bigint)
- timestamp
- entityid
- value (double)
The table contains >1 billion entries. I want to be able to visualize any time-window. The time window can be size of "one day" to "many years". There are measurement values round about every minute in DB.
So the number of entries for a time-window can be quite different. Say from few hundrets to several thousands or millions.
Those values are ment to be visualiuzed in a graphical chart-diagram on a webpage.
If the chart is - lets say - 800px wide, it does not make sense to get thousands of rows from database if time-window is quite big. I cannot show more than 800 values on this chart anyhow.
So, is there a way to reduce the resultset directly on DB-side?
I know "average" and "sum" etc. as aggregate function. But how can I i.e. aggregate 100k rows from a big time-window to lets say 800 final rows?
Just getting those 100k rows and let the chart do the magic is not the preferred option. Transfer-size is one reason why this is not an option.
Isn't there something on DB side I can use?
Something like avg() to shrink X rows to Y averaged rows?
Or a simple magic to just skip every #th row to shrink X to Y?
update:
Although I'm using MySQL right now, I'm not tied to this. If PostgreSQL f.i. provides a feature that could solve the issue, I'm willing to switch DB.
update2:
I maybe found a possible solution: https://mike.depalatis.net/blog/postgres-time-series-database.html
See section "Data aggregation".
The key is not to use a unixtimestamp but a date and "trunc" it, avergage the values and group by the trunc'ed date. Could work for me, but would require a rework of my table structure. Hmm... maybe there's more ... still researching ...
update3:
Inspired by update 2, I came up with this query:
SELECT (`timestamp` - (`timestamp` % 86400)) as aggtimestamp, `entity`, `value` FROM `measuredata` WHERE `entity` = 38 AND timestamp > UNIX_TIMESTAMP('2019-01-25') group by aggtimestamp
Works, but my DB/index/structue seems not really optimized for this: Query for last year took ~75sec (slow test machine) but finally got only a one value per day. This can be combined with avg(value), but this further increases query time... (~82sec). I will see if it's possible to further optimize this. But I now have an idea how "downsampling" data works, especially with aggregation in combination with "group by".
There is probably no efficient way to do this. But, if you want, you can break the rows into equal sized groups and then fetch, say, the first row from each group. Here is one method:
select md.*
from (select md.*,
row_number() over (partition by tile order by timestamp) as seqnum
from (select md.*, ntile(800) over (order by timestamp) as tile
from measuredata md
where . . . -- your filtering conditions here
) md
) md
where seqnum = 1;

PowerPivot Ranking Groups using DAX's Rankx - Ranking Using Sum of a Field

Am trying to rank groups by summing a field (not a calculated column) for each group so I get a static answer for each row in my table.
For example, I may have a table with state, agent, and sales. Sales is a field, not a measure. There can be many agents within a state, so there are many rows for each individual state. I am trying to rank the states by total sales within each state.
I have tried many things, but the ones that make the most sense to me are:
rankx(CALCULATETABLE(Table,allexcept(Table,Table[AGENT]),sum([Sales]),,DESC)
and
=rankx(SUMMARIZE(State,Table[State],"Sales",sum(Table[Sales])),[Sales])
The first one is creating a table where it sums sales without grouping by Agent. and then tries to rank based on that. I get #ERROR on this one.
The second one creates a table using SUMMARIZE with only sum of Sales grouped by state, then tries to take that table and rank the states based on Sales. For this one I get a rank of 1 for every row.
I think, but am not sure, that my problem is coming from the sales being a static field and not a calculated measure. I can't figure out where to go from here. Any help?
Assuming your data looks something like this...
...have you tried this:
Ranking Measure = RANKX(ALL('Table'[STATE]),CALCULATE(SUM('Table'[Sales])))
The ALL('Table'[STATE]) says to rank all states. The CALCULATE(SUM('Table'[Sales])) says to rank by the sum of their sales. The CALCULATE wrapper is important; a plain SUM('Table'[Sales]) will be filtered to the current row context, resulting in every state being ranked #1. (Alternatively, you can spin off SUM('Table'[Sales]) into a separate Sales measure - which I'd recommend.)
Note: the ranks will change based on slicers/filters (e.g. a filter by agent will re-rank the states by that agent). If you're looking for a static rank of states by their total sales (i.e. not affected by filters on agent and always looking at the entire table), then try this:
Static Ranking Measure = CALCULATE([Ranking Measure], ALLEXCEPT('Table', 'Table'[State]))
This takes the same ranking measure, but removes all filters except the state filter (which you need to leave, as that's the column you're ranking by).
I did figure out a solution that's pretty simple, but it's messier than I'd like. If it's the only thing that works though, that's okay.
I created a new table with each distinct state along with a sum of sales then just do a basic RANKX on that table.

JOIN EACH and GROUP EACH BY clauses can't be used on the output of window functions

How would you overcome the above restriction?
I am trying to find flows based on sequences of 3 records using the LEAD and LAG window functions, and than calculate some aggregations (count, sum, etc,) of their attributes.
When i run my queries on a small sample of data, everything is fine and the group by runs OK. but when running on larger data set, i get: "Resources exceeded during query execution. The query contained a GROUP BY operator, consider using GROUP EACH BY instead."
In many other cases switching to GROUP EACH BY do the work...
However, as I use window functions, I cannot use EACH...
Any suggestions? Best practices?
here is a sample query based of wikipedia sample data. it shows the frequency of title editing by different contributors. the where condition is just to limit response size, if you remove the "B" we get results, if we add it we got the "use EACH" recomendation.
select title,count (case when contributor_id<>LeadContributor then 1 else null end) as different,
count (case when contributor_id=LeadContributor then 1 else null end) as same,
count(*) as total
from
(
SELECT title,contributor_id,lead(contributor_id)over(partition by title order by timestamp) as LeadContributor
FROM [publicdata:samples.wikipedia]
where regexp_match(title,r'^[A,B]')=true)
group by title
Thanks
I guess your particular use case is different to the sample query, but let me comment on what I'm able to see:
You found a way to make GROUP EACH and OVER possible: Surrounding the OVER() query with another one allows you to change the GROUP BY to GROUP EACH BY. However, this query's problem is not there.
Let's forget about GROUP and GROUP EACH. Let's look at the core query:
SELECT title, contributor_id, LEAD(contributor_id)
OVER(PARTITION BY title ORDER BY timestamp) AS LeadContributor
FROM [publicdata:samples.wikipedia]
WHERE REGEXP_MATCH(title, r'^[A,B]')
This query fails with r'^[A,B]' and works with r'^[A]', and it highlight an OVER() limitation: As GROUP BY and ORDER BY, it only works when data fits in one machine, as they are not parallelizable. As the answer to r'^[A]' reveals, that can be a lot of data - though sometimes not enough. That's why BigQuery offers the parallelizable GROUP EACH BY. However, there is no parallelizable OVER EACH BY we can use here.
The workaround I would apply here is exactly what you are doing: Do the OVER() with just a fraction of the data.
(btw, let me say I love the sample query... it's an interesting question with an interesting answer!)

Group keywords by site

I am finding a lot of useful help here today, and I really appreciate it. This should be the last one for the day:
I have a list of the top 10 keywords per site, sorted by visits, by date. The records need to be sorted as follows (excuse the formatting):
2010-05 2010-04
site1.com keyword1 apples wine
keyword1 visits 100 12
keyword2 oranges water
keyword2 visits 99 10
site2.com keyword1 blueberry cornbread
keyword1 visits 90 100
keyword2 squares biscuits
keyword2 visits 80 99
Basically what I need to accomplish involves grouping, but I can't seem to figure it out. Am I heading down the right path, or is there another way to achieve this, or is it just impossible?
Edit:
The dataset is something like this (csv):
site_name,date,keyword,visits
site1.com,2010-04,apples,100
site1.com,2010-04,oranges,99
site1.com,2010-05,wine,12
site1.com,2010-05,water,10
site2.com,2010-04,cornbread,100
site2.com,2010-04,biscuits,99
site2.com,2010-05,blueberry,90
site2.com,2010-05,squares,80
Across the X-axis, we need to have the 'date' value
Across the Y-axis, we need to have the 'site_name' as the primary value, but grouped within that we need to have the 'keyword' followed by the respective 'visits'.
Ok, I think you are going down the right track. It's a little tricky getting the groups right, but this should be able to be solved with grouping.
What it looks like you need is a matrix (the table where you can have dynamic rows and columns) and put the dates in a group across the top. Then group the rows by site name and then (I think) by keyword.
If grouping by keyword doesn't work, try grouping by the row number instead (within the scope of the site name group)? If this doesn't work, try getting your database to produce an extra column with rank in it first. Then you can definitely group by that. What I mean is:
site_name,date,keyword,visits,rank
site1.com,2010-04,apples,100,1
site1.com,2010-04,oranges,99,2
site1.com,2010-05,wine,12,1
site1.com,2010-05,water,10,2
site2.com,2010-04,cornbread,100,1
site2.com,2010-04,biscuits,99,2
site2.com,2010-05,blueberry,90,1
site2.com,2010-05,squares,80,2
You should then be able to add two rows in that group to put the keyword and visits in. If you can't, you might have to resort to fancy rectangle work - in the detail cell, put a rectangle, then two textboxes, with the keyword in the top one and the number of visits in the bottom one.
Create a row grouping on "site" then a child/sub row grouping on "keyword"
You don't need to use a Matrix as you know how many columns you will have, so you can just do it in a table
So the grouping would be something like
=Fields!site_name
with the same value appearing in the text box
then for the next grouping down
=Fields!keyword
ditto for the textbox
you can just use SUM to figure out how many vists =SUM(Fields!vists)
in the group total