Power Query List.Generate to loop paginated API calls - api

I am trying to append the results multiple API calls in Power Query to build an outcome table.
The API limits each call to 1000 rows, but each result returns the current page number as well as the next call and the total rows so it's easy to know what to iterate on and how many times. I need to start at page 1 and end at total rows / 1000.
I just need help with the syntax if it's possible to append each subsequent result in a loop with the cumulative previous calls.
Something like a List.Generate inside a function might work, but I'm out of my depth here.

I have managed to answer my question using the reference in the comment from #horseyride.
It involved creating two custom functions in Power Query in M code. I'm sharing the important parts of both functions below.
fnGetResults:
(page, outcome) =>
let
url = "https://myapi.com/outcomes/findings/?outcome="&Number.ToText(outcome)&"&page="&Number.ToText(page)&"&page_size=1000",
Source = Json.Document(Web.Contents(url, [Headers = [#"Authorization"="JWT "&get_token]])),
results = Source[results]
in
results
fnGetOutcomes:
(id) =>
let
outcome_list = List.Generate (
() => [page = 1, result = fnGetResults(1,id)],
each not List.IsEmpty([result]),
each [page= [page] + 1, result = fnGetResults([page],id)],
each [result]
),
#"Converted to Table" = Table.FromList(outcome_list, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
#"Expanded Column1" = Table.ExpandListColumn(#"Converted to Table", "Column1"),
outcome = Table.ExpandRecordColumn(#"Expanded Column1", "Column1",
Record.FieldNames(#"Expanded Column1"{0}[Column1]),
Record.FieldNames(#"Expanded Column1"{0}[Column1])),
in
outcome

Related

Complex row manipulation based on column value in SQL or Power Query

I have a call dataset. Looks like this
If a call about a certain member comes in within 30 days of an "original call", that call is considered a callback. I need some logic or Power Query magic to handle this dataset using this logic. So the end result should look like this
Right now, I have the table left joined to itself which gives me every possible combination. I thought I could do something with that but it's proven difficult and when I have over 2 million unique case keys, the duplicates kill run time and overload memory. Any suggestions? I'd prefer to do the manipulation in Power Query editor but can do it in SQL. Plz and thank you.
I think you can do this in Power Query, but I have no idea how it will run with two million records.
It may be able to be sped up with judicious use of the Table.Buffer function. But give it a try as written first.
The code should be reasonably self-documenting
Group by Member ID
For each Member ID, create a table from a list of records which is created using the stated logic.
expand the tables
Mark the rows to be deleted by shifting up the Datediff column by one and applying appropriate logic to the Datediff and shifted columns.
Code assumes that the dates for each Member ID are in ascending order. If not, an extra sorting step would need to be added
Try this M code. (Change the Source line to be congruent with your own data source).
Edit:
Code edited to allow for multiple call backs from an initial call
let
//Change next line to be congruent with your actual data source
Source = Excel.CurrentWorkbook(){[Name="Table3"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{
{"Case key", type text}, {"Member ID", Int64.Type}, {"Call Date", type date}}),
//Group by Member ID
// then create tables with call back date using the stated logic
#"Grouped Rows" = Table.Group(#"Changed Type", {"Member ID"}, {
{"Call Backs",(t)=>Table.FromRecords(
List.Generate(
()=>[ck=t[Case key]{0}, cd=t[Call Date]{0}, cb = null, df=null, idx=0],
each [idx] < Table.RowCount(t),
each [ck=if Duration.Days(t[Call Date]{[idx]+1} - [cd]) < 30
then [ck] else t[Case key]{[idx]+1},
cd=if Duration.Days(t[Call Date]{[idx]+1} - [cd]) < 30
then [cd] else t[Call Date]{[idx]+1},
cb = if Duration.Days(t[Call Date]{[idx]+1} - [cd]) < 30
then t[Call Date]{[idx]+1} else null,
df = if Duration.Days(t[Call Date]{[idx]+1} - [cd]) < 30
then Duration.Days(t[Call Date]{[idx]+1} - [cd]) else null,
idx = [idx]+1],
each Record.FromList({[ck],[cd],[cb],[df]},{"Case key","Call Date","Call Back Date", "Datediff"}))
)}
}),
#"Expanded Call Backs" = Table.ExpandTableColumn(#"Grouped Rows", "Call Backs",
{"Case key", "Call Date", "Call Back Date", "Datediff"},
{"Case key", "Call Date", "Call Back Date", "Datediff"}),
#"Shifted Datediff" = Table.FromColumns(
Table.ToColumns(#"Expanded Call Backs") & {
List.RemoveFirstN(#"Expanded Call Backs"[Datediff]) & {null}},
type table[Member ID=Int64.Type, Case key=text, Call Date=date, Call Back Date=date, Datediff=Int64.Type, shifted=Int64.Type ]),
#"Filter" = Table.SelectRows(#"Shifted Datediff", each [shifted]=null or [Datediff]<>null),
#"Removed Columns" = Table.RemoveColumns(Filter,{"shifted"})
in
#"Removed Columns"
Example with multiple callbacks
Think you can do this with Lead function.
here is the fiddle https://dbfiddle.uk/?rdbms=oracle_11.2&fiddle=f7cabdbe4d1193e5f0da6bd6a4571b96
select
a.*,
LEAD(CallDate, 1) OVER (
Partition by memberId
ORDER BY
CallDate
) AS "CallbackDate",
LEAD(CallDate, 1) OVER (
Partition by memberId
ORDER BY
CallDate
) - a.calldate AS DateDiff
from
mytable a

How to return distinct rows while keeping the ordering in a query (SQL Alchemy)

I've been stuck on this for a few days now. An event can have multiple dates, and I want the query to only return the date closest to today (the next date). I have considered querying for Events and then adding a hybrid property to Event that returns the next Event Date but I believe this won't work out (such as if I want to query EventDates in a certain range).
I'm having a problem with distinct() not working as I would expect. Keep in mind I'm not a SQL expert. Also, I'm using postgres.
My query starts like this:
distance_expression = func.ST_Distance(
cast(EventLocation.geo, Geography(srid=4326)),
cast("SRID=4326;POINT(%f %f)" % (lng, lat), Geography(srid=4326)),
)
query = (
db.session.query(EventDate)
.populate_existing()
.options(
with_expression(
EventDate.distance,
distance_expression,
)
)
.join(Event, EventDate.event_id == Event.id)
.join(EventLocation, EventDate.location_id == EventLocation.id)
)
And then I have multiple filters (just showing a few for as an example)
query = query.filter(EventDate.start >= datetime.utcnow)
if kwargs.get("locality_id", None) is not None:
query = query.filter(EventLocation.locality_id == kwargs.pop("locality_id"))
if kwargs.get("region_id", None) is not None:
query = query.filter(EventLocation.region_id == kwargs.pop("region_id"))
if kwargs.get("country_id", None) is not None:
query = query.filter(EventLocation.country_id == kwargs.pop("country_id"))
Then I want to order by date and distance (using my query expression)
query = query.order_by(
EventDate.start.asc(),
distance_expression.asc(),
)
And finally I want to get distinct rows, and only return the next EventDate of an event, according to the ordering in the code block above.
query = query.distinct(Event.id)
The problem is that this doesn't work and I get a database error. This is what the generated SQL looks like:
SELECT DISTINCT ON (events.id) ST_Distance(CAST(event_locations.geo AS geography(GEOMETRY,4326)), CAST(ST_GeogFromText(%(param_1)s) AS geography(GEOMETRY,4326))) AS "ST_Distance_1", event_dates.id AS event_dates_id, event_dates.created_at AS event_dates_created_at, event_dates.event_id AS event_dates_event_id, event_dates.tz AS event_dates_tz, event_dates.start AS event_dates_start, event_dates."end" AS event_dates_end, event_dates.start_naive AS event_dates_start_naive, event_dates.end_naive AS event_dates_end_naive, event_dates.location_id AS event_dates_location_id, event_dates.description AS event_dates_description, event_dates.description_attribute AS event_dates_description_attribute, event_dates.url AS event_dates_url, event_dates.ticket_url AS event_dates_ticket_url, event_dates.cancelled AS event_dates_cancelled, event_dates.size AS event_dates_size
FROM event_dates JOIN events ON event_dates.event_id = events.id JOIN event_locations ON event_dates.location_id = event_locations.id
WHERE events.hidden = false AND event_dates.start >= %(start_1)s AND (event_locations.lat BETWEEN %(lat_1)s AND %(lat_2)s OR false) AND (event_locations.lng BETWEEN %(lng_1)s AND %(lng_2)s OR false) AND ST_DWithin(CAST(event_locations.geo AS geography(GEOMETRY,4326)), CAST(ST_GeogFromText(%(param_2)s) AS geography(GEOMETRY,4326)), %(ST_DWithin_1)s) ORDER BY event_dates.start ASC, ST_Distance(CAST(event_locations.geo AS geography(GEOMETRY,4326)), CAST(ST_GeogFromText(%(param_3)s) AS geography(GEOMETRY,4326))) ASC
I've tried a lot of different things and orderings but I can't work this out. I've also tried to create a subquery at the end using from_self() but it doesn't keep the ordering.
Any help would be much appreciated!
EDIT:
On further experimentation it seems that I can't use order_by will only work if it's ordering the same field that I'm using for distinct(). So
query = query.order_by(EventDate.event_id).distinct(EventDate.event_id)
will work, but
query.order_by(EventDate.start).distinct(EventDate.event_id)
will not :/
I solved this by using adding a row_number column and then filtering by the first row numbers like in this answer:
filter by row_number in sqlalchemy

Get the item with the highest count

Can you please help me to get the item with the highest count using DAX?
Measure = FIRSTNONBLANK('Table1'[ItemName],CALCULATE(COUNT('Table2'[Instance])))
This shows the First ItemName in the table but doesnt get the ItemName of the Highest Value.
Thanks
Well, it's more complicated than I would have wanted, but here's what I came up with.
There things that you are hoping to do that are not so straightforward in DAX. First, you want an aggregated aggregation ;) -- in this case, the Max of a Count. The second thing is that you want to use a value from one column that you identify by what's in another column. That's row-based thinking and DAX prefers column-based thinking.
So, to do the aggregate of aggregates, we just have to slog through it. SUMMARIZE gives us counts of items. Max and Rank functions could help us find the biggest count, but wouldn't be so useful for getting Item Name. TOP N gives us the whole row where our count is the biggest.
But now we need to get our ItemName out of the row, so SELECTCOLUMNS lets us pick the field to work with. Finally, we really want a value not a 1-column, 1-row table. So FirstNonBlank finishes the job.
Hope it helps.
Here's my DAX
MostFrequentItem =
VAR SummaryTable = SUMMARIZE ( 'Table', 'Table'[ItemName], "CountsByItem", COUNT ( 'Table'[ItemName] ) )
VAR TopSummaryItemRow = TOPN(1, SummaryTable, [CountsByItem], DESC)
VAR TopItem = SELECTCOLUMNS (TopSummaryItemRow, "TopItemName", [ItemName])
RETURN FIRSTNONBLANK (TopItem, [TopItemName])
Here's the DAX without using variables (not tested, sorry. Should be close):
MostFrequentItem_2 =
FIRSTNONBLANK (
SELECTCOLUMNS (
TOPN (
1,
SUMMARIZE ( 'Table', 'Table'[ItemName], "Count", COUNT ( 'Table'[ItemName] ) ),
[Count], DESC
),
"ItemName", [ItemName]
),
[ItemName]
)
Here's the mock data:
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WcipNSspJTS/NVYrVIZ/nnFmUnJOKznRJzSlJxMlyzi9PSs3JAbODElMyizNQmLEA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type text) meta [Serialized.Text = true]) in type table [Stuff = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Stuff", type text}}),
#"Renamed Columns" = Table.RenameColumns(#"Changed Type",{{"Stuff", "ItemName"}})
in
#"Renamed Columns"

How to sum consecutive rows in Power Query

I have in Power Query a Column "% sum of all". I need to create a custom column "Sum Consecutive" that each row has as value the "% sum of all" of the current row + the value of "Sum Consecutive" of the previous row.
Current row situation
New Custom Column Expectation
You can see two images that show the current situation and the next situation I need in the Power Query.
Can you please help me find a code/command to create this new column like that?
Although there are similar solved questions in DAX, I still need to keep editing the file after that, so it should be performed in M language in power query.
Thank you!
Not sure how performant my approaches are. I would think both should be reasonably efficient as they only loop over each row in the table once (and "remember" the work done in the previous rows). However, maybe the conversion to records/list and then back to table is slow for large tables (I don't know).
Approach 1: Isolate the input column as a list, transform the list by cumulatively adding, put the transformed list back in the table as a new column.
let
someTable = Table.FromColumns({List.Repeat({0.0093}, 7) & List.Repeat({0.0086}, 7) & {0.0068, 0.0068}}, {"% of sum of all"}),
listToLoopOver = someTable[#"% of sum of all"],
cumulativeSum = List.Accumulate(List.Positions(listToLoopOver), {}, (listState, currentIndex) =>
let
numberToAdd = listToLoopOver{currentIndex},
sum = try listState{currentIndex - 1} + numberToAdd otherwise numberToAdd,
append = listState & {sum}
in
append
),
backToTable = Table.FromColumns(Table.ToColumns(someTable) & {cumulativeSum}, Table.ColumnNames(someTable) & {"Cumulative sum"})
in
backToTable
Approach 2: Convert the table to a list of records, loop over each record and add a new field (representing the new column) to each record, then convert the transformed list of records back into a table.
let
someTable = Table.FromColumns({List.Repeat({0.0093}, 7) & List.Repeat({0.0086}, 7) & {0.0068, 0.0068}}, {"% of sum of all"}),
listToLoopOver = Table.ToRecords(someTable),
cumulativeSum = List.Accumulate(List.Positions(listToLoopOver), {}, (listState, currentIndex) =>
let
numberToAdd = Record.Field(listToLoopOver{currentIndex}, "% of sum of all"),
sum = try listState{currentIndex - 1}[Cumulative sum] + numberToAdd otherwise numberToAdd, // 'try' should only be necessary for first item
recordToAdd = listToLoopOver{currentIndex} & [Cumulative sum = sum],
append = listState & {recordToAdd}
in
append
),
backToTable = Table.FromRecords(cumulativeSum)
in
backToTable
I couldn't find a function in the reference for M/Power Query that sums a list cumulatively.

SQL Query continues running for a very long time if search term not found

In my Azure hosted ASP.NET Core site I have a table of users and I implemented search as follows:
var inner = from user in db.Users
select new
{
Name = user.Name,
Verified = user.Verified,
PhotoURL = user.PhotoURL,
UserID = user.Id,
Subdomain = user.Subdomain,
Deleted=user.Deleted,
AppearInSearch = user.AppearInSearch
};
return await inner.Where(u=>u.Name.Contains(name)&& !u.Deleted && u.AppearInSearch)
.OrderByDescending(u => u.Verified)
.Skip(page * recordsInPage)
.Take(recordsInPage)
.Select(u => new UserSearchResult()
{
Name = u.Name,
Verified = u.Verified,
PhotoURL = u.PhotoURL,
UserID = u.UserID,
Subdomain = u.Subdomain
}).ToListAsync();
This translates to a SQL statement similar to the following:
SELECT [t].[Name], [t].[Verified],
[t].[PhotoURL], [t].[Id],
[t].[Subdomain], [t].[Deleted],
[t].[AppearInSearch]
FROM (
SELECT [user0].[Name], [user0].[Verified],
[user0].[PhotoURL], [user0].[Id],
[user0].[Subdomain], [user0].[Deleted],
[user0].[AppearInSearch]
FROM [AspNetUsers] AS [user0]
WHERE (((CHARINDEX('khaled', [user0].[Name]) > 0) OR ('khaled' = N''))
AND ([user0].[Deleted] = 0))
AND ([user0].[AppearInSearch] = 1)
ORDER BY [user0].[Verified] DESC
OFFSET 10 ROWS FETCH NEXT 10 ROWS ONLY ) AS [t]
If the search term is available in the database, the result is obtained in less than a second.
However, If it's not found the query runs for a very long time (I have seen it once reaching 48 seconds).
This greatly affects performance when we publish this feature to the internet.
Can you kindly suggest a way to solve this issue?
Thank you
Update: this issue is continued here: Empty Login Name When Showing sys.processes
Already you can simplify your query like this ;) :
int start=page * recordsInPage;
var inner = (from user in db.Users
where user.Name.Contains(name) && !user.Deleted && user.AppearInSearch
orderby user.Verified descending
select new
{
Name = user.Name,
Verified = user.Verified,
PhotoURL = user.PhotoURL,
UserID = user.Id,
Subdomain = user.Subdomain,
Deleted=user.Deleted,
AppearInSearch = user.AppearInSearch
}
).Skip(start).Take(recordsInPage);
return await inner.ToListAsync();
If you have a performance problem, try to create a stored procedure with your SQL and use it with entity Framework.
SQL Server has to use a scan to find rows matching the .Contains clause. There is no way around this.
However, if we reduce the amount of data that SQL server has to scan, we will speed up the query.
Covering filtered index
An index is "covering" if it contains all the data needed to be returned in a query.
CREATE INDEX IX_User_Name_filtered ON USER ([Verified], [Name])
INCLUDE ( [PhotoURL], [Id], [Subdomain], [Deleted], [AppearInSearch] )
WHERE [AppearInSearch]=1 AND [Deleted]=0
This index is likely substantially smaller than the original table, so even if a scan is required, it will be quicker.
Depending on the plan that is generated, this index may be a better choice. it doesn't include the extra columns and will be smaller still. Testing will be required to determine the best choice.
CREATE INDEX IX_User_Name_filtered ON USER ([Verified], [Name])
WHERE [AppearInSearch]=1 AND [Deleted]=0