How to have if statements in iloc - pandas

I was thus to come up with a solution for The column "Loan_Term" which measures the duration of the loan applied for, in months.
Applicants asking for loans with loan terms less than 120 months are given a rating of "Short" under loan tenure. Those with loan terms at least 120 months, but less than 300 months will be rated as "Medium". Applicants asking for loans with loan terms of 300 or more months, will be rated as "Long".
Write a function that takes an applicant's numerical value of months of Loan_Term as an input parameter and returns the respective customer's rating. Create a new attribute "Loan_Tenure" for every applicant in df_loans.
Display df_loans with only the "Loan_Term" and "Loan_Tenure" attributes.
My code is as follows df_loans =df
df_loans.loc[(df_loans.Loan_Term < 120 return "Short" ) | (df_loans.Loan_Term > 120 & < 300 return "Medium") | (df_loans.Loan_Term > 300 return "Long")]. It is wrong and I was wondering is there a way for it to only display this criterion in the table through loc or must i use something else.

for this you can use numpy's select function
df['Loan_Tenure'] = np.select([df_loans.Loan_Term <= 120 ,(df_loans.Loan_Term>120)&(df_loans.Loan_Term<300),(df_loans.Loan_Term >= 300)],['Short','Medium','Long'])

Related

Count number of cases and add row to sum average cost

I have a large data set that I'm trying to export in a way I've never done before. There are dozens of columns with flags (0 or 1) to indicate whether a person has that trait. At the end each record has a total cost which sums up all money associated with that person. Sample below
ID
Visit
Stay
Treatment
Total Cost
1
0
1
1
$50
2
1
0
1
$100
I'm trying to get it into a format like so:
Visit
Stay
Treatment
1
1
2
$100
$50
$75
So that number of flags is summed up per column and the average cost is below that. Hence, there's two treatment and the average cost is $75, there's one stay with an average cost of $100.
I've tried GROUPING BY and a few other functions, but haven't been successful. Any help would be greatly appreciated!
We add up all we need and then use union all to unpivot.
select sum(visit) as visit
,sum(stay) as stay
,sum(treatment) as treatment
from t
union all
select sum(visit*total_cost)/sum(visit)
,sum(stay*total_cost)/sum(stay)
,sum(treatment*total_cost)/sum(treatment)
from t
visit
stay
treatment
1
1
2
100
50
75
Fiddle

Query smallest number of rows to match a given value threshold

I would like to create a query that operates similar to a cash register. Imagine a cash register full of coins of different sizes. I would like to retrieve a total value of coins in the fewest number of coins possible.
Given this table:
id
value
1
100
2
100
3
500
4
500
5
1000
How would I query for a list of rows that:
has a total value of AT LEAST a given threshold
with the minimum excess value (value above the threshod)
in the fewest possible rows
For example, if my threshold is 1050, this would be the expected result:
id
value
1
100
5
1000
I'm working with postgres and elixir/ecto. If it can be done in a single query great, if it requires a sequence of multiple queries no problem.
I had a go at this myself, using answers from previous questions:
Using ABS() to order by the closest value to the threshold
Select rows until a sum reduction of a single column reaches a threshold
Based on #TheImpaler's comment above, this prioritises minimum number of rows over minimum excess. It's not 100% what I was looking for, so open to improvements if anyone can, but if not I think this is going to be good enough:
-- outer query selects all rows underneath the threshold
-- inner subquery adds a running total column
-- window function orders by the difference between value and threshold
SELECT
*
FROM (
SELECT
i.*,
SUM(i.value) OVER (
ORDER BY
ABS(i.value - $THRESHOLD),
i.id
) AS total
FROM
inputs i
) t
WHERE
t.total - t.value < $THRESHOLD;

Counting latest instance of multiple only based on filter context

I've got a large table of events that have occurred in an inventory of vehicles, which affect whether they are in service or out of service. I would like to create a measure that would be able to count the number of vehicles in the various inventories at any point in time, based on the events in this table.
This table is pulled from a SQL database into an Excel 2016 sheet, and I'm using PowerPivot to try to come up with the DAX measure.
Here is some example data event_list:
vehicle_id event_date event event_sequence inventory
100 2018-01-01 purchase 1 in-service
101 2018-01-01 purchase 1 in-service
102 2018-02-04 purchase 1 in-service
100 2018-02-07 maintenance 2 out-of-service
101 2018-02-14 damage 2 out-of-service
101 2018-02-18 repaired 3 in-service
100 2018-03-15 repaired 3 in-service
102 2018-05-01 damage 2 out-of-service
103 2018-06-03 purchase 1 in-service
I'd like to be able to create a pivot table in Excel (or use CUBE functions, etc) to get an output table like this:
date in-service out-of-service
2018-02-04 3 0
2018-02-14 1 2
2018-03-15 3 0
2018-06-03 3 1
Essentially, I want to be able to calculate the inventory based on any date in time. The example only has a few dates, but hopefully provides enough of a picture.
I've basically come up with this so far, but it counts more vehicles than desired - I can't figure out how to only take the latest event_sequence or event_date and use that to count the inventory.
cumulative_vehicles_at_date:=CALCULATE(
COUNTA([vehicle_id]),
IF(IF(HASONEVALUE (event_list[event_date]), VALUES (event_list[event_date]))>=event_list[event_date],event_list[event_date])
)
I tried using MAX() and EARLIER() functions, but they don't seem to work.
Edit: Added the PowerBI tag as I'm now using that software to attempt to solve this as well. See comments on Alexis Olson's answer.
I think I've found a much cleaner method than I gave previously.
Let's add two columns onto the event_list table. One which counts vehicles "in-service" on that date and one which counts vehicles "out-of-service" on that date.
InService =
VAR Summary = SUMMARIZE(
FILTER(event_list,
event_list[event_date] <= EARLIER(event_list[event_date])),
event_list[vehicle_id],
"MaxSeq", MAX(event_list[event_sequence]))
VAR Filtered = FILTER(event_list,
event_list[event_sequence] =
MAXX(
FILTER(Summary,
event_list[vehicle_id] = EARLIER(event_list[vehicle_id])),
[MaxSeq]))
RETURN SUMX(Filtered, 1 * (event_list[inventory] = "in-service"))
You can create an analogous calculated column for OutOfService or you can just take the total minus the InService count.
OutOfService =
CALCULATE(
DISTINCTCOUNT(event_list[vehicle_id]),
FILTER(event_list,
event_list[event_date] <= EARLIER(event_list[event_date])))
- event_list[InService]
Now all you have to do is put event_date on the matrix visual rows section and add the InService and OutOfService columns to the values section (use Maximum or Minimum for the aggregation option rather than Sum).
Here's the logic behind the calculated column InService:
We first create a Summary table which calculates the maximal event_sequence value for each vehicle. (We filter the event_date to only consider dates up to the current one we are working with.)
Now that we know what the last event_sequence value is for each vehicle, we use that to filter the entire table down to just the rows that correspond to those vehicles and sequence values. The filter goes through the table row by row and checks to see if the sequence value matches the one we calculated in the Summary table. Note that when we filter the Summary table to just the vehicle we are currently working with, we only get a single row. I'm just using MAXX to extract the [MaxSeq] value. (It's kind of like using LOOKUPVALUE, but you can't use that on a variable.)
Now that we've filtered the table just to the most recent events for each vehicle, all we need to do is count how many of them are "in-service". I used a SUMX here where the 1*(True/False) coerces the boolean value to return 1 or 0.
This is pretty difficult. I don't have a great answer, but here's something that kind of works.
You'll create a new calculated table where you'll calculate the status for each vehicle on each date. Start with the base cross join for each vehicle and each date:
= CROSSJOIN(VALUES(event_list[vehicle_id]), VALUES(event_list[event_date]))
Then add a calculated column to find the max sequence number for each vehicle on that date.
Sequence = MAXX(
FILTER(event_list,
event_list[event_date] <= Cross[event_date] &&
event_list[vehicle_id] = Cross[vehicle_id]),
event_list[event_sequence])
Now you can lookup the inventory value for each vehicle/sequence pair with another calculated column:
Inventory = LOOKUPVALUE(
event_list[inventory],
event_list[vehicle_id], Cross[vehicle_id],
event_list[event_sequence], Cross[Sequence])
The result should look something like this:
Once you have this, you can create a matrix using this calculated table. Put the event_date on the rows and Inventory on the columns. Filter out blank inventory values in the visual level filter and put the vehicle_id in the values field, using a count or distinct count as the aggregation method (instead of the default sum).
It should look like this:

counting and numbering in a select statement in Access SQL

Could you please help me figuring out how to accomplish the following.
I have a table containing the number of products available between one date and another as per below:
TABLE MyProducts
DateProduct ProductId Quantity Price
26/02/2016 7 2 100
27/02/2016 7 3 100
28/02/2016 7 4 100
I have created a form where users need to select a date range and the number of products they are looking for (in my example, the number of products is going to be 1).
In this example, let's say that a user makes the following selection:
SELECT SUM(MyProducts.Price) As TotalPrice
FROM MyProducts WHERE MyProducts.DateProduct
Between #2/26/2016# And #2/29/2016#-1 AND MyProducts.Quantity>=1
Now the user can see the total amount that 1 product costs: 300
For this date range, however, I want to allow users to select from a combobox also the number of products that they can still buy: if you give a look at the Quantity for this date rate, a user can only buy a maximum of 2 products because 2 is the lowest quantity available is in common for all the dates listed in the query.
First question: how can I feed the combobox with a "1 to 2" list (in this case) considering that 2 is lowest quantity available in common for all the dates queried by this user?
Second question: how can I manage the products that a user has purchased.
Let's say that a user has purchased 1 product within this date range and a second user has purchased for the very same date range the same quantity too (which is 1) for a total of 2 products purchased already in this date range. How can I see that for this date rate and giving this case the number of products actually available are:
DateProduct ProductId Quantity Price
26/02/2016 7 0 100
27/02/2016 7 1 100
28/02/2016 7 2 100
Thank you in advance and please let me know should you need further information.
You could create a table with an integer field counting from 1 to whatever max qty you could expect. Then create a query that will only return rows from your new table up to the min() qty in the MyProducts table. Use that query as the control source of your combobox.
EDIT: You will actually need two queries. The first should be:
SELECT Min(MyProducts.Quantity) AS MinQty FROM MyProducts;
which I called "qryMinimumProductQty". I create the table called "Numbering" with a single integer field called "Sequence". The second query:
SELECT Numbering.Sequence FROM Numbering, qryMinimumProductQty WHERE Numbering.Sequence<=qryMinimumProductQty.MinQty;
AFAIK there is no Access function/feature that will fill in a series of numbers in a combobox control source. You have to build the control source yourself. (Anyone with more VBA experience might have a solution to solve this, but I do not.)
It makes me ache thinking of an entire table with a single integer column only being used for a combobox though. A simpler approach to the combobox would just to show the qty available in a control on your form, give an unbound text box for the user to enter their order qty, and add a validation rule to stop the order and notify them if they have chosen a number greater than the qty on hand. (Just a thought)
As for your second question, I don't really understand what you're looking for either. It sounds like there may be another table of purchases? It should be a simple query to relate MyProducts to Purchases and take the difference between your MyProducts!qty and the Purchases!qty. If you don't have a table to store Purchases, it might be warranted based on my cursory understanding of your system.

Mean of variable at the selection (SAS)

For expamle I have a table A with 2 variables, the first one is a customer id, and the second is the income of the customer which is in range from 100 to 200 US dollars. The task is to create a table B where I would have customers with mean of income 150 USD and the amount of customers should be maximal. In other words I need to have table B with the maximal amount of customers from table A and the mean of income among the customers of table B should be exactly equal to 150. Is there any elegant approach using SAS Enterprise Guide?
Sort the records by income, low to high. Then compute the mean of all records 1 - N. Find N where mean = 150.
data test;
do id = 1 to 1000;
income = 100 + round(ranuni(1)*100,1);
output;
end;
run;
proc sort data=test;
by income;
run;
data want(where=(ave<=150));
set test;
retain sum 0;
sum = sum + income;
ave = sum / _n_;
drop sum;
run;
You want as many low values as possible. This then lets you add large values to get the mean to 150. So sorting by income should give you what you want.
A greedy algorithm might do the job well-enough, depending on the structure of the data. This is definitely not guaranteed to be optimal, but it can be implemented relatively fast.
The idea is:
Calculate the average of all the records
If the average is $150 then stop
Remove the largest/smallest value to increase or decrease the average, as appropriate
If the average is $150 then stop
Repeat (1) until finished
This should work pretty well if the values cluster around $150. If they are widely dispersed, then you might not get any records in the final bins.
If the algorithm works on your data, then there may be faster ways of implementing it.