How to divide customers into income buckets using pyspark?
Assume I've 2 columns named Customer and Income having range 10K to 50K.
Now I want to Divide the customers into income buckets 10K -20K, 20K-30K etc using pyspark Data frame.
TIA!
Related
I am trying to find the average household income while using a weighted average. I have a data source of ZIP codes with the total population and the average household income. I want to be able to select multiple ZIP codes and still pull an accurate average household income.
Can I use SQL to pull a weighted average like this?
ZIP
TOTAL_POP
AVG_HH_INC
12345
130350
66000
54321
55750
78000
44668
17300
89000
If you want the overall average, then use arithmetic:
select sum(total_pop * avg_hh_inc) / sunm(total_pop)
from t;
Note: If the values are stored as 4-byte integers, then this runs the risk of overflow. Just use a different numeric representation if that is an issue.
Need this to average = 7.41
But this happens when I use AVG
I am trying to calculate the daily average of each employee based on the number of days worked.
I already used a pivot table to calculate the daily total hours of each employee per day but I cannot figure out how to get the pivot table to display the average work day. When I alter the field settings, it averages the source data which I do not need.
The employees worked a different # of workdays, so I need the average function to calculate based on the # of instances for each employee. When I highlight the first employees data, Excel returns an average of 7.41.
Further down the list though, there are employees with 0.00 hours for a date that is beiong calculated into the averages.
How do I get this pivot table to give me a true snapshot of the persons daily hours worked - without having to manually delete 0.00 hour instances in the source data?
This question is similar to my previous one: Shifting elements of column based on index given condition on another column
I have a dataframe (df) with 2 columns and 1 index.
Index is datetime index and is in format of 2001-01-30 .... etc and the index is ordered by DATE and there are thousands of identical dates (and is monthly dates). Column A is company name (which corresponds to the date), Column B are share prices for the company names in column A for the date in the Index.
Now there are multiple companies in Column A for each date, and companies do vary over time (so the data is not predictable fully).
I want to create a column C which has the 3 day rolling exponential weighting average of the price for a particular company using the current and 2 dates before for a particular company in column A.
I have tried a few methods but have failed. Thanks.
Try:
df.groupby('ColumnA', as_index=False).apply(lambda g: g.ColumnB.ewm(3).mean())
I am a graduate intern at a big company and I'm having some trouble with creating a measure in PowerPivot.
I'm quite new with PowerPivot and I need some help. I am the first person to use PowerPivot in this office so I can't ask for help here.
I have a fact table that has basically all journal entries. See next table. All entries are done with a unique ID (serialnumber) for every product
ID DATE ACCOUNT# AMOUNT
110 2010-1-1 900 $1000
There is a dimension table with has all accounts allocated to a specific country and expense or revenue.
ACCOUNT# Expense Country
900 Revenue Germany
And another dimension table to split the dates.
The third dimension table contains product information, but also contains a column with a certain expense (Expense X).
ID Expense X ProductName Productcolour
110 $50 Flower Green
I made sure I made the correct relations between the tables of course. And slicing works in general.
To calculate the margin I need to deduct this expense x from the revenue. I already made a measure that shows total Revenue, that one was easy.
Now I need a measure to show the total for Expense X, related to productID. So I can slice in a pivot table on date and product name etc.
The problem is that I can't use RELATED function because the serial number is used multiple times in the fact table (journal entries can have the same serial number)
And if I use the SUM or CALCULATE function it won't slice properly.
So how can I calculate the total for expense X so it will slice properly?
Check the function RELATEDTABLE.
If you create a dummy dataset I can play around and send you a solution.
I am making a basic hospital management system in Access 2013.I have two tables named "Bed" and "Receipt".
Bed(BedID,AssignedDate,PatientID,DischargeDate,BedCharges)
Reciept(ReceiptID,PatientID,BedCharges)
I want to calculate "BedCharges" by calculating the number of days using "AssignedDate" and "DischargeDate" and then multiplying with a constant amount of charges per day.
Also the BedCharges calculated in "Bed" Table also needs to be in the "Receipt" table.
How can I count the number of days and then calculate the "BedCharges" in both the tables?