Unable to create new features in Machine learning - pandas

I have a dataset. I am using pandas dataframe and named it df.
The dataset has 50,000 rows - here are the first 5:.
Name_Restaurant cuisines_available Average cost
Food Heart Japnese, chinese 60$
Spice n Hungary Indian, American, mexican 42$
kfc, Lukestreet Thai, Japnese 29$
Brown bread shop American 11$
kfc, Hypert mall Thai, Japnese 40$
I want to create column which contains the no. of cuisines available
I am trying code
df['no._of_cuisines_available']=df['cuisines_available'].str.len()
Then instead of showing the no. of cuisines, it is showing the sum of charecters.
For example - for first row the o/p should be 2 , but its showing 17.
I need a new column that contain number of stores for each restaurant. example -
here kfc has 2 stores kfc, lukestreet and kfc, hypert mall. I have completely
no idea how to code this.

i)
df['cuisines_available'].str.split(',').apply(len)
ii)
df['Name_Restaurant'].str.split(',', expand=True).melt().['value'].str.strip().value_counts()
What ii) does: split columns at ',' and store all strings thus generated in an individual column. Then use melt to make one big column, strip away spaces etc. and count individual entries.

Related

How to sum data based on a boolean amount?

Created dictionary called items. Combined values for different keys and put into variable food_list.
items={'Food':['Ice Cream','Salad'],'Computer':['Laptop','Notebook']
food_list= '|'.join(items['Food'])
Description Amount
Lenovo Laptop 300
Chicken Salad 40
Dell Notebook 250
Chocolate Ice Cream 3
I tried to find a string based on dictionary values. If the string is in the dictionary, then the row in the dataframe contains the string. I take the amount it is associated with and add up the total row amounts that fit the condition.
total_amount=df.loc[df['Description'].str.contains(food_list,na=False)
==df['Amount'].sum()]
I usually run the code and get
Empty DataFrame
Columns: [Date, Description, Amount]
Index: []

Parse data from Morningstar Direct to worksheet

I have to put together a report every quarter using data pulled off of Morningstar Direct. I have to automate the whole process, or at least parts of it. We have put this report together for the last two quarters, and we use the same format each time. So, we already have the general templates for the report - now I'm just looking for a way to pull the data from Morningstar and putting into the templates correctly.
Does anyone have any general idea where I should start?
A B C D E F
Group Name Weight Gross Net Contribution
Equity 25% 10% 8% .25
IBM 5% 15% 12%
AAPL 7% 23% 18%
Fixed Income 25% 5% 4% .17
10 Yr Bond 10% 7% 5%
Emerging Mrkts
And it goes on breaking things into more groups, and there are many more holdings within each group.
What I want it to do is search until it finds "Equity", for example, and then go over one row, grab the name of the position, its weight, and its net return, and do that for each holding in Equity. The for it to do the same thing in Fixed Income, and on and on - selecting the names, weights, and nets for each holding. Then copy and pasting them into another workbook.
Anyway that is possible?
It sounds like you need to parse your information. By using left(), right(), and mid() you can select the good data and ignore the superfluous. You could separate the data in one cell into multiple cells in the desired format.
A B
Name Address
John Q. Public 123 My Street, City, State, Zip
E (First Name) F (Middle Initial) (extra work to program missing data)
=LEFT(A2,FIND(" ",A2)) =MID(A2,LEN(E2)+1,FIND(" ",MID(A2,LEN(E2)-1,99)))
G (Last Name) H (City)
=MID(A2,(LEN(E2)+LEN(F2)+2),99) =MID(B2,LEN(H2)+2,FIND(",",MID(B2,LEN(H2)+2,99))-1)
I (State)
=MID(B2,(LEN(I2)+LEN(H2)+4),FIND(",",MID(B2,(LEN(I2)+LEN(H2)+4),99))-1)
J (Zip Code)
=MID(B2,(LEN(H2)+LEN(I2)+LEN(J2)+6),99)
This code will parse the name in the cell A2 and address in cell B2 into separate fields.
Similar cuts should allow you to get rid of the unwanted data.
==================================================================
7/8/2015
Your data seems to be your desired output. If so, please provide sanitized input data for comparison. You probably need to loop through your input to find the groups. When the group changes, prepare the summary figures.

Repetition while copying data to SQL table from multiple sheets

I have to copy data from multiple excel sheets to the single SQL table.
Excel inputs:
Sheet1's columns: fname a, b. lname c, d. (2 rows)
Sheet2's columns: city boston, austin, state ma, tx. (2 rows)
My output (tMSSqlOutpout) has 4 rows instead of 2.
a c boston ma, a c austin tx, b d boston ma, b d austin tx.
Desired output: a c boston ma, b d austin tx. (2 rows only)
How do I manage this?
As per the comments, you don't have a natural key to join the two data sets. Instead you could generate a sequence for each data set that would increment equally for both data sets and would equate to being your row number on each data set.
First of all, this should set alarm bells ringing about the state of your data and how you can be sure that row n in one data set definitely corresponds to row n in another data set. It smacks of something being badly normalised out without proper keys being added and it can be very dangerous to assume that the resulting data set from this is going to be accurate.
If you absolutely must do this, however, then you should assign a Numeric.sequence to each of your data sets. You can do this in a tMap that precedes your joining tMap:
Notice the "s1" parameter to the Numeric.sequence. If you reuse this elsewhere then it will increment this one rather than starting from 1 so typically you would want to choose a unique name for each sequence you have in your job (although there are obviously occasions where incrementing a previously defined sequence is what you desire).
Once you have defined a unique sequence with the same starting numbers (the second parameter) and the same increment numbers (the third parameter) then you should be able to create a join on these instances:

How to use SSRS to report using a custom, "matrix-ized" table

First up, my environment: SQL 2005 + MS DAX 2009.
We have made a table that gets used in a matrix-like fashion for entering in purchase orders via an AX form. So each row will have:
a column for item#
a column for color
columns 1-7 for size (size1, size2,...), quantity (qty1, qty2,...), and cost (cost1, cost2,...).
I am trying to create a report in SSRS that basically uses this data in a more list-like fashion for printing out a PO order form.
I have got it to show the sizing right, but the cost situation complicates it as the unit cost can, and does, differ depending on size (for instance 2XL is more than S-M-L).
For example in our table, item 10000 black has 3 for Small (this data would be qty1), 3 for Medium (qty2) and 4 2XL (qty5). The cost for qty1 and qty2 are the same at $2.50 (cost1 and cost2). The cost for qty5 (cost5) would be $4. I would like to have this broken out into 2 rows by the cost and associated size on the form. So one line would have 10000 black Small and medium info and the second row would have the same item and color, but only have 2XL and its cost data.
Is there a way to "match" fields or somehow cycle through them to get the correct cost without having to have an additional 7 cost columns? Or perhaps there is a more elegant solution that is escaping me?

vba loop through all the pivot fields of a pivot table and return specified values

I have a dataset whose entries has 5 different attributes and one value. For example, I have a height of 5000 people. For each person I have his hair color, eye color, his nationality, the city he were born and the name of his mother (the 5 dimensions).
No/Eye Color/Hair Color/Nationality/Hometown/Mother's Name/Height
Blue Blond Swiss Zürich Nicole 184
Blue Brown English York Ruby 164
Brown Brown French Paris Sophie 154
etc..
So there are 5 dimensions. The data is set dynamically, so the number of categories in each dimensions can vary. I sought to compute the average height of people depending on whether I want to include some dimensions or not (from 1 to 5). For example I wanted the retrieve:
The average height of French and Blue eyed people. Next day only the people born in London. And the week after, the Swiss, blue-eyed, red-haired, born in Geneva and whose mother is called Nicole.
So I create a pivot table with the Eye Color as Row labels, Hair Color as Column labels, the average height as the Data and the last 3 dimensions as Market Filters. This allowed me see all the possible and desired combinations of average height that my data implies.
Now my goal is:
I want to create a Macro that goes through all the possible combinations that my dimensions entails (i.e 2^5-1=31) and store in a vector all the combination of height average that are above a certain value, e.g. 190. And then It could print on a worksheet.
I was thinking on using some booleans arrays vector and For-Each-Next structure, but I must say that I fail to picture how to implement it.
Any ideas?
Thanks for the time and help!