Dividing values from 2 different datasets - sql

I am trying to divide 2 different fields from 2 different datasets. Also using a lookup in the statement but for some reason it does the lookup part of the expression but doesn't do the division part. Any ideas?
=IIF(Fields!PACKSHORT_DESC.Value = "EA",(LOOKUP(TRIM(Fields!PRODUCT_CODE.value), TRIM(Fields!item.value),Fields!tcost.value,"Cost")/Fields!NO_OF_EACHES.Value),(LOOKUP(TRIM(Fields!PRODUCT_CODE.value), TRIM(Fields!item.value),Fields!tcost.value,"Cost")))

Get it to output the two numbers you are trying to divide first to see if they are pulling through correctly first, assign them names and then divide them instead.

Related

Encode all data in one column and assign the same code if data has a same value

I have a dataframe which has appr. 100 columns and 20000 rows. Now I want to encode one categorical column so that it will have numerical code. After checking its value counts, the result shows something like this:
df['name'].value_counts()
aaa 650
baa 350
cad 50
dae 10
ef3 1
....
The total unique values are about 3300. So I might have a code range from 1 to 3300. I will
normalize the numerical code before train it. As I have already many columns in the dataset, I prefer not using one hot encoding method. So how can I do it? Thank you!
You can enumerate each group using ngroup(). It would look something like:
df.assign(num_code=lambda x: x.groupby(['name']).ngroup())
I don't know what kind of information the column contains, however I am not sure it makes sense to assign an incremental numerical code to a column that seems to be categorical for training models.

Subtract the mean of a group for a column away from a column value

I have a companies dataset with 35 columns. The companies can belong to one of 8 different groups. How do I for each group create a new dataframe which subtract the mean of the column for that group away from the original value?
Here is an example of part of the dataset.
So for example for row 1 I want to subtract the mean of BANK_AND_DEP for Consumer Markets away from the value of 7204.400207. I need to do this for each column.
I assume this is some kind of combination of a transform and a lambda - but cannot hit the syntax.
Although it might seem counter-intuitive for this to involve a loop at all, looping through the columns themselves allows you to do this as a vectorized operation, which will be quicker than .apply(). For what to subtract by, you'll combine .groupby() and .transform() to get the value you need to subtract from a column. Then, just subtract it.
for column in df.columns:
df['new_'+column] = df[column]-df.groupby('Cluster')['column'].transform('mean')

Need explanation on how pandas.drop is working here

I have a data frame, lets say xyz. I have written code to find out the % of null values each column possess in the dataframe. my code below:
round(100*(xyz.isnull().sum()/len(xyz.index)), 2)
let say i got following results:
abc 26.63
def 36.58
ghi 78.46
I want to drop column ghi because it has more than 70% of null values.
I achieved it using the following code:
xyz = xyz.drop(xyz.loc[:,round(100*(xyz.isnull().sum()/len(xyz.index)), 2)>70].columns, 1)
but , i did not understand how does this code works, can anyone please explain it?
the code is doing the following:
xyz.drop( [...], 1)
removes the specified elements for a given axis, either by row or by column. In this particular case, df.drop( ..., 1) means you're dropping by axis 1, i.e, column
xyz.loc[:, ... ].columns
will return a list with the column names resulting from your slicing condition
round(100*(xyz.isnull().sum()/len(xyz.index)), 2)>70
this instruction is counting the number of nulls, adding them up and normalizing by the number of rows, effectively computing the percentage of nan in each column. Then, the amount is rounded to have only 2 decimal positions and finally you return True is the number of nan is more than 70%. Hence, you get a mapping between columns and a True/False array.
Putting everything together: you're first producing a Boolean array that marks which columns have more than 70% nan, then, using .loc you use Boolean indexing to look only at the columns you want to drop ( nan % > 70%), then using .columns you recover the name of such columns, which then are used by the .drop instruction.
Hopefully this clear things up!
If you code is hard to understand , you can just check dropna with thresh, since pandas already cover this case.
df=df.dropna(axis=1,thresh=round(len(df)*0.3))

Performing calculations on multiple columns in dataframe and create new columns

I'm trying to perform calculations based on the entries in a pandas dataframe. The dataframe looks something like this:
and it contains 1466 rows. I'll have to run similar calculations on other dfs with more rows later.
What I'm trying to do, is calculate something like mag='(U-V)/('R-I)' (but ignoring any values that are -999), put that in a new column, and then z_pred=10**((mag-c)m) in a new column (mag, c and m are just hard-coded variables). I have other columns I need to add too, but I figure that'll just be an extension of the same method.
I started out by trying
for i in range(1):
current = qso[:]
mag = (U-V)/(R-I)
name = current['NED']
z_pred = 10**((mag - c)/m)
z_meas = current['z']
but I got either a Series for z, which I couldn't operate on, or various type errors when I tried to print the values or write them to a file.
I found this question which gave me a start, but I can't see how to apply it to multiple calculations, as in my situation.
How can I achieve this?
Conditionally adding calculated columns row wise are usually performed with numpy's np.where;
df['mag'] = np.where(~df[['U', 'V', 'R', 'I']].eq(-999).any(1), (df.U - df.V) / (df.R - df.I), -999)
Note; assuming here that when any of the columns contain '-999' it will not be calculated and a '-999' is returned.

recoding multiple variables in the same way

I am looking for the shortest way to recode many variables in the same way.
For example I have data frame where columns a,b,c are names of items of survey and rows are observations.
d <- data.frame(a=c(1,2,3), b=c(1,3,2), c=c(1,2,1))
I want to change values of all observations for selected columns. For instance value 1 of column "a" and "c" should be replaced to string "low" and values 2,3 of these columns should be replaced to "high".
I do it often with many columns so I am looking for function which can do it in very simple way, like this:
recode2(data=d, columns=a,d, "1=low, 2,3=high").
Almost ok is function recode from package cars, but if I have 10 columns to recode I have to rewrite it 10 times and it is not as effective as I want.