Replace '-' but not negative numbers in pandas - pandas

In a DataFrame, I have negative numbers, and also missing values that are given by a - . I want to replace the missing values with an empty cell, but this operation should NOT remove the - in front of the negative numbers.
It looks like:
45 45 45 45 45 45 45 45 45 45
45 45 15 31 43 45 45 45 45 45
44.24 121.55 1.80 0.00% - 97.63 -4.87 -6.02 -20.14 169.19
1 1 7 12 3 1 1 1 1 1
So the missing value cell with the - should be empty, but the -4.87 should stay intact.
Any help would be greatly appreciated.

The problem should have been addressed at the time of loading the file into the DataFrame (by providing the na_values parameter to read_csv() or whatever function you used).
At this point, use operation replace(): it replaces whole words, not individual characters.
df = df.replace("-", np.nan)


Gnuplot second from right or last coulmn [duplicate]

This question already has an answer here:
plot first and last columns (variable number) gnuplot
(1 answer)
Closed 2 years ago.
I wish to plot the second, third, and fourth from the rightmost column using Gnuplot. In awk, we can use ($NF-1). But in Gnuplot, not sure how can I designate a column from the rightmost column with 'using'.
Is this possible to use awk in Gnuplot to plot 3rd from the right column vs 4th from the right column? Or is this something that we must need to use shell script?
I have a lot of long text files to plot, so I cannot create new text files to rewrite the file using awk and then use Gnuplot. That would be too time-consuming. I wish to use Gnuplot to make plots from 2nd, 3rd, and 4th from the right.
No need for awk. If you do stats you could limit it to one row with every ::0::0. It should be pretty fast. Try the following complete example:
### plotting columns from right
reset session
$Data <<EOD
11 21 31 41 51 61 71
12 22 32 42 52 62 72
13 23 33 43 53 63 73
14 24 34 44 54 64 74
15 25 35 45 55 65 75
16 26 36 46 56 66 76
17 27 37 47 57 67 77
stats $Data u 0 every ::0::0 nooutput
ColMax = STATS_columns
ColFromRight(col) = column(ColMax-col+1)
plot $Data u (ColFromRight(3)):(ColFromRight(4)) w lp pt 7
### end of code
you can use STATS_columns for the number of columns and use it in your plot
nf = int(STATS_columns)
plot data.dat using 1:nf-4

Python Dataframe column operation using lambda function [duplicate]

I'm trying to multiply two existing columns in a pandas Dataframe (orders_df): Prices (stock close price) and Amount (stock quantities) and add the calculation to a new column called Value. For some reason when I run this code, all the rows under the Value column are positive numbers, while some of the rows should be negative. Under the Action column in the DataFrame there are seven rows with the 'Sell' string and seven with the 'Buy' string.
for i in orders_df.Action:
if i == 'Sell':
orders_df['Value'] = orders_df.Prices*orders_df.Amount
elif i == 'Buy':
orders_df['Value'] = -orders_df.Prices*orders_df.Amount)
Please let me know what i'm doing wrong !
I think an elegant solution is to use the where method (also see the API docs):
In [37]: values = df.Prices * df.Amount
In [38]: df['Values'] = values.where(df.Action == 'Sell', other=-values)
In [39]: df
Prices Amount Action Values
0 3 57 Sell 171
1 89 42 Sell 3738
2 45 70 Buy -3150
3 6 43 Sell 258
4 60 47 Sell 2820
5 19 16 Buy -304
6 56 89 Sell 4984
7 3 28 Buy -84
8 56 69 Sell 3864
9 90 49 Buy -4410
Further more this should be the fastest solution.
You can use the DataFrame apply method:
order_df['Value'] = order_df.apply(lambda row: (row['Prices']*row['Amount']
if row['Action']=='Sell'
else -row['Prices']*row['Amount']),
It is usually faster to use these methods rather than over for loops.
If we're willing to sacrifice the succinctness of Hayden's solution, one could also do something like this:
In [22]: orders_df['C'] = orders_df.Action.apply(
lambda x: (1 if x == 'Sell' else -1))
In [23]: orders_df # New column C represents the sign of the transaction
Prices Amount Action C
0 3 57 Sell 1
1 89 42 Sell 1
2 45 70 Buy -1
3 6 43 Sell 1
4 60 47 Sell 1
5 19 16 Buy -1
6 56 89 Sell 1
7 3 28 Buy -1
8 56 69 Sell 1
9 90 49 Buy -1
Now we have eliminated the need for the if statement. Using DataFrame.apply(), we also do away with the for loop. As Hayden noted, vectorized operations are always faster.
In [24]: orders_df['Value'] = orders_df.Prices * orders_df.Amount * orders_df.C
In [25]: orders_df # The resulting dataframe
Prices Amount Action C Value
0 3 57 Sell 1 171
1 89 42 Sell 1 3738
2 45 70 Buy -1 -3150
3 6 43 Sell 1 258
4 60 47 Sell 1 2820
5 19 16 Buy -1 -304
6 56 89 Sell 1 4984
7 3 28 Buy -1 -84
8 56 69 Sell 1 3864
9 90 49 Buy -1 -4410
This solution takes two lines of code instead of one, but is a bit easier to read. I suspect that the computational costs are similar as well.
Since this question came up again, I think a good clean approach is using assign.
The code is quite expressive and self-describing:
df = df.assign(Value = lambda x: x.Prices * x.Amount * x.Action.replace({'Buy' : 1, 'Sell' : -1}))
To make things neat, I take Hayden's solution but make a small function out of it.
def create_value(row):
if row['Action'] == 'Sell':
return row['Prices'] * row['Amount']
return -row['Prices']*row['Amount']
so that when we want to apply the function to our dataframe, we can do..
df['Value'] = df.apply(lambda row: create_value(row), axis=1)
...and any modifications only need to occur in the small function itself.
Concise, Readable, and Neat!
Good solution from bmu. I think it's more readable to put the values inside the parentheses vs outside.
df['Values'] = np.where(df.Action == 'Sell',
Using some pandas built in functions.
df['Values'] = np.where(df.Action.eq('Sell'),
For me, this is the clearest and most intuitive:
values = []
for action in ['Sell','Buy']:
amounts = orders_df['Amounts'][orders_df['Action'==action]].values
if action == 'Sell':
prices = orders_df['Prices'][orders_df['Action'==action]].values
prices = -1*orders_df['Prices'][orders_df['Action'==action]].values
values += list(amounts*prices)
orders_df['Values'] = values
The .values method returns a numpy array allowing you to easily multiply element-wise and then you can cumulatively generate a list by 'adding' to it.
First, multiply the columns Prices and Amount. Afterwards use mask to negate the values if the condition is True:
Values=(df["Prices"] * df["Amount"]).mask(df["Action"] == "Buy", lambda x: -x)

How to Create a CDF out of a PDF in SQL

So I have a datatable that looks something like that following. ID represents an object, bin represents how I am segmenting the data, and percent is how much of a data falls into that bin.
id bin percent
2 8 0.20030698388
2 16 0.14504988488
2 24 0.12356101304
2 32 0.09976976208
2 40 0.09056024558
2 48 0.07137375287
2 56 0.04067536454
2 64 0.03914044512
2 72 0.02916346891
2 80 0.16039907904
3 8 0.36316695352
3 16 0.03958691910
3 24 0.11876075731
3 32 0.13253012048
3 40 0.03098106712
3 48 0.07228915662
3 56 0.07745266781
3 64 0.02581755593
3 72 0.02065404475
3 80 0.11876075731
I am looking for a function to turn this dataset into a cdf partitioning id. I have tried cume_dist and percent_rank, but they do not appear to work.
I am facing a similar problem and found this great tutorial for doing exactly that:
It tries to rebuild the Excel function NORM.DIST function which gives you either the PDF if you set the cummulative flag as FALSE and the CDF if you set it as TRUE. I assumed that CUME_DIST would do the exact same thing in SQL. However, it turns out that the latter distributes by counting the elements whereas Excel uses the relative differences in the values.

Pandas pivot_table -- index values extended through rows

I'm trying to tidy some data, specifically by taking two columns "measure" and "value" and making more columns for each unique value of measure.
So far I have some python (3) code that reads in data and pivots it to the form that I want--roughly. This code looks like so:
import pandas as pd
#Load the data
df = pd.read_csv(r"C:\Users\User\Documents\example data.csv")
#Pivot the dataframe
df_pivot = df.pivot_table(index=['Geography Type', 'Geography Name', 'Week Ending',
'Item Name'], columns='Measure', values='Value')
This outputs:
Measure X Y Z
Geography Type Geography Name Week Ending Item Name
Type 1 Total US 1/1/2018 Item A 57 51 16
Item B 95 37 17
1/8/2018 Item A 92 8 32
Item B 36 49 54
Type 2 Region 1 1/1/2018 Item A 78 46 88
This is almost perfect, but for my work I need to put this file in software and for the software to read the data correctly it needs values for each of the rows, so I need the columns values for each of those indexes to extend through the rows, like so:
Measure X Y Z
Geography Type Geography Name Week Ending Item Name
Type 1 Total US 1/1/2018 Item A 57 51 16
Type 1 Total US 1/1/2018 Item B 95 37 17
Type 1 Total US 1/8/2018 Item A 92 8 32
Type 1 Total US 1/8/2018 Item B 36 49 54
Type 2 Region 1 1/1/2018 Item A 78 46 88
and so on.

Hive returning no results on a simple select query

I have a table called processed. The last column is named as monthid. The data type for this column is bigint. When I fire a simple query like this, I get no results:
select * from processed where monthid = 5 ;
A few rows for the table have been shown below. Can someone suggest what's wrong here?
11741 Negative 11 69.55 1401172919 48 27 5
11741 Negative 11 102.0 1401172997 48 27 5
11741 Negative 11 145.78 1401173093 48 27 5
11741 Negative 11 70.54 1401173137 49 27 5
11741 Negative 11 85.2 1401173146 49 27 5
11741 Negative 11 67.47 1401173156 49 27 5
11741 Negative 11 92.76 1401173223 49 27 5
As can be seen from the above sample data, the last column has monthid = 5. However, my query returns me nothing.
I believe the problem here was that i had partitioned the above table based on column #6. Hence, due to either permissions issue or something funky, the query was returning nothing. After, I dropped the table and created it again without the partition, the above query worked fine. For more information on this, please refer to
Hive - Queries on Partitions return nothing