Converting Negative Number in String Format to Numeric when Sign as at the end - pandas

I have certain numbers within a column of my dataframe that have negative numbers in a string format like this: "500.00-" I need to convert every negative number within the column to numeric format. I'm sure there's an easy way to do this, but I have struggled finding one specific to pandas dataframe. Any help would be greatly appreciated.
I have tried the basic to_numeric function as shown below, but it doesn't read it in correctly. Also, only some of the numbers within the column are negative, therefore I can't simply remove all the negative signs and multiply the column by 1.
Q1['Credit'] = pd.to_numeric(Q1['Credit'])

Sample data:
df:
num
0 50.00
1 60.00-
2 70.00+
3 -80.00
Using series str accessor to check last digit. If it is '-' or '+', swap it to front. Use df.mask to apply it only to rows having -/+ as suffix. Finally, astype column to float
df.num.mask(df.num.str[-1].isin(['-','+']), df.num.str[-1].str.cat(df.num.str[:-1])).astype('float')
Out[1941]:
0 50.0
1 -60.0
2 70.0
3 -80.0
Name: num, dtype: float64

Possibly a bit explicit but would work
# build a mask of negative numbers
m_neg = Q1["Credit"].str.endswith("-")
# remove - signs
Q1["Credit"] = Q1["Credit"].str.rstrip("-")
# convert to number
Q1["Credit"] = pd.to_numeric(Q1["Credit"])
# Apply the mask to create the negatives
Q1.loc[m_neg, "Credit"] *= -1

Let us consider the following example dataframe:
Q1 = pd.DataFrame({'Credit':['500.00-', '100.00', '300.00-']})
Credit
0 500.00-
1 100.00
2 300.00-
We can use str.endswith to create a mask which indicates the negative numbers. Then we use np.where to conditionally convert the numbers to negative:
m1 = Q1['Credit'].str.endswith('-')
m2 = Q1['Credit'].str[:-1].astype(float)
Q1['Credit'] = np.where(m1, -m2, m2)
Output
Credit
0 -500.0
1 100.0
2 -300.0

Related

Split a vector into parts separated by zeros and cumulatively sum the elements in each part

I want to split a vector into several parts separated by the numeric value 0. For each part, cumulatively calculate the sum of the elements encountered so far. Negative numbers do not participate in the calculation.
For example, with the input [0,1,1,1,0,1,-1,1,1], I expect the result to be [0,1,2,3,0,1,1,2,3].
How to implement this in DolphinDB?
Use the DolphinDB built-in function cumPositiveStreak(X). Treat the negative elements in X as NULL values.
Script:
a = 1 2 -1 0 1 2 3
cumPositiveStreak(iif(a<0,NULL,a))
Execution result:
1 3 3 0 1 3 6

Finding the mean of a column; but excluding a singular value

Imagine I have a dataset that is like so:
ID birthyear weight
0 619040 1962 0.1231231
1 600161 1963 0.981742
2 25602033 1963 1.3123124
3 624870 1987 10,000
and I want to get the mean of the column weight, but the obvious 10,000 is hindering the actual mean. In this situation I cannot change the value but must work around it, this is what I've got so far, but obviously it's including that last value.
avg_num_items = df_cleaned['trans_quantity'].mean()
translist = df_cleaned['trans_quantity'].tolist()
my dataframe is df_cleaned and the column I'm actually working with is 'trans_quantity' so how do I go about the mean while working around that value?
Since you added SQL in your tags, In SQL you'd want to exclude it in the WHERE clause:
SELECT AVG(trans_quantity)
FROM your_data_base
WHERE trans_quantity <> 10,000
In Pandas:
avg_num_items = df_cleaned[df_cleaned["trans_quantity"] != 10000]["trans_quantity"].mean()
You can also replace your value with a NAN and skip it in the mean:
avg_num_items = df_cleaned["trans_quantity"].replace(10000, np.nan).mean(skipna=True)
With pandas, ensure you have numeric data (10,000 is a string), filter the values above threshold and use the mean:
(pd.to_numeric(df['weight'], errors='coerce')
.loc[lambda x: x<10000]
.mean()
)
output: 0.8057258333333334

Pandas how to get row number from datetime index and back again?

I have great difficulties. I have read a csv files, and set the index on "Timestamp" column like this
# df = pd.read_csv (csv_file, quotechar = "'", decimal = ".", delimiter=";", parse_dates = True, index_col="Timestamp")
# df
XYZ PRICE position nrLots posText
Timestamp
2014-10-14 10:00:29 30 140 -1.0 -1.0 buy
2014-10-14 10:00:30 21 90 -1.0 -5.0 buy
2014-10-14 10:00:31 3 110 1.0 2.0 sell
2014-10-14 10:00:32 31 120 1.0 1.0 sell
2014-10-14 10:00:33 4 70 -1.0 -5.0 buy
So if I want to get the price of 2nd row, I want to do like this:
df.loc [2,"PRICE"]
But that does not work. If I want to use df.loc[] operator, I need to insert a Timestamp, like this:
df.loc["2014-10-14 10:00:31", "PRICE"]
If I want to use row numbers, I need to do like this instead:
df["PRICE"].iloc[2]
which sucks. The syntax is ugly. However, it works. I can get the value, and I can set the value - which is what I want.
If I want to find the Timestamp from a row, I can do like this:
df.index[row]
Question) Is there a more elegant syntax to get and set the value, when you always work with a row number? I always iterate over the row numbers, never iterate over Timestamps. I never use the Timestamp to access values, I always use row numbers.
Bonusquestion) If I have a Timestamp, how can I find the corresponding row number?
There is way to do this .
First use df = df.reset_index() .
"Timestamp" will be new column added to df , now you get new index as integer.
And you access any row element with df.loc[] or df.iat[] and you can find any row with specific element .

I've a column in my pandas dataframe where it contains arithmetic operator (*) as well

The data type of the column in a pandas dataframe is object. It contains an arithmetic expression, for example: 24 * 365. I would like to get the result (24 * 365 = 8760) of the expression returned in place of the expression. Can anyone help in resolving this?
The quantity column in the picture shown is having number of units multiplied by the quantity of each unit. I would like to get the total quantity by multiplying them.
Search for strings that contain ' x ' and split based based off that, convert to a number and multiply the left of the split by the right of that split.
df = pd.DataFrame({'Quantity' : ['1', '1000', '24 x 13.75', '60 x 40', '750']})
df['Quantity1'] = np.where((df['Quantity'].str.contains(' x ')),
pd.to_numeric(df['Quantity'].str.split(' x ').str[0]) * pd.to_numeric(df['Quantity'].str.split(' x ').str[1]),
df['Quantity']).astype(float)
df
#If above doesn't work then delete `.astype(float)` and there may be some additional logic to consider like what if there is a cell in this format 12.25 * 43 * 1 OR what about 40 / 8, or what if there is a capital 'X', etc.
ouput:
Quantity Quantity1
0 1 1.0
1 1000 1000.0
2 24 x 13.75 330.0
3 60 x 40 2400.0
4 750 750.0
inter_m=df["a1"].str.split("*",expand=True).astype('int32')
Split the column by the symbol * . Later convert the numbers to type integer. Store in intermediate dataframe.
df["a1"]=inter_m[0] * inter_m[1]
Multiply the two columns in the intermediate dataframe.

Need explanation on how pandas.drop is working here

I have a data frame, lets say xyz. I have written code to find out the % of null values each column possess in the dataframe. my code below:
round(100*(xyz.isnull().sum()/len(xyz.index)), 2)
let say i got following results:
abc 26.63
def 36.58
ghi 78.46
I want to drop column ghi because it has more than 70% of null values.
I achieved it using the following code:
xyz = xyz.drop(xyz.loc[:,round(100*(xyz.isnull().sum()/len(xyz.index)), 2)>70].columns, 1)
but , i did not understand how does this code works, can anyone please explain it?
the code is doing the following:
xyz.drop( [...], 1)
removes the specified elements for a given axis, either by row or by column. In this particular case, df.drop( ..., 1) means you're dropping by axis 1, i.e, column
xyz.loc[:, ... ].columns
will return a list with the column names resulting from your slicing condition
round(100*(xyz.isnull().sum()/len(xyz.index)), 2)>70
this instruction is counting the number of nulls, adding them up and normalizing by the number of rows, effectively computing the percentage of nan in each column. Then, the amount is rounded to have only 2 decimal positions and finally you return True is the number of nan is more than 70%. Hence, you get a mapping between columns and a True/False array.
Putting everything together: you're first producing a Boolean array that marks which columns have more than 70% nan, then, using .loc you use Boolean indexing to look only at the columns you want to drop ( nan % > 70%), then using .columns you recover the name of such columns, which then are used by the .drop instruction.
Hopefully this clear things up!
If you code is hard to understand , you can just check dropna with thresh, since pandas already cover this case.
df=df.dropna(axis=1,thresh=round(len(df)*0.3))