Using value_counts in pandas with conditions - pandas

I have a column with around 20k values. I've used the following function in pandas to display their counts:
weather_data["snowfall"].value_counts()
weather_data is the dataframe and snowfall is the column.
My results are:
0.0 12683
M 7224
T 311
0.2 32
0.1 31
0.5 20
0.3 18
1.0 14
0.4 13
etc.
Is there a way to:
Display the counts of only a single variable or number
Use an if condition to display the counts of only those values which satisfy the condition?

I'll be as clear as possible without having a full example as piRSquared suggested you to provide.
value_counts' output is a Series, therefore the values in your originale Series can be retrieved from the value_counts' index. Displaying only the result of one of the variables then is exactly slicing your series:
my_value_count = weather_data["snowfall"].value_counts()
my_value_count.loc['0.0']
output:
0.0 12683
If you want to display only for a list of variables:
my_value_count.loc[my_value_count.index.isin(['0.0','0.2','0.1'])]
output:
0.0 12683
0.2 32
0.1 31
As you have M and T in your values, I suspect the other values will be treated as strings and not float. Otherwise you could use:
my_value_count.loc[my_value_count.index < 0.4]
output:
0.0 12683
0.2 32
0.1 31
0.3 18

Use an if condition to display the counts of only those values which satisfy the condition?
First create a new column based on the condition you want. Then you can use groupby and sum.
For example, if you want to count the frequency only if a column has a non-null value. In my case, if there is an actual completion_date non-null value:
dataset['Has_actual_completion_date'] = np.where(dataset['ACTUAL_COMPLETION_DATE'].isnull(), 0, 1)
dataset['Mitigation_Plans_in_progress'] = dataset['Has_actual_completion_date'].groupby(dataset['HAZARD_ID']).transform('sum')

Related

Changing column name and it's values at the same time

Pandas help!
I have a specific column like this,
Mpg
0 18
1 17
2 19
3 21
4 16
5 15
Mpg is mile per gallon,
Now I need to replace that 'MPG' column to 'litre per 100 km' and change those values to litre per 100 km' at the same time. Any help? Thanks beforehand.
-Tom
I changed the name of the column but doing both simultaneously,i could not.
Use pop to return and delete the column at the same time and rdiv to perform the conversion (1 mpg = 1/235.15 liter/100km):
df['litre per 100 km'] = df.pop('Mpg').rdiv(235.15)
If you want to insert the column in the same position:
df.insert(df.columns.get_loc('Mpg'), 'litre per 100 km',
df.pop('Mpg').rdiv(235.15))
Output:
litre per 100 km
0 13.063889
1 13.832353
2 12.376316
3 11.197619
4 14.696875
5 15.676667
An alternative to pop would be to store the result in another dataframe. This way you can perform the two steps at the same time. In my code below, I first reproduce your dataframe, then store the constant for conversion and perform it on all entries using the apply method.
df = pd.DataFrame({'Mpg':[18,17,19,21,16,15]})
cc = 235.214583 # constant for conversion from mpg to L/100km
df2 = pd.DataFrame()
df2['litre per 100 km'] = df['Mpg'].apply(lambda x: cc/x)
print(df2)
The output of this code is:
litre per 100 km
0 13.067477
1 13.836152
2 12.379715
3 11.200694
4 14.700911
5 15.680972
as expected.

Pandas how to get row number from datetime index and back again?

I have great difficulties. I have read a csv files, and set the index on "Timestamp" column like this
# df = pd.read_csv (csv_file, quotechar = "'", decimal = ".", delimiter=";", parse_dates = True, index_col="Timestamp")
# df
XYZ PRICE position nrLots posText
Timestamp
2014-10-14 10:00:29 30 140 -1.0 -1.0 buy
2014-10-14 10:00:30 21 90 -1.0 -5.0 buy
2014-10-14 10:00:31 3 110 1.0 2.0 sell
2014-10-14 10:00:32 31 120 1.0 1.0 sell
2014-10-14 10:00:33 4 70 -1.0 -5.0 buy
So if I want to get the price of 2nd row, I want to do like this:
df.loc [2,"PRICE"]
But that does not work. If I want to use df.loc[] operator, I need to insert a Timestamp, like this:
df.loc["2014-10-14 10:00:31", "PRICE"]
If I want to use row numbers, I need to do like this instead:
df["PRICE"].iloc[2]
which sucks. The syntax is ugly. However, it works. I can get the value, and I can set the value - which is what I want.
If I want to find the Timestamp from a row, I can do like this:
df.index[row]
Question) Is there a more elegant syntax to get and set the value, when you always work with a row number? I always iterate over the row numbers, never iterate over Timestamps. I never use the Timestamp to access values, I always use row numbers.
Bonusquestion) If I have a Timestamp, how can I find the corresponding row number?
There is way to do this .
First use df = df.reset_index() .
"Timestamp" will be new column added to df , now you get new index as integer.
And you access any row element with df.loc[] or df.iat[] and you can find any row with specific element .

Is there a way to use cumsum with a threshold to create bins?

Is there a way to use numpy to add numbers in a series up to a threshold, then restart the counter. The intention is to form groupby based on the categories created.
amount price
0 27 22.372505
1 17 126.562276
2 33 101.061767
3 78 152.076373
4 15 103.482099
5 96 41.662766
6 108 98.460743
7 143 126.125865
8 82 87.749286
9 70 56.065133
The only solutions I found iterate with .loc which is slow. I tried building a solution based on this answer https://stackoverflow.com/a/56904899:
sumvals = np.frompyfunc(lambda a,b: a+b if a <= 100 else b,2,1)
df['cumvals'] = sumvals.accumulate(df['amount'], dtype=np.object)
The use-case is to find the average price of every 75 sold amounts of the thing.
Solution #1 Interpreting the following one way will get my solution below: "The use-case is to find the average price of every 75 sold amounts of the thing." If you are trying to do this calculation the "hard way" instead of pd.cut, then here is a solution that will work well but the speed / memory will depend on the cumsum() of the amount column, which you can find out if you do df['amount'].cumsum(). The output will take about 1 second per every 10 million of the cumsum, as that is how many rows is created with np.repeat. Again, this solution is not horrible if you have less than ~10 million in cumsum (1 second) or even 100 million in cumsum (~10 seconds):
i = 75
df = np.repeat(df['price'], df['amount']).to_frame().reset_index(drop=True)
g = df.index // i
df = df.groupby(g)['price'].mean()
df.index = (df.index * i).astype(str) + '-' + (df.index * i +75).astype(str)
df
Out[1]:
0-75 78.513748
75-150 150.715984
150-225 61.387540
225-300 67.411182
300-375 98.829611
375-450 126.125865
450-525 122.032363
525-600 87.326831
600-675 56.065133
Name: price, dtype: float64
Solution #2 (I believe this is wrong but keeping just in case)
I do not believe you are tying to do it this way, which was my initial solution, but I will keep it here in case, as you haven't included expected output. You can create a new series with cumsum and then use pd.cut and pass bins=np.arange(0, df['Group'].max(), 75) to create groups of cumulative 75. Then, groupby the groups of cumulative 75 and take the mean. Finally, use pd.IntervalIndex to clean up the format and change to a sting:
df['Group'] = df['amount'].cumsum()
s = pd.cut(df['Group'], bins=np.arange(0, df['Group'].max(), 75))
df = df.groupby(s)['price'].mean().reset_index()
df['Group'] = pd.IntervalIndex(df['Group']).left.astype(str) + '-' + pd.IntervalIndex(df['Group']).right.astype(str)
df
Out[1]:
Group price
0 0-75 74.467390
1 75-150 101.061767
2 150-225 127.779236
3 225-300 41.662766
4 300-375 98.460743
5 375-450 NaN
6 450-525 126.125865
7 525-600 87.749286

I've a column in my pandas dataframe where it contains arithmetic operator (*) as well

The data type of the column in a pandas dataframe is object. It contains an arithmetic expression, for example: 24 * 365. I would like to get the result (24 * 365 = 8760) of the expression returned in place of the expression. Can anyone help in resolving this?
The quantity column in the picture shown is having number of units multiplied by the quantity of each unit. I would like to get the total quantity by multiplying them.
Search for strings that contain ' x ' and split based based off that, convert to a number and multiply the left of the split by the right of that split.
df = pd.DataFrame({'Quantity' : ['1', '1000', '24 x 13.75', '60 x 40', '750']})
df['Quantity1'] = np.where((df['Quantity'].str.contains(' x ')),
pd.to_numeric(df['Quantity'].str.split(' x ').str[0]) * pd.to_numeric(df['Quantity'].str.split(' x ').str[1]),
df['Quantity']).astype(float)
df
#If above doesn't work then delete `.astype(float)` and there may be some additional logic to consider like what if there is a cell in this format 12.25 * 43 * 1 OR what about 40 / 8, or what if there is a capital 'X', etc.
ouput:
Quantity Quantity1
0 1 1.0
1 1000 1000.0
2 24 x 13.75 330.0
3 60 x 40 2400.0
4 750 750.0
inter_m=df["a1"].str.split("*",expand=True).astype('int32')
Split the column by the symbol * . Later convert the numbers to type integer. Store in intermediate dataframe.
df["a1"]=inter_m[0] * inter_m[1]
Multiply the two columns in the intermediate dataframe.

Converting Negative Number in String Format to Numeric when Sign as at the end

I have certain numbers within a column of my dataframe that have negative numbers in a string format like this: "500.00-" I need to convert every negative number within the column to numeric format. I'm sure there's an easy way to do this, but I have struggled finding one specific to pandas dataframe. Any help would be greatly appreciated.
I have tried the basic to_numeric function as shown below, but it doesn't read it in correctly. Also, only some of the numbers within the column are negative, therefore I can't simply remove all the negative signs and multiply the column by 1.
Q1['Credit'] = pd.to_numeric(Q1['Credit'])
Sample data:
df:
num
0 50.00
1 60.00-
2 70.00+
3 -80.00
Using series str accessor to check last digit. If it is '-' or '+', swap it to front. Use df.mask to apply it only to rows having -/+ as suffix. Finally, astype column to float
df.num.mask(df.num.str[-1].isin(['-','+']), df.num.str[-1].str.cat(df.num.str[:-1])).astype('float')
Out[1941]:
0 50.0
1 -60.0
2 70.0
3 -80.0
Name: num, dtype: float64
Possibly a bit explicit but would work
# build a mask of negative numbers
m_neg = Q1["Credit"].str.endswith("-")
# remove - signs
Q1["Credit"] = Q1["Credit"].str.rstrip("-")
# convert to number
Q1["Credit"] = pd.to_numeric(Q1["Credit"])
# Apply the mask to create the negatives
Q1.loc[m_neg, "Credit"] *= -1
Let us consider the following example dataframe:
Q1 = pd.DataFrame({'Credit':['500.00-', '100.00', '300.00-']})
Credit
0 500.00-
1 100.00
2 300.00-
We can use str.endswith to create a mask which indicates the negative numbers. Then we use np.where to conditionally convert the numbers to negative:
m1 = Q1['Credit'].str.endswith('-')
m2 = Q1['Credit'].str[:-1].astype(float)
Q1['Credit'] = np.where(m1, -m2, m2)
Output
Credit
0 -500.0
1 100.0
2 -300.0