Arithmetic operations on large dataframe - pandas

Apologies in advance if this isn't a good question, I'm a beginner in DataFrames...
I have a large dataframe (about a thousands rows and 5000+ columns).
The first 5000 columns contain numbers, and I need to do some operations on each of these numbers based on the values of other columns.
For instance multiply the 5000 first numbers on a row with the value of another column on the same row.
Index
1
2
3
4
...
5000
a
b
c
d
0
0.1
0.4
0.8
0.6
...
0.3
3
7
2
9
1
0.7
0.5
0.4
0.8
...
0.1
4
6
1
3
...
...
...
...
...
...
...
...
...
...
...
1000
0.2
0.5
0.1
0.9
...
0.6
6
8
5
4
This is an example of code that is multiplying my numbers by the column "a", then muliply by a constant and then get the exponential of that :
a_col = df.get_loc("a")
df.iloc[: , : 5000 ] = np.exp (df.iloc[: , : 5000 ] * df.iloc[: , [a_col]].to_numpy() * np.sqrt(4) )
While the results look fine, it does feel slow, especially compared to the code I'm trying to replace that was doing these operations rows by rows in a loop.
Is this the proper way to do what I'm trying to achieve, or am I doing something wrong ?
Thank you for your help !

Use .values method to get the numpy arrays, np.newaxis to make df.a a column vector and multiply row-wise:
df.iloc[: , : 5000 ] = np.exp(df.iloc[: , : 5000 ].values * df.a.values[:, np.newaxis] * np.sqrt(4) )

Try this:
df.iloc[:, :5000] = np.exp(df.iloc[:, :5000].values * a_col.to_numpy().reshape(-1,1) * np.sqrt(4))
It took just a few seconds to run (for the 5 million cells).
If it works, I'll explain it :)

Related

Python Pandas groupby and join

I am fairly new to python pandas and cannot find the answer to my problem in any older posts.
I have a simple dataframe that looks something like that:
dfA ={'stop':[1,2,3,4,5,1610,1611,1612,1613,1614,2915,...]
'seq':[B, B, D, A, C, C, A, B, A, C, A,...] }
Now I want to merge the 'seq' values from each group, where the difference between the next and previous value in 'stop' is equal to 1. When the difference is high like 5 and 1610, that is where the next cluster begins and so on.
What I need is to write all values from each cluster into separate rows:
0 BBDAC #join'stop' cluster 1-5
1 CABAC #join'stop' cluster 1610-1614
2 A.... #join'stop' cluster 2015 - ...
etc...
What I am getting with my current code is like:
True BDACABAC...
False BCA...
for the entire huge dataframe.
I understand the logic behid the whay it merges it, which is meeting the condition (not perfect, loosing cluster edges) I specified, but I am running out of ideas if I can get it joined and split properly into clusters somehow, not all rows of the dataframe.
Please see my code below:
dfB = dfA.groupby((dfA.stop - dfA.stop.shift(1) == 1))['seq'].apply(lambda x: ''.join(x)).reset_index()
Please help.
P.S. I have also tried various combinations with diff() but that didn't help either. I am not sure if groupby is any good for this solution as well. Please advise!
dfC = dfA.groupby((dfA['stop'].diff(periods=1)))['seq'].apply(lambda x: ''.join(x)).reset_index()
This somehow splitted the dataframe into smaller chunks, cluster-like, but I am not understanding the legic behind the way it did it, and I know the result makes no sense and is not what I intended to get.
I think you need create helper Series for grouping:
g = dfA['stop'].diff().ne(1).cumsum()
dfC = dfA.groupby(g)['seq'].apply(''.join).reset_index()
print (dfC)
stop seq
0 1 BBDAC
1 2 CABAC
2 3 A
Details:
First get differences by diff:
print (dfA['stop'].diff())
0 NaN
1 1.0
2 1.0
3 1.0
4 1.0
5 1605.0
6 1.0
7 1.0
8 1.0
9 1.0
10 1301.0
Name: stop, dtype: float64
Compare by ne (!=) for first values of groups:
print (dfA['stop'].diff().ne(1))
0 True
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 False
10 True
Name: stop, dtype: bool
Asn last create groups by cumsum:
print (dfA['stop'].diff().ne(1).cumsum())
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 2
9 2
10 3
Name: stop, dtype: int32
I just figured it out.
I managed to round the values of 'stop' to a nearest 100 and assigned it as a new column.
Then my previous code is working....
Thank you so much for quick answer though.
dfA['new_val'] = (dfA['stop'] / 100).astype(int) *100

Access SQL - Select same column twice for different criteria

I've been struggling with the following table for a while now. Hopefully anyone can help me out.
Item Type Value
A X 2
B X 3
C X 4
D X 5
A Y 0.1
B Y 0.3
C Y 0.4
D Y 0.6
The result I would like to see is this:
Item X Y
A 2 0.1
B 3 0.3
C 4 0.4
D 5 0.6
Is it possible to fix this in one query?
I tried Union queries and IIF statements, but none of it gives me the desired result. Another option might be to split it up in multiple queries, however I would rather have it done in once.
Looking forward to any answer.
Many thanks!
Best,
Mathijs
That's a job for a Crosstab query.
TRANSFORM Max(Table1.Valu) AS MaxOfValu
SELECT Table1.item
FROM Table1
GROUP BY Table1.item
PIVOT Table1.type;
ps: Value is a reserved word and cannot be used as a field name. And I would never used Type or Item either.

Performing a division of some elements in the column based on condition in R

I am getting started in R and looking for some help I have this column extracted from a HTML page.
Brand_Value
200 M
400 M
2 B
5 B
150 M
As you see some items are in millions and some are in billions. I would like to convert all the million values to billion (i.e divide by 1000) and remove the characters B and M. At the end of it, this should look like:
Brand_Value
0.2
0.4
2
5
0.15
Any pointers appreciated, Thank you!.
Regards
R
This isn't the most concise workaround, but it should work if the column only contains values in millions and billions ("M" and "B"):
new_vector<-NA
test<-NA
for (i in 1:length(vector)) {
test <- grepl("M",vector)
if(test[i]==T) {
new_vector[i] <- as.numeric( gsub("M", "", vector[i]) )*1000000
} else {
new_vector[i] <- as.numeric( gsub("B", "", vector[i]) )*1000000000
}
}
new_vector

Calculate diff() between selected rows

I have a dataframe with ordered times (in seconds) and a column that is either 0 or 1:
time bit
index
0 0.24 0
1 0.245 0
2 0.47 1
3 0.471 1
4 0.479 0
5 0.58 1
... ... ...
I want to select those rows where the time difference is, let's say <0.01 s. But only those differences between rows with bit 1 and bit 0. So in the above example I would only select row 3 and 4 (or any one of them). I thought that I would calculate the diff() of the time column. But I need to somehow select on the 0/1 bit.
Coming from the future to answer this one. You can apply a function to the dataframe that finds the indices of the rows that adhere to the condition and returns the row pairs accordingly:
def filter_(x, threshold = 0.01):
indices = df.index[(df.time.diff() < threshold) & (df.bit.diff().abs() == 1)]
mask = indices | indices - 1
return x[mask]
print(df.apply(filter_, args = (0.01,)))
Output:
time bit
3 0.471 1
4 0.479 0

Using value_counts in pandas with conditions

I have a column with around 20k values. I've used the following function in pandas to display their counts:
weather_data["snowfall"].value_counts()
weather_data is the dataframe and snowfall is the column.
My results are:
0.0 12683
M 7224
T 311
0.2 32
0.1 31
0.5 20
0.3 18
1.0 14
0.4 13
etc.
Is there a way to:
Display the counts of only a single variable or number
Use an if condition to display the counts of only those values which satisfy the condition?
I'll be as clear as possible without having a full example as piRSquared suggested you to provide.
value_counts' output is a Series, therefore the values in your originale Series can be retrieved from the value_counts' index. Displaying only the result of one of the variables then is exactly slicing your series:
my_value_count = weather_data["snowfall"].value_counts()
my_value_count.loc['0.0']
output:
0.0 12683
If you want to display only for a list of variables:
my_value_count.loc[my_value_count.index.isin(['0.0','0.2','0.1'])]
output:
0.0 12683
0.2 32
0.1 31
As you have M and T in your values, I suspect the other values will be treated as strings and not float. Otherwise you could use:
my_value_count.loc[my_value_count.index < 0.4]
output:
0.0 12683
0.2 32
0.1 31
0.3 18
Use an if condition to display the counts of only those values which satisfy the condition?
First create a new column based on the condition you want. Then you can use groupby and sum.
For example, if you want to count the frequency only if a column has a non-null value. In my case, if there is an actual completion_date non-null value:
dataset['Has_actual_completion_date'] = np.where(dataset['ACTUAL_COMPLETION_DATE'].isnull(), 0, 1)
dataset['Mitigation_Plans_in_progress'] = dataset['Has_actual_completion_date'].groupby(dataset['HAZARD_ID']).transform('sum')