Just a newbie using numpy. I have a long data but simply like this:
a 3
b 2
c 1
d 0
e 1
f 2
g 3
I want to have output:
a -3
b -2
c -1
d 0
e 1
f 2
g 3
I tried to use numpy to negate data above column2=0, but I always get error.
can anyone help me please?
If the values are really ascending indices then like that:
import numpy as np
a = np.arange(-3, 4)
print(a)
b = np.zeros((7, 2))
print(b)
b[:, 1] = a
to make this slightly more general: given an array arr that you want to change sign up to (and excluding) a value v, you could
import numpy as np
arr = np.array([3,2,1,0,1,2,3])
v = 0
# find the index of the first occurrence of v:
idx = np.argmax(arr == v)
# change the sign up to index idx-
# since argmax returns 0 if v is not found, we have to check that:
if arr[idx] == v:
arr[:idx] = arr[:idx] * -1
# to not touch the original array, e.g.
# arr_new = np.concatenate([arr[:idx]*-1, arr[idx:]])
# could put an else condition here, raise ValueError or sth like that
print(arr)
# [-3 -2 -1 0 1 2 3]
Related
I have two DF (df1 and df2):
Sample Date Value_df1
1992-11-04 1
1992-11-12 2
1992-11-18 3
... ...
1992-12-02 4
1992-12-09 5
1992-12-21 6
1992-12-28 7
1993-01-07 8
Sample Date Value_df2
1992-11-04 9
1992-11-12 10
1992-11-18 11
... ...
1992-12-02 12
1992-12-09 13
1992-12-21 14
1992-12-28 15
1993-01-07 16
by establishing:
y = df1['Value_df1']
x = df2['Value_df2']
I want to fit the values of df1 and df2 to this eq πΏππ10(y)=πΏππ10(π)+ππΏππ10(x) and then get the constants a and b.
This is what I've done:
import numpy as np
#getting log10 values from both DF
log_y = np.log10( y )
log_x = np.log10( x )
curve = np.polyfit(log_x, log_y, 1)
b = curve[0]
a = curve[1]
#applying the inverse of log(a) to obtain the real a value
a = 10 ** a
This way one should get a and b values?.
Is this a good approach? is there a better way or am I doing something wrong?
Thanks for your feedback
I got a dataframe and I want to groupby the rows based on a specific column. Number of rows in each group will be at least 4 and at most 50. I want to save one column from the group into two lines. If the groupsize is even, let us say 2n, then n rows in one line and the remaining n in the second line. If it is odd, n+1 and n or n and n+1 will do.
For example,
import pandas as pd
from io import StringIO
data = """
id,name
1,A
1,B
1,C
1,D
2,E
2,F
2,ds
2,G
2, dsds
"""
df = pd.read_csv(StringIO(data))
I want to groupby id
df.groupby('id',sort=False)
and then get a dataframe like
id name
0 1 A B
1 1 C D
2 2 E F ds
3 2 G dsds
Probably not the most efficient solution, but it works:
import numpy as np
df = df.sort_values('id')
# next 3 lines: for each group find the separation
df['range_idx'] = range(0, df.shape[0])
df['mean_rank_group'] = df.groupby(['id'])['range_idx'].transform(np.mean)
df['separate_column'] = df['range_idx'] < df['mean_rank_group']
# groupby itself with the help of additional column
df.groupby(['id', 'separate_column'], as_index=False)['name'].agg(','.join).drop(
columns='separate_column')
This is a bit convoluted approach but it does the work;
def func(s: pd.Series):
mid = max(s.shape[0]//2 ,1)
l1 = ' '.join(list(s[:mid]))
l2 = ' '.join(list(s[mid:]))
return [l1, l2]
df_new = df.groupby('id').agg(func)
df_new["name1"]= df_new["name"].apply(lambda x: x[0])
df_new["name2"]= df_new["name"].apply(lambda x: x[1])
df = df_new.drop(labels="name", axis=1).stack().reset_index().drop(labels = ["level_1"], axis=1).rename(columns={0:"name"}).set_index("id")
I want to make a NumPy array which has below;
Random number: 0~9 (0<=value<=9) Random 1D size: 5~9 (5<= size <=9)
And I hope to find missing numbers between min and max so I made a code like this
import numpy as np
min_val = 0
max_val = 10
min_val_len = 5
max_val_len = 10
arr1 = [4,3,2,7,8,2,3]
a = list(arr1)
print(a)
diff = np.setdiff1d(range(min_val, max_val), arr1)
arr = np.arange(min_val_len, max_val_len)
if diff in arr:
print(diff)
else:
print("no missing")
In my purpose, the output will be [5,6].
And if an input is [1, 2, 3, 4, 5], the result will be 'no_missing'.
But the code isn't work on my expectation.
I think you expect in to work in a way it does not: You want to check every single element, try:
b = [d in arr for d in diff]
Now b contains a boolean value for each value d of diff. If you want to find the actual number that are missing you can do it using a condition
b = np.intersect1d(np.setdiff1d(range(min_val, max_val), arr1), arr)
Also note that python has built in set types, so you do not actually need to use numpy.
Now b contains all numbers of d that are in arr. But you can do it in even a simpler way as you're already using the notion of sets:
print(np.setdiff1d(rang
I'm reading from a database that had many array type columns, which pd.read_sql gives me a dataframe with columns that are dtype=object, containing lists.
I'd like an efficient way to find which rows have arrays containing some element:
s = pd.Series(
[[1,2,3], [1,2], [99], None, [88,2]]
)
print s
..
0 [1, 2, 3]
1 [1, 2]
2 [99]
3 None
4 [88, 2]
1-hot-encoded feature tables for an ML application and I'd like to end up with tables like:
contains_1 contains_2, contains_3 contains_88
0 1 ...
1 1
2 0
3 nan
4 0
...
I can unroll a series of arrays like so:
s2 = s.apply(pd.Series).stack()
0 0 1.0
1 2.0
2 3.0
1 0 1.0
1 2.0
2 0 99.0
4 0 88.0
1 2.0
which gets me at the being able to find the elements meeting some test:
>>> print s2[(s2==2)].index.get_level_values(0)
Int64Index([0, 1, 4], dtype='int64')
Woot! This step:
s.apply(pd.Series).stack()
produces a great intermediate data-structure (s2) that's fast to iterate over for each category. However, the apply step is jaw-droppingly slow (many 10's of seconds for a single column with 500k rows with lists of 10's of items), and I have many columns.
Update: It seems likely that having the data in a series of lists to begin with in quite slow. Performing unroll in the SQL side seems tricky (I have many columns that I want to unroll). Is there a way to pull array data into a better structure?
import numpy as np
import pandas as pd
import cytoolz
s0 = s.dropna()
v = s0.values.tolist()
i = s0.index.values
l = [len(x) for x in v]
c = cytoolz.concat(v)
n = np.append(0, np.array(l[:-1])).cumsum().repeat(l)
k = np.arange(len(c)) - n
s1 = pd.Series(c, [i.repeat(l), k])
UPDATE: What worked for me...
def unroll(s):
s = s.dropna()
v = s.values.tolist()
c = pd.Series(x for x in cytoolz.concat(v)) # 16 seconds!
i = s.index
lens = np.array([len(x) for x in v]) #s.apply(len) is slower
n = np.append(0, lens[:-1]).cumsum().repeat(lens)
k = np.arange(sum(lens)) - n
s = pd.Series(c)
s.index = [i.repeat(lens), k]
s = s.dropna()
return s
It should be possible to replace:
s = pd.Series(c)
s.index = [i.repeat(lens), k]
with:
s = pd.Series(c, index=[i.repeat(lens), k])
But this doesn't work. (Says is ok here )
How do I compare the value of the first row in col b and the last row in col b from grouping by col a, without using the groupby function? Because groupby function is very slow for a large dataset.
a = [1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3]
b = [1,0,0,0,0,0,7,8,0,0,0,0,0,4,1,0,0,0,0,0,1]
Return two lists: one has the group names from col a where the last value is larger than the first value, etc.
larger_or_equal = [1,3]
smaller = [2]
All numpy
a = np.array([1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3])
b = np.array([1,0,0,0,0,0,7,8,0,0,0,0,0,4,1,0,0,0,0,0,1])
w = np.where(a[1:] != a[:-1])[0] # find the edges
e = np.append(w, len(a) - 1) # define the end pos
s = np.append(0, w + 1) # define start pos
# slice end pos with boolean array. then slice groups with end postions.
# I could also have used start positions.
a[e[b[e] >= b[s]]]
a[e[b[e] < b[s]]]
[1 3]
[2]
Here is a solution without groupby. The idea is to shift column a to detect group changes:
df[df['a'].shift() != df['a']]
a b
0 1 1
7 2 8
14 3 1
df[df['a'].shift(-1) != df['a']]
a b
6 1 7
13 2 4
20 3 1
We will compare the column b in those two dataframes. We simply need to reset the index for the pandas comparison to work:
first = df[df['a'].shift() != df['a']].reset_index(drop=True)
last = df[df['a'].shift(-1) != df['a']].reset_index(drop=True)
first.loc[last['b'] >= first['b'], 'a'].values
array([1, 3])
Then do the same with < to get the other groups. Or do a set difference.
As I wrote in the comments, groupby(sort=False) might well be faster depending on your dataset.