Find Pattern in consecutive numbers - sql

Whats the most effective method to detect n Pattern in consecutive numbers?
Maybe an SQL column or vector, R.
Some Pseudocode -R- to illustrate the "problem":
find Pattern in consecutive integers, where
2nd integer < 1st integer,
3rd integer > 2nd integer &
4th integer > 3rd integer.
a <- x
b <- x +1 < a
c <- x +2 > b
d <- x +3 > c
pattern <- c(a, b, c, d)
example: pattern <- c(10, 8, 9, 11) or pattern <- c(2.11, 2.08, 2.09, 2.11)
count(pattern)
find(pattern)

If you take the difference of the vector then the first should be negative and the other two positive, so,
a <- c(10, 8, 9, 11)
all((diff(a) < 0) == c(TRUE, FALSE, FALSE))
#[1] TRUE
To apply that to a bigger vector, you can use rollapply from zoo package, i.e.
library(zoo)
a <- sample(1:100,100,replace=T)
unique(rollapply(a, 4, by = 1, function(i) i[all((diff(i) < 0) == c(TRUE, FALSE, FALSE))]))
which gives,
[,1] [,2] [,3] [,4]
[1,] 85 18 85 92
[2,] 44 27 67 76
[3,] 58 2 39 54
[4,] 85 69 82 84
[5,] 61 4 40 44
[6,] 65 58 73 97
[7,] 19 9 92 96
[8,] 33 24 57 73
[9,] 79 11 37 100

Related

Slicing pandas dataframe by closest value

I have a pandas data frame that looks like this:
age score
5 72 99.424
6 70 99.441
7 69 99.442
8 67 99.443
9 71 99.448
mean score: 99.4396
The mean is the mean over all score column. How can I slice/get an age value that is say +/- 0.001 closer to the mean score.
So in this case: 67 and 69
mean = df['score'].mean()
df[df['score'].between(mean - .001, mean + .001)]['age']
import pandas as pd
import statistics
df = pd.DataFrame({"age": [72, 70, 69, 67, 71], "score": (99.424, 99.441, 99.442, 99.443, 99.448)})
df["diff"] = abs(df["score"] - statistics.mean(list(df["score"])))
You get :
age score diff
0 72 99.424 0.0156
1 70 99.441 0.0014
2 69 99.442 0.0024
3 67 99.443 0.0034
4 71 99.448 0.0084
Then :
x = 0.002
ages = list(df.loc[df["diff"] < x]["age"])
[Out]: [70]
x will be your parameter for the difference with the mean.
EDIT: we cannot get the same result as you as we do not have your whole score column by the way

Create column with values only for some multiindex in pandas

I have a dataframe like this:
df = pd.DataFrame(np.random.randint(50, size=(4, 4),
index=[['a', 'a', 'b', 'b'], [800, 900, 800, 900]],
columns=['X', 'Y', 'r_value', 'z_value'])
df.index.names = ["dat", "recor"]
X Y r_value z_value
dat recor
a 800 14 28 12 18
900 47 34 59 49
b 800 33 18 24 33
900 18 25 44 19
...
I want to apply a function to create a new column based on r_value that gives values only for the case of recor==900, so, in the end I would like something like:
X Y r_value z_value BB
dat recor
a 800 14 28 12 18 NaN
900 47 34 59 49 0
b 800 33 18 24 33 NaN
900 18 25 44 19 2
...
I have created the function like:
x = df.loc[pd.IndexSlice[:,900], "r_value"]
conditions = [x >=70, np.logical_and(x >= 40, x < 70), \
np.logical_and(x >= 10, x < 40), x <10]
choices = [0, 1, 2, 3]
BB = np.select(conditions, choices)
So now I need to append BB as a column, filling with NaNs the rows corresponding to recor==800. How can I do it? I have tried a couple of ideas (not commented here) without result. Thx.
Try
df.loc[df.index.get_level_values('recor')==900, 'BB'] = BB
the part df.index.get_level_values('recor')==900 creates a boolean array with True where the index level "recor" equals 900
indexing using a columns that does not already exist, ie "BB" creates that new column.
The rest of the column should automatically be filled with NaN.
I cant test it since you didn't include a minimal reproducible example.

Compare two Excel files that have a different number of rows using Python Pandas

I'm using Python 3.7 , and I want to compare two Excel file that have the same columns (140 columns) but with a different number of rows, I looked on the website , but I didn't find a solution for my case!
Here is an example :
df1 (old report) :
id qte d1 d2
A 10 23 35
B 43 63 63
C 15 61 62
df2 (new report) :
id qte d1 d2
A 20 23 35
C 15 61 62
E 38 62 16
F 63 20 51
and the results should be :
the modify rows must be in yellow and the value modified in red color
the new rows in green
the deleted rows in red
id qte d1 d2
A 20 23 35
C 15 61 62
B 43 63 63
E 38 62 16
F 63 20 51
the code :
import pandas as pd
import numpy as np
df1= pd.read_excel(r'C .....\data novembre.xlsx','Sheet1',na_values=['NA'])
df2= pd.read_excel(r'C.....\data decembre.xlsx','Sheet1',na_values=['NA'])
merged_data=df1.merge(df2, left_on = 'id', right_on = 'id', how = 'outer')
Joining the data though is not want I want to have!
I'm just starting to learn Python so I really need help!
an excel diff can quickly become a funky beast, but we should be able to do this with some concats and boolean statements.
assuming your dataframes are called df1, df2
df1 = df1.set_index('id')
df2 = df2.set_index('id')
df3 = pd.concat([df1,df2],sort=False)
df3a = df3.stack().groupby(level=[0,1]).unique().unstack(1).copy()
df3a.loc[~df3a.index.isin(df2.index),'status'] = 'deleted' # if not in df2 index then deleted
df3a.loc[~df3a.index.isin(df1.index),'status'] = 'new' # if not in df1 index then new
idx = df3.stack().groupby(level=[0,1]).nunique() # get modified cells.
df3a.loc[idx.mask(idx <= 1).dropna().index.get_level_values(0),'status'] = 'modified'
df3a['status'] = df3a['status'].fillna('same') # assume that anything not fufilled by above rules is the same.
print(df3a)
d1 d2 qte status
id
A [23] [35] [10, 20] modified
B [63] [63] [43] deleted
C [61] [62] [15] same
E [62] [16] [38] new
F [20] [51] [63] new
if you don't mind the performance hit of turning all your datatypes to strings then this could work. I dont' recommend it though, use a fact or slow changing dimension schema to hold such data, you'll thank your self in the future.
df3a.stack().explode().astype(str).groupby(level=[0,1]).agg('-->'.join).unstack(1)
d1 d2 qte status
id
A 23 35 10-->20 modified
B 63 63 43 deleted
C 61 62 15 same
E 62 16 38 new
F 20 51 63 new

Why given 2 arguments, TypeError: takes exactly 2 arguments (1 given)?

I do not know why my function that takes two arguments, when given two arguments, returns an error that says I have only have given one argument. I am trying to get the output of a fibonacci algorithm then multiply that by the second user input.
Thanks
29 def population(n,k):
30 n = int(sys.argv[1])
31 k = int(sys.argv[2])
32
33 if n in range (1,10000):
34 pass
35 else:
36 print("day out of range")
37 if k in range(1,10000):
38 pass
39 else:
40 print("growth rate out of range")
41 FibArray = [0,1]
42
43 #reproduction rate is Fn=F(n-1)+F)n-2)
44 #start fibonacci sequence on one and end on the nth day, multiply by the rate or reproduction
45 while len(FibArray) < n + 1:
46 FibArray.append(0)
47
48 if n <=1:
49 return n
50 else:
51 if FibArray [int(n)-1]==0:
52 FibArray[n-1]=population(n-1)
53
54 if FibArray[n-2]==0:
55 FibArray[n-2]=population(n-2)
56
57 FibArray[n]= FibArray[n-2] + FibArray[n-1]
58 X = FibArray[n]
59 return k * X
60
61 if __name__=="__main__":
62 n = int(sys.argv[1])
63 k = int(sys.argv[2])
64 pop = population(n, k)
65 print("Your population size on day {} is {}".format(n,pop))
Traceback (most recent call last):
File "./fibonacci.py", line 64, in <module>
pop = population(n, k)
File "./fibonacci.py", line 55, in population
FibArray[n-2]=population(n-2)
TypeError: population() takes exactly 2 arguments (1 given)
To answer your question, I would say that you read the error again. It says that the error is on line 55, where you have called the population() function with only one argument that is (n-2). That is creating a problem. You need to rectify your code there.

How to create a partially filled column in pandas

I have a df_trg with, say 10 rows numbered 0-9.
I get from various sources values for an additional column foo which contains only a subset of rows, e.g. S1 has 0-3, 7, 9 and S2 has 4, 6.
I would like to get a data frame with a single new column foo where some rows may remain NaN.
Is there a "nicer" way other than:
df_trg['foo'] = np.nan
for src in sources:
df_trg['foo'][df_trg.index.isin(src.index)] = src
for example, using join or merge?
Let's create the source DataFrame (df), s1 and s2 (Series objects with
updating data) and a list of them (sources):
df = pd.DataFrame(np.arange(1, 51).reshape((5, -1)).T)
s1 = pd.Series([11, 12, 13, 14, 15, 16], index=[0, 1, 2, 3, 7, 9])
s2 = pd.Series([27, 28], index=[4, 6])
sources = [s1, s2]
Start the computation from adding foo column, initially filled with
an empty string:
df = df.assign(foo='')
Then run the following "updating" loop:
for src in sources:
df.foo.update(other=src)
The result is:
0 1 2 3 4 foo
0 1 11 21 31 41 11
1 2 12 22 32 42 12
2 3 13 23 33 43 13
3 4 14 24 34 44 14
4 5 15 25 35 45 27
5 6 16 26 36 46
6 7 17 27 37 47 28
7 8 18 28 38 48 15
8 9 19 29 39 49
9 10 20 30 40 50 16
In my opinion, this solution is (at least a little) nicer than yours and
shorter.
Alternative: Fill foo column initially with NaN, but this time
updating values will be converted to float (side effect of using NaN).