I want to enter a value from a range 'from-col' - 'to-Col' e.g. 1371 and want to get the only index value: 27.89 from the value column
The value is entered as input via a search box in my Android app
The table looks like this:
id
from
to
value
1
0
1339.99
0
2
1340
1349.99
6.89
3
1350
1359.99
13.89
4
1360
1369.99
20.89
5
1370
1379.99
27.89
6
1380
1389.99
34.89
7
1390
1399.99
41.89
8
1400
1409.99
48.89
9
1410
1419.99
55.89
I tried it with Between and also with >= and <=. I don't get along with any of the WHERE conditions.
I am grateful for any reference
Related
Hello thanks in advance for all answers, I really appreciate community help
Here is my dataframe - from a csv containing scraped data from cars classified ads
Unnamed: 0 NameYear \
0 0 BMW 7 серия, 2007
1 1 BMW X3, 2021
2 2 BMW 2 серия Gran Coupe, 2021
3 3 BMW X5, 2021
4 4 BMW X1, 2021
Price \
0 520 000 ₽
1 от 4 810 000 ₽\n4 960 000 ₽ без скидки
2 2 560 000 ₽
3 от 9 259 800 ₽\n9 974 800 ₽ без скидки
4 от 3 130 000 ₽\n3 220 000 ₽ без скидки
CarParams \
0 187 000 км, AT (445 л.с.), седан, задний, бензин
1 2.0 AT (190 л.с.), внедорожник, полный, дизель
2 1.5 AMT (140 л.с.), седан, передний, бензин
3 3.0 AT (400 л.с.), внедорожник, полный, дизель
4 2.0 AT (192 л.с.), внедорожник, полный, бензин
url
0 https://www.avito.ru/moskva/avtomobili/bmw_7_s...
1 https://www.avito.ru/moskva/avtomobili/bmw_x3_...
2 https://www.avito.ru/moskva/avtomobili/bmw_2_s...
3 https://www.avito.ru/moskva/avtomobili/bmw_x5_...
4 https://www.avito.ru/moskva/avtomobili/bmw_x1_...
THE TASK - I want to know if there are duplicate rows, or if the SAME car advertisement appears twice. Most reliable maybe url because it should be unique: CarParameters or NameYear can repeat so I will check nunique and duplicated on url column
screenshot to visually inspect the reslt of duplicated:
THE ISSUE: Visual inspection (sorry for unprofessional jargon) shows these urls are not the SAME, but I wanted to get possible exactly same urls to check for repeat data. I tried to set keep = False as well
Try:
df.duplicated(subset=["url"], keep=False)
df.duplicted() gives you a pd.Series with bool-values.
Here is a example that your could probably use
from random import randint
import pandas as pd
urls=['http://www.google.com',
'http://www.stackoverfow.com',
'http://bla.xy','http://bla.com']
d=[]
for i, url in enumerate(urls):
for j in range(0,randint(1,i+1)):
d.append(dict(customer=str(randint(1,100)), url=url))
df=pd.DataFrame(d)
df['dups']=df['url'].duplicated(keep=False)
print(df)
resulting in the following df:
customer url dups
0 89 http://www.google.com False
1 43 http://www.stackoverfow.com False
2 36 http://bla.xy True
3 86 http://bla.xy True
4 32 http://bla.com False
the column dups shows you which urls exist more than once. In my example data is only the url http://bla.xy
The important thing is that you check what the parameter keep does
keep{‘first’, ‘last’, False}, default ‘first’
Determines which duplicates (if any) to mark.
first : Mark duplicates as True except for the first occurrence.
last : Mark duplicates as True except for the last occurrence.
False : Mark all duplicates as True.
In my case used False to get all duplicated values
I have this regular dataframe indexed by 'Date', called ES:
Price Day Hour num_obs med abs_med Ret
Date
2006-01-03 08:30:00 1260.583333 1 8 199 1260.416667 0.166667 0.000364
2006-01-03 08:35:00 1261.291667 1 8 199 1260.697917 0.593750 0.000562
2006-01-03 08:40:00 1261.125000 1 8 199 1260.843750 0.281250 -0.000132
2006-01-03 08:45:00 1260.958333 1 8 199 1260.895833 0.062500 -0.000132
2006-01-03 08:50:00 1261.214286 1 8 199 1260.937500 0.276786 0.000203
I have this other dataframe indexed by the following MultiIndex. The first index goes from 0 to 23 and the second index goes from 0 to 55. In other words we have daily 5 minute increment data.
5min_Ret
0 0 2.235875e-06
5 9.814064e-07
10 -1.453213e-06
15 4.295757e-06
20 5.884896e-07
25 -1.340122e-06
30 9.470660e-06
35 1.178204e-06
40 -1.111621e-05
45 1.159005e-05
50 6.148861e-06
55 1.070586e-05
1 0 1.485287e-05
5 3.018576e-06
10 -1.513273e-05
15 -1.105312e-05
20 3.600874e-06
...
I want to create a column in the original dataframe, ES, that has the appropriate '5min_Ret' at each appropriate hour/5minute combo.
I've tried multiple things: looping over rows, finding some apply function. But nothing has worked so far. I feel like I'm overlooking a simple and Pythonic solution here.
The expected output creates a new column called '5min_ret' to the original dataframe in which each row corresponds to the correct hour/5minute pair from the smaller dataframe containing the 5min_ret
Price Day Hour num_obs med abs_med Ret 5min_ret
Date
2006-01-03 08:30:00 1260.583333 1 8 199 1260.416667 0.166667 0.000364 xxxx
2006-01-03 08:35:00 1261.291667 1 8 199 1260.697917 0.593750 0.000562 xxxx
2006-01-03 08:40:00 1261.125000 1 8 199 1260.843750 0.281250 -0.000132 xxxx
2006-01-03 08:45:00 1260.958333 1 8 199 1260.895833 0.062500 -0.000132 xxxx
2006-01-03 08:50:00 1261.214286 1 8 199 1260.937500 0.276786 0.000203 xxxx
I think one way is to use merge on hour and minute. First create a column 'min' in ES from the datetimeindex such as:
ES['min'] = ES.index.minute
Now you can merge with your multiindex DF containing the column '5min_Ret' that I named df_multi such as:
ES = ES.merge(df_multi.reset_index(), left_on = ['hour','min'],
right_on = ['level_0','level_1'], how='left')
Here you merge on 'hour' and 'min' from ES with 'level_0' and 'level_1', which are created from your multiindex of df_multi when you do reset_index, and on the value of the left df (being ES)
You should get a new column in ES named '5min_Ret' with the value you are looking for. You can drop the colum 'min' if you don't need it anymore by ES = ES.drop('min',axis=1)
i have a pandas dataframe
id no_of_rows
1 2689
2 1515
3 3826
4 814
5 1650
6 2292
7 1867
8 2096
9 1618
10 923
11 766
12 191
i want to divide id's into 5 different bins based on their no. of rows,
such that every bin has approx(equal no of rows)
and assign it as a new column bin
One approach i thought was
df.no_of_rows.sum() = 20247
div_factor = 20247//5 == 4049
if we add 1st and 2nd row its sum = 2689+1515 = 4204 > div_factor.
Therefore assign bin = 1 where id = 1.
Now look for the next ones
id no_of_rows bin
1 2689 1
2 1515 2
3 3826 3
4 814 4
5 1650 4
6 2292 5
7 1867
8 2096
9 1618
10 923
11 766
12 191
But this method proved wrong.
Is there a way to have 5 bins such that every bin has good amount of stores(approximately equal)
You can use an approach based on percentiles.
n_bins = 5
dfa = df.sort_values(by='no_of_rows').cumsum()
df['bin'] = dfa.no_of_rows.apply(lambda x: int(n_bins*x/dfa.no_of_rows.max()))
And then you can check with
df.groupby('bin').sum()
The more records you have the more fair it will be in terms of dispersion.
Essentially, I have a log which contains a Unique identifier for a subject which is tracked through multiple cases. I then used the following code, suggested previously through the great community here, to create an Index. Unfortunately, I've run into a new challenge that I can't seem to figure out. Here is a sample of the current data set to provide perspective.
Indexing function
sort cases by Unique_Modifier.
if $casenum=1 or Unique_Modifier<>lag(Unique_Modifier) Index=1.
if Unique_Modifier=lag(Unique_Modifier) Index=lag(Index)+1.
format Index(f2).
execute.
Unique Identifier Index Variable of interest
A 1 101
A 2 101
A 3 607
A 4 607
A 5 101
A 6 101
B 1 108
B 2 210
C 1 610
C 2 987
C 3 1100
C 4 610
What I'd like to do is create a new variable which contains the number of discrete, different entries in the variable of interest column. The expected output would be as the following:
Unique Identifier Index Variable of interest Intended Output
A 1 101 1
A 2 101 1
A 3 607 2
A 4 607 2
A 5 101 2
A 6 101 2
B 1 108 1
B 2 210 2
C 1 610 1
C 2 987 2
C 3 1100 3
C 4 610 3
I've tried a few different ways to do it, one was to use a similar index function, but it fails as if the variable of interest is different in subsequent lines, it works but, sometimes, we have a recurrence of a variable like 5 lines later. My next idea was to use the AGGREGATE function, but I looked through the IBM manual and it doesn't seem like there is a function within aggregate that would produce the intended output here. Anyone have any ideas? I think a loop is the best bet, but loops within SPSS are a bit funky and hard to get working.
Try this:
data list list/Unique_Identifier Index VOI (3f) .
begin data.
1 1 101
1 2 101
1 3 607
1 4 607
1 5 101
1 6 101
2 1 108
2 2 210
3 1 610
3 2 987
3 3 1100
3 4 610
end data.
string voiT (a1000).
compute voiT=concat(ltrim(string(VOI,f10)),",").
compute Intended_Output=1.
do if index>1.
do if index(lag(voiT), rtrim(voiT))>0.
compute Intended_Output=lag(Intended_Output).
compute voiT=lag(voiT).
else.
compute Intended_Output=lag(Intended_Output)+1.
compute voiT=concat(rtrim(lag(voiT)), rtrim(voiT)).
end if.
end if .
exe.
I have a table in PowerPivot which contains the logged data of a traffic control camera mounted on a road. This table is filled the velocity and the number of vehicles that pass this camera during a specific time(e.g. 14:10 - 15:25). Now I want to know that how can I get the average velocity of cars for an specific hour and list them in a separate table with 24 rows(hour 0 - 23) where the second column of each row is the weighted average velocity of that hour? A sample of my stat_table data is given below:
count vel hour
----- --- ----
133 96.00237 15
117 91.45705 21
81 81.90521 6
2 84.29946 21
4 77.7841 18
1 140.8766 17
2 56.14951 14
6 71.72839 13
4 64.14309 9
1 60.949 17
1 77.00728 21
133 100.3956 6
109 100.8567 15
54 86.6369 9
1 83.96901 17
10 114.6556 21
6 85.39127 18
1 76.77993 15
3 113.3561 2
3 94.48055 2
In a separate PowerPivot table I have 24 rows and 2 columns but when I enter my formula, the whole rows get updated with the same number. My formula is:
=sumX(FILTER(stat_table, stat_table[hour]=[hour]), stat_table[count] * stat_table[vel])/sumX(FILTER(stat_table, stat_table[hour]=[hour]), stat_table[count])
Create a new calculated column named "WeightedVelocity" as follows
WeightedVelocity = [count]*[vel]
Create a measure "WeightedAverage" as follows
WeightedAverage = sum(stat_table[WeightedVelocity]) / sum(stat_table[count])
Use measure "WeightedAverage" in VALUES area of pivot Table and use "hour" column in ROWS to get desired result.