Read a specific value from a table - sql

I want to enter a value from a range 'from-col' - 'to-Col' e.g. 1371 and want to get the only index value: 27.89 from the value column
The value is entered as input via a search box in my Android app
The table looks like this:
id
from
to
value
1
0
1339.99
0
2
1340
1349.99
6.89
3
1350
1359.99
13.89
4
1360
1369.99
20.89
5
1370
1379.99
27.89
6
1380
1389.99
34.89
7
1390
1399.99
41.89
8
1400
1409.99
48.89
9
1410
1419.99
55.89
I tried it with Between and also with >= and <=. I don't get along with any of the WHERE conditions.
I am grateful for any reference

Related

pandas duplicate values: Result visual inspection not duplicates

Hello thanks in advance for all answers, I really appreciate community help
Here is my dataframe - from a csv containing scraped data from cars classified ads
Unnamed: 0 NameYear \
0 0 BMW 7 серия, 2007
1 1 BMW X3, 2021
2 2 BMW 2 серия Gran Coupe, 2021
3 3 BMW X5, 2021
4 4 BMW X1, 2021
Price \
0 520 000 ₽
1 от 4 810 000 ₽\n4 960 000 ₽ без скидки
2 2 560 000 ₽
3 от 9 259 800 ₽\n9 974 800 ₽ без скидки
4 от 3 130 000 ₽\n3 220 000 ₽ без скидки
CarParams \
0 187 000 км, AT (445 л.с.), седан, задний, бензин
1 2.0 AT (190 л.с.), внедорожник, полный, дизель
2 1.5 AMT (140 л.с.), седан, передний, бензин
3 3.0 AT (400 л.с.), внедорожник, полный, дизель
4 2.0 AT (192 л.с.), внедорожник, полный, бензин
url
0 https://www.avito.ru/moskva/avtomobili/bmw_7_s...
1 https://www.avito.ru/moskva/avtomobili/bmw_x3_...
2 https://www.avito.ru/moskva/avtomobili/bmw_2_s...
3 https://www.avito.ru/moskva/avtomobili/bmw_x5_...
4 https://www.avito.ru/moskva/avtomobili/bmw_x1_...
THE TASK - I want to know if there are duplicate rows, or if the SAME car advertisement appears twice. Most reliable maybe url because it should be unique: CarParameters or NameYear can repeat so I will check nunique and duplicated on url column
screenshot to visually inspect the reslt of duplicated:
THE ISSUE: Visual inspection (sorry for unprofessional jargon) shows these urls are not the SAME, but I wanted to get possible exactly same urls to check for repeat data. I tried to set keep = False as well
Try:
df.duplicated(subset=["url"], keep=False)
df.duplicted() gives you a pd.Series with bool-values.
Here is a example that your could probably use
from random import randint
import pandas as pd
urls=['http://www.google.com',
'http://www.stackoverfow.com',
'http://bla.xy','http://bla.com']
d=[]
for i, url in enumerate(urls):
for j in range(0,randint(1,i+1)):
d.append(dict(customer=str(randint(1,100)), url=url))
df=pd.DataFrame(d)
df['dups']=df['url'].duplicated(keep=False)
print(df)
resulting in the following df:
customer url dups
0 89 http://www.google.com False
1 43 http://www.stackoverfow.com False
2 36 http://bla.xy True
3 86 http://bla.xy True
4 32 http://bla.com False
the column dups shows you which urls exist more than once. In my example data is only the url http://bla.xy
The important thing is that you check what the parameter keep does
keep{‘first’, ‘last’, False}, default ‘first’
Determines which duplicates (if any) to mark.
first : Mark duplicates as True except for the first occurrence.
last : Mark duplicates as True except for the last occurrence.
False : Mark all duplicates as True.
In my case used False to get all duplicated values

Insert items from MultiIndexed dataframe into regular dataframe based on time

I have this regular dataframe indexed by 'Date', called ES:
Price Day Hour num_obs med abs_med Ret
Date
2006-01-03 08:30:00 1260.583333 1 8 199 1260.416667 0.166667 0.000364
2006-01-03 08:35:00 1261.291667 1 8 199 1260.697917 0.593750 0.000562
2006-01-03 08:40:00 1261.125000 1 8 199 1260.843750 0.281250 -0.000132
2006-01-03 08:45:00 1260.958333 1 8 199 1260.895833 0.062500 -0.000132
2006-01-03 08:50:00 1261.214286 1 8 199 1260.937500 0.276786 0.000203
I have this other dataframe indexed by the following MultiIndex. The first index goes from 0 to 23 and the second index goes from 0 to 55. In other words we have daily 5 minute increment data.
5min_Ret
0 0 2.235875e-06
5 9.814064e-07
10 -1.453213e-06
15 4.295757e-06
20 5.884896e-07
25 -1.340122e-06
30 9.470660e-06
35 1.178204e-06
40 -1.111621e-05
45 1.159005e-05
50 6.148861e-06
55 1.070586e-05
1 0 1.485287e-05
5 3.018576e-06
10 -1.513273e-05
15 -1.105312e-05
20 3.600874e-06
...
I want to create a column in the original dataframe, ES, that has the appropriate '5min_Ret' at each appropriate hour/5minute combo.
I've tried multiple things: looping over rows, finding some apply function. But nothing has worked so far. I feel like I'm overlooking a simple and Pythonic solution here.
The expected output creates a new column called '5min_ret' to the original dataframe in which each row corresponds to the correct hour/5minute pair from the smaller dataframe containing the 5min_ret
Price Day Hour num_obs med abs_med Ret 5min_ret
Date
2006-01-03 08:30:00 1260.583333 1 8 199 1260.416667 0.166667 0.000364 xxxx
2006-01-03 08:35:00 1261.291667 1 8 199 1260.697917 0.593750 0.000562 xxxx
2006-01-03 08:40:00 1261.125000 1 8 199 1260.843750 0.281250 -0.000132 xxxx
2006-01-03 08:45:00 1260.958333 1 8 199 1260.895833 0.062500 -0.000132 xxxx
2006-01-03 08:50:00 1261.214286 1 8 199 1260.937500 0.276786 0.000203 xxxx
I think one way is to use merge on hour and minute. First create a column 'min' in ES from the datetimeindex such as:
ES['min'] = ES.index.minute
Now you can merge with your multiindex DF containing the column '5min_Ret' that I named df_multi such as:
ES = ES.merge(df_multi.reset_index(), left_on = ['hour','min'],
right_on = ['level_0','level_1'], how='left')
Here you merge on 'hour' and 'min' from ES with 'level_0' and 'level_1', which are created from your multiindex of df_multi when you do reset_index, and on the value of the left df (being ES)
You should get a new column in ES named '5min_Ret' with the value you are looking for. You can drop the colum 'min' if you don't need it anymore by ES = ES.drop('min',axis=1)

Divide dataframe in different bins based on condition

i have a pandas dataframe
id no_of_rows
1 2689
2 1515
3 3826
4 814
5 1650
6 2292
7 1867
8 2096
9 1618
10 923
11 766
12 191
i want to divide id's into 5 different bins based on their no. of rows,
such that every bin has approx(equal no of rows)
and assign it as a new column bin
One approach i thought was
df.no_of_rows.sum() = 20247
div_factor = 20247//5 == 4049
if we add 1st and 2nd row its sum = 2689+1515 = 4204 > div_factor.
Therefore assign bin = 1 where id = 1.
Now look for the next ones
id no_of_rows bin
1 2689 1
2 1515 2
3 3826 3
4 814 4
5 1650 4
6 2292 5
7 1867
8 2096
9 1618
10 923
11 766
12 191
But this method proved wrong.
Is there a way to have 5 bins such that every bin has good amount of stores(approximately equal)
You can use an approach based on percentiles.
n_bins = 5
dfa = df.sort_values(by='no_of_rows').cumsum()
df['bin'] = dfa.no_of_rows.apply(lambda x: int(n_bins*x/dfa.no_of_rows.max()))
And then you can check with
df.groupby('bin').sum()
The more records you have the more fair it will be in terms of dispersion.

Using SPSS Reference Variables and an Index to Create a New Variable

Essentially, I have a log which contains a Unique identifier for a subject which is tracked through multiple cases. I then used the following code, suggested previously through the great community here, to create an Index. Unfortunately, I've run into a new challenge that I can't seem to figure out. Here is a sample of the current data set to provide perspective.
Indexing function
sort cases by Unique_Modifier.
if $casenum=1 or Unique_Modifier<>lag(Unique_Modifier) Index=1.
if Unique_Modifier=lag(Unique_Modifier) Index=lag(Index)+1.
format Index(f2).
execute.
Unique Identifier Index Variable of interest
A 1 101
A 2 101
A 3 607
A 4 607
A 5 101
A 6 101
B 1 108
B 2 210
C 1 610
C 2 987
C 3 1100
C 4 610
What I'd like to do is create a new variable which contains the number of discrete, different entries in the variable of interest column. The expected output would be as the following:
Unique Identifier Index Variable of interest Intended Output
A 1 101 1
A 2 101 1
A 3 607 2
A 4 607 2
A 5 101 2
A 6 101 2
B 1 108 1
B 2 210 2
C 1 610 1
C 2 987 2
C 3 1100 3
C 4 610 3
I've tried a few different ways to do it, one was to use a similar index function, but it fails as if the variable of interest is different in subsequent lines, it works but, sometimes, we have a recurrence of a variable like 5 lines later. My next idea was to use the AGGREGATE function, but I looked through the IBM manual and it doesn't seem like there is a function within aggregate that would produce the intended output here. Anyone have any ideas? I think a loop is the best bet, but loops within SPSS are a bit funky and hard to get working.
Try this:
data list list/Unique_Identifier Index VOI (3f) .
begin data.
1 1 101
1 2 101
1 3 607
1 4 607
1 5 101
1 6 101
2 1 108
2 2 210
3 1 610
3 2 987
3 3 1100
3 4 610
end data.
string voiT (a1000).
compute voiT=concat(ltrim(string(VOI,f10)),",").
compute Intended_Output=1.
do if index>1.
do if index(lag(voiT), rtrim(voiT))>0.
compute Intended_Output=lag(Intended_Output).
compute voiT=lag(voiT).
else.
compute Intended_Output=lag(Intended_Output)+1.
compute voiT=concat(rtrim(lag(voiT)), rtrim(voiT)).
end if.
end if .
exe.

PowerPivot formula for row wise weighted average

I have a table in PowerPivot which contains the logged data of a traffic control camera mounted on a road. This table is filled the velocity and the number of vehicles that pass this camera during a specific time(e.g. 14:10 - 15:25). Now I want to know that how can I get the average velocity of cars for an specific hour and list them in a separate table with 24 rows(hour 0 - 23) where the second column of each row is the weighted average velocity of that hour? A sample of my stat_table data is given below:
count vel hour
----- --- ----
133 96.00237 15
117 91.45705 21
81 81.90521 6
2 84.29946 21
4 77.7841 18
1 140.8766 17
2 56.14951 14
6 71.72839 13
4 64.14309 9
1 60.949 17
1 77.00728 21
133 100.3956 6
109 100.8567 15
54 86.6369 9
1 83.96901 17
10 114.6556 21
6 85.39127 18
1 76.77993 15
3 113.3561 2
3 94.48055 2
In a separate PowerPivot table I have 24 rows and 2 columns but when I enter my formula, the whole rows get updated with the same number. My formula is:
=sumX(FILTER(stat_table, stat_table[hour]=[hour]), stat_table[count] * stat_table[vel])/sumX(FILTER(stat_table, stat_table[hour]=[hour]), stat_table[count])
Create a new calculated column named "WeightedVelocity" as follows
WeightedVelocity = [count]*[vel]
Create a measure "WeightedAverage" as follows
WeightedAverage = sum(stat_table[WeightedVelocity]) / sum(stat_table[count])
Use measure "WeightedAverage" in VALUES area of pivot Table and use "hour" column in ROWS to get desired result.