Group rows of a dataframe based on column mean and geo-locations - pandas

route_number source_id latitude_value longitude_value no_of_stores
r1 676 28.15085 32.66055 23
r2 715 28.2160253 32.5214831 23
r3 345 28.2123115 32.537211 22
r4 150 28.23009 32.50323 23
r5 534 28.0949248 32.8075467 21
r6 1789 28.2204214 32.5035782 22
r7 647 28.21548 32.50238 23
r8 667 28.21132 32.51481 22
r9 2242 28.2389 32.5 19
r10 797 28.161657 32.8416816 20
r11 1097 28.1792849 32.8255522 19
r12 591 28.2513623 32.7638247 22
r13 1091 28.251208 32.7808329 21
r14 1267 28.2102213 32.8129836 21
r15 1016 28.1654648 32.8350845 19
r16 785 28.0786012 32.9513468 4
r17 1072 28.1701673 32.8382309 1
Mentioned above is a dataframe i am dealing with.
As you can see, the no. of stores in a route_number are different.
mean(no_of_stores) = 20 in this case
What i am looking for is,
depending on the geo-locations(latitude & longitude value) of my source_id , i want to combine multiple routes which lie closer to each other into 1 such that the no_of_stores in new group are equally divided.
The condition of routes lying closer to each other can be excluded, and just merge routes with lesser no. of stores into 1 can also be done.
i.e the routes which lie closer to each other( and no_of_stores are less than the mean(no_of_stores)), combine them into 1 big route, such that no_of_stores in the new routes formed is the mean of no_of_stores column, which in case is around 19.
Final output expected something like this: (not actual)
route_number new_route_no
r1 A1 #since its already has stores greater than mean
r2 A2
r3 A3
r4 A4
....................
r9 A9 #(19 stores)
r17 A9 #(1 stores) total 20
....................
r11 A11
r16 A11
r15 A15 #19 stores , since it cannot be combined further,keep as it is
I have tried using pandas groupby and aggregate methods, but couldnt find a way to transform this dataframe,
Any leads would be helpful.

Related

SQL Server not displaying currency symbol correctly

I have the following encoding problem: In 2 databases I should have the same data (2nd database is newer version of 1st). In some tables, characters are not displaying correctly, such a currency table that holds the name and symbols for different currencies.
I use SSMS to query both databases.
Id Name R7 R8 Name Different
-------------------------------------
148 DZD DZD 0
37 EGP £ £ EGP 1
149 ERN ERN 0
150 ETB ETB 0
1 EUR € € EUR 1
40 FJD $ $ FJD 0
39 FKP £ £ FKP 1
2 GBP £ £ GBP 1
151 GEL GEL 0
46 GGP £ £ GGP 1
42 GHC ¢ ¢ GHC 1
152 GHS GHS 0
Both tables(Currency) have the same structure and encoding for symbol col(R7 & R8): SQL_Latin1_General_CP1_CI_AS. I have tried to look up encoding solutions, but have run out of ideas of what to ask in google.
Does anyone know what might cause R7 to display incorrectly while R8 displays correctly?
Column definition for R7 (ShortName):
Column definition for R8 (ShortName):
#AlwaysLearning :

Find maximum value of each group within a Pandas Frame

I do have a question, hoping that you could give me a little support. I looked into the archiv here, found a solution but that's taking much time and is not "beautiful", since works with Loops
Suppose you have a following frame
System Country_Key Name Bank_number_length Check rule for bank acct no.
PEM AD Andorra 8 2
PL1 AD Andorra 15 5
PPE AD Andorra 14 5
P11 AD Andorra 9 5
P16 AD Andorra 12 4
PEM AE Emirates 3 5
PL1 AE Emirates 15 4
PPE AE Emirates 15 5
P11 AE Emirates 15 6
P16 AE Emirates 13 5
I found the following approach for two columns Get the max value from each group with pandas.DataFrame.groupby
However, in my case I do really have many columns and need to set the index for the first three columns "System", "Country_Key" and "Name"
my desire output would be the following
System Country_Key Name Bank_number_length Check rule for bank acct no.
PEM AD Andorra
PL1 15 5
PPE 5
P11 5
P16
PEM AE Emirates
PL1 15
PPE 15
P11 15 6
P16
So actually dropping the lowest values except the max value. Any kind of hint would be really benefical
You can try mask the not max value to empty string and mask the duplicated values to empty string
keys = ['Country_Key', 'Name']
cols = ['Bank_number_length', 'Check rule for bank acct no.']
df[cols] = df[cols].mask(df[cols].ne(df.groupby(keys)[cols].transform(max)), '')
df.loc[df.duplicated(keys), keys] = ''
print(df)
System Country_Key Name Bank_number_length Check rule for bank acct no.
0 PEM AD Andorra
1 PL1 15 5
2 PPE 5
3 P11 5
4 P16
5 PEM AE Emirates
6 PL1 15
7 PPE 15
8 P11 15 6
9 P16

SQL/Microsoft Access Query Problems

CNum DNum RNum Quant Price
C100 D1 R10 2 8.99
C100 D1 R40 7 9.99
C200 D3 R10 4 16.99
C200 D3 R20 2 15.99
C200 D3 R30 2 17.99
C200 D3 R40 5 19.99
C200 D3 R50 6 18.99
C200 D3 R60 4 19.99
C200 D3 R70 8 15.99
C200 D5 R20 1 8.99
C300 D3 R10 2 16.99
C300 D4 R20 5 22.99
C400 D6 R30 3 4.99
C400 D6 R70 3 2.99
C500 D1 R40 1 9.99
C500 D2 R20 2 23.99
C500 D2 R40 1 24.99
C500 D3 R40 2 19.99
C500 D4 R40 8 23.99
C500 D5 R40 4 8.99
C500 D5 R50 5 8.99
C500 D5 R70 1 9.99
C500 D6 R20 2 1.99
C500 D6 R40 5 3.99
The table above is name Orders. The Query I'm trying to solve is stated as "For each dish ordered from a restaurant, get the dish number(DNum), the restaurant number(RNum), and the total quantity (for that dish ordered from that restaurant)." I can get the two numbers to populate, but am totally unsure of how to add up the quantities, anything I've tried just adds up the Quantities in total. Any ideas?
Here is one of the queries I tried. This actually returned an error:"Your query does not include the specified expression 'DNum' as part of an aggregate function.'
SELECT Ord1.DNum, Ord2.DNum, SUM(Ord1.Quant + Ord2.Quant) AS TotQuant
FROM Orders AS Ord1, Orders AS Ord2
WHERE (Ord1.RNum = Ord2.RNum)
another thats not working
SELECT Order1.DNum, Order2.DNum, TotQuant
FROM (SELECT SUM(Order1.Quant + Order2.Quant) AS TotQuant
FROM Orders AS Order1, Orders AS Order2
WHERE (Order1.RNum = Order2.RNum)
AND (Order1.DNum = Order2.DNum))
and one more
SELECT DISTINCT Ord1.DNum, SUM(Ord1.Quant + Ord2.Quant) AS TotQuant
FROM Orders AS Ord1, Orders AS Ord2
WHERE (Ord1.RNum = Ord2.RNum)
AND (Ord1.DNum = Ord2.DNum)
If my guess as to what you are trying to do is correct something like this should get you close:
SELECT DNum, RNum, SUM(Quant) AS TotalQuantity
FROM Orders
GROUP BY DNum, RNum
Ok so some quick comments on what you have tried:
Query 1
SELECT Ord1.DNum, Ord2.DNum, SUM(Ord1.Quant + Ord2.Quant) AS TotQuant
FROM Orders AS Ord1, Orders AS Ord2
WHERE (Ord1.RNum = Ord2.RNum);
This might seem like it should work, but if you think about it it's quite a meaningless query. You are selecting two identical DNum values and SUMming two identical Quant values. A human might be able to understand what you're asking the computer to do, but the computer is perplexed.
Query 2 and Query 3 will not work, primarily because they are similar to the initial query that returns and error. They are slightly different, but essentially you are asking for the wrong thing.
Now here's what you can try:
Introducing the GROUP BY method! Woohoo!
GROUP BY is perfect for this and many other queries! As stated on the w3schools page for it:
The GROUP BY statement is often used with aggregate functions (COUNT,
MAX, MIN, SUM, AVG) to group the result-set by one or more columns.
So a query like this:
SELECT Orders.DNum, Orders.RNum, sum(Orders.Quant) as OrderQuantity
FROM Orders
GROUP BY (Orders.DNum, Orders.RNum);
To deconstruct this a little bit:
This selects the columns you want to display, and the aggregate of the Orders.Quant column you want the sum of.
Then, you group by the DNum, which is then grouped by RNum to get you the sum you are looking for.
Hope it helps!

SQL query between and equals

there are three tables, first table name is baseline which contains all beneficiaries information and one column in the name of PPI Score and the second table in the name of PPI_SCORE_TOOKUP which contains six columns as below the third table in the name of endline which contains beneficiaries end line assessment data and also one column in the name of PPI_Score, what i want is, to join some how these tables however there is no foreign key of the baseline and endline table in the PPI_SCORE_TOOKUP table there is only PPI_Score in the tables PPI_SCORE_TOOKUP, endline and endline tables, and i want to query to show some baseline data along PPI result if the values of the ppi in the basline table is between or equals to PPI_SCORE_START and PPI_SCORE_END and also it should show endline data of the same member along with the PPI Score with its six column if ppi score in the endline table is between and equals to PPI_SCORE_START and PPI_SCORE_END all in one row.
Note: i did not try any query yet since i did not have any idea how to do this, but i expect the expected result in the bottom of this question.
Tables are as follows
baseline table
ID NAME LAST_NAME DISTRICT PPI_SCORE
1 A A A 10
2 B B B 23
3 C C C 90
4 D D D 47
endline table
baseline_ID Enterprise Market PPI_SCORE
3 Bee Keeping Yes
2 Poultry No 74
1 Agriculture Yes 80
PPI_SCORE_TOOKUP table
ppi_start ppi_end national national_150 national_200 usaid
0 4 100 100 100 100
10 14 66.1 89.5 96.5 39.2
5 9 68.8 90.2 96.7 44.4
15 19 59.5 89.1 97.2 35.2
20 24 51.3 85.5 96.4 28.8
25 29 43.5 81.1 93.2 20
30 34 31.9 74.5 90.4 13.6
35 39 24.6 66.9 87.3 7.9
40 44 15.2 58 82.8 4.5
45 49 11.4 47.9 73.4 4.2
50 54 6 37.2 68.4 2.6
55 59 2.7 26.1 61.3 0.5
60 64 0.9 21 50.4 0.5
65 69 0 14.3 37.1 0
70 74 3 14.3 29.2 0
75 79 0 1.4 5.1 0
80 84 0 0 9.5 0
85 89 0 0 15.2 0
90 94 0 0 0 0
95 100 0 0 0 0
Expected Result
Your query can be made in the following way:
SELECT *
FROM baseline b
LEFT JOIN endline e ON b.id = e.baseline_ID
LEFT JOIN PPI_SCORE_TOOKUP ppi ON b.PPI_SCORE BETWEEN ppi.ppi_start AND ppi.ppi_end
LEFT JOIN PPI_SCORE_TOOKUP ppi2 ON e.PPI_SCORE BETWEEN ppi2.ppi_start AND ppi2.ppi_end
This matches your id's from the baseline table with the baseline_ID's from the endline table, keeping possible null values from baseline. It then matches the PPI_SCORE from baseline with ppi_start and ppi_end from PPI_SCORE_TOOKUP. Then we join the PPI_SCORE from endline with and ppi_end.
By replacing * with whatever fields you want to have.
See fiddle for a working example

joining two narrow format tables

I have scenario where i have got tables (in propriety datastore) with thousands of columns. The tables before being exported for querying is transformed to narrow format (http://en.wikipedia.org/wiki/Wide_and_Narrow_Data).
I am developing a query executor. The input to this query executor is the narrow tables not the original tables. I want to perform joins on two similar narrow tables, but cannot figure out the exact general logic behind it.
For example lets say we have two table R and S in the original format(wide format)
Table R
C1 C2 C3 R1 R2 R3
5 6 7 1234 4552 12532
5 6 8 4512 21523 434
15 16 17 1254 1212 3576
Table S
C1 C2 C3 S1 S2 S3
5 6 7 5412 35112 3512
5 6 8 125393 1523 6749
15 16 17 74397 4311 1153
C1, C2, C3 are the common columns between the tables.
The narrow table for table R is
C1 C2 C3 Key Value
5 6 7 R1 1234
R2 4552
R3 12532
5 6 8 R1 4512
R2 21523
R3 434
15 16 17 R1 1254
R2 1212
R3 3576
The narrow table for table S is
C1 C2 C3 Key Value
5 6 7 S1 5412
S2 35112
S3 3512
5 6 8 S1 125393
S2 1523
S3 6749
15 16 17 S1 74397
S2 4311
S3 1153
Now when i join the original table R and S (on C1, C2 and C3) i get the result
C1 C2 C3 R1 R2 R3 S1 S2 S3
5 6 7 1234 4552 12532 5412 35112 3512
5 6 8 4512 21523 434 125393 1523 6749
15 16 17 1254 1212 3576 74397 4311 1153
Whose narrow format is
C1 C2 C3 Key Value
5 6 7 R1 1234
R2 4552
R3 12532
S1 5412
S2 35112
S3 3512
5 6 8 R1 4512
R2 21523
R3 434
S1 125393
S2 1523
S3 6749
15 16 17 R1 1254
R2 1212
R3 3576
S1 74397
S2 4311
S3 1153
How can i get the above table by just joining the narrow tables (on the common columns) that i got as input.
If you use normal tabular join (natural joing, outer join etc) between the two narrow tables you will get an exploded table because each key on table R gets multiplied with all the keys in table S.
I am not using SQL, or postgres or any database system. I am looking for the answer in terms of algorithms or relational algebraic expressions.
You're looking for the set union operator: A∪B is defined as the set of all tuples that appear in A, B or both, supposing the two relations have the same schema. The narrow tables all have the same schema (id, key, value), so they're perfectly union compatible.
And I have proof:
Suppose we have relations A(id, val1, val2 ... val_n) and B(id, val_n+1 ... val_n+m). We will also need a relation holding our variable names V(variable) = {('val1'), ('val2') ... ('val_n+m')}. The narrow-format equivalent of A is A'(id, variable, value), which we can construct like this:
That is, for each value we project A to (id, val_i), rename val_i to "value", put the variable name in the table (by taking the cross product with a single tuple in V); then we take the union of all these relations. Let us also construct B'(id, variable, value) in a similar fashion.
The natural join can be defined using only primitives:
Therefore we can construct (A ⋈ B)' like this (having combined the projections):
Let's apply the projection earlier:
But a val_i can only appear in A or B, not both, making one term of the cross product zero half of the time so this can be reduced and re-ordered into
which is exactly A' U B'.
So, we have shown that (A ⋈ B)' = A' U B', that is, the narrow format of the joined tables is the union of the narrow format tables.