Pandas, new column containing value if key equals another column - pandas

Apologies if the title didn't make sense, I found it difficult to explain in one line.
I have a data frame, two of the columns contain data like this:
Column 1
| Selection_id |
| -------- |
| 46660181 |
| 40115397 |
| 267698 |
| 34774 |
| 449342 |
Column 2
| Bsps |
| -------- |
| 46660181: 2.1, 40115397: 1.75, 267698: 3.15, 34774: 2.64, 449342: 3.9 |
What I need to do is create a new column containing the value from the bsps as long as the selection id matches the bsp key. So 46660181 would have 2.1 in the column next to it.
I hope that makes sense!
I don't have a great knowledge on Python or Pandas as I've not done it for long but I'll do my best to follow along!
Any help with this would be appreciated, thank you.

Assuming some things here... There are faster ways than df.apply if you want to restructure your initial data, but it works as long as your data is on the smaller end and doesn't need to scale.
import pandas
df = pd.DataFrame(dict(cola=[2,1,3], colb=[[{1:4},{2:5},{3:6}], [{1:4},{2:5},{3:6}],[{1:4},{2:5},{3:6}],],))
| | cola | colb |
|---:|-------:|:-------------------------|
| 0 | 2 | [{1: 4}, {2: 5}, {3: 6}] |
| 1 | 1 | [{1: 4}, {2: 5}, {3: 6}] |
| 2 | 3 | [{1: 4}, {2: 5}, {3: 6}] |
solution:
df['val'] = df.apply(lambda row: [x.get(row.cola) for x in row.colb if x.get(row.cola) is not None][0], axis=1)
| | cola | colb | val |
|---:|-------:|:-------------------------|------:|
| 0 | 2 | [{1: 4}, {2: 5}, {3: 6}] | 5 |
| 1 | 1 | [{1: 4}, {2: 5}, {3: 6}] | 4 |
| 2 | 3 | [{1: 4}, {2: 5}, {3: 6}] | 6 |

Related

How to get sum of transposed array in PostgresSQL

I've desinged a table below constructures.
Column | Type | Collation | Nullable | Default
-------------------------+--------------------------+-----------+----------+---------
counts | integer[] | | |
All of value in field of counts has 9 elements.
All of elements is not null.
I would like to get the sum of transposed array like the below python code with using only SQL query.
import numpy as np
counts = np.array([
[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]
])
counts = counts.transpose()
sum = list(map(lambda x: sum(x), counts))
print(sum) # [12, 15, 18, 21]
So given example data:
counts |
-----------------------|
{0,15,8,6,10,12,4,0,5} |
{0,4,6,14,7,9,8,0,9} |
{0,6,7,4,11,6,10,0,10} |
Because the record has over thousand, it takes much time 500ms to get response to calculate the sum of transpose in frontend side.
I would like a result:
counts |
---------------------------|
{0,25,21,24,28,27,22,0,24} |
or
value0 | value1 | value2 | value3 | value4 | value5 | value6 | value7 | value8 |
--------|--------|--------|--------|--------|--------|--------|--------|--------|
0 | 25 | 21 | 24 | 28 | 27 | 22 | 0 | 24 |
In my opinion, this question could be solved using SQL functions but I don't know how to use it.
Just SUM() the array elements:
SELECT SUM(i[1]) -- first
, SUM(i[2]) -- second
, SUM(i[3]) -- third
-- etc.
, SUM(i[9])
FROM (VALUES
('{0,15,8,6,10,12,4,0,5}'::int[]),
('{0,4,6,14,7,9,8,0,9}'),
('{0,6,7,4,11,6,10,0,10}')) s(i);

Postgresql order by out of order

I have a database where I need to retrieve the data as same order as it was populated in the table. The table name is bible When I type in table bible; in psql, it prints the data in the order it was populated with, but when I try to retrieve it, some rows are always out of order as in the below example:
table bible
-[ RECORD 1 ]-----------------------------------------------------------------------------------------------------------------------------------------
id | 1
day | 1
book | Genesis
chapter | 1
verse | 1
text | In the beginning God created the heavens and the earth.
link | https://api.biblia.com/v1/bible/content/asv.txt.txt?passage=Genesis1.1&key=dc5e2d416f46150bf6ceb21d884b644f
-[ RECORD 2 ]-----------------------------------------------------------------------------------------------------------------------------------------
id | 2
day | 1
book | John
chapter | 1
verse | 1
text | In the beginning was the Word, and the Word was with God, and the Word was God.
link | https://api.biblia.com/v1/bible/content/asv.txt.txt?passage=John1.1&key=dc5e2d416f46150bf6ceb21d884b644f
-[ RECORD 3 ]-----------------------------------------------------------------------------------------------------------------------------------------
id | 3
day | 1
book | John
chapter | 1
verse | 2
text | The same was in the beginning with God.
link | https://api.biblia.com/v1/bible/content/asv.txt.txt?passage=John1.2&key=dc5e2d416f46150bf6ceb21d884b644f
Everything is in order, but when I try to query the same thing using for example: select * from bible where day='1' or select * from bible where day='1' order by day or select * from bible where day='1' order by day, id;, I always get some rows out of order either in the day selected (here 1) or any other day.
I have been using Django to interfere with Postgres database, but since I found this problem, I tried to query using SQL, but nothing, I still get rows out of order, although they all have unique ids which I verified with select count(distinct id), count(id) from bible;
- [ RECORD 1 ]------------------------------------------------------------------------------------------------------
id | 1
day | 1
book | Genesis
chapter | 1
verse | 1
text | In the beginning God created the heavens and the earth.
link | https://api.biblia.com/v1/bible/content/asv.txt.txt?passage=Genesis1.1&key=dc5e2d416f46150bf6ceb21d884b644f
-[ RECORD 2 ]-----------------------------------------------------------------------------------------------------------------------------------------
id | 10
day | 1
book | Colossians
chapter | 1
verse | 18
text | And he is the head of the body, the church: who is the beginning, the firstborn from the dead; that in all things he might have the preemine
nce.
link | https://api.biblia.com/v1/bible/content/asv.txt.txt?passage=Colossians1.18&key=dc5e2d416f46150bf6ceb21d884b644f
-[ RECORD 3 ]-----------------------------------------------------------------------------------------------------------------------------------------
id | 11
day | 1
book | Genesis
chapter | 1
verse | 2
text | And the earth was waste and void; and darkness was upon the face of the deep: and the Spirit of God moved upon the face of the waters.
link | https://api.biblia.com/v1/bible/content/asv.txt.txt?passage=Genesis1.2&key=dc5e2d416f46150bf6ceb21d884b644f
As you could see above if you notice, the ids are out of order 1, 10, 11.
my table
Table "public.bible";
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
---------+------+-----------+----------+---------+----------+--------------+-------------
id | text | | | | extended | |
day | text | | | | extended | |
book | text | | | | extended | |
chapter | text | | | | extended | |
verse | text | | | | extended | |
text | text | | | | extended | |
link | text | | | | extended | |
Access method: heap
The id field is of type text because I used pandas's to_sql() method to populate the bible table. I tried to drop the id column and then I added it again as a PK with ALTER TABLE bible ADD COLUMN id SERIAL PRIMARY KEY; but I still get data return out of order.
Is there anyway I can retrieve the data with ordering with id, without having some of the rows totally out of order? Thank you in advance!
Thou shalt cast thy id to integer to order it as number.
SELECT * FROM bible ORDER BY cast(id AS integer);
While #jordanvrtanoski is correct, the way to do this is django is:
>>> Bible.objects.extra(select={'id': 'CAST(id AS INTEGER)'}).order_by('id').values('id')
<QuerySet [{'id': 1}, {'id': 2}, {'id': 3}, {'id': 10}, {'id': 20}]>
Side note: If you want to filter on day as an example, you can do this:
>>> Bible.objects.extra(select={
'id': 'CAST(id AS INTEGER)',
'day': 'CAST(day AS INTEGER)'}
).order_by('id').values('id', 'day').filter(day=2)
<QuerySet [{'id': 2, 'day': 2}, {'id': 10, 'day': 2}, {'id': 11, 'day': 2}, {'id': 20, 'day': 2}]>
Otherwise you get this issue: (notice 1 is followed by 10 and not 2)
>>> Bible.objects.order_by('id').values('id')
<QuerySet [{'id': '1'}, {'id': '10'}, {'id': '2'}, {'id': '20'}, {'id': '3'}]>
I HIGHLY suggest you DO NOT do any of this, and set your tables correctly (have the correct column types and not have everything as text), or your query performance is going to suck.. BIG TIME
Building on both answers of #jordanvrtanoski and #Javier Buzzi, and some search online, the issue is because the ids are of type TEXT (or VARCHAR too), so, you would need to cast the id to type INTEGER as in the following:
ALTER TABLE bible ALTER COLUMN id TYPE integer USING (id::integer);
Now here is my table
Table "public.bible"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
---------+---------+-----------+----------+-----------------------------------------+----------+--------------+-------------
id | integer | | | nextval('bible_id_seq'::regclass) | plain | |
day | text | | | | extended | |
book | text | | | | extended | |
chapter | text | | | | extended | |
verse | text | | | | extended | |
text | text | | | | extended | |
link | text | | | | extended | |
Indexes:
"lesson_unique_id" UNIQUE CONSTRAINT, btree (id)
Referenced by:
TABLE "notes_note" CONSTRAINT "notes_note_verse_id_5586a4bf_fk" FOREIGN KEY (verse_id) REFERENCES days_lesson(id) DEFERRABLE INITIALLY DEFERRED
Access method: heap
Hope this helps other people, and thank you everyone!

Check the elements in two columns of different dataframes

I have two dataframes.
Df1
Id | Name | Remarks
---------------------
1 | A | Not bad
1 | B | Good
2 | C | Very bad
Df2
Id | Name | Place |Job
-----------------------
1 | A | Can | IT
2 |C | Cbe | CS
4 |L | anc | ME
5 | A | cne | IE
Output
Id | Name | Remarks |Results
------------------------------
1 | A | Not bad |True
1 | B | Good |False
2 | C | VeryGood |True
That is the result should be true if same id and name are present in both dataframes. I tried
df1['Results']=np.where(Df1['id','Name'].isin(Df2['Id','Name']),'true','false')
But it was not successful.
Use DataFrame.merge with indicator parameter and compare both values:
df = Df1[['id','Name']].merge(Df2[['Id','Name']], indicator='Results', how='left')
df['Results'] = df['Results'].eq('both')
Your solution is possible by compare index values by DataFrame.set_index with Index.isin:
df1['Results']= Df1.set_index(['id','Name']).index.isin(Df2.set_index(['id','Name']).index)
Or compare tuples from both columns:
df1['Results']= Df1[['id','Name']].agg(tuple, 1).isin(Df2[['id','Name']].agg(tuple, 1))
You can easily achieve by merge like #jezrael 's answer.
You can also achieve it with np.where,list comprehension and zip like below:
df1['Results']=np.where([str(i)+'_'+str(j)==str(k)+'_'+str(l) for i,j,k,l in zip(Df1['ID'],Df1['Name'],Df2['ID'],Df2['Name'])],True,False)

Distinct Sum and Group by

I have a dataset [attached example] and I want to create 2 tables out of this;
+------+------------+-------+-------+-------+--------+
| corp | product | data | Group | sales | market |
+------+------------+-------+-------+-------+--------+
| A | Eli | 43831 | A | 100 | I |
| A | Eli | 43831 | B | 100 | I |
| B | Sut | 43831 | A | 80 | I |
| A | Api | 43831 | C | 50 | C or D |
| A | Api | 43831 | D | 50 | C or D |
| B | Konkurent2 | 43831 | C | 40 | C or D |
+------+------------+-------+-------+-------+--------+
1st - sum(sales) by market and exclude duplicated rows, so I want to end up with Sales for each market in specific date rage (Data column) but exclude duplicated - I have them because 1 product can be in more than 1 group
So first table, for exmaple, for MRCC I, would look like:
+--------+-------+-------+
| market | sales | data |
+--------+-------+-------+
| I | 180 | 43831 |
+--------+-------+-------+
Then second table I would like to look like above one, but add as a 'dictionary' aditionall column with uniqe product name within Market and Date, so for MRCC I it would look like:
+--------+-------+-------+----------------+
| market | sales | data | unique product |
+--------+-------+-------+----------------+
| I | 180 | 43831 | eli |
| I | 180 | 43831 | Sut |
+--------+-------+-------+----------------+
The thing is, im not that experienced in SQL, and i'm fairly new to DataProcessing, the system I am working in allows me to do some of data processing either by some "visual" recipes or by SQL code which im not that familiar with. And even moe confusing is I can choose between 3 SQL DBMS , Impala, Hive, Spark SQL - for example to create market column I used Impala and script looks like this, and im not sure if this is "pure" Impala syntax:
SELECT * from
(
-- mrc I --
SELECT *,case when
(`product`="Eli")
or
(`product`="Sut")
THEN "MRCC I"
end as market
FROM x.`y`
)a
where market is not null
Could you give me some tips on a structure of a code and if this is even possible?
Thanks,
eM
import spark.implicits._
import org.apache.spark.sql.functions._
case class Sale(
corp: String,
product: String,
data: Long,
group: String,
sales: Long,
market: String
)
val df = Seq(
Sale("A", "Eli", 43831, "A", 100, "I"),
Sale("A", "Eli", 43831, "B", 100, "I"),
Sale("A", "Sut", 43831, "A", 80, "I"),
Sale("A", "Api", 43831, "C", 50, "C or D"),
Sale("A", "Api", 43831, "D", 50, "C or D"),
Sale("B", "Konkurent2", 43831, "C", 40, "C or D")
).toDF()
val t2 = df.dropDuplicates(Seq("corp", "product", "data", "market"))
.groupBy("market", "product", "data").sum("sales")
.select(
'market,
col("sum(sales)").alias("sales"),
'data,
'product.alias("unique product")
)
t2.show(false)
// +------+-----+-----+--------------+
// |market|sales|data |unique product|
// +------+-----+-----+--------------+
// |I |80 |43831|Sut |
// |I |100 |43831|Eli |
// |C or D|40 |43831|Konkurent2 |
// |C or D|50 |43831|Api |
// +------+-----+-----+--------------+
val t1 = t2.drop("unique product")
.groupBy("market", "data").sum("sales")
.select(
'market,
col("sum(sales)").alias("sales"),
'data)
t1.show(false)
// +------+-----+-----+
// |market|sales|data |
// +------+-----+-----+
// |I |180 |43831|
// |C or D|90 |43831|
// +------+-----+-----+

Postgresql join on array and transform to json

I would like to make a join on array containing ids and transform the result of this subselect into json (json array).
I have the following model:

The lnam_refs column contains identifiers that are related to the lnam column
I would like transform the column lnam_refs into something like [row_to_json(), row_to_json()] or [] or [row_to_json()] or …
I tried several methods but I can not achieve a clean result…
To try to be clearer :
Table in input:
id | label | lnam | lnam_refs
--------+----------------------+----------+-----------------------
1 | 'master1' | 11111111 | {33333333}
2 | 'master2' | 22222222 | {44444444,55555555}
3 | 'slave1' | 33333333 | {}
4 | 'slave2' | 44444444 | {}
5 | 'slave3' | 55555555 | {}
6 | 'master3' | 66666666 | {}
Results Expected:
id | label | lnam | lnam_refs | slaves
--------+----------------------+----------+-----------------------+---------------------------------
1 | 'master1' | 11111111 | {33333333} | [ {id: 3, label: 'slave1', lnam: 33333333, lnam_refs: []} ]
2 | 'master2' | 22222222 | {44444444,55555555} | [ {id: 4, label: 'slave2', lnam: 44444444, lnam_refs: []}, {id: 5, label: 'slave3', lnam: 55555555, lnam_refs: []} ]
6 | 'master3' | 66666666 | {} | []
Thanks for your help !
Here's one way to do it. (I created a table called t with that data you supplied.)
SELECT *, (SELECT JSON_AGG(ROW_TO_JSON(t2)) FROM t t2 WHERE label LIKE 'slave%' AND lnam = ANY(t1.lnam_refs)) AS slaves
FROM t t1
WHERE label LIKE 'master%'
I use the label field in the WHERE clause as I don't know how else you're determining which records should be master etc.
Result:
1;master1;11111111;{33333333};[{"id":3,"label":"slave1","lnam":33333333,"lnam_refs":[]}]
2;master2;22222222;{44444444,55555555};[{"id":4,"label":"slave2","lnam":44444444,"lnam_refs":[]}, {"id":5,"label":"slave3","lnam":55555555,"lnam_refs":[]}]
6;master3;66666666;{};