Transform DataFrame to list of dictionaries where column name is a value of key:value pair

Transform DataFrame to list of dictionaries where column name is a value of key:value pair - pandas

I have a panda DataFrame as follow
|---------------------|------------------|------------------|
| A | B | C |
|---------------------|------------------|------------------|
| abc | 34 | 8 |
|---------------------|------------------|------------------|
| abc | | 12 |
|---------------------|------------------|------------------|
| abc | 6 | 321 |
|---------------------|------------------|------------------|
I would like to conver it to a list of dictionary like this:
[
{
name: "A",
value: "abc"
},
{
name: "B",
value: 34
},
{
name: "C",
value: 8
}
]
There are several way to do it with a lot of data manipulation but I am looking for one that is straightforward if it exists
Thank you for your help

[[{'name':k, 'value':v} for k,v in x.items()] for x in df.to_dict(orient='records')]
This would probably work, not sure it is straightforward though.

Related

Iterate through pandas data frame and replace some strings with numbers

I have a dataframe sample_df that looks like:
bar foo
0 rejected unidentified
1 clear caution
2 caution NaN
Note this is just a random made up df, there are lot of other columns lets say with different data types than just text. bar and foo might also have lots of empty cells/values which are NaNs.
The actual df looks like this, the above is just a sample btw:
| | Unnamed: 0 | user_id | result | face_comparison_result | created_at | facial_image_integrity_result | visual_authenticity_result | properties | attempt_id |
|-----:|-------------:|:---------------------------------|:---------|:-------------------------|:--------------------|:--------------------------------|:-----------------------------|:----------------|:---------------------------------|
| 0 | 58 | ecee468d4a124a8eafeec61271cd0da1 | clear | clear | 2017-06-20 17:50:43 | clear | clear | {} | 9e4277fc1ddf4a059da3dd2db35f6c76 |
| 1 | 76 | 1895d2b1782740bb8503b9bf3edf1ead | clear | clear | 2017-06-20 13:28:00 | clear | clear | {} | ab259d3cb33b4711b0a5174e4de1d72c |
| 2 | 217 | e71b27ea145249878b10f5b3f1fb4317 | clear | clear | 2017-06-18 21:18:31 | clear | clear | {} | 2b7f1c6f3fc5416286d9f1c97b15e8f9 |
| 3 | 221 | f512dc74bd1b4c109d9bd2981518a9f8 | clear | clear | 2017-06-18 22:17:29 | clear | clear | {} | ab5989375b514968b2ff2b21095ed1ef |
| 4 | 251 | 0685c7945d1349b7a954e1a0869bae4b | clear | clear | 2017-06-18 19:54:21 | caution | clear | {} | dd1b0b2dbe234f4cb747cc054de2fdd3 |
| 5 | 253 | 1a1a994f540147ab913fcd61b7a859d9 | clear | clear | 2017-06-18 20:05:05 | clear | clear | {} | 1475037353a848318a32324539a6947e |
| 6 | 334 | 26e89e4a60f1451285e70ca8dc5bc90e | clear | clear | 2017-06-17 20:21:54 | suspected | clear | {} | 244fa3e7cfdb48afb44844f064134fec |
| 7 | 340 | 41afdea02a9c42098a15d94a05e8452b | NaN | clear | 2017-06-17 20:42:53 | clear | clear | {} | b066a4043122437bafae3ddcf6c2ab07 |
| 8 | 424 | 6cf6eb05a3cc4aabb69c19956a055eb9 | rejected | NaN | 2017-06-16 20:00:26 |
I want to replace any strings I find with numbers, per the below mapping.
def no_strings(df):
columns=list(df)
for column in columns:
df[column] = df[column].map(result_map)
#We will need a mapping of strings to numbers to be able to analyse later.
result_map = {'unidentified':0,"clear": 1, 'suspected': 2,"caution" : 3, 'rejected':4}
So the output might look like:
bar foo
0 4 0
1 1 3
2 3 NaN
For some reason, when I run no_strings(sample_df) I get errors.
What am I doing wrong?

df['bar'] = df['bar'].map(result_map)
df['foo'] = df['foo'].map(result_map)
df
bar foo
0 4 0
1 1 3
2 3 2
However, if you wish to be on the safe side (assuming a key/value is not in your result_map and you dont want to see a NaN) do this:
df['foo'] = df['foo'].map(lambda x: result_map.get(x, 'not found'))
df['bar'] = df['bar'].map(lambda x: result_map.get(x, 'not found'))
so an out put for this df
bar foo
0 rejected unidentified
1 clear caution
2 caution suspected
3 sdgdg 0000
will result in:
bar foo
0 4 0
1 1 3
2 3 2
3 not found not found
To be extra efficient:
cols = ['foo','bar','other_columns']
for c in cols:
df[c] = df[c].map(lambda x: result_map.get(x, 'not found'))

Lets try stack, map the dict and then unstack
df.stack().to_frame()[0].map(result_map).unstack()
bar foo
0 4 0
1 1 3
2 3 2

Distinct Sum and Group by

I have a dataset [attached example] and I want to create 2 tables out of this;
+------+------------+-------+-------+-------+--------+
| corp | product | data | Group | sales | market |
+------+------------+-------+-------+-------+--------+
| A | Eli | 43831 | A | 100 | I |
| A | Eli | 43831 | B | 100 | I |
| B | Sut | 43831 | A | 80 | I |
| A | Api | 43831 | C | 50 | C or D |
| A | Api | 43831 | D | 50 | C or D |
| B | Konkurent2 | 43831 | C | 40 | C or D |
+------+------------+-------+-------+-------+--------+
1st - sum(sales) by market and exclude duplicated rows, so I want to end up with Sales for each market in specific date rage (Data column) but exclude duplicated - I have them because 1 product can be in more than 1 group
So first table, for exmaple, for MRCC I, would look like:
+--------+-------+-------+
| market | sales | data |
+--------+-------+-------+
| I | 180 | 43831 |
+--------+-------+-------+
Then second table I would like to look like above one, but add as a 'dictionary' aditionall column with uniqe product name within Market and Date, so for MRCC I it would look like:
+--------+-------+-------+----------------+
| market | sales | data | unique product |
+--------+-------+-------+----------------+
| I | 180 | 43831 | eli |
| I | 180 | 43831 | Sut |
+--------+-------+-------+----------------+
The thing is, im not that experienced in SQL, and i'm fairly new to DataProcessing, the system I am working in allows me to do some of data processing either by some "visual" recipes or by SQL code which im not that familiar with. And even moe confusing is I can choose between 3 SQL DBMS , Impala, Hive, Spark SQL - for example to create market column I used Impala and script looks like this, and im not sure if this is "pure" Impala syntax:
SELECT * from
(
-- mrc I --
SELECT *,case when
(`product`="Eli")
or
(`product`="Sut")
THEN "MRCC I"
end as market
FROM x.`y`
)a
where market is not null
Could you give me some tips on a structure of a code and if this is even possible?
Thanks,
eM

import spark.implicits._
import org.apache.spark.sql.functions._
case class Sale(
corp: String,
product: String,
data: Long,
group: String,
sales: Long,
market: String
)
val df = Seq(
Sale("A", "Eli", 43831, "A", 100, "I"),
Sale("A", "Eli", 43831, "B", 100, "I"),
Sale("A", "Sut", 43831, "A", 80, "I"),
Sale("A", "Api", 43831, "C", 50, "C or D"),
Sale("A", "Api", 43831, "D", 50, "C or D"),
Sale("B", "Konkurent2", 43831, "C", 40, "C or D")
).toDF()
val t2 = df.dropDuplicates(Seq("corp", "product", "data", "market"))
.groupBy("market", "product", "data").sum("sales")
.select(
'market,
col("sum(sales)").alias("sales"),
'data,
'product.alias("unique product")
)
t2.show(false)
// +------+-----+-----+--------------+
// |market|sales|data |unique product|
// +------+-----+-----+--------------+
// |I |80 |43831|Sut |
// |I |100 |43831|Eli |
// |C or D|40 |43831|Konkurent2 |
// |C or D|50 |43831|Api |
// +------+-----+-----+--------------+
val t1 = t2.drop("unique product")
.groupBy("market", "data").sum("sales")
.select(
'market,
col("sum(sales)").alias("sales"),
'data)
t1.show(false)
// +------+-----+-----+
// |market|sales|data |
// +------+-----+-----+
// |I |180 |43831|
// |C or D|90 |43831|
// +------+-----+-----+

Postgresql join on array and transform to json

I would like to make a join on array containing ids and transform the result of this subselect into json (json array).
I have the following model:

The lnam_refs column contains identifiers that are related to the lnam column
I would like transform the column lnam_refs into something like [row_to_json(), row_to_json()] or [] or [row_to_json()] or …
I tried several methods but I can not achieve a clean result…
To try to be clearer :
Table in input:
id | label | lnam | lnam_refs
--------+----------------------+----------+-----------------------
1 | 'master1' | 11111111 | {33333333}
2 | 'master2' | 22222222 | {44444444,55555555}
3 | 'slave1' | 33333333 | {}
4 | 'slave2' | 44444444 | {}
5 | 'slave3' | 55555555 | {}
6 | 'master3' | 66666666 | {}
Results Expected:
id | label | lnam | lnam_refs | slaves
--------+----------------------+----------+-----------------------+---------------------------------
1 | 'master1' | 11111111 | {33333333} | [ {id: 3, label: 'slave1', lnam: 33333333, lnam_refs: []} ]
2 | 'master2' | 22222222 | {44444444,55555555} | [ {id: 4, label: 'slave2', lnam: 44444444, lnam_refs: []}, {id: 5, label: 'slave3', lnam: 55555555, lnam_refs: []} ]
6 | 'master3' | 66666666 | {} | []
Thanks for your help !

Here's one way to do it. (I created a table called t with that data you supplied.)
SELECT *, (SELECT JSON_AGG(ROW_TO_JSON(t2)) FROM t t2 WHERE label LIKE 'slave%' AND lnam = ANY(t1.lnam_refs)) AS slaves
FROM t t1
WHERE label LIKE 'master%'
I use the label field in the WHERE clause as I don't know how else you're determining which records should be master etc.
Result:
1;master1;11111111;{33333333};[{"id":3,"label":"slave1","lnam":33333333,"lnam_refs":[]}]
2;master2;22222222;{44444444,55555555};[{"id":4,"label":"slave2","lnam":44444444,"lnam_refs":[]}, {"id":5,"label":"slave3","lnam":55555555,"lnam_refs":[]}]
6;master3;66666666;{};

How to create hierarchal json object using ltree query results? (postgresql)

I'm trying to create a storage system for custom categories using postgres.
After looking around for potential solutions I settled on trying to use ltree;
Here is an example of raw data below;
+----+---------+---------------------------------+-----------+
| id | user_id | path | name |
+----+---------+---------------------------------+-----------+
| 1 | 1 | root.test | test |
| 2 | 1 | root.test.inbox | inbox |
| 3 | 1 | root.personal | personal |
| 4 | 1 | root.project | project |
| 5 | 1 | root.project.idea | idea |
| 6 | 1 | root.personal.events | events |
| 7 | 1 | root.personal.events.janaury | january |
| 8 | 1 | root.project.objective | objective |
| 9 | 1 | root.personal.events.february | february |
| 10 | 1 | root.project.objective.january | january |
| 11 | 1 | root.project.objective.february | february |
+----+---------+---------------------------------+-----------+
I thought that it might be easier to first order the results, and remove the top level from the path return. Using;
select id, name, subpath(path, 1) as path, nlevel(subpath(path, 1)) as level from testLtree order by level, path
I get;
+----+-----------+----------------------------+-------+
| id | name | path | level |
+----+-----------+----------------------------+-------+
| 3 | personal | personal | 1 |
| 4 | project | project | 1 |
| 1 | test | test | 1 |
| 6 | events | personal.events | 2 |
| 5 | idea | project.idea | 2 |
| 8 | objective | project.objective | 2 |
| 2 | inbox | test.inbox | 2 |
| 9 | february | personal.events.february | 3 |
| 7 | january | personal.events.january | 3 |
| 11 | february | project.objective.february | 3 |
| 10 | january | project.objective.january | 3 |
+----+-----------+----------------------------+-------+
I'm hoping to be able to transform this result into a set of JSON data somehow. I would like an output similar to this;
personal: {
id: 3,
name: 'personal',
children: {
events: {
id: 6,
name: 'events',
children: {
january: {
id: 7,
name: 'january',
children: null
},
february: {
id: 9,
name: 'february',
children: null
}
}
}
}
},
project: {
id: 4,
name: 'project',
children: {
idea: {
id: 5,
name: 'idea',
children: null
},
objective: {
id: 8,
name: 'objective',
children: {
january: {
id: 10,
name: 'january',
children: null
},
february: {
id: 11,
name: 'february',
children: null
}
}
}
}]
},
test: {
id: 1,
name: 'test',
children: {
inbox: {
id: 2,
name: 'inbox',
children: null
}
}
}
I've been looking around for the best way to do this but haven't came across any solutions that make sense to me. However, as I am new to postgres and SQL in general this is expected.
I think I may have to use a recursive query? I'm a bit confused over what the best method/execution of this would be. Any help/advice is much appreciated! and any further questions please ask.
I've put everything into a sqlfiddle below;
http://sqlfiddle.com/#!17/1713e/5

I ran into the same problem as you. I had a large struggle with this in PostgreSQL and it became overly complex to solve. Since I'm using Django (Python framework), I decided to solve it using Python. In case it can help anyone in my same situation, I would like to share the code:
https://gist.github.com/eherrerosj/4685e3dc843e94f3ef8645d31dbe490c

Add Column in a Spark Dataframe ,based on a parametric sql query dependent on values of some fields of the dataframe

I have several Spark Dataframes(we can call them Table a, table b etc).
I want to add a column just to table a, based on a result of a query to one of the other tables, but this table will change every time based on a value of one of the fields of table a. So this query should be parametric.
Below I show an example to make the problem clear:
Every table has the column OID and a column TableName with the name of the current table, plus other columns.
This is the fixed query to be performed on Tab A to add new column:
Select $ColumnName from $TableName where OID=$oids
Tab A
| oids|TableName |ColumnName | other fields|New Column: ValueOidDb
================================================================
| 2 | Book | Title | x |result query:harry potter
| 8 | Book | Isbn | y |result query: 556
| 1 | Author | Name | z |result query:Tolkien
| 4 | Category |Description| b |result query: Commedy
Tab Book
| OID |TableName |Title |Isbn |other fields|
================================================================
| 2 | Book |harry potter| 123 | x |
| 8 | Book | hobbit | 556 | y |
| 21 | Book | etc | 8942 | z |
| 5 | Book | etc2 | 984 | b |
Tab Author
| OID |TableName |Name |nationality |other fields|
================================================================
| 5 | Author |J.Rowling | eng | x |
| 2 | Author |Geor. Martin| us | y |
| 1 | Author | Tolkien | eng | z |
| 13 | Author | Dan Brown | us | b |
| OID | TableName |Description |
=====================================
| 12 | Category | Fantasy |
| 4 | Category | Commedy |
| 9 | Category | Thriller |
| 7 | Category | Action |
I tried with this udf
def setValueOid = (oid: Int,TableName: String, TableColumn: String) => {
try{
sqlContext.sql(s"Select $currTableColumn from $currTableName where OID = $curroid ").first().toString()
}
catch{
case x: java.lang.NullPointerException => "error"
}
}
sqlContext.udf.register("setValueOid", setValueOid)
val FinalRtxf = sqlContext.sql("SELECT all the column of TAB A ,"
+ " setValueOid(oid, Table,AttributeDatabaseColumn ) as ValueOidDb"
+ " FROM TAB A")
I put the code in a try catch because otherwise it gives me a nullpointerexception, but it doesn't work, because it always returns a "problem".
If I try this function without a sql query by just passing some manual parameters it works perfectly:
val try=setValueOid(8,"BOOK","ISBN")
try: String = [0977326403 ] FINISHED
Took 4 sec. Last updated by anonymous at November 20 2016, 3:29:28 AM.
I read here that is not possible to make a query inside a udf
Trying to execute a spark sql query from a UDF
So how can I solve my problem? I don't know how to make a parametric join. I tried this:
%sql
Select all attributes TAB A,
FROM TAB A as a
join (Select $AttributeDatabaseColumn ,TableName from $Table where OID=$oid) as b
on a.Table=b.TableName
but it gave me this exception:
org.apache.spark.sql.AnalysisException: cannot recognize input near '$' 'AttributeDatabaseColumn' ',' in select clause; line 3 pos 1 at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:318)
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)

One option:
transform each Book, Author, Category to a form:
root
|-- oid: integer (nullable = false)
|-- tableName: string (nullable = true)
|-- properties: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
For example first record in Book:
val book = Seq((2L, "Book",
Map("title" -> "harry potter", "Isbn" -> "123", "other field" -> "x")
)).toDF("oid", "title", "properties")
+---+---------+---------------------------------------------------------+
|oid|tableName|properties |
+---+---------+---------------------------------------------------------+
|2 |Book |Map(title -> harry potter, Isbn -> 123, other field -> x)|
+---+---------+---------------------------------------------------------+
union Book, Author, Category as properties.
val properties = book.union(author).union(category)
join with base table:
val comb = properties.join(table, Seq($"oid", $"tableName"))
use case when ... based on tableName to add new column from properties field.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Transform DataFrame to list of dictionaries where column name is a value of key:value pair - pandas

[[{'name':k, 'value':v} for k,v in x.items()] for x in df.to_dict(orient='records')] This would probably work, not sure it is straightforward though.

Related

Iterate through pandas data frame and replace some strings with numbers

Distinct Sum and Group by

Postgresql join on array and transform to json

How to create hierarchal json object using ltree query results? (postgresql)

Add Column in a Spark Dataframe ,based on a parametric sql query dependent on values of some fields of the dataframe

Categories

Resources