How to parallelly process a pandas Dataframe for each unique value in a column?

How to parallelly process a pandas Dataframe for each unique value in a column? - pandas

I'm looking for ideas to optimize my function. I have limited knowledge on multiprocessing so just looking for someone to point me in the right direction!
So, I have a pandas DataFrame with the following format
---------------------------------------------------
|Date | Portfolio | Classification | P&L |
--------------------------------------------------
| 2016-01-01 | A | Class_1 | 100 |
--------------------------------------------------
|2016-01-02 | A | Class_2 | 200 |
. .
. .
. .
--------------------------------------------------
|2019-10-31 | A | Class_700 | -200 |
--------------------------------------------------
All I need to get from this is P&L attribution that basically explains how each unique Class generated p&l from its inception & I have another function to do that...
Currently my logic is:
for unique_class in df['Classification'].unique():
class_df = df[df['Classification'] == unique_class].copy()
result1, result2 = call_attribution_function(class_df)
overall_results1.append(result1)
overall_results2.append(results2)
This gets the job done but is obviously very slow. In real scenario, I have over 700 unique classifications.
However, none of the classifications are dependent on each other. Basically all 700 can be processed parallelly and that would significantly improve the performance.
Any ideas on how this can be achieved?
I've seen a few examples with joblib but I didn't get a lot of direction for slicing dataframes and parallelizing functions with multiple return variables..
Any pointers highly appreciated! Thanks in advance!

Related

Best data structure for finding tags of nested locations

Somebody pointed out that my data structure architecture sucks.
The task
I have a locations table which stores the name of a location. Then I have a tags table which stores information about those locations. The locations have a hierarchie which I want to use to get all tags.
Example
Locations:
USA <- California <- San Francisco <- Mission St
Tags:
USA: English
California: Sunny
California: West coast
San Francisco: Sea side
Mission St: Cable car station
If somebody requests information about the Mission St I want to deliver all tags of it and it's ancestors (["English", "Sunny", "West coast", "Sea side", "Cable car station"]. If I request all tags of California the answer would be ["English", "Sunny", "West coast"].
I'm looking for the best read performance! I don't care about write performance. This data is not changed very often. And I don't care about table sizes either. If I need more or larger tables to solve this quicker so be it.
The tables
So currently I'm thinking about setting up these tables:
locations
id | name
---|--------------
1 | USA
2 | California
3 | San Francisco
4 | Mission St
tags
id | location_id | name
---|-------------|------------------
1 | 1 | English
2 | 2 | Sunny
3 | 2 | West coast
4 | 3 | Sea side
5 | 4 | Cable car station
ancestors
I added a position field to store the hierarchy.
| id | location_id | ancestor_id | position |
|----|-------------|-------------|----------|
| 1 | 2 | 1 | 1 |
| 2 | 3 | 2 | 1 |
| 3 | 3 | 1 | 2 |
| 4 | 4 | 3 | 1 |
| 5 | 4 | 2 | 2 |
| 6 | 4 | 1 | 3 |
Question
Is this a good solution to solve the problem or is there a better one? I want to select as fast as possible all tags of any given location including all the tags of it's ancestors. I'm using a PostgreSQL database but I think this is a pure SQL architecture problem.

Your problem seems to consist of two challenges. The most interesting is "how do I store hierarchies in a relational database". There are lots of answers to that - the one you've proposed is the most common.
There's an alternative called "nested set" which is faster for reading (in your example, finding all locations within a particular hierarchy would be "between x and y".
Postgres has dedicated support for hierachies; I'd assume this would also provide great performance.
The second part of your question is "given a path in my hierarchy, retrieve all matching tags". The easiest option is to join to the tags table as you suggest.
The final aspect is "should you denormalize/precalculate". I usually recommend building and optimizing the "normalized" solution and only denormalize when you need to.

If you want to deliver all tags for a particular location, then I would recommend replicating the data and storing the tags in a tags array on a row for each location.
You say that the locations don't change very much. So, I would simply batch create the entire table, when any underlying data changes.
Modifying the data in situ is rather problematic. A single update could end up affecting a zillion different rows -- consider a tag change on USA. Recalculating the entire table is going to be more efficient.
If you need to search on the tags as well as return them, then I would go for a more traditional structure of a table with two important columns, location and tag. Then you can have indexes on both (location) and (tag) to facilitate searching in either direction.

If write performance is not crucial, I would go for denormalization of the database. That means you use the above structure for your write operations and fill a table for your read operations by a trigger or a some async job, if you are afraid of triggers. Then the read performance is optimal, but you have to invest a bit more into the write logic.
Using the above structure for read operations is indeed not a smart solution, cause you don't know how deep the tree can get.

Select rows in a table (postgis) from selected features QGIS

How do I select rows in a table based on a key (PK) from another table. I have selected multiple polygons which is within a geografical region from one layer.
The attributes table from the selected layer look like this:
| Bloknr | Column 1 | Column 2 | Column 3 |
| 111-08 | xqyz | xyzq | qxyz |
| 208-09 | abc | cba | bca |
Where the row in question (row 1) is selected.
I now want to select this row from a nongeographic layer (from a postgresql database) with a table that looks like this:
| BLOKNR | Column 1 | Column 2 | Column 3 |
| 111-08 | cab | bac | cab |
| 208-09 | abc | cba | bca |
| 111-08 | cba | bca | cab |
Where the first and third row is to be selected.
There is about 20.000.000 rows in the postgres table and multiple matches on each bloknr
I work in qgis ver. 3.2 and postgresql with PGadmin4
Any help most appreciated.
UPDATE to answer the comments
It would be simple, if it was a matter of doing it within postgres - it's kind of made for that - but i cannot figure out how to query within qgis i would like not to have to export each table (I have a few, and for each i need multiple selection queries, based on geography) to postgresql - partly because i would like to keep the workflow in qgis, and partly because the export feature in the DB manager of qgis gives me this error - which i think means that i have to make all the tables manually.
" ERROR: function addgeometrycolumn(unknown, unknown, unknown,
integer, unknown, integer) does not exist LINE 1: SELECT
AddGeometryColumn('public','Test',NULL,0,'MULTIPOLYGO...
HINT: No function matches the given name and argument types. You might need to add explicit type casts."
So again any help appreciated.

So i have come up with an answer, that will work in theory.
First make the desired geographical selection and make a new layer with the selection
Then export the layer to the postgis database, with which you are connected
Now it is possible to make queries in postgresql - and PGadmin.
Note that this does not keep the workflow in qgis - and for further processing of statistics etc. one will have to work on the integration between the new postgis layer and selection within this - and it doesn't quite solve the geographical/mapbased selection approach - although it will work

Dynamic creation of Table type

I have a column table with a single column.
I would like to create a table type with all the elements in the column of the above mentioned table as column names with fixed datatype and size and use it in a function.
similarly like below:
Dynamic creation of table in tsql
Any suggestions would be appreciated.
EDIT:
To finish a product, a machine has to perform different Jobs on the material with different tools.
I have a list of Jobs a machine can perform and a list of Tools. a specific tool for a specific Job.
Each job needs a specific tool and number of hours (to change the tool once it reached its change time). A Job can be performed many times on a product. (in this case if a Job is performed for 1 hour = tool has been used for 1 hour)
For each product, a set of tools will be at work in a sequence. so I Need a report for each product, number of hours the tool has worked.
EDIT 2:
Product table
---------+-----+
ProductID|Jobs |
---------+-----+
1 | job1 |
1 | job2 |
1 | job3 |
1 | . |
1 | . |
1 |100th |
2 | job1 |
2 | . |
2 | . |
2 |200th |
Jobs table
-------+-------+-------
Jobs | tool | time
-------+-------+-------
job1 |tool 10| 2
job1 |tool 09| 1
job2 |tool 11| 4
job3 |tool 17| 0.5
required report (this table does not physically exist)
----------+------+------+------+------+------+-----
productID | job1 | job2 | job3 | job4 | job5 | . . .
----------+------+------+------+------+------+------
1 | 20 | 10 | 5 | . | . | .
----------+------+------+------+------+------+------
2 | 10 | 13 | 5 | . | . | .
----------+------+------+------+------+------+------

Based on the added information, there are two main requirements here:
You want to sum up the time spent for producing each product grouped by the jobs involved
and
You want to have a cross-table report showing the times from step 1 against products and jobs.
For the first bit, you probably could do this with a query like this:
SELECT
p.product_id,
j.jobs,
SUM(j.time) as SUM_TIME
FROM
products p
INNER JOIN jobs j
ON p.jobs = j.jobs
GROUP BY
p.product_id,
j.jobs;
For the second part: this is usually called a PIVOT report.
SAP HANA does not provide a dynamic SQL command for generating output in this form (other DBMS have that).
However, this dynamic transformation is usually relevant for the data presentation and not so much for the processing.
So, as you probably want to use some form of front end for this report (e.g. MS Excel, Crystal Reports, Business Objects X, Tableau, ...) I would recommend doing the transformation and formatting in the frontend report. Look for "PIVOT" or "CROSSTAB" options to do that.

Creating an SSIS job to split a column and insert into database

I have a column called Description:
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Description/Title |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Liszt, Hungarian Rhapsody #6 {'Pesther Carneval'}; 2 Episodes from Lenau's 'Faust'; 'Hunnenschlacht' Symphonic Poem. (NW German Phil./ Kulka) |
| Beethoven, Piano Sonatas 8, 23 & 26. (Justus Frantz) |
| Puccini, Verdi, Gounod, Bizet: Arias & Duets from Butterfly, Tosca, Boheme, Turandot, I Vespri, Faust, Carmen. (Fiamma Izzo d'Amico & Peter Dvorsky w.Berlin Radio Symph./Paternostro) |
| Puccini, Ponchielli, Bizet, Tchaikovsky, Donizetti, Verdi: Arias from Boheme, Manon Lescaut, Tosca, Gioconda, Carmen, Eugen Onegin, Favorita, Rigoletto, Luisa Miller, Ballo, Aida. (Peter Dvorsky, ten. w.Hungarian State Opera Orch./ Mihaly) |
| Thomas, Leslie: 'The Virgin Soldiers' (Hywel Bennett reads abridged version. Listening time app. 2 hrs. 45 mins. DOLBY) |
| Katalsky, A. {1856-1926}: Liturgy for A Cappella Chorus. Rachmaninov, 6 Choral Songs w.Piano. (Bolshoi Theater Children's Choir/ Zabornok. DOLBY) |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Please note that above I'm only showing 1 field.
Also, the output that I would like is:
+-------+-------+
| Word | Count |
+-------+-------+
| Arias | 3 |
| Duets | 2 |
| Liszt | 10 |
| Tosca | 1 |
+-------+-------+
I want this output to encompass EVERY record. I do not want a separate one of these for each record, just one global one.
I am choosing to use SSIS to do this job. I'd like your input on which controls to use to help with this task:
I'm not looking for a solution, but simply some direction on how to get started with this. I understand this can be done many different ways, but I cannot seem to think of a way to do this most efficiently. Thank you for any guidance.
FYI:
This script does an excellent job of concatenating everything:
select description + ', ' as 'data()'
from [BroincInventory]
for xml path('')
But I need guidance on how to work with this result to create the required output. How can this be done with c# or with one of the SSIS components?
edit: As siyual points out below I need a script task. The script above obviously will not work since there's a limit to the size of a data point.

I think term extraction might be the component you are looking for. Check this out: http://www.mssqltips.com/sqlservertip/3194/simple-text-mining-with-the-ssis-term-extraction-component/

Google Refine / Open Refine: Columns to Rows

I'm afraid this might be a somewhat simple question, but I can't seem to figure it out.
I have a spreadsheet with many objects, each of which has many attributes (one per column), like this (sorry, I can't post images, so this is the best I can do):
OBJECT ID | PERIOD | COLOR | REPRESENTATION
1 | Early Intermediate | Bichrome | Abstract
2 | Middle Horizon | Multicolored | Representational
… and I'd like each column to become a separate row — which would mean that each object would be listed a number of times. Like this:
OBJECT | ATTRIBUTE
Object 1 | Early Intermediate
Object 1 | Bichrome
Object 1 | Abstract
Object 2 | Middle Horizon
Object 2 | Multicolored
Object 2 | Representational
I'm not seeing an obvious way to do this, and I can't find an answer here, though perhaps I'm not using the right search terms.
Thanks for any help you can offer!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to parallelly process a pandas Dataframe for each unique value in a column? - pandas

Related

Best data structure for finding tags of nested locations

Select rows in a table (postgis) from selected features QGIS

Dynamic creation of Table type

Creating an SSIS job to split a column and insert into database

Google Refine / Open Refine: Columns to Rows

Categories

Resources