pandas dealing with a column with multiple values separated with delimiter for data analysis - pandas

here's a python self learner trying to find a way working with columns with multiple values. the dataset is TMDb Movie Dataset and there are multiple values columns are like genres, cast etc.
I managed splitting values and counting them, it's okay. but what if I want to see the relationship between genres and for example popularity? how can I group all genres after a proper splitting process?
dataset looks like this:

I would use something like the stack function to create a row for every item when splitting. For example, when you want to group by genre, create a row for genre when splitting (keeping other columns the same). With that you can do simple groupby operations.

Related

How can I create multiple rows based on the value of one column in SQL?

I have a column of type string in my table, where multiple values are separated by pipe operator. For example, like this,
Value1|Value2|Value3
Now, what I want is to have a query, which will show three rows for this row. Basically something similar to the concept of explode in Dataframes.
Note that I am using Spark SQL. And I want to achieve this using SQL, not dataframes.
I got it working by using the following query.
select t.*, explode(split(values, "\\|")) as value
from table t
\\| here can also be replaced by [|]. Just specifying | doesn't work.

SQL Server match partial text in the comma separated varchar column

I have a VARCHAR column category_text in the table that contain tags to a notification stored. I have three tags Query, Complaint and Suggestion and column can have one or more values separated by comma. I am applying a filter and filter can have one or more values as well in comma separated pattern.
Now what I want is to retrieve all the rows that contain at least one tag based on the filter user is applying, for instance user can select 'query,suggestion' as a filter and result would be all the rows that contain one of the tags i.e. query or suggestion.
select
t.category_text
from
real_time_notifications t
where
charindex('query, suggestion, complaints', t.category_text) > 0
order by
t.id desc
Create a new table, like user_category (user.id link to user table, category) and create an index on both. It will speed up a lot for searching and ease your future maintenance a lot.
If you still persist to do that, create an inline function to split string to records and then merge to test.

How to flatten tables correcty in Big Query?

I have the following tables:
In table 2 (yellow looking fields), the first field is part of the following:
name1 RECORD NULLABLE
name1. name2 RECORD REPEATED
name1.name2. date_inserted TIMESTAMP NULLABLE
As you can see the last (sub-row?) of the row 25 is greyed because it is part of the repeated record name1.name2
I am trying to join table 2, with table 1(orange looking fields) on another field. I have 0 experience with records or repeated records but using FLATTEN() I managed to join them.
The problem is, I noticed that some dates from the 2nd after the join return NULL although there aren't any NULLS before it. So since I can't figure out what the greyed cells are I guess I am doing something wrong.
All this sums up to: How can I totally flatten all tables that I want to use so that there won't be any records at all and so I can go through the data with simple SQL statements? Please provide an example as well. Looking for something generic.
How can I totally flatten all tables that I want to use so that there won't be any records at all and so I can go through the data with simple SQL statements?
It really depends on the schemas you are working with. You can preprocess them, flatten the arrays and rename the structs fields, then use that as your base table to work with simple SQL statements
For your scenario, you can start by flattening the table 2, name2 column like this
SELECT
name2.date_inserted -- Add additional fields you want on the result
FROM table2, table2.name1.name2
You can do CROSS JOIN and LEFT JOIN to further adjust your results.
Please provide an example as well. Looking for something generic.
I'm not sure about a generic approach, since each schema would probably have distinct requirements. The key concept is to know how to flatten arrays and how to query struct with arrays and arrays of structs
You can find plenty examples in that documentation

SSRS LOOKUP with row grouping

I am trying to add a column to tablix that uses different dataset. Now the dataset1 holds new data and dataset2 holds old comparison data.
The tablix is using dataset1 and the row in question is grouped by D_ID now I added a column that needs to binded with D_ID(dataset1) to D_ID(dataset2)
=-1*sum(Lookup(Fields!D_ID.Value, Fields!D_ID.Value, Fields!BUD_OLD.Value, "OLD")+Lookup(Fields!D_ID.Value, Fields!D_ID.Value, Fields!ACK_BUD_OLD.Value, "OLD"))
However this does take into account that what I need is all the rows from BUD_OLD with D_ID = smth to be summed together. The lookup only returns one value not a sum of all values with D_ID.
Example
D_ID SUM(BUD_NEW+ACK_BUD_NEW) SUM(BUD_OLD+ACK_BUD_OLD)
**100** **75** (40+35) **15**(SHOULD BE 15+20=35)
How can I get the sum?
LOOKUP only gets a single value.
You would need to use LOOKUPSET and a special function to SUM the results.
Luckily, this has been done before.
SSRS Groups, Aggregated Group after detailed ones
From BIDS:
LOOKUP: Use Lookup to retrieve the value from the specified dataset for a name-value pair where there is a 1-to-1 relationship.
For example, for an ID field in a table, you can use Lookup to
retrieve the corresponding Name field from a dataset that is not bound
to the data region.
LOOKUPSET: Use LookupSet to retrieve a set of values from the specified dataset for a name-value pair where there is a 1-to-many
relationship. For example, for a customer identifier in a table, you
can use LookupSet to retrieve all the associated phone numbers for
that customer from a dataset that is not bound to the data region.
Your expression requires a second "sum"
Try the following:
-1*sum(Lookup(Fields!D_ID.Value, Fields!D_ID.Value, Fields!BUD_OLD.Value, "OLD")+SUM(Lookup(Fields!D_ID.Value, Fields!D_ID.Value, Fields!ACK_BUD_OLD.Value, "OLD")

Is it possible to concat a string field after group by in Hive

I am evaluating Hive and need to do some string field concatenation after group by. I found a function named "concat_ws" but it looks like I have to explicitly list all the values to be concatenated. I am wondering if I can do something like this with concat_ws in Hive. Here is an example. So I have a table named "my_table" and it has two fields named country and city. I want to have only one record per country and each record will have two fields - country and cities:
select country, concat_ws(city, "|") as cities
from my_table
group by country
Is this possible in Hive? I am using Hive 0.11 from CDH5 right now
In database management an aggregate function is a function where the values of multiple rows are grouped together as input on certain criteria to form a single value of more significant meaning or measurement such as a set, a bag or a list.
Source: Aggregate function - Wikipedia
Hive's out-of-the-box aggregate functions listed on the following web-page:
Built-in Aggregate Functions (UDAF - user defined aggregation function)
So, the only built-in option (for Hive 0.11; for Hive 0.13 and above you have collect_list) is:
array collect_set(col)
This one will answer your request in case there is no duplicate city records per country (returns a set of objects with duplicate elements eliminated). Otherwise create your own UDAF or aggregate outside of Hive.
References for writing UDAF:
Writing GenericUDAFs: A Tutorial
HivePlugins
Create/Drop Function