I used python recordlinkage library to find the duplicates,got the index pairs but now have to make a single csv by putting duplicates in a cluster - multi-index

having MultiIndexes like this ([( 4061, 1461),
( 5011, 4608),
( 7189, 446),
( 9505, 3120),
( 9926, 1534),
length=4063)
but now i have to make a dataset of all the duplicates found at these indexes by putting in them same cluster , Im not finding anything..

Related

Can I query a JSON field from Crashlytics as a "where" directive in BigQuery?

I've searched SO for this and found a number of similar questions, but I can't quite figure out how to apply them to my scenario.
We have Google Crashlytics linked to BigQuery.
Given the following table data (many columns deleted for clarity):
TABLE firebase_crashlytics.com_foo_ios
is_fatal | application
======================
true | {"v":{"f":[{"v":"53"},{"v":"0.9.1"}]}}
false | {"v":{"f":[{"v":"71"},{"v":"1.0.0"}]}}
true | {"v":{"f":[{"v":"72"},{"v":"1.0.1"}]}}
I've tried a lot of the suggestions, but can't seem to make this work.
I want to query the com_foo_ios table for all records for which is_fatal equals true, and the application version (the second array element of application) is higher than 1.0.0. Alternatively, I could use the build number as that is unique to versions.
So my question is:
Can this be done via an SQL query without having to write custom functions as suggested in Berlyant's reply here, which didn't work for me.
Interestingly, the errors I see indicate that the two elements of the application array are identified as build_version and display_name. I've tried using those in queries as well, to no avail.
Using the above sample data, can anyone suggest a straightforward way of querying this info?
Using the above sample data, can anyone suggest a straightforward way of querying this info?
select t.*,
json_extract_scalar(_[offset(0)], '$.v') as col1,
json_extract_scalar(_[offset(1)], '$.v') as col2
from your_table t,
unnest([struct(json_extract_array(application, '$.v.f') as _)])
if applied to sample data in your question
with your_table as (
select true is_fatal, '{"v":{"f":[{"v":"53"},{"v":"0.9.1"}]}}' application union all
select false, '{"v":{"f":[{"v":"71"},{"v":"1.0.0"}]}}' union all
select true, '{"v":{"f":[{"v":"72"},{"v":"1.0.1"}]}}'
)
output is
Hmm, dumping the data model shows 'application: STRUCT<build_version STRING, display_version STRING>'
so looks like your table schema is actually like in below sample
with your_table as (
select true is_fatal, struct('53' as buildversion, '0.9.1' as display_version) application union all
select false, struct('71' as buildversion, '1.0.0' as display_version) union all
select true, struct('72' as buildversion, '1.0.1' as display_version)
)
If so - use below to get to particular field(s)
select is_fatal, application.*
from your_table t
with output

Unnesting a json in Redshift causing nested loop in the query plan

I have a column in my tables called 'data' with JSONs in it like below:
{"tt":"452.95","records":[{"r":"IN184366","t":"812812819910","s":"129.37","d":"982.7","c":"83"},{"r":"IN183714","t":"8028028029093","s":"33.9","d":"892","c":"38"}]}
I have written a code to unnest it into separate columns like tr,r,s.
Below is the code
with raw as (
SELECT json_extract_path_text(B.Data, 'records', true) as items
FROM tableB as B where B.date::timestamp between
to_timestamp('2019-01-01 00:00:00','YYYY-MM-DD HH24:MA:SS') AND
to_timestamp('2022-12-31 23:59:59','YYYY-MM-DD HH24:MA:SS')
UNION ALL
SELECT json_extract_path_text(C.Data, 'records', true) as items
FROM tableC as C where C.date-5 between
to_timestamp('2019-01-01 00:00:00','YYYY-MM-DD HH24:MA:SS') AND
to_timestamp('2022-12-31 23:59:59','YYYY-MM-DD HH24:MA:SS')
),
numbers as (
SELECT ROW_NUMBER() OVER (ORDER BY TRUE)::integer- 1 as ordinal
FROM <any_random_table> limit 1000
),
joined as (
select raw.*,
json_array_length(orders.items, true) as number_of_items,
json_extract_array_element_text(
raw.items,
numbers.ordinal::int,
true
) as item
from raw
cross join numbers
where numbers.ordinal <
json_array_length(raw.items, true)
),
parsed as (
SELECT J.*,
json_extract_path_text(J.item, 'tr',true) as tr,
json_extract_path_text(J.item, 'r',true) as r,
json_extract_path_text(J.item, 's',true)::float8 as s
from joined J
)
select * from parsed
The above code is working when there are small number of records but this taking more than a day to run and CPU utilization (in redshift) is reaching 100 % and even the disk space used also reaching 100% if I am putting date between last two years etc.. or if the number of records is large.
Can anyone please suggest any alternative way to unnnest JSON objects like above in redshift.
My query plan is saying:
Nested Loop Join in the query plan - review the join predicates to avoid Cartesian products
Goal: To Unnest without using any cross joins
Input: data column having JSON
"tt":"452.95","records":[{"r":"IN184366","t":"812812819910","s":"129.37","d":"982.7","c":"83"},{"r":"IN183714","t":"8028028029093","s":"33.9","d":"892","c":"38"}]}
Output should be for example
tr,r,s columns from the above json
You want to unnest json records of up to 1000 stored in a json array but nested loop join is taking too long.
The root issues is likely your data model. You have stored structured records (called "records"), inside a semi-structure text element (json), within a column of a structured columnar database. You want to perform some operation on these buried records that you haven't described but here's the problem. Columnar databases are optimized for performing read-centric analytic queries but you need to expand these json internal records into Redshift rows (records) which is fundamentally a write operation. This is working against the optimizations of the database.
The size of this expanding data is also large as compared to your disk storage on your cluster which is why the disks are filling up. You CPUs are likely spinning unpacking the jsons and managing overloaded disk and memory capacity. At the edge of filling up disks Redshift shifts to a mode that optimizes disk space utilization at the expense of execution speed. A larger cluster may give you a significantly faster execution if you can avoid this effect but that will cost money you may not have budgeted. Not an ideal solution.
One area that would improve speed of your query is not carrying all the data along. You keep raw.* and J.* all through the query but it is not clear you need these. Since part of the issue is data size during execution and that this execution includes loop joining, you are making the execution much harder that it needs to be by carrying all this data (including the original jsons).
The best way out of this situation is to change your data model and expand these json internal records into Redshift records on ingestion. Json data is fine for seldom used information or information that is only needed at the end of a query where the data is small. Needing the expanded json at the input end of the query for such a large amount of data is not good use case for json in Redshift. Each of these "records" inside of the json are records and need to be stored as such if you need to work across them as query input.
Now you want to know if there is some slick way to get around this issue in your case and the answer is "unlikely but maybe". Can you describe how you are using the final values in your query (t, r, and s)? If you are just using some aspect of this data (max value or sum or ...) then there may be a way to get to the answer without the large nested loop join. But if you need all the values then there is no other way to get these AFAIK. A description of what comes next in the data process could open up such an opportunity.

Query Snowflake Jobs [duplicate]

is there any way within snowflake/sql query to view what tables are being queried the most as well as what columns? I want to know what data is of most value to my users and not sure how to do this programatically. Any thoughts are appreciated - thank you!
2021 update
The new ACCESS_HISTORY view has this information (in preview right now, enterprise edition).
For example, if you want to find the most used columns:
select obj.value:objectName::string objName
, col.value:columnName::string colName
, count(*) uses
, min(query_start_time) since
, max(query_start_time) until
from snowflake.account_usage.access_history
, table(flatten(direct_objects_accessed)) obj
, table(flatten(obj.value:columns)) col
group by 1, 2
order by uses desc
Ref: https://docs.snowflake.com/en/sql-reference/account-usage/access_history.html
2020 answer
The best I found (for now):
For any given query, you can find what tables are scanned through looking at the plan generated for it:
SELECT *, "objects"
FROM TABLE(EXPLAIN_JSON(SYSTEM$EXPLAIN_PLAN_JSON('SELECT * FROM a.b.any_table_or_view')))
WHERE "operation"='TableScan'
You can find all of your previous ran queries too:
select QUERY_TEXT
from table(information_schema.query_history())
So the natural next step would be combine both - but that's not straightforward, as you'll get an error like:
SQL compilation error: argument 1 to function EXPLAIN_JSON needs to be constant, found 'SYSTEM$EXPLAIN_PLAN_JSON('SELECT * FROM a.b.c')'
The solution would be to combine the queries from the query_history() with the SYSTEM$EXPLAIN_PLAN_JSON outside (to make the strings constant), and then you will be able to find out the most queried tables.

Quick one on Big Query SQL-Ecommerce Data

I am trying to replicate the Google Analyitcs data in Big Query but couldnt do that.
Basically I am using Custom Dimension 40 (user subscription status)
but I am getting wrong numbers in BQ.
Can someone help me on this?
I am using this query but couldn't find it out the exact one.
SELECT
(SELECT value FROM hits.customDimensions where index=40) AS UserStatus,
COUNT(hits.transaction.transactionId) AS Unique_Purchases
FROM
`xxxxxxxxxxxxx.ga_sessions_2020*` AS GA, --new rollup
UNNEST(GA.hits) AS hits
WHERE
(SELECT value FROM hits.customDimensions where index=40) IN ("xx001","xxx002")
GROUP BY 1
I am getting this from big query which is wrong.
I have check out the dates also but dont know why its wrong.
Your question is rather unclear. But because you want something to be unique and numbers are mysteriously not what you want, I would suggest using COUNT(DISTINCT):
COUNT(DISTINCT hits.transaction.transactionId) AS Unique_Purchases
As far as I understand, you imported Google Analytics data into Bigquery and you are trying to group the custom dimension with index 40 and values ("xx001","xxx002") in order to know how many hit transactions were performed in function of these dimension values.
Replicating your scenario and trying to execute the query you posted, I got the following error.
However, I created a query that could help with your use-case. At first, it selects the transactionId and dimension values with the transactionId different from null and with index value equal to 40, then the grouping is done by the dimension value, filtered with values equals to "xx001"&"xxx002".
WITH tx AS (
SELECT
HIT.transaction.transactionId,
CD.value
FROM
`xxxxxxxxxxxxx.ga_sessions_2020*` AS GA,
UNNEST(GA.hits) AS HIT,
UNNEST(HIT.customDimensions) AS CD
WHERE
HIT.transaction.transactionId IS NOT NULL
AND
CD.index = 40
)
SELECT tx.value AS UserStatus, count(tx.transactionId) AS Unique_Purchases
FROM tx
WHERE tx.value IN ("xx001","xx002")
GROUP BY tx.value
For further details about the format and schema of the data that is imported into BigQuery, I found this document.

GCP: select query with unnest from array has very big process data to run compared to hardcoded values

In bigQuery GCP, I am trying to grab some data in a table where the date is the same as a date in a list of values I have got. If I hardcode the list of values in the select it is vastly cheaper in process to run than if I use a temp structure like an array...
Is there a way to use the temp structure but avoid the enormous processing cost ?
Why is it so expensive for something small simple like this.
please see below examples:
**-----1/ array structure example: this query process's 144.8 GB----------**
WITH
get_a as (
SELECT
GENERATE_DATE_ARRAY('2000-01-01','2000-01-02') as array_of_dates
)
SELECT
a.heading as title
a.ingest_time as proc_date
FROM
'veiw_a.events' as a
get_a as b
UNNEST(b.array_of_dates) as c
WHERE
c in (CAST(a.ingest_time AS DATE)
)
**------2/ hardcoded example: this query processes 936.5 MB over 154 X's less ? --------**
SELECT
a.heading as title
a.ingest_time as proc_date
FROM
'veiw_a.events' as a
WHERE
(CAST(a.ingest_time as DATE)) IN ('2000-01-01','2000-01-02')
Presumably, your view_a.events table is partitioned by the ingest_time.
The issue is that partition pruning is very conservative (buggy?). With the direct comparisons, BigQuery is smart enough to recognize exactly which partitions are used for the query. But with the generated version, BigQuery is not able to figure this out, so the entire table needs to be read.