I would like to unnest the column json_blob:
SELECT '{"a": [1, 2, 3], "b": [4, 5, 6]}' AS json_blob
to look like this at the end:
key | val
----------
"a" | [1,2,3]
"b" | [4, 5, 6]
Note that different rows can have different keys and it's a lot of them. I don't want to write all of them by hand.
The schema of the JSON would have to stay the same, then you could do this:
with t as (SELECT '{"a": [1, 2, 3], "b": [4, 5, 6]}' AS json_blob)
select key, val
from t cross join unnest([
struct('a' as key, json_extract(json_blob, '$.a') as val),
struct('b' as key, json_extract(json_blob, '$.b') as val)
])
Below example can be a good starting point - but really depends on pattern of your json
#standardSQL
WITH `project.dataset.table` AS (
SELECT '{"a": [1, 2, 3], "b": [4, 5, 6]}' AS json_blob UNION ALL
SELECT '{"a": [11, 12, 13], "c": [14, 15, 16]}' UNION ALL
SELECT '{"d": 21, "b": [24, 25, 26]}'
)
SELECT
SPLIT(kv, ': ')[OFFSET(0)] AS key,
SPLIT(kv, ': ')[SAFE_OFFSET(1)] AS value
FROM `project.dataset.table`,
UNNEST(REGEXP_EXTRACT_ALL(json_blob, r'("\w+":[^"]*)(?:,|})')) kv
with result
Row key value
1 "a" [1, 2, 3]
2 "b" [4, 5, 6]
3 "a" [11, 12, 13]
4 "c" [14, 15, 16]
5 "d" 21
6 "b" [24, 25, 26]
Related
I have a dataframe book_matrix with users as rows, books as columns, and ratings as values. When I use corrwith() to compute the correlation between 'The Lord of the Rings' and 'The Silmarillion' the result is 1.0, but the values are clearly different.
The non-null values [10, 3] and [10, 9] have correlation 1.0. I would expect them to be exactly the same when the correlation is equal to one. How can this happen?
Correlation means the values have a certain relationship with one another, for example linear combination of factors. Here's an illustration:
import pandas as pd
df1 = pd.DataFrame({"A":[1, 2, 3, 4],
"B":[5, 8, 4, 3],
"C":[10, 4, 9, 3]})
df2 = pd.DataFrame({"A":[2, 4, 6, 8],
"B":[-5, -8, -4, -3],
"C":[4, 3, 8, 5]})
df1.corrwith(df2, axis=0)
A 1.000000
B -1.000000
C 0.395437
dtype: float64
So you can see that [1, 2, 3, 4] and [2, 4, 6, 8] have correlation 1.0
The next column [5, 8, 4, 3] and [-5, -8, -4, -3] have extreme negative correlation -1.0
In the last column, [10, 4, 9, 3] and [4, 3, 8, 5] are somewhat correlated 0.395437, because both exhibits high-low-high-low sequence but with varying vertical scaling factors.
So in your case both books 'The Lord of the Rings' and 'The Silmarillion' only has 2 ratings each, and both ratings are having high-low sequence. Even if I illustrate with more data points, they have the same vertical scaling factor.
df1 = pd.DataFrame({"A": [10, 3, 10, 3, 10, 3],
"B": [10, 3, 10, 3, 10, 3]})
df2 = pd.DataFrame({"A": [10, 9, 10, 9, 10, 9],
"B": [10, 10, 10, 9, 9, 9]})
df1.corrwith(df2, axis=0)
A 1.000000
B 0.333333
dtype: float64
So you can see that [10, 3, 10, 3, 10, 3] and [10, 9, 10, 9, 10, 9] are also correlated perfectly at 1.0.
But if I rearrange the sequence a little, [10, 3, 10, 3, 10, 3] and [10, 10, 10, 9, 9, 9] are not perfectly correlated anymore at 0.333333
So going forward, you need more data, and more variations in the data! Hope that helps 😎
my data
df = pd.DataFrame({"id":['1,2,3,4','1,2,3,6'], "sum": [6,7]})
mycode:
df['id']=df['id'].str.split(',')
df['nf']=df.apply(lambda x: set(range(1,x['sum']+1))-set(x['id']) , axis=1)
print(df)
i want output
id sum nf
0 [1, 2, 3, 4] 6 {5, 6}
1 [1, 2, 3, 6] 7 {4, 5, 7}
but it output
id sum nf
0 [1, 2, 3, 4] 6 {1, 2, 3, 4, 5, 6}
1 [1, 2, 3, 6] 7 {1, 2, 3, 4, 5, 6, 7}
i think the 'num' in the list is actually str
but i don't known how to easily modify it by pandas
Use map for convert values to integers:
df['nf']=df.apply(lambda x: set(range(1,x['sum']+1))-set(map(int, x['id'])) , axis=1)
print(df)
id sum nf
0 [1, 2, 3, 4] 6 {5, 6}
1 [1, 2, 3, 6] 7 {4, 5, 7}
I have this series called hours_by_analysis_date, where the index is datetimes, and the values are a list of ints. For example:
Index |
01-01-2000 | [1, 2, 3, 4, 5]
01-02-2000 | [2, 3, 4, 5, 6]
01-03-2000 | [1, 2, 3, 4, 5]
I want to return all the indices where the value is [1, 2, 3, 4, 5], so it should return 01-01-2000 and 01-03-2000
I tried hours_by_analysis_date.where(fh_by_analysis_date==[1, 2, 3, 4, 5]), but it gives me the error:
{ValueError} lengths must match to compare
It's confused between comparing two array-like objects and equality test for each element.
You can use apply:
hours_by_analysis_date.apply(lambda elem: elem == [1,2,3,4,5])
I have a postgres 9.6 table with a JSONB column
> SELECT id, data FROM my_table ORDER BY id LIMIT 4;
id | data
----+---------------------------------------
1 | {"a": [1, 7], "b": null, "c": [8]}
2 | {"a": [2, 9], "b": [1], "c": null}
3 | {"a": [8, 9], "b": null, "c": [3, 4]}
4 | {}
As you can see, some JSON keys have null values.
I'd like to exclude these - is there an easy way to SELECT only the non-null key-value pairs to produce:
id | data
----+---------------------------------------
1 | {"a": [1, 7], "c": [8]}
2 | {"a": [2, 9], "b": [1]}
3 | {"a": [8, 9], "c": [3, 4]}
4 | {}
Thanks!
You can use jsonb_strip_nulls()
select id, jsonb_strip_nulls(data) as data
from my_table;
Online example: http://rextester.com/GGJRW83576
Note that this function would not remove null values inside the arrays.
Consider a numpy array A of shape (7,6)
A = array([[0, 1, 2, 3, 5, 8],
[4, 100, 6, 7, 8, 7],
[8, 9, 10, 11, 5, 4],
[12, 13, 14, 15, 1, 2],
[1, 3, 5, 6, 4, 8],
[12, 23, 12, 24, 4, 3],
[1, 3, 5, 7, 89, 0]])
together with a second numpy array r of the same shape which contains the radius of A starting from a central point A(3,2)=0:
r = array([[3, 3, 3, 3, 3, 4],
[2, 2, 2, 2, 2, 3],
[2, 1, 1, 1, 2, 3],
[2, 1, 0, 1, 2, 3],
[2, 1, 1, 1, 2, 3],
[2, 2, 2, 2, 2, 3],
[3, 3, 3, 3, 3, 4]])
I would like to pick up all the elements of A which are located at the position 1 of r, i.e. [9,10,11,15,4,6,5,13], all the elements of A located at position 2 of r and so on. I there some numpy function to do that?
Thank you
You can select a section of A by doing something like A[r == 1], to get all the sections as a list you could do [A[r == i] for i in range(r.max() + 1)]. This will work, but may be inefficient depending on how big the values in r go because you need to compute r == i for every i.
You could also use this trick, first sort A based on r, then simply split the sorted A array at the right places. That looks something like this:
r_flat = r.ravel()
order = r_flat.argsort()
A_sorted = A.ravel()[order]
r_sorted = r_flat[order]
edges = r_sorted.searchsorted(np.arange(r_sorted[-1] + 1), 'right')
sections = []
start = 0
for end in edges:
sections.append(A_sorted[start:end])
start = end
I get a different answer to the one you were expecting (3 not 4 from the 4th row) and the order is slightly different (strictly row then column), but:
>>> A
array([[ 0, 1, 2, 3, 5, 8],
[ 4, 100, 6, 7, 8, 7],
[ 8, 9, 10, 11, 5, 4],
[ 12, 13, 14, 15, 1, 2],
[ 1, 3, 5, 6, 4, 8],
[ 12, 23, 12, 24, 4, 3],
[ 1, 3, 5, 7, 89, 0]])
>>> r
array([[3, 3, 3, 3, 3, 4],
[2, 2, 2, 2, 2, 3],
[2, 1, 1, 1, 2, 3],
[2, 1, 0, 1, 2, 3],
[2, 1, 1, 1, 2, 3],
[2, 2, 2, 2, 2, 3],
[3, 3, 3, 3, 3, 4]])
>>> A[r==1]
array([ 9, 10, 11, 13, 15, 3, 5, 6])
Alternatively, you can get column then row ordering by transposing both arrays:
>>> A.T[r.T==1]
array([ 9, 13, 3, 10, 5, 11, 15, 6])