Storing file with field names change in Hadoop using PIG STORE - apache-pig

I load a relation using
data = load 'path' using JsonLoader('class: chararray, marks: int');
datagrouped = group data on class;
total_marks = foreach datagrouped generate group as class, sum(data.marks) as Total_Score
Now I get the relation
highest_score =
A, 2130
B, 1890
C, 1640
Now I store the relation using:
Store total_marks into 'path' using JsonStorage()
My data gets stored as
{"class": "A", "Total_Score":2130}
{"class": "B", "Total_Score":1890}
{"class": "C", "Total_Score":1640}
This in my case is not the output I require. I want to output to be:
{"group": "A", "Total_Score":2130}
{"group": "B", "Total_Score":1890}
{"group": "C", "Total_Score":1640}
How can I achieve this?

Related

How to flatten a json in snowflake? sql

I have a table "table_1" with one column called "Value" and it only has one entry. The entry in the column is a json that looks like
{
"c1": "A",
"c10": "B",
"c100": "C",
"c101": "D",
"c102": "E",
"c103": "F",
"c104": "G",
.......
}
I would like to just separate this json into two columns, where one column contains the keys (c1, c10 etc), and the second columns contains the associated values for that key (A, B etc). Is there a way I can do this? There are about 125 keys in my json
It is possible to achieve it using FLATTEN function:
CREATE OR REPLACE TABLE tab
AS
SELECT PARSE_JSON('{
"c1": "A",
"c10": "B",
"c100": "C",
"c101": "D",
"c102": "E",
"c103": "F",
"c104": "G",
}') AS col;
SELECT KEY, VALUE::TEXT AS value
FROM tab
,TABLE(FLATTEN (INPUT => tab.COL));
Output:

Issues using mergeDynamicFrame on AWS Glue

I need do a merge between two dynamic frames on Glue.
I tried to use the mergeDynamicFrame function, but i keep getting the same error:
AnalysisException: "cannot resolve 'id' given input columns: [];;\n'Project ['id]\n+- LogicalRDD false\n"
Right now, i have 2 DF:
df_1(id, col1, salary_src) and df_2(id, name, salary)
I want to merge df_2 into df_1 by the "id" column.
df_1 = glueContext.create_dynamic_frame.from_catalog(......)
df_2 = glueContext.create_dynamic_frame.from_catalog(....)
merged_frame = df_1.mergeDynamicFrame(df_2, ["id"])
applymapping1 = ApplyMapping.apply(frame = merged_frame, mappings = [("id", "long", "id", "long"), ("col1", "string", "name", "string"), ("salary_src", "long", "salary", "long")], transformation_ctx = "applymapping1")
datasink2 = glueContext.write_dynamic_frame.from_options(....)
As a test i tried to pass a column from both DFs (salary and salary_src), and, the error as:
AnalysisException: "cannot resolve 'salary_src' given input columns: [id, name, salary];;\n'Project [salary#2, 'salary_src]\n+- LogicalRDD [id#0, name#1, salary#2], false\n"
Is this case, it seems to recognize the columns from the df_2 (id, name, salary).. but if i pass just one of the columns, or even the 3, it keeps failing
It doesn't appear to be a mergeDynamicFrame issue.
Based on the information you provided, it looks like your df1, df2 or both are not reading data correctly and returning an empty dynamicframe which is why you have an empty list of input columns "input columns: []"
if you are reading from s3, you must crawl your data before you can use glueContext.create_dynamic_frame.from_catalog.
you can also include df1.show() or df1.printSchema() after you create your dynamic_frame as a troubleshooting step to make sure you are reading your data correctly before merging.

How can I write a query to insert array values from a python dictionary in BigQuery?

I have a python dictionary that looks like this:
{
'id': 123,
'categories': [
{'category': 'fruit', 'values': ['apple', 'banana']},
{'category': 'animal', 'values': ['cat']},
{'category': 'plant', 'values': []}
]
}
I am trying to insert those values into a table in big query via the API using python, I just need to format the above into an "INSERT table VALUES" query. The table needs to have the fields: id, categories.category, categories.values.
I need categories to basically be an array with the category and each category's corresponding values. The table is supposed to look sort of like this in the end - except I need it to be just one row per id, with the corresponding category fields nested and having the proper field name:
SELECT 123 as id, (["fruit"], ["apple", "banana"]) as category
UNION ALL (SELECT 123 as id, (["animal"], ["cat"]) as category)
UNION ALL (SELECT 123 as id, (["plant"], ["tree", "bush", "rose"]) as category)
I'm not really sure how to format the "INSERT" query to get the desired result, can anyone help?
If you want to load a dictionary to BigQuery using Python, you have first to prepare your data. I chose to convert the Python Dictionary to a .json file and then load it to BigQuery using the Python API. However, according to the documentation, BigQuery has some limitations regarding loading .json nested data, among them:
Your .json must be a new line delimited, which means that each object must be in a new line in the file
BigQuery does not support maps or dictionaries in Json. Thus, in order to do so, you have to wrap your whole data in [], as you can see here.
For this reason, some modifications should be done in the file, so you can load the created .json file to BiguQuery. I have created two scripts, in which: the first one converts the Python dict to a JSON file and the second the JSON file is formatted as New Line delimited json and then loaded in BigQuery.
Convert the python dict to a .json file. Notice that you have to wrap the whole data between []:
import json
from google.cloud import bigquery
py_dict =[{
'id': 123,
'categories': [
{'category': 'fruit', 'values': ['apple', 'banana']},
{'category': 'animal', 'values': ['cat']},
{'category': 'plant', 'values': []}
]
}]
json_data = json.dumps(py_dict, sort_keys=True)
out_file = open('json_data.json','w+')
json.dump(py_dict,out_file)
Second, convert json to new line delimited json and load to BigQuery:
import json
from google.cloud import bigquery
with open("json_data.json", "r") as read_file:
data = json.load(read_file)
result = [json.dumps(record) for record in data]
with open('nd-proceesed.json', 'w') as obj:
for i in result:
obj.write(i+'\n')
client = bigquery.Client()
filename = '/path/to/file.csv'
dataset_id = 'sample'
table_id = 'json_mytable'
dataset_ref = client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
job_config.autodetect = True
with open("nd-proceesed.json", "rb") as source_file:
job = client.load_table_from_file(source_file, table_ref, job_config=job_config)
job.result() # Waits for table load to complete.
print("Loaded {} rows into {}:{}.".format(job.output_rows, dataset_id, table_id))
Then, in the BigQuery UI, you can query your table as follows:
SELECT id, categories
FROM `test-proj-261014.sample.json_mytable4` , unnest(categories) as categories
And the output:
You can use below query - with your dictionary text embed into it
#standardSQL
WITH data AS (
SELECT '''
{
'id': 123,
'categories': [
{'category': 'fruit', 'values': ['apple', 'banana']},
{'category': 'animal', 'values': ['cat']},
{'category': 'plant', 'values': ['tree', 'bush', 'rose']}
]
}
''' dict
)
SELECT
JSON_EXTRACT_SCALAR(dict, '$.id') AS id,
ARRAY(
SELECT AS STRUCT
JSON_EXTRACT_SCALAR(cat, '$.category') AS category,
ARRAY(
SELECT TRIM(val, '"')
FROM UNNEST(JSON_EXTRACT_ARRAY(cat, '$.values')) val
)`values`
FROM UNNEST(JSON_EXTRACT_ARRAY(dict, '$.categories')) cat
) AS categories
FROM data
which produces below result
Row id categories.category categories.values
1 123 fruit apple
banana
animal cat
plant tree
bush
rose

Renaming columns in dataframe according to reference key

Let's assume I have a simple data frame:
df <- data.frame("one"= c(1:5), "two" = c(6:10), "three" =c(7:11))
I would like like to rename my column names, so that they match the reference
Let my reference be following:
df2 <- data.frame("Name" = c("A", "B", "C"), "Oldname" = c("one", "two", "three"))
How could I replace my column names from df with those from df2, if they match whats there (So that column names in df are: A, B C)?
In my original data df2 is way bigger and I have multiple data sets such a df, so for a solution to work, the code should be as generic as possible. Thanks in advance!
We can use the match function here to map the new names onto the old ones:
names(df) <- df2$Name[match(names(df), df2$Oldname)]
names(df)
[1] "A" "B" "C"
Demo

Select column of dataframe with name matching a string in Julia?

I have a large DataFrame that I import from a spreadsheet. I have the names of several columns that I care about in an array of strings. How do I select a column of the DataFrame who's name matches the contents of a string? I would have though that something like this would work
using DataFrames
df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"], C = 2:5)
colsICareAbout = [":B" ":C"]
df[:A] #This works
df[colsICareAbout[1]] #This doesn't work
Is there a way to do this?
Strings are different than symbols, but they're easy to convert.
colsICareAbout = ["B","C"]
df[symbol(colsICareAbout[1])]
Mind you it might be better to make the entries in colsICareAbout symbols to begin with, but I don't know where your data is coming from.