Wikimedia pageview compression not working - bzip2

I am trying to analyze monthly wikimedia pageview statistics. Their daily dumps are OK but monthly reports like the one from June 2021 (https://dumps.wikimedia.org/other/pageview_complete/monthly/2021/2021-06/pageviews-202106-user.bz2) seem broken:
[radim#sandbox2 pageviews]$ bzip2 -t pageviews-202106-user.bz2
bzip2: pageviews-202106-user.bz2: bad magic number (file not created by bzip2)
You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.
[radim#sandbox2 pageviews]$ file pageviews-202106-user.bz2
pageviews-202106-user.bz2: Par archive data
Any idea how to extract the data? What encoding is used here? Can it be Parquet file from their Hive analytics cluster?

These files are not bzip2 archives. They are Parquet files. Parquet-tools can be used to inspect them.
$ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main schema /tmp/pageviews-202106-user.bz2 2>/dev/null
{
"type" : "record",
"name" : "hive_schema",
"fields" : [ {
"name" : "line",
"type" : [ "null", "string" ],
"default" : null
} ]
}

Related

Text processing to fetch the attributes

Find below the input data:
[{"acc_id": 166211981, "archived": true, "access_key": "ALLLJNXXXXXXXPU4C7GA", "secret_key": "X12J6SixMaFHoXXXXZW707XXX24OXXX", "created": "2018-10-03T05:56:01.208069Z", "description": "Data Testing", "id": 11722990697, "key_field": "Ae_Appl_Number", "last_modified": "2018-10-03T08:44:20.324237Z", "list_type": "js_variables", "name": "TEST_AE_LI_KEYS_003", "project_id": 1045199007354, "s3_path": "opti-port/dcp/ue.1045199007354/11722990697"}, {"acc_id": 166211981, "archived": false, "access_key": "ALLLJNXXXXXXXPU4C7GA", "secret_key": "X12J6SixMaFHoXXXXZW707XXX24OXXX", "created": "2018-10-03T08:46:32.535653Z", "description": "Data Testing", "id": 11724290732, "key_field": "Ae_Appl_Number", "last_modified": "2018-10-03T10:11:13.167798Z", "list_type": "js_variables", "name": "TEST_AE_LI_KEYS_001", "project_id": 1045199007354, "s3_path": "opti-port/dcp/ue.1045199007354/11724290732"}]
I want the output file to contain below data:
11722990697,TEST_AE_LI_KEYS_003,opti-port/dcp/ue.1045199007354/11722990697
11724290732,EST_AE_LI_KEYS_001,opti-port/dcp/ue.1045199007354/11724290732
I am able to achieve the same by taking one record at a time and processing it using awk.but i am getting the field names also.
find below my trial:
R=cat in.txt | awk -F '},' '{print $1}'
echo $R | awk -F , '{print $7 " " $11 " " $13}'
I want it to be done for entire file without field names.
AWK/SED is not the right tool for parsing JSON files. Use jq
[root#localhost]# jq -r '.[] | "\(.acc_id),\(.name),\(.s3_path)"' abc.json
166211981,TEST_AE_LI_KEYS_003,opti-port/dcp/ue.1045199007354/11722990697
166211981,TEST_AE_LI_KEYS_001,opti-port/dcp/ue.1045199007354/11724290732
If you don't want to install any other software then you can use python as well which is found on most of the linux machine
[root#localhost]# cat parse_json.py
#!/usr/bin/env python
# Import the json module
import json
# Open the json file in read only mode and load the json data. It will load the data in python dictionary
with open('abc.json') as fh:
data = json.load(fh)
# To print the dictionary
# print(data)
# To print the name key from first and second record
# print(data[0]["name"])
# print(data[1]["name"])
# Now to get both the records use a for loop
for i in range(0,2):
print("%s,%s,%s") % (data[i]["access_key"],data[i]["name"],data[i]["s3_path"])
[root#localhost]# ./parse_json.py
ALLLJNXXXXXXXPU4C7GA,TEST_AE_LI_KEYS_003,opti-port/dcp/ue.1045199007354/11722990697
ALLLJNXXXXXXXPU4C7GA,TEST_AE_LI_KEYS_001,opti-port/dcp/ue.1045199007354/11724290732
Assuming the input data is in a file called input.json, you can use a Python script to fetch the attributes. Put the following content in a file called fetch_attributes.py:
import json
with open("input.json") as fh:
data = json.load(fh)
with open("output.json", "w") as of:
for record in data:
of.write("%s,%s,%s\n" % (record["id"],record["name"],record["s3_path"]))
Then, run the script as:
python fetch_attributes.py
Code Explanation
import json - Importing Python's json library to parse the JSON.
with open("input.json") as fh: - Opening the input file and getting the file handler in if.
data = json.load(fh) - Loading the JSON input file using load() method from the json library which will populate the data variable with a Python dictionary.
with open("output.json", "w") as of: - Opening the output file in write mode and getting the file handler in of.
for record in data: - Loop over the list of records in the JSON.
of.write("%s,%s,%s\n" % (record["id"],record["name"],record["s3_path"])) - Fetching the required attributes from each record and writing them in the file.

How to generate CREATE TABLE script from an existing table

I create a table with Big Query interface. A large table. And I would like to export the schema of this table in Standard SQL (or Legacy SQL) syntax.
Is it possible ?
Thanks !
You can get the DDL for a table with this query:
SELECT t.ddl
FROM `your_project.dataset.INFORMATION_SCHEMA.TABLES` t
WHERE t.table_name = 'your_table_name'
;
As can be read in this question it is not possible to do so and there is a feature request to obtain the output schema of a standard SQL query but seems like it was not finally implemented. Depending on what your use case is, apart from using bq, another workaround is to do a query with LIMIT 0. Results are returned immediately (tested with a 100B row table) with the schema field names and types.
Knowing this you could also automate the procedure in your favorite scripting language. As an example I used Cloud Shell as the CLI and API calls. It makes three successive calls where the first one executes the query and a jobId is obtained (unnecessary fields are not included in request URL), then we obtain the dataset and table IDs correspondent to that particular job and, finally, the schema is retrieved.
I used the jq tool to parse the responses (manual), which comes preinstalled in the Shell, and wrapped everything in a shell function:
result_schema()
{
QUERY=$1
authToken="$(gcloud auth print-access-token)"
projectId=$(gcloud config get-value project 2>\dev\null)
# get the jobId
jobId=$(curl -H"Authorization: Bearer $authToken" \
-H"Content-Type: application/json" \
https://www.googleapis.com/bigquery/v2/projects/$projectId/queries?fields=jobReference%2FjobId \
-d"$( echo "{
\"query\": "\""$QUERY" limit 0\"",
\"useLegacySql\": false
}")" 2>\dev\null|jq -j .jobReference.jobId)
# get destination table
read -r datasetId tableId <<< $(curl -H"Authorization: Bearer $authToken" \
"https://www.googleapis.com/bigquery/v2/projects/$projectId/jobs/$jobId?fields=configuration(query(destinationTable(datasetId%2CtableId)))" 2>\dev\null | jq -j '.configuration.query.destinationTable.datasetId, " " ,.configuration.query.destinationTable.tableId')
# get resulting schema
curl -H"Authorization: Bearer $authToken" https://www.googleapis.com/bigquery/v2/projects/$projectId/datasets/$datasetId/tables/$tableId?fields=schema 2>\dev\null | jq .schema.fields
}
then we can invoke the function by querying a 100B row public dataset (don't specify LIMIT 0 as the function automatically adds it):
result_schema 'SELECT year, month, CAST(wikimedia_project as bytes) AS project_bytes, language AS lang FROM `bigquery-samples.wikipedia_benchmark.Wiki100B` GROUP BY year, month, wikimedia_project, language'
with the following output as the schema (mind the selected fields using casts and aliases to modify the returned schema):
[
{
"name": "year",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "month",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "project_bytes",
"type": "BYTES",
"mode": "NULLABLE"
},
{
"name": "lang",
"type": "STRING",
"mode": "NULLABLE"
}
]
This field array can then be copy/pasted (or further automated) in the fields editor when creating a new table using the UI.
I am not sure how it is possible using StandardSQL or Legacy SQL syntax. But you can get the schema in json format using command line.
From this link the command to do it would be:
bq show --schema --format=prettyjson [PROJECT_ID]:[DATASET].[TABLE] > [PATH_TO_FILE]

Convert Empty string ("") to Double data type while importing data from JSON file using command line BQ command

What steps will reproduce the problem?
1.I am running command:
./bq load --source_format=NEWLINE_DELIMITED_JSON --schema=lifeSchema.json dataset_test1.table_test_3 lifeData.json
2.I have attached data source file and scema files.
3. It throws an error - JSON parsing error in row starting at position 0 at file:
file-00000000. Could not convert value to double. Field:
computed_results_A; Value:
What is the expected output? What do you see instead?
I want empty string converted as NULL or 0
What version of the product are you using? On what operating system?
I am using MAC OSX YOSEMITE
Source JSON lifeData.json
{"schema":{"vendor":"com.bd.snowplow","name":"in_life","format":"jsonschema","version":"1-0-2"},"data":{"step":0,"info_userId":"53493764","info_campaignCity":"","info_self_currentAge":45,"info_self_gender":"male","info_self_retirementAge":60,"info_self_married":false,"info_self_lifeExpectancy":0,"info_dependantChildren":0,"info_dependantAdults":0,"info_spouse_working":true,"info_spouse_currentAge":33,"info_spouse_retirementAge":60,"info_spouse_monthlyIncome":0,"info_spouse_incomeInflation":5,"info_spouse_lifeExpectancy":0,"info_finances_sumInsured":0,"info_finances_expectedReturns":6,"info_finances_loanAmount":0,"info_finances_liquidateSavings":true,"info_finances_savingsAmount":0,"info_finances_monthlyExpense":0,"info_finances_expenseInflation":6,"info_finances_expenseReduction":10,"info_finances_monthlyIncome":0,"info_finances_incomeInflation":5,"computed_results_A":"","computed_results_B":null,"computed_results_C":null,"computed_results_D":null,"uid_epoch":"53493764_1466504541604","state":"init","campaign_id":"","campaign_link":"","tool_version":"20150701-lfi-v1"},"hierarchy":{"rootId":"94583157-af34-4ecb-8024-b9af7c9e54fa","rootTstamp":"2016-06-21 10:22:24.000","refRoot":"events","refTree":["events","in_life"],"refParent":"events"}}
Schema JSON lifeSchema.json
{
"name": "computed_results_A",
"type": "float",
"mode": "nullable"
}
Try loading the JSON file as a one column CSV file.
bq load --field_delimiter='|' proj:set.table file.json json:string
Once the file is loaded into BigQuery, you can use JSON_EXTRACT_SCALAR or a JavaScript UDF to parse the JSON with total freedom.

JSON file not loading into redshift

I have issues using the copy command in redshift to load in JSON objects, I am receiving a file in the below JSON format which fails when attempting to use the copy command, however when I adjust the json file to the bottom it works. This is not an ideal solution as I am not permiited to modify the JSON file
this works fine :
{
"id": 1,
"name": "Major League Baseball"
}
{
"id": 2,
"name": "National Hockey League"
}
This does not work (notice the extra square brackets)
[
{"id":1,"name":"Major League Baseball"},
{"id":2,"name":"National Hockey League"}
]
this is my json path
{
"jsonpaths": [
"$['id']",
"$['name']"
]
}
The problem with the COPY command is it does not really accept a valid JSON file. Instead, it expects a JSON-per-line which is shown in the documentation, but not obviously mentioned.
Hence, every line is supposed to be a valid JSON but the full file is not. That's why when you modify your file, it works.

Windows scripting to parse a HL7 file

I have a HUGE file with a lot of HL7 segments. It must be split into 1000 (or so ) smaller files.
Since it has HL7 data, there is a pattern (logic) to go by. Each data chunk starts with "MSH|" and ends when next segment starts with "MSH|".
The script must be windows (cmd) based or VBS as I cannot install any software on that machine.
File structure:
MSH|abc|123|....
s2|sdsd|2323|
...
..
MSH|ns|43|...
...
..
..
MSH|sdfns|4343|...
...
..
asds|sds
MSH|sfns|3|...
...
..
as|ss
File in above example, must be split into 2 or 3 files. Also, the files comes from UNIX, so newlines must remain as they are in the source file.
Any help?
This is a sample script that I used to parse large hl7 files into separate files with the new file names based on the data file. Uses REBOL which does not require installation ie. the core version does not make any registry entries.
I have a more generalised version that scans an incoming directory and splits them into single files and then waits for the next file to arrive.
Rebol [
file: %split-hl7.r
author: "Graham Chiu"
date: 17-Feb-2010
purpose: {split HL7 messages into single messages}
]
fn: %05112010_0730.dat
outdir: %05112010_0730/
if not exists? outdir [
make-dir outdir
]
data: read fn
cnt: 0
filename: join copy/part form fn -4 + length? form fn "-"
separator: rejoin [ newline "MSH"]
parse/all data [
some [
[ copy result to separator | copy result to end ]
(
write to-file rejoin [ outdir filename cnt ".txt" ] result
print "Got result"
?? result
cnt: cnt + 1
)
1 skip
]
]
HL7 has a lot of segments - I assume that you know that your file has only MSH segments. So, have you tried parsing the file for the string "(newline)MSH|"? Just keep a running buffer and dump that into an output file when it gets too big.