Pentaho Json Input missing data - pentaho

Loading JSON from a URL does not provide the same results as reading the same JSON from a file.
When reading the JSON from a URL the result is only 1-10 rows, but when reading the file it outputs all the rows.
This exact setup works perfectly when I load exactly the same JSON from a .js file.

Related

Fix Unicode Decode Error Without Specifying Encoding='UTF-8'

I am getting the following error:
'ascii' codec can't decode byte 0xf4 in position 560: ordinal not in range(128)
I find this very weird given that my .csv file doesn't have special characters. Perhaps it has special characters that specify header rows and what not, idk.
But the main problem is that I don't actually have access to the source code that reads in the file, so I cannot simply add the keyword argument encoding='UTF-8'. I need to figure out which encoding is compatible with codecs.ascii_decode(...). I DO have access to the .csv file that I'm trying to read, and I can adjust the encoding to that, but not the source file that reads it.
I have already tried exporting my .csv file into Western (ASCII) and Unicode (UTF-8) formats, but neither of those worked.
Fixed. Had nothing to do with unicode shenanigans, my script was writing a parquet file when my Cloud Formation Template was expecting a csv file. Thanks for the help.

Big Query do not accept EMOJI

I have emojis in this format - \U0001f924 why BigQuery(Google Data studio) does not display them, even if I saw examples that this format working for other people?
SAMPLE: - Second Emoji in this format \u2614
Ref: Emoji crashed when uploading to Big Query
Based on this article it should work: Google \Uhhhhhhhh Format
UPDATE 1.0:
If I use "" then emojis in this format \U2714 displays emoji, this one \U0001f680 still the same as text U0001f680
If I use '' then emojis in this format \U2714 as well as \U0001f680 display only value U2714 and U0001f680
The emoji on the question works for me with SELECT "\U0001f680":
I stored the results in a table so you can find it:
https://bigquery.cloud.google.com/table/fh-bigquery:public_dump.one_emoji?tab=preview
If you ask BigQuery to export this table to a GCS file, and bring this file into your computer, it will continue to work:
You can download this json file and load it back into BigQuery:
https://drive.google.com/file/d/1hXASPN0J4bP0PVk20x7x6HfkAFDAD4vq/view?usp=sharing
Let's load it into BigQuery:
Everything works fine:
So the problem is in the files you are loading to BigQuery - which are not encoding emoji's appropriately.
What I don't know is how you are generating these files, nor how to fix that process. But here I have proven that for files that correctly encode emojis - you can load them into BigQuery and emojis will be preserved.
🙃

Converting a massive JSON file to CSV

I have a JSON file which is 48MB (collection of tweets I data mined). I need to convert the JSON file to CSV so I can import it into a SQL database and cleanse it.
I've tried every JSON to CSV converter but they all come back with the same result of "file exceeds limits" / the file is too large. Is there a good method of converting such a massive JSON file to CSV within a short period of time?
Thank you!
A 48mb json file is pretty small. You should be able to load the data into memory using something like this
import json
with open('data.json') as data_file:
data = json.load(data_file)
Dependending on how you wrote to the json file, data may be a list, which contains many dictionaries. Try running:
type(data)
If the type is a list, then iterate over each element and inspect it. For instance:
for row in data:
print(type(row))
# print(row.keys())
If row is a dict instance, then inspect the keys and within the loop, start building up what each row of the CSV should contain, then you can either use pandas, the csv module or just open a file and write line by line with commas yourself.
So maybe something like:
import json
with open('data.json') as data_file:
data = json.load(data_file)
with open('some_file.txt', 'w') as f:
for row in data:
user = row['username']
text = row['tweet_text']
created = row['timestamp']
joined = ",".join([user, text, created])
f.write(joined)
You may still run into issues with unicode characters, commas within your data, etc...but this is a general guide.

Cannot upload CSV that starts with an integer

I'm stuck with what seems like a weird BigQuery bug : I cannot upload a CSV file that starts (first line, first column) by an integer.
Here's my schema : COL1:INTEGER,COL2:INTEGER,COL3:STRING
Here's my csv file content :
100,4,XXX
100,4,XXX
If I put the STRING column as first column, the upload is OK.
If I add a header and tell BigQuery to skip it during the import, the upload is ok too.
But with the CSV and schema above, BigQuery always complains : Line:1 / Field:1, Value cannot be converted to expected type.
Anyone knows what the problem is ?
Thank you in advance,
David
I could not reproduce this problem--I copied and pasted the content into a file and uploaded it with no problems.
Perhaps the uploaded file format is corrupted somehow? If there are extra bytes at the beginning of the file, those would be ignored in a header row but might result in this error is the first value of the first field is expected to be an integer. I'd recommend examining the actual binary data in the file to make sure there's nothing funny going on.
Also, are you doing this import via web UI, command-line tool, or API? Have you tried one of the other methods?

Problem saving uploaded files in Python3

i control the problem of the data what is uploaded by the POST method, in the web.
If the file is a text theres no problem but the trouble comes when it's an encoded file, as a Picture or other. What the when the system insert the data into the file.
Well it doesn 't encoded in the right way. I will put all the code, from the area whats take the environ['wsgi.input'] to the area thats save the file:
# Here the data from the environ['wsgi.input'],
# first i convert the byte into a string delete the first
# field that represent the b and after i strip the single quotes
tmpData = str(rawData)[1:].strip("' '")
dat = tmpData.split('\\r')#Then i split all the data in the '\\r'
s = open('/home/hidura/test.png', 'w')#I open the test.png file.
for cont in range(5,150):#Now beging in the 5th position to the 150th position
s.write(dat[cont])#Insert the piece of the data in the file.
s.close()#Then closed.
Where is the mistake?
Thankyou in advance.
Why do you convert the binary data to a string? A png file is binary data. Just write the binary data to the file. You need to open the file in binary mode as well.
s = open('/home/hidura/test.png', 'wb')
s.write(data)