I have a pipeline that is loading a CSV file from GCS into BQ. The details are here: Import CSV file from GCS to BigQuery.
I'm splitting the CSV in a ParDo into a TableRow where some of the fields are empty.
String inputLine = c.element();
String[] split = inputLine.split(',');
TableRow output = new TableRow();
output.set("Event_Time", split[0]);
output.set("Name", split[1]);
...
c.output(output);
My question is, how can I have the empty fields show up as a null in BigQuery? Currently they are coming through as empty fields.
It's turning up in BigQuery as an empty String because when you use split(), it will return an empty String for ,, and not null in the Array.
Two options:
Check for empty String in your result array and don't set the field in output.
Check for empty String in your result array and explicitly set null for the field in output.
Either way will result in null for BigQuery.
Note: be careful splitting Strings in Java like this this. split() will remove leading and trailing empties. Use split("," -1) instead. See here.
BTW: unless you're doing some complex/advanced transformations in Dataflow, you don't have to use a pipeline to load in your CSV files. You could just load it or read it directly from GCS.
Related
I'm trying to use python to clean up data in csv file.
data = ['Code', 'Name',' Size ',' Sector',' Industry ']
Tried the following;
for x in data:
print(data.strip())
it works where I can read the data in the format I want but the problem is it doesn't change the data in csv.
If you want to strip away space from string stored in a list you can do it with a list comprehension like this,
data = [item.strip() for item in data]
If you like to do this over a pandas dataframe column,
df['col'] = df['col'].str.strip()
reassign the cleansed entries back to the data variable before saving it back to csv.
data = [x.strip() for x in data]
then save data to csv.
I've a data frame in pyspark.sql.dataframe.DataFrame, I converted it into pandas dataframe then saved it as csv file. Here in csv when opened, I discovered the columns which has empty values in a field become \"\".
I go back to the spark dataframe.toPandas() the when I check one of these columns values I see this empty string with blank space.
dfpandas.colX[2] give this res: ' '.
I used this kind of csv saving.
df_sparksql.repartition(1).write.format('com.databricks.spark.csv').save("/data/rep//CLT_20200729csv",
header = 'true',)
I used also this kind of saving method but lead to memory outage.
df = df_per_mix.toPandas()
df.to_csv("/data/rep//CLT_20200729.csv",sep=";", index=False)
What is the issue and how to remove the blank space converted to \"\"?
I was expecting Null_marker to replace the blank STRING with null but, it did not work. Any suggestions, please?
tried using the --null_marker="null"
$gcloud_dir/bq load $svc_ac --max_bad_records=10 --replace --source_format=CSV --null_marker="null" --field_delimiter=',' table source
the empty stings did not get replaced with NULL
Google Cloud Support here!
After reading through the documentation, the description for the --null_marker flag states:
Specifies a string that represents a null value in a CSV file. For example, if you specify "\N", BigQuery interprets "\N" as a null value when loading a CSV file. The default value is the empty string.
Therefore setting null_marker=null will not replace empty strings with NULL, it will only treat 'null' as a null value. At this point you should either:
Replace empty strings before uploading the CSV file.
Once you have uploaded the CSV file make a query using the replace function.
I have a JSON file which is 48MB (collection of tweets I data mined). I need to convert the JSON file to CSV so I can import it into a SQL database and cleanse it.
I've tried every JSON to CSV converter but they all come back with the same result of "file exceeds limits" / the file is too large. Is there a good method of converting such a massive JSON file to CSV within a short period of time?
Thank you!
A 48mb json file is pretty small. You should be able to load the data into memory using something like this
import json
with open('data.json') as data_file:
data = json.load(data_file)
Dependending on how you wrote to the json file, data may be a list, which contains many dictionaries. Try running:
type(data)
If the type is a list, then iterate over each element and inspect it. For instance:
for row in data:
print(type(row))
# print(row.keys())
If row is a dict instance, then inspect the keys and within the loop, start building up what each row of the CSV should contain, then you can either use pandas, the csv module or just open a file and write line by line with commas yourself.
So maybe something like:
import json
with open('data.json') as data_file:
data = json.load(data_file)
with open('some_file.txt', 'w') as f:
for row in data:
user = row['username']
text = row['tweet_text']
created = row['timestamp']
joined = ",".join([user, text, created])
f.write(joined)
You may still run into issues with unicode characters, commas within your data, etc...but this is a general guide.
I have to get the filename with each row so i used
data = LOAD 'data.csv' using PigStorage(',','-tagFile') AS (filename:chararray);
But in data.csv some columns have comma(,) in content as well so to handle comma issue i used
data = LOAD 'data.csv' using org.apache.pig.piggybank.storage.CSVExcelStorage()AS (filename:chararray);
But I didn't get any option to use -tagFile option with CSVExcelStorage.
Please let me know how can i use CSVExcelStorage and -tagFile option at once?
Thanks
I got the way to perform both operation(get the file name in each row and replace delimiter if it appears in column content)
data = LOAD 'data.csv' using PigStorage(',','-tagFile') AS (filename:chararray, record:chararray);
/*replace comma(,) if it appears in column content*/
replaceComma = FOREACH data GENERATE filename, REPLACE (record, ',(?!(([^\\"]*\\"){2})*[^\\"]*$)', '');
/*replace the quotes("") which is present around the column if it have comma(,) as its a csv file feature*/
replaceQuotes = FOREACH replaceComma GENERATE filename, REPLACE ($4,'"','') as record;
Once data is loaded properly without comma , i am free to perform any operation.
Detailed use case is available at my blog
You can't use -tagefile with CSVExcelStorage since CSVExcelStorage does not have -tagFile option.The workaround is to change the delimiter of the file and use PigStorage with the new delimiter and -tagFile or replace the comma in your data.