As a result, I have a dataframe, that I convert to dict and them write it to BQ using Apache Beam. One of the column is string that can contain emoji. When I print result I see emoji, but in BQ I see ��. How I can write string with emoji to BQ?
BQ supports UTF-8 format. Emojis are converted to utf-8 and stored.
Documentation Link
Related
I have one "Bal_123.csv" file and when I am searching its data on splunk web by providing query " sourcetype="Bal_123.csv" " I am getting latest indexed raw data in comma separated format. But for further operation I need that data in .Json format
Is there any way we can get that data in .Json format itself. I know I can export the data in Json format but I am using Rest call to get data from splunk and I need that Json data on splunk itself.
can anyone help me regarding this?
Splunk will parse JSON, but will not display data in JSON format except, as you've already noted, in an export.
You may be able to play with the format command to get something close to JSON.
A better option might be to wrap your REST call in some Python that converts the results into JSON.
I haver a pyspark script which reads mysql data which contains column values in a data frame and stores the data in parquet format in aws s3 but while querying using aws athena it is showing some random text and not arabic. I am doing something wrong . Please help in getting this sorted.
Text im getting is Ãâ¦Ã±ÃÆò..., how this can be converted to arabic .
While reading data from mysql using pyspark i am getting data in format: 'الشرقية'.
Thanks in advance.
While reading from mysql we need to pass "?useUnicode=true&characterEncoding=UTF-8" with the url string e.g
user_df = sqlContext.read.format("jdbc").options(
url="jdbc:mysql://HOST/DB_NAME?useUnicode=true&characterEncoding=UTF-8",
driver="com.mysql.jdbc.Driver",
dbtable="users",
user="root",
password="root"
).load()
This resolved my issue.
Any information about formatting SQL table in reshift to a JSON file. I see SQL syntax doesn't work on here. It says Syntax error.
The Amazon Redshift UNLOAD command can only output in Fixed-Width or Delimited (eg CSV) format.
You could either convert the output from the UNLOAD command, or use/write a program that calls Redshift via SQL and saves the output in the desired format.
My requirement is to pull the data from Different sources(Facebook,youtube, double click search etc) and load into BigQuery. When I try to pull the data, in some of the sources I was getting "NULL" when the column is empty.
I tried to load the same data to BigQuery and BigQuery is treating as a string instead of NULL(empty).
Right now replacing ""(empty string) where NULL is there before loading into BigQuery. Instead of doing this is there any way to load the file directly without any manipulations(replacing).
Thanks,
What is the file format of source file e.g. CSV, New Line Delimited JSON, Avro etc?
The reason is CSV treats an empty string as a null and the NULL is a string value. So, if you don't want to manipulate the data before loading you should save the files in NLD Json format.
As you mentioned that you are pulling data from Social Media platforms, I assume you are using their REST API and as a result it will be possible for you to save that data in NLD Json instead of CSV.
Answer to your question is there a way we can load this from web console?:
Yes, Go to your bigquery project console https://bigquery.cloud.google.com/ and create table in a dataset where you can specify the source file and table schema details.
From Comment section (for the convenience of other viewers):
Is there any option in bq commands for this?
Try this:
bq load --format=csv --skip_leading_rows=1 --null_marker="NULL" yourProject:yourDataset.yourTable ~/path/to/file/x.csv Col1:string,Col2:string,Col2:integer,Col3:string
You may consider running a command similar to: bq load --field_delimiter="\t" --null_marker="\N" --quote="" \
PROJECT:DATASET.tableName gs://bucket/data.csv.gz table_schema.json
More details can be gathered from the replies to the "Best Practice to migrate data from MySQL to BigQuery" question.
Are there any Serde available to support hive table with Unicode characters. We might have file in either UTF-8, UTF-18 and UTF-32.Which is nothing but we are looking for support different languages like Japanese, Chinese in hive table. We should be able to load different language data into hive table
Hive could only read and write UTF-8 text files.
for other character set,It should be converted into UTF-8.Syntax for conversion is
hive> CREATE TABLE mytable(name, datatype) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES("serialization.encoding"='FORMAT');
conversion can be done using iconv but it supports only files smaller than 16G.
syntax:
>iconv -f encoding -t encoding inputfile