I am not sure if it's possible to write the query results of Apache-Jena tdbquery directly into other file formats such as the columnar ones (e.g., parquet, or ORC).
Herein, how I use it with CSV, I want it to be one of the other file formats.
./tdbquery --loc /location/.. --query $filename --results CSV> file.csv
Notably, the idea is the file that I am writing has too many nulls, with CSV, it takes too much space more than the disk space I even have on the machine.
Related
I am used to parquet file with a single schema. I came across a file which, seemingly has more than one schema. I used pandas to convert it to a CSV file. The result is some things like this:
table-1,table-2,table-3
0, {data for table-1} {dat for table-2} {data for table-3}
I read the parquet file format and it looks like a single parquet file has a single schema.
Does parquet support more than one schema in a single file?
No, the Parquet format only supports a single schema per file. This schema is written into the footer of the file and accounts for all sections of the file. You could probably reread the CSV file into pandas and save that as a Parquet file, but ultimately you will be better off when you save each table as a separate file. The latter should also be much more performant and space-efficient.
I have been tasked with uploading some data into Solr, whereupon it will be used for analysis.
I understand that Solr can index data in xlsx file formats.
In Exercise 2 for Solr, the following files were indexed in the order of json, xml and csv:
bin/post -c films example/films/films.json
bin/post -c films example/films/films.xml
bin/post -c films example/films/films.csv -params "f.genre.split=true&f.directed_by.split=true&f.genre.separator=|&f.directed_by.separator=|"
The issue I have is that the though I indexed my xlsx file, it only shows one record in the query, which means that the file may have been indexed wrongly, ie it may require parameters such as that needed by a csv file. Can anyone tell me how this indexing can be done without having to convert the xlsx file into a csv file?
You can use Apacha Tika to index these formats in SOLR. It will parse the data and do the index.
Reference Link :
https://lucidworks.com/2009/09/02/content-extraction-with-tika/
I am exporting a table of size>1GB from Bigquery into GCS but it splits the files into very small files of 2-3 MB. Is there a way to get bigger files like 40-60MB per files rather than 2-3 MB.
I do the expport via the api
https://cloud.google.com/bigquery/docs/exporting-data#exporting_data_into_one_or_more_files
https://cloud.google.com/bigquery/docs/reference/v2/jobs
The source table size is 60 GB on Bigquery. I extract the data with format - NewLine_Delimited_Json and GZIP compression
destination_cloud_storage_uris=[
'gs://bucket_name/main_folder/partition_date=xxxxxxx/part-*.gz'
]
Are you trying to export partitioned table? If yes, each partition is exported as different table and it might cause small files.
I run the export in cli with each of the following commands and received in both cases files of size 49 MB:
bq extract --compression=GZIP --destination_format=NEWLINE_DELIMITED_JSON project:dataset.table gs://bucket_name/path5-component/file-name-*.gz
bq extract --compression=GZIP project:dataset.table gs://bucket_name/path5-component/file-name-*.gz
Please add more details to the question so we can provide specific advice: How are you exactly asking for this export?
Nevertheless, if you have many files in GCS and you want to merge them all into one, you can do:
gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/composite
https://cloud.google.com/storage/docs/gsutil/commands/compose
I'm trying to load a CSV file which is control+A separated into bigquery. What should be the option I pass for -F parameter for the bq load command? All the options I have tried are resulting in an error while loading.
I would guess that Control+A is used in some legacy formats that OP wants to load into BigQuery. From the other hand Control+A can be chosen when it is hard to select any of usually used delimiters.
My recommendation would be to load your CSV file without any delimiter, so whole row will be loaded as a one field
Assuming your rows loaded into TempTable look like below with just one column called FullRow.
'value1^Avalue2^Avalue3'
where ^A is "invisible" character
So, after you loaded your file into BigQuery - now you can parse it to separate columns and write it to final table with something like below
SELECT
REGEXP_EXTRACT(FullRow, r'(?:\w*\x01){0}(\w*)') AS col1,
REGEXP_EXTRACT(FullRow, r'(?:\w*\x01){1}(\w*)') AS col2,
REGEXP_EXTRACT(FullRow, r'(?:\w*\x01){2}(\w*)') AS col3
FROM TempTable
Above is confirmed to work as I used this approach multiple times. Works for both Legacy and Standard SQL
Now,i find my lzo file can't be splited,so my search is very slow.So how i should to speed up the search.
Which files are splittable.
My data source is flume,data are stored in hdfs.
I suggest you write your data in ORC or Parquet. Either format will be much faster than lzo.