Is it possible to load CSV file directly into Redis (5.X) or CSV first needs to be converted into JSON and then load it programmatically?
Depending on how you want to store the data in your CSV file, you may or may not need to process it programmatically. For example, running this:
redis-cli SET foo "$(cat myfile.csv)"
will result with the contents of the files stored in the key 'foo' as a Redis String. If you want to store each line in its own data structure under a key (perhaps a Hash with all the columns), you'll need to process it with code and populate the database accordingly.
Note: there is no need, however, to convert it to JSON.
Related
I'm using Azure Data Factory and am looking for the complement to the "Lookup" activity. Basically I want to be able to write a single line to a file.
Here's the setup:
Read from a CSV file in blob store using a Lookup activity
Connect the output of that to a For Each
within the For Each, take each record (a line from the file read by the Lookup activity) and write it to a distinct file, named dynamically.
Any clues on how to accomplish that?
Use Data flow, use the derived column activity to create a filename column. Use the filename column in sink. Details on how to implement dynamic filenames in ADF is describe here: https://kromerbigdata.com/2019/04/05/dynamic-file-names-in-adf-with-mapping-data-flows/
Data Flow would probably be better for this, but as a quick hack, you can do the following to read the text file line by line in a pipeline:
Define your source dataset to output a line as a single column. Normally I would use "NoDelimiter" for this, but that isn't supported by Lookup. As a workaround, define it with an incorrect Column Delimiter (like | or \t for a CSV file). You should also go to the Schema tab, and CLEAR the schema. This will generate a column in the output named "Prop_0".
In the foreach activity, set the Items to the Lookup's "output.value" and check "Sequential".
Inside the foreach, you can use item().Prop_0 to grab the text of the line:
To the best of my understanding, creating a blob isn't directly supported by pipelines [hence my suggestion above to look into Data Flow]. It is, however, very simple to do in Logic Apps. If I was tackling this problem, I would create a logic app with an HTTP Request Received trigger, then call it from ADF with a Web activity and send the text line and dynamic file name in the payload.
I have legacy data stored as CSV in an Azure DataLake Gen2 storage account. I'm able to connect to this and interrogate it using DataBricks. I have a requirement to remove certain records once their retention period expires, or if a GDPR "right to be forgotten" needs applying to the data.
Using Delta I can load a CSV into a Delta table and use SQL to locate and delete the required rows, but what is the best way to save these changes? Ideally back to the original file, so that the data is removed from the original. I've used the LOCATION option when creating the Delta table to persist the generated Parquet format files to the DataLake but it would be nice to keep it in the original CSV format.
Any advice appreciated.
I'd be careful here. Right to be forgotten means you need to delete the data. Delta doesn't actually delete it from the original file (initially at least) - this will only happen once the data is vacuumed.
The safest way to delete data is to read all the data into a dataframe, filter off the records you do not want and then write it back using overwrite. This will ensure the data is remove and the same structure is re-written.
Convert Parquet to CSV in ADF
The versioned parquet files created in the ADLS Gen2 location can be converted to CSV using the Copy Data task in an Azure Data Factory pipeline.
So, you could read the CSV data into a Delta table(with location pointing to a Data lake folder), perform the required changes using SQL and then convert the parquet files to CSV format using ADF.
I have tried this and it works. The only hurdle might be detecting the column headers while reading the CSV file to Delta. You could read it to a dataframe and create a Delta table from it.
If you are running the delete operations periodically then it is costly to save file in csv, As every time you are reading the file and transforming the dataframe to Delta and then query on it and finally after filtering the records you are again saving it to csv and deleting the Delta table.
So my suggestion here would be, transform the csv to Delta once, perform delete periodically and generate csv only when it's needed.
The advantage here is - Delta internally stores data in parquet format which stores data in binary format and allow better compression and encoding/decoding of data.
My requirement is to pull the data from Different sources(Facebook,youtube, double click search etc) and load into BigQuery. When I try to pull the data, in some of the sources I was getting "NULL" when the column is empty.
I tried to load the same data to BigQuery and BigQuery is treating as a string instead of NULL(empty).
Right now replacing ""(empty string) where NULL is there before loading into BigQuery. Instead of doing this is there any way to load the file directly without any manipulations(replacing).
Thanks,
What is the file format of source file e.g. CSV, New Line Delimited JSON, Avro etc?
The reason is CSV treats an empty string as a null and the NULL is a string value. So, if you don't want to manipulate the data before loading you should save the files in NLD Json format.
As you mentioned that you are pulling data from Social Media platforms, I assume you are using their REST API and as a result it will be possible for you to save that data in NLD Json instead of CSV.
Answer to your question is there a way we can load this from web console?:
Yes, Go to your bigquery project console https://bigquery.cloud.google.com/ and create table in a dataset where you can specify the source file and table schema details.
From Comment section (for the convenience of other viewers):
Is there any option in bq commands for this?
Try this:
bq load --format=csv --skip_leading_rows=1 --null_marker="NULL" yourProject:yourDataset.yourTable ~/path/to/file/x.csv Col1:string,Col2:string,Col2:integer,Col3:string
You may consider running a command similar to: bq load --field_delimiter="\t" --null_marker="\N" --quote="" \
PROJECT:DATASET.tableName gs://bucket/data.csv.gz table_schema.json
More details can be gathered from the replies to the "Best Practice to migrate data from MySQL to BigQuery" question.
I tried to search for it but cannot find the tip/recommendations.
Here is my situation. I have all the data lined up correctly and output working fine using pig script. Stored the files in a output directory. The output files are more than 100 files so what i have done is accumulated the results file using another pig script.
I was wondering if there is anything in PIG LATIN that will help me add "Header" to the accumulated results file so that business users can quickly use it as it also has headers?
Please advise
If you are using DUMP in Pig script and redirecting the result to a single file, you can use DESCRIBE before DUMP. Doing so will append schema information as header to your output file
A = LOAD 'test' USING PigStorage() AS (col1:int, col2:chararray);
DESCRIBE A;
DUMP A;
output will be something like:
A: {col1: int,col2: chararray}
1,test
2,test
...
Pig can store the schema into a different file ".pig_schema" using PigStorage:
store A into 'outputFile' using PigStorage('\t', '-schema');
will save your data in the outputFile using tabs as delimiters and also creates the schema file.
You can store the header in a separate file, LOAD it and UNION it with your data. Then you need to do an ORDER BY (that might be tricky depending on your data).
Another way would be to use hadoop getmerge.
In general, this is not something pig is very good at, you might as well write a script in another language.
I'm attempting to pull data from several spreadsheets that reside in a single folder, then put all the data into a single csv file along with column headings.
I have a foreach loop container setup to iterate through each of the filenames in the folder, which then appends this data to a RAW file, however as many have seemed to run into, there does not appear to be a built in option that will allow one to simply truncate the RAW file before entering the loop container.
Jamie Thompson described a similar situation in his blog here, but the links to the examples do not seem to work. Does anyone have an easy way to truncate the RAW file in a stand alone step before entering the foreach loop?
The approach I always use is to create a data flow with the appropriate metadata format but no actual rows and route that to a RAW file set to Create new.
In my existing data flow, I look at the metadata that populates the RAW file and then craft a select statement that mimics it.
e.g.
SELECT
CAST(NULL AS varchar(70)) AS AddressLine1
, CAST(NULL AS bigint) AS SomeBigInt
, CAST(NULL AS nvarchar(max)) AS PerformanceLOL
Here's what I would did:
Make your initial raw file
Make a copy of that raw file
Use a file task to replace the staging file at the beginning of your package/job every time.
In my use case I have 20 foreach threads writing to their own files all at the same time. No thread can create and then append, so you just "recreate" by copying over an 'empty' raw file that already has the meta data assigned, before calling the threads: