I was trying to insert 1M entries in redis timeseries DB (on my local machine).
For this, I was using add(sourceKey, timestamp, value) method of RedisTimeSeries on every entry.
Wanted to know if there's a better way to do this and if bulk load is possible in redis timeseries.
Couldn't find a method for bulk loading of data in this doc:
https://oss.redis.com/redistimeseries/commands/#tsadd
Thanks
Currently, the fastest way is to combine pipelining and the TS.MADD command
https://redis.io/topics/pipelining
https://oss.redis.com/redistimeseries/commands/#tsmadd
Related
We have several on premise databases and then so far had also our data warehouse as on premise. Now moving over to the cloud and data warehouse will be in Snowflake. But we still have more on premise source systems than in the cloud, so would like to stick with our on premise ETL solution. We are using Pentaho Data Integration (PDI) as our ETL tool.
The issue we have then is then that the PDI Table output step that is using the Snowflake JDBC driver is horribly slow for bulk loads into Snowflake. A year ago it was even worse, as it then just did INSERT INTO and COMMIT after every row. By today it has improved a lot, (when looking at the Snowflake history/logs) it now seems to do some kind of PUT to some temp Snowflake stage, but then from there still does some kind of INSERT to the target table and this is slow (in our test case then it took an hour to load 1 000 000 records in).
We have used the workaround for the bulk load into that we use SnowSQL (Snowflakes command line tool) scrips to make the bulk load into Snowflake that is orchestrated by PDI then. In our example case it takes then less than a minute to get the same 1 000 000 records into Snowflake.
All stuff that is then done inside the Snowflake database is just done via PDI SQL steps sent to Snowflake over JDBC and all our source system queries run fine with PDI. So the issue is only with the bulk load into Snowflake where we need to do some weird workaround:
Instead of:
PDI.Table input(get source data) >> PDI.Table output(write to Snowflake table)
we have then:
PDI.Table input(get source data) >> PDI.Write to local file >> Snowsql.PUT local file to Snowflake Stage >> Snowsql.COPY data from Snowflake Stage to Snowflake table >> PDI clear local file, also then clear Snowflake stage.
It works, but is much more complex than it needs to be (compared to previous on premise database load for example).
I don't even know if this issue is rather on the Snowflake (if the JDBC driver works not optimal) side or on the PDI side (if it just does not utilize the JDBC driver correctly), but would like to have it working better.
To bulk load in Snowflake, you need to do the put and copy.
I am currently setting up a simple NiFi flow that reads from a RDBMS source and writes to a Hive sink. The flow works as expected until the PuHiveSql processor, which is running extremely slow. It inserts one record every minute approximately.
Currently is setup as a standalone instance running on one node.
The logs showing the insert every 1 minute approx:
(INSERT INTO customer (id, name, address) VALUES (x, x, x))
Any ideas about why this may be? Improvements to try?
Thanks in advance
Inserting one record at a time into Hive will result extreme slowness.
As your doing regular insert into hive table:
Change your flow:
QueryDatabaseTable
PutHDFS
Then create Hive avro table on top of HDFS directory where you have stored the data.
(or)
QueryDatabaseTable
ConvertAvroToORC //incase if you need to store data in orc format
PutHDFS
Then create Hive orc table on top of HDFS directory where you have stored the data.
Are you poshing one record at time? if so may use the merge record process to create batches before pushing into HiveQL,
It is recommended to batch into 100 records :
See here: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-hive-nar/1.5.0/org.apache.nifi.processors.hive.PutHiveQL/
Batch Size | 100 | The preferred number of FlowFiles to put to the database in a single transaction
Use the MergeRecord process and set the number of records or/and timeout, it should speed-up considerably
I need to transform a fairly big database table with aws Glue to csv. However I only the newest table rows from the past 24 hours. There ist a column which specifies the creation date of the row. Is it possible, to just transform these rows, without copying the whole table into the csv file? I am using a python script with Spark.
Thank you very much in advance!
There are some Built-in Transforms in AWS Glue which are used to process your data. This transfers can be called from ETL scripts.
Please refer the below link for the same :
https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html
You haven't mentioned the type of database that you are trying connect. Anyway for JDBC connections spark has the option of query, in which you can issue the usual SQL query to get the rows you need.
I would save the result of a query on an external db to Bigquery.
I am using pyodbc to manage the odbc connection.
What is the most efficient way to perform such operation?
Should I fetchOne each cursor row and then insert in BigQuery?
Does the result have large amount of data?
If the result is small, you can just read all rows and insert into BigQuery. The benefit is the result is immediately available to BigQuery queries. However, for large results, the streaming insert might be expensive (see https://cloud.google.com/bigquery/pricing).
For large results I would just save the result to a file (commonly CSV), upload it to GCP and run load job.
I have postgres 9.3 db and I want to use redis to cache calls the the DB (basically like memcached). I followed these docs, which means I have basically configured redis to work as an LRU cache. But am unsure what to do next. How do I tell redis to track calls to the DB and cache their output? How can I tell it's working?
In pseudo code:
see if redis has the record by 'record_type:record_id'
if so return the result
if not then query postgres for the record_id in the record_type table
store the result in redis by 'record_type:record_id'
return the result
This might have to be a custom adapter for the query engine that you are using.