I have en existing DSE 6 node cluster on AWS that performs very well. I would like to move the data to the "Cassandra compatible" Amazon keyspaces but after moving some data, I have found there is no "IN" clause.
I use the field mentioned in the "IN" clause as the sharding separator. The field is unique per day so if I want to search over a number of days I use "where data_bucket in (1,2,3,4,5)"
Does anyone know how I could approach this (or adapt the query) using Keyspaces that would be performant?
Yes there are work arounds for the IN operator. The traditional one is to break the IN statement into multiple queries. This is what the cassandra coordinator does. Actually performing parallel queries will result in better response times (latency) where the coordinator will execute this IN statements synchronously.
If you don't want to make a code change. Open a ticket with the Amazon Keyspaces service they may be able to help you with this feature.
Keyspaces probably didn't fully caught up with C*. Why move if it performs well? Alternatively, you can move it to ScyllaDB, it does have IN clause.
Related
I'm working in SQL Workbench in Redshift. We have daily event tables for customer accounts, the same format each day just with updated info. There are currently 300+ tables. For a simple example, I would like to extract the top 10 rows from each table and place them in 1 table.
Table name format is Events_001, Events_002, etc. Typical values are Customer_ID and Balance.
Redshift does not appear to support declare variables, so I'm a bit stuck.
You've effectively invented a kind of pseudo-partitioning; where you manually partition the data by day.
To manually recombine the tables create a view to union everything together...
CREATE VIEW
events_combined
AS
SELECT 1 AS partition_id, * FROM events_001
UNION ALL
SELECT 2 AS partition_id, * FROM events_002
UNION ALL
SELECT 3 AS partition_id, * FROM events_003
etc, etc
That's a hassle, you need to recreate the view every time you add a new table.
That's why most modern databases have partitioning schemes built in to them, so all the boiler-plate is taken care of for you.
But RedShift doesn't do that. So, why not?
In general because RedShift has many alternative mechanisms for dividing and conquering data. It's columnar, so you can avoid reading columns you don't use. It's horizontally partitioned across multiple nodes (sharded), to share the load with large volumes of data. It's sorted and compressed in pages to avoid loading rows you don't want or need. It has dirty pages for newly arriving data, which can then be cleaned up with a VACUUM.
So, I would agree with others that it's not normal practice. Yet, Amazon themselves do have a help page (briefly) describing your use case.
https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-time-series-tables.html
So, I'd disagree with "never do this". Still, it is a strong indication that you've accidentally walked in to an anti-pattern and should seriously re-consider your design.
As others have pointed out many small tables in Redshift is really inefficient, like terrible if taken to the extreme. But that is not your question.
You want to know how to perform the same query on multiple tables from SQL Workbench. I'm assuming you are referring to SQLWorkbench/J. If so you can define variables in the bench and use these variable in queries. Then you just need to update the variable and rerun the query. Now SQLWorkbench/J doesn't offer any looping or scripting capabilities. If you want to loop you will need to wrap the bench in a script (like a BAT file or a bash script).
My preference is to write a jinja template with the SQL in it along with any looping and variable substitution. Then apply a json with the table names and presto you have all the SQL for all the tables in one file. I just need to run this - usually with the psql cli but at times I'm import it into my bench.
My advice is to treat Redshift as a query execution engine and use an external environment (Lambda, EC2, etc) for the orchestration of what queries to run and when. Many other databases (try to) provide a full operating environment inside the database functionality. Applying this pattern to Redshift often leads to problems. Use Redshift for what it is great at and perform the other actions elsewhere. In the end you will find that the large AWS ecosystem provides extended capabilities as compared to other databases, it's just that these aren't all done inside of Redshift.
So I have a custom Postgresql query that retrieves all rows within a specified longitude latitude radius, like so:
SELECT *,
earth_distance(ll_to_earth($1,$2), ll_to_earth(lat, lng)) as distance_metres
FROM RoomsWithUsers
WHERE earth_box(ll_to_earth($1,$2), $3) #> ll_to_earth(lat, lng)
ORDER by distance_metres;
And in my node server, I want to be able to be notified every time the number of rows in this query changes. I have looked into using a Node library such as pg-live-query but I would much rather pg-pubsub that works with existing Postgres LISTEN/NOTIFY in order to avoid unnecessary overhead. However, as far as I am able to understand, PostgreSQL TRIGGERs only work on UPDATE/INSERT/DELETE operations and not on any specific queries themselves. Is there any way to accomplish what I'm doing?
You need to set up the right triggers that will call NOTIFY for all the clients that use LISTENon the same channel.
It is difficult to advise how exactly you implement your NOTIFY logic inside the triggers, because it depends on the following:
How many clients the message is intended for?
How heavy/large is the query that's being evaluated?
Can the triggers know the logic of the query to evaluate it?
Based on the answers you might consider different approaches, which includes but not limited to the following options and their combinations:
execute the query/view when the outcome cannot be evaluated, and cache the result
provide smart notification, if the query's outcome can be evaluated
use the payload to pass in the update details to the listeners
schedule query/view re-runs for late execution, if it is heavy
do entire notification as a separate job
Certain scenarios can grow quite complex. For example, you may have a master client that can do the change, and multiple slaves that need to be notified. In this case the master executes the query, checks if the result has changed, and then calls a function in the PostgreSQL server to trigger notifications across all slaves.
So again, lots and lots of variations are possible, depending on specific requirements of the task at hand. In your case you do not provide enough details to offer any specific path, but the general guidelines above should help you.
Async & LISTEN/NOTIFY is the right way!
You can add trigger(s) on UPDATE/INSERT and execute your query in the body of trigger, save number of rows in simple table and if value changed call NOTIFY. If you need multiple params combinations in query you can create/destroy triggers from inside your program.
Hi I recently joined a new job that uses Hive and PostgreSQL. The existing ETL scripts gather data from Hive partitioned by dates and creates tables for those data in PostgreSQL and then the PostgreSQL scripts/queries perform left joins and create the final table for reporting purpose. I have heard in the past that Hive joins are not a good idea. However, I noticed that Hive does allow joins so I'm not sure why it's a bad idea.
I wanted to use something like Talend or Mulesoft to create joins and do aggregations within hive and create a temporary table and transfer that temporary table as the final table to PostgreSQL for reporting.
Any suggestions, especially if this is not a good practice with HIVE. I'm new to hive.
Thanks.
The major issue with joining in hive has to do with data locality.
Hive queries are executed as MapReduce jobs and several mappers will launch, as much as possible, in the nodes where the data lies.
However, when joining tables the two rows of data from LHS and RHS tables will not in general be in the same node, which may cause a significant amount of network traffic between nodes.
Joining in Hive is not bad per se, but if the two tables being joined are large may result in slow jobs.
If one of the tables is significantly smaller than the other you may want to store it in HDFS cache, making its data available in every node, which allows the join algorithm to retrieve all data locally.
So, there's nothing wrong with running large joins in Hive, you just need to be aware they need their time to finish.
Hive is growing in maturity
It is possible that arguments against using joins, no longer apply for recent versions of hive.
The most clear example I found in the manual section on join optimization:
The MAPJOIN implementation prior to Hive 0.11 has these limitations:
The mapjoin operator can only handle one key at a time
Therefore I would recommend asking what the foundation of their reluctance is, and then checking carefully whether it still applies. Their arguments may well still be valid, or might have been resolved.
Sidenote:
Personally I find Pig code much easier to re-use and maintain than hive, consider using Pig rather than hive to do map-reduce operations on your (hive table) data.
Its perfectly fine to do joins in HIVE, I am a ETL tester and have performed left joins on big tables in Hive most of the time the queries run smoothly but some times the job do get stuck or are slow due to network traffic.
Also depends on number of Nodes the cluster is having.
Thanks
I have a main SQL Server A that data is inserted into. The table of interest on A looks like this:
Name|Entry Time|Exit Time|Comments
From this main table, I want to construct a table on another server B that contains the same data from A but with some additional filters using a WHERE clause i.e. WHERE Name IN ('John', 'Adam', 'Jack').
I am not sure what this is called and if this is even supported by SQL Server natively (or I should setup a script to achieve this). Replication means replicating entire data but can someone tell me what is it that I am looking for and how to achieve this?
Transactional replication does support filters on articles, but I'll be honest - I've never set it up with articles with filters. This article may help as well as this topic in Books Online.
If it's only one table and/or you are uncomfortable diving into replication, you may want to populate the remote table with a trigger (this will obviously be easier if the data is only written to the table on insert and never updated). But you'll need to have logic set up to deal with situations where the remote server is down.
A third solution might be viable if you do not need server B to be continuously up to date - you can manually move data over every n minutes using a job - either using an outer join / merge or completely swapping out the set of data that matches the filter (I've used shadow schemas for this scenario to minimize the impact this has on readers of server B - see this dba.stackexchange answer for more details).
Transactional replication with SQL Server supports the ability to filter data. In fact, when you set up your replication, there is an Add Filter dialog box (assuming you're using SSMS) that allows you to create your filter (Where clause).
You can learn more about this here.
I want to iterate through a table/view and then kick off some process (e.g. run a job, send an email) based on some criteria.
My arbitrary constraint here is that I want to do this inside the database itself, using T-SQL on a stored proc, trigger, etc.
Does this scenario require cursors, or is there some other native T-SQL row-based operation I can take advantage of?
Your best bet is a cursor. SQL being declarative and set based, any 'workaround you may find that tries to force SQL to do imperative row oriented operations is unreliable and may break. Eg. the optimizer may cut out your 'operation' from the execution, or do it in strange order or for an unexpected number of times.
The general bad name cursors get is when they are deployed instead of set based operations (like do a computation and update, or return a report) because the developer did not found a set oriented way of doing the same functionality. But for non-SQL operations (ie. launch a process) they are appropriate.
You may also use some variations on the cursor theme, like client side iterating through a result set. That is very similar in spirit to a cursor, although not using explicit cursors.
The standard way to do this would be SSIS. Just use an Execute SQL task to get the rows, and a For Each task container to iterate once per row. Inside the container, run whatever tasks you like, and they'll have access to the columns of each row.
If you are planning on sending an email to each record with an email address (or a similar row-based operation) then you would indeed plan on using a cursor.
There is no other "row-based" operation that you'd do within SQL itself (although I like John's suggestion to investigate SSIS - as long as you have SQL Server Standard or Enterprise). However, if you are summing, searching or any other kind of operation and then kicking off an event once done the entire selection set, then you would certainly not use a cursor. Just so you know - cursors are generally considered a "last resort" approach to problems in SQL Server.
The first thought which comes to my mind when I need to iterate over the result set of a query is to use cursors. Yes, it is a quick and dirty way of programming. But cursors have their setbacks as well - They incur overheads and can be performance bottle necks.
There are alternatives to using cursors. You can try using a temp table with an identity column. Copy you table to the temp table and using a while loop to iterate over the rows. Then based on a condition call your stored procedure.
Here, check this link for alternatives to cursors - http://searchsqlserver.techtarget.com/tip/0,289483,sid87_gci1339242,00.html
cheers