How to group by multiple columns in cratedb - cratedb

Can someone help me on group by multiple columns in crate?
For ex.
SELECT COUNT(1), x, y from table GROUP BY x,y
Thanks in Advance.

Your grouping statement seems good (you should use count(*) though, CrateDB opimizes for that specifically, count(1) will be slower), but the CircuitBreakerException and the low amount of memory stated there indicates a memory misconfiguration.
Can you set
CRATE_MIN_MEM=16g
CRATE_MAX_MEM=16g
CRATE_HEAP_SIZE=16g
What operating system are you working on? CRATE_HEAP_SIZE should automatically set CRATE_MIN_MEM and CRATE_MAX_MEM, but there might be a problem with precedence...
Cheers, Claus

Related

How to cast nvarchar to integer?

I have a database I'm running queries on where I cannot change the schema. I am in my second year of database management. We have not touched much on writing actual SQL as opposed to just using the GUI to create our queries and manage our DB's.
I have a population attribute that I need to run SUM on but the population is datatype ncharvar. I need to cast it to int.
I don't know how to do that in SQL though! Can someone please show me? I've been fiddling with it for awhile and I'm out of ideas. I'm very unfamiliar with SQL (as simple as it looks) and this would be helpful.
SELECT dbo_City.CityName, Sum(dbo_City.Population) AS SumOfPopulation
FROM dbo_City
GROUP BY dbo_City.CityName
ORDER BY dbo_City.CityName, Sum(dbo_City.Population);
I need to find what cities have populations between 189,999 and 200,000, which is a very simple query. I'm grouping by city and using the sum of the population. I'm not sure where to insert the 189,999-200,000 figure in the query but I can figure that out later. Right now I'm stuck on casting the ncharvar Population field to an Int so I can run this query!
I found the answer here:
Using SUM on nvarchar field
SELECT SUM(CAST(NVarcharCol as int))
But I'm not sure how to execute this solution. Specifically, I'm not sure where to insert this code in the above provided SQL, and I don't understand why the nvarchar is called nvarcharcol.
From MSDN:
Syntax for CAST:
CAST ( expression AS data_type [ ( length ) ] )
Your solution should look something like this:
SELECT c.CityName, CAST(c.Population AS INT) AS SumOfPopulation
FROM dbo_City AS c
WHERE ISNUMERIC(c.Population) = 1 AND CAST(c.Population AS INT) BETWEEN 189999 AND 200000
ORDER BY c.CityName, CAST(c.Population AS INT)
You shouldn't need the sum function unless you want to know the total population of the table, which would be more useful if it was a table of countries, cities, and city populations, unless this particular city table is broken down further (such as with individual zip codes?). In that case, the below would be the preference:
SELECT c.CityName, SUM(CAST(c.Population AS INT)) AS SumOfPopulation
FROM dbo_City AS c
WHERE ISNUMERIC(c.Population) = 1
GROUP BY c.CityName
HAVING SUM(CAST(c.Population AS INT)) BETWEEN 189999 AND 200000
ORDER BY c.CityName, SUM(CAST(c.Population AS INT))
I hope this helps point you in the right direction.
-C§
Edit: Integrated the "fail safe" from your linked syntax, which should stop that error coming up. It adds a filter to the column to only process those that are able to be cast to a numeric type without extra processing (such as removing the comma as in vkp's response).
I ran into a similar problem where I had temperatures (temp) stored as nvarchar. I wanted to filter out all temperatures under 50F. Unfortunately,
WHERE (temp > '5')
would include temperatures that started with a - sign (-5, -6, ...); even worse, I discovered that temperatures over 100F were also getting discarded.
SOLUTION:
WHERE (CAST (temp as SMALLINT) > '50')
I know this doesn't "directly" answer your question but for the life of me I couldnt find a specific answer to my problem anywhere on the web. I thought it would be lame to answer my own question, so I wanted to add that my discovery to your answer.

Optimize query to avoid "Resources exceeded during query execution"

I've saved this query to as a View:
SELECT nth(1,CodAlm) as FirstCode,
nth(1,DesAlm) as FirstDescription,
last(CodAlm) as LastCode,
Last(DesAlm) as LastDescription,
max(DATE(DataTic)) as LastVisit,
min(DATE(DataTic)) as FirstVisit,
DATEDIFF(CURRENT_TIMESTAMP(),TIMESTAMP(max(DATE(DataTic)))) as Diffdays,
count(distinct DATE(DataTic)) as countVisits,
count(distinct CodAlm) as NumberCodes,
sum(subtot) as Totalimport,
TarCli,
Last(nomcli) as Name,
Last(cogcli) as LastName,
Last(emailcli) as email,
Last(sexcli) as gender
FROM (SELECT CodAlm, DesAlm, DataTic,SubTot, TarCli, NomCli,CogCli,EmailCli,SexCli FROM [bime.Sales] where Year(DataTic)>2012 AND IsFirstLine="1" ORDER by TarCli, DataTic)
group each by tarcli
But, When I run any query over this view, bigquery returns me Resources exceeded during query execution. I think that ORDER BY is the cause of my problem, but I need this to show my results correctly. How I could rewrite this query correctly? bime.Sales table have 18 Milion rows.
See this question for more info:
What causes "resources exceeded" in BigQuery?
The error is likely caused by the GROUP EACH BY clause, and the most likely reason is that you have a skewed distribution of keys (i.e., one key with a disproportionate number of records). Can you look at your data distribution, and perhaps filter out any skewed keys?
Also note that the ORDER BY is not guaranteed to be preserved by GROUP EACH BY, so you need to apply the ordering after the GROUP EACH BY. You may find it useful to use analytic functions like FIRST_VALUE, NTH_VALUE, and LAST_VALUE with OVER(PARTITION BY tarcli ORDER BY DataTic) instead of GROUP EACH BY if you want to get reliable ordering.
Some things you should consider and try (if you haven't done that so far):
1) Do you really need the "group each by"? Have you tried with just "group by"?
2) Have you tried to use a table instead of a view? You could try to "materialize" the view to check if the resource consumption decreases.
3) Can you shard the data? Perhaps putting each year or month in a different table (using DataTic). That would decrease the size of each table and, therefore, the resource usage.
Cheers!

Group By seems to add inordinate amount of computation in a simple Postgres query

I have a Postgres query as such:
select id,
from ads_1 as a
join ads_2 as b
on a.id_key = b.id_key
where b.date between '2014-01-01' and '2014-01-02'
group by id
order by id;
It's nothing fancy but works fine -- only takes about 3 minutes when querying a large database to return the result.
My question is, why does this slight modification to the above code cause the time for the query to more-than quadruple?
select id, b.ad_description
from ads_1 as a
join ads_2 as b
on a.id_key = b.id_key
where b.date between '2014-01-01' and '2014-01-02'
group by id, b.ad_description
order by id;
What is going on? The mere inclusion of one simple column of (albeit unique) information is bogging my query down. I am somehow asking Postgres to do a tremendously larger amount of work. For the life of me, I don't see how.
I'd like to preemptively apologize for not including any raw data. I'm hoping this simplified example of what I'm really facing is clear enough for some kind soul to make an enlightening comment. I can say that I'm going over a million rows in each table.
Thanks in advance.
i think the size of the return set is the issue. since your result set now has a millionish rows, that would explain the extra time (although it does seem excessive). a couple of things. first, the first column in the select is not bound to a range variable, it probably doesn't matter. but, you might want to make it select b.id, b.ad_description instead of id, b.ad_description. second, normally i use a group by when one of the columns is an aggregate that isn't in the group by statement, like a 'count()' or something. maybe i'm missing something, but, you might get the same result with:
select distinct b.id, b.ad_description
from ads_1 as a
join ads_2 as b
on a.id_key = b.id_key
where b.date between '2014-01-01' and '2014-01-02'
order by b.id, b.ad_description;
you might want to play with some of the numbers in the postgresql.conf file to beef up workspace/query space.
finally, i have this funny hunch that the order by and the group by matching might also help performance.
the LIMIT 5 wouldn't help much with performance, because the entire query would need to be finished before the first 5 rows would come out.
-g

SQL grouping with "Invalid use of group" error

I'll be upfront, this is a homework question, but I've been stuck on this one for hours and I just was looking for a push in the right direction. First I'll give you the relations and the hw question for background, then I'll explain my question:
Branch (BookCode, BranchNum, OnHand)
HW problem: List the BranchNum for all branches that have at least one book that has at least 10 copies on hand.
My question: I understand that I must take the SUM(OnHand) grouped by BookCode, but how do I then take that and group it by BranchNum? This is logically what I come up with and various versions:
select distinct BranchNum
from Inventory
where sum(OnHand) >= 10
group by BookCode;
but I keep getting an error that says "Invalid use of group function."
Could someone please explain what is wrong here?
UPDATE:
I understand now, I had to use the HAVING statement, the basic form is this:
select distinct (what you want to display)
from (table)
group by
having
Try this one.
SELECT BranchNum
FROM Inventory
GROUP BY BranchNum
HAVING SUM(OnHand) >= 10
You can also find Group By Clause with example here.
Although all comments in the question seem to be valid and add information they all seem to be missing why your query is not working. The reason is simple and is strictly related by the state/phase at which the sum is calculated.
The where clause is the first thing that will get executed. This means it will filter all rows at the beginning. Then the group by will come in effect and will merge all rows that are not specified in the clause and apply the aggregated functions (if any).
So if you try to add an aggregated function to the where clause you're trying to aggregate before data is being grouped by and even filtered. The having clause gets executed after the group by and allows you to filter the aggregated functions, as they have already been calculated.
That's why you can write HAVING SUM(OnHand) >= 10 and you can't write WHERE SUM(OnHand) >= 10.
Hope this helps!

Problems with distinct in SQL query

Okay, i've been trying it for a while and haven't succeeded yet, it's kind of mystical, so please help.
Here is my table. I need to select all distinct models and group/order them by the vehicle_type. Everything is ok until I start using DISTINCT.
I'm using postgres
Little help with query please?
Assuming model could be shared between several vehicle types:
SELECT vehicle_type,model
FROM vehicle
GROUP BY vehicle_type,model
ORDER BY vehicle_type,model
The data model does not adequately capture your reporting requirments as the column data needs to be inspected to categorise it but something like:
(Extrapolating a possible relationship from your description)
SELECT CASE (vt.description ~ 'car$')
WHEN TRUE THEN 'car'
ELSE 'van'
END AS vehicle_group,
vt.description AS vehicle_sub_group,
COUNT (*) -- or whatever aggregates you might need
FROM vehicle v
INNER JOIN vehicle_type vt ON vt.vehicle_type = v.vehicle_type
GROUP BY 1,2;
Might get you towards what you need in the stated case, however it is a fragile way of dealing with data and will not cope well with additional complexities e.g. if you need to further split car into saloon car, sports car, 4WD or van into flatbed, 7.5 ton, 15 ton etc.