I want to use a calculated column to calculate another column in Hive. There are many answers available to the same type of questions but those were for SQL not for Hive (HQL). Basically I want an alternative for the below command.
select (colA + 1) as calCol1, (calCol1 + 2) as calCol2, (calCol2 + 1) as calCol3 from table;
(Actual logic is lot more complex and recomputation is not preferred)
I cannot use nested queries here since I don't know how many columns will be calculated from how many columns. According to my understanding, I will need one subquery to use one calculated column, and so on.
Correct me if I am wrong. Any help would be appreciated
Related
in Microsoft Access I need to bind a cross-tab query to a report. To bind the query I need to know the column names in advance which is the problem as I do not know which dates will be in the data. The date names end up as the columns in the cross-tab query. as the dates vary the column names vary and I can't bind the report as I don't know the column names. the best solution I can think of is to calculate a column that replaces the date with its order. then the column names will always be 1,2,3,4,5,6,7,8. However I haven't been able to create this calculated order column. No matter what I do access bugs out. Part of the cause is that This table will be used for a cross-tab query with parameters so access bugs out with any additional complications.
How do I turn StatusDate into CalculatedOrder?
Edit:
I cannot use a subquery to make CalculatedOrder as is sujested in the comments because the original query has parameters. access bugs out when a subquery draws on a query based on parameters.
The real problem was two bugs in Access. My guess is Access will bug at any additional complexity. if the original table has parameters work with a copy of the table that doesn't have parameters. In all cases, before calculating the crosstab, copy just your values into another temporary table. In Access, crosstab doesn't work if a crosstab column variable is a correlated subquery.
You can make the CalculatedOrder column using a correlated subquery similar to allenbrowne.com/ranking.html.
if the table has parameters:
PARAMETERS myfirstparameter
SELECT StatusDate, CrossTabRow, CrossTabValue INTO NoParametersTable
FROM MyTableWithParameters
WHERE ...
Then make a query with only a grouped StatusDate.
SELECT NoParametersTable.StatusDate
FROM NoParametersTable
GROUP BY NoParametersTable.StatusDate;
Then Turn StatusDate into Order using a correlatedsubquery:
SELECT CrossTabRow, CrossTabValue, (SELECT Count(dupe.StatusDate) +1 FROM [MyGroupedStatusDateQuery] as dupe WHERE dupe.StatusDate < [MyGroupedStatusDateQuery].StatusDate) AS [Order]
FROM NoParametersTable INNER JOIN [MyGroupedStatusDateQuery] ON NoParametersTable.StatusDate = MyGroupedStatusDateQuery.StatusDate)
finally make sure to turn this final query into a table without a correlatedsubquery by copying the values into another temporary table. run the crosstab on the final temporary table and the crosstab will work because the final table just has values instead of parameters and a correlated subquery.
So in Qlikview, I am trying to make a conditional that would only display the table if the table has less than 50000 rows. How would I go about to doing this?
The table I am working with is used for the user to create their own reports. They are able to choose what fields they want to see and are able to see those fields next to a calculated value column. I have tried using the RowNo() and NoOfRows() functions but was not able to get anywhere with that. If you have any other ideas, I would appreciate it.
Thanks
Consider that the number of rows will be determined by the number of distinct entries for your dimension for the table. So you could use:
Count(Distinct myDimension) < 50000
Where myDimension is the dimension of your table (or some concatenation of many dimensions if you have more than one dimension in your table).
Chris J's answer should be faster than the above Count(Distinct... since it does not require runtime elimination of duplicates, but depending on your data, you may need to create an extra table with a resident load to contain the counter correctly.
In my experience however, users prefer to have a logical limit on their data (something like being forced to select a week) rather than having a fixed limit to the number of records.
You can enforce this kind of limit with a condition like
GetSelectedCount(myWeekField) <= 1
As part of your load script, you should add an additional field to the table that you are
,1 as RecordSum;
Then set a variable in the script
set vRecordSum = sum(RecordSum)
Then, on the straight table set to conditional with the formula $(vRecordSum)<50000
One easy way should be to do, as a condition :
SUM(1) < 50.000
Sum(1) should represent the number of rows.
I'm experiencing some heavy performance-issues with a query in SQLite. Currently there are around 20000 entries in the table activity_tbl and about 40 in the table activity_data_tbl. I have an index for both of the columns used in the query below, but it doesn't seem to have any effect on the performance at all.
SELECT a._id, a.start_time + b.length AS time
FROM activity_tbl a INNER JOIN activity_data_tbl b
ON a.activity_data_id = b._data_id
WHERE time > ?
ORDER BY 2
LIMIT 1
As you can see, I select one column and a value created from adding two columns together. I guess this is what's causing the low performance, since the query is very fast if I just select a.start_time or b.length.
Do you guys have any suggestion for how I could optimize this?
Try putting an index on the time column. This should speed up the query
This query is not optimizable using indexes for the filter part since you are filtering and ordering on a calculated value. To optimize the query you will either need to filter on one of the actual table columns (starttime or length) or pre-compute the time values before querying.
The only place an index will help, and I assume you have one, is on b.data_id.
A compound index may help. According to its docs, SQLite tries to avoid to access the table, if the index has enough information. So if the engine did its homework it will recognize that the index is enough to compute the where clause value and spare some time. If it does not work, only the pre-computation will do.
If you are more often confronted with similar tasks, please read this: http://www.sqlite.org/rtree.html
I have a series of T-SQL queries that I use that are running very slowly. One part of the query that I suspect is causing some problems is a series of Casts that I have to do on them.
This is the problem. I have to combine the 4 columns together as a nvarchar/varchar as the combination of them form a (semi)-unique key for an entry in another table (horrible idea I know, but I'm stuck with it).
The four columns are:
t_orno, t_reno, t_pono, t_srnb: all INT columns without indexes.
The way I have been doing this is like so:
Cast(t_orno AS nvarchar(10)) + '-' + Cast(t_reno as nvarchar(10)) +
'-' + Cast(t_pono as nvarchar(5)) + '-' + Cast(t_srnb as nvarchar(5))
Unfortunately I'm stuck with having to merge these columns together. Is there a better way of doing this? The queries need to be more efficient and there has got to be a better way than casting all four individually?
Assume: that the database is completely unchangeable -- which sadly it is... (don't want to get into that..)
Thanks for your help.
EDIT: As per a request for more info on the tables:
Both tables that are being queried from only contain one index, and it is on the PK column. Again note, that nothing can be added/changed on these tables.
The table being joined contains the combination of those four columns:
BaanID > nvarchar, no index.
Have you tried the reverse, i.e. splitting "an entry in another table" on the "-" character and casting each to int - may yield better performance?
I would try to use a persisted view and create an index on it. Here is an article that may help you: http://technet.microsoft.com/en-us/library/cc917715.aspx.
Or you could add a computed column to the table containing the t_* columns and index this column.
I believe this is the crucial point:
The table being joined contains the combination of those four columns: BaanID > nvarchar, no index
Unless you are dealing with relatively small tables, joining two tables together on columns that are not indexed is likely to be costly.
I have a table with millions of rows, and I need to do LOTS of queries which look something like:
select max(date_field)
where varchar_field1 = 'something'
group by varchar_field2;
My questions are:
Is there a way to create an index to help with this query?
What (other) options do I have to enhance performance of this query?
An index on (varchar_field1, varchar_field2, date_field) would be of most use. The database can use the first index field for the where clause, the second for the group by, and the third to calculate the maximum date. It can complete the entire query just using that index, without looking up rows in the table.
Obviously, an index on varchar_field1 will help a lot.
You can create yourself an extra table with the columns
varchar_field1 (unique index)
max_date_field
You can set up triggers on inserts, updates, and deletes on the table you're searching that will maintain this little table -- whenever a row is added or changed, set a row in this table.
We've had good success with performance improvement using this refactoring technique. In our case it was made simpler because we never delete rows from the table until they're so old that nobody ever looks up the max field. This is an especially helpful technique if you can add max_date_field to some other table rather than create a new one.