The logic for the problem is that I am attempting to resolve an issue where a certain field will return a null value and I would like to auto-generate a value for this field to that of another similar value given that its other relevant fields are the same.
Example (for both results):
*GradYear: 2018 ----
StudentName: Jake ----
*SchoolNumber: 54 ----
*StateCode: NA11 ----
CountyCode: MA02 ----
*SchoolName: Hillsburn ----
*GradYear: 2018 ----
StudentName: Sarah ----
*SchoolNumber: 54 ----
*StateCode: NA11 ----
CountyCode: NULL ----
*SchoolName: Hillsburn ----
As seen above, the CountCody for Sarah returns a null value. I am attempting to make it so that it will automatically fill the value for CountyCode, if the other similar values are the same between Students. (The necessary similar values being shown with a '*'.)
Also, I am attempting to solve this without using the "Previous" feature or hard-coded information so that it may be accomplished with any data.
My original attempt was to use a simple if/IsNull statement along with a Peek feature but the values persisted at returning a null value.
if((isnull(CountyCode)), Peek(CountyCode), CountyCode) as CountyCode
Any help with this would be greatly appreciated! Thank you in advance.
I would use applymap for this.
lets says the SchoolNumber is unique to CountyCode.
so first lets load our mapping table:
CountyCode_Map:
mapping load distinct SchoolNumber, CountyCode
from Data.qvd (qvd) where len(CountyCode)>0;
Now when loading you data use this for CountyCode:
applymap('CountyCode_Map',SchoolNumber) as CountyCode
in case that SchoolNumber is not unique to CountyCode you can use any other field or a concatenation of fields.
for more info on applymap : link
Related
I have a situation where I need to extract some information from a column on the basis of another column. The table size is quite big. It is having almost 50 columns and 70M records. Attaching screenshot below to explain the situation.
id idkey ValuesNeededInAnotherColumn
----- --------------- ---------------------------
123 10012300152 152
12340 100001234001400 1400
12 20123152 3152
253 5000253
so table is having a column idkey. idkey is made up of some values like companycode(100)+id(123)+custcode(00152) = 10012300152. The length of company code and id is not fixed. They can vary in length but position of data is fixed so the value after id in idkey is always fixed. If there is nothing in idkey after id value that means custcode is null. The solution which i am trying to implement is to find the position of id column in idkey column and then substring the value till end and cast it as int. This solution is taking too much time as i have to convert dataframe into rdds as it is not possible to do it directly on dataframe.
Anyone having an optimized solution that can be implemented on quite big table then please help.
I would have concerns about ambiguity. For example:
id idkey ValuesNeededInAnotherColumn
----- --------------- ---------------------------
12 120123012 123012 or 3012 or null?
Not considering that, it's relatively simple use of regexp_extract function.
SELECT id, idkey,
cast(regexp_extract(idkey, concat(id, '(.*)'), 1) as int) as ValuesNeededInAnotherColumn
FROM df;
I have Five columns.
E.g.
Column1: Name
Column2: surname
Column3: mapping
Column4: Mapped data
Columns contain data like
Name Surname Mapping Name1 Surname1
1 ABC 1 AAAA 3 ABC QQQQ
2 XYZ 2 XXXX 1 XYZ AAAA
3 OPQ 3 QQQQ 4 OPQ RRRR
4 RST 4 RRRR 2 RST XXXX
Now my aim is to map name column to surname by using mapping column and result should be stored at Name1 and Surname1 column. I have more data in Name and Surname column, by writing number in Mapping column it will automatically map the surname to Name (the choice is given to user for entering number in mapped column then map the data accordingly) and result should be copied in Name1 and Surname1.
I am not getting any idea to achieve this using VBA. coding Plz help me.....
Amar, there are certainly plenty of ways to go about this using Excel's built in functions, however, since you asked about a VBA solution, here you go:
Function Map(n)
Map = Cells(n + 1, 2)
End Function
Placing the above code into the VBA editor of your project will allow you to use this custom function in the same way you would any of Excel's builtin functions. That is, entering =Map(C3) into any cell should give you the result you're after (where C3 is the cell containing your mapping number). The function works by returning the data in [row n (defined in your mapping column) +1 (to account for the header row); column 2 (the column containing your surname)]. The data in column "Name1" will always be the same as that in column "Name" (so it seems). So the function in your "Name1" column would simply be =A2
If this does not solve your problem, or you need further guidance, please let me know.
Supplement
#Amar, the comment by #freakfeuer is spot on. VBA is really overkill for something as simple as this and, as he points out, portability and security are both significant drawbacks. Offset is a fine alternative.
It's a bit of a long and convoluted story why I need to do this, but I will be getting a query string which I will then be executing with this code
EXECUTE sp_ExecuteSQL
I need to set the aliases of all the columns to "value". There could be a variable number of columns in the queries that are being passed in, and they could be all sorts of data types, for example
SELECT
Company, AddressNo, Address1, Town, County, Postcode
FROM Customers
SELECT
OrderNo, OrderType, CustomerNo, DeliveryNo, OrderDate
FROM Orders
Is this possible and relatively simple to do, or will I need to get the aliases included in the SQL queries (it would be easier not to do this, if it can be avoided and done when we process the query)
---Edit---
As an example, the output from the first query would be
Company AddressNo Address1 Town County Postcode
--------- --------- ------------ ------ -------- --------
Dave Inc 12345 1 Main Road Harlow Essex HA1 1AA
AA Tyres 12234 5 Main Road Epping Essex EP1 1PP
I want it to be
value value value value value value
--------- --------- ------------ ------ -------- --------
Dave Inc 12345 1 Main Road Harlow Essex HA1 1AA
AA Tyres 12234 5 Main Road Epping Essex EP1 1PP
So each of the column has an alias of "value"
I could do this with
SELECT
Company AS 'value', AddressNo AS 'value', Address1 AS 'value', Town AS 'value', County AS 'value', Postcode AS 'value'
FROM Customers
but it would be better (it would save additional complexity in other steps in the process chain) if we didn't have to manually alias each column in the SQL we're feeding in to this section of the process.
Regarding the XY problem, this is a tiny section in a very large process chain, it would take pages to explain the whole process in detail - in essence, we're taking code out of our database triggers and putting it into a dynamic procedure; then we will have frontends that users will access to "edit" the SQL statements that are called by the triggers and these will then dynamically feed the results out into other systems. It works if we manually alias the SQL going in, but it would be neater if there was a way we could feed clean SQL into the process and then apply the aliases when the SQL is processed - it would keep us DRY, to start with.
I do not understand at all what you are trying to accomplish, but I believe the answer is no: there is no built-in way how to globally predefine or override column aliases for ad hoc queries. You will need to code it yourself.
I want to group by a given field and get the output with grouped fields. Below is an example of what I am trying to achieve:-
Imagine a table named 'sample_table' with two columns as below:-
F1 F2
001 111
001 222
001 123
002 222
002 333
003 555
I want to write Hive Query that will give the below output:-
001 [111, 222, 123]
002 [222, 333]
003 [555]
In Pig, this can be very easily achieved by something like this:-
grouped_relation = GROUP sample_table BY F1;
Can somebody please suggest if there is a simple way to do so in Hive? What I can think of is to write a User Defined Function (UDF) for this but this may be a very time consuming option.
The built in aggregate function collect_set (doumented here) gets you almost what you want. It would actually work on your example input:
SELECT F1, collect_set(F2)
FROM sample_table
GROUP BY F1
Unfortunately, it also removes duplicate elements and I imagine this isn't your desired behavior. I find it odd that collect_set exists, but no version to keep duplicates. Someone else apparently thought the same thing. It looks like the top and second answer there will give you the UDAF you need.
collect_set actually works as expected since a set as per definition is a collection of well defined and distinct objects i.e. objects occur exactly once or not at all within a set.
I have a large table (TokenFrequency) which has millions of rows in it. The TokenFrequency table that is structured like this:
Table - TokenFrequency
id - int, primary key
source - int, foreign key
token - char
count - int
My goal is to select all of the rows in which two sources have the same token in it. For example if my table looked like this:
id --- source --- token --- count
1 ------ 1 --------- dog ------- 1
2 ------ 2 --------- cat -------- 2
3 ------ 3 --------- cat -------- 2
4 ------ 4 --------- pig -------- 5
5 ------ 5 --------- zoo ------- 1
6 ------ 5 --------- cat -------- 1
7 ------ 5 --------- pig -------- 1
I would want a SQL query to give me source 1, source 2, and the sum of the counts. For example:
source1 --- source2 --- token --- count
---- 2 ----------- 3 --------- cat -------- 4
---- 2 ----------- 5 --------- cat -------- 3
---- 3 ----------- 5 --------- cat -------- 3
---- 4 ----------- 5 --------- pig -------- 6
I have a query that looks like this:
SELECT F.source AS source1, S.source AS source2, F.token,
(F.count + S.count) AS sum
FROM TokenFrequency F
INNER JOIN TokenFrequency S ON F.token = S.token
WHERE F.source <> S.source
This query works fine but the problems that I have with it are that:
I have a TokenFrequency table that has millions of rows and therefore need a faster alternative to obtain this result.
The current query that I have is giving duplicates. For example its selecting:
source1=2, source2=3, token=cat, count=4
source1=3, source2=2, token=cat, count=4
Which isn't too much of a problem but if there is a way to elimate those and in turn obtain a speed increase then it would be very useful
The main issue that I have is speed of the query with my current query it takes hours to complete. The INNER JOIN on a table to itself is what I believe to be the problem. Im sure there has to be a way to eliminate the inner join and get similar results just using one instance of the TokenFrequency table. The second problem that I mentioned might also promote a speed increase in the query.
I need a way to restructure this query to provide the same results in a faster, more efficient manner.
Thanks.
I'd need a little more info to diagnose the speed issue, but to remove the dups, add this to the WHERE:
AND F.source<S.source
Try this:
SELECT token, GROUP_CONCAT(source), SUM(count)
FROM TokenFrequency
GROUP BY token;
This should run a lot faster and also eliminate the duplicates. But the sources will be returned in a comma-separated list, so you'll have to explode that in your application.
You might also try creating a compound index over the columns token, source, count (in that order) and analyze with EXPLAIN to see if MySQL is smart enough to use it as a covering index for this query.
update: I seem to have misunderstood your question. You don't want the sum of counts per token, you want the sum of counts for every pair of sources for a given token.
I believe the inner join is the best solution for this. An important guideline for SQL is that if you need to calculate an expression with respect to two different rows, then you need to do a join.
However, one optimization technique that I mentioned above is to use a covering index so that all the columns you need are included in an index data structure. The benefit is that all your lookups are O(log n), and the query doesn't need to do a second I/O to read the physical row to get other columns.
In this case, you should create the covering index over columns token, source, count as I mentioned above. Also try to allocate enough cache space so that the index can be cached in memory.
If token isn't indexed, it certainly should be.