columns manipulation in fast load - sql

Hello i am new to teradata. I am loading flat file into my TD DB using fast load.
My data set(CSV FILE) contains some issues like some of the rows in city column contains proper data but some of the rows contains NULL. The values of the city columns which contains NULL are stored into the next column which is zip code and so on. At the end some of the rows contains extra columns due to the extra NULL in rows. Examples is given below. How to resolve these kind of issues in fastload? Can someone answer this with SQL example?
City Zipcode country
xyz 12 Esp
abc 11 Ger
Null def(city's data) 12(zipcode's data) Por(country's data)

What about different approach. Instead of solving this in fast load, load your data to temporary table like DATABASENAME.CITIES_TMP with structure like below
City | zip_code | country | column4
xyz | 12 | Esp |
NULL | abc | 12 | Por
In next step create target table DATABASENAME.CITY with the structure
City | zip_code | country |
As a final step you need to run 2 INSERT queries:
INSERT INTO DATABASENAME.CITY (City, zip_code, country)
SELECT City, zip_code, country FROM DATABASENAME.CITIES_TMP
WHERE CITY not like 'NULL'/* WHERE CITY is not null - depends if null is a text value or just empty cell*/;
INSERT INTO DATABASENAME.CITY (City, zip_code, country)
SELECT Zip_code, country, column4 FROM DATABASENAME.CITIES_TMP
WHERE CITY like 'NULL' /* WHERE CITY is null - depends if null is a text value or just empty cell*/
Of course this will work if all your data looks exacly like in sample you provide.
This also will work only when you need to do this once in a while. If you need to load data few times a day it will be a litte cumbersome (not sure if I used proper word in this context) and then you should build some kind of ETL process with for example Talend tool.

Related

How do I select a SQL dataset where values in the first row are the column names?

I have data that looks like this:
ID RowType Col_1 Col_2 Col_3 ... Col_n
1 HDR FirstName LastName Birthdate
2 DTL Steve Bramblet 1989-01-01
3 DTL Bob Marley 1967-03-12
4 DTL Mickey Mouse 1921-04-25
And I want to return a table or dataset that looks like this:
ID FirstName LastName Birthdate
2 Steve Bramblet 1989-01-01
3 Bob Marley 1967-03-12
4 Mickey Mouse 1921-04-25
where n = 255 (so there's a limit of 255 Col_ fields)
***EDIT: The data in the HDR row is arbitrary so I'm just using FirstName, LastName, Birthdate as examples. This is why I thought it will need to be dynamic SQL since the column names I want to end up with will change based on the values in the HDR row. THX! ***
If there's a purely SQL solution that is what I'm after. It's going into an ETL process (SSIS) so I could use a Script task if all else fails.
Even if I could return a single row that would be a solution. I was thinking there might be a dynamic sql solution for something like this:
select Col_1 as FirstName, Col_2 as LastName, Col_3 as Birthdate
Not sure if your first data snippet is already in a oracle table or not but it is in a CSV file then you have option during loading to skip headers.
If data is already in table then you can use UNION to get desired result
Select * from table name where rowtype=‘HRD’
union
select * from table name where rowtype=‘DTL’
If you need First Name etc as Column header then you need not to do anything. Design destination table columns as per your requirement.
Sorry, posted an answer but I completely misread that you had your desired column headers as data in the source table.
One trivial solution (though it requires more IO) would be to dump the table data to a flat file without headers, then read it back in, but this time tell SSIS that the first row has headers, and ignore the RowType column. Make sure you sort the data correctly before writing it out to the intermediate file!
To dump to a file without headers, you have to set ColumnNamesInFirstDataRow to false. Set this in the properties window, not by editing the connection. More info in this thread
If you have a lot of data, this is obviously very inefficient.
Try the following using row_number. Here is the demo.
with cte as
(
select
*,
row_number() over (order by id) as rn
from myTable
)
select
ID,
Col_1 as FirstName,
Col_2 as LastName,
Col_3 as Birthdate
from cte
where rn > 1
output:
| id | firstname | lastname | birthdate |
| --- | --------- | -------- | ---------- |
| 2 | Steve | Bramblet | 1989-01-01 |
| 3 | Bob | Marley | 1967-03-12 |
| 4 | Mickey | Mouse | 1921-04-25 |
Oh, well. There is a pure SSIS approach, assumed the source is a SQL table. Here it is, rather sketchy.
Create a Variable oColSet with type Object, and 255 variables of type String and names sColName_1, sColName_2 ... sColName_255.
Create a SQL Task with query like select top(1) Col_1, Col_2, ... Col_255 from Src where RowType = 'HDR', set task properties ResultSet = Full Result Set, on result set tab - set Result Name to 0 and Variable Name to oColSet.
Add ForEach Loop enumerator, set it as ForEach ADO Enumerator, ADO object source variable - set to oColSet, Enumeration mode = Rows in the first table. Then, on the Variable Mappings tab - define as such example (Variable - Index) - sColName_1 - 0, sColName_2 - 1, ... sColName_255 - 254.
Create a variable sSQLQuery with type String and Variable Expression like
"SELECT Col_1 AS ["+#[User::sColName_1]+"],
Col_2 AS ["+#[User::sColName_2]+"],
...
Col_255 AS ["+#[User::sColName_255]+"]
FROM Src WHERE RowType='DTL'"
In the ForEach Loop - add your dataflow, in the OLEDB Source - set Data access mode to SQL command from variable and provide variable name User::sSQLQuery. On the Data Flow itself - set DelayValidation=true.
The main idea of this design - retrieve all column names and store it in temp variable (step 2). Then step 3 does parsing and places all results into corresponding variables, 1 column (0th) - into sColName_1 etc. Step 4 defines a SQL command as an expression, which is evaluated every time when the variable is read. Finally, in the ForEach Loop (where parsing is done) - you perform your dataflow.
Limitations of SSIS - data types and column names should be the same at runtime as at design time. If you need to further store your dataset into SQL - let me know, so I could adjust the proposed solution.

How to match phone number prefix to country from phonenumber in SQL

I am trying to extract the country code prefix from a list of numbers, and match them to the region that they belong to. The data might look something like this:
| id | phone_number |
|----|----------------|
| 1 | +27000000000 |
| 2 | +16840000000 |
| 3 | +10000000000 |
| 4 | +27000000000 |
The country codes here are:
American Samoa: +1684
United States and Caribbean: +1
South Africa: +27
And the desired result would be something this:
| country | count |
|-----------------------------|-------|
| South Africa | 2 |
| American Samoa | 1 |
| United States and Caribbean | 1 |
There are some difficulties because
country prefix codes vary from 1 to 4 numbers and even without the country prefix,
phone number length varies from place to place.
I do not have write access to this DB, so adding another column, while probably the best solution, will not work in this use case
This is my current solution:
SELECT
CASE
WHEN SUBSTRING(phone_number,1,5) = '+1684' THEN 'American Samoa'
WHEN SUBSTRING(phone_number,1,5) = '+1264' THEN 'Anguilla'
...
WHEN SUBSTRING(phone_number,1,5) = '+1599' THEN 'Saint Martin'
WHEN SUBSTRING(phone_number,1,4) = '+355' THEN 'Albania'
WHEN SUBSTRING(phone_number,1,4) = '+213' THEN 'Algeria'
...
WHEN SUBSTRING(phone_number,1,4) = '+263' THEN 'Zimbabwe'
WHEN SUBSTRING(phone_number,1,3) = '+93' THEN 'Afghanistan'
WHEN SUBSTRING(phone_number,1,3) = '+54' THEN 'Argentina'
...
WHEN SUBSTRING(phone_number,1,3) = '+58' THEN 'Venezuela'
WHEN SUBSTRING(phone_number,1,3) = '+84' THEN 'Vietnam'
WHEN SUBSTRING(phone_number,1,2) = '+1' THEN 'United States and Caribbean'
WHEN SUBSTRING(phone_number,1,2) = '+7' THEN 'Kazakhstan, Russia'
ELSE 'unknown'
END as country_name,
count(*)
FROM users
GROUP BY country_name
order by count desc
There are ~205 WHEN ... THEN cases. It seems to be very inefficient and times out. I assume this is because it runs the pattern matching on every row. This would need to scale to roughly 10s of millions of rows
Is there a more efficient way to do this?
I am using postgreSQL 9.6.16
In spite of reading the whole table, an index could help here. In order to aggregate the data per country code, the DBMS must sort all rows by country code. Sorting is an expensive operation, especially on large data sets. If you had an index on the country codes, the DBMS would find the codes already pre-sorted in the index and could avoid the work of sorting the data.
You don't have the separate country code in a column, but each phone number starts with the code, so you could index the complete phone number:
create index idx on users (phone_number);
Then you must make it obvious to the DBMS that you are interested in the beginnings of the string, so it will consider using the index. Invoking a function like SUBSTRING on the phone number is likely to make the the DBMS blind to this. Use LIKE instead. According to the docs (https://www.postgresql.org/docs/9.3/indexes-types.html), indexes on strings can be used with LIKE 'something%':
WHEN phone_number LIKE '+1684%' THEN 'American Samoa'
There is no guarantee this will help, but it's worth a try I think. It depends on whether the optimizer sees the advantage of using the pre-sorted phone numbers from the index.

How to do an exact match followed by ORDER BY in PostgreSQL

I'm trying to write a query that puts some results (in my case a single result) at the top, and then sorts the rest. I have yet to find a PostgreSQL solution.
Say I have a table called airports like so.
id | code | display_name
----+------+----------------------------
1 | SDF | International
2 | INT | International Airport
3 | TES | Test
4 | APP | Airport Place International
In short, I have a query in a controller method that gets called asynchronously when a user text searches for an airport either by code or display_name. However, when a user types in an input that matches a code exactly (airport code is unique), I want that result to appear first, and all airports that also have int in their display_name to be displayed afterwards in ascending order. If there is no exact match, it should return any wildcard matches sorted by display_name ascending. So if a user types in INT, The row (2, INT, International Airport) should be returned first followed by the others:
Results:
1. INT | International Airport
2. APP | Airport Place International
3. SDF | International
Here's the kind of query I was tinkering with that is slightly simplified to make sense outside the context of my application but same concept nonetheless.
SELECT * FROM airports
WHERE display_name LIKE 'somesearchtext%'
ORDER BY (CASE WHEN a.code = 'somesearchtext` THEN a.code ELSE a.display_name END)
Right now the results if I type INT I'm getting
Results:
1. APP | Airport Place International
2. INT | International Airport
3. SDF | International
My ORDER BY must be incorrect but I can't seem to get it
Any help would be greatly appreciated :)
If you want an exact match on code to return first, then I think this does the trick:
SELECT a.*
FROM airports a
WHERE a.display_name LIKE 'somesearchtext%'
ORDER BY (CASE WHEN a.code = 'somesearchtext' THEN 1 ELSE 2 END),
a.display_name
You could also write this as:
ORDER BY (a.code = 'somesearchtext') DESC, a.display_name
This isn't standard SQL, but it is quite readable.
I think you can achieve your goal by using a UNION.
First get an exact match and then add that result to rest of the data as you which.
e.g.. (you will need to work in this a bit)
SELECT * FROM airports
WHERE code == 'somesearchtext'
ORDER BY display_name
UNION
SELECT * FROM airports
WHERE code != 'somesearchtext' AND display_name LIKE 'somesearchtext%'
ORDER BY display_name

Joining QlikView tables resorts in unwanted repeated entries

I have 2 .csv with ';' separated files i loaded into Qlikview.
first file contains:
ID | date/time | Price | Postalcode
the second contains
ID | Postalcode | City | Region
I've first did an extract to a .qvd file and in the .qvw i added the following code:
Customerpostalcode:
LOAD %key_CustomerId,
TimeDate,
Price,
%key_postalcode
FROM
[$(vExtract)Customerpostalcodes.qvd]
(qvd);
Postcodes:
LEFT JOIN (Customerpostalcodes)
LOAD %key_postalcodeID,
%key_postalcode,
City,
Region
FROM
[$(vExtract)Postalcodes.qvd]
(qvd);
Now here in Belgium you have multiple cities for one postal code
for example if postal code is "9700" then i have 15 cities but if the price for postal code 9700 is €50 than i got 15 times €50. How can i tell Qlikview to only count and sum this price one time each postal code?
Thx
The problem here is that you have a many-to-many relationship in your Postcodes table so JOINing this into the first table will explode the table with the repeated entries.
Perhaps you need to try and figure out a way of expanding your source data for Customerpostalcode so that the table also contains more address information for the customer (for example by including City). This way you could then join the tables on both %key_postalcode and City which would then result in hopefully single matches which would then solve your issue.

PostgreSQL: Self-referencing, flattening join to table which contains tree of objects

I have a relatively large (as in >10^6 entries) table called "things" which represent locateable objects, e.g. countries, areas, cities, streets, etc. They are used as a tree of objects with a fixed depth, so the table structure looks like this:
id
name
type
continent_id
country_id
city_id
area_id
street_id
etc.
The association inside "things" is 1:n, i.e. a street or area always belongs to a defined city and country (not two or none); the column city_id for example contains the id of the "city" thing for all the objects which are inside that city. The "type" column contains the type of thing (Street, City, etc) as a string.
This table is referenced in another table "actions" as "thing_id". I am trying to generate a table of action location statistics showing the number of active and inactive actions a given location has. A simple JOIN like
SELECT count(nullif(actions.active, 1)) AS icount,
count(nullif(actions.active, 0)) AS acount,
things.name AS name, things.id AS thing_id, things.city_id AS city_id
FROM "actions"
LEFT JOIN things ON actions.thing_id = things.id
WHERE UPPER(substring(things.name, 1, 1)) = UPPER('A')
AND actions.datetime_at BETWEEN '2012-09-26 19:52:14' AND '2012-10-26 22:00:00'
GROUP BY things.name, things.id ORDER BY things.name
will give me a list of "things" (starting with 'A') which have actions associated with them and their active and inactive count like this:
icount | acount | name | thing_id | city_id
------------------------------------------------------------------
0 5 Brooklyn, New York City | 25 | 23
1 0 Manhattan, New York City | 24 | 23
3 2 New York City | 23 | 23
Now I would like to
only consider "city" things (that's easy: filter by type in "things"), and
in the active/inactive counts, use the sum of all actions happening in this city - regardless of whether the action is associated with the city itself or something inside the city (= having the same city_id). With the same dataset as above, the new query should result in
icount | acount | name | thing_id | city_id
------------------------------------------------------------------
4 7 New York City | 23 | 23
I do not need the thing_id in this table (since it would not be unique anyway), but since I do need the city's name (for display), it is probably just as easy to also output the ID, then I don't have to change as much in my code.
How would I have to modify the above query to achieve this? I'd like to avoid additional trips to the database, and advanced SQL features such as procedures, triggers, views and temporary tables, if possible.
I'm using Postgres 8.3 with Ruby 1.9.3 on Rails 3.0.14 (on Mac OS X 10.7.4).
Thank you! :)
You need to count actions for all things in the city in an independent subquery and then join to a limited set of things:
SELECT c.icount
,c.acount
,t.name
,t.id AS thing_id
,t.city_id
FROM (
SELECT t.city_id
,count(nullif(a.active, 1)) AS icount
,sum(a.active) AS acount
FROM things t
LEFT JOIN actions a ON a.thing_id = t.id
WHERE t.city_id = 23 -- to restrict results to one city
GROUP BY t.city_id
) c -- counts per city
JOIN things t USING (city_id)
WHERE t.name ILIKE 'A%'
AND t.datetime_at BETWEEN '2012-09-26 19:52:14'
AND '2012-10-26 22:00:00'
ORDER BY t.name, t.id;
I also simplified a number of other things in your query and used table aliases to make it easier to read.