Import (non-CSV) text data to PostgreSQL, which is separated via spaces and one capital letter - sql

This is the first time I am working with SQL. I am using PostgreSQL on Windows 7 64bit.
I have the following (large) .txt file of tweets built like this:
T 2009-06-07 02:07:41
U http://twitter.com/cyberplumber
W SPC Severe Thunderstorm Watch 339: WW 339 SEVERE TSTM KS NE 070200Z - 070800Z URGENT - IMMEDIATE BROADCAST REQUE.. http://tinyurl.com/5th9sw
As you see, all three "columns" are separated in the following fashion: T \t (same goes for U and W) instead of the traditional comma (,).
I would like to import the whole file into a SQL table with columns named date, user and text_msg.
I am guessing I will probably have to parse it in some way. Any ideas how to get the data into a table in the simplest and most efficient manner? Please also consider that the .txt files in question are rather huge (>4GB) and thus there is no easy way for me to edit them manually.

Quick&dirty hack:
DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;
CREATE TABLE lutser
( id SERIAL NOT NULL PRIMARY KEY
, ztxt text
);
CREATE TABLE tweetdeck
( id SERIAL NOT NULL PRIMARY KEY
, stamp timestamp NOT NULL
, zurl text
, ztxt text
);
COPY lutser(ztxt)
FROM '/tmp/tweet.dat'
;
INSERT INTO tweetdeck (stamp, zurl, ztxt)
SELECT regexp_replace( t.ztxt, E'^[A-Z][ \t]*', '')::timestamp
, regexp_replace( u.ztxt, E'^[A-Z][ \t]*', '')
, regexp_replace( w.ztxt, E'^[A-Z][ \t]*', '')
FROM lutser t
JOIN lutser u ON u.id = t.id+1
JOIN lutser w ON w.id = t.id+2
WHERE t.id %3 = 1
AND LEFT(t.ztxt,1) = 'T' -- Should be redundant, Won't harm
AND LEFT(u.ztxt,1) = 'U'
AND LEFT(w.ztxt,1) = 'W'
;
SELECT * FROM lutser;
SELECT * FROM tweetdeck;
Results:
COPY 9
INSERT 0 3
id | ztxt
----+--------------------------------------------------------------------------------------------------------------------------------------------------
1 | T 2009-06-07 02:07:31
2 | U http://twitter.com/cyberplumber
3 | W SPC Severe Thunderstorm Watch 339: WW 339 SEVERE TSTM KS NE 070200Z - 070800Z URGENT - IMMEDIATE BROADCAST REQUE.. http://tinyurl.com/5th9sw
4 | T 2009-06-07 02:07:41
5 | U http://twitter.com/cyberplumber
6 | W SPC Severe Thunderstorm Watch 339: WW 339 SEVERE TSTM KS NE 070200Z - 070800Z URGENT - IMMEDIATE BROADCAST REQUE.. http://tinyurl.com/5th9sw
7 | T 2009-06-07 02:07:51
8 | U http://twitter.com/cyberplumber
9 | W SPC Severe Thunderstorm Watch 339: WW 339 SEVERE TSTM KS NE 070200Z - 070800Z URGENT - IMMEDIATE BROADCAST REQUE.. http://tinyurl.com/5th9sw
(9 rows)
id | stamp | zurl | ztxt
----+---------------------+---------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------
1 | 2009-06-07 02:07:31 | http://twitter.com/cyberplumber | SPC Severe Thunderstorm Watch 339: WW 339 SEVERE TSTM KS NE 070200Z - 070800Z URGENT - IMMEDIATE BROADCAST REQUE.. http://tinyurl.com/5th9sw
2 | 2009-06-07 02:07:41 | http://twitter.com/cyberplumber | SPC Severe Thunderstorm Watch 339: WW 339 SEVERE TSTM KS NE 070200Z - 070800Z URGENT - IMMEDIATE BROADCAST REQUE.. http://tinyurl.com/5th9sw
3 | 2009-06-07 02:07:51 | http://twitter.com/cyberplumber | SPC Severe Thunderstorm Watch 339: WW 339 SEVERE TSTM KS NE 070200Z - 070800Z URGENT - IMMEDIATE BROADCAST REQUE.. http://tinyurl.com/5th9sw
(3 rows)

Try doing the following:
Firstly, create an appropriate table in SQL like so:
CREATE TABLE tweet(
ts timestamp, -- if inserting the values as timestamps gives errors, change to 'TEXT'
url TEXT, -- There smarted UDTs for URL available too
message TEXT
);
Then go on and try to run a standard COPY statement, something like the following:
COPY tweet
FROM E'c:\\\\my dir\\\filename' -- path of the file using the magic E for escaped text with double backslash between directory names fro Windows
FORMAT text; -- The default delimiter for format text is a tab
Finally, pray that you'll have enough memory and log spaces for > 4GB of files. For more info regarding the COPY command, see http://www.postgresql.org/docs/9.2/static/sql-copy.html

Related

Handling multiple childs for the same element generated in XML from SQL using "for XML clause"

I want to generate a XML file using a specific query. The main issue is that when I generate the XML, the output would look like this:
<nsSAFT:Account xmlns:nsSAFT="uri">
<nsSAFT:Produs>
<nsSAFT:CodProdus>0200943</nsSAFT:CodProdus>
<nsSAFT:Denumire>SPRAY SPECIAL EFECT 151 SILVER METAL</nsSAFT:Denumire>
<nsSAFT:Miscari>
<nsSAFT:Cantitate> 1.00</nsSAFT:Cantitate>
</nsSAFT:Miscari>
</nsSAFT:Produs>
</nsSAFT:Account>
<nsSAFT:Account xmlns:nsSAFT="uri">
<nsSAFT:Produs>
<nsSAFT:CodProdus>0200943</nsSAFT:CodProdus>
<nsSAFT:Denumire>SPRAY SPECIAL EFECT 151 SILVER METAL</nsSAFT:Denumire>
<nsSAFT:Miscari>
<nsSAFT:Cantitate> 2.00</nsSAFT:Cantitate>
</nsSAFT:Miscari>
</nsSAFT:Produs>
</nsSAFT:Account>
The main problem is that I want to have multiple children on the same product. My expected output would look like this:
<nsSAFT:Account xmlns:nsSAFT="uri">
<nsSAFT:Produs>
<nsSAFT:CodProdus>0200943</nsSAFT:CodProdus>
<nsSAFT:Denumire>SPRAY SPECIAL EFECT 151 SILVER METAL</nsSAFT:Denumire>
<nsSAFT:Miscari>
<nsSAFT:Cantitate> 1.00</nsSAFT:Cantitate>
</nsSAFT:Miscari>
<nsSAFT:Miscari>
<nsSAFT:Cantitate> 2.00</nsSAFT:Cantitate>
</nsSAFT:Miscari>
</nsSAFT:Produs>
</nsSAFT:Account>
The SQL query I used for generating the first output mentioned by me looks like this:
WITH XMLNAMESPACES ('uri' as nsSAFT)
SELECT
RTRIM(P.codProdus) AS 'nsSAFT:Produs/nsSAFT:CodProdus',
RTRIM(P.Denumire) AS 'nsSAFT:Produs/nsSAFT:Denumire',
STR(M.Cantitate, 18, 2) AS 'nsSAFT:Produs/nsSAFT:Miscari/nsSAFT:Cantitate'
FROM
Miscari M
INNER JOIN
ProdusGestiune PG ON M.idProdusGestiune = PG.idProdusGestiune
INNER JOIN
Produs P ON PG.idProdus = P.idProdus
FOR XML PATH ('nsSAFT:Account'), ELEMENTS ;
The data sample would look like this:
CodProdus
Denumire
Cantitate
0200943
SPRAY SPECIAL EFECT 151 SILVER METAL
1.00
0200943
SPRAY SPECIAL EFECT 151 SILVER METAL
2.00
0200943
SPRAY SPECIAL EFECT 151 SILVER METAL
5.00
0200947
SPRAY SPECIAL USE 230 PENETRATING OIL
6.00
I use the following tables:
"Produs":
| CodProdus | Denumire |
|:---- |:------:|
| 0200943 | SPRAY SPECIAL EFECT 151 SILVER METAL |
| 0200954 | SPRAY ACRILIC MAT 9005 400ML |
| 0200955 | SPRAY ACRILIC MAT 9016 400ML |
| 0200960 | SPRAY ACRILIC RAL 3000 400ML |
"Miscari":
| Cantitate|
|:---- |:------:|
| 14.000000 |
| 12.000000 |
| 5.000000 |
I tried to use "select distinct", but the SSMS returns me an error. I also tried multiple queries using "union all" and I met some errors too.
You're probably wanting a subquery to generate correlated Cantitate subelements, such as with the following:
WITH XMLNAMESPACES ('uri' as nsSAFT)
SELECT
RTRIM(P.codProdus) AS [nsSAFT:CodProdus],
RTRIM(P.Denumire) AS [nsSAFT:Denumire],
(
SELECT
STR(M.Cantitate, 18, 2) AS [nsSAFT:Cantitate]
FROM
ProdusGestiune PG
INNER JOIN
Miscari M ON M.idProdusGestiune = PG.idProdusGestiune
WHERE
PG.idProdus = P.idProdus
FOR XML PATH('nsSAFT:Miscari'), TYPE
)
FROM
Produs P
--WHERE codProdus='0200943'
FOR XML PATH('nsSAFT:Produs'), ROOT('nsSAFT:Account'), ELEMENTS;

How to add values to single empty column from local text file

I have table called customers with CustomerID,CompanyName,Address,Phone
Now we inserted a new column called Remarks which is empty or null
I have text file to bulk insert into the column using the view Remarkinsert with the following code
bulk insert HRRegion.dbo.Remarksinsert
From 'C:\Users\SMSTECHLNG50\Documents\remarks..txt'
with
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)
GO
But its getting the error
Msg 515, Level 16, State 2, Line 9 Cannot insert the value NULL into
column 'CustomerID', table 'HRRegion.dbo.Customers'; column does not
allow nulls. INSERT fails. The statement has been terminated.
I think that also here, you can only insert a whole row or nothing.
If your customer table looks like this:
customer
custid|co_name |addr |phone |remarks
42|Laverda |Breganze, Italy |+39 6 233 84 81 |(NULL)
43|Benelli |Pesaro, Italy |+39 8 284 55 32 |(NULL)
44|Ural |Irbit, Russia |+7 14 526 22 2342|(NULL)
45|Dnepr |Kiew, Ukraine |+380 526 22 2342 |(NULL)
46|Harley Davidson|Milwaukee, US |+1 802 223 4444 |(NULL)
47|Honda |Tokyo, Japan |+81 82 555 4123 |(NULL)
48|Moto Guzzi |Mandello del Lario, Italy|+39 6 423 04 72 |(NULL)
49|Ducati |Bologna, Italy |+39 7 722 04 43 |(NULL)
50|Norton |Birmingham, UK |+44 7234 723 4423|(NULL)
51|Matchless |Plumstead, London, UK |+44 8021 612 0843|(NULL)
52|Brough |Nottingham, UK |+44 5812 512 4883|(NULL)
(well, you add the remarks column using an ALTER TABLE ...), then, I would expect the file you mentioned with the remarks to look somewhat like this:
remarks
custid|remarks
42|built also tractors, closed now
43|first series 6-cylinder motorbike
44|old style sidecar rigs with modern engine
45|old style sidecar rigs, permanent two-wheel drive
46|the american classic
47|builders of the CB 750 four and the gold wing
48|famous for horizontal singles and 90° V twins
49|90° V twin bikes with lateral crankshaft
50|english classic, still alive
51|english classic, closed now
52|probably the finest motorcycles ever built
So, you would build a remarks_stg table:
CREATE TABLE remarks_stg (
custid SMALLINT NOT NULL
, remarks VARCHAR(50) NOT NULL
);
Then, you load just that staging table with the data file as described above - and, at least if have SQL Server 2008 and above, you use a MERGE statement to update the customer table:
MERGE customer t
USING stg_customer s
ON t.custid = s. custid
WHEN MATCHED THEN UPDATE SET
remarks = s.remarks
;

SQL Server selecting with self join

I have a table that has records like this:
FieldId collationid Type Message
---------------------------------------------
1 1234 WC hello
2 1234 WR next message
3 1234 WZ again
4 1234 WX another message
5 ab12 WC this message
6 ab12 WR again
7 ab12 WZ misc message
8 5678 WC hello
9 5678 WR next message
10 5678 WZ again
11 5678 WX another message
A recordset is complete when it has all four records, a WC, WR, WZ and WX. I need a sql that shows me when a record is missing. In the previous table example, the SQL would produce ab12 because it only has WC, WR and WZ records.
Appreciate any help you can give me..
Use COUNT() and HAVING :
SELECT collationid
FROM tbl
WHERE Type IN('WC', 'WR', 'WZ', 'WX')
GROUP BY collationid
HAVING COUNT(DISTINCT Type) < 4

SQL select statement that would result in an associative table

The company I work for uses an AS400 (iSeries). There is some data in a system dictionary that I am trying to pluck out and turn into an associative table.
Here is what the data looks like
xtype | xdata
60 | 011111211 212
60 | 345
60 | 212312 169
xtype is the "key" that will allow me to return the relevant data.
212,345,169 are employee numbers and are in the left 3 characters of the 24 character xdata column.
011111211 is 3 "territories" (011, 111 and 211), likewise 212312 is 2 "territories" (212, 312)
What I would like to end up with is
empNum | territory
------------------
212 | 011
212 | 111
212 | 211
169 | 212
169 | 312
Here is what I have worked on so far:
SELECT
*
From
(
select
right(xdata,3) as empNum,
trim(coalesce(left(xdata,3),'')) as ter
from Table
where xtype=60 and xarg < 960
) as outerTable
where ter <> ''
and
trim(coalesce(substr(xdata,4,3),'')) as ter
where ter <> ''
would work for the second territory
and
trim(coalesce(substr(xdata,7,3),'')) as ter
where ter <> ''
would work for the third territory
What I don't know is how to take those 3 and join them into a result that looks like an associative table. Any thoughts?
So you've got one query that returns 212 | 011 / 169 | 212, another that returns 212 | 111 / 169 | 312, and a third that returns 212 | 211, is that correct?
The obvious answer to transform this to the results you're asking for is to use UNION ALL to combine the three queries. You were looking at ways to join the queries, but (simply put) joining would add columns, when what you want to add is rows.

PostgreSQL: strange collision of ORDER BY and LIMIT/OFFSET

I'm trying to do this in PostgreSQL 9.1:
SELECT m.id, vm.id, vm.value
FROM m
LEFT JOIN vm ON vm.m_id = m.id and vm.variation_id = 1
ORDER BY lower(trim(vm.value)) COLLATE "C" ASC LIMIT 10 OFFSET 120
The result is:
id | id | value
----+-----+---------------
504 | 511 | "andr-223322"
506 | 513 | "andr-322223"
824 | 831 | "angHybrid"
866 | 873 | "Another thing"
493 | 500 | "App update required!"
837 | 844 | "App update required!"
471 | 478 | "April"
905 | 912 | "Are you sure you want to delete this thing?"
25 | 29 | "Assignment"
196 | 201 | "AT ADDRESS"
Ok, let's execute the same query with OFFSET 130:
id | id | value
----+-----+---------------
196 | 201 | "AT ADDRESS"
256 | 261 | "Att Angle"
190 | 195 | "Att Angle"
273 | 278 | "Att Angle:"
830 | 837 | "attAngle"
475 | 482 | "August"
710 | 717 | "Averages"
411 | 416 | "AVG"
692 | 699 | "AVG SHAPE"
410 | 415 | "AVGs"
and we see our AT ADDRESS item again, but at the beginning!!!
The fact is that the vm table contains two following items:
id | m_id | value
----+------+---------------
201 | 196 | "AT ADDRESS"
599 | 592 | "At Address"
I cure this situation with a workaround:
(lower(trim(vm.value)) || vm.id)
but What The Hell ???!!!
Why do I have to use a workaround?
Swearing won't change the SQL standard that defines this behavior.
The order of rows is undefined unless specified in ORDER BY. The manual:
If sorting is not chosen, the rows will be returned in an unspecified
order. The actual order in that case will depend on the scan and join
plan types and the order on disk, but it must not be relied on. A
particular output ordering can only be guaranteed if the sort step is explicitly chosen.
Since you didn't define an order for these two peers (in your sort order):
id | m_id | value
----+------+---------------
201 | 196 | "AT ADDRESS"
599 | 592 | "At Address"
.. you get arbitrary ordering - whatever is convenient for Postgres. A query with LIMIT often uses a different query plan, which can explain different results.
Fix
ORDER BY lower(trim(vm.value)) COLLATE "C", vm.id;
Or (maybe more meaningful - possibly also tune to existing indexes):
ORDER BY lower(trim(vm.value)) COLLATE "C", vm.value, vm.id;
(This is unrelated to the use of COLLATE "C" here, btw.)
Don't concatenate for this purpose, that's much more expensive and potentially makes it impossible to use an index (unless you have an index on that precise expression). Add another expression that kicks in when prior expressions in the ORDER BY list leave ambiguity.
Also, since you have a LEFT JOIN there, rows in m without match in vm have null values for all current ORDER BY expressions. They come last and are sorted arbitrarily otherwise. If you want a stable sort order overall, you need to deal with that, too. Like:
ORDER BY lower(trim(vm.value)) COLLATE "C", vm.id, m.id;
Asides
Why store the double quotes? Seems to be costly noise. You might be better off without them. You can always add the quotes on output if need be.
Many clients cannot deal with the same column name multiple times in one result. You need a column alias for at least one of your id columns: SELECT m.id AS m_id, vm.id AS vm_id .... Goes to show why "id" for a column is an anti-pattern to begin with.