I am new to postgres and I play around with data loading.
Here is the table definition from postgres 9.2 spec:
CREATE TABLE weather (
city varchar(80),
temp_lo int, -- low temperature
temp_hi int, -- high temperature
prcp real, -- precipitation
date date
);
I prepared the following data file (weather.txt):
San Francisco 43 57 0.0 '1994-11-29'
Hayward 54 37 0.0 '1994-11-29'
and ran the COPY command:
COPY weather FROM '~aviad/postsgres/playground/weather.txt';
now, when I run select * from weather; I see that single quotes appears around the city values.
This does not happen when I run simple INSERT e.g:
INSERT INTO weather VALUES ('San Francisco', 46, 50, 0.25, '1994-11-27');
I wonder:
what is the reason for wrapping text values by single quotes?
What is the correct way to put the text data in the file used by
COPY to avoid single quote wrapping?
What you describe in your question is obviously not what's really happening. COPY would fail trying to import string literals with redundant single quotes into a date column.
To get rid of redundant quotes, import to a temporary table with text column, then INSERT INTO the target table trimming the quotes:
CREATE TEMP TABLE wtmp (
city text
, temp_lo int
, temp_hi int
, prcp real
, date text -- note how I use text here.
);
COPY wtmp FROM '~aviad/postsgres/playground/weather.txt';
INSERT INTO weather (city, temp_lo, temp_hi, prcp, date)
SELECT city, temp_lo, temp_hi, prcp, trim(date, '''')::date
FROM wtmp
-- ORDER BY ?
;
The temp table is dropped automatically at the end of your session.
Reserved words as identifiers
I see you copied the example from the manual. Here is the deep link to the current manual.
While being correct, that example in the manual is unfortunate. I'd advise not to use reserved words like date as column names. As you can see here date is a reserved word in every SQL standard. It's allowed to use in Postgres and I can see how it's tempting for a simple example. But that doesn't make it a good idea. Generally, you should be in the habit of avoiding reserved words as identifiers. It leads to confusing error messages and needlessly incompatible SQL code.
1 - The practice of wrapping text in quoted identifiers is done in the event that the demiliter used within the file (comma in your case) also appears within the text string
2 - I do not know must about postgres, but if you specifiy the quoted identifier in your COPY command, it should remove it during the import:
COPY weather FROM '~aviad/postsgres/playground/weather.txt' (QUOTE '?');
something along those lines. Simply try replacing the ? with the quoted identifier - in your case, I would try this first:
COPY weather FROM '~aviad/postsgres/playground/weather.txt' (QUOTE '''');
You also might want to check out: http://www.postgresql.org/docs/9.2/static/sql-copy.html as there are many different switches you can use in the COPY command
Related
I have an automation code that runs a bunch of queries into Postgresql DB.
one of my queries is :
CREATE TABLE 行 (CustomerName int, City varchar(255),Country varchar(255))
when running it into the DB, I got this response:
Query response from db:
CREATE TABLE ? (CustomerName int, City varchar(255),Country varchar(255));
ERROR: syntax error at or near "?"
LINE 1: CREATE TABLE ? (CustomerName int, City varchar(255),Country ...
^
postgres=#
it seems that it converts the unique char to '?'.
any suggestion why this could happen?
I'm sure that before the query is executed the letters are encoded correctly.
(when running this query manually everything goes well)
I would avoid using keyword characters, or international characters.
If you really need to do that you can try to use " double quote
Quoted identifiers can contain any character, except the character with code zero. (To include a double quote, write two double-quotes.) This allows constructing table or column names that would otherwise not be possible, such as ones containing spaces or ampersands. The length limitation still applies.
SQL-SYNTAX-IDENTIFIERS
CREATE TABLE "行" (CustomerName int, City varchar(255),Country varchar(255))
When i execute SELECT * FROM tablename generate that erorr :
java.sql.SQLSyntaxErrorException: Table/View 'tablename' does not exist.
but if i run that sql command
SELECT * FROM "tablename" the sql run without problems why.
This is an aspect of the SQL standard known as "delimited identifiers".
Table names, column names, and other objects are things that you can give names to in your database.
The SQL standard says that, if you aren't particular about the upper/lower case of your object names, you can just specify the names without quotation marks, and your database will process them in a case-insensitive manner (typically, by converting an unquoted object name into the all-upper-case version of that name).
CREATE TABLE mytable(c1 INT, c2 CHAR(10));
INSERT INTO MyTable (C1, C2) VALUES (42, 'Bryan');
SELECT c2 FROM MYTABLE;
Since you didn't specify any object names in quotation marks, all of these examples work fine, because mytable, MyTable, and MYTABLE are all the same, when they aren't in quotation marks.
But if you specify your object names in quotation marks, then you have to get things exactly right:
CREATE TABLE "MyCaseSensitiveTable" (c1 int, c2 char(10));
INSERT INTO MyCaseInsensitiveTable (c1, c2) values (64, 'a nice age');
In this case, your INSERT statement will be rejected, because "MyCaseSensitiveTable" is different than MyCaseSensitiveTable.
Delimited identifiers bring other advantages:
You can use otherwise-reserved keywords from the SQL language as table names, so you can create a table named "TABLE" if you want.
You can use various special characters in your database object names.
Personally, I try never to use delimited identifiers, because I think they make my programs hard to read. But they are a completely legitimate part of the SQL standard and they are widely used.
But, the bottom line is: if you're going to put your database object names in quotation marks, you have to put them in quotation marks all the time, and you have to give the name exactly the same each time, but if you don't use quotation marks for your database object names, they will be treated in a case-insensitive manner.
We use Oracle Text in a Oracle Database (11.2.0.4.0) to perform full-text search over stored documents, as well as over multiple columns in our database.
For these multi-column indexes we noticed that some double-sided wildcard queries return the wrong number of results: The whole table!
Our application translates the query of a user into a double-sided wildcard query (e.g. "york" -> "%york%") and passes them to the contains operator.
We re-ran this on the database and could reproduce it.
Consider, for example, a table containing cities where the full-text index spans all columns: Zip-Code, Cityname, State and Country:
select * from city where contains(cityname, '%york%')>0
The following query arguments seem to return a wrong number of results (all rows):
%s%
%i%
%d%
%c%
What I checked already:
Interestingly, the non-working queries are all format-arguments in C. But I have not been able to find these as keywords or special operators in the Oracle Text documentation.
I checked that the stop word list does not contain these queries.
I set a custom lexer and turned on the "mixed case" option for it, which seems to fix the issue for lowercase queries, but the issue persists for upper case queries (%S%).
The score operator returns a value of 6 for the rows that should not match:
select cityname, state, zip, score(1) from city where contains(cityname, '%s%', 1)>0
---------------------------------
|Cityname |State|Zip | Score(1)|
|-------------------------------|
|La Cibourg|NE |2332| 6 | - WRONG
|Morlon |FR |1638| 6 | - WRONG
|Leuk Stadt|VS |3953| 12 | - Correct row
---------------------------------
Do you know any (mis-)configuration that can cause this?
Update
The exact version is 11.2.0.4.0, with Patch 18842982 applied.
The script to create the table and index is below:
drop table city_copy;
create table city_copy (
city_nr number not null,
zip_code varchar2(60),
city_name varchar2(60),
state varchar2(60)
);
insert into city_copy
select 1, 2332, 'La Ciboug', 'NE' from dual
union all
select 2, 1638, 'Morlon', 'FR' from dual
union all
select 3, 3953, 'Leuk Stadt', 'VS' from dual;
commit;
exec ctxsys.ctx_ddl.drop_preference('CITY_MULTI');
exec ctxsys.ctx_ddl.create_preference('CITY_MULTI', 'MULTI_COLUMN_DATASTORE');
exec ctxsys.ctx_ddl.set_attribute('CITY_MULTI', 'COLUMNS', 'ZIP_CODE, CITY_NAME, STATE');
create index city_idx_ft on city_copy(zip_code)
indextype is ctxsys.context parameters ('datastore CITY_MULTI sync (on commit)');
The current settings for the default lexer are:
DEFAULT_LEXER COMPOSITE GERMAN
DEFAULT_LEXER MIXED_CASE YES
DEFAULT_LEXER ALTERNATE_SPELLING GERMAN
Our stoplist is unchanged from the default stoplist for German
SO, after quite a bit research...
I am still not sure if it is a bug, but although my intuition said it is the lexer that causes this behavior - it is not.
Please add to the preference an attribute named DELIMITER with a value of NEWLINE
exec ctx_ddl.set_attribute('CITY_MULTI', 'DELIMITER', 'NEWLINE');
That would solve your issue.
The default delimiter is COLUMN_NAME_TAG which probably conflicts with too short parameters (it is supposed to treat your data as if it is an XML, and probably somewhere in how Oracle concatenate the text there are the single characters you were looking for).
It looks to me like for the multi column data store Oracle Text constructs for each row you have an XML that contains the name of the column in it, something like:
<XML>
<zip_code>2332</zip_code>
<city_name>La Ciboug</city_name>
<state>NE</state>
</XML>
and that XML is being indexed (or a structure that is similar to it).
and when looking for just S the s from the word "state" returns in every row.
The new line changes the way the text is being built to
2332
La Ciboug
NE
which is better in your case and the way you search.
more info about it you can find here:
http://docs.oracle.com/cd/B19306_01/text.102/b14218/cdatadic.htm#i1006391
Good luck!
I need to test if my application is reading special characters from the database and displaying them in exactly the same way. For this, I need to populate the database table with all special characters available. However, I am not sure how I can specify the special characters in the sql insert query. Can anyone please guide me to an example where I can insert a special character in the query? For simplicity sake, suppose the table is a City table with Area and Avg_Temperature being the 2 columns. If I need to insert the degree (celcius/farhenheit) symbol in Avg_Temperature column, how should I write the query?
*[Edit on 1/9/2012 at 2:50PM EST]*As per Justin Cave's suggestion below, I did following analysis:
Table: create table city(area number, avg_temperature nvarchar2(10));
Date: insert into city values (1100, '10◦C');
Query:
select dump(avg_temperature, 1010) from city where area = 1100;
O/P
DUMP(AVG_TEMPERATURE,1010)
----------------------------------------------------------
Typ=1 Len=8 CharacterSet=AL16UTF16: 0,49,0,48,0,191,0,67
Query
select value$ from sys.props$ where name='NLS_CHARACTERSET';
O/P
VALUE$
----------------
WE8MSWIN1252
Query:
select value$ from sys.props$ where name='NLS_NCHAR_CHARACTERSET';
O/P
----------------
AL16UTF16
It seems that the insert does mess up the special characters as Justin Cave suggested. But I am not able to understand why this is happening? Can anyone please provide related suggestion?
First you should not store the symbol as part of your column. That requires you to declare the column as VARCHAR which will give you lots of problems in the long run (e.g. you cannot sum() on them, you cannot avg() on them and so on)
You should store the unit in which the temperature was taken in a second column (e.g. 1 = celcius and 2 = fahrenheit) and translate this when displaying the data in the frontend. If you really want to store the symbol, declare the units columns as CHAR(1):
CREATE TABLE readings
(
area number(22),
avg_temperature number(10,3),
units varchar(2)
)
Then you can insert it as follows:
INSERT INTO readings
(area, avg_temperature, units)
VALUES
(1000, 12.3, '°C');
But again: I would not recommend to store the actual symbol. Store only the code!
First you need to know what the database character set is. Then you need to know what character set your "client" connection is using. Life is always easier if these are the same.
If your databse is utf-8 and your client is utf-8 then you don't need to do any character escaping you can just use the utf-8 encoding for the desired character.
In your example the degree character is unicode codepoint u+00b0.
In utf-8 this is a two-byte sequence: x'c2', x'b0'.
In my MySQL InnoDB database, I have dirty zip code data that I want to clean up.
The clean zip code data is when I have all 5 digits for a zip code (e.g. "90210").
But for some reason, I noticed in my database that for zipcodes that start with a "0", the 0 has been dropped.
So "Holtsville, New York" with zipcode "00544" is stored in my database as "544"
and
"Dedham, MA" with zipcode "02026" is stored in my database as "2026".
What SQL can I run to front pad "0" to any zipcode that is not 5 digits in length? Meaning, if the zipcode is 3 digits in length, front pad "00". If the zipcode is 4 digits in length, front pad just "0".
UPDATE:
I just changed the zipcode to be datatype VARCHAR(5)
Store your zipcodes as CHAR(5) instead of a numeric type, or have your application pad it with zeroes when you load it from the DB. A way to do it with PHP using sprintf():
echo sprintf("%05d", 205); // prints 00205
echo sprintf("%05d", 1492); // prints 01492
Or you could have MySQL pad it for you with LPAD():
SELECT LPAD(zip, 5, '0') as zipcode FROM table;
Here's a way to update and pad all rows:
ALTER TABLE `table` CHANGE `zip` `zip` CHAR(5); #changes type
UPDATE table SET `zip`=LPAD(`zip`, 5, '0'); #pads everything
You need to decide the length of the zip code (which I believe should be 5 characters long). Then you need to tell MySQL to zero-fill the numbers.
Let's suppose your table is called mytable and the field in question is zipcode, type smallint. You need to issue the following query:
ALTER TABLE mytable CHANGE `zipcode` `zipcode`
MEDIUMINT( 5 ) UNSIGNED ZEROFILL NOT NULL;
The advantage of this method is that it leaves your data intact, there's no need to use triggers during data insertion / updates, there's no need to use functions when you SELECT the data and that you can always remove the extra zeros or increase the field length should you change your mind.
Ok, so you've switched the column from Number to VARCHAR(5). Now you need to update the zipcode field to be left-padded. The SQL to do that would be:
UPDATE MyTable
SET ZipCode = LPAD( ZipCode, 5, '0' );
This will pad all values in the ZipCode column to 5 characters, adding '0's on the left.
Of course, now that you've got all of your old data fixed, you need to make sure that your any new data is also zero-padded. There are several schools of thought on the correct way to do that:
Handle it in the application's business logic. Advantages: database-independent solution, doesn't involve learning more about the database. Disadvantages: needs to be handled everywhere that writes to the database, in all applications.
Handle it with a stored procedure. Advantages: Stored procedures enforce business rules for all clients. Disadvantages: Stored procedures are more complicated than simple INSERT/UPDATE statements, and not as portable across databases. A bare INSERT/UPDATE can still insert non-zero-padded data.
Handle it with a trigger. Advantages: Will work for Stored Procedures and bare INSERT/UPDATE statements. Disadvantages: Least portable solution. Slowest solution. Triggers can be hard to get right.
In this case, I would handle it at the application level (if at all), and not the database level. After all, not all countries use a 5-digit Zipcode (not even the US -- our zipcodes are actually Zip+4+2: nnnnn-nnnn-nn) and some allow letters as well as digits. Better NOT to try and force a data format and to accept the occasional data error, than to prevent someone from entering the correct value, even though it's format isn't quite what you expected.
I know this is well after the OP. One way you can go with that keeps the table storing the zipcode data as an unsigned INT but displayed with zeros is as follows.
select LPAD(cast(zipcode_int as char), 5, '0') as zipcode from table;
While this preserves the original data as INT and can save some space in storage you will be having the server perform the INT to CHAR conversion for you. This can be thrown into a view and the person who needs this data can be directed there vs the table itself.
It would still make sense to create your zip code field as a zerofilled unsigned integer field.
CREATE TABLE xxx (
zipcode INT(5) ZEROFILL UNSIGNED,
...
)
That way mysql takes care of the padding for you.
CHAR(5)
or
MEDIUMINT (5) UNSIGNED ZEROFILL
The first takes 5 bytes per zip code.
The second takes only 3 bytes per zip code. The ZEROFILL option is necessary for zip codes with leading zeros.
you should use UNSIGNED ZEROFILL in your table structure.
LPAD works with VARCHAR2 as it does not put spaces for left over bytes.
LPAD changes leftover/null bytes to zeros on LHS
SO datatype should be VARCHAR2