This is sort of a general question that has come up in several contexts, the example below is representative but not exhaustive. I am interested in any ways of learning to work with Postgres on imperfect (but close enough) data sources.
The specific case -- I am using Postgres with PostGIS for working with government data published in shapefiles and xml. Using the shp2pgsql module distributed with PostGIS (for example on this dataset) I often get schema like this:
Column | Type |
------------+-----------------------+-
gid | integer |
st_fips | character varying(7) |
sfips | character varying(5) |
county_fip | character varying(12) |
cfips | character varying(6) |
pl_fips | character varying(7) |
id | character varying(7) |
elevation | character varying(11) |
pop_1990 | integer |
population | character varying(12) |
name | character varying(32) |
st | character varying(12) |
state | character varying(16) |
warngenlev | character varying(13) |
warngentyp | character varying(13) |
watch_warn | character varying(14) |
zwatch_war | bigint |
prog_disc | bigint |
zprog_disc | bigint |
comboflag | bigint |
land_water | character varying(13) |
recnum | integer |
lon | numeric |
lat | numeric |
the_geom | geometry |
I know that at least 10 of those varchars -- the fips, elevation, population, etc., should be ints; but when trying to cast them as such I get errors. In general I think I could solve most of my problems by allowing Postgres to accept an empty string as a default value for a column -- say 0 or -1 for an int type -- when altering a column and changing the type. Is this possible?
If I create the table before importing with the type declarations generated from the original data source, I get better types than with shp2pgsql, and can iterate over the source entries feeding them to the database, discarding any failed inserts. The fundamental problem is that if I have 1% bad fields, evenly distributed over 25 columns, I will lose 25% of my data since a given insert will fail if any field is bad. I would love to be able to make a best-effort insert and fix any problems later, rather than lose that many rows.
Any input from people having dealt with similar problems is welcome -- I am not a MySQL guy trying to batter PostgreSQL into making all the same mistakes I am used to -- just dealing with data I don't have full control over.
Could you produce a SQL file from shp2pgsql and do some massaging of the data before executing it? If the data is in COPY format, it should be easy to parse and change "" to "\N" (insert as null) for columns.
Another possibility would be to use shp2pgsql to load the data into a staging table where all the fields are defined as just 'text' type, and then use an INSERT...SELECT statement to copy the data to your final location, with the possibility of massaging the data in the SELECT to convert blank strings to null etc.
I don't think there's a way to override the behaviour of how strings are converted to ints and so on: possibly you could create your own type or domain, and define an implicit cast that was more lenient... but this sounds pretty nasty, since the types are really just artifacts of how your data arrives in the system and not something you want to keep around after that.
You asked about fixing it up when changing the column type: you can do that too, for example:
steve#steve#[local] =# create table test_table(id serial primary key, testvalue text not null);
NOTICE: CREATE TABLE will create implicit sequence "test_table_id_seq" for serial column "test_table.id"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "test_table_pkey" for table "test_table"
CREATE TABLE
steve#steve#[local] =# insert into test_table(testvalue) values('1'),('0'),('');
INSERT 0 3
steve#steve#[local] =# alter table test_table alter column testvalue type int using case testvalue when '' then 0 else testvalue::int end;
ALTER TABLE
steve#steve#[local] =# select * from test_table;
id | testvalue
----+-----------
1 | 1
2 | 0
3 | 0
(3 rows)
Which is almost equivalent to the "staging table" idea I suggested above, except that now the staging table is your final table. Altering a column type like this requires rewriting the entire table anyway: so actually, using a staging table and reformatting multiple columns at once is likely to be more efficient.
Related
Running a posgresql database.
I have a table with CITEXT columns for case-insensitivity. When I try to update a a CITEXT value to the same word in different casing it does not work. Postgres returns 1 row updated, as it targeted 1 row, but the value is not changed.
Eg
Table Schema - users
Column | Type
___________________________
user_id | PRIMARY KEY SERIAL
user_name | CITEXT
age | INT
example row:
user_id | user_name | age
_________________________________
1 | ayanaMi | 99
SQL command:
UPDATE users SET user_name = 'Ayanami' WHERE user_id = 1
The above command turns 1 UPDATED, but the casing does not change. I assume this is because postgres sees them as the same value.
The docs state:
If you'd like to match case-sensitively, you can cast the operator's arguments to text.
https://www.postgresql.org/docs/9.1/citext.html
I can force a case sensitive search by using CAST as such:
SELECT * FROM users WHERE CAST(user_name AS TEXT) = `Ayanami`
[returns empty row]
Is there a way to force case sensitive updating?
I've got two tables "modulo1_cella" and "modulo2_campionamento".
The first, "modulo1_cella" contains polygons, while the latter, "modulo2_campionamento", contains points (samples). Now, I need to assign to each polygon the nearest sample, and the identificative of the sampler itself.
Table "public.modulo1_cella"
Column | Type | Modifiers
-------------------+-------------------+------------------------------------------------------------------
cella_id | integer | not null default nextval('modulo1_cella_cella_id_seq'::regclass)
nome_cella | character varying |
geometria | geometry |
campione_id | integer |
dist_camp | double precision |
Table "public.modulo2_campionamento"
Column | Type | Modifiers
--------------------------+-----------------------------+----------------------------------------------------------------------------------
campione_id | integer | not null default nextval('modulo2_campionamento_aria_campione_id_seq'::regclass)
x_campionamento | double precision |
y_campionamento | double precision |
codice_campione | character varying(10) |
cella_id | integer |
geometria | geometry(Point,4326) |
I'm looking for an INSERT/UPDATE trigger that for each row of "modulo1_cella" table, i.e. for each polygon, returns:
the nearest sample, "campione_id";
the corrisponding distance, "dist_camp".
I created a query that works, but I'm not able to convert it to a trigger.
CREATE TEMP TABLE TemporaryTable
(
cella_id int,
campione_id int,
distanza double precision
);
INSERT INTO TemporaryTable(cella_id, campione_id, distanza)
SELECT
DISTINCT ON (m1c.cella_id) m1c.cella_id, m2cmp.campione_id, ST_Distance(m2cmp.geometria::geography, m1c.geometria::geography) as dist
FROM modulo1_cella As m1c, modulo2_campionamento As m2cmp
WHERE ST_DWithin(m2cmp.geometria::geography, m1c.geometria::geography, 50000)
ORDER BY m1c.cella_id, m2cmp.campione_id, ST_Distance(m2cmp.geometria::geography, m1c.geometria::geography);
UPDATE modulo1_cella as mc
SET campione_id=tt.campione_id, dist_camp=tt.distanza
from TemporaryTable as tt
where tt.cella_id=mc.cella_id;
DROP TABLE TemporaryTable;
Any help? Thank you in advance.
First, if "geometria" is not geography and is instead geometry, you should make it a geography type on the table.
ALTER TABLE modulo2_campionamento
ALTER COLUMN geometria
SET DATE TYPE geography(POINT 4326)
USING (geometria::geography);
ALTER TABLE modulo1_cella
ALTER COLUMN geometria
SET DATA TYPE geography(4326)
USING (geometria::geography);
Now, I need to assign to each polygon the nearest sample, and the identificative of the sampler itself.
You would not normally do this because it's very fast to find the nearest sample using a KNN search anyway.
CREATE INDEX ON modulo1_cella USING gist (geometria);
CREATE INDEX ON modulo2_campionamento USING gist (geometria);
VACUUM FULL ANALYZE modulo1_cella;
VACUUM FULL ANALYZE modulo2_campionamento;
SELECT *
FROM modulo1_cella As m1c
CROSS JOIN LATERAL (
SELECT *
FROM modulo2_campionamento As m2cmp
WHERE ST_DWithin(m2cmp.geometria, m1c.geometria, 50000)
ORDER BY m2cmp.geometria <-> m1c.geometria,
m1c.cella_id,
m2cmp.campione_id
FETCH FIRST ROW ONLY
) AS closest_match
That's much faster than the DISTINCT ON query you wrote.
If that is fast enough, I suggest using a VIEW. If that's not fast enough, I suggest using a MATERIALIZE VIEW. If it's still not fast enough you have a very niche load and it may be worth investigating a solution with triggers. But only then.
If I have the following table:
| name | value |
------------------
| A | 1 |
| B | NULL |
Where at the moment name is of type varchar(10) and value is of type bit.
I want to change this table so that value is actually a nvarchar(3) however, and I don't want to lose any of the information during the change. So in the end I want to end up with a table that looks like this:
| name | value |
------------------
| A | Yes |
| B | No |
What is the best way to convert this column from one type to another, and also convert all of the data in it according to a pre-determined translation?
NOTE: I am aware that if I was converting, say, a varchar(50) to varchar(200), or an int to a bigint, then I can just alter the table. But I require a similar procedure for a bit to a nvarchar, which will not work in this manner.
The best option is to ALTER bit to varchar and then run an update to change 1 to 'Yes' and 0 or NULL to 'No'
This way you don't have to create a new column and then rename it later.
Alex K's comment to my question was the best.
Simplest and safest; Add a new column, update with transform, drop existing column, rename new column
Transforming each item with a simple:
UPDATE Table
SET temp_col = CASE
WHEN value=1
THEN 'yes'
ELSE 'no'
END
You should be able to change the data type from a bit to an nvarchar(3) without issue. The values will just turn from a bit 1 to a string "1". After that you can run some SQL to update the "1" to "Yes" and "0" to "No".
I don't have SQL Server 2008 locally, but did try on 2012. Create a small table and test before trying and create a backup of your data to be safe.
Using PostgreSQL, what's the command to migrate an integer column type to a string column type?
Obviously I'd like to preserve the data, by converting the old integer data to strings.
You can convert from INTEGER to CHARACTER VARYING out-of-the-box, all you need is ALTER TABLE query chaning column type:
SQL Fiddle
PostgreSQL 9.3 Schema Setup:
CREATE TABLE tbl (col INT);
INSERT INTO tbl VALUES (1), (10), (100);
ALTER TABLE tbl ALTER COLUMN col TYPE CHARACTER VARYING(10);
Query 1:
SELECT col, pg_typeof(col) FROM tbl
Results:
| col | pg_typeof |
|-----|-------------------|
| 1 | character varying |
| 10 | character varying |
| 100 | character varying |
I suggest a four step process:
Create a new string column. name it temp for now. See http://www.postgresql.org/docs/9.3/static/ddl-alter.html for details
Set the string column. something like update myTable set temp=cast(intColumn as text) see http://www.postgresql.org/docs/9.3/static/functions-formatting.html for more interesting number->string conversions
Make sure everything in temp looks the way you want it.
Remove your old integer column. Once again, see http://www.postgresql.org/docs/9.3/static/ddl-alter.html for details
Rename temp to the old column name. Again: http://www.postgresql.org/docs/9.3/static/ddl-alter.html
This assumes you can perform the operation while no clients are connected; offline. If you need to make this (drastic) change in an online table, take a look at setting up a new table with triggers for live updates, then swap to the new table in an atomic operation. see ALTER TABLE without locking the table?
i wanted to change EN_NO length from 21 to 16 in sql table TB_TRANSACTION. Below are mine current sql column fields.
sql command -
describe table tb_transaction
column | type schema | type name | length | scale | nulls
EN_NO| SYSIBM | VARCHAR | 21 | 0 | Yes
i tried with this command but failed.
alter table tb_transaction alter column EN_NO set data type varchar(16)<br/>
Error message:
SQL0190N ALTER TABLE "EASC.TB_TRANSACTION" specified attributes for column
"EN_NO" that are not compatible with the existing column. SQLSTATE=42837
Any help would be appreciated.
We can increase the size of the column but, we can not decrease the size of column because Data lose will be happen that's why system will not allow to decrease the size.
If you still want to decrease the size, you need to drop that column and add again.