postsql trouble when populating a table with comma separator - sql

I am trying to populate a table, and I have values with "," from my array "name". so This is not working, if I use something other than a comma separator, it says
BadCopyFileFormat: missing data for column
s = "CREATE TABLE IF NOT EXISTS tokens (address varchar(100) NOT NULL,symbol varchar(100) NOT NULL,name varchar(100) NOT NULL)"
db_cursor.execute(s)
with open('data/tokens.csv', 'r', encoding="utf-8") as f:
next(f) # Skip the header row.
db_cursor.copy_from(f, 'tokens', sep=',')
db_conn.commit()
My data look like
address symbol name
x23fva3 ABC ABC
2vajd83 DAP
29vb4h2 Wink Jamal, ab
2jsbg93 x3 xon3
Is there a way to populate the table with missing values??

What I got to work:
cat data/tokens.csv
address |symbol|name
x23fva3 | ABC | ABC
2vajd83 | DAP |
29vb4h2 | Wink | Jamal, ab
2jsbg93 | x3 | xon3
with open('data/tokens.csv', 'r', encoding="utf-8") as f:
next(f) # Skip the header row.
db_cursor.copy_from(f, 'tokens', sep='|')
db_conn.commit()
select * from tokens ;
address | symbol | name
----------+--------+------------
x23fva3 | ABC | ABC
2vajd83 | DAP |
29vb4h2 | Wink | Jamal, ab
2jsbg93 | x3 | xon3
I use the pipe(|) regularly for this sort of thing as it very rarely shows up in data on its own.
UPDATE
For a file with empty values there needs to still be a separator for each field like:
address |symbol|name
x23fva3 | ABC | ABC
2vajd83 | DAP |
32vb4h3 | |
1jsbg94 | | xon3

Related

Create external table from csv on HDFS , all values come with quotes

I have a csv file on HDFS and I am trying to create an impala table , the situation is it created the table and values with all the "
CREATE external TABLE abc.def
(
name STRING,
title STRING,
last STRING,
pno STRING
)
row format delimited fields terminated by ','
location 'hdfs:pathlocation'
tblproperties ("skip.header.line.count"="1") ;
The output is
name tile last pno
"abc" "mr" "xyz" "1234"
"rew" "ms" "pre" "654"
I just want to create table from csv file without quotes. Please guide where I am going wrong.
Regards,
R
A way to do that is creating a stage table that load the file with quotes and then with CTAS (Create table as select) create the right table cleaning the fields with replace function.
As an example
CREATE TABLE quote_stage(
id STRING,
name STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
+-----+----------+
| id | name |
+-----+----------+
| "1" | "pepe" |
| "2" | "ana" |
| "3" | "maria" |
| "4" | "ramon" |
| "5" | "lucia" |
| "6" | "carmen" |
| "7" | "alicia" |
| "8" | "pedro" |
+-----+----------+
CREATE TABLE t_quote
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
AS SELECT replace(id,'"','') AS id, replace(name,'"','') AS name FROM quote_stage;
+----+--------+
| id | name |
+----+--------+
| 1 | pepe |
| 2 | ana |
| 3 | maria |
| 4 | ramon |
| 5 | lucia |
| 6 | carmen |
| 7 | alicia |
| 8 | pedro |
+----+--------+
Hope this helps.

Py spark join on pipeline separated column

I have two data frames which i want to join. The catch is the one of the tables have pipeline separated string on which one of the value is what I want to join with. How do I it in Pyspark. Below is an example
TABLE A has
+-------+--------------------+
|id | name |
+-------+--------------------+
| 613760|123|test|test2 |
| 613740|456|ABC |
| 598946|OMG|567 |
TABLE B has
+-------+--------------------+
|join_id| prod_type|
+-------+--------------------+
| 123 |Direct De |
| 456 |Direct |
| 567 |In |
Expected Result - Join table A and Table B when there is a match with Table A's pipeline separated ID against Table B's value. For instance TableA.id - 613760 the name has 123|test and I want to join with table B's join ID 123 likewise 456 and 567.
Resultant Table
+--------------------+-------+
| name |join_Id|
+-------+------------+-------+
|123|test|test2 |123 |
|456|ABC |456 |
|OMG|567 |567 |
Can someone help me solve this. I am relatively new to pyspark and I am learning
To solve your problem you need to:
split those "pipeline separated strings"
then exploding those values into separated rows.
posexplode would do that for you http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.posexplode
from there an "inner join" and
finally a "select" would do the rest of the trick.
See the code below:
import pyspark.sql.functions as f
#First create the dataframes to test solution
table_A = spark.createDataFrame([(613760, '123|test|test2' ), (613740, '456|ABC'), (598946, 'OMG|567' )], ["id", "name"])
# +-------+--------------------+
# |id | name |
# +-------+--------------------+
# | 613760|123|test|test2 |
# | 613740|456|ABC |
# | 598946|OMG|567 |
table_B = spark.createDataFrame([('123', 'Direct De' ), ('456', 'Direct'), ('567', 'In' )], ["join_id", "prod_type"])
# +-------+--------------------+
# |join_id| prod_type|
# +-------+--------------------+
# | 123 |Direct De |
# | 456 |Direct |
# | 567 |In |
result = table_A \
.select(
'name',
f.posexplode(f.split(f.col('name'),'\|')).alias('pos', 'join_id')) \
.join(table_B, on='join_id', how='inner') \
.select('name', 'join_id')
result.show(10, False)
# +--------------+-------+
# |name |join_id|
# +--------------+-------+
# |123|test|test2|123 |
# |456|ABC |456 |
# |OMG|567 |567 |
# +--------------+-------+
Hope that works. As you continue getting better in Pyspark. I would recommend you to go through the functions in pyspark.sql.functions and that would take your skills to the next level.

Can SQL REPLACE function as a "find and replace" on both strings and substrings?

I have a database of boxes within boxes. Max nesting depth is 10, so each box could have up to 9 parent or child locations. One field contains the hierarchy of each box - i.e. for box DEF which is inside ABC:
SELECT hierarchy from INVENTORY WHERE boxname = 'DEF' returns "ABC -> DEF".
I now need to allow users to rename boxes. I'm trying to use SQL's REPLACE function to accomplish this, but it can't work on substrings as far as I can tell. I've tried:
update inventory
set hierarchy = replace(hierarchy, 'DEF', 'XYZ')
and this doesn't update the hierarchy to "ABC -> XYZ" like I'd expect
My hope is to use it as a "Ctrl+F find and replace" function, but it seems like it can't do the following:
Find all fields that contain the string, including as a substring.
Replace all occurrences across all fields for a given record.
Does anyone know if either of these are indeed possible?
I'm using TSQL.
sample data as requested:
input:
| name | parent1 | parent2 | ... | hierarchy |
| --- | --- | --- | --- | --- |
| DEF | ABC | | | ABC -> DEF |
| JKL | DEF | ABC | | ABC -> DEF -> JKL |
output:
| name | parent1 | parent2 | ... | hierarchy |
| --- | --- | --- | --- | --- |
| XYZ | ABC | | | ABC -> XYZ |
| JKL | XYZ | ABC | | ABC -> XYZ -> JKL |
No, this isn't possible. SQL UPDATE with REPLACE does not function as a "find and replace". SQL does not have this functionality.

Change value from one column into another

I have got a table:
ID | Description
--------------------
1.13.1-3 | .1 Hello
1.13.1-3 | .2 World
1.13.1-3 | .3 Text
4.54.1-4 | sthg (.1) Ble
4.54.1-4 | sthg (.2) Bla
4.54.1-4 | aaaa (.3) Qwer
4.54.1-4 | bbbb (.4) Tyuio
And would like to change ending of ID by taking value from second column to have result like:
ID | Description
--------------------
1.13.1 | Hello
1.13.2 | World
1.13.3 | Text
4.54.1 | Ble
4.54.2 | Bla
4.54.3 | Qwer
4.54.4 | Tyuio
Is there any quick way to do it in postgresql?
Use regex to manipulate the strings into what you want:
update mytable set
ID = regexp_replace(ID, '\.[^.]*$', '') || substring(Description from '\.[0-9+]'),
Description = regexp_replace(Description, '.*\.[0-9]+\S* ', '')
See SQLFiddle showing this query working with your data.

Can you use xml datatype based statements on xml stored as nvarchar datatype in mssql?

My datatable looks like this with the artist columns as NVARCHAR(MAX) but holds text which is basically a xml file.
Id | Name | Surname | Title | Location | Artist |
-------------------------------------------------------------------
1 | xxx | abc | def | London | XML string in Nvarchar |
2 | xxx | abc | def | Oslo | XML string in Nvarchar |
3 | xxx | abc | def | New York | XML string in Nvarchar |
My XML file looks like this
<song category="gaming">
<title>Valentine's Day</title>
<artist-main>Fatfinger</artist-main>
<artist-featured>Slimthumb</artist-featured>
<year>2013</year>
<price>29.99</price>
<album>Gamestain</album>
<albumimg>http://download.gamezone.com/uploads/image/data/875338/halo-4.jpg</albumimg>
<songurl>http://www.youtube.com/watch?v=-J0ABq9TnCw</songurl>
Can I use a XML datatype based SQL statement shown below on the artist column?
SELECT Id, Name, Surname, Title
FROM #table
WHERE Artist.value('(/artist-main)[1]','varchar(max)') = '%FatFinger%'
if it is valid XML you can just cast it like this:
WHERE (CAST(Artist AS XML)).value('(/artist-main)[1]','varchar(max)') = 'FatFinger'
(I removed the % signs around your search string. If you need them, do you maybe intended to use LIKE instead of =?)