Reading CSV file using pyspark - dataframe

While reading csv file with customized schema definition the column count changes whereas with inferschema column count is different. Can anyone help me why does this happens.

Related

Need SQL expression for REPLACE

During a CSV file import from external URL i need to execute REPLACE
(I can't edit the CSV file manually/locally cause it's located on the suppliers FTP and will be used in the future for add/dele/update etc of the products in the file on a automated recurring scheduled task)
I got this expression for replacing value of a column in the CSV file:
REPLACE([CSV_COL(6)],"TEXSON","EXTERNAL")
It's working for column 6 in the CSV file cause all row values of that column is the same(TEXSON)
What i need help with:
In column 5 in the CSV file i have various values and there no connection between these values.
How can i run an expression that replaces all values in column 5 in the CSV with "EXTERNAL"?
See image of how it looks in the CSV file:
Maybe some "wildcard" to just replace everything in that column, no matter what value it is...
Additional information: Im working with the PrestaShop Store Manager to import products to the shop from our supplier...
Thanks!

How to write dataframe to csv file with sheetname using spark scala

I am trying to write the dataframe to csv file with option sheetName but it is not working for me.
df13.coalesce(1).write.option("delimiter",",").mode(SaveMode.Overwrite).option("sheetName","Info").option("header","true").option("escape","").option("quote","").csv("path")
Can anyone help me on that
I don't think in CSV file you actually have a sheet name , ideally the filename is the sheet name in a CSV file. Can you try changing to excel and try..
Spark can't directly do this while writing as a csv, There is no option as sheetName, The output path is path you mention as .csv("path").
Spark uses hadoops file format, which is partitioned in multiple part files under the output path, 1 part file on your case. Also do not repartitions to 1 unless you really need it.
One thing you can do is write the dataframe without repartition and use HADOOP API to merge those small many part files to single.
Here is more on detail Write single CSV file using spark-csv
We can only 1 default sheet in csv file if we want multiple sheet then we should write the dataframe to excel format instead of csv file format.

TYPE command. Inserting csv file

I have a CSV file im looking to load into a TSQL table using the "type" command.
Code: type yourfilename
When looking in the command prompt its breaking the file lines into two different rows and inserting them separately into my destination table
EX.
"Manheim Chicago","Manheim","IL","199004520601","On
Block","2D4FV47V86H126473","2006","DODGE","MAGNUM 4X2 V6"
I want solution to look like this
Solution Pic
[1]: https://i.stack.imgur.com/Bkgf6.png
Where this would be one record in the table.
Question. Does anyone know how to format a type command so it displays a full record without line breaks?

CSV file matadata validation (comparing with existing SQl Table)

I have a requirement of validating CSV file before loading into staged-folder, and later have to load into sql table.
I need to validate metadata (the structure of the file must be same as target sql table)
No. of columns should be equal to the target sql table
order of columns should be same as target sql table
Data types of columns (no text values should exist in numeric field of csv file)
looking for some easy and efficient way achieve this.
Thanks for help
A Python program and module that does most of what you're looking for is chkcsv.py: https://pypi.org/project/chkcsv/. It can be used to verify that a CSV file contains a specified set of columns and that the data type of each column conforms to the specification. It does not, however, verify that the order of columns in the CSV file is the same as the order in the database table. Instead of loading the CSV file directly into the target table, you can load it into a staging table and then move it from there into the target table--this two-step process eliminates column order dependence.
Disclaimer: I wrote chkcsv.py
Edit 2020-01-26: I just added an option that allows you to specify that the column order should be checked also.

Issues loading CSV into BigQuery table

Im trying to create a BigQuery table using a pretty simple csv file I have stored in GCS.
I keep getting the same error over and over again:
Could not parse '1/1/2008' as datetime for field XXX
I've checked that the csv file isn't corrupted, and I've managed to upload everything into one column so the file is readable by BigQuery.
I've added the word NULL to any empty fields thinking consecutive delimiters may be causing the issues but I am still facing the same issue.
I know data, I understand data and CSV files.
BigQuery cannot cast '1/1/2008' as DATETIME and rather would expecting something like '2008-1-1'
So, you can either modify your CSV file or just use STRING for that XXX field and than translate it into DATETIME in your queries - like below
#standardSQL
SELECT PARSE_DATETIME('%d/%m/%Y', '1/1/2008')