Write CSV file in append mode on Azure Databricks - pandas

Wanted to write csv file in append mode on Azure Databricks. The below code is working fine on my local machine.(Jupyter notebook)
df = pd.read_csv("/dbfs/mnt/dev/tmp/ml_p/csv_append.csv")
df+6
Ans
[1]: https://i.stack.imgur.com/sXsgH.png
when I opened the same csv file and wanted to save the file after performing the operation.
I got , OSError: [Errno 95] Operation not supported
with open('/dbfs/mnt/dev/tmp/ml_p/csv_append.csv', 'a') as f:
(df + 6).to_csv(f, header=False)
Is there is another alternative to write the CSV file in append mode? or Can I achieve the same using pyspark.

There are some limitations on what operations could be done with files on DBFS (especially via /dbfs mount point), and you hit this limit. The workaround would be to copy file from DBFS to local file system, modify it the same as you do it, and then upload back. Copying of the file could be done with dbutils.fs commands, like:
dbutils.fs.cp("dbfs:/mnt/dev/tmp/ml_p/csv_append.csv", "file:/tmp/csv_append.csv")
df = pd.read_csv("/tmp/csv_append.csv")
df+6
with open('/tmp/csv_append.csv', 'a') as f:
(df + 6).to_csv(f, header=False)
dbutils.fs.mv("file:/tmp/csv_append.csv","dbfs:/mnt/dev/tmp/ml_p/csv_append.csv")

Related

Bigquery - Export multiple csv files from Storage bucket to local folder (C drive)

What I am trying to achieve:
Copy data from a query/view in to a csv files and then concatenate it and convert to a .hyper file for Tableau.
Step 1:
I am using the EXPORT command to copy data from a view in to csv file/s. As the view is huge ( 4million + records) it dumps 250+ files in to the Storage Bucket.
Export Command Used:
EXPORT DATA OPTIONS(
uri='gs://test/*.csv',
format='CSV',
overwrite=true,
header=true,
field_delimiter=';') AS
SELECT * FROM `xa.dev.project.DATASET.VIEW`
Can I dump the csv files to a local folder instead of a bucket directly? so that I can read and concatenate them outside (Python/Pandas) and then convert to .hyper file?
Step 2:
I have tried the below to copy files from bucket to local but its not responding
gsutil -m cp –r gs://test_bucket/folder1/*.csv "C:\Users\xyz\Desktop\Test\"
Result: No response or error
Step3: Concatenate csv files using Pandas and create a .hyper file using Pantab
Is there a faster or better way to achieve this?

Kubeflow Error in handling large input file: The node was low on resource: ephemeral-storage

In Kubeflow - When input file size is really large (60 GB), I am getting 'The node was low on resource: ephemeral-storage.' It looks like kubeflow is using /tmp folder to store the files. I had following questions:
What is the best way to exchange really large files? How to avoid ephemeral-storage issue?
Will all the InputPath and OutputPath files are stored in MinIO Instance of Kubeflow? If yes, how can we purge the data from MinIO?
When data is passed between one stage of workflow to the next, does Kubeflow download file from MinIO and copy it to /tmp folder and pass InputPath to the function?
Is there a better way to pass pandas dataframe between different stages of workflow? Currently I am exporting pandas dataframe as CSV to OutputPath of the operation and reloading pandas dataframe from InputPath in the next stage.
Is there a way to use different volume for file exchange than using ephemeral storage? If yes, how I can configure it?
import pandas as pd
print("text_path:", text_path)
pd_df = pd.read_csv(text_path)
print(pd_df)
with open(text_path, 'r') as reader:
for line in reader:
print(line, end = '')

How to actually save a csv file to google drive from colab?

so, this problem seems very simple but apparently is not.
I need to transform a pandas dataframe to a csv file and save it in google drive.
My drive is mounted, I was able to save a zip file and other kinds of files to my drive.
However, when I do:
df.to_csv("file_path\data.csv")
it seems to save it where I want, it's on the left panel in my colab, where you can see all your files from all your directories. I can also read this csv file as a dataframe with pandas in the same colab.
HOWEVER, when I actually go on my Google Drive, I can never find it! but I need a code to save it to my drive because I want the user to be able to just run all cells and find the csv file in the drive.
I have tried everything I could find online and I am running out of ideas!
Can anyone help please?
I have also tried this which creates a visible file named data.csv but i only contains the file path
import csv
with open('file_path/data.csv', 'w', newline='') as csvfile:
csvfile.write('file_path/data.csv')
HELP :'(
edit :
import csv
with open('/content/drive/MyDrive/Datatourisme/tests_automatisation/data_tmp.csv') as f:
s = f.read()
with open('/content/drive/MyDrive/Datatourisme/tests_automatisation/data.csv', 'w', newline='') as csvfile:
csvfile.write(s)
seems to do the trick.
First export as csv with pandas (named this one data_tmp.csv),
then read it and put that in a variable,
then write the result of this "reading" into another file that I named data.csv,
this data.csv file can be found in my drive :)
HOWEVER when the csv file I try to open is too big (mine has 100.000 raws), it does nothing.
Has anyone got any idea?
First of all, mount your Google Drive with the Colab:
from google.colab import drive
drive.mount('/content/drive')
Allow Google Drive permission
Save your data frame as CSV using this function:
import pandas as pd
filename = 'filename.csv'
df.to_csv('/content/drive/' + filename)
In some cases, directory '/content/drive/' may not work, so try 'content/drive/MyDrive/'
Hope it helps!
Here:
df.to_csv( "/Drive Path/df.csv", index=False, encoding='utf-8-sig')
I recommend you to use pandas to work with data in python, works very well.
In that case, here is a simple tutorial, https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html Pandas tutorial
Then to save your data frame to drive, if you have your drive already mounted, use the function to_csv
dataframe.to_csv("/content/drive/MyDrive/'filename.csv'", index=False), will do the trick

How to use spark toLocalIterator to write a single file in local file system from cluster

I have a pyspark job which writes my resultant dataframe in local filesystem. Currently it is running in local mode and so I am doing coalesce(1) to get a single file as below
file_format = 'avro' # will be dynamic and so it will be like avro, json, csv, etc
df.coalesce.write.format(file_format).save('file:///pyspark_data/output')
But I see a lot of memory issues (OOM) and takes longer time as well. So I want to run this job with master as yarn and mode as client. And so to write the result df into a single file in localsystem, I need to use toLocalIterator which yields Rows. How can I stream these Rows into a file of required format (json/avro/csv/parquet and so on)?
file_format = 'avro'
for row in df.toLocalIterator():
# write the data into a single file
pass
You get OOM error because you try to retrieve all the data into a single partition with: coalesce(1)
I dont recommend to use toLocalIterator because you will re-rewrite a custom writer for every format and you wont have parallele writing.
You first solution is a good one :
df.write.format(file_format).save('file:///pyspark_data/output')
if you use hadoop you can retrieve all the data into one on filesysteme this way : (it work for csv, you can try for other) :
hadoop fs -getmerge <HDFS src> <FS destination>

Cannot open a csv file

I have a csv file on which i need to work in my jupyter notebook ,even though i am able to view the contents in the file using the code in the picture
When i am trying to convert the data into a data frame i get a "no columns to parse from file error"
i have no headers. My csv file looks like this and also i have saved it in the UTF-8 format
Try to use pandas to read the csv file:
df = pd.read_csv("BON3_NC_CUISINES.csv)
print(df)