Write CSV to HDFS from stream with pyarrow upload - pandas

I am trying to save a Pandas DataFrame to HDFS in CSV format using pyarrow upload method, but the CSV file saved is empty. The code example can be found below.
import io
import pandas as pd
import pyarrow as pa
df = pd.DataFrame({"x": [1, 2, 3]})
buf = io.StringIO()
df.to_csv(buf)
hdfs = pa.hdfs.connect()
hdfs.upload("path/to/hdfs/test.csv", buf)
When I check the contents of test.csv on HDFS it is empty. What did I do wrong? Thanks.

You need to call buf.seek(0) before uploading.
Basically you need to rewind to the begining of the buffer otherwise hdfs thinks there's nothing to upload:
>>> buf.read()
''
>>> buf.seek(0)
0
>>> buf.read()
',x\n0,1\n1,2\n2,3\n'
>>> buf.read()
''

Related

How to convert csv to parquet using pandas?

I want to convert my CSV file to a parquet file. My code below causes my kernel to be KILLED regardless of the chunksize parameter. I do not know the number of rows x columns in my file, but I suspect that I have many columns.
What is the ideal solution?
With Pandas:
import pandas as pd
import dask.dataframe as dd
import pyarrow as pa
import pyarrow.parquet as pq
csv_file = "kipan_exon.csv.gz"
parquet_file = "kipan_exon.csv.gz"
chunksize = 1000000
df = pd.read_csv(csv_file, sep="\t", chunksize=chunksize, low_memory=False, compression="gzip")
for i, chunk in enumerate(df):
print("Chunk", i)
if i == 0:
parquet_schema = pa.Table.from_pandas(df=chunk).schema
parquet_writer = pd.ParquetWriter(parquet_file, parquet_schema, compression="gzip")
table = pa.Table.from_pandas(chunk, schema=parquet_schema)
parquet_writer.write_table(table)
parquet_writer.close()
With dask:
df = dd.read_csv(csv_file, sep="\t", compression="gzip", blocksize=None)
df = df.repartition(partition_size="100MB")
df.to_parquet(parquet_file, write_index=False)
Another (more recent) solution is to use a LazyFrame approach in polars:
csv_file = "kipan_exon.csv" # this doesn't work with compressed files right now
parquet_file = "kipan_exon.parquet" # #MichaelDelgado's comment re: same value as `csv_file`
from polars import scan_csv
ldf = scan_csv(csv_file)
ldf.sink_parquet(parquet_file)
This should work well in memory-constrained situations since the data is not loaded fully, but streamed to the parquet file.
When using dask for csv to parquet conversion, I'd recommend avoiding .repartition. It introduces additional data shuffling that can strain workers and the scheduler. The simpler approach would look like this:
csv_file = "kipan_exon.csv.gz"
parquet_file = "kipan_exon.parquet" # #MichaelDelgado's comment re: same value as `csv_file`
from dask.dataframe import read_csv
df = read_csv(csv_file, sep="\t", compression="gzip")
df.to_parquet(parquet_file, write_index=False)

Google cloud blob: XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

I want to import several xml files from a bucket on GCS and then parse them into a pandas Dataframe. I found the pandas.read_xml function do to this which is great. Unfortunately
I keep getting the error:
XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
I checked the xml files and they look fine.
This is the code:
from google.cloud import storage
import pandas as pd
#importing the data
client = storage.Client()
bucket = client.get_bucket('bucketname')
df = pd.DataFrame()
from google.cloud import storage
import pandas as pd
#parsing the data into pandas df
for blob in bucket.list_blobs():
print(blob)
split = str(blob.name).split("/")
country = split[0]
data = pd.read_xml(blob.open(mode='rt', encoding='iso-8859-1', errors='ignore'), compression='gzip')
df["country"] = country
print(country)
df.append(data)
When I print out the blob it gives me :
<Blob: textkernel, DE/daily/2020/2020-12-19/jobs.0.xml.gz, 1612169959288959>
maybe it has something to do with the pandas function trying to read the filename and not the content? Does someone have an idea about why this could be happening?
thank you!

reading csv file into python pandas

I want to read a csv file into a pandas dataframe but I get an error when executing the code below:
filepath = "https://drive.google.com/file/d/1bUTjF-iM4WW7g_Iii62Zx56XNTkF2-I1/view"
df = pd.read_csv(filepath)
df.head(5)
To retrieve information or data from google drive, at first, you need to identify the file id.
import pandas as pd
url='https://drive.google.com/file/d/0B6GhBwm5vaB2ekdlZW5WZnppb28/view?usp=sharing'
file_id=url.split('/')[-2]
dwn_url='https://drive.google.com/uc?id=' + file_id
df = pd.read_csv(dwn_url)
print(df.head())
Try the following code snippet to read the CSV from Google Drive into the pandas DataFrame:
import pandas as pd
url = "https://drive.google.com/uc?id=1bUTjF-iM4WW7g_Iii62Zx56XNTkF2-I1"
df = pd.read_csv(url)
df.head(5)

How to write a pandas dataframe to .arrow file

How can I write a pandas dataframe to disk in .arrow format? I'd like to be able to read the arrow file into Arquero as demonstrated here.
Since Feather is the Arrow IPC format, you can probably just use write_feather. See http://arrow.apache.org/docs/python/feather.html
You can do this as follows:
import pyarrow
import pandas
df = pandas.read_parquet('your_file.parquet')
schema = pyarrow.Schema.from_pandas(df, preserve_index=False)
table = pyarrow.Table.from_pandas(df, preserve_index=False)
sink = "myfile.arrow"
# Note new_file creates a RecordBatchFileWriter
writer = pyarrow.ipc.new_file(sink, schema)
writer.write(table)
writer.close()
Pandas can directly write a DataFrame to the binary Feather format. (uses pyarrow)
import pandas as pd
df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
df.to_feather('my_data.arrow')
Additional keywords are passed to pyarrow.feather.write_feather(). This includes the compression, compression_level, chunksize and version keywords.

How to download csv file from S3 bucket into numpy array

I have a csv file in an AWS S3 bucket. How do I download the CSV and assign it to a numpy array?
[Using python 3.6/boto3]
I've tried various forms including:
s3 = boto3.resource('s3', region_name=region)
obj = s3.Object(bucket, key)
with io.BytesIO(obj.get()["Body"].read()) as f:
# rewind the file
f.seek(0)
arr_data = numpy.load(f)
arr_data = numpy.genfromtxt('https://BUCKETNAME.s3-eu-west-1.amazonaws.com/folder/infile.csv',dtype='str',delimiter=',')
This also doesn't work
Essentially I'm trying to replicate in S3:
arr_data = np.genfromtxt('path...input.csv',dtype='str',delimiter=',')
I was able to convert a csv to a numpy array using pandas in-between... not sure if that's what you're looking for. But here's how I did it:
import pandas as pd
import numpy as np
data_location = 's3://<path>'
data = pd.read_csv(data_location)
data_numpy = data.value.values.reshape(-1,1)