I generate a dataframe, write the dataframe to S3 as CSV file, and perform a select query on the CSV in S3 bucket. Based on the query and data I expect to see '4' and '10' printed but I only see '4'. For some reason S3 is not seeing the '10'.
It works fine for filtering between date.
import pandas as pd
import s3fs
import boto3
# dataframe
d = {'date':['1990-1-1','1990-1-2','1990-1-3','1999-1-4'], 'speed':[0,10,3,4]}
df = pd.DataFrame(d)
# write csv to s3
bytes_to_write = df.to_csv(index=False).encode()
fs = s3fs.S3FileSystem()
with fs.open('app-storage/test.csv', 'wb') as f:
f.write(bytes_to_write)
# query csv in s3 bucket
s3 = boto3.client('s3',region_name='us-east-1')
resp = s3.select_object_content(
Bucket='app-storage',
Key='test.csv',
ExpressionType='SQL',
Expression="SELECT s.\"speed\" FROM s3Object s WHERE s.\"speed\" > '3'",
InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}},
OutputSerialization = {'CSV': {}},
)
for event in resp['Payload']:
if 'Records' in event:
records = event['Records']['Payload'].decode('utf-8')
print(records)
Just needed to cast the string to float in the SQL statement.
"SELECT s.\"speed\" FROM s3Object s WHERE cast(s.\"speed\" as float) > 3"
Not it works without a problem.
Related
I have a script in AWS Glue ETL Job, where it reads a S3 bucket with a lot of parquet files, do a sort by key1, key2 and a timestamp field. After that the script delete the duplicates and save a single parquet file in other S3 Bucket.
Look the data I have before the Job runs:
key1
key2
uploadTimestamp
0005541779
10
2021-12-29 14:54:08.753
0005541779
10
2021-12-29 15:06:05.968
The code that do the sort and eliminate duplicates:
#############################################################
tempDF = S3bucket_node1.toDF() #from Dynamic Frame to Data Frame
sortedDF = tempDF.orderBy(f.col("uploadTimestamp").desc(),"key1","key2").dropDuplicates(["key1","key2"]) #sort and remove duplicates
dynamicFrame = DynamicFrame.fromDF(sortedDF, glueContext, 'salesOrder') #back to Dynamic Frame
#############################################################
Get a look on this image after an order by:
My problem:
In the output file, some data got the last timestamp, some data got the first.. I can't understand why it doesnt work for all data.
Thanks.
It worked with the following code:
tempDF = S3bucket_node1.toDF()
w = Window.partitionBy("key1","key2").orderBy(f.desc("uploadTimestamp"))
df = tempDF.withColumn("rn", f.row_number().over(w)).filter("rn = 1").drop("rn")
dynamicFrame = DynamicFrame.fromDF(df, glueContext, 'dynamicFrame')
The tip to solve that, was found here:
pyspark dataframe drop duplicate values with older time stamp
Is it possible to use Pandas' DataFrame.to_parquet functionality to split writing into multiple files of some approximate desired size?
I have a very large DataFrame (100M x 100), and am using df.to_parquet('data.snappy', engine='pyarrow', compression='snappy') to write to a file, but this results in a file that's about 4GB. I'd instead like this split into many ~100MB files.
I ended up using Dask:
import dask.dataframe as da
ddf = da.from_pandas(df, chunksize=5000000)
save_dir = '/path/to/save/'
ddf.to_parquet(save_dir)
This saves to multiple parquet files inside save_dir, where the number of rows of each sub-DataFrame is the chunksize. Depending on your dtypes and number of columns, you can adjust this to get files to the desired size.
One other option is to use the partition_cols option in pyarrow.parquet.write_to_dataset():
import pyarrow.parquet as pq
import numpy as np
# df is your dataframe
n_partition = 100
df["partition_idx"] = np.random.choice(range(n_partition), size=df.shape[0])
table = pq.Table.from_pandas(df, preserve_index=False)
pq.write_to_dataset(table, root_path="{path to dir}/", partition_cols=["partition_idx"])
Slice the dataframe and save each chunk to a folder, using just pandas api (without dask or pyarrow).
You can pass extra params to the parquet engine if you wish.
def df_to_parquet(df, target_dir, chunk_size=1000000, **parquet_wargs):
"""Writes pandas DataFrame to parquet format with pyarrow.
Args:
df: DataFrame
target_dir: local directory where parquet files are written to
chunk_size: number of rows stored in one chunk of parquet file. Defaults to 1000000.
"""
for i in range(0, len(df), chunk_size):
slc = df.iloc[i : i + chunk_size]
chunk = int(i/chunk_size)
fname = os.path.join(target_dir, f"part_{chunk:04d}.parquet")
slc.to_parquet(fname, engine="pyarrow", **parquet_wargs)
Keep each parquet size small, around 128MB. To do this:
import dask.dataframe as dd
# Get number of partitions required for nominal 128MB partition size
# "+ 1" for non full partition
size128MB = int(df.memory_usage().sum()/1e6/128) + 1
# Read
ddf = dd.from_pandas(df, npartitions=size128MB)
save_dir = '/path/to/save/'
ddf.to_parquet(save_dir)
cunk = 200000
i = 0
n = 0
while i<= len(all_df):
j = i + cunk
print((i, j))
tmpdf = all_df[i:j]
tmpdf.to_parquet(path=f"./append_data/part.{n}.parquet",engine='pyarrow', compression='snappy')
i = j
n = n + 1
I'm trying to upload files locally into bigquery using python. Whenever i run this i get an error
ValueError: Could not determine schema for table
Table(TableReference(DatasetReference('database-150318', 'healthanalytics'), 'pres_kmd'))'. Call client.get_table() or pass in a list of schema fields to the selected_fields argument.
client = bigquery.Client(project="database-150318")
job_config = bigquery.LoadJobConfig(autodetect=True)
table_ref = client.dataset('healthanalytics').table('pres_kmd')
table = client.get_table(table_ref)
#table = dataset.table("test_table")
deidrows = []
for filename in glob.glob('/Users/janedoe/kmd/health/*dat.gz'):
with gzip.open(filename) as f:
for line in f:
#line = line.decode().strip().split('|')
deidrows.append(line)
client.insert_rows(table, deidrows)
pdb.set_trace()
Can someone help with why? I already thought that if i put autodetect in there it would assume.
Thanks in advance!
You can try this example:
import csv
client = bigquery.Client()
table_ref = client.dataset('bq_poc').table('new_emp')
table = client.get_table(table_ref)
filename = "data.csv"
with open(filename) as f:
for line in f:
reader = csv.reader(f, skipinitialspace=True)
rows = [[int(row[0]), str(row[1]), int(row[2])] for row in reader]
client.insert_rows(table, rows)
Note:
job_config is not utilized and it can be removed
Data needs to be converted into specific format (mentioned as rows)
What is the recommended way to prepend data (a pandas dataframe) to an existing dask dataframe in parquet storage?
This test, for example, fails intermittently:
import dask.dataframe as dd
import numpy as np
import pandas as pd
def test_dask_intermittent_error(tmp_path):
df = pd.DataFrame(np.random.randn(100, 1), columns=['A'],
index=pd.date_range('20130101', periods=100, freq='T'))
dfs = np.array_split(df, 2)
dd1 = dd.from_pandas(dfs[0], npartitions=1)
dd2 = dd.from_pandas(dfs[1], npartitions=1)
dd2.to_parquet(tmp_path)
_ = (dd1
.append(dd.read_parquet(tmp_path))
.to_parquet(tmp_path))
assert_frame_equal(df,
dd.read_parquet(tmp_path).compute())
gives
.venv/lib/python3.7/site-packages/dask/dataframe/core.py:3812: in to_parquet
return to_parquet(self, path, *args, **kwargs)
...
fastparquet.util.ParquetException: Metadata parse failed: /private/var/folders/_1/m2pd_c9d3ggckp1c1p0z3v8r0000gn/T/pytest-of-jfaleiro/pytest-138/test_dask_intermittent_error0/part.0.parquet
We considered relying on a simple append and figuring out order after retrieval, but this seems to be hitting a different bug, i.e.:
def test_dask_prepend_as_append(tmp_path):
df = pd.DataFrame(np.random.randn(100, 1), columns=['A'],
index=pd.date_range('20130101', periods=100, freq='T'))
dfs = np.array_split(df, 2)
dd1 = dd.from_pandas(dfs[0], npartitions=1)
dd2 = dd.from_pandas(dfs[1], npartitions=1)
dd2.to_parquet(tmp_path)
dd1.to_parquet(tmp_path, append=True)
assert_frame_equal(df,
dd.read_parquet(tmp_path).compute())
gives
ValueError: Appended divisions overlapping with previous ones.
If you avoid using a "_metadata" file when writing (which you will with the default settings and pyarrow), then you could simply rename your files, to assure that the prepended partition occurs before the rest, when listed by glob. Normally, Dask will begin naming with a serial number 0.
I have the following column in a table.
daily;1;21/03/2015;times;10
daily;1;01/02/2016;times;8
monthly;1;01/01/2016;times;2
weekly;1;21/01/2016;times;4
How can I parse this by the ; delimiter into different columns?
one way to do it would be to pull it into pandas, delimit by semicolon, and put it back into SQL Server. See below for an example which I tested.
TEST DATA SETUP
CODE
import sqlalchemy as sa
import urllib
import pandas as pd
server = 'yourserver'
read_database = 'db_to_read_data_from'
write_database = 'db_to_write_data_to'
read_tablename = 'table_to_read_from'
write_tablename = 'table_to_write_to'
read_params = urllib.quote_plus("DRIVER={SQL Server};SERVER="+server+";DATABASE="+read_database+";TRUSTED_CONNECTION=Yes")
read_engine = sa.create_engine("mssql+pyodbc:///?odbc_connect=%s" % read_params)
write_params = urllib.quote_plus("DRIVER={SQL Server};SERVER="+server+";DATABASE="+write_database+";TRUSTED_CONNECTION=Yes")
write_engine = sa.create_engine("mssql+pyodbc:///?odbc_connect=%s" % write_params)
#Read from SQL into DF
Table_DF = pd.read_sql(read_tablename, con=read_engine)
#Delimit by semicolon
parsed_DF = Table_DF['string_column'].apply(lambda x: pd.Series(x.split(';')))
#write DF back to SQL
parsed_DF.to_sql(write_tablename,write_engine,if_exists='append')
RESULT