I'm trying to import excel database sheet in big query from my download folder or google drive but I unable to import. Please reply how to import database sheet from google drive, if any method available.
XLSX is not included in the supported data for batch loading in Bigquery. A workaround is to convert XLSX to CSV then load to Bigquery from your local data source. I both achieved it using Bigquery Python API.
See working code below:
import pandas as pd
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
table_id = "<your-project>.<your-dataset>.<your-table>"
file_path = "<your-path>/20220621.xlsx"
coverted_path = "<your-path>/20220621.csv"
# Converting XLSX to CSV
data_xls = pd.read_excel(file_path, index_col=None)
data_xls.to_csv(coverted_path, encoding='utf-8', index=False)
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.CSV, skip_leading_rows=1, autodetect=True,
)
with open(coverted_path, "rb") as source_file:
job = client.load_table_from_file(source_file, table_id, job_config=job_config)
job.result() # Waits for the job to complete.
table = client.get_table(table_id) # Make an API request.
print(
"Loaded {} rows and {} columns to {}".format(
table.num_rows, len(table.schema), table_id
)
)
.XLSX file:
Converted to CSV:
Loaded to Bigquery:
Related
Trying to understand if I can use pickle for storing the model in a file system.
from neuralprophet import NeuralProphet
import pandas as pd
import pickle
df = pd.read_csv('data.csv')
pipe = NeuralProphet()
pipe.fit(df, freq="D")
pickle.dump(pipe, open('model/pipe_model.pkl', 'wb'))
Question:- Loading multiple CSV files. I have multiple CSV file. How can i dump multiple CSV files in the same pickle file and load later for the prediction?
I think the right answer here is sqlite. SQLite acts like a database but it is stored as a single self-contained file on disk.
The benefit for your use case is that you can append new data as received into a table on the file, then read it as required. The code to do this is as simple as:
import pandas as pd
import sqlite3
# Create a SQL connection to our SQLite database
# This will create the file if not already existing
con = sqlite3.connect("my_table.sqlite")
# Replace this with read_csv
df = pd.DataFrame(index = [1, 2, 3], data = [1, 2, 3], columns=['some_data'])
# Simply continue appending onto 'My Table' each time you read a file
df.to_sql(
name = 'My Table',
con = con,
if_exists='append'
)
Please be aware that SQLite performance drops after very large numbers of rows, in which case caching the data as parquet files or another fast and compressed format, then reading them all in at training time may be more appropriate.
When you need the data, just read everything from the table:
pd.read_sql('SELECT * from [My Table]', con=con)
I have a lambda function that checks an s3 bucket for a creation event. If a file is uploaded to the bucket I parse that file and upload the outcomes to a dynamodb table. Here is what my function looks like.
import json
import boto3
import pandas as pd
import urllib
import io
import uuid
import logging
path = "sensor-data.csv"
obj = s3.get_object(Bucket='sensor-bucket', Key=path)
csv_string = io.BytesIO(obj['Body'].read())
# Read a csv file and turn it into a DataFrame
df = pd.read_csv(csv_string, delimiter=';', engine ='c', encoding= 'unicode_escape')
# Rename columns as seen in the Lambda Function
df.rename(columns={'< 5,6m': 'SmallSize', '>= 5,6m': 'LargeSize'}, inplace=True)
df.Felt.replace(['1', '2', '3', '4'], ['lane_1', 'lane_2', 'lane_3', 'lane_4'], inplace=True)
# Filter out data column
data = df['data'] = df[['Navn', 'Vegreferanse', 'Fra', 'Til', 'Volum',
'Felt', 'SmallSize', 'LargeSize']].to_json(orient='records')
# Calculate traffic
traffic = df.groupby(['Felt'])['Volum'].sum().to_dict()
# Create a Dictionary for a new DataFrame
data = {'sensor-id': df.Trafikkregistreringspunkt.iloc[0], 'data': data,
'date': df.Dato.iloc[0], 'Id': str(uuid.uuid4()), 'traffic': [traffic]}
# Create the dataframe
df2 = pd.DataFrame(data, index=[0])
print("Parsing complete. Writing to table...")
# Connect to dynamodb push items to the table
dynamoDb = boto3.resource('dynamodb')
table = dynamoDb.Table("sensor-data-table")
for line in df2.T.to_dict().values():
table.put_item(Item=line)
print("Data processing completed successfully!")
On my local machine I can run the code and I can put the data into the table. However, when I try to do that on a Lambda function I get the following error;
Unsupported type "<class 'numpy.int64'>" for value "1996": TypeError
1996 is one of the values that I try to upload to the table. Here is what df2.T.to_dict().values() looks like;
dict_values([{'sensor-id': '11219V22151',
'data': '[{"Some Data Here"}]',
'date': '2020-01-01',
'Id': '107d8ce5-c7d2-4b86-af83-d5ce7d11ce74',
'traffic': {'Totalt': 1996, 'Totalt i retning Fianex Rv 415': 944,
'Totalt i retning Stølen X Rv 420': 1052, 'lane_1': 1052, 'lane_2': 944}}])
I'd appreciate some help and clarification on the issue.
It turned out that I was using a pre-made lambda deployment package that was a little bit outdated. I created a venv then used a build script to create the zip file. It worked perfectly fine.
I am trying to read a bunch of CSV files from Google Cloud Storage into pandas dataframes as explained in Read csv from Google Cloud storage to pandas dataframe
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blobs = bucket.list_blobs(prefix=prefix)
list_temp_raw = []
for file in blobs:
filename = file.name
temp = pd.read_csv('gs://'+bucket_name+'/'+filename+'.csv', encoding='utf-8')
list_temp_raw.append(temp)
df = pd.concat(list_temp_raw)
It shows the following error message while importing gcfs. The packages 'dask' and 'gcsfs' have already been installed on my machine; however, cannot get rid of the following error.
File "C:\Program Files\Anaconda3\lib\site-packages\gcsfs\dask_link.py", line
121, in register
dask.bytes.core._filesystems['gcs'] = DaskGCSFileSystem
AttributeError: module 'dask.bytes.core' has no attribute '_filesystems'
It seems there is some error or conflict between the gcsfs and dask packages. In fact, the dask library is not needed for your code to work. The minimal configuration for your code to run is to install the libraries ( I am posting its latest versions):
google-cloud-storage==1.14.0
gcsfs==0.2.1
pandas==0.24.1
Also, the filename already contains the .csv extension. So change the 9th line to this:
temp = pd.read_csv('gs://' + bucket_name + '/' + filename, encoding='utf-8')
With this changes I ran your code and it works. I suggest you to create a virtual env and install the libraries and run the code there:
This has been tested and seen to work from elsewhere - whether reading directly from GCS or via Dask. You may wish to try import of gcsfs and dask, see if you can see the _filesystems and see its contents
In [1]: import dask.bytes.core
In [2]: dask.bytes.core._filesystems
Out[2]: {'file': dask.bytes.local.LocalFileSystem}
In [3]: import gcsfs
In [4]: dask.bytes.core._filesystems
Out[4]:
{'file': dask.bytes.local.LocalFileSystem,
'gcs': gcsfs.dask_link.DaskGCSFileSystem,
'gs': gcsfs.dask_link.DaskGCSFileSystem}
As of https://github.com/dask/gcsfs/pull/129 , gcsfs behaves better if it is unable to register itself with Dask, so updating may solve your problem.
Few things to point out in the text above:
bucket_name and prefixes needed to be defined.
and the iteration over the filenames should append the each dataframe each time. Otherwise it is the last one that gets concatenated.
from google.cloud import storage
import pandas as pd
storage_client = storage.Client()
buckets_list = list(storage_client.list_buckets())
bucket_name='my_bucket'
bucket = storage_client.bucket(bucket_name)
blobs = bucket.list_blobs()
list_temp_raw = []
for file in blobs:
filename = file.name
temp = pd.read_csv('gs://'+bucket_name+'/'+filename, encoding='utf-8')
print(filename, temp.head())
list_temp_raw.append(temp)
df = pd.concat(list_temp_raw)
I am new to python and I have a scenario where there are multiple parquet files with file names in order. ex: par_file1,par_file2,par_file3 and so on upto 100 files in a folder.
I need to read these parquet files starting from file1 in order and write it to a singe csv file. After writing contents of file1, file2 contents should be appended to same csv without header. Note that all files have same column names and only data is split into multiple files.
I learnt to convert single parquet to csv file using pyarrow with the following code:
import pandas as pd
df = pd.read_parquet('par_file.parquet')
df.to_csv('csv_file.csv')
But I could'nt extend this to loop for multiple parquet files and append to single csv.
Is there a method in pandas to do this? or any other way to do this would be of great help. Thank you.
I ran into this question looking to see if pandas can natively read partitioned parquet datasets. I have to say that the current answer is unnecessarily verbose (making it difficult to parse). I also imagine that it's not particularly efficient to be constantly opening/closing file handles then scanning to the end of them depending on the size.
A better alternative would be to read all the parquet files into a single DataFrame, and write it once:
from pathlib import Path
import pandas as pd
data_dir = Path('dir/to/parquet/files')
full_df = pd.concat(
pd.read_parquet(parquet_file)
for parquet_file in data_dir.glob('*.parquet')
)
full_df.to_csv('csv_file.csv')
Alternatively, if you really want to just append to the file:
data_dir = Path('dir/to/parquet/files')
for i, parquet_path in enumerate(data_dir.glob('*.parquet')):
df = pd.read_parquet(parquet_path)
write_header = i == 0 # write header only on the 0th file
write_mode = 'w' if i == 0 else 'a' # 'write' mode for 0th file, 'append' otherwise
df.to_csv('csv_file.csv', mode=write_mode, header=write_header)
A final alternative for appending each file that opens the target CSV file in "a+" mode at the onset, keeping the file handle scanned to the end of the file for each write/append (I believe this works, but haven't actually tested it):
data_dir = Path('dir/to/parquet/files')
with open('csv_file.csv', "a+") as csv_handle:
for i, parquet_path in enumerate(data_dir.glob('*.parquet')):
df = pd.read_parquet(parquet_path)
write_header = i == 0 # write header only on the 0th file
df.to_csv(csv_handle, header=write_header)
I'm having a similar need and I read current Pandas version supports a directory path as argument for the read_csv function. So you can read multiple parquet files like this:
import pandas as pd
df = pd.read_parquet('path/to/the/parquet/files/directory')
It concats everything into a single dataframe so you can convert it to a csv right after:
df.to_csv('csv_file.csv')
Make sure you have the following dependencies according to the doc:
pyarrow
fastparquet
This helped me to load all parquet files into one data frame
import glob
files = glob.glob("*.snappy.parquet")
data = [pd.read_parquet(f,engine='fastparquet') for f in files]
merged_data = pd.concat(data,ignore_index=True)
If you are going to copy the files over to your local machine and run your code you could do something like this. The code below assumes that you are running your code in the same directory as the parquet files. It also assumes the naming of files as your provided above: "order. ex: par_file1,par_file2,par_file3 and so on upto 100 files in a folder." If you need to search for your files then you will need to get the file names using glob and explicitly provide the path where you want to save the csv: open(r'this\is\your\path\to\csv_file.csv', 'a') Hope this helps.
import pandas as pd
# Create an empty csv file and write the first parquet file with headers
with open('csv_file.csv','w') as csv_file:
print('Reading par_file1.parquet')
df = pd.read_parquet('par_file1.parquet')
df.to_csv(csv_file, index=False)
print('par_file1.parquet appended to csv_file.csv\n')
csv_file.close()
# create your file names and append to an empty list to look for in the current directory
files = []
for i in range(2,101):
files.append(f'par_file{i}.parquet')
# open files and append to csv_file.csv
for f in files:
print(f'Reading {f}')
df = pd.read_parquet(f)
with open('csv_file.csv','a') as file:
df.to_csv(file, header=False, index=False)
print(f'{f} appended to csv_file.csv\n')
You can remove the print statements if you want.
Tested in python 3.6 using pandas 0.23.3
a small change for those trying to read remote files, which helps to read it faster (direct read_parquet for remote files was doing this much slower for me):
import io
merged = []
# remote_reader = ... <- init some remote reader, for example AzureDLFileSystem()
for f in files:
with remote_reader.open(f, 'rb') as f_reader:
merged.append(remote_reader.read())
merged = pd.concat((pd.read_parquet(io.BytesIO(file_bytes)) for file_bytes in merged))
Adds a little temporary memory overhead though.
You can use Dask to read in the multiple Parquet files and write them to a single CSV.
Dask accepts an asterisk (*) as wildcard / glob character to match related filenames.
Make sure to set single_file to True and index to False when writing the CSV file.
import pandas as pd
import numpy as np
# create some dummy dataframes using np.random and write to separate parquet files
rng = np.random.default_rng()
for i in range(3):
df = pd.DataFrame(rng.integers(0, 100, size=(10, 4)), columns=list('ABCD'))
df.to_parquet(f"dummy_df_{i}.parquet")
# load multiple parquet files with Dask
import dask.dataframe as dd
ddf = dd.read_parquet('dummy_df_*.parquet', index=False)
# write to single csv
ddf.to_csv("dummy_df_all.csv",
single_file=True,
index=False
)
# test to verify
df_test = pd.read_csv("dummy_df_all.csv")
Using Dask for this means you won't have to worry about the resulting file size (Dask is a distributed computing framework that can handle anything you throw at it, while pandas might throw a MemoryError if the resulting DataFrame is too large) and you can easily read and write from cloud data storage like Amazon S3.
I have been trying to write a function that loads multiple files from a Google Cloud Storage bucket into a single Pandas Dataframe, however I cannot seem to make it work.
import pandas as pd
from google.datalab import storage
from io import BytesIO
def gcs_loader(bucket_name, prefix):
bucket = storage.Bucket(bucket_name)
df = pd.DataFrame()
for shard in bucket.objects(prefix=prefix):
fp = shard.uri
%gcs read -o $fp -v tmp
df.append(read_csv(BytesIO(tmp))
return df
When I try to run it says:
undefined variable referenced in command line: $fp
Sure, here's an example:
https://colab.research.google.com/notebook#fileId=0B7I8C_4vGdF6Ynl1X25iTHE4MGc
This notebook shows the following:
Create two random CSVs
Upload both CSV files to a GCS bucket
Uses the GCS Python API to iterate over files in the bucket. And,
Merge each file into a single Pandas DataFrame.