Exporting Multiple log files data to single Excel using Pandas - pandas

How do I export multiple dataframes to a single excel, I'm not talking about merging or combining. I just want a specific line from multiple log files to be compiled to a single excel sheet. I already wrote a code but I am stuck:
import pandas as pd
import glob
import os
from openpyxl.workbook import Workbook
file_path = "C:/Users/HP/Desktop/Pandas/MISC/Log Source"
read_files = glob.glob(os.path.join(file_path,"*.log"))
for files in read_files:
logs = pd.read_csv(files, header=None).loc[540:1060, :]
print(LBS_logs)
logs.to_excel("LBS.xlsx")
When I do this, I only get data from the first log.
Appreciate your recommendations. Thanks !

You are saving logs, which is the variable in your for loop that changes on each iteration. What you want is to make a list of dataframes and combine them all, and then save that to excel.
file_path = "C:/Users/HP/Desktop/Pandas/MISC/Log Source"
read_files = glob.glob(os.path.join(file_path,"*.log"))
dfs = []
for file in read_files:
log = pd.read_csv(file, header=None).loc[540:1060, :]
dfs.append(log)
logs = pd.concat(logs)
logs.to_excel("LBS.xlsx")

Related

Generate Multiple files Table Structure and create table

I'm trying to Generate Multiple files Table Structure to Snowflake via Python
I have list of files in Directory, I want to read data from files create the tables dynamically in snowflake using file names.
below is what I tried so far
# Generate Multiple files Table Structure to Snowflake via Python
import os
from os import path
import pandas as pd
import snowflake.connector
from snowflake.sqlalchemy import URL
from sqlalchemy import create_engine
from snowflake.connector.pandas_tools import write_pandas, pd_writer
dir = r"D:\Datasets\users_dataset"
engine = create_engine(URL(
account='<account>',
user='<user>',
password='<password>',
role='ACCOUNTADMIN',
warehouse='COMPUTE_WH',
database='DEM0_DB',
schema='PUBLIC'
))
connection = engine.connect()
connection.execute("USE DATABASE DEMO_DB")
connection.execute("USE SCHEMA PUBLIC")
results = connection.execute('USE DATABASE DEMO_DB').fetchone()
print(results)
# read the files from the directory and split the filename and extension
for file in os.listdir(dir):
name, extr = path.splitext(file)
print(name)
file_path = os.path.join(dir, file)
print(file_path)
df = pd.read_csv(file_path, delimiter=',')
df.to_sql(name, con=engine, index=False)
I'm getting below error
sqlalchemy.exc.ProgrammingError: (snowflake.connector.errors.ProgrammingError) 090105 (22000): Cannot perform CREATE TABLE. This session does not have a current database. Call 'USE DATABASE', or use a qualified name.
[SQL:
CREATE TABLE desktop (
"[.ShellClassInfo]" FLOAT
)
]
(Background on this error at: https://sqlalche.me/e/14/f405)
I checked for permission issues on snowflake, and I haven't found any issues.
Can someone please help with this error
At the create_engine the database is called DEM0_DB instead of DEMO_DB:
engine = create_engine(URL(
...
#database='DEM0_DB',
database='DEMO_DB',
schema='PUBLIC'
))

How to iterate over a list of csv files and compile files with common filenames into a single csv as multiple columns

I am currently iterating through a list of csv files and want to combine csv files with common filename strings into a single csv file merging the data from the new csv file as a set of two new columns. I am having trouble with the final part of this in that the append command adds the data as rows at the base of the csv. I have tried with pd.concat, but must be going wrong somewhere. Any help would be much appreciated.
**Note the code is using Python 2 - just for compatibility with the software I am using - Python 3 solution welcome if it translates.
Here is the code I'm currently working with:
rb_headers = ["OID_RB", "Id_RB", "ORIG_FID_RB", "POINT_X_RB", "POINT_Y_RB"]
for i in coords:
if fnmatch.fnmatch(i, '*RB_bank_xycoords.csv'):
df = pd.read_csv(i, header=0, names=rb_headers)
df2 = df[::-1]
#Export the inverted RB csv file as a new csv to the original folder overwriting the original
df2.to_csv(bankcoords+i, index=False)
#Iterate through csvs to combine those with similar key strings in their filenames and merge them into a single csv
files_of_interest = {}
forconc = []
for filename in coords:
if filename[-4:] == '.csv':
key = filename[:39]
files_of_interest.setdefault(key, [])
files_of_interest[key].append(filename)
for key in files_of_interest:
buff_df = pd.DataFrame()
for filename in files_of_interest[key]:
buff_df = buff_df.append(pd.read_csv(filename))
files_of_interest[key]=buff_df
redundant_headers = ["OID", "Id", "ORIG_FID", "OID_RB", "Id_RB", "ORIG_FID_RB"]
outdf = buff_df.drop(redundant_headers, axis=1)
If you want only to merge in one file:
paths_list=['path1', 'path2',...]
dfs = [pd.read_csv(f, header=None, sep=";") for f in paths_list]
dfs=pd.concat(dfs,ignore_index=True)
dfs.to_csv(...)

Using pandas to open Excel files stored in GCS from command line

The following code snippet is from a Google tutorial, it simply prints the names of files on GCP in a given bucket:
from google.cloud import storage
def list_blobs(bucket_name):
"""Lists all the blobs in the bucket."""
# bucket_name = "your-bucket-name"
storage_client = storage.Client()
# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(bucket_name)
for blob in blobs:
print(blob.name)
list_blobs('sn_project_data')
No from the command line I can run:
$ python path/file.py
And in my terminal the files in said bucket are printed out. Great, it works!
However, this isn't quite my goal. I'm looking to open a file and act upon it. For example:
df = pd.read_excel(filename)
print(df.iloc[0])
However, when I pass the path to the above, the error returned reads "invalid file path." So I'm sure there is some sort of GCP specific function call to actually access these files...
What command(s) should I run?
Edit: This video https://www.youtube.com/watch?v=ED5vHa3fE1Q shows a trick to open files and needs to use StringIO in the process. But it doesn't support excel files, so it's not an effective solution.
read_excel() does not support google cloud storage file path as of now but it can read data in bytes.
pandas.read_excel(io, sheet_name=0, header=0, names=None,
index_col=None, usecols=None, squeeze=False, dtype=None, engine=None,
converters=None, true_values=None, false_values=None, skiprows=None,
nrows=None, na_values=None, keep_default_na=True, na_filter=True,
verbose=False, parse_dates=False, date_parser=None, thousands=None,
comment=None, skipfooter=0, convert_float=True, mangle_dupe_cols=True,
storage_options=None)
Parameters: io : str, bytes, ExcelFile, xlrd.Book, path object, or
file-like object
What you can do is use the blob object and use download_as_bytes() to convert the object into bytes.
Download the contents of this blob as a bytes object.
For this example I just used a random sample xlsx file and read the 1st sheet:
from google.cloud import storage
import pandas as pd
bucket_name = "your-bucket-name"
blob_name = "SampleData.xlsx"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(blob_name)
data_bytes = blob.download_as_bytes()
df = pd.read_excel(data_bytes)
print(df)
Test done:

how to add header row to df.to_csv output file only after a certain condition is met

I am reading a bulk download csv file of Stock prices and splitting it into many individual csv's based on ticker, where ticker name is the name of the outputted file and where the header row which contains "ticker, date, open, high, low, close, volume is being written ONLY the the first time I run the script because if I run it again with the header set to true, it writes a new header row mixed in with the stock data. I have mode set to "a", meaning "append" because I want each new row of data added to the file. However, I see a situation now where a new ticker has appeared in the source file, and because I have the header set to False, there is no header in this newly created output file, which causes processing to fail. How can I include a condition so that it writes a header row ONLY for new files which never existed before. Here is my code. Thanks
import pandas as pd
import os
import csv
import itertools
import datetime
datetime = datetime.datetime.today().strftime('%Y-%m-%d')
filename = "Bats_"+(datetime)+".csv"
csv_file = ("H:\\EOD_DATA_RECENT\\DOWNLOADS\\"+filename)
path = 'H:\\EOD_DATA_RECENT\\VIA-API-CALL\\BATS\\'
df = pd.read_csv(csv_file)
for i, g in df.groupby('Ticker'):
# SET HEADER TO TRUE THE FIRST RUN, THEN SET TO FALSE THEREAFTER
g.to_csv(path + '{}.csv'.format(i), mode='a', header=False, index=False, index_label=None)
print(df.tail(5))
FINAL CODE SNIPPET BELOW THAT WORKS. Thanks
for i, g in df.groupby('Ticker'):
if os.path.exists(path+i+".csv"):
g.to_csv(path + '{}.csv'.format(i), mode='a', header=False, index=False, index_label=None)
else:
g.to_csv(path + '{}.csv'.format(i), mode='w', header=True, index=False, index_label=None)
print(df.tail(5))

reading paritionned dataset in aws s3 with pyarrow doesn't add partition columns

i'm trying to read a partitionned dataset in aws s3, it looks like :
MyDirectory--code=1--file.parquet
--code=2--another.parquet
--code=3--another.parquet
i created a file_list containing the path to all the files in the directory then executed
df = pq.ParquetDataset(file_list, filesystem=fs).read().to_pandas()
everything works except that the partition column code doesn't exist in the dataframe df.
i tried it also using one path to MyDirectory insted of file_list, but found an error
"Found files in an intermediate directory: s3://bucket/Mydirectoty", i can't find any answer online.
Thank you!
AWS has a project (AWS Data Wrangler) that helps with the integration between Pandas/PyArrow and their services.
This snippet should work:
import awswrangler as wr
# Write
wr.s3.to_parquet(
df=df,
path="s3://...",
mode="overwrite",
dataset=True,
database="my_databse", # Optional, only if you want it available on Athena/Glue Catalog
table="my_table",
partition_cols=["PARTITION_COL_NAME"])
# READ
df = wr.s3.read_parquet(path="s3://...", dataset=True)
If you're happy with other tools you can give dask a try. Assume all the data you want to read is in s3://folder you can just use
import dask.dataframe as dd
storage_options = {
'key': your_key,
'secret': your_secret}
df = dd.read_parquet("s3://folder",
storage_options=storage_options)