I want to import XML data to Splunk using below .py script
My concerns are:
Can I directly configure .py script output to index data in splunk using inputs.conf, or do I need to save output first into a .csv file. If yes can anyone please suggest some approach so that data does not get changed after storing it into a new .csv file.
How can I configure that .py file to fetch data in every 5 min.
import requests
import xmltodict
import json
url = "https://www.w3schools.com/xml/plant_catalog.xml"
response = requests.get(url)
content=xmltodict.parse(response.text)
print(content)
If you put your Python script into a [script://] stanza in inputs.conf then not only can you have Splunk launch the script automatically every 5 minutes, but anything the script writes to stdout will be indexed in Splunk.
[script:///path/to/the/script.py]
interval = 1/5 * * * *
index = main
sourcetype = foo
Related
I have created an app that generates automatic reports for my team and I based on data located on multiple files (> 200). On my localhost streamlit app, I could input a few parameters (year, deployment number, etc) and the app would automatically use the correct files (3 out of 200 for each set of parameters) and generate the desired report.
However, now that I have deployed my app, I want it to select the desired files from a general OneDrive to which my whole team has access. This means the data would all be stored online in one location and the app would automatically only take the ones needed depending on the input parameters inserted by the user.
I have two problems:
1. I would like to open a csv file from a OneDrive URL. The method below gives me an error "urllib.error.HTTPError: HTTP Error 400: Bad Request":
'''
import base64
import urllib.request
import requests
from contextlib import closing
import csv
def create_onedrive_directdownload (onedrive_link):
data_bytes64 = base64.b64encode(bytes(onedrive_link, 'utf-8'))
data_bytes64_String = data_bytes64.decode('utf-8').replace('/','_').replace('+','-').rstrip("=")
resultUrl = f"https://api.onedrive.com/v1.0/shares/u!{data_bytes64_String}/root/content"
return resultUrl
onedrive_link = "https://my.sharepoint.com/:x:/s/myteam/..."
onedrive_direct_link = create_onedrive_directdownload(onedrive_link)
df = pd.read_csv(onedrive_direct_link)
r = requests.get(onedrive_link)
text = r.iter_lines()
reader = csv.reader(text, delimiter=',')
'''
2. I would like the app to select the right files depending on the first part of the URL only since the ending of the URL is a random list of numbers and letters but the beginning is predictable (all the files have a formated name which include the inputed parameters, i.e. year, deployment number, instrument). So what I am trying to do is something like this:
'''
folder_path = url to OneDrive
file_prefix_number = 062
year = 2013
if url contains "'+str(folder_path)+'/'+str(file_prefix_number)+'_ADP_'+str(year)+'-'+str(deployment)+'.csv'" then df = pd.read_csv(urlADP);
else ignore
'''
Any advice would be very welcome, I have been trying unsuccessfully many methods but I am afraid my python knowledges are not that good.
Thank you in advance!
Hopefully someone can help me. I have a set of static data files to do some data analysis, however, every time I run my script it takes really long time to see what is happening, because the data is loaded every time. Is there a way to load the data once and after just work with the data??
I have been using Jupyter notebooks and it work really well, but I would like a way to fix this problem by using Python code.
The sequence of my code is:
File 1: contains all the functions;
File 2: Contains all the variables and it calls file 1 in order to know what to do with the data.\n
File 1 = functions.py\n
import numpy as np
def dict_files(filepath_lst):
dictoffiles = {}
for namefile in filepath_lst:
content_file = np.loadtxt(namefile)
dictoffiles[namefile] = content_file
## Sorting files according to smallest timestamp to largest##
sorted_dictoffiles = {keys: values for keys, values in sorted(dictoffiles.items(), key=lambda item: item[1][0, 0])}
return sorted_dictoffiles
File 2\n
import functions as f
### ----------File Path -----------###
directory = 'some_file_path'
file_path = glob.glob(filejoin(directory, '*.dat'))
dictionary_of_files = f.dict_files(file_path)
I exported a dataset from Google BigQuery to Google Cloud Storage, given the size of the file BigQuery exported the file as 99 csv files.
However now I want to connect to my GCP Bucket and perform some analysis with Spark, yet I need to join all 99 files into a single large csv file to run my analysis.
How can this be achieved?
BigQuery splits the data exported into several files if it is larger than 1GB. But you can merge these files with the gsutil tool, check this official documentation to know how to perform object composition with gsutil.
As BigQuery export the files with the same prefix, you can use a wildcard * to merge them into one composite object:
gsutil compose gs://example-bucket/component-obj-* gs://example-bucket/composite-object
Note that there is a limit (currently 32) to the number of components that can be composed in a single operation.
The downside of this option is that the header row of each .csv file will be added in the composite object. But you can avoid this by modifiyng the jobConfig to set the print_header parameter to False.
Here is a Python sample code, but you can use any other BigQuery Client library:
from google.cloud import bigquery
client = bigquery.Client()
bucket_name = 'yourBucket'
project = 'bigquery-public-data'
dataset_id = 'libraries_io'
table_id = 'dependencies'
destination_uri = 'gs://{}/{}'.format(bucket_name, 'file-*.csv')
dataset_ref = client.dataset(dataset_id, project=project)
table_ref = dataset_ref.table(table_id)
job_config = bigquery.job.ExtractJobConfig(print_header=False)
extract_job = client.extract_table(
table_ref,
destination_uri,
# Location must match that of the source table.
location='US',
job_config=job_config) # API request
extract_job.result() # Waits for job to complete.
print('Exported {}:{}.{} to {}'.format(
project, dataset_id, table_id, destination_uri))
Finally, remember to compose an empty .csv with just the headers row.
I got tired kind tired of doing multiple recursive compose operations, stripping headers, etc... Especially when dealing with 3500 split gzipped csv files.
Therefore wrote a CSV Merge (Sorry windows only though) to solve exactly this problem.
https://github.com/tcwicks/DataUtilities
Download latest release, unzip and use.
Also wrote an article with a use case and usage example for it:
https://medium.com/#TCWicks/merge-multiple-csv-flat-files-exported-from-bigquery-redshift-etc-d10aa0a36826
Hope it is of use to someone.
p.s. Recommend tab delimited over CSV as it tends to have less data issues.
I have a big query table that I would like to run on using pandas DataFrame. The table is big and using the: pd.read_gpq() function gets stuck and does not manage to retrieve the data.
I implemented a chunk mechanism using pandas that works, but it takes a long time to fetch (an hour for 9M rows). So im looking into a new sulotion.
I would like to download the table to as a csv file and then read it. I saw this code in the google cloud docs:
# from google.cloud import bigquery
# client = bigquery.Client()
# bucket_name = 'my-bucket'
project = 'bigquery-public-data'
dataset_id = 'samples'
table_id = 'shakespeare'
destination_uri = 'gs://{}/{}'.format(bucket_name, 'shakespeare.csv')
dataset_ref = client.dataset(dataset_id, project=project)
table_ref = dataset_ref.table(table_id)
extract_job = client.extract_table(
table_ref,
destination_uri,
# Location must match that of the source table.
location='US') # API request
extract_job.result() # Waits for job to complete.
print('Exported {}:{}.{} to {}'.format(
project, dataset_id, table_id, destination_uri))
but all the URIs shown in the examples are google cloud buckets URIs and not local, and I didn't manage to download it (tried to put a local URI which gave me an error).
Is there a way to download the table's data as csv file without using a bucket?
As mentioned here
The limitation with bigquery export is - You cannot export data to a local file or to Google Drive, but you can save query results to a local file. The only supported export location is Cloud Storage.
Is there a way to download the table's data as csv file without using a bucket?
So now as we know that we can store query result to local file so you can use something like this :
from google.cloud import bigquery
client = bigquery.Client()
# Perform a query.
QUERY = (
'SELECT * FROM `project_name.dataset_name.table_name`')
query_job = client.query(QUERY) # API request
rows = query_job.result() # Waits for query to finish
for row in rows:
print(row.name)
this rows variable will have all the table rows and you can either directly use it or can write it to a local file.
I have a daily GCP billing export file in csv format containing GCP billing details. This export contains a header row. I've setup a load job as follows (summarized):
from google.cloud import bigquery
job = client.load_table_from_storage(job_name, dest_table, source_gs_file)
job.source_format = 'CSV'
job.skipLeadingRows=1
job.begin()
This job produces the error:
Could not parse 'Start Time' as a timestamp. Required format is YYYY-MM-DD HH:MM[:SS[.SSSSSS]]
This error means that it is still trying to parse the header row even though I specified skipLeadingRows=1. What am I doing wrong here?
You should use skip_leading_rows instead of skipLeadingRows when using the Python SDK.
skip_leading_rows: Number of rows to skip when reading data (CSV only).
Reference: https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.LoadJobConfig.html
I cannot reproduce this. I took the example you gave ("2017-02-04T00:00:00-08:00"), added 3 rows/timestamps to a csv file, uploaded it to GCS, and finally created an empty table in BigQuery with one column of type TIMESTAMP.
File contents:
2017-02-04T00:00:00-08:00
2017-02-03T00:00:00-08:00
2017-02-02T00:00:00-08:00
I then ran the example Python script found here, and it successfully loaded the file into the table:
Loaded 3 rows into timestamp_test:gcs_load_test.
def load_data_from_gcs(dataset_name, table_name, source):
bigquery_client = bigquery.Client()
dataset = bigquery_client.dataset(dataset_name)
table = dataset.table(table_name)
job_name = str(uuid.uuid4())
job = bigquery_client.load_table_from_storage(job_name, table, source)
job.begin()
wait_for_job(job)
print('Loaded {} rows into {}:{}.'.format(job.output_rows, dataset_name, table_name))