How to export pandas data to elasticsearch? - pandas

It is possible to export a pandas dataframe data to elasticsearch using elasticsearch-py. For example, here is some code:
https://www.analyticsvidhya.com/blog/2017/05/beginners-guide-to-data-exploration-using-elastic-search-and-kibana/
There are a lot of similar methods like to_excel, to_csv, to_sql.
Is there a to_elastic method? If no, where should I request it?

The following script works for localhost:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
INDEX="dataframe"
TYPE= "record"
def rec_to_actions(df):
import json
for record in df.to_dict(orient="records"):
yield ('{ "index" : { "_index" : "%s", "_type" : "%s" }}'% (INDEX, TYPE))
yield (json.dumps(record, default=int))
from elasticsearch import Elasticsearch
e = Elasticsearch() # no args, connect to localhost:9200
if not e.indices.exists(INDEX):
raise RuntimeError('index does not exists, use `curl -X PUT "localhost:9200/%s"` and try again'%INDEX)
r = e.bulk(rec_to_actions(df)) # return a dict
print(not r["errors"])
Verify using curl -g 'http://localhost:9200/dataframe/_search?q=A:[29%20TO%2039]'
There are many little things that can be added to suit different needs but main is there.

I'm not aware of any to_elastic method integrated in pandas. You can always raise an issue on the pandas github repo or create a pull request.
However, there is espandas which allows to import a pandas DataFrame to elasticsearch. The following example from the README has been tested with Elasticsearch 6.2.1.
import pandas as pd
import numpy as np
from espandas import Espandas
df = (100 * pd.DataFrame(np.round(np.random.rand(100, 5), 2))).astype(int)
df.columns = ['A', 'B', 'C', 'D', 'E']
df['indexId'] = (df.index + 100).astype(str)
INDEX = 'foo_index'
TYPE = 'bar_type'
esp = Espandas()
esp.es_write(df, INDEX, TYPE)
Retrieving the mappings with GET foo_index/_mappings:
{
"foo_index": {
"mappings": {
"bar_type": {
"properties": {
"A": {
"type": "long"
},
"B": {
"type": "long"
},
"C": {
"type": "long"
},
"D": {
"type": "long"
},
"E": {
"type": "long"
},
"indexId": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}

may you can use
pip install es_pandas
pip install progressbar2
This package should work on Python3(>=3.4) and ElasticSearch should be version 5.x, 6.x or 7.x.
import time
import pandas as pd
from es_pandas import es_pandas
# Information of es cluseter
es_host = 'localhost:9200'
index = 'demo'
# crete es_pandas instance
ep = es_pandas(es_host)
# Example data frame
df = pd.DataFrame({'Alpha': [chr(i) for i in range(97, 128)],
'Num': [x for x in range(31)],
'Date': pd.date_range(start='2019/01/01', end='2019/01/31')})
# init template if you want
doc_type = 'demo'
ep.init_es_tmpl(df, doc_type)
# Example of write data to es, use the template you create
ep.to_es(df, index, doc_type=doc_type)
# set use_index=True if you want to use DataFrame index as records' _id
ep.to_es(df, index, doc_type=doc_type, use_index=True)
here is the document https://pypi.org/project/es-pandas/
if 'es_pandas' cann't solve you problem,you could see other solution : https://towardsdatascience.com/exporting-pandas-data-to-elasticsearch-724aa4dd8f62

You could use elasticsearch-py or if you won't use elasticsearch-py you may find answer to your question here => index-a-pandas-dataframe-into-elasticsearch-without-elasticsearch-py

Related

Error in Loading nested and repeated data to bigquery?

I am getting JSON response from a API.
I want to get 5 columns from that,from which 4 are normal,but 1 column in RECORD REPEATED type.
I want to load that data to Bigquery table.
Below is my code in which Schema is mentioned.
import requests
from requests.auth import HTTPBasicAuth
import json
from google.cloud import bigquery
import pandas
import pandas_gbq
URL='<API>'
auth = HTTPBasicAuth('username', 'password')
# sending get request and saving the response as response object
r = requests.get(url=URL ,auth=auth)
data = r.json()
----------------------json repsonse----------------
{
"data": {
"id": "jfp695q8",
"origin": "taste",
"title": "Christmas pudding martini recipe",
"subtitle": null,
"customTitles": [{
"name": "editorial",
"value": "Christmas pudding martini"
}]
}
}
id=data['data']['id']
origin=data['data']['origin']
title=data['data']['title']
subtitle=data['data']['subtitle']
customTitles=json.dumps(data['data']['customTitles'])
# print(customTitles)
df = pandas.DataFrame(
{
'id':id,
'origin':origin,
'title':title,
'subtitle':'subtitle',
'customTitles':customTitles
},index=[0]
)
# df.head()
client = bigquery.Client(project='ncau-data-newsquery-sit')
table_id = 'sdm_adpoint.testfapi'
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("id", "STRING"),
bigquery.SchemaField("origin", "STRING"),
bigquery.SchemaField("title", "STRING"),
bigquery.SchemaField("subtitle", "STRING"),
bigquery.SchemaField(
"customTitles",
"RECORD",
mode="REPEATED",
fields=[
bigquery.SchemaField("name", "STRING", mode="NULLABLE"),
bigquery.SchemaField("value", "STRING", mode="NULLABLE"),
])
],
autodetect=False
)
df.head()
job = client.load_table_from_dataframe(
df, table_id, job_config=job_config
)
job.result()
customeTitle is RECORD REPEATED fiels, which has two keys name and values, so I have made schema like that.
Below is my table schema.
Below is output of df.head()
jfp695q8 taste Christmas pudding martini recipe subtitle [{"name": "editorial", "value": "Christmas pudding martini"}]
Till here its good.
But ,when I try to load the data to table it throws below error.
ArrowTypeError: Could not convert '[' with type str: was expecting tuple of (key, value) pair
Can anyone tell me whats wrong here?

Only csv file can import from GCS to Dataflow and BigQuery using Cloud Composer - Apache Airflow

I have a usecase: There are several files type in GCS like json, csv, txt,.. but I only want to choose csv file, use Dataflow in Python to transform them (such as rename fields,...), then write it to BigQuery. And the main requirement is use Airflow sensors without Cloud Fucntion to trigger them whenever a new csv file import to GCS.
Here is my code:
from datetime import timedelta, datetime
from airflow.models import DAG
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from airflow.providers.google.cloud.sensors.gcs import GCSObjectExistenceSensor
from airflow.contrib.operators.dataflow_operator import DataflowTemplateOperator
PROJECT = 'abc'
ZONE = 'us-central1-c'
BUCKET_NAME = 'bucket_testing'
BQ_DATASET = "abc.dataset_name"
LOCATION = "US"
DEFAULT_DAG_ARGS = {
'owner': 'gcs to bigquery using dataflow',
'depends_on_past': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'schedule_interval': '#daily',
'dataflow_default_options': {
'project': PROJECT,
'zone': ZONE,
'stagingLocation': BUCKET_NAME
}
}
ENVIRONMENT = {
"bypassTempDirValidation": "false",
"maxWorkers": "20",
"numWorkers": "1",
"serviceAccountEmail": "abc8932097-compute#developer.gserviceaccount.com",
"tempLocation": "gs://composer_bucket",
"ipConfiguration": "WORKER_IP_UNSPECIFIED",
"additionalExperiments": [
"sideinput_io_metrics"
]
}
PARAMETERS = {
"outputTable": "abc:dataset_name.how_to_define_here", // how to got multiple table from multiple csv
"bigQueryLoadingTemporaryDirectory": "gs://composer_bucket",
}
with DAG('dag_sensor', default_args=DEFAULT_DAG_ARGS,dagrun_timeout=timedelta(hours=3),schedule_interval='00 * * * *') as dag:
gcs_file_exists = GCSObjectExistenceSensor(
task_id="gcs_object_sensor",
bucket=BUCKET_NAME,
object='*.csv',
mode='poke',
)
my_dataflow_job = DataflowTemplateOperator(
task_id='transfer_from_gcs_to_bigquery',
template='???', //what I need to write here
parameters=PARAMETERS,
environment=ENVIRONMENT,
dag=dag
)
my_bq_result = BigQueryOperator(
task_id='write_to_bq',
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
create_disposition='CREATE_IF_NEEDED',
dag = dag
)
gcs_file_exists >> my_dataflow_job >> my_bq_result
I am a newbie here, so please point me a detailed example.
Many thanks!

Panda Dataframe Css use TTF file

I am trying to create the HTML from a dataframe. I want to use a custom font from TTF file below code is not working.
import pandas as pd
import dataframe_image as dfi
styles = [
dict(selector="th", props=[("font-family", "Gotham"),
("src", "url('gotham-bold.ttf')")]),
dict(selector="td", props=[("font-family", "Gotham"),
("src", "url('gotham-bold.ttf')")]),
dict(selector="", props=[("font-family", "Gotham"),
("src", "url('gotham-bold.ttf')")])
]
data = [
{
"name": "John",
"gender": "Male"
},
{
"name": "Martin",
"gender": "Female"
}
]
df = pd.json_normalize(data)
df = df.style.set_table_styles(styles).hide(axis='index')
df.to_html("test.html")
Can someone please suggest how to use the font src in Panda?

How to use Schema.from_dict() for nested dictionaries?

I am trying to create a Schema class using nested dictionaries that has some list as elements. However when I do a dumps() Only the top level elements are dumped.
Have a rest api that returns a list of certain things,eg. list of users. but the schema is such that certain aggregate details are sent at the top level, the data looks something like this. This is what i am expecting as output:
{
"field1": 5,
"field2": false,
"field3": {
"field4": 40,
"field5": [
{
"field6": "goo goo gah gah",
"field7": 99.341879,
"field8": {
"field9": "goo goo gah gah",
"field10": "goo goo gah gah"
}
}]
}
}
Heres my code:
MySchema = Schema.from_dict(
{
"field1": fields.Int(),
"field2": fields.Bool(),
"field3": {
"field4": fields.Int(),
"field5": [
{
"field6": fields.Str(),
"field7": fields.Float(),
"field8": {
"field9": fields.Str(),
"field10": fields.Str()
}
}]
}
}
)
#Then use it like:
response = MySchema().dumps(data)
Actual result:
"{\"field1\": 5, \"field2\": false}"
Option 1
You're looking for several nested schemas, interconnected through fields.Nested:
from marshmallow import Schema, fields
Field8Schema = Schema.from_dict({
"field9": fields.Str(),
"field10": fields.Str()
})
Field5Schema = Schema.from_dict({
"field6": fields.Str(),
"field7": fields.Float(),
"field8": fields.Nested(Field8Schema),
})
Field3Schema = Schema.from_dict({
"field4": fields.Int(),
"field5": fields.List(fields.Nested(Field5Schema))
})
MySchema = Schema.from_dict({
"field1": fields.Int(),
"field2": fields.Bool(),
"field3": fields.Nested(Field3Schema),
})
MySchema().dump(data)
# {'field2': False,
# 'field1': 5,
# 'field3': {'field4': 40,
# 'field5': [{'field6': 'goo goo gah gah',
# 'field8': {'field9': 'goo goo gah gah', 'field10': 'goo goo gah gah'},
# 'field7': 99.341879}]}}
Option 2
If the nesting won't be that deep, it might be simpler to use decorators, i.e. nest and unnest data as suggested in the docs:
class UserSchema(Schema):
#pre_load(pass_many=True)
def remove_envelope(self, data, many, **kwargs):
namespace = 'results' if many else 'result'
return data[namespace]
#post_dump(pass_many=True)
def add_envelope(self, data, many, **kwargs):
namespace = 'results' if many else 'result'
return {namespace: data}
It feels it fits your case nicely.
Comments
I'd suggest not to use from_dict as it is less readable for such a complex data, and instead switch to a class-based schema.
There's plenty of good examples of nesting in the docs.

Nested pymongo queries (mlab)

I have some documents in mlab mongodb; the format is:
{
"_id": {
"$oid": "58aeb1d074fece33edf2b356"
},
"sensordata": {
"operation": "chgstatus",
"user": {
"status": "0",
"uniqueid": "191b117fcf5c"
}
},
"created_date": {
"$date": "2017-02-23T15:26:29.840Z"
}
}
database name : mparking_sensor
collection name : sensor
I want to query in python to extract status key value pair and created_date key value pair only.
my python code is :
import sys
import pymongo
uri = 'mongodb://thorburn:tekush1!#ds157529.mlab.com:57529/mparking_sensor'
client = pymongo.MongoClient(uri)
db = client.get_default_database().sensor
print db
results = db.find()
for record in results:
print(record["sensordata"] , record['created_date'])
print()
client.close()
which gives me everything under sensordata as expected, dot notations giving me an error, can somebody help?
PyMongo represents BSON documents as Python dictionaries, and subdocuments as dictionaries within dictionaries. To access a value in a nested dictionary:
record["sensordata"]["user"]["status"]
So a complete print statement might be:
print("%s %s" % (record["sensordata"]["user"]["status"], record['created_date']))
That prints:
0 {'$date': '2017-02-23T15:26:29.840Z'}