I am running the exact same query both through pandas' read_sql and through an external app (DbVisualizer).
DbVisualizer returns 206 rows, while pandas returns 178.
I have tried reading the data from pandas by chucks based on the information provided at How to create a large pandas dataframe from an sql query without running out of memory?, it didn't make a change.
What could be the cause for this and ways to remedy it?
The query:
select *
from rainy_days
where year=’2010’ and day=‘weekend’
The columns contain: date, year, weekday, amount of rain at that day, temperature, geo_location (row per location), wind measurements, amount of rain the day before, etc..
The exact python code (minus connection details) is:
import pandas
from sqlalchemy import create_engine
engine = create_engine(
'postgresql://user:pass#server.com/weatherhist?port=5439',
)
query = """
select *
from rainy_days
where year=’2010’ and day=‘weekend’
"""
df = pandas.read_sql(query, con=engine)
https://github.com/xzkostyan/clickhouse-sqlalchemy/issues/14
If you use pure engine.execute you should care about format manually
The problem is that pandas returns a packed dataframe (DF). For some reason this is always on by default and the results varies widely as to what is shown. The solution is to use the unpacking operator (*) before/when trying to print the df, like this:
print(*df)
(This is also know as the splat operator for Ruby enthusiasts.)
To read more about this, please check out these references & tutorials:
https://treyhunner.com/2018/10/asterisks-in-python-what-they-are-and-how-to-use-them/
https://www.geeksforgeeks.org/python-star-or-asterisk-operator/
https://medium.com/understand-the-python/understanding-the-asterisk-of-python-8b9daaa4a558
https://towardsdatascience.com/unpacking-operators-in-python-306ae44cd480
It's not a fix, but what worked for me was to rebuild the indices:
drop the indices
export the whole thing to a csv:
delete all the rows:
DELETE FROM table
import the csv back in
rebuild the indices
pandas:
df = read_csv(..)
df.to_sql(..)
If that works, then at least you know you have a problem somewhere with the indices keeping up to date.
Related
I want to find the mode in a large dataset. When I have multiple common values I still want to know what they are instead I get output such as:
'no unique mode; found %d equally common values' % len(table)
StatisticsError: no unique mode; found 24 equally common values
I have tried:
import pandas as pd
import statistics
df=pd.read_csv('Area(1).txt',delimiter='\t')
df4=df.iloc[0:528,5]
print(df4)
statistics.mode(df4)
But given that the dataset is quite big with many common values, I get the StatisticsError I mentioned earlier. Since statistics.mode is not giving me the required output, I have also tried statistics.multimode() but for some reason this command does not work/isn't recognizable at all.
You may just use mode from pandas
df4.mode()
I want to concat 2 data-frames into one df and save as one csv considering that the first dataframe is in csv file and huge so i dont want to load it in memory. I tried the df.to_csv with append mode but it doesnt behave like df.concat in regards to different columns (comparing and combining columns). Anyone knows how to concat a csv and a df ? Basically csv and df can have different columns so the output csv should have only one header along with all columns and proper respective rows.
You can use Dask DataFrame to do this operation lazily. It'll load your data into memory, but do so in small chunks. Make sure to keep the partition size (blocksize) reasonable -- based on your overall memory capacity.
import dask.dataframe as dd
ddf1 = dd.read_csv("data1.csv", blocksize=25e6)
ddf2 = dd.read_csv("data2.csv", blocksize=25e6)
new_ddf = dd.concat([ddf1, ddf2])
new_ddf.to_csv("combined_data.csv")
API docs: read_csv, concat, to_csv
I cannot share my actual code or data, unfortunately, as it is proprietary, but I can produce a MWE if the problem isn't clear to readers from the text.
I am working with a dataframe containing ~50 million rows, each of which contains a large XML document. From each XML document, I extract a list of statistics relating to the number of occurrences and hierarchical relationships between tags (nothing like undocumented XML formats to brighten one's day). I can express these statistics in dataframes, and I can combine these dataframes over multiple documents using standard operations like GROUP BY/SUM and DISTINCT. The goal is to extract the statistics for all 50 million documents and express them in a single dataframe.
The problem is that I don't know how to efficiently generate 50 million dataframes from each row of one dataframe in Spark, or how to tell Spark to reduce a list of 50 million dataframes to one dataframe using binary operators. Are there standard functions that do these things?
So far, the only workaround I have found is massively inefficient (storing the data as a string, parsing it, doing the computations, and then converting it back into a string). It would take weeks to finish using this method, so it isn't practical.
The extractions and statistical data from each XML response for each row can be stored in additional columns of the row itself. That way spark should be able to do the processes in its multiple executors improving the performance.
Here is a pseudocode.
from pyspark.sql.types import StructType, StructField, IntegerType,
StringType, DateType, FloatType, ArrayType
def extract_metrics_from_xml(row):
j = row['xmlResponse'] # assuming your xml column name is xmlResponse
# perform your xml extractions and computations for the xmlResponse in python
...
load_date = ...
stats_data1 = ...
return Row(load_date, stats_data1, stats_data2, stats_group)
schema = schema = StructType([StructField('load_date', DateType()),
StructField('stats_data1', FloatType()),
StructField('stats_data2', ArrayType(IntegerType())),
StructField('stats_group', StringType())
])
df_with_xml_stats = original_df.rdd\
.map(extract_metrics_from_xml)\
.toDF(schema=schema, sampleRatio=1)\
.cache()
I am using as_pandas utility from impala.util to read the data in dataframe form fetched from hive. However, using pandas, I think I will not be able to handle large amount of data and it will also be slower. I have been reading about dask which provides excellent functionality for reading large data files. How can I use it to efficiently fetch data from hive.
def as_dask(cursor):
"""Return a DataFrame out of an impyla cursor.
This will pull the entire result set into memory. For richer pandas-
like functionality on distributed data sets, see the Ibis project.
Parameters
----------
cursor : `HiveServer2Cursor`
The cursor object that has a result set waiting to be fetched.
Returns
-------
DataFrame
"""
import pandas as pd
import dask
import dask.dataframe as dd
names = [metadata[0] for metadata in cursor.description]
dfs = dask.delayed(pd.DataFrame.from_records)(cursor.fetchall(),
columns=names)
return dd.from_delayed(dfs).compute()
There is no current straight-forward way to do this. You would do well to see the implementation of dask.dataframe.read_sql_table and similar code in intake-sql - you will probably want a way to partition your data, and have each of your workers fetch one partition via a call to delayed(). dd.from_delayed and dd.concat could then be used to stitch the pieces together.
-edit-
Your function has the delayed idea back to front. You are delaying and the immediately materialising the data within a function that operates on a single cursor - it can't be parallelised and will break your memory if the data is big (which is the reason you are trying this).
Lets suppose you can form a set of 10 queries, where each query gets a different part of the data; do not use OFFSET, use a condition on some column that is indexed by Hive.
You want to do something like:
queries = [SQL_STATEMENT.format(i) for i in range(10)]
def query_to_df(query):
cursor = impyla.execute(query)
return pd.DataFrame.from_records(cursor.fetchall())
Now you have a function that returns a partition and has no dependence on global objects - it only takes as input a string.
parts = [dask.delayed(query_to_df)(q) for q in queries]
df = dd.from_delayed(parts)
Is there any way to get pandas to read a table with array-typed columns directly into native structures? By default, a int[] column ends up as an object column containing python list of python ints. There are ways to convert this into a column of Series, or better, a column with a multi-index, but this are very slow (~10 seconds) for 500M rows. Would be much faster if the data was originally loaded into a dataframe. I don't what to unroll the array in sql because I have very many array columns.
url = "postgresql://u:p#host:5432/dname"
engine = sqlalchemy.create_engine(url)
df = pd.read_sql_query("select 1.0 as a, 2.2 as b, array[1,2,3] as c;", engine)
print df
print type(df.loc[0,'c']) # list
print type(df.loc[0,'c'][0]) # int
Does it help if you use read_sql_table instead of read_sql_query ? Also, type detection can fail due to missing values. Maybe this is the cause.