How to do sampling in sql query to get dataframe with pandas - sql

Note my question is a bit different here:
I am working with pandas on a dataset that has a lot of data (10M+):
q = "SELECT COUNT(*) as total FROM `<public table>`"
df = pd.read_gbq(q, project_id=project, dialect='standard')
I know I can do with pandas function with a frac option like
df_sample = df.sample(frac=0.01)
however, I do not want to generate the original df with that size. I wonder what is the best practice to generate a dataframe with data already sampled.
I've read some sql posts showing the sample data was generated from a slice, that is absolutely not accepted in my case. The sample data needs to be evenly distributed as much as possible.
Can anyone shed me with more light?
Thank you very much.
UPDATE:
Below is a table showing how the data looks like:
Reputation is the field I am working on. You can see majority records have a very small reputation.
I don't want to work with a dataframe with all the records, I want the sampled data also looks like the un-sampled data, for example, similar histogram, that's what I meant "evenly".
I hope this clarifies a bit.

A simple random sample can be performed using the following syntax:
select * from mydata where rand()>0.9
This gives each row in the table a 10% chance of being selected. It doesn't guarantee a certain sample size or guarantee that every bin is represented (that would require a stratified sample). Here's a fiddle of this approach
http://sqlfiddle.com/#!9/21d1ee/2
On average, random sampling will provide a distribution the same as that of the underlying data, so meets your requirement. However if you want to 'force' the sample to be more representative or force it to be a certain size we need to look at something a little more advanced.

Related

can i compress a pandas dataframe into one row?

I have a pandas dataframe that I've extracted from a json object using pd.json_normalize.
It has 4 rows and over 60 columns, and with the exception of the 'ts' column there are no columns where there is more than one value.
Is it possible to merge the four rows togather to give one row which can then be written to a .csv file? I have searched the documentation and found no information on this.
To give context, the data is a one time record from a weather station, I will have records at 5 minute intervals and need to put all the records into a database for further use.
I've managed to get the desired result, it's a little convoluted, and i would expect that there is a much more succint way to do it, but I basically manipulated the dataframe, replaced all nan's with zero, replaced some strings with ints and added the columns together as shown in the code below:
with open(fname,'r') as d:
ws=json.loads(next(d))
df=pd.json_normalize(ws['sensors'], record_path='data')
df3=pd.concat([df.iloc[0],df.iloc[1], df.iloc[2],
df.iloc[3]],axis=1)
df3.rename(columns={0 :'a', 1:'b', 2 :'c' ,3 :'d'}, inplace=True)
df3=df3.fillna(0)
df3.loc['ts',['b','c','d']]=0
df3.loc[['ip_v4_gateway','ip_v4_netmask','ip_v4_address'],'c']=int(0)
df3['comb']=df3['a']+df3['b']+df3['c']+df3['d']
df3.drop(columns=['a','b','c','d'], inplace=True)
df3=df3.T
As has been said by quite a few people, the documentation on this is very patchy, so I hope this may help someone else who is struggling with this problem! (and yes, i know that one line isn't indented properly, get over it!)

Slice dataframe according to unique values into many smaller dataframes

I have a large dataframe (14,000 rows). The columns include 'title', 'x' and 'y' as well as other random data.
For a particular title, I've written a code which basically performs an analysis using the x and y values for a subset of this data (but the specifics are unimportant for this).
For this title (which is something like "Part number Y1-17") there are about 80 rows.
At the moment I have only worked out how to get my code to work on 1 subset of titles (i.e. one set of rows with the same title) at a time. For this I've been making a smaller dataframe out of my big one using:
df = pd.read_excel(r"mydata.xlsx")
a = df.loc[df['title'].str.contains('Y1-17')]
But given there are about 180 of these smaller datasets I need to do this analysis on, I don't want to have to do it manually.
My question is, is there a way to make all of the smaller dataframes automatically, by slicing the data by the unique 'title' value? All the help I've found, it seems like you need to specify the 'title' to make a subset. I want to subset all of it and I don't want to have to list all the title names to do it.
I've searched quite a lot and haven't found anything, however I am a beginner so it's very possible I've missed some really basic way of doing this.
I'm not sure if its important information but the modules I'm working with pandas, and numpy
Thanks for any help!
You can use Pandas groupby
For example:
df_dict = {key: title for key, title in df.copy().groupby('title', sort=False)}
Which creates a dictionary of DataFrames each containing all the columns and only the rows pertaining to each unique value of title.

Organising csv. file data in Python

I am quite a beginner with Python but I have a programming-related project to work on, so, I really would like to ask some help. I didnĀ“t find many simple solutions to organize the data such a way that I could do some analysis with that.
First, I have multiple csv-files, which I read in as DataFrame objects. In the end, I need them all to analyze together (right now the files are separated to the list of DataFrames but later on I probably will need those as one DataFrame object).
However, I have a problem with organizing and separating the data. These are thousands of rows in one column, a part of it is presented:
CIP;Date;Hour;Cons;REAL/ESTIMATED
EN025140855608477018TC2L;11/03/2020;1;0 057;R
EN025140855608477018TC2L;11/03/2020;2;0 078;R
EN025140855608477018TC2L;11/03/2020;3;0 033;R
EN025140855608477018TC2L;11/03/2020;4;0 085;R
EN025140855608477018TC2L;11/03/2020;5;0 019;R
...
EN025140855608477018TC2L;11/04/2020;20;0 786;R
EN025140855608477018TC2L;11/04/2020;21;0 288;R
EN025140855608477018TC2L;11/04/2020;22;0 198;R
EN025140855608477018TC2L;11/04/2020;23;0 728;R
EN025140855608477018TC2L;11/04/2020;24;0 275;R
The area, where the huge space between, the number should be merged together, for example, 0.057, which information represents "Cons" (actually it is the most important information).
I should be able to split data into 5 columns in order to proceed with the analysis. However, it should be a universal tool for different csv-files without knowing the including symbols. But the structure of the content and the heading is always the same.
I would be happy if anyone might know to recommend a way to work with this kind of data.
Sounds like what you are trying to do is convert the Cons column so that the spaces become a dot.
df = pd.read_csv("file.txt", sep=";")
df['Cons'] = df['Cons'].str.replace("\s+",".")
df['Cons'].head()
Output:
0 0.057
1 0.078
2 0.033
3 0.085
4 0.019

Pandas join is slow

edit at Oct 16 2017: I think I found the problem, it seems to be a bug in pandas core. It can't merge/join anything over 145k rows. 144k rows it can do without an issue. Pandas version 0.20.3, running on Fedora 26.
----Original post----
I have a medium size amount of data to process (about 200k rows with about 40 columns). I've optimised a lot of the code, but the only trouble I have now is joining the columns.
I receive the data in an unfortunate structure and need to extract the data in a certain way, then put it all into a dataframe.
Basically I extract 2 arrays at a time (each 200k rows long). One array is the timestamp, the other array is the values.
Here I create a dataframe, and use the timestamp as the index.
When I extract the second block of data, I do the same and create a new dataframe using the new values + timestamp.
I need to join the two dataframes on the index. The timestamps can be slightly different, so I use a join method using the 'outer' method, to keep the new timestamps. Basically I follow the documentation below.
result = left.join(right, how='outer')
https://pandas.pydata.org/pandas-docs/stable/merging.html#joining-on-index
This however is way to slow. I left it for about 15 mins and it still hadn't finished processing, so I killed the process.
Can anyone help? Any hints/tips?
edit:
It's a work thing, so I can't give out the data sorry. But it's just two long dataframes, each with a timestamp as the index, and a single column for the values.
The code is just as described above.
data_df.join(variable_df, how='outer')
I forgot to answer this. It's not really a bug in pandas.
The timestamp was a nanosecond timestamp, and joining them on the index like this was causing a massive slow down. Basically it was better to join on a column - made it all much faster.

Why is pyspark so much slower in finding the max of a column?

Is there a general explanation, why spark needs so much more time to calculate the maximum value of a column?
I imported the Kaggle Quora training set (over 400.000 rows) and I like what spark is doing when it comes to rowwise feature extraction. But now I want to scale a column 'manually': find the maximum value of a column and divide by that value.
I tried the solutions from Best way to get the max value in a Spark dataframe column and https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html
I also tried df.toPandas() and then calculate the max in pandas (you guessed it, df.toPandas took a long time.)
The only thing I did ot try yet is the RDD way.
Before I provide some test code (I have to find out how to generate dummy data in spark), I'd like to know
can you give me a pointer to an article discussing this difference?
is spark more sensitive to memory constraints on my computer than pandas?
As #MattR has already said in the comment - you should use Pandas unless there's a specific reason to use Spark.
Usually you don't need Apache Spark unless you encounter MemoryError with Pandas. But if one server's RAM is not enough, then Apache Spark is the right tool for you. Apache Spark has an overhead, because it needs to split your data set first, then process those distributed chunks, then process and join "processed" data, collect it on one node and return it back to you.
#MaxU, #MattR, I found an intermediate solution that also makes me reassess Sparks laziness and understand the problem better.
sc.accumulator helps me define a global variable, and with a separate AccumulatorParam object I can calculate the maximum of the column on the fly.
In testing this I noticed that Spark is even lazier then expected, so this part of my original post ' I like what spark is doing when it comes to rowwise feature extraction' boils down to 'I like that Spark is doing nothing quite fast'.
On the other hand a lot of the time spent on calculating the maximum of the column has most presumably been the calculation of the intermediate values.
Thanks for yourinput and this topic really got me much further in understanding Spark.