Can't find the mode for multiple common values - pandas

I want to find the mode in a large dataset. When I have multiple common values I still want to know what they are instead I get output such as:
'no unique mode; found %d equally common values' % len(table)
StatisticsError: no unique mode; found 24 equally common values
I have tried:
import pandas as pd
import statistics
df=pd.read_csv('Area(1).txt',delimiter='\t')
df4=df.iloc[0:528,5]
print(df4)
statistics.mode(df4)
But given that the dataset is quite big with many common values, I get the StatisticsError I mentioned earlier. Since statistics.mode is not giving me the required output, I have also tried statistics.multimode() but for some reason this command does not work/isn't recognizable at all.

You may just use mode from pandas
df4.mode()

Related

Slice dataframe according to unique values into many smaller dataframes

I have a large dataframe (14,000 rows). The columns include 'title', 'x' and 'y' as well as other random data.
For a particular title, I've written a code which basically performs an analysis using the x and y values for a subset of this data (but the specifics are unimportant for this).
For this title (which is something like "Part number Y1-17") there are about 80 rows.
At the moment I have only worked out how to get my code to work on 1 subset of titles (i.e. one set of rows with the same title) at a time. For this I've been making a smaller dataframe out of my big one using:
df = pd.read_excel(r"mydata.xlsx")
a = df.loc[df['title'].str.contains('Y1-17')]
But given there are about 180 of these smaller datasets I need to do this analysis on, I don't want to have to do it manually.
My question is, is there a way to make all of the smaller dataframes automatically, by slicing the data by the unique 'title' value? All the help I've found, it seems like you need to specify the 'title' to make a subset. I want to subset all of it and I don't want to have to list all the title names to do it.
I've searched quite a lot and haven't found anything, however I am a beginner so it's very possible I've missed some really basic way of doing this.
I'm not sure if its important information but the modules I'm working with pandas, and numpy
Thanks for any help!
You can use Pandas groupby
For example:
df_dict = {key: title for key, title in df.copy().groupby('title', sort=False)}
Which creates a dictionary of DataFrames each containing all the columns and only the rows pertaining to each unique value of title.

Dummy Variable Trap And removing one Column

Can anyone explain me excatly what is meant by Dummy Variable Trap?And why we want to remove one column to avoid that trap?Please provide me some links or explain this.I am not clear about this process.
In regression analysis there's often talk about the issue of multicolinearity, which you might be familiar with already. The dummy variable trap is simply perfect colinearity between two or more variables. This can arise if, for one binary variable, two dummies are included; Imagine that you have a variable x which is equal to 1 when something is True. If you would include x, along with another variable z, which would be the opposite of x (i.e. 1 when that same thing is False), in your regression model, you would have two perfectly negatively correlated variables.
Here's a simple demonstration. Let's say your x is one column with True/False values in a pandas dataframe. See what happens when you use pd.get_dummies(df.x) below. The two dummies that are created are mirroring each other, so one of them is redundant. In simpler terms, you only need one of them since you can always guess the value of the other based on the one that you have.
import pandas as pd
df = pd.DataFrame({'x': [True, False]})
pd.get_dummies(df.x)
False True
0 0 1
1 1 0
The same applies if you have a categorical variable that can take on more than two values. Whether binary or not, there is always a "base scenario" that can be defined by the variation in the other case(s). This "base scenario" is therefore redundant and will only introduce perfect colinearity in the model if included.
So what's the issue with multicolinearity/linear dependence? The short answer is that if there is imperfect multicolinearity among your explanatory variables, your estimated coefficients can be distorted/biased. If there is perfect multicolinearity (which is the case with the dummy variable trap) you can't estimate your model at all; think of it like this, if you have a variable that can be perfectly explained by another variable, it means that your sample data only includes valuable information about one, not two, truly unique variables. So it would be impossible to obtain two separate coefficient estimates for the same variable.
Further Reading
Multicolinearity
Dummy Variable Trap

How to do sampling in sql query to get dataframe with pandas

Note my question is a bit different here:
I am working with pandas on a dataset that has a lot of data (10M+):
q = "SELECT COUNT(*) as total FROM `<public table>`"
df = pd.read_gbq(q, project_id=project, dialect='standard')
I know I can do with pandas function with a frac option like
df_sample = df.sample(frac=0.01)
however, I do not want to generate the original df with that size. I wonder what is the best practice to generate a dataframe with data already sampled.
I've read some sql posts showing the sample data was generated from a slice, that is absolutely not accepted in my case. The sample data needs to be evenly distributed as much as possible.
Can anyone shed me with more light?
Thank you very much.
UPDATE:
Below is a table showing how the data looks like:
Reputation is the field I am working on. You can see majority records have a very small reputation.
I don't want to work with a dataframe with all the records, I want the sampled data also looks like the un-sampled data, for example, similar histogram, that's what I meant "evenly".
I hope this clarifies a bit.
A simple random sample can be performed using the following syntax:
select * from mydata where rand()>0.9
This gives each row in the table a 10% chance of being selected. It doesn't guarantee a certain sample size or guarantee that every bin is represented (that would require a stratified sample). Here's a fiddle of this approach
http://sqlfiddle.com/#!9/21d1ee/2
On average, random sampling will provide a distribution the same as that of the underlying data, so meets your requirement. However if you want to 'force' the sample to be more representative or force it to be a certain size we need to look at something a little more advanced.

Why is pyspark so much slower in finding the max of a column?

Is there a general explanation, why spark needs so much more time to calculate the maximum value of a column?
I imported the Kaggle Quora training set (over 400.000 rows) and I like what spark is doing when it comes to rowwise feature extraction. But now I want to scale a column 'manually': find the maximum value of a column and divide by that value.
I tried the solutions from Best way to get the max value in a Spark dataframe column and https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html
I also tried df.toPandas() and then calculate the max in pandas (you guessed it, df.toPandas took a long time.)
The only thing I did ot try yet is the RDD way.
Before I provide some test code (I have to find out how to generate dummy data in spark), I'd like to know
can you give me a pointer to an article discussing this difference?
is spark more sensitive to memory constraints on my computer than pandas?
As #MattR has already said in the comment - you should use Pandas unless there's a specific reason to use Spark.
Usually you don't need Apache Spark unless you encounter MemoryError with Pandas. But if one server's RAM is not enough, then Apache Spark is the right tool for you. Apache Spark has an overhead, because it needs to split your data set first, then process those distributed chunks, then process and join "processed" data, collect it on one node and return it back to you.
#MaxU, #MattR, I found an intermediate solution that also makes me reassess Sparks laziness and understand the problem better.
sc.accumulator helps me define a global variable, and with a separate AccumulatorParam object I can calculate the maximum of the column on the fly.
In testing this I noticed that Spark is even lazier then expected, so this part of my original post ' I like what spark is doing when it comes to rowwise feature extraction' boils down to 'I like that Spark is doing nothing quite fast'.
On the other hand a lot of the time spent on calculating the maximum of the column has most presumably been the calculation of the intermediate values.
Thanks for yourinput and this topic really got me much further in understanding Spark.

pandas read_sql not reading all rows

I am running the exact same query both through pandas' read_sql and through an external app (DbVisualizer).
DbVisualizer returns 206 rows, while pandas returns 178.
I have tried reading the data from pandas by chucks based on the information provided at How to create a large pandas dataframe from an sql query without running out of memory?, it didn't make a change.
What could be the cause for this and ways to remedy it?
The query:
select *
from rainy_days
where year=’2010’ and day=‘weekend’
The columns contain: date, year, weekday, amount of rain at that day, temperature, geo_location (row per location), wind measurements, amount of rain the day before, etc..
The exact python code (minus connection details) is:
import pandas
from sqlalchemy import create_engine
engine = create_engine(
'postgresql://user:pass#server.com/weatherhist?port=5439',
)
query = """
select *
from rainy_days
where year=’2010’ and day=‘weekend’
"""
df = pandas.read_sql(query, con=engine)
https://github.com/xzkostyan/clickhouse-sqlalchemy/issues/14
If you use pure engine.execute you should care about format manually
The problem is that pandas returns a packed dataframe (DF). For some reason this is always on by default and the results varies widely as to what is shown. The solution is to use the unpacking operator (*) before/when trying to print the df, like this:
print(*df)
(This is also know as the splat operator for Ruby enthusiasts.)
To read more about this, please check out these references & tutorials:
https://treyhunner.com/2018/10/asterisks-in-python-what-they-are-and-how-to-use-them/
https://www.geeksforgeeks.org/python-star-or-asterisk-operator/
https://medium.com/understand-the-python/understanding-the-asterisk-of-python-8b9daaa4a558
https://towardsdatascience.com/unpacking-operators-in-python-306ae44cd480
It's not a fix, but what worked for me was to rebuild the indices:
drop the indices
export the whole thing to a csv:
delete all the rows:
DELETE FROM table
import the csv back in
rebuild the indices
pandas:
df = read_csv(..)
df.to_sql(..)
If that works, then at least you know you have a problem somewhere with the indices keeping up to date.