Dot product between two Vector columns of a DataFrame - dataframe

I do have this situation, and I'm stuck, and looking for guidance please (I do see a lot of the limitations when performing linear algebra operations on Spark and one is distributed scientific computing like scipy and numpy at scale, serialization and deserialization), did thought of the joining this 2 column and perform a combination of columns and took a look of this approach, but index of the vector in vector column is very important for me and I also did look at this udf for dot product for dataframe columns but is performing elements in to row and not all combinations from col1 with col2:
"Looking to solve 2 product between 2 SparseVectors columns, one
SparseVectors Column is from df1 and another is SparseVectors column
from df2 with preserving the index of each vector". As u already know this is sole for big data, millions and billions vectors and collecting and using simple numpy and scipy is not a solution for me at this moment and only after filtering to have a small data.
Here is a sample of my data, each vector length is the same but the column (amount of vectors for each df are different :
> df1:
col1
|(11128,[0,1,2,3,5...|
|(11128,[11,22,98,...|
|(11128,[51,90,218...|
> df2:
col1
|(11128,[21,23,24,...|
|(11128,[0,1,2,3,5...|
|(11128,[0,1,2,3,4...|
|(11128,[28,59,62,...|
...
Adding more info for vectors part, maybe modifying .withColumn() to use .map() function to do all vectors in parallel at once, since does have index? I do know is not the best
approach but is all I can think of it right now (this is not related
solve for .dot() product but more for the UDF/pandas_udf to
extend math operations at Vectors level:
I do bring all into rdd with index, is a way for me to modify the approach to make index as col name ?
[[0, SparseVector(11128, {0: 0.0485, 1: 0.0515, 2: 0.0535, 3: 0.0536, 5: 0.0558, 6: 0.0545, 7: 0.0558, 59: 0.1108, 62: 0.1114, 65: 0.1123, 68: 0.1126, 70: 0.113, 82: 0.121, 120: 0.1414, 149: 0.149, 189: 0.1685, 271: 0.1876, 275: 0.1891, 303: 0.1919, 478: 0.2193, 634: 0.2359, 646: 0.2383, 1017: 0.2626, 1667: 0.2943, 1821: 0.3006, 2069: 0.3095, 2313: 0.3191, 3104: 0.347})],
[1, SparseVector(11128, {11: 0.0621, 22: 0.0776, 98: 0.1167, 210: 0.155, 357: 0.1811, 360: 0.1818, 466: 0.1965, 475: 0.1962, 510: 0.2005, 532: 0.2033, 597: 0.2092, 732: 0.2178, 764: 0.2198, 1274: 0.2489, 1351: 0.2519, 1353: 0.2522, 1451: 0.2562, 1577: 0.2608, 2231: 0.2841, 2643: 0.2969, 3107: 0.3114})]]
So I did try approach with UDF but so far I can get with static vector
(I convert to rdd and take each vector individually but is not the
best approach for me, I want to do all at once and in parallel so map
and keep the index for each vector in place when doing it):
from pyspark.mllib.linalg import *
# write our UDF for .dot product
def dot_prod(a,b):
return a.dot(b)
# apply the UDF to the column
df = df.withColumn("dotProd", udf(dot_prod, FloatType())(col("col2"), array([lit(v) for v in static_array])))

Related

Is there an easier way to grab a single value from within a Pandas DataFrame with multiindexed columns?

I have a Pandas DataFrame of ML experiment results (from MLFlow). I am trying to access the run_id of a single element in the 0th row and under the "tags" -> "run_id" multi-index in the columns.
The DataFrame is called experiment_results_df. I can access the element with the following command:
experiment_results_df.loc[0,(slice(None),'run_id')].values[0]
I thought I should be able to grab the value itself with a statement like the following:
experiment_results_df.at[0,('tags','run_id')]
# or...
experiment_results_df.loc[0,('tags','run_id')]
But either of those just results in the following rather confusing error (as I'm not setting anything):
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.
It's working now, but I'd prefer to use a simpler syntax. And more than that, I want to understand why the other approach isn't working, and if I can modify it. I find multiindexes very frustrating to work with in Pandas compared to regular indexes, but the additional formatting is nice when I print the DF to the console, or display it in a CSV viewer as I currently have 41 columns (and growing).
I don't understand what is the problem:
df = pd.DataFrame({('T', 'A'): {0: 1, 1: 4},
('T', 'B'): {0: 2, 1: 5},
('T', 'C'): {0: 3, 1: 6}})
print(df)
# Output
T
A B C
0 1 2 3
1 4 5 6
How to extract 1:
>>> df.loc[0, ('T', 'A')]
1
>>> df.at[0, ('T', 'A')]
1
>>> df.loc[0, (slice(None), 'A')][0]
1

Best way to insert dataframe into stackoverflow question [duplicate]

This question's answers are a community effort. Edit existing answers to improve this post. It is not currently accepting new answers or interactions.
Having spent a decent amount of time watching both the r and pandas tags on SO, the impression that I get is that pandas questions are less likely to contain reproducible data. This is something that the R community has been pretty good about encouraging, and thanks to guides like this, newcomers are able to get some help on putting together these examples. People who are able to read these guides and come back with reproducible data will often have much better luck getting answers to their questions.
How can we create good reproducible examples for pandas questions? Simple dataframes can be put together, e.g.:
import pandas as pd
df = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice'],
'income': [40000, 50000, 42000]})
But many example datasets need more complicated structure, e.g.:
datetime indices or data
Multiple categorical variables (is there an equivalent to R's expand.grid() function, which produces all possible combinations of some given variables?)
MultiIndex or Panel data
For datasets that are hard to mock up using a few lines of code, is there an equivalent to R's dput() that allows you to generate copy-pasteable code to regenerate your datastructure?
Note: Most of the ideas here are pretty generic for Stack Overflow, indeed questions in general. See Minimal, Reproducible Example; see also Short, Self Contained, Correct Example.
Disclaimer: Writing a good question is hard.
The Good:
Do include a small example DataFrame, either as runnable code:
In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
or make it "copy and pasteable" using pd.read_clipboard(sep='\s\s+'). You can format the text for Stack Overflow by highlighting and using Ctrl+K (or prepend four spaces to each line), or place three backticks (```) above and below your code with your code unindented:
In [2]: df
Out[2]:
A B
0 1 2
1 1 3
2 4 6
Test pd.read_clipboard(sep='\s\s+') yourself.
I really do mean small. The vast majority of example DataFrames could be fewer than 6 rows,[citation needed] and I bet I can do it in 5 rows. Can you reproduce the error with df = df.head()? If not, fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.
But every rule has an exception, the obvious one being for performance issues (in which case definitely use %timeit and possibly %prun), where you should generate:
df = pd.DataFrame(np.random.randn(100000000, 10))
Consider using np.random.seed so we have the exact same frame. Having said that, "make this code fast for me" is not strictly on topic for the site.
Write out the outcome you desire (similarly to above)
In [3]: iwantthis
Out[3]:
A B
0 1 5
1 4 6
Explain where the numbers come from:
The 5 is the sum of the B column for the rows where A is 1.
Do show the code you've tried:
In [4]: df.groupby('A').sum()
Out[4]:
B
A
1 5
4 6
But say what's incorrect:
The A column is in the index rather than a column.
Do show you've done some research (search the documentation, search Stack Overflow), and give a summary:
The docstring for sum simply states "Compute sum of group values"
The groupby documentation doesn't give any examples for this.
Aside: the answer here is to use df.groupby('A', as_index=False).sum().
If it's relevant that you have Timestamp columns, e.g. you're resampling or something, then be explicit and apply pd.to_datetime to them for good measure.
df['date'] = pd.to_datetime(df['date']) # this column ought to be date.
Sometimes this is the issue itself: they were strings.
The Bad:
Don't include a MultiIndex, which we can't copy and paste (see above). This is kind of a grievance with Pandas' default display, but nonetheless annoying:
In [11]: df
Out[11]:
C
A B
1 2 3
2 6
The correct way is to include an ordinary DataFrame with a set_index call:
In [12]: df = pd.DataFrame([[1, 2, 3], [1, 2, 6]], columns=['A', 'B', 'C'])
In [13]: df = df.set_index(['A', 'B'])
In [14]: df
Out[14]:
C
A B
1 2 3
2 6
Do provide insight to what it is when giving the outcome you want:
B
A
1 1
5 0
Be specific about how you got the numbers (what are they)... double check they're correct.
If your code throws an error, do include the entire stack trace. This can be edited out later if it's too noisy. Show the line number and the corresponding line of your code which it's raising against.
The Ugly:
Don't link to a CSV file we don't have access to (and ideally don't link to an external source at all).
df = pd.read_csv('my_secret_file.csv') # ideally with lots of parsing options
Most data is proprietary, we get that. Make up similar data and see if you can reproduce the problem (something small).
Don't explain the situation vaguely in words, like you have a DataFrame which is "large", mention some of the column names in passing (be sure not to mention their dtypes). Try and go into lots of detail about something which is completely meaningless without seeing the actual context. Presumably no one is even going to read to the end of this paragraph.
Essays are bad; it's easier with small examples.
Don't include 10+ (100+??) lines of data munging before getting to your actual question.
Please, we see enough of this in our day jobs. We want to help, but not like this....
Cut the intro, and just show the relevant DataFrames (or small versions of them) in the step which is causing you trouble.
How to create sample datasets
This is to mainly to expand on AndyHayden's answer by providing examples of how you can create sample dataframes. Pandas and (especially) NumPy give you a variety of tools for this such that you can generally create a reasonable facsimile of any real dataset with just a few lines of code.
After importing NumPy and Pandas, be sure to provide a random seed if you want folks to be able to exactly reproduce your data and results.
import numpy as np
import pandas as pd
np.random.seed(123)
A kitchen sink example
Here's an example showing a variety of things you can do. All kinds of useful sample dataframes could be created from a subset of this:
df = pd.DataFrame({
# some ways to create random data
'a':np.random.randn(6),
'b':np.random.choice( [5,7,np.nan], 6),
'c':np.random.choice( ['panda','python','shark'], 6),
# some ways to create systematic groups for indexing or groupby
# this is similar to R's expand.grid(), see note 2 below
'd':np.repeat( range(3), 2 ),
'e':np.tile( range(2), 3 ),
# a date range and set of random dates
'f':pd.date_range('1/1/2011', periods=6, freq='D'),
'g':np.random.choice( pd.date_range('1/1/2011', periods=365,
freq='D'), 6, replace=False)
})
This produces:
a b c d e f g
0 -1.085631 NaN panda 0 0 2011-01-01 2011-08-12
1 0.997345 7 shark 0 1 2011-01-02 2011-11-10
2 0.282978 5 panda 1 0 2011-01-03 2011-10-30
3 -1.506295 7 python 1 1 2011-01-04 2011-09-07
4 -0.578600 NaN shark 2 0 2011-01-05 2011-02-27
5 1.651437 7 python 2 1 2011-01-06 2011-02-03
Some notes:
np.repeat and np.tile (columns d and e) are very useful for creating groups and indices in a very regular way. For 2 columns, this can be used to easily duplicate r's expand.grid() but is also more flexible in ability to provide a subset of all permutations. However, for 3 or more columns the syntax quickly becomes unwieldy.
For a more direct replacement for R's expand.grid() see the itertools solution in the pandas cookbook or the np.meshgrid solution shown here. Those will allow any number of dimensions.
You can do quite a bit with np.random.choice. For example, in column g, we have a random selection of six dates from 2011. Additionally, by setting replace=False we can assure these dates are unique -- very handy if we want to use this as an index with unique values.
Fake stock market data
In addition to taking subsets of the above code, you can further combine the techniques to do just about anything. For example, here's a short example that combines np.tile and date_range to create sample ticker data for 4 stocks covering the same dates:
stocks = pd.DataFrame({
'ticker':np.repeat( ['aapl','goog','yhoo','msft'], 25 ),
'date':np.tile( pd.date_range('1/1/2011', periods=25, freq='D'), 4 ),
'price':(np.random.randn(100).cumsum() + 10) })
Now we have a sample dataset with 100 lines (25 dates per ticker), but we have only used 4 lines to do it, making it easy for everyone else to reproduce without copying and pasting 100 lines of code. You can then display subsets of the data if it helps to explain your question:
>>> stocks.head(5)
date price ticker
0 2011-01-01 9.497412 aapl
1 2011-01-02 10.261908 aapl
2 2011-01-03 9.438538 aapl
3 2011-01-04 9.515958 aapl
4 2011-01-05 7.554070 aapl
>>> stocks.groupby('ticker').head(2)
date price ticker
0 2011-01-01 9.497412 aapl
1 2011-01-02 10.261908 aapl
25 2011-01-01 8.277772 goog
26 2011-01-02 7.714916 goog
50 2011-01-01 5.613023 yhoo
51 2011-01-02 6.397686 yhoo
75 2011-01-01 11.736584 msft
76 2011-01-02 11.944519 msft
Diary of an Answerer
My best advice for asking questions would be to play on the psychology of the people who answer questions. Being one of those people, I can give insight into why I answer certain questions and why I don't answer others.
Motivations
I'm motivated to answer questions for several reasons
Stackoverflow.com has been a tremendously valuable resource to me. I wanted to give back.
In my efforts to give back, I've found this site to be an even more powerful resource than before. Answering questions is a learning experience for me and I like to learn. Read this answer and comment from another vet. This kind of interaction makes me happy.
I like points!
See #3.
I like interesting problems.
All my purest intentions are great and all, but I get that satisfaction if I answer 1 question or 30. What drives my choices for which questions to answer has a huge component of point maximization.
I'll also spend time on interesting problems but that is few and far between and doesn't help an asker who needs a solution to a non-interesting question. Your best bet to get me to answer a question is to serve that question up on a platter ripe for me to answer it with as little effort as possible. If I'm looking at two questions and one has code I can copy paste to create all the variables I need... I'm taking that one! I'll come back to the other one if I have time, maybe.
Main Advice
Make it easy for the people answering questions.
Provide code that creates variables that are needed.
Minimize that code. If my eyes glaze over as I look at the post, I'm on to the next question or getting back to whatever else I'm doing.
Think about what you're asking and be specific. We want to see what you've done because natural languages (English) are inexact and confusing. Code samples of what you've tried help resolve inconsistencies in a natural language description.
PLEASE show what you expect!!! I have to sit down and try things. I almost never know the answer to a question without trying some things out. If I don't see an example of what you're looking for, I might pass on the question because I don't feel like guessing.
Your reputation is more than just your reputation.
I like points (I mentioned that above). But those points aren't really really my reputation. My real reputation is an amalgamation of what others on the site think of me. I strive to be fair and honest and I hope others can see that. What that means for an asker is, we remember the behaviors of askers. If you don't select answers and upvote good answers, I remember. If you behave in ways I don't like or in ways I do like, I remember. This also plays into which questions I'll answer.
Anyway, I can probably go on, but I'll spare all of you who actually read this.
The Challenge One of the most challenging aspects of responding to SO questions is the time it takes to recreate the problem (including the data). Questions which don't have a clear way to reproduce the data are less likely to be answered. Given that you are taking the time to write a question and you have an issue that you'd like help with, you can easily help yourself by providing data that others can then use to help solve your problem.
The instructions provided by #Andy for writing good Pandas questions are an excellent place to start. For more information, refer to how to ask and how to create Minimal, Complete, and Verifiable examples.
Please clearly state your question upfront. After taking the time to write your question and any sample code, try to read it and provide an 'Executive Summary' for your reader which summarizes the problem and clearly states the question.
Original question:
I have this data...
I want to do this...
I want my result to look like this...
However, when I try to do [this], I get the following problem...
I've tried to find solutions by doing [this] and [that].
How do I fix it?
Depending on the amount of data, sample code and error stacks provided, the reader needs to go a long way before understanding what the problem is. Try restating your question so that the question itself is on top, and then provide the necessary details.
Revised Question:
Qustion: How can I do [this]?
I've tried to find solutions by doing [this] and [that].
When I've tried to do [this], I get the following problem...
I'd like my final results to look like this...
Here is some minimal code that can reproduce my problem...
And here is how to recreate my sample data:
df = pd.DataFrame({'A': [...], 'B': [...], ...})
PROVIDE SAMPLE DATA IF NEEDED!!!
Sometimes just the head or tail of the DataFrame is all that is needed. You can also use the methods proposed by #JohnE to create larger datasets that can be reproduced by others. Using his example to generate a 100 row DataFrame of stock prices:
stocks = pd.DataFrame({
'ticker':np.repeat( ['aapl','goog','yhoo','msft'], 25 ),
'date':np.tile( pd.date_range('1/1/2011', periods=25, freq='D'), 4 ),
'price':(np.random.randn(100).cumsum() + 10) })
If this was your actual data, you may just want to include the head and/or tail of the dataframe as follows (be sure to anonymize any sensitive data):
>>> stocks.head(5).to_dict()
{'date': {0: Timestamp('2011-01-01 00:00:00'),
1: Timestamp('2011-01-01 00:00:00'),
2: Timestamp('2011-01-01 00:00:00'),
3: Timestamp('2011-01-01 00:00:00'),
4: Timestamp('2011-01-02 00:00:00')},
'price': {0: 10.284260107718254,
1: 11.930300761831457,
2: 10.93741046217319,
3: 10.884574289565609,
4: 11.78005850418319},
'ticker': {0: 'aapl', 1: 'aapl', 2: 'aapl', 3: 'aapl', 4: 'aapl'}}
>>> pd.concat([stocks.head(), stocks.tail()], ignore_index=True).to_dict()
{'date': {0: Timestamp('2011-01-01 00:00:00'),
1: Timestamp('2011-01-01 00:00:00'),
2: Timestamp('2011-01-01 00:00:00'),
3: Timestamp('2011-01-01 00:00:00'),
4: Timestamp('2011-01-02 00:00:00'),
5: Timestamp('2011-01-24 00:00:00'),
6: Timestamp('2011-01-25 00:00:00'),
7: Timestamp('2011-01-25 00:00:00'),
8: Timestamp('2011-01-25 00:00:00'),
9: Timestamp('2011-01-25 00:00:00')},
'price': {0: 10.284260107718254,
1: 11.930300761831457,
2: 10.93741046217319,
3: 10.884574289565609,
4: 11.78005850418319,
5: 10.017209045035006,
6: 10.57090128181566,
7: 11.442792747870204,
8: 11.592953372130493,
9: 12.864146419530938},
'ticker': {0: 'aapl',
1: 'aapl',
2: 'aapl',
3: 'aapl',
4: 'aapl',
5: 'msft',
6: 'msft',
7: 'msft',
8: 'msft',
9: 'msft'}}
You may also want to provide a description of the DataFrame (using only the relevant columns). This makes it easier for others to check the data types of each column and identify other common errors (e.g. dates as string vs. datetime64 vs. object):
stocks.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 3 columns):
date 100 non-null datetime64[ns]
price 100 non-null float64
ticker 100 non-null object
dtypes: datetime64[ns](1), float64(1), object(1)
NOTE: If your DataFrame has a MultiIndex:
If your DataFrame has a multiindex, you must first reset before calling to_dict. You then need to recreate the index using set_index:
# MultiIndex example. First create a MultiIndex DataFrame.
df = stocks.set_index(['date', 'ticker'])
>>> df
price
date ticker
2011-01-01 aapl 10.284260
aapl 11.930301
aapl 10.937410
aapl 10.884574
2011-01-02 aapl 11.780059
...
# After resetting the index and passing the DataFrame to `to_dict`, make sure to use
# `set_index` to restore the original MultiIndex. This DataFrame can then be restored.
d = df.reset_index().to_dict()
df_new = pd.DataFrame(d).set_index(['date', 'ticker'])
>>> df_new.head()
price
date ticker
2011-01-01 aapl 10.284260
aapl 11.930301
aapl 10.937410
aapl 10.884574
2011-01-02 aapl 11.780059
Here is my version of dput - the standard R tool to produce reproducible reports - for Pandas DataFrames.
It will probably fail for more complex frames, but it seems to do the job in simple cases:
import pandas as pd
def dput(x):
if isinstance(x,pd.Series):
return "pd.Series(%s,dtype='%s',index=pd.%s)" % (list(x),x.dtype,x.index)
if isinstance(x,pd.DataFrame):
return "pd.DataFrame({" + ", ".join([
"'%s': %s" % (c,dput(x[c])) for c in x.columns]) + (
"}, index=pd.%s)" % (x.index))
raise NotImplementedError("dput",type(x),x)
now,
df = pd.DataFrame({'a':[1,2,3,4,2,1,3,1]})
assert df.equals(eval(dput(df)))
du = pd.get_dummies(df.a,"foo")
assert du.equals(eval(dput(du)))
di = df
di.index = list('abcdefgh')
assert di.equals(eval(dput(di)))
Note that this produces a much more verbose output than DataFrame.to_dict, e.g.,
pd.DataFrame({
'foo_1':pd.Series([1, 0, 0, 0, 0, 1, 0, 1],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1)),
'foo_2':pd.Series([0, 1, 0, 0, 1, 0, 0, 0],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1)),
'foo_3':pd.Series([0, 0, 1, 0, 0, 0, 1, 0],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1)),
'foo_4':pd.Series([0, 0, 0, 1, 0, 0, 0, 0],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1))},
index=pd.RangeIndex(start=0, stop=8, step=1))
vs
{'foo_1': {0: 1, 1: 0, 2: 0, 3: 0, 4: 0, 5: 1, 6: 0, 7: 1},
'foo_2': {0: 0, 1: 1, 2: 0, 3: 0, 4: 1, 5: 0, 6: 0, 7: 0},
'foo_3': {0: 0, 1: 0, 2: 1, 3: 0, 4: 0, 5: 0, 6: 1, 7: 0},
'foo_4': {0: 0, 1: 0, 2: 0, 3: 1, 4: 0, 5: 0, 6: 0, 7: 0}}
for du above, but it preserves column types.
E.g., in the above test case,
du.equals(pd.DataFrame(du.to_dict()))
==> False
because du.dtypes is uint8 and pd.DataFrame(du.to_dict()).dtypes is int64.

Create pandas MultiIndex DataFrame from multi dimensional np arrays

I am trying to insert 72 matrixes with dimensions (24,12) from an np array into a preexisting MultiIndexDataFrame indexed according to a np.array with dimension (72,2). I don't care to index the content of the matrixes (24,12), I just need to index the 72 matrix even as objects for rearrangemnet purposes. It is like a map to reorder accroding to some conditions to then unstack the columns.
what I have tried so far is:
cosphi.shape
(72, 2)
MFPAD_RCR.shape
(72, 24, 12)
df = pd.MultiIndex.from_arrays(cosphi.T, names=("costheta","phi"))
I successfully create an DataFrame of 2 columns with 72 index row. Then I try to add the 72 matrixes
df1 = pd.DataFrame({'MFPAD':MFPAD_RCR},index=df)
or possibly
df1 = pd.DataFrame({'MFPAD':MFPAD_RCR.astype(object)},index=df)
I get the error
Exception: Data must be 1-dimensional.
Any idea?
After a bot of careful research, I found that my question has been already answered here (the right answer) and here (a solution using a deprecated function).
For my specific question, the answer is something like:
data = MFPAD_RCR.reshape(72, 288).T
df = pd.DataFrame(
data=data,
index=pd.MultiIndex.from_product([phiM, cosM],names=["phi","cos(theta)"]),
columns=['item {}'.format(i) for i in range(72)]
)
Note: that the 3D np array has to be reshaped with the second dimension equal to the product of the major and the minor indexes.
df1 = df.T
I want to be able to sort my items (aka matrixes) according to extra indexes coming from cosphi
cosn=np.array([col[0] for col in cosphi]); #list
phin=np.array([col[1] for col in cosphi]); #list
Note: the length of the new indexes has to be the same as the items (matrixes) = 72
df1.set_index(cosn, "cos_ph", append=True, inplace=True)
df1.set_index(phin, "phi_ph", append=True, inplace=True)
And after this one can sort
df1.sort_index(level=1, inplace=True, kind="mergesort")
and reshape
outarray=(df1.T).values.reshape(24,12,72).transpose(2, 0, 1)
Any suggestion to make the code faster / prettier is more than welcome!

The fastest way to get filtered data checking substring value within ndarray

I have a big array of data:
>>> len(b)
6636849
>>> print(b)
[['60D19E9E-4E2C-11E2-AA9A-52540027E502' '100015361']
['60D19EB6-4E2C-11E2-AA9A-52540027E502' '100015385']
['60D19ECE-4E2C-11E2-AA9A-52540027E502' '100015409']
...,
['8CC90633-447E-11E6-B010-005056A76B49' '106636785']
['F8C74244-447E-11E6-B010-005056A76B49' '106636809']
['F8C7425C-447E-11E6-B010-005056A76B49' '106636833']]
I need to get the filtered dataset, i.e, everything containing (or starting with) '106' in the string). Something like the following code with substring operation instead of math operation:
>>> len(b[b[:,1] > '10660600'])
30850
I don't think numpy is well suited for this type of operation. You can do it simply using basic python operations. Here it is with some sample data a:
import random # for the test data
a = []
for i in range(10000):
a.append(["".join(random.sample('abcdefg',3)), "".join(random.sample('01234567890',8))])
answer = [i for i in a if i[1].find('106') != -1]
Keep in mind that startswith is going to be a lot faster than find, because find has to look for matching substrings in all positions.
It's not too clear why you need do this with such a large list/array in the first place, and there might be a better solution when it comes to not including these values in the list in the first place.
Here's a simple pandas solution
import pandas as pd
df = pd.DataFrame(b, columns=['1st String', '2nd String'])
df_filtered = df[df['2nd String'].str.contains('106')]
This gives you
In [29]: df_filtered
Out[29]:
1st String 2nd String
3 8CC90633-447E-11E6-B010-005056A76B49 106636785
4 F8C74244-447E-11E6-B010-005056A76B49 106636809
5 F8C7425C-447E-11E6-B010-005056A76B49 106636833
Update: Timing Results
Using Benjamin's list a as the test sample:
In [20]: %timeit [i for i in a if i[1].find('106') != -1]
100 loops, best of 3: 2.2 ms per loop
In [21]: %timeit df[df['2nd String'].str.contains('106')]
100 loops, best of 3: 5.94 ms per loop
So it looks like Benjamin's answer is actually about 3x faster. This surprises me since I was under the impression that the operation in pandas is vectorized. Moreover, the speed ratio does not change when a is 100 times longer.
Look at the functions in the np.char submodule:
data = [['60D19E9E-4E2C-11E2-AA9A-52540027E502', '100015361'],
['60D19EB6-4E2C-11E2-AA9A-52540027E502', '100015385'],
['60D19ECE-4E2C-11E2-AA9A-52540027E502', '100015409'],
['8CC90633-447E-11E6-B010-005056A76B49', '106636785'],
['F8C74244-447E-11E6-B010-005056A76B49', '106636809'],
['F8C7425C-447E-11E6-B010-005056A76B49', '106636833']]
data = np.array([r[1] for r in data], np.str)
idx = np.char.startswith(data, '106')
print(idx)

Looking for built-in, invertible, list-of-list-accepting constructor/deconstructor pair for pandas dataframes

Are there built-in ways to construct/deconstruct a dataframe from/to a Python list-of-Python-lists?
As far as the constructor (let's call it make_df for now) that I'm looking for goes, I want to be able to write the initialization of a dataframe from literal values, including columns of arbitrary types, in an easily-readable form, like this:
df = make_df([[9.75, 1],
[6.375, 2],
[9., 3],
[0.25, 1],
[1.875, 2],
[3.75, 3],
[8.625, 1]],
['d', 'i'])
For the deconstructor, I want to essentially recover from a dataframe df the arguments one would need to pass to such make_df to re-create df.
AFAIK,
officially at least, the pandas.DataFrame constructor accepts only a numpy ndarray, a dict, or another DataFrame (and not a simple Python list-of-lists) as its first argument;
the pandas.DataFrame.values property does not preserve the original data types.
I can roll my own functions to do this (e.g., see below), but I would prefer to stick to built-in methods, if available. (The Pandas API is pretty big, and some of its names not what I would expect, so it is quite possible that I have missed one or both of these functions.)
FWIW, below is a hand-rolled version of what I described above, minimally tested. (I doubt that it would be able to handle every possible corner-case.)
import pandas as pd
import collections as co
import pandas.util.testing as pdt
def make_df(values, columns):
return pd.DataFrame(co.OrderedDict([(columns[i],
[row[i] for row in values])
for i in range(len(columns))]))
def unmake_df(dataframe):
columns = list(dataframe.columns)
return ([[dataframe[c][i] for c in columns] for i in dataframe.index],
columns)
values = [[9.75, 1],
[6.375, 2],
[9., 3],
[0.25, 1],
[1.875, 2],
[3.75, 3],
[8.625, 1]]
columns = ['d', 'i']
df = make_df(values, columns)
Here's what the output of the call to make_df above produced:
>>> df
d i
0 9.750 1
1 6.375 2
2 9.000 3
3 0.250 1
4 1.875 2
5 3.750 3
6 8.625 1
A simple check of the round-trip1:
>>> df == make_df(*unmake_df(df))
True
>>> (values, columns) == unmake_df(make_df(*(values, columns)))
True
BTW, this is an example of the loss of the original values' types:
>>> df.values
array([[ 9.75 , 1. ],
[ 6.375, 2. ],
[ 9. , 3. ],
[ 0.25 , 1. ],
[ 1.875, 2. ],
[ 3.75 , 3. ],
[ 8.625, 1. ]])
Notice how the values in the second column are no longer integers, as they were originally.
Hence,
>>> df == make_df(df.values, columns)
False
1 In order to be able to use == to test for equality between dataframes above, I resorted to a little monkey-patching:
def pd_DataFrame___eq__(self, other):
try:
pdt.assert_frame_equal(self, other,
check_index_type=True,
check_column_type=True,
check_frame_type=True)
except:
return False
else:
return True
pd.DataFrame.__eq__ = pd_DataFrame___eq__
Without this hack, expressions of the form dataframe_0 == dataframe_1 would have evaluated to dataframe objects, not simple boolean values.
I'm not sure what documentation you are reading, because the link you give explicitly says that the default constructor accepts other list-like objects (one of which is a list of lists).
In [6]: pandas.DataFrame([['a', 1], ['b', 2]])
Out[6]:
0 1
0 a 1
1 b 2
[2 rows x 2 columns]
In [7]: t = pandas.DataFrame([['a', 1], ['b', 2]])
In [8]: t.to_dict()
Out[8]: {0: {0: 'a', 1: 'b'}, 1: {0: 1, 1: 2}}
Notice that I use to_dict at the end, rather than trying to get back the original list of lists. This is because it is an ill-posed problem to get the list arguments back (unless you make an overkill decorator or something to actually store the ordered arguments that the constructor was called with).
The reason is that a pandas DataFrame, by default, is not an ordered data structure, at least in the column dimension. You could have permuted the order of the column data at construction time, and you would get the "same" DataFrame.
Since there can be many differing notions of equality between two DataFrame (e.g. same columns even including type, or just same named columns, or some columns and in same order, or just same columns in mixed order, etc.) -- pandas defaults to trying to be the least specific about it (Python's principle of least astonishment).
So it would not be good design for the default or built-in constructors to choose an overly specific idea of equality for the purposes of returning the DataFrame back down to its arguments.
For that reason, using to_dict is better since the resulting keys will encode the column information, and you can choose to check for column types or ordering however you want to for your own application. You can even discard the keys by iterating the dict and simply pumping the contents into a list of lists if you really want to.
In other words, because order might not matter among the columns, the "inverse" of the list-of-list constructor maps backwards into a bigger set, namely all the permutations of the same column data. So the inverse you're looking for is not well-defined without assuming more structure -- and casual users of a DataFrame might not want or need to make those extra assumptions to get the invertibility.
As mentioned elsewhere, you should use DataFrame.equals to do equality checking among DataFrames. The function has many options that allow you specify the specific kind of equality testing that makes sense for your application, while leaving the default version as a reasonably generic set of options.