numpy recarray indexing based on intersection with external array - indexing

I'm trying to subset the records in a numpy.recarray based on the common values between one of the recarrays fields and an external array. For example,
a = np.array([(10, 'Bob', 145.7), (20, 'Sue', 112.3), (10, 'Jim', 130.5)],
dtype=[('id', 'i4'), ('name', 'S10'), ('weight', 'f8')])
a = a.view(np.recarray)
b = np.array([10,30])
I want to take the intersection of a.id and b to determine what records to pull from the recarray, so that I get back:
(10, 'Bob', 145.7)
(10, 'Jim', 130.5)
Naively, I tried:
common = np.intersect1d(a.id, b)
subset = a[common]
but of course that doesn't work because there is no a[10]. I also tried to do this by creating a reverse dict between the id field and the index and subsetted from there, e.g.
id_x_index = {}
ids = a.id
indexes = np.arange(a.size)
for (id, index) in zip(ids, indexes):
id_x_index[id] = index
subset_indexes = np.sort([id_x_index[x] for x in ids if x in b])
print a[subset_indexes]
but then I'm overriding dict values in id_x_index if a.id has duplicates, as in this case I get
(10, 'Jim', 130.5)
(10, 'Jim', 130.5)
I know I'm overlooking some simple way to get the appropriate indices into the recarray. Thanks for help.

The most concise way to do this in Numpy is
subset = a[np.in1d(a.id, b)]

And for those who have an older version of numpy, you can also do it this way:
subset = a[np.array([i in b for i in a.id])]

Related

SQLAlchemy update iteration (performace)

a=[1,2]
ids = [id1, id2]
all_rows = Table.query.filter(Table.id.in_(ids)).update({Table.price: ...}
I am trying to speed up update of my records in db, first I pick up those records I want to update by list of updates and then I would like to update price of record by list a in sense that both those list are sorted and 1 should be for id1, 2 for id2.
Is it possible to do that and if so how?
I know there is an option to do it one by one in for loop but... that is not really fast if running over bigger amount of records.
Thank you for help!
You can do this kind of update by using a case expression for the update values.
For example, if you wanted to set the values of col2depending on the values of col1, for a particular group of ids, you could do this:
ids = [1, 2, 3]
targets = ['a', 'b', 'c'] # col1
values = [10, 20, 30] # col2
case = db.case(*((Table.col1 == t, v) for t, v in zip(targets, values)))
Table.query.filter(Table.id.in_(ids)).update({'col2': case})

Multiple calculations over spark dataframe in one pass

I need to make such several computations over data frame: min and max over column A, distinct values over column B. What is the most efficient way to do that? Is it possible to do that in one pass?
val df = sc.parallelize(Seq(
(1, "John"),
(2, "John"),
(3, "Dave"),
)).toDF("A", "B")
If by one pass you mean inside a single statement , you can do this as below -
sparkDF = sc.parallelize(Seq(
(1, "John"),
(2, "John"),
(3, "Dave"),
)).toDF("A", "B")
sparkDF.select([F.max(F.col('A')),F.min(F.col('A')),F.countDistinct(F.col('B'))
Additionally you can provide sample data and the expected output you are looking for to better gauge the question

Sorting the values in a dataframe where the column contain list of tuples

data frame
values which needs to be sorted
that column contain list of tuples values ..
for ex : {1100. [ ( a, 4) , (b, 6)
and there are n customers such ....
op needs to be 6,b 4,a
The query is to sort based on the recommended games rank..
I have used the below code
sort_rank = sorted(recommend_dict.items(), key=lambda x: x[1], reverse=True)
recommend_games_df= pd.DataFrame.from_dict(sort_rank)
but this is failing ..
recommend_games_df= pd.DataFrame.from_dict(recommend_dict )
recommend_games_df=recommend_games_df.sort_values(recommend_games_df.columns[1])

Get row and column index of value in Pandas df

Currently I'm trying to automate scheduling.
I'll get requirement as a .csv file.
However, the number of day changes by month, and personnel also changes occasionally, which means the number of columns and rows is not fixed.
So, I want to put value '*' as a marker meaning end of a table. Unfortunately, I can't find a function or method that take a value as a parameter and return a(list of) index(name of column and row or index numbers).
Is there any way that I can find a(or a list of) index of a certain value?(like coordinate)
for example, when the data frame is like below,
|column_1 |column_2
------------------------
1 | 'a' | 'b'
------------------------
2 | 'c' | 'd'
how can I get 'column_2' and '2' by the value, 'd'? It's something similar to the opposite of .loc or .iloc.
Interesting question. I also used a list comprehension, but with np.where. Still I'd be surprised if there isn't a less clunky way.
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
[(i, np.where(df[i] == 'd')[0].tolist()) for i in list(df) if len(np.where(df[i] == 'd')[0]) > 0]
> [[('column_2', [1])]
Note that it returns the numeric (0-based) index, not the custom (1-based) index you have. If you have a fixed offset you could just add a +1 or whatever to the output.
If I understand what you are looking for: Find the (index value, column location) for a value in a dataframe. You can use list comprehension in a loop. Probably wont be the fastest if your dataframe is large.
# assume this dataframe
df = pd.DataFrame({'col':['abc', 'def','wert','abc'], 'col2':['asdf', 'abc', 'sdfg', 'def']})
# list comprehension
[(df[col][df[col].eq('abc')].index[i], df.columns.get_loc(col)) for col in df.columns for i in range(len(df[col][df[col].eq('abc')].index))]
# [(0, 0), (3, 0), (1, 1)]
change df.columns.get_loc to col if you want the column value rather than location:
[(df[col][df[col].eq('abc')].index[i], col) for col in df.columns for i in range(len(df[col][df[col].eq('abc')].index))]
# [(0, 'col'), (3, 'col'), (1, 'col2')]
I might be misunderstanding something, but np.where should get the job done.
df_tmp = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
solution = np.where(df_tmp == 'd')
solution should contain row and column index.
Hope this helps!
To search single value:
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
df[df == 'd'].stack().index.tolist()
[Out]:
[('column_2', 2)]
To search a list of values:
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
df[df.isin(['a', 'd'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (2, 'column_2')]
Also works when value occurs at multiple places:
df = pd.DataFrame({'column_1':['test','test'], 'column_2':['test','test']}, index=[1,2])
df[df == 'test'].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_1'), (2, 'column_2')]
Explanation
Select cells where the condition matches:
df[df.isin(['a', 'b', 'd'])]
[Out]:
column_1 column_2
1 a b
2 NaN d
stack() reshapes the columns to index:
df[df.isin(['a', 'b', 'd'])].stack()
[Out]:
1 column_1 a
column_2 b
2 column_2 d
Now the dataframe is a multi-index:
df[df.isin(['a', 'b', 'd'])].stack().index
[Out]:
MultiIndex([(1, 'column_1'),
(1, 'column_2'),
(2, 'column_2')],
)
Convert this multi-index to list:
df[df.isin(['a', 'b', 'd'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_2')]
Note
If a list of values are searched, the returned result does not preserve the order of input values:
df[df.isin(['d', 'b', 'a'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_2')]
Had a similar need and this worked perfectly
# deals with case sensitivity concern
df = raw_df.applymap(lambda s: s.upper() if isinstance(s, str) else s)
# get the row index
value_row_location = df.isin(['VALUE']).any(axis=1).tolist().index(True)
# get the column index
value_column_location = df.isin(['VALUE']).any(axis=0).tolist().index(True)
# Do whatever you want to do e.g Replace the value above that cell
df.iloc[value_row_location - 1, value_column_location] = 'VALUE COLUMN'

Applying a user defined function to a PySpark dataframe and return a dictionary

Suppose I have a pandas dataframe called df
id value1 value2
1 2 1
2 2 1
3 4 5
In plain Python, I wrote a function to process this dataframe and return a dictionary:
d = dict()
for row in df.itertuples()
x = do_something (row)
d[x[0]] = x[1:]
I am trying to reimplement this function using Spark.
d = dict() # define a global var
def do_something (id, value1, value2):
# business logic
d[x0] = [x1,x2,x3]
return 0
udf_do = udf (do_something)
then:
df_spark.select (udf_do ('id','value1','value2'))
My idea is, by calling df_spark.select, the function do_something will be called over the dataframe, and it will update the global variable d. I don't really care about the return value of udf_do so I return 0.
My solution does not work, indeed.
Could you suggest me some ways to iterate through (I know it is not a Spark-way) or somehow to process a Spark dataframe and update an external dictionary?
Note that the dataframe is quite large. I tried to convert it to pandas by calling toPandas() but I have OOM problem.
UDF cannot update any global state. But, you can do some some businness login inside UDF and then use toLocalIterator to get all the data to the driver in memory-efficient way (partition by partition). For example:
df = spark.createDataFrame([(10, 'b'), (20, 'b'), (30, 'c'),
(40, 'c'), (50, 'c'), (60, 'a')], ['col1', 'col2'])
df.withColumn('udf_result', ......)
df.cache()
df.count() # force cache fill
for row in df.toLocalIterator():
print(row)