a=[1,2]
ids = [id1, id2]
all_rows = Table.query.filter(Table.id.in_(ids)).update({Table.price: ...}
I am trying to speed up update of my records in db, first I pick up those records I want to update by list of updates and then I would like to update price of record by list a in sense that both those list are sorted and 1 should be for id1, 2 for id2.
Is it possible to do that and if so how?
I know there is an option to do it one by one in for loop but... that is not really fast if running over bigger amount of records.
Thank you for help!
You can do this kind of update by using a case expression for the update values.
For example, if you wanted to set the values of col2depending on the values of col1, for a particular group of ids, you could do this:
ids = [1, 2, 3]
targets = ['a', 'b', 'c'] # col1
values = [10, 20, 30] # col2
case = db.case(*((Table.col1 == t, v) for t, v in zip(targets, values)))
Table.query.filter(Table.id.in_(ids)).update({'col2': case})
I need to make such several computations over data frame: min and max over column A, distinct values over column B. What is the most efficient way to do that? Is it possible to do that in one pass?
val df = sc.parallelize(Seq(
(1, "John"),
(2, "John"),
(3, "Dave"),
)).toDF("A", "B")
If by one pass you mean inside a single statement , you can do this as below -
sparkDF = sc.parallelize(Seq(
(1, "John"),
(2, "John"),
(3, "Dave"),
)).toDF("A", "B")
sparkDF.select([F.max(F.col('A')),F.min(F.col('A')),F.countDistinct(F.col('B'))
Additionally you can provide sample data and the expected output you are looking for to better gauge the question
data frame
values which needs to be sorted
that column contain list of tuples values ..
for ex : {1100. [ ( a, 4) , (b, 6)
and there are n customers such ....
op needs to be 6,b 4,a
The query is to sort based on the recommended games rank..
I have used the below code
sort_rank = sorted(recommend_dict.items(), key=lambda x: x[1], reverse=True)
recommend_games_df= pd.DataFrame.from_dict(sort_rank)
but this is failing ..
recommend_games_df= pd.DataFrame.from_dict(recommend_dict )
recommend_games_df=recommend_games_df.sort_values(recommend_games_df.columns[1])
Currently I'm trying to automate scheduling.
I'll get requirement as a .csv file.
However, the number of day changes by month, and personnel also changes occasionally, which means the number of columns and rows is not fixed.
So, I want to put value '*' as a marker meaning end of a table. Unfortunately, I can't find a function or method that take a value as a parameter and return a(list of) index(name of column and row or index numbers).
Is there any way that I can find a(or a list of) index of a certain value?(like coordinate)
for example, when the data frame is like below,
|column_1 |column_2
------------------------
1 | 'a' | 'b'
------------------------
2 | 'c' | 'd'
how can I get 'column_2' and '2' by the value, 'd'? It's something similar to the opposite of .loc or .iloc.
Interesting question. I also used a list comprehension, but with np.where. Still I'd be surprised if there isn't a less clunky way.
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
[(i, np.where(df[i] == 'd')[0].tolist()) for i in list(df) if len(np.where(df[i] == 'd')[0]) > 0]
> [[('column_2', [1])]
Note that it returns the numeric (0-based) index, not the custom (1-based) index you have. If you have a fixed offset you could just add a +1 or whatever to the output.
If I understand what you are looking for: Find the (index value, column location) for a value in a dataframe. You can use list comprehension in a loop. Probably wont be the fastest if your dataframe is large.
# assume this dataframe
df = pd.DataFrame({'col':['abc', 'def','wert','abc'], 'col2':['asdf', 'abc', 'sdfg', 'def']})
# list comprehension
[(df[col][df[col].eq('abc')].index[i], df.columns.get_loc(col)) for col in df.columns for i in range(len(df[col][df[col].eq('abc')].index))]
# [(0, 0), (3, 0), (1, 1)]
change df.columns.get_loc to col if you want the column value rather than location:
[(df[col][df[col].eq('abc')].index[i], col) for col in df.columns for i in range(len(df[col][df[col].eq('abc')].index))]
# [(0, 'col'), (3, 'col'), (1, 'col2')]
I might be misunderstanding something, but np.where should get the job done.
df_tmp = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
solution = np.where(df_tmp == 'd')
solution should contain row and column index.
Hope this helps!
To search single value:
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
df[df == 'd'].stack().index.tolist()
[Out]:
[('column_2', 2)]
To search a list of values:
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
df[df.isin(['a', 'd'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (2, 'column_2')]
Also works when value occurs at multiple places:
df = pd.DataFrame({'column_1':['test','test'], 'column_2':['test','test']}, index=[1,2])
df[df == 'test'].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_1'), (2, 'column_2')]
Explanation
Select cells where the condition matches:
df[df.isin(['a', 'b', 'd'])]
[Out]:
column_1 column_2
1 a b
2 NaN d
stack() reshapes the columns to index:
df[df.isin(['a', 'b', 'd'])].stack()
[Out]:
1 column_1 a
column_2 b
2 column_2 d
Now the dataframe is a multi-index:
df[df.isin(['a', 'b', 'd'])].stack().index
[Out]:
MultiIndex([(1, 'column_1'),
(1, 'column_2'),
(2, 'column_2')],
)
Convert this multi-index to list:
df[df.isin(['a', 'b', 'd'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_2')]
Note
If a list of values are searched, the returned result does not preserve the order of input values:
df[df.isin(['d', 'b', 'a'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_2')]
Had a similar need and this worked perfectly
# deals with case sensitivity concern
df = raw_df.applymap(lambda s: s.upper() if isinstance(s, str) else s)
# get the row index
value_row_location = df.isin(['VALUE']).any(axis=1).tolist().index(True)
# get the column index
value_column_location = df.isin(['VALUE']).any(axis=0).tolist().index(True)
# Do whatever you want to do e.g Replace the value above that cell
df.iloc[value_row_location - 1, value_column_location] = 'VALUE COLUMN'
Suppose I have a pandas dataframe called df
id value1 value2
1 2 1
2 2 1
3 4 5
In plain Python, I wrote a function to process this dataframe and return a dictionary:
d = dict()
for row in df.itertuples()
x = do_something (row)
d[x[0]] = x[1:]
I am trying to reimplement this function using Spark.
d = dict() # define a global var
def do_something (id, value1, value2):
# business logic
d[x0] = [x1,x2,x3]
return 0
udf_do = udf (do_something)
then:
df_spark.select (udf_do ('id','value1','value2'))
My idea is, by calling df_spark.select, the function do_something will be called over the dataframe, and it will update the global variable d. I don't really care about the return value of udf_do so I return 0.
My solution does not work, indeed.
Could you suggest me some ways to iterate through (I know it is not a Spark-way) or somehow to process a Spark dataframe and update an external dictionary?
Note that the dataframe is quite large. I tried to convert it to pandas by calling toPandas() but I have OOM problem.
UDF cannot update any global state. But, you can do some some businness login inside UDF and then use toLocalIterator to get all the data to the driver in memory-efficient way (partition by partition). For example:
df = spark.createDataFrame([(10, 'b'), (20, 'b'), (30, 'c'),
(40, 'c'), (50, 'c'), (60, 'a')], ['col1', 'col2'])
df.withColumn('udf_result', ......)
df.cache()
df.count() # force cache fill
for row in df.toLocalIterator():
print(row)