a=[1,2]
ids = [id1, id2]
all_rows = Table.query.filter(Table.id.in_(ids)).update({Table.price: ...}
I am trying to speed up update of my records in db, first I pick up those records I want to update by list of updates and then I would like to update price of record by list a in sense that both those list are sorted and 1 should be for id1, 2 for id2.
Is it possible to do that and if so how?
I know there is an option to do it one by one in for loop but... that is not really fast if running over bigger amount of records.
Thank you for help!
You can do this kind of update by using a case expression for the update values.
For example, if you wanted to set the values of col2depending on the values of col1, for a particular group of ids, you could do this:
ids = [1, 2, 3]
targets = ['a', 'b', 'c'] # col1
values = [10, 20, 30] # col2
case = db.case(*((Table.col1 == t, v) for t, v in zip(targets, values)))
Table.query.filter(Table.id.in_(ids)).update({'col2': case})
Related
I am working on, going into each element of each column and comparing them with every other element of the other column in the dataframe.
To do this I made 4 loops (nested) and tried to test it but apparently, it is very very very slow. Is there any other way through which I can do it?
Here is my code:
num = 0
for i in df:
for k in df:
for val in df[i]:
for val2 in df[k]:
if val == val2:
num += 1
else:
break
It is just to count the common elements but that is not my main purpose, it's just to know is there an efficient way to do it?
For Example:
So I want to find edit distance between each value in a column with the same index value of every other column but all that I could find is finding distances for all the values in the column from all the values in the other columns which is quite slow.
Better understanding is shown in the pic but I want the one with the 'tick sign' on it. And I want an average distance of the newly made columns.
Output:
Average Distance between column1 and column2 is (some num) ,
Average Distance between column1 and column3 is (some num)
Thanks a ton!
This might be what you are looking for, I had to create some of my own data, but I believe this is what yo u are trying to accomplish
df = pd.DataFrame({
'Column1' : ['Yes'],
'Column2' : ['No'],
'Column3' : ['Yes'],
'Column4' : ['Yes']
})
df['Count'] = df.apply(lambda x : ' '.join(x), axis = 1).apply(lambda x : x.split()).apply(lambda x : x.count('Yes'))
df1 = pd.DataFrame(np.random.randint(0,9,size=(2, 2)))
df2 = pd.DataFrame(np.random.randint(0,9,size=(2, 2)))
Lets say after concatenate df1 and df2(real case I have many dfs with 700*200 size) in a way that I get something like below table(I dont need to see this table, just for explanation)
col a
col b
row a
[1.4]
[7,8]
row b
[9,2]
[2,0]
Then i want to pass each cell values to below compute function and add the result it from to the cell
def compute(row, column, cell_values):
baseline_df = [2, 4, 6, 7, 8]
result = baseline_df
for values in cell_values:
if (column-row) != dict[values]: # dict contain specific values
result = baseline_df
else:
result = result.apply(func, value=values)
return result.loc[column-row]
def func(df, value):
# operation
result_df = df*value
return result_df
What I want is get df1 and df2 , concatenate and apply above function and get the results. In a really fast way.
In the actual use case , df quite big and if it run for all cells it would take significant amount of time, i need a faster way to perform this.
Note:
This is my idea of doing this. I hope you understand what my requirements are. Please let me know if that is not clear.
currently, i am using something like below, just get the max value of the cell and do the calculation(func)later
This will just give the max value of all cells combined,
dfs = pd.concat(grid).max(level=0)
Final result should be something like this after calculation(same 2d array with new cell data)
col a
col b
row a
0.1
0.7
row b
0.9
0,6
Different approaches are also welcome
I have a relatively large table with thousands of rows and few tens of columns. Some columns are meta data and others are numerical values. The problem I have is, some meta data columns are incomplete or partial that is, it missed the string after a ":". I want to get a count of how many of these are with the missing part after the colon mark.
If you look at the miniature example below, what I should get is a small table telling me that in group A, MetaData is complete for 2 entries and incomplete (missing after ":") in other 2 entries. Ideally I also want to get some statistics on SomeValue (Count, max, min etc.).
How do I do it in an SQL query or in Python Pandas?
Might turn out to be simple to use some build in function however, I am not getting it right.
Data:
Group MetaData SomeValue
A AB:xxx 20
A AB: 5
A PQ:yyy 30
A PQ: 2
Expected Output result:
Group MetaDataComplete Count
A Yes 2
A No 2
No reason to use split functions (unless the value can contain a colon character.) I'm just going to assume that the "null" values (not technically the right word) end with :.
select
"Group",
case when MetaData like '%:' then 'Yes' else 'No' end as MetaDataComplete,
count(*) as "Count"
from T
group by "Group", case when MetaData like '%:' then 'Yes' else 'No' end
You could also use right(MetaData, 1) = ':'.
Or supposing that values can contain their own colons, try charindex(':', MetaData) = len(MetaData) if you just want to ask whether the first colon is in the last position.
Here is an example:
## 1- Create Dataframe
In [1]:
import pandas as pd
import numpy as np
cols = ['Group', 'MetaData', 'SomeValue']
data = [['A', 'AB:xxx', 20],
['A', 'AB:', 5],
['A', 'PQ:yyy', 30],
['A', 'PQ:', 2]
]
df = pd.DataFrame(columns=cols, data=data)
# 2- New data frame with split value columns
new = df["MetaData"].str.split(":", n = 1, expand = True)
df["MetaData_1"]= new[0]
df["MetaData_2"]= new[1]
# 3- Dropping old MetaData columns
df.drop(columns =["MetaData"], inplace = True)
## 4- Replacing empty string by nan and count them
df.replace('',np.NaN, inplace=True)
df.isnull().sum()
Out [1]:
Group 0
SomeValue 0
MetaData_1 0
MetaData_2 2
dtype: int64
From a SQL perspective, performing a split is painful, not mention using the split results in having to perform the query first then querying the results:
SELECT
Results.[Group],
Results.MetaData,
Results.MetaValue,
COUNT(Results.MetaValue)
FROM (SELECT
[Group]
MetaData,
SUBSTRING(MetaData, CHARINDEX(':', MetaData) + 1, LEN(MetaData)) AS MetaValue
FROM VeryLargeTable) AS Results
GROUP BY Results.[Group],
Results.MetaData,
Results.MetaValue
If your just after a count, you could also try the algorithmic approach. Just loop over the data and use regular expressions with negative lookahead.
import pandas as pd
import re
pattern = '.*:(?!.)' # detects the strings of the missing data form
missing = 0
not_missing = 0
for i in data['MetaData'].tolist():
match = re.findall(pattern, i)
if match:
missing += 1
else:
not_missing += 1
I have a dataframe looks like this:
import pandas as pd
df = pd.DataFrame({'AA': [1, 1, 2, 2], 'BB': ['C', 'D', 'C', 'D'], 'CC': [10,20,30,40], 'DD':[], 'EE':[]})
Now, I want to multiply a value in the column 'CC' with number 2 if 'AA'= 1 and 'BB'='C'. For example, the first row will meet the conditions, so the value in the column 'CC,' which is 10 will be multiplied by 2 and the output will go to the same row in 'DD' column.
I will have other requirements for other pairs of 'AA' and 'BB,' but it will be a good start if I can get the idea of how to apply multiplication on rows that meet conditions.
Thank you so much.
m0 = df.AA == 1
m1 = df.BB == "C"
df.loc[m0 & m1, "DD"] = df.loc[m0 & m1, "CC"] * 2
I am using the Amazon database for my research where I want to select the 100 most rated items. So first I have counted the values of the itemID's (asin)
data = amazon_data_parse('data/reviews_Movies_and_TV_5.json.gz')
unique, counts = np.unique(data['asin'], return_counts=True)
test = np.asarray((unique, counts)).T
test.sort(axis=1)
which gives:
array([[5, '0005019281'],
[5, '0005119367'],
[5, '0307141985'],
...,
[1974, 'B00LG7VVPO'],
[2110, 'B00LH9ROKM'],
[2213, 'B00LT1JHLW']], dtype=object)
It is clearly to see that there must be at least 6.000 rows selected. But if I run:
a= test[49952:50054,1]
a = a.tolist()
test2 = data[data.asin.isin(a)]
It only selects 2000 rows from the dataset. I already have tried multiple thing, like only filter on one asin but it doesn't just seem to work. Can someone please help? If there is a better option to get a dataframe with the rows of the 100 most frequent values in asin column I would be glad too.
I found the solution, had to change the sorting line to:
test = test[test[:,1].argsort()]