I need to concatenate multiple matrices (containing numbers and strings) in a loop, so far I wrote this solution but I don't like to use a dummy variable (h) and I'm sure the code could be improved.
Here it is:
h = 0
for name in list_of_matrices:
h +=1
Matrix = pd.read_csv(name)
if h == 1:
Matrix_final = Matrix
continue
Matrix_final = pd.concat([Matrix_final,Matrix])
For some reason if I use the following code I end up having 2 matrices one after the other and not a joint one (so this code is not fitting):
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
I tried to multiply the two Sparse matrices, but I had trouble deleting extra rows that were all zeros, I usednumpy.delete(my_matrix, [n], axis=0)and got this error:
index 4 is out of bounds for axis 0 with size 3
def mult_mat(mat1, mat2):
col = mat1[0][1]
row = mat2[0][0]
row_mat1, row_mat2 = np.shape(mat1)[0], np.shape(mat2)[0]
if col != row:
return "Multiplication is not possible because the number" \
" of columns in the first matrix is opposite of the" \
" number of rows in the second matrix"
my_matrix = np.array([[0] * 3] * (mat1[0][2] * mat2[0][2]))
n = 0
for r in range(1, row_mat1):
for h in range(1, row_mat2):
if mat1[r][1] == mat2[h][0]:
my_matrix[n][0], my_matrix[n][1], my_matrix[n][2] = mat1[r][0], mat2[h][1], mat1[r][2] * mat2[h][2]
n += 1
row_my_matrix = np.shape(my_matrix)[0]
for n in range(row_my_matrix):
if my_matrix[n][0] == 0 & my_matrix[n][1] == 0 & my_matrix[n][2] == 0:
my_matrix = np.delete(my_matrix, [n], axis=0)
return my_matrix
I'm trying to get a subset of my data whenever there is consecutive occurrence of an two events in that order. The event is time-stamped. So every time there are continuous 2's and then continuous 3's, I want to subset that to a dataframe and append it to a dictionary. The following code does that but I have to apply this to a very large dataframe of more than 20 mil obs. This is extremely slow using iterrows. How can I make this fast?
df = pd.DataFrame({'Date': [101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122],
'Event': [1,1,2,2,2,3,3,1,3,2,2,3,1,2,3,2,3,2,2,3,3,3]})
dfb = pd.DataFrame(columns = df.columns)
C = {}
f1 = 0
for index, row in df.iterrows():
if ((row['Event'] == 2) & (3 not in dfb['Event'].values)):
dfb = dfb.append(row)
f1 =1
elif ((row['Event'] == 3) & (f1 == 1)):
dfb = dfb.append(row)
elif 3 in dfb['Event'].values:
f1 = 0
C[str(dfb.iloc[0,0])] = dfb
del dfb
dfb = pd.DataFrame(columns = df.columns)
if row['Event'] == 2:
dfb = dfb.append(row)
f1 =1
else:
f1=0
del dfb
dfb = pd.DataFrame(columns = df.columns)
Edit: The desired output is basically a dictionary of the subsets shown in the imagehttps://i.stack.imgur.com/ClWZs.png
If you want to accerlate, you should vectorize your code. You could try it like this (df is the same with your code):
vec = df.copy()
vec['Event_y'] = vec['Event'].shift(1).fillna(0).astype(int)
vec['Same_Flag'] = float('nan')
vec.Same_Flag.loc[(vec['Event_y'] == vec['Event']) & (vec['Event'] != 1)] = 1
vec.dropna(inplace=True)
vec.loc[:, ('Date', 'Event')]
Output is:
Date Event
3 104 2
4 105 2
6 107 3
10 111 2
18 119 2
20 121 3
21 122 3
I think that's close to what you need. You could improve based on that.
I'm not understand why date 104, 105, 107 are not counted.
I've been trying to vectorize the following with no such luck:
Consider two data frames. One is a list of dates:
cols = ['col1', 'col2']
index = pd.date_range('1/1/15','8/31/18')
df = pd.DataFrame(columns = cols )
What i'm doing currently is looping thru df and getting the counts of all rows that are less than or equal to the date in question with my main (large) dataframe df_main
for x in range(len(index)):
temp_arr = []
active = len(df_main[(df_main.n_date <= index[x])]
temp_arr = [index[x],active]
df= df.append(pd.Series(temp_arr,index=cols) ,ignore_index=True)
Is there a way to vectorize the above?
What about something like the following
#initializing
mycols = ['col1', 'col2']
myindex = pd.date_range('1/1/15','8/31/18')
mydf = pd.DataFrame(columns = mycols )
#create df_main (that has each of myindex's dates minus 10 days)
df_main = pd.DataFrame(data=myindex-pd.Timedelta(days=10), columns=['n_date'])
#wrap a dataframe around a list comprehension
mydf = pd.DataFrame([[x, len(df_main[df_main['n_date'] <= x])] for x in myindex])
For data that indexed from dataframe like this:
import json
mycolumns = ['name']
df = pd.DataFrame(columns=mycolumns)
rows = [["John Abraham"],["Lincoln Smith"]]
for row in rows:
df.loc[len(df)] = row
print(df)
jsons = json.loads(df.to_json(orient='records'))
n = 0
for j in jsons:
j['injection_timestamp'] = pd.to_datetime('now')
es.index(index="prox", doc_type='record', body=j)
if n%1000==0:
print (n/1000),
n+=1
I am trying to search match_phrase that is spread on two rows as described here:
https://www.elastic.co/guide/en/elasticsearch/guide/current/_multivalue_fields_2.html#_multivalue_fields_2
es.search(index="prox", body={"query": {"match_phrase":{"name": "Abraham Lincoln"}}})
I expected to get 1 hit because of the ways how arrays are indexed but I don't get any hit.