arr = [12345, 12345, 12345, 12345]
paths = [67890, 67890, 67890, 67890]
How can I combine these two numpy arrays to get this output below?
output = [1234567890, 1234567890, 1234567890, 1234567890]
IIUC, here's one way:
arr_list = [arr, paths]
result = np.apply_along_axis(''.join, 0, arr_list)
Related
I have an array of arrays of size (4,1) like:
ar1 = np.array([[1,1],
[2,2],
[3,3],
[4,4]])
and a second array of size (1,2) like:
ar2 = np.array([2,3])
I'm trying to multiply every first item of the first array times the first item of the second array, and every second item of the first array times the second item of the second array, such as the result is:
ar_result = np.array([[2,3],
[4,6],
[6,9],
[8,12]])
Is there a way to do this in an easy and vectorized way?
When I try to ar1*ar2 it I'm getting this error:
ValueError: operands could not be broadcast together with shapes (4,) (2,)
Thanks
EDIT: To clarify, in my case ar1 is a DataFrame df1 column or row, and ar2 is the content of a cell in another DataFrame df2 (df2.loc[x,y] = [2,3])
Have you tried ar1 * ar2?
ar_result = ar1 * ar2
#array([[ 2, 3],
# [ 4, 6],
# [ 6, 9],
# [ 8, 12]])
I have a series of web addresses, which I want to split them by the first '.'. For example, return 'google', if the web address is 'google.co.uk'
d1 = {'id':['1', '2', '3'], 'website':['google.co.uk', 'google.com.au', 'google.com']}
df1 = pd.DataFrame(data=d1)
d2 = {'id':['4', '5', '6'], 'website':['google.co.jp', 'google.com.tw', 'google.kr']}
df2 = pd.DataFrame(data=d2)
df_list = [df1, df2]
I use enumerate to iterate the dataframe list
for i, df in enumerate(df_list):
df_list[i]['website_segments'] = df['website'].str.split('.', n=1, expand=True)
Received error: ValueError: Wrong number of items passed 2, placement implies 1
You are splitting the website which gives you a list-like data structure. Think [google, co.uk]. You just want the first element of that list so:
for i, df in enumerate(df_list):
df_list[i]['website_segments'] = df['website'].str.split('.', n=1, expand=True)[0]
Another alternative is to use extract. It is also ~40% faster for your data:
for i, df in enumerate(df_list):
df_list[i]['website_segments'] = df['website'].str.extract('(.*?)\.')
I have a DataFrame containing 2 columns of ordered categorical data (of the same category). I want to construct another column that contains the categorical maximum of the first 2 columns. I set up the following.
import pandas as pd
from pandas.api.types import CategoricalDtype
import numpy as np
cats = CategoricalDtype(categories=['small', 'normal', 'large'], ordered=True)
data = {
'A': ['normal', 'small', 'normal', 'large', np.nan],
'B': ['small', 'normal', 'large', np.nan, 'small'],
'desired max(A,B)': ['normal', 'normal', 'large', 'large', 'small']
}
df = pd.DataFrame(data).astype(cats)
The columns can be compared, although the np.nan items are problematic, as running the following code shows.
df['A'] > df['B']
The manual suggests that max() works on categorical data, so I try to define my new column as follows.
df[['A', 'B']].max(axis=1)
This yields a column of NaN. Why?
The following code constructs the desired column using the comparability of the categorical columns. I still don't know why max() fails here.
dfA = df['A']
dfB = df['B']
conditions = [dfA.isna(), (dfB.isna() | (dfA >= dfB)), True]
cases = [dfB, dfA, dfB]
df['maxAB'] = np.select(conditions, cases)
Columns A and B are string-types. So you gotta assign integer values to each of these categories first.
# size string -> integer value mapping
size2int_map = {
'small': 0,
'normal': 1,
'large': 2
}
# integer value -> size string mapping
int2size_map = {
0: 'small',
1: 'normal',
2: 'large'
}
# create columns containing the integer value for each size string
for c in df:
df['%s_int' % c] = df[c].map(size2int_map)
# apply the int2size map back to get the string sizes back
print(df[['A_int', 'B_int']].max(axis=1).map(int2size_map))
and you should get
0 normal
1 normal
2 large
3 large
4 small
dtype: object
I want to use PandasUDFDType.SCALAR to operate the Row arrays like belows:
df = spark.createDataFrame([([1, 2, 3, 2],), ([4, 5, 5, 4],)], ['data'])
#pandas_udf(ArrayType(IntegerType()), PandasUDFType.SCALAR)
def s(x):
z = x.apply(lambda xx: xx*2)
return z
df.select(s(df.data)).show()
but it went wrong:
pyarrow.lib.ArrowInvalid: trying to convert NumPy type int32 but got int64```
I have a data frame consisting of a mixture of NaN's and strings e.g
data = {'String1':['NaN', 'tree', 'car', 'tree'],
'String2':['cat','dog','car','tree'],
'String3':['fish','tree','NaN','tree']}
ddf = pd.DataFrame(data)
I want to
1:count the total number of items and put in a new data frame e.g
NaN=2
tree=5
car=2
fish=1
cat=1
dog=1
2:Count the total number of items when compared to a separate longer list (column of a another data frame, e.g
df['compare'] =
NaN
tree
car
fish
cat
dog
rabbit
Pear
Orange
snow
rain
Thanks
Jason
For the first question:
from collections import Counter
data = {
"String1": ["NaN", "tree", "car", "tree"],
"String2": ["cat", "dog", "car", "tree"],
"String3": ["fish", "tree", "NaN", "tree"],
}
ddf = pd.DataFrame(data)
a = Counter(ddf.stack().tolist())
df_result = pd.DataFrame(dict(a), index=['Count']).T
df = pd.DataFrame({'vals':['NaN', 'tree', 'car', 'fish', 'cat', 'dog', 'rabbit', 'Pear', 'Orange', 'snow', 'rain']})
df_counts = df.vals.map(df_result.to_dict()['Count'])
THis should do :)
You can use the following code for count of items over all data frame.
import pandas as pd
data = {'String1':['NaN', 'tree', 'car', 'tree'],
'String2':['cat','dog','car','tree'],
'String3':['fish','tree','NaN','tree']}
df = pd.DataFrame(data)
def get_counts(df: pd.DataFrame) -> dict:
res = {}
for col in df.columns:
vc = df[col].value_counts().to_dict()
for k,v in vc.items():
if k in res:
res[k] += v
else:
res[k] = v
return res
counts = get_counts(df)
Output
>>> print(counts)
{'tree': 5, 'car': 2, 'NaN': 2, 'cat': 1, 'dog': 1, 'fish': 1}