I already trained a decision tree model on a list of first names that includes the related gender and I would like to test the model on a new dataset that only includes the first names.
The following Function used to predict the gender based on the first name:
def genderpredictor(a):
test_name1 = [a]
transform_dv = dv.transform(features(test_name1))
vector = transform_dv.toarray()
if dclf.predict(vector) == 0:
print("Female")
else:
print("Male")```
The following is the new dataset:
```customers.head()```
Output:
|Cust_First_Name|
----|:-------------:|
0 |EBtissam |
1 |Nawal |
2 |Amer |
3 |Joanna |
4 |Stephany |
# Transforming the column to a list to be able to use it for prediction
```customers_list = customers.Cust_First_Name.tolist()```
# Prediction
```for n in customers_list:
print(genderpredictor(n))```
the above code will generate a list as below:
Male
Female
Female
Female
Male
What I need to to transform this generated list to a column that will be appended to the dataframe customers
Related
I am using a dummy dataset that contains a number of columns. Here's How the data looks like
A B C D
5 3 pen Copybook
2 4 pencil Rubber
I want to predict column B as my target given other features in the dataset
df = pd.read_csv('Data.csv')
X = df.drop('B', axis = 1)
Y = df['B']
How can I read Columns A,C,D. How can I deal with this type of data as an input to my model as they are discarded since they are not float, binary or boolean.
I appreciate your help
I have a big CSV file (290GB), which I read using dask. It contains info on birth years of lots of individuals (and their parents). I need to create a new column 'EVENT_DATE' that will contain the birth year for Individuals (type I) and the birth year of children for their parents (type P)
I know it is a big file and will take some time to process, but I have the impression I am not using dask in the correct way.
The original data looks like this (with many more columns):
id
sourcename
type
event
birth_year
father_id
mother_id
1
source_A
I
B
1789
2
3
2
source_A
P
B
3
source_A
P
B
..
...
...
...
...
...
...
n
source_B
I
B
1800
x
y
And what I'd like to obtain is something like this:
id
sourcename
type
event
birth_year
father_id
mother_id
EVENT_DATE
1
source_A
I
B
1789
2
3
1789
2
source_A
P
B
1789
3
source_A
P
B
1789
..
...
...
...
...
...
...
...
n
source_B
I
B
1800
x
y
1800
I filter the ddf using a list of unique values "sourcenames" for the column 'sourcename' and then iterate this operation to work on smaller data frames.
I repartition and then perform operations on these slices
I want to save them as separate parquet files
My code so far, look like this:
ddf = dd.read_csv('raw_data.csv')
# I create the EVENT_DATE column:
ddf['EVENT_DATE'] = pd.NA
# I then create a smaller df based on sourcename
for source in sourcenames:
df = ddf[ddf.sourcename == source]
df = df.repartition(partition_size="100MB")
# I then add the info on the birth year in a new EVENT_DATE COLUMN
df['EVENT_DATE'] = df.apply(lambda x: x['birth_year'] if x['type'] == 'I' and x['event'] == 'B', axis=1, meta=(None, 'object'))
# I then try to match parents' EVENT_DATE to the birth year of their children. I thought that restricting to 10 rows above and below might speed up calculations since I know it is very unlikely that observations are going to be far apart in the dataset:
df['EVENT_DATE'] = df.apply(lambda x: df[(df.index > x.name - 10) & (df.index < x.name + 10) & ((df['father_id'] == x['id']) | (df['mother_id'] == x['id']))]['EVENT_DATE'].values[0] if x['type'] == 'P' else x['EVENT_DATE'], axis=1, meta=(None, 'object'))
This gives me the following error
ValueError: Arrays chunk sizes are unknown: (nan,)
A possible solution: https://docs.dask.org/en/latest/array-chunks.html#unknown-chunks
Summary: to compute chunks sizes, use
x.compute_chunk_sizes() # for Dask Array x ddf.to_dask_array(lengths=True) # for Dask DataFrame ddf
I think I may understand what the issue is, but I certainly don't understand how to solve it.
Any help would be immensely appreciated.
I have the following dataframe:
df = pd.DataFrame()
data = {'Description':['CERVEZA DORADA BOTELLA NR 24 UNIDADES 350ML', 'BEBIDA DE ALMENDRA COCO SILK 1/6/946 ML WHITE WAVE (1788 UNIDADES)', 'ADES SOYA ORAN TETRA 200MLX10',
'ADES SOYA NATURAL TETRA', 'ADES COCO TETRA']}
df = pd.DataFrame(data)
print (df)
I am using the following code (dictionary) to create a new column based on a specific brand name (in this case Ades) -solution found in Pandas str.contains - Search for multiple values in a string and print the values in a new column)
brands =['Ades']
def matcher(x):
for i in brands:
if i.lower() in x.lower():
return i
else:
return np.nan
df['Brands'] = df['Description'].apply(matcher)
It creates the column and applies the dictionary, but is not right.
Results:
and as of now if it finds ADES in any combination (like UNIDADES) it says that is Ades. What I am trying to accomplish is only Ades, not any combination of the word. This is a simple combination, but I have more than 10 million records and different brands. How to set up the dictionary only to find that word not a combination?
Thanks.
Create a regex pattern to extract:
# (?i) insensitive, \b word boundary
pat = fr"(?i)\b({'|'.join(brands)})\b"
df['Brand'] = df['Description'].str.extract(pat, expand=False)
print(df)
# Output
Description Brand
0 CERVEZA DORADA BOTELLA NR 24 UNIDADES 350ML NaN
1 BEBIDA DE ALMENDRA COCO SILK 1/6/946 ML WHITE ... NaN
2 ADES SOYA ORAN TETRA 200MLX10 ADES
3 ADES SOYA NATURAL TETRA ADES
4 ADES COCO TETRA ADES
I have a dataframe column contains domain names i.e. newyorktimes.com. I split by '.' and apply CountVectorizer to "newyorktimes".
The dataframe
domain split country
newyorktimes.com newyorktimes usa
newyorkreport.com newyorkreport usa
"newyorktimes" is also added as a new dataframe column called 'split'
I'm able to get the term frequencies
vectoriser = CountVectorizer(analyzer='word', ngram_range=(2, 2), stop_words='english')
X = vectoriser.fit_transform(df['split'])
features = vectoriser.get_feature_names()
count = x.toarray().sum(axis=0)
dic = dict(zip(features, count))
dic = sorted(dic.items(), key=lambda x: x[1], reverse=True)
But I also need the 'country' information from the original dataframe and I don't know how to map the terms back to the original dataframe.
Expected output
term country domain count
new york usa 2
york times usa 1
york report usa 1
I cannot reproduce the example you provided, not very sure if you provided the correct input for the countvectorizer. If it is a matter of adding the count matrix back to the data frame, you can do it like this:
df = pd.DataFrame({'corpus':['This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?']
})
vectoriser = CountVectorizer(analyzer='word', ngram_range=(2, 2), stop_words='english')
X = vectoriser.fit_transform(df['corpus'])
features = vectoriser.get_feature_names()
pd.concat([df,pd.DataFrame(X.toarray(),columns=features,index=df.index)],axis=1)
corpus document second second document
0 This is the first document. 0 0
1 This document is the second document. 1 1
2 And this is the third one. 0 0
3 Is this the first document? 0 0
So, assume I have the following table:
Name | Color
------------------------------
John | Blue
Greg | Red
John | Yellow
Greg | Red
Greg | Blue
I would like to get a table of the distinct colors for each name - how many and their values. Meaning, something like this:
Name | Distinct | Values
--------------------------------------
John | 2 | Blue, Yellow
Greg | 2 | Red, Blue
Any ideas how to do so?
collect_list will give you a list without removing duplicates.
collect_set will automatically remove duplicates
so just
select
Name,
count(distinct color) as Distinct, # not a very good name
collect_set(Color) as Values
from TblName
group by Name
this feature is implemented since spark 1.6.0 check it out:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
/**
* Aggregate function: returns a set of objects with duplicate elements eliminated.
*
* For now this is an alias for the collect_set Hive UDAF.
*
* #group agg_funcs
* #since 1.6.0
*/
def collect_set(columnName: String): Column = collect_set(Column(columnName))
For PySPark; I come from an R/Pandas background, so I'm actually finding Spark Dataframes a little easier to work with.
To do this:
Setup a Spark SQL context
Read your file into a dataframe
Register your dataframe as a temp table
Query it directly using SQL syntax
Save results as objects, output to files..do your thing
Here's a class I created to do this:
class SQLspark():
def __init__(self, local_dir='./', hdfs_dir='/users/', master='local', appname='spark_app', spark_mem=2):
self.local_dir = local_dir
self.hdfs_dir = hdfs_dir
self.master = master
self.appname = appname
self.spark_mem = int(spark_mem)
self.conf = (SparkConf()
.setMaster(self.master)
.setAppName(self.appname)
.set("spark.executor.memory", self.spark_mem))
self.sc = SparkContext(conf=self.conf)
self.sqlContext = SQLContext(self.sc)
def file_to_df(self, input_file):
# import file as dataframe, all cols will be imported as strings
df = self.sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "\t").option("inferSchema", "true").load(input_file)
# # cache df object to avoid rebuilding each time
df.cache()
# register as temp table for querying, use 'spark_df' as table name
df.registerTempTable("spark_df")
return df
# you also cast a spark dataframe as a pandas df
def sparkDf_to_pandasDf(self, input_df):
pandas_df = input_df.toPandas()
return pandas_df
def find_distinct(self, col_name):
my_query = self.sqlContext.sql("""SELECT distinct {} FROM spark_df""".format(col_name))
# now do your thing with the results etc
my_query.show()
my_query.count()
my_query.collect()
###############
if __name__ == '__main__':
# instantiate class
# see function for variables to input
spark = TestETL(os.getcwd(), 'hdfs_loc', "local", "etl_test", 10)
# specify input file to process
tsv_infile = 'path/to/file'