mecab python extract company name - mecab

I'm trying to run the data in a column and extract only the company name using MeCab library and list them in a new column.
The target column is a comment column which includes employee names, company names, invoice number etc all together or by itself depending on the transaction. Listed below is my code trying to extract only the company name. Please note the below code is still in production, but just wanted to post something to start with.
Sorry in advance for my messy coding...
Thank you,
import mecab-python3
import ipadic
df = pd.read_csv("")
m = MeCab.Tagger(ipadic.MECAB_ARGS)
def kaiseki(column):
list= df[column].values.tolist()
new_list = []
new_list2 = []
for li in list:
li = m.parse(li)
new_list.append(li)
li2 = li.split('\n')
new_list2.append(li2)
for li1 in li2:
li2 = li1.split('\t')
for li2_1 in li2:
li2_1_1 = li2_1.split(',')[0]
#組織名 means company name in Japanese
if li2_1_1 == '組織名':
print(li1.split()[0])
else:
continue
df[column] = new_list
df["column2"] = new_list2
return df["columns2"]
columns = ['column']
for column in columns:
kaiseki(column)

Related

Arcpy Script to loop through field and run Union Analysis

I have a polygon file in form of a fishnet. Also another feature class with polygons named Trawl_Buffers. There is a unique field within Trawl_Buffers based on YEAR. I'd like to create a script to run a selection on YEAR, and then perform a union analysis with the fishnet polygon for each YEAR. So the desired output would be "Trawl_Buffers_union2003", "Trawl_Buffers_union2004" etc. I have a function that will get me the unique list of the years, and puts them in a list which i called vals.
Then seems I need to run a for loop over this list of unique years, create a temporary selection, then use that as input for the union, but I am having trouble implementing the query process.
Here is where I started, but seriously tripping
import arcpy
#Set the data environment
arcpy.env.overwriteOutput = True
arcpy.env.workspace = r'C:\Data\working\AK_Fishing_VMS\2021_Delivery\ArcPro_proj\ArcPro_proj.gdb'
trawlBuffs = r'C:\Data\working\AK_Fishing_VMS\2021_Delivery\ArcPro_proj\ArcPro_proj.gdb\buffers\buffers_testing'
fishnet = r'C:\Data\working\AK_Fishing_VMS\2021_Delivery\ArcPro_proj\ArcPro_proj.gdb\fishnets\vms_net1k'
unionOut = r'C:\Data\working\AK_Fishing_VMS\2021_Delivery\ArcPro_proj\ArcPro_proj.gdb\unions\union'
# function to get unique values for the YEAR field found within the trawlBuffs fc
def unique_values(table, field):
with arcpy.da.SearchCursor(table, [field]) as cursor:
return sorted({row[0] for row in cursor})
# Get the unique values for the field 'YEAR' found within the 'trawl_buffs' featureclass table
vals = unique_values(trawlBuffs, "YEAR")
# Create a query string for the selected country
yearSelectionClause = '"YEAR" = ' + "'" + vals + "'"
#loop through the years, create selection, union, make permanent
for year in vals:
year_layer = str(year) + "_union"
arcpy.MakeFeatureLayer_management(trawlBuffs, year_layer)
arcpy.SelectLayerByAttribute_management(year_layer, "NEW_SELECTION", "\"YEAR"\" = %d" % (year))
arcpy.Union_analysis(fishnet, year_layer , unionOut)
arcpy.CopyFeatures_management(year_layer, "union_" + str(year))

combining CSV files from Covid-data

I want to combine the CSV files from the Johns Hopkins Covid Data (e.g. https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/05-10-2020.csv & https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/01-23-2020.csv).
I already managed to load the files into a DataFrame as well as sanitizing the header (_ vs. / in some names). Now I want to pick one column (e.g. Confirmed), rename it to the day of the file and then combine those CSV files to get a progress over time.
This merge needs to be done by state_province. In both frames, the key may not be present. How can I do this? I experimented with rightjoin and outerjoin, but didn't have any success. Can someone point me the right way please?
I initially didn't want to share the code that I have so far because I didn't want to guide to a specific solution - but here it is. It is copied together from several Jupyter cells.
using Dates
start = Dates.Date(2020,1,22) #begin of recording
now = Dates.Date(Dates.now())- Dates.Day(1) #today
date_range = collect(start:Dates.Day(1):now) #create a date range with 1 element per day
prefix = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/"
suffix = ".csv"
function create_url(date)
return prefix * Dates.format(date, "mm-dd-YYYY") * suffix
end
function cleanup_column_names(name)
if name == "Country/Region" || name == "Country_Region"
return "country"
elseif name == "Province/State" || name == "Province_State"
return "state"
else
return name
end
end
using CSV
using HTTP
using DataFrames
selected_data = "Confirmed"
date = date_range[1]
data = DataFrame(CSV.File(HTTP.get(create_url(date)).body))
DataFrames.rename!(cleanup_column_names, data)
DataFrames.select!(data,["state", "country", selected_data])
DataFrames.rename!(data, 3 => Dates.format(date, "YYYY-mm-dd"))
Regards
Tobias
I am relatively new to Julia, so take my answer with a bit of scepticism:
First, we wrap the DataFrame creation into a function:
function prepare_date_df(date)
data = DataFrame(CSV.File(HTTP.get(create_url(date)).body))
DataFrames.rename!(cleanup_column_names, data)
DataFrames.select!(data,["state", "country", selected_data])
DataFrames.rename!(data, 3 => Dates.format(date, "YYYY-mm-dd"))
return data
end
Let's create our first Dataframe:
df = prepare_date_df(date_range[1])
Now, let's iterate over all the other dates, create a dataframe for each date and merge this with our first dataframe:
for date in date_range[2:end]
df_new = prepare_date_df(date)
df = outerjoin(df, df_new, on = [:state, :country])
end
This works fine for the first two months, but with the growing Dataframes, it suddenly gets very slow (and even hangs?). So I would be very interested in a more performative answer!

spaCy nlp - positions of entities in string, extracting nearby words

Lets say I have a string and want to mark some entities such as Organizations.
string = I was working as a marketing executive for Bank of India, a 4 months..
string_tagged = I was working as a marketing executive for [Bank of India], a 4 months..
I want to identify the words beside the entity tagged.
How can I locate the positions of the entity tagged and extract the words beside the entity?
My code:
import spacy
nlp = spacy.load('en')
doc = nlp(string)
company = doc.text
for ent in doc.ents:
if ent.label_ == 'ORG':
company = company[:ent.start_char] + company[:ent.start_char -1] +company[:ent.end_char +1]
print company
As I understood from your question you want words beside the ORG tagged token:
import spacy
nlp = spacy.load('en')
#string = "blah blah"
doc = nlp(string)
company = ""
for i in range (1, len(doc)-1)):
if doc[i].ent.label_ == 'ORG':
company = doc[i-1] + doc[i] + doc[i+1] # previous word, tagged word and next one
print company
be aware of the first and last token checking.
Following code works for me:
doc = nlp(str_to_be_tokenized)
company = []
for ent in doc.ents:
if ent.label_ == 'ORG' and ent.text not in company:
company.append(ent.text)
print(company)
The 2nd condition in if is to extract only unique company names in my block of text. If you remove that you'll get all instances of 'ORG' added to your company list. Hope this'll work for you as well

Creating a function to count the number of pos in a pandas instance

I've used NLTK to pos_tag sentences in a pandas dataframe from an old Yelp competition. This returns a list of tuples (word, POS). I'd like to count the number of parts of speech for each instance. How would I, say, create a function to count the number of being verbs in each review? I know how to apply functions to features - no problem there. I just can't wrap my head around how to count things inside tuples inside lists inside a pd feature.
The head is here, as a tsv: https://pastebin.com/FnnBq9rf
Thank you #zhangyulin for your help. After two days, I learned some incredibly important things (as a novice programmer!). Here's the solution!
def NounCounter(x):
nouns = []
for (word, pos) in x:
if pos.startswith("NN"):
nouns.append(word)
return nouns
df["nouns"] = df["pos_tag"].apply(NounCounter)
df["noun_count"] = df["nouns"].str.len()
As an example, for dataframe df, noun count of the column "reviews" can be saved to a new column "noun_count" using this code.
def NounCount(x):
nounCount = sum(1 for word, pos in pos_tag(word_tokenize(x)) if pos.startswith('NN'))
return nounCount
df["noun_count"] = df["reviews"].apply(NounCount)
df.to_csv('./dataset.csv')
There are a number of ways you can do that and one very straight forward way is to map the list (or pandas series) of tuples to indicator of whether the word is a verb, and count the number of 1's you have.
Assume you have something like this (please correct me if it's not, as you didn't provide an example):
a = pd.Series([("run", "verb"), ("apple", "noun"), ("play", "verb")])
You can do something like this to map the Series and sum the count:
a.map(lambda x: 1 if x[1]== "verb" else 0).sum()
This will return you 2.
I grabbed a sentence from the link you shared:
text = nltk.word_tokenize("My wife took me here on my birthday for breakfast and it was excellent.")
tag = nltk.pos_tag(text)
a = pd.Series(tag)
a.map(lambda x: 1 if x[1]== "VBD" else 0).sum()
# this returns 2

How to make a new variable based on 30 other variables

I have 30 variables on family history of cancer i.e. breast cancer father, breast cancer mother, breast cancer sister etc. I would like to make a new variable and give it a value of "1" if in one of my columns there is a 1.
Thus:
I have 30 variables with answers 1 to 3; 1 is yes, 2 is no and, 3 is unknown if one of the 30 variables is given a 1 I would like my new variable to take on the value 1.
Does someone know how I can do this?
You can create a list instead of separate 30 variables and then filter it out to create a new variable. This will make it more dynamic.
// This will be the cancer history for a single family
var cancerHistory = [];
// Add dummy data
cancerHistory.push('yes');
cancerHistory.push('no')
cancerHistory.push('unknown');
cancerHistory.push('no');
// Check if at least one of them is "yes"
var hasHistoryOfCancer = cancerHistory.indexOf('yes') > -1;
alert(hasHistoryOfCancer); // true
You can use a for loop. You did not mention the language so I am writing the code in Python which is easy to understand. If you want it in other language you can use the similar approach and apply it
import pandas as pd
new_var = []
df = pd.read_csv("DataFile.csv") # Convert data file to csv and put name it.
for i in range(len(df)):
x = [df['column1'][i], df['column2'][i] ...., df['column30'][i]]
if (1 in x): new_var.append(1)
else: new_var.append(0)
df['new_var'] = new_var
df.to_csv('NewDataFile.csv', sep=',', encoding='utf-8')