From dictionary to organized Excel - pandas

Good Morning,
I have a dictionary: organized this way below; And What I want to do is use the dictionary values as the column number for the Key. As showed below:
My first idea was to loop through the dictionary and create a text file where dico_values = tabs and then transform this new file into an excel file but this seems one too much step. Cheers to all

You could perhaps try this:
new_dico = {value: [] for value in dico.values()} # {1: [], 2: [], 3: [], ...}
for key, value in dico.items():
new_dico[value].append(key)
for otherkey in new_dico.keys():
if otherkey == value:
continue
else:
new_dico[otherkey].append("")
# new_dico == {1: ["President of ...", "Members of ..", "", ...], 2: ["", "", "President's Office", ...], ...}
# Then you can make a dataframe of 'new_dico' with pandas, for instance

Related

Removing \xa0 from dataframe in pandas

I have a list of NBA team names that have doubled up. How do I remove the entries with \xa0?
This is the output I get.
['Atlanta Hawks', 'Atlanta Hawks\xa0', 'Boston Celtics', 'Boston Celtics\xa0', ect
I am removing the seed from the webscraping process by
teams["Team"] = teams["Team"].str.replace("(1)", "", regex=False)
teams["Team"] = teams["Team"].str.replace("(2)", "", regex=False)
teams["Team"] = teams["Team"].str.replace("(15)", "", regex=False)
How to I get my list to not include the \xa0 entries?

sort dataframe by string and set a new id

is there a possibility to adjust the strings according to the order for example 1.wav, 2.wav 3.wav etc. and the ID accordingly with ID: 1, 2, 3 etc?
i have already tried several sorting options do any of you have any ideas?
Thank you in advance
dataframe output
def createSampleDF(audioPath):
data = []
for file in Path(audioPath).glob('*.wav'):
print(file)
data.append([os.path.basename(file), file])
df_dataSet = pd.DataFrame(data, columns= ['audio_name', 'filePath'])
df_dataSet['ID'] = df_dataSet.index+1
df_dataSet = df_dataSet[['ID','audio_name','filePath']]
df_dataSet.sort_values(by=['audio_name'],inplace=True)
return df_dataSet
def createSamples(myAudioPath,savePath, sampleLength, overlap = 0):
cutSamples(myAudioPath=myAudioPath,savePath=savePath,sampleLength=sampleLength)
df_dataSet=createSampleDF(audioPath=savePath)
return df_dataSet
You can split the string, make it an integer, and then sort on multiple columns. See the pandas.Dataframe.sort_values for more info. If your links are more complicated you may need to design a regex to pull out the integers you want to sort on using pandas.Series.str.extract.
df = pd.DataFrame({
'ID':[1,2,3,4, 5],
'audio_name' : ['1.wav','10.wav','96.wav','3.wav','55.wav']})
(df
.assign(audio_name=lambda df_ : df_.audio_name.str.split('.', expand=True).iloc[:,0].astype('int'))
.sort_values(by=['audio_name','ID']))

pandas: how to remove duplicates from a deeply nested list of lists

I have a panda dataframe like the following:
df = pd.DataFrame({ 'text':['the weather is nice though', 'How are you today','the beautiful girl and the nice boy']})
df['sentence_number'] = df.index + 1
df['token'] = df['text'].str.split().tolist()
df= df.explode('token').reset_index(drop=True)
I have to have a column for tokens as I need it for another project. I have applied the following to my dataframe.
import spacy
nlp = spacy.load("en_core_web_sm")
dep_children_sm = []
def dep_children_tagger(txt):
children = [[[child for child in n.children] for n in doc] for doc in nlp.pipe(txt)]
dep_children_sm.append(children)
dep_children_tagger(df.text)
since one has to apply the n.children method on the sentence level, I have to use the text column and not the token column, so the output has a list of repetitions. I would now like to remove this repetitions from my list 'dep_children_sm', and i have done the following,
children_flattened =[item for sublist in dep_children_sm for item in sublist]
list(k for k,_ in itertools.groupby(children_flattened))
but nothing happens, and I still have the repeated lists. I have also tried to add drop_duplicates() to the text column when calling the function, but the problem is that I have duplicate sentences in my original dataframe and unfortunately cannot do that.
desired output = [[[], [the], [weather, nice, though], [], []], [[], [How, you, today], [], []], [[], [], [the, beautiful, and, boy], [], [], [], [the, nice]]]
It seems that you want to apply your function on the unique texts. Thus you can first use the pandas.Series.unique method on df.text
>>> df['text'].unique()
array(['the weather is nice though', 'How are you today',
'the beautiful girl and the nice boy'], dtype=object)
Then I would simplify your function to directly output the result. There is no need for a global list. Also, your function was adding an extra level of list, which seems unwanted.
def dep_children_tagger(txt):
return [[[child for child in n.children] for n in doc] for doc in nlp.pipe(txt)]
Finally, apply your function on the unique texts:
dep_children_sm = dep_children_tagger(df['text'].unique())
This gives:
>>> dep_children_sm
[[[], [the], [weather, nice, though], [], []],
[[], [How, you, today], [], []],
[[], [], [the, beautiful, and, boy], [], [], [], [the, nice]]]
ok, I figured out how to solve this issue. the problem was that the nlp.text outputs a list of list of lists on spacy tokens, and since at no point there is any string in this nested list, the itertools does not work.
since I cannot remove the duplicates from the text column in my analysis, I have done the following instead.
d =[' '.join([str(c) for c in lst]) for lst in children_flattened]
list(set(d))
this outputs a list of strings excluding duplicates
# ['[] [How, you, today] [] []',
# '[] [the] [weather, nice, though] [] []',
# '[] [] [the, beautiful, and, boy] [] [] [] [the, nice]']

UnboundLocalError in Pandas

My task is to clean the data and present it in next format: DataFrame( [ ["Michigan", "Ann Arbor"], ["Michigan", "Yipsilanti"] ], columns=["State", "RegionName"] ).
I'm getting a mistake. How can I fix it?
"UnboundLocalError: local variable 'state' referenced before assignment"
The code is following:
with open ('university_towns.txt') as file:
data=[]
for line in file:
data.append(line[:-1])
state_town = []
for line in data:
if line[:-6]=='[edit]':
state=line[:-6]
elif '(' in line:
town = line[:line.index('(')-1]
state_town.append([state,town])
else:
town=line
state_town.append([state,town])
state_college_df=pd.DataFrame(state_town, columns=['State','RegionName'])
return state_college_df

How can I get the contents of the second page as well with Scrapy for the scenario below?

I have a spider that needs to fetch an array of object where each object has 5 items. 4 items are on the same page and the 5th item is a URL which I need to extract data from and return all 5 items as text. In the code snippet below, explanation is the key that lies on the other page. I need to parse that and add its data along with the other attributes while yielding it.
My current solution when exported to a JSON file shows up as follows. As you notice, my "e" is not resolved. How do I get the data?
[
{
"q": "How many pairs of integers (x, y) exist such that the product of x, y and HCF (x, y) = 1080?",
"c": [
"8",
"7",
"9",
"12"
],
"a": "Choice (C).9",
"e": "<Request GET http://iim-cat-questions-answers.2iim.com/quant/number-system/hcf-lcm/hcf-lcm_1.shtml>",
"d": "Hard"
}
]
class CatSpider(scrapy.Spider):
name = "catspider"
start_urls = [
'http://iim-cat-questions-answers.2iim.com/quant/number-system/hcf-lcm/'
]
def parse_solution(self, response):
yield response.xpath('//p[#class="soln"]').extract_first()
def parse(self, response):
for lis in response.xpath('//ol[#class="z1"]/li'):
questions = lis.xpath('.//p[#lang="title"]/text()').extract_first()
choices = lis.xpath(
'.//ol[contains(#class, "xyz")]/li/text()').extract()
answer = lis.xpath(
'.//ul[#class="exp"]/li/span/span/text()').extract_first()
explanation = lis.xpath(
'.//ul[#class="exp"]/li[2]/a/#href').extract_first()
difficulty = lis.xpath(
'.//ul[#class="exp"]/li[last()]/text()').extract_first()
if questions and choices and answer and explanation and difficulty:
yield {
'q': questions,
'c': choices,
'a': answer,
'e': scrapy.Request(response.urljoin(explanation), callback=self.parse_solution),
'd': difficulty
}
Scrapy is an asynchronous framework, which means none of it's elements are blocking. So Request as an object does nothing, it's only stores info for scrapy downloader, thus it means you cannot just call it to download something like you're doing right now.
Common solution for this is to design a crawl chain by carrying your data through callbacks:
def parse(self, response):
item = dict()
item['foo'] = 'foo is great'
next_page = 'http://nextpage.com'
return Request(next_page,
callback=self.parse2,
meta={'item': item}) # put our item in meta
def parse2(self, response):
item = response.meta['item'] # take our item from the meta
item['bar'] = 'bar is great too!'
return item