How to partition by in pandas and output to a word doc? - pandas

I have a table I have filtered from data. It is my highlights across the web. I want to, ultimately, output these to a doc file I have by the page they came from
I have the api data filtered down to two columns
url|quote
How do I, for each url, output the quote to a doc file. or just for starters iterate through a set of quotes by each earl.
In SQL it would be something like this
Select quote over(partition by url) as sub_header
From table
url quote
https://jotengine.com/transcriptions/WIUL8HBabqxffIDOkUA9Dg I actually think that the bigger problem is not necessarily having the ideas. I think everyone has lots of interesting ideas. I think the bigger problem is not killing the bad ideas fast enough. I have the most respect for the Codecademy founders in this respect. I think they tried 12 ideas in seven weeks or something like that, in the summer of YC.
https://jotengine.com/transcriptions/WIUL8HBabqxffIDOkUA9Dg We were like what the heck is going on here so we went and visited five of our largest customers in New York, this was about three years ago and we said okay, you're using the S3 integration but what the heck are you using it for? For five out of five customers in a row, they said well we have a data engineering team that's taking data from the S3 bucket, converting it into CS view files and managing all the schema-translations and now they're uploading it into a data warehouse like Redshift. The first time I heard that from a customer, I was like okay, that's interesting
I want to output a url header followed by all the quotes I've highlighted. Ideally my final product will be in docx

it would be great if you could provide some source code to help explain your problem. From looking at your question, I would say all you need to do is put your columns into a DataFrame, then export this to excel.
df = pd.DataFrame({"url":url,"quote":quote})
df["quote"].to_excel("filename.xlsx")
Hope this helps.

Related

Organising csv. file data in Python

I am quite a beginner with Python but I have a programming-related project to work on, so, I really would like to ask some help. I didnĀ“t find many simple solutions to organize the data such a way that I could do some analysis with that.
First, I have multiple csv-files, which I read in as DataFrame objects. In the end, I need them all to analyze together (right now the files are separated to the list of DataFrames but later on I probably will need those as one DataFrame object).
However, I have a problem with organizing and separating the data. These are thousands of rows in one column, a part of it is presented:
CIP;Date;Hour;Cons;REAL/ESTIMATED
EN025140855608477018TC2L;11/03/2020;1;0 057;R
EN025140855608477018TC2L;11/03/2020;2;0 078;R
EN025140855608477018TC2L;11/03/2020;3;0 033;R
EN025140855608477018TC2L;11/03/2020;4;0 085;R
EN025140855608477018TC2L;11/03/2020;5;0 019;R
...
EN025140855608477018TC2L;11/04/2020;20;0 786;R
EN025140855608477018TC2L;11/04/2020;21;0 288;R
EN025140855608477018TC2L;11/04/2020;22;0 198;R
EN025140855608477018TC2L;11/04/2020;23;0 728;R
EN025140855608477018TC2L;11/04/2020;24;0 275;R
The area, where the huge space between, the number should be merged together, for example, 0.057, which information represents "Cons" (actually it is the most important information).
I should be able to split data into 5 columns in order to proceed with the analysis. However, it should be a universal tool for different csv-files without knowing the including symbols. But the structure of the content and the heading is always the same.
I would be happy if anyone might know to recommend a way to work with this kind of data.
Sounds like what you are trying to do is convert the Cons column so that the spaces become a dot.
df = pd.read_csv("file.txt", sep=";")
df['Cons'] = df['Cons'].str.replace("\s+",".")
df['Cons'].head()
Output:
0 0.057
1 0.078
2 0.033
3 0.085
4 0.019

UniData - record count of all files / tables

Looking for a shortcut here. I am pretty adept with SQL database engines and ERPs. I should clarify... I mean databases like MS SQL, MySQL, postresql, etc.
One of the things that I like to do when I am working on a new project is to get a feel for what is being utilized and what isn't. In T-SQL this is pretty easy. I just query the information schema and get a row count of all the tables and filter out the ones having rowcount = 0. I know this isn't truly a precise row count, but it does give me an idea of what is in use.
So I recently started at a new company and one of their systems is running on UniData. This is a pretty radical shift from mainstream databases and there isn't a lot of help out there. I was wondering if anybody knew of a command to do the same thing listed above in UniBasic/UniQuery/whatever else.
Which tables, files, are heavily populated and which ones are not?
You can start with a special "table" (or file in Unidata terminology) named VOC - it will have a list of all the other files that are in your current "database" (aka account), as well as a bunch of other things.
To get a list of files in (or pointed to) the current account:
:SORT VOC WITH F1 = "F]" "L]" "DIR" F1 F2
Try HELP CREATE.FILE if you're curious about the difference between F and LF and DIR.
Once you have a list of files, weed out the ones named *TEMP* or *WORK* and start digging into the ones that seem important. There are other ways to get at what's important (e.g using triggers or timestamps), but browsing isn't a bad idea to see what conventions are used.
Once you have a file that looks interesting (let's say CUSTOMERS), you can look at the dictionary of that file to see
:SORT DICT CUSTOMERS F1 F2 BY F1 BY F2 USING DICT VOC
It can help to create something like F2.LONG in DICT VOC to increase the display size up from 15 characters.
Now you have a list of "columns" (aka fields or attributes), you're looking for D-type attributes that will tell you what columns are in the file. V or I-type's are calculations
https://github.com/ianmcgowan/SCI.BP/blob/master/PIVOT is helpful with profiling when you see an attribute that looks interesting and you want to see what the data looks like.
http://docs.rocketsoftware.com/nxt/gateway.dll/RKBnew20/unidata/previous%20versions/v8.1.0/unidata_userguide_v810.pdf has some generally good information on the concepts and there are many other online manuals available there. It can take a lot of reading to get to the right thing if you don't know the terminology.

Pig: how to loop through all fields/columns?

I'm new to Pig. I need to do some calculation for all fields/columns in a table. However, I can't find a way to do it by searching online. It would be great if someone here can give some help!
For example: I have a table with 100 fields/columns, most of them are numeric. I need to find the average of each field/column, is there an elegant way to do it without repeat AVERAGE(column_xxx) for 100 times?
If there's just one or two columns, then I can do
B = group A by ALL;
C = foreach B generate AVERAGE(column_1), AVERAGE(columkn_2);
However, if there's 100 fields, it's really tedious to repeatedly write AVERAGE for 100 times and it's easy to have errors.
One way I can think of is embed Pig in Python and use Python to generate a string like that and put into compile. However, that still sounds weird even if it works.
Thank you in advance for help!
I don't think there is a nice way to do this with pig. However, this should work well enough and can be done in 5 minutes:
Describe the table (or alias) in question
Copy the output, and reorgaize it manually into the script part you need (for example with excel)
Finish and store the script
If you need to be able with columns that can suddenly change etc. there is probably no good way to do it in pig. Perhaps you could read it in all columns (in R for example) and do your operation there.

Display ALL countries in an autocomplete form

First time posting on here because Google is yielding no results!
So, I have a website that is based around travelling and locations. Everytime someone enters content into the site, they select a location and that content then has lat and long, country, etc.
The issue I face is that I have a DB of all the "cities and areas" of the world and there are a good 3.5 million records in the database I believe.
My question to you is how would you guys recommend doing a 1 field autocomplete form for all the cities? I don't need advice on the autocomplete form itself, I need advice on HOW and WHERE I should be storing the data... text files? SQL? Up until now, I have been using SQL but I don't know how it should be done. Would an AJAX autoloader be able to handle it if I only returned 100 records or so? Should all the results be preloaded?
Thanks for your help guys!
EDIT: I have actually found another way to do it. I found this awesome little plugin to integrate Google Maps with it
http://xilinus.com/jquery-addresspicker/demos/index.html
Fantastic.
Benny
I have a few thoughts here:
since you don't know whether a user will enter the english or local (native) name, each city record in your database should have both. Make sure to index these fields.
Do not do auto-complete until you have a minimum number of characters. Otherwise, you will match way too many rows in your table. For example, assuming an even distribution of english characters (26), then at 3.5 million records you would statistically get thar = he following matches per character:
1 char = 135k
2 char = 5.2k
3 char = 200
4 char = 8
If you are using MySQL you will want to use the LIKE specifier.
There are much more advance methods for predictive matching, but this should be a good start.

Adding custom data to GapMinder

Does anyone have any experience adding their own data to GapMinder, the really cool software that Hans Rosling uses in his TED talks? I have an array od objects in JSON that would be easy to show in moving bubbles. This would be really cool.
I can see that my Ubuntu box has what looks like data in /opt/Gapminder Desktop/share/assets/graphs/world, but I would need to figure out:
How to add a measure to a graph
How to add a data series
How to set the time range of the data
Identify the measures to follow at each time step
and so on.
Just for the record: if you want to use Gapminder with your own dataset, you have to convert your data in a format suitable to Gapminder. More specifically, looking in the assets/graphs/world, you will have to:
Edit the file overview.xml, which contains the tree structure of all the indicators (just copy/paste an entry and specify your own data);
Convert your data copying the structure of the xml files in that directory (this is the tricky part): you can specify some metadata in the preamble, and then specify your own data series, with something like:
<t1 m="i20,50.0,99.0,1992" d="90.0, ... ,50.0, ..."/> where i20 is the country id, which is followed by the minima and maxima of the series, and the year it refers to.
In my humble opinion, Gapminder is a great app but it definitely needs more work on integration with other datasets. Way better to use Google Motion Chart as you did, or MooGraph (site and doc), which is unfortunately not as great as Gapminder.
#Stefano
the information you provided is very valuable. Is somewhere available a detailed specification of the XML files containing the data?
Anyway, just to enrich your response, I also found that:
overview.xml file
The link between Nations and their IDs is in this file
The structure of the menus for the selection of the indicators is also in the same file (at the bottom) under the section <indicatorCategorization>
The structure of the datafile XML
For each line the year represents the first year of the serie, and then the values follow one per year, comma separated.
Grazie,
Max
I ended up using the google motion chart API. I ended up with this.