Repetitive task needs to be automated, don't know where to s

From Visual Basic to GNU C, this is the place to talk programming.

Moderators: SecretSquirrel, just brew it!

Repetitive task needs to be automated, don't know where to s

Postposted on Fri Jan 20, 2012 3:03 pm

To explain the problem:

I am writing code for Stata. The code reads in data from a text file, then assigns variable names like "Price" to whats read in as ER34556 so I actually know what the variables are. I have multiple text files of data, corresponding to different years in a panel survey. Variables translate across years, but have different names. ie ER34556 is "Price" in 2007 and ER45566 is "Price" in 2009. There exists online documentation for this data, where i can search "ER34556" and it will return a list of variables across years that correspond to this, ie I will get "[05]ER25588 [07]ER34556 [09]ER45566" as part of the returned information. I have written the code for 2009 and cut and pasted to previous years. What I want to do is this: Take the nonsensical variable name "ER..." from my stata code (written as a text file) and search the online documentation. Then, use the returned list of variables corresponding to other years to update my code written for other years. As I see it (please correct me if im wrong at any point), this will require several steps

  1. Extract the variables names from the 2009 code (commands are written: generate varname = ER...), so i would need everything to the right of the equals sign.
  2. Take each "ER" variable, and search the documentation.
  3. Extract the list that is returned
  4. Search the stata code for additional instances of the "ER" variable
  5. Assign the new variable name from the documentation on a year by year basis.
  6. Repeat above as needed


This should be possible, but I have no idea where to start! (languages, methods, etc) Any help at all would be greatly appreciated! If more information is needed to answer the question, please let me know. Thanks in advance.
TML
Gerbil
 
Posts: 19
Joined: Fri Jan 20, 2012 2:58 pm

Re: Repetitive task needs to be automated, don't know where

Postposted on Sat Jan 21, 2012 4:02 pm

wouldn't this entire parsing of your text file be easier done with using a SQL (mysql/postgresql/MSSQL/oracle/whatever) over using all those useless variables?

Even if you had to use variables, put them ALL inside ONE variable and make it an array/class/object of some sort/type...
thegleek
Darth Gerbil
Gold subscriber
 
 
Posts: 7359
Joined: Tue Jun 10, 2003 11:06 am
Location: Detroit, MI

Re: Repetitive task needs to be automated, don't know where

Postposted on Sun Jan 22, 2012 6:50 am

The tricky bit is querying the online documentation to get the variable names. Is there no way to download the whole thing as a file? Obviously it's perfectly possible to write some kind of robot or scraper than could download the information for you but (and absolutely no offence intended) I'd tend to think it's probably a little above your current level as a programmer.

The rest of the stuff is pretty standard string manipulation.

Just in case you feel like giving it a shot here's a page that covers the basic first step in building your scraper (IE filling out a form on a web page and receiving the results) in python, php and java.
http://stackoverflow.com/questions/1116921/html-form-submission-in-python-and-php-is-simple-can-a-novice-do-it-in-java

If you manage to get that code working it returns an object that represents the page with the search results on it so your next job is to get the actual data out of all the html junk that does the page formatting. I've seen people do this using regular expressions but without seeing the page you're getting the data from it's hard to say if this is the best way to go about it.

As to what language to use... that's really up to you. This kind of thing can be achieved in most programming languages and you'd probably be best off picking one you already know. Personally I'm a python fan but there's always quite a steep learning curve right at the start of learning a new language (even one as nice as python) and while this project isn't super hard it's a couple of levels beyond a complete beginner in any language.
Fernando!
Your mother ate my dog!
cheesyking
Minister of Gerbil Affairs
 
Posts: 2244
Joined: Sun Jan 25, 2004 7:52 am
Location: That London (or so I'm told)

Re: Repetitive task needs to be automated, don't know where

Postposted on Mon Jan 23, 2012 11:40 am

Thanks for the help cheesy king. No offense taken, I realize where I fall in the hierarchy of programming experience, and it isn't high up the ladder. As far as the HTML stuff returned, it can be found at http://simba.isr.umich.edu/VS/s.aspx. I need to search for something like er43564. What I need can be found clicking on the little exclamation point next to the name in the returned record at the bottom of the page. Suggestions on how to proceed from here? Again, if I had some idea what to do next, I wouldn't be asking you guys. Thanks again.
TML
Gerbil
 
Posts: 19
Joined: Fri Jan 20, 2012 2:58 pm

Re: Repetitive task needs to be automated, don't know where

Postposted on Mon Jan 23, 2012 2:50 pm

TML wrote:I realize where I fall in the hierarchy of programming experience, and it isn't high up the ladder.

I'm not that high up it myself really :wink:

Since the web scraping bit is the hardest thing to do I'd have a look and see if there's a way round it. Looking at that site there is also this page:
http://simba.isr.umich.edu/VS/f.aspx

EDIT: BTW you have to expand out the bits of the list you're interesting in before you can see the data in the page source.

Now without knowing anything about the study of socioeconomics it looks like you can get complete lists of the variable names on there. If you just view the source of the page (ctrl + u on chrome and firefox) you should see them all listed. If you copy and paste all that into a decent text editor (like notepad++) you will be able to delete most of the rubbish for it by hand leaving you with a file just holding stuff like:
Code: Select all
<option value="ER42017">ER42017 - AGE OF HEAD</option><option value="ER42018">ER42018 - SEX OF HEAD</option><option value="ER42019">ER42019 - AGE OF WIFE</option><option value="ER42020">ER42020 - # CHILDREN IN FU</option><option value="ER42021">ER42021 - AGE YOUNGEST CHILD</option>

which is the actual data

each year begins with a section like this:
Code: Select all
<IMG border=0 align=absmiddle src="/PowerupTree/Graphics/D.gif"><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder3$Var_File1$MyTree','t:3')"><IMG border=0 align=absmiddle src="/PowerupTree/Graphics/K.gif"></a></TD><TD>&nbsp;</TD><td nowrap><font class='MyNode'>2007:      8,289 observations      5,069 variables</font></TD></TR></TABLE><TABLE border=0 cellpadding=0 cellspacing=0><TR><TD><IMG border=0 align=absmiddle src="/PowerupTree/Graphics/D.gif"><IMG border=0 align=absmiddle src="/PowerupTree/Graphics/D.gif"><IMG border=0 align=absmiddle src="/PowerupTree/Graphics/D.gif"><IMG border=0 align=absmiddle src="/PowerupTree/Graphics/E.gif"></TD><TD>&nbsp;</TD><td nowrap><font class='MyNode'>


It should be a lot easier to write script that can work with a simplified file rather than trying to talk directly to their website (partly because their website is so slow).

If you want to give it a go here's a little example script I wrote for tidying up a long list of email address. It read the original list from a CSV file, removed "<" and ">" which had gotten round some of the addresses and tried to work out what the recipients' names were before spewing it into a new csv file. It show most of the basic things you'll need to do (reading and writing files, working with strings).
Code: Select all
import csv
import string

def cleanAddr(addr):
    "remove < and > from around the address"
    addr = addr.replace("<", "")
    return addr.replace(">", "")

def getNames(addr):
    "extracts the names for addresses in the form first.last@..."
    names = addr.split("@")[0]
    if "." in names:
        n = string.capwords(names, ".").split(".")
    elif "_" in names:
        n = string.capwords(names, "_").split("_")
    else:
        #fall back to just using the bit before the @
        n = [names, " "]
    return n


infile = open("hirelist1.csv", "rb")
#expected format: col1=firstname col2=lastname col3=email
dialect = csv.Sniffer().sniff(infile.read(1024))
infile.seek(0)
csvReader = csv.reader(infile, dialect)

outfile = open("hirelist2.csv", "wb")
#output format: col1=firstname col2=lastname col3=email
csvWriter = csv.writer(outfile, dialect="excel")

addresses = []
adups = 0       #number of times an address is duplicated
ddups = {}      #dictionary of domain names and numbers of addresses on each

for row in csvReader:
    fname = row[0]
    lname = row[1]
    email = cleanAddr(row[2])
   
    if len(fname) == 0:
        names = getNames(email)
        fname = names[0]
        lname = names[1]
   
    if email not in addresses:
        addresses.append(email)
        csvWriter.writerow([fname, lname, email])
    else:
        adups += 1
        print "duplicate: "+email
   
    domain = email.split("@")[1]
    if domain not in ddups:
        ddups[domain] = 1
    else:
        ddups[domain] += 1

print "There were "+str(adups)+" duplicate addresses"
for ddup in iter(ddups):
    if ddups[ddup] > 1:
        #print a list of domains receiving more than one email
        print ddup, ddups[ddup]
outfile.close()


It doesn't do anything a clever excel user couldn't have done but I'm not a clever excel user! (and I know I'm not a python guru either but it works)

If you're already familiar with another language like VB then you might want to stick with that. If you're not already familiar with another language and you think learning one might be handy then you might want to have a look at python (probably best to stick with version 2 for as that's what most of the beginners' guides cover).

This isn't impossibly hard to do but it is a bit ambitious for your first ever bit of programming in a new language!

Good luck!
Fernando!
Your mother ate my dog!
cheesyking
Minister of Gerbil Affairs
 
Posts: 2244
Joined: Sun Jan 25, 2004 7:52 am
Location: That London (or so I'm told)

Re: Repetitive task needs to be automated, don't know where

Postposted on Mon Jan 23, 2012 3:48 pm

Thanks cheesyking. I know about that part of the psid website, but the advantage, as I see it, to searching for the variable is that it compiles which years the variable is available (ie defined in exactly the same way, because it does change) and the variable names associated with those years. On the part of the website you posted, I don't know, without going through the documentation and comparing variable definitions, if the variable is defined differently across years, but has the same name.

The website is mind numbingly slow, but it seems to take the same amount of time to search for multiple variables, so I could create an object with the remaining variable names and search for that, with each variable separated by a space, no? Then just go through the entire list that is returned.
TML
Gerbil
 
Posts: 19
Joined: Fri Jan 20, 2012 2:58 pm

Re: Repetitive task needs to be automated, don't know where

Postposted on Tue Jan 24, 2012 7:35 am

TML wrote:Thanks cheesyking. I know about that part of the psid website, but the advantage, as I see it, to searching for the variable is that it compiles which years the variable is available (ie defined in exactly the same way, because it does change) and the variable names associated with those years. On the part of the website you posted, I don't know, without going through the documentation and comparing variable definitions, if the variable is defined differently across years, but has the same name.


Actually looking at it again it might be even easier to do it this way. It appears you don't actually have to use the search box to get the result. Just requesting a url like:
Code: Select all
http://simba.isr.umich.edu/cb.aspx?vList=ER43564

brings up the information for ER43564 in a relatively small page that should be "easy" to process. No form submission required which not only makes the code a little simpler but should also reduce the load on their site a bit. On that subject you should make sure any script you write pauses between making requests to the site so you don't accidentally bring it to its knees! You might also want to check that doing this kind of thing isn't against their terms and conditions.

So

1) write a script to get a list of all the variable names you've already used and put them in a nice simple list (possibly in a CSV file so you can view it in excel, perhaps: col1=name in your code | col2=2009 name | col3=2008 name etc)

2) write another script that downloads the url: http://simba.isr.umich.edu/cb.aspx?vList=2009 name (it's the 2009 ones that you know IIRC) and saves it as a file (technically you don't actually need to save the whole page I'm just thinking that if you muck the next step up you can try again without having to hit their website).

3) finally write a script that reads the files you created above that extracts the contents of the table cell:
Code: Select all
<td style="font-size:10pt;text-align:left;width"> ... </td>

which looks like this:
Code: Select all
[03]ER22574  [05]ER26555  [07]ER37573  [09]ER43564

which it then puts back into the csv file you created in the first step.

(NB I'm not suggesting doing it this was because it's a clever or efficient method, I'm just trying to break it down in small achievable steps for a novice)

Here's a page about a python script that could be (heavily) modified to do what you want:
http://www.tcu-inc.com/Articles/13/LT_Python.html
what it does is download a page and extract all the links on it before saving them in a text file (so it's actually doing steps 2 and 3 in one hit but only for a single address. You'd have to call this from another loop that steps through the url for each variable name.

If that script looks terrifying then you might want to go to somewhere like odesk or rentacoder and pay someone to do it for you. It's probably not more than 3-4 hours work for an experienced coder while I'd probably spend a day on it and you as a complete novice... who can say, it depends how well you take to it (though you've obviously not stupid :wink: )
Fernando!
Your mother ate my dog!
cheesyking
Minister of Gerbil Affairs
 
Posts: 2244
Joined: Sun Jan 25, 2004 7:52 am
Location: That London (or so I'm told)

Re: Repetitive task needs to be automated, don't know where

Postposted on Fri Jan 27, 2012 1:14 pm

Cheesyking, you are officially, the man! Your help was invaluable. While others, both on this forum and another, were saying things like "Why don't you just use MySQL to do this?" you actually paid attention to the post and realized I needed a bit of hand-holding. Thanks!

I figured I would post my code, in case other rookies might need something similar. I realize it's not pretty, elegant, efficient or probably the best way to go about it, but it works. And for me, that's all that matters.

In the first bit, I get the variable name I want to assign to each of these, and the 2009 variable name the data folks assigned to it.
Code: Select all
def extract_vars(infile, outfile, begin_line, end_line):
    """Extracts the variables from lines of code of the form 'gen varname = var'.  i.e. varname and var are output to a text file delimited by a comma (,).  Takes the input file name, output file name, the line from the input file where the code begins and the line from the input file where the code ends as arguments."""
   
    #open files,  create list with lines of variable names and numbers
    inf = open(infile)
    outf = open(outfile, 'w')
    lines = inf.readlines()

    #loop through all lines, extract variable name, 2009 number, and write to .csv file.
    for i in range(begin_line - 1, end_line):
        words = lines[i].split()
        var_name = words.pop(1)
        var = words.pop(-1)
        outf.write(var_name + ", " + var + "\n")

    #close files.
    outf.close()
    inf.close()

extract_vars("trial_in.txt", "trial_out.txt", 56, 103)


In the second piece, I go to the interwebs and get the previous years names for each variable, saving each page to a text file, and also creating a .csv file where each line is my_var_name, [09]ER99878, [07]ER87654, etc. The link cheesyking posted pretty much wrote the first part of the code, with some minor tweaks.

Code: Select all
# One obvious thing to do is apply error checking for url download,
# download must contain at least one entry, and we are able to create the
# new file. This will be done later.

  ### import the web module, string module, regular expression,  module
  ### and the os module
 
import urllib, string, re, os

  ### define the new webpage we create and where to get the info
 
var_list = open("trial_out.txt")
vlist = var_list.readlines()
var_list.close()
outfile = open("trial_out_final.txt", 'w')

Download_Location = "VariableDL"

for line in vlist:
   
    words = line.split()
    var = words.pop(-1)
    Url = "http://simba.isr.umich.edu/cb.aspx?vList=" + var

    #-----------------------------------------------------------
      ### Create a web object with the Url

    var_table = urllib.urlopen( Url )

      ### Grab all the info into an array (if big, change to do one line at a time)

    Text_Array =  var_table.readlines()
    var_file  = open(Download_Location + "\Var_" + var + ".txt", 'w');
    for l in Text_Array:
        # Extract relevant data from PSID HTML file.
        if '<td style="font-size:10pt;text-align:left;width">' in l:
            years = l
    # Save data to files for (potential) future use.
    var_file.writelines(years[73:-7])
    var_file.close()

    # Prepare and write data to csv table.
    v = years[73:-7].split()
    v.reverse()
    v.pop(0)
    line = line.rstrip('\n')
   
    for year in v:
        line += ", " + year

    outfile.write(line + "\n")

outfile.close()


And finally, going back to my stata code file and editing lines. This one is highly dependent on the structure of my stata code in particular, but it still might be helpful.

Code: Select all
var_file = open("trial_out_final.txt")
var_lines = var_file.readlines()
code_file = open("trial_in.txt")
code_lines = code_file.readlines()
outfile = open("trial_code_edit.txt", 'w')
years = range(1978,2000)
years.extend(range(2001,2010,2))
years.reverse()
year = "0000"


# loop over all years for which we have variables
for j, y in enumerate(years):
    if y != 2009:

        # loop over every line of code in our main file
        for i, codeline in enumerate(code_lines):
            if codeline != "\n":
               
                if codeline[1:-2] == str(y):
                    year = str(y)
                elif codeline[1:-2] == str(years[j - 1]):
                    year = "0000"
                else:
                    year = year
               
                # split line from main file into strings
                code_vars = codeline.split()
       
                # loop over every variable in our vars file, so we can compare to the code from our main file
                for varline in var_lines:
       
                    #split variables into separate strings
                    var = varline.split()
                   
                    if len(code_vars) > 1:
                        var[0] = var[0].rstrip(',')
                       
                        # if the variable name from the main code matches the variable name from our variable file AND the year from our main file matches the list of years
                        if var[0] == code_vars[1] and year == str(y):
                           
                            # loop over variables from var file, comparing if year from main file matches the year included witht the variable
                            for v in var:
                                if v[1:3] == year[2:4]:
   
                                    # replace the variable definition from main file with that from variable file.
                                    code_vars[-1] = v[4:].rstrip(',')
       
                # replace line in code with new line "code_vars"
                code_lines[i] = ""
                for cvar in code_vars:
                    code_lines[i] += cvar + " "

                code_lines[i] = code_lines[i].rstrip()
                code_lines[i] += "\n"

# Write code_lines back to file
outfile.writelines(code_lines)

# close files
outfile.close()
var_file.close()
code_file.close()


I hope someone else finds this helpful, and thanks again cheesyking.
TML
Gerbil
 
Posts: 19
Joined: Fri Jan 20, 2012 2:58 pm

Re: Repetitive task needs to be automated, don't know where

Postposted on Fri Jan 27, 2012 3:04 pm

Cool, really glad to be of help :D

Python is a really nice language isn't it.

EDIT:
How long did it take you to get it all working?
Fernando!
Your mother ate my dog!
cheesyking
Minister of Gerbil Affairs
 
Posts: 2244
Joined: Sun Jan 25, 2004 7:52 am
Location: That London (or so I'm told)

Re: Repetitive task needs to be automated, don't know where

Postposted on Mon Jan 30, 2012 1:16 pm

Python is awesome! My coding experience with "real" languages was limited to java, which is much more of a pain than Python. It took about 2 days of actually working on the problem to get everything operational, but I spent a couple of days working through a Python tutorial just to get up to speed. Definitely a worthy time investment. Would have taken as long or longer to do it by hand, and I wouldn't have learned anything new.
TML
Gerbil
 
Posts: 19
Joined: Fri Jan 20, 2012 2:58 pm


Return to Developer's Den

Who is online

Users browsing this forum: No registered users and 2 guests