Cheesyking, you are officially, the man! Your help was invaluable. While others, both on this forum and another, were saying things like "Why don't you just use MySQL to do this?" you actually paid attention to the post and realized I needed a bit of hand-holding. Thanks!
I figured I would post my code, in case other rookies might need something similar. I realize it's not pretty, elegant, efficient or probably the best way to go about it, but it works. And for me, that's all that matters.
In the first bit, I get the variable name I want to assign to each of these, and the 2009 variable name the data folks assigned to it.
def extract_vars(infile, outfile, begin_line, end_line):
"""Extracts the variables from lines of code of the form 'gen varname = var'. i.e. varname and var are output to a text file delimited by a comma (,). Takes the input file name, output file name, the line from the input file where the code begins and the line from the input file where the code ends as arguments."""
#open files, create list with lines of variable names and numbers
inf = open(infile)
outf = open(outfile, 'w')
lines = inf.readlines()
#loop through all lines, extract variable name, 2009 number, and write to .csv file.
for i in range(begin_line - 1, end_line):
words = lines[i].split()
var_name = words.pop(1)
var = words.pop(-1)
outf.write(var_name + ", " + var + "\n")
#close files.
outf.close()
inf.close()
extract_vars("trial_in.txt", "trial_out.txt", 56, 103)
In the second piece, I go to the interwebs and get the previous years names for each variable, saving each page to a text file, and also creating a .csv file where each line is my_var_name, [09]ER99878, [07]ER87654, etc. The link cheesyking posted pretty much wrote the first part of the code, with some minor tweaks.
# One obvious thing to do is apply error checking for url download,
# download must contain at least one entry, and we are able to create the
# new file. This will be done later.
### import the web module, string module, regular expression, module
### and the os module
import urllib, string, re, os
### define the new webpage we create and where to get the info
var_list = open("trial_out.txt")
vlist = var_list.readlines()
var_list.close()
outfile = open("trial_out_final.txt", 'w')
Download_Location = "VariableDL"
for line in vlist:
words = line.split()
var = words.pop(-1)
Url = "http://simba.isr.umich.edu/cb.aspx?vList=" + var
#-----------------------------------------------------------
### Create a web object with the Url
var_table = urllib.urlopen( Url )
### Grab all the info into an array (if big, change to do one line at a time)
Text_Array = var_table.readlines()
var_file = open(Download_Location + "\Var_" + var + ".txt", 'w');
for l in Text_Array:
# Extract relevant data from PSID HTML file.
if '<td style="font-size:10pt;text-align:left;width">' in l:
years = l
# Save data to files for (potential) future use.
var_file.writelines(years[73:-7])
var_file.close()
# Prepare and write data to csv table.
v = years[73:-7].split()
v.reverse()
v.pop(0)
line = line.rstrip('\n')
for year in v:
line += ", " + year
outfile.write(line + "\n")
outfile.close()
And finally, going back to my stata code file and editing lines. This one is highly dependent on the structure of my stata code in particular, but it still might be helpful.
var_file = open("trial_out_final.txt")
var_lines = var_file.readlines()
code_file = open("trial_in.txt")
code_lines = code_file.readlines()
outfile = open("trial_code_edit.txt", 'w')
years = range(1978,2000)
years.extend(range(2001,2010,2))
years.reverse()
year = "0000"
# loop over all years for which we have variables
for j, y in enumerate(years):
if y != 2009:
# loop over every line of code in our main file
for i, codeline in enumerate(code_lines):
if codeline != "\n":
if codeline[1:-2] == str(y):
year = str(y)
elif codeline[1:-2] == str(years[j - 1]):
year = "0000"
else:
year = year
# split line from main file into strings
code_vars = codeline.split()
# loop over every variable in our vars file, so we can compare to the code from our main file
for varline in var_lines:
#split variables into separate strings
var = varline.split()
if len(code_vars) > 1:
var[0] = var[0].rstrip(',')
# if the variable name from the main code matches the variable name from our variable file AND the year from our main file matches the list of years
if var[0] == code_vars[1] and year == str(y):
# loop over variables from var file, comparing if year from main file matches the year included witht the variable
for v in var:
if v[1:3] == year[2:4]:
# replace the variable definition from main file with that from variable file.
code_vars[-1] = v[4:].rstrip(',')
# replace line in code with new line "code_vars"
code_lines[i] = ""
for cvar in code_vars:
code_lines[i] += cvar + " "
code_lines[i] = code_lines[i].rstrip()
code_lines[i] += "\n"
# Write code_lines back to file
outfile.writelines(code_lines)
# close files
outfile.close()
var_file.close()
code_file.close()
I hope someone else finds this helpful, and thanks again cheesyking.