Creating Aeon request batches from finding aids: Difference between revisions

(Created page with "==Extract metadata from the finding aid== # Navigate to the [https://findingaids.folger.edu/ finding aids website]. # Click on the finding aid in question, and then view sourc...")
 
No edit summary
Line 1: Line 1:
==Extract metadata from the finding aid==
==Extract metadata from the finding aid using Python==
# The following steps require Python 3; if you do not already have it installed, it is available to download [https://www.python.org/downloads/ here].
# Navigate to the [https://findingaids.folger.edu/ finding aids website].
# Navigate to the [https://findingaids.folger.edu/ finding aids website].
# Click on the finding aid in question, and then view source. Save this xml as a txt file in your Python folder.
# Click on the finding aid in question, and then view source. Save this xml as a txt file in your Python folder.
Line 12: Line 13:
callnos = []
callnos = []
dates = []
dates = []
descs = []
#extents = []
#watermarks = []
start=0
start=0
end=0
end=0
Line 68: Line 66:
else:
else:
dates.append("null")
dates.append("null")
start=0
end=0
#get extents
#start=item.find("<extent>")
#if start != -1:
#start+=8
#end=item.find("</extent>")
#extent=item[start:end]
#extents.append(extent)
#else:
#extents.append("null")
#start=0
#end=0
#get watermarks
#start=item.find("<physfacet>")
#if start != -1:
# start+=11
# end=item.find("</physfacet>")
# watermark=item[start:end]
# watermarks.append(watermark)
#else:
# watermarks.append("null")
#start=0
#end=0
#minimize to just scope contents note; get descs
start=item.find("<scopecontent")
if start != -1:
item=item[start:]
start=0
start=item.find("<p>")
if start != -1:
start+=3
end=item.find("</scopecontent>")
item=item[start:end]
item=item.strip()
item=item.replace("</p>","")
item=item.replace("<p>"," ")
desc=' '.join(item.split())
desc=desc.replace("<persname>","")
desc=desc.replace("</persname>","")
desc=desc.replace("<famname>","")
desc=desc.replace("</famname>","")
desc=desc.replace("<corpname>","")
desc=desc.replace("</corpname>","")
''.join(desc.splitlines())
descs.append(desc)
else:
descs.append("null")
else:
descs.append("null")
start=0
start=0
end=0
end=0
Line 125: Line 73:
callnofile = open('FAcallnos.txt','w', encoding='utf-8')
callnofile = open('FAcallnos.txt','w', encoding='utf-8')
datefile = open('FAdates.txt','w', encoding='utf-8')
datefile = open('FAdates.txt','w', encoding='utf-8')
descfile = open('FAdescs.txt','w', encoding='utf-8')
#extentfile = open('FAextents.txt','w', encoding='utf-8')
#watermarkfile = open('FAwatermarks.txt','w', encoding='utf-8')


for title in titles:
for title in titles:
Line 135: Line 80:
for date in dates:
for date in dates:
     datefile.write("%s\n" % date)
     datefile.write("%s\n" % date)
for desc in descs:
    descfile.write("%s\n" % desc)
#for extent in extents:
    #extentfile.write("%s\n" % extent)
</pre>
</pre>
<ol start="4">
<ol start="4">
<li>After running this script, you will find 4 text files in your Python folder: FAcallnos.txt, FAdates.txt, FAdescs.txt, and FAtitles.txt; these text files contain the extracted finding aid metadata, and will be used in the following step.</li>
<li>After running this script, you will find 3 text files in your Python folder: FAcallnos.txt, FAdates.txt, and FAtitles.txt; these text files contain the extracted finding aid metadata, and will be used in the following steps.</li>
</ol>
</ol>

Revision as of 10:47, 15 February 2018

Extract metadata from the finding aid using Python

  1. The following steps require Python 3; if you do not already have it installed, it is available to download here.
  2. Navigate to the finding aids website.
  3. Click on the finding aid in question, and then view source. Save this xml as a txt file in your Python folder.
  4. Run the following Python script on the txt file you just saved, making sure to first edit the file name in the first line of the script to match the file name of your text file:
f = open('findingaid.txt','r', encoding='utf-8')
lines = f.readlines()
f.close()
item=""
newItem=False
titles = []
callnos = []
dates = []
start=0
end=0
i=0
for line in lines:
	if "<c" and "level=\"item\"" in line:
		newItem=True
	elif "</c>" in line:
		newItem=False
	if newItem==True:
		item=item+line
	elif newItem==False and len(item)>0:
		#get title
		start=item.find("<unittitle>")
		if start != -1:
			start+=11
			end=item.find("</unittitle>")
			title=item[start:end]
			title=title.replace("<persname>","")
			title=title.replace("</persname>","")
			title=title.replace("<famname>","")
			title=title.replace("</famname>","")
			title=title.replace("<corpname>","")
			title=title.replace("</corpname>","")
			title=title.replace("<p>","")
			title=title.replace("</p>","")
			title=title.replace("<title render=\"italic\">","")
			title=title.replace("</title>","")
			title=' '.join(title.split())
			titles.append(title)
		else:
			titles.append("null")
		start=0
		end=0
		#get callno
		start=item.find("<unitid>")
		if start != -1:
			start+=8
			end=item.find("</unitid>")
			callno=item[start:end]
			callnos.append(callno)
		else:
			callnos.append("null")
		start=0
		end=0
		#get date
		start=item.find("<unitdate>")
		if start != -1:
			start+=10
			end=item.find("</unitdate>")
			date=item[start:end]
			dates.append(date)
		else:
			dates.append("null")
		start=0
		end=0
		item=""

titlefile = open('FAtitles.txt','w', encoding='utf-8')
callnofile = open('FAcallnos.txt','w', encoding='utf-8')
datefile = open('FAdates.txt','w', encoding='utf-8')

for title in titles:
    titlefile.write("%s\n" % title)
for callno in callnos:
    callnofile.write("%s\n" % callno)
for date in dates:
    datefile.write("%s\n" % date)
  1. After running this script, you will find 3 text files in your Python folder: FAcallnos.txt, FAdates.txt, and FAtitles.txt; these text files contain the extracted finding aid metadata, and will be used in the following steps.