Building a website of keywords

This is a short project to explore creating keywords and creating webpages programmatically. (Full project on github). This is an older project that I cleaned up a bit recently.

Screen Shot 2021-10-02 at 8.05.28 PM.png

Screen Shot 2021-10-02 at 8.40.36 PM.png

Screen Shot 2021-10-02 at 8.40.43 PM.png

I first scan through the abstracts from the ACM Journals (stored in a .txt file), pulling out all words, ignoring stopwords, and correcting common misspellings.

for i in abstracts:
        # replace noise characters
        abs_words = i.replace(';', ' ').replace(',', ' ').replace(':', ' ').replace('-', ' ').replace('.', ' ') \
            .replace('(', ' ').replace(')', ' ').replace('{', ' ').replace('}', ' ').replace('0', ' ').replace('1', ' ') \
            .replace('2', ' ').replace('3', ' ').replace('4', ' ').replace('5', ' ').replace('6', ' ').replace('7', ' ') \
            .replace('8', ' ').replace('9', ' ').replace("'", ' ').replace('"', ' ').replace(']', ' ').replace('[', ' ') \
            .replace('“', ' ').replace('”', ' ').replace('?', ' ').replace('=', ' ').replace('&', ' ').split()
        for word in abs_words:
            word = word.lower()
            # if word is in misspellings list, replace with correct spelling
            for key in corrections:
                word = word.replace(key, corrections[key])
            # check word is not in stopwords list
            if word not in stopwords_list:
                keywords.append(word)

This list of words is then counted and sorted, keeping only the top 35 keywords.

The abstracts are in one big text file, so I parse them into a list with each abstract in it’s own list item.

# until reach the end of file
    while aline != '':
        
        # create a (new) blank list for the articles
        article = []
        
        # until reach the newline that's between each article
        while aline != '\n':
            aline = aline.rstrip('\n')
            # append to list called article
            article.append(aline)
            # read the next line
            aline = abstracts.readline()
        
        # Get the whole abstract together
        new_entry = ' '.join(article)
        # append the individual entries to the overall list
        abstract_list.append(new_entry)
        # read in the next line
        aline = abstracts.readline()

Next, I build a page for each keyword, with the abstracts that contain that keyword

# Building pages for each keyword
    # take each keyword
    for word in top_keywords:
        indices = []
        count = 0
        
        # compare each keyword to the abstracts
        for listing in abstract_list:
            
            articleholder = []
            # if keyword is in the abstract, append to a list
            # of articles for that keyword
            if (word in str(listing)):
                
                # then split on the " that surround the titles
                articleholder = listing.split('"')
                
                # take the second part of the split (title) and add link
                # use rstrip to remove ending commas
                link = '<a href="article' + str(count) + '.html">' + \
                    str(articleholder[1]).rstrip(',') + '</a>'
                
                # append the first part of the split (author) and second part (title)
                # to create a listing
                # use rstrip to remove ending commas
                key_listing = str(articleholder[0]).rstrip(', ') + '<br>' + link

                # append this to a list for the keyword page
                indices.append(key_listing)
                
                # create page for the abstract with line breaks
                # use rstrip to remove ending commas
                abs_page = '<u> Abstract</u>' + '<br> <br>' + \
                    '<i>' + str(articleholder[0]).rstrip(', ') + '</i>' + '<br>' + \
                    '<b>' + str(articleholder[1]).rstrip(',') + '</b>' + '<br> <br>' + \
                    str(articleholder[2])
                
                # write out HTML for that article
                filename = 'article' + str(count) + '.html'
                keyword_file = open(filename, "w")
                keyword_file.write(abs_page)
                keyword_file.close()
                
            count += 1
        
        # build keyword abstract listing page, 
        # including line breaks between abstracts
        page = '\n \n <br><br> \n'.join(indices)
        
        ### Print out the files
        filename = word + '.html'
        keyword_file = open(filename, "w")
        keyword_file.write(page)
        keyword_file.close()

Then finally bring it all together with a simple index page.

# build HTML document
    html_begin = """
    <!DOCTYPE html>
    <html>
    <body>
    <h1>Welcome to the ACM Library</h1>
    <h3>A research, discovery and networking platform</h3>
    <h2>Browse our library by keyword.
    <h3 id="keyword"><h3>Keywords</h3></a>
    <ul>
    """

    html_end = """
    </ul>
    </body>
    </html>"""

    # build HTML doc
    html_str = html_begin + li_list + html_end

    # write out HTML
    html_file = open("index.html", "w")
    html_file.write(html_str)
    html_file.close()

    abstracts.close()

Next Steps

If I continue this project, I’d like to explore creating directories for the keyword pages and the abstract pages (conditionally if they directories don’t already exist). I could also come back to improve the HTML or add some CSS to make the pages look better.

I would also go back and break up main() into a number of smaller functions.