Home > Programming > Google AppEngine Tale using python

Google AppEngine Tale using python

Months and Months have been foregoing having to finish up an application that runs on the App Engine stack. In my free time I am working on a java project that will run on App Engine making use of Google Web Toolkit (GWT) API’s (GWT, gwt-map), Guice, and Google App Engine Java SDK. But as long as I continue postponing, I don’t know when I will ever finish that project. But Thursday last week things changed, due to course assignment, I chose to play around with App Engine.

Here I go through woes and success of getting an application running on Google App Engine stack in total 20 hours.
My target was simple. Write a Google App Engine application that scraps Uganda voters from the electoral website and store them in the Google data store for easier searching. I am not an experienced python developer, but I do have enough concepts to get done with whatever I need in python. Below I attach the whole final code.

So to start with I need to scrap the districts, constituencies, sub counties, parishes and polling stations because it’s that information that I need to be able to download PDFs having lists of voters on a specific polling station. To get started I wrote functionality that scraps the locations above and pickles them in a file for later transformation in to CSV format for uploading using the bulkloader. This was trivial since it just needed to write a threaded program that downloads this information and stores it a pickled file. Extract below:-

class ScrapThread(threading.Thread):
    __scrapThreadLockObject = multiprocessing.Lock()
    __currentDistrictIndex=0

    def __init__(self, districts):
        self.__districts=districts
        super(ScrapThread,self).__init__()

    def run(self):
        while not(self.isScrappingComplete()):
            currentIndex=None
            ScrapThread.__scrapThreadLockObject.acquire()
            if not(self.isScrappingComplete()):
                currentIndex = ScrapThread.__currentDistrictIndex
                ScrapThread.__currentDistrictIndex = ScrapThread.__currentDistrictIndex + 1
            ScrapThread.__scrapThreadLockObject.release()

            if(currentIndex == None):
                return

            district = self.__districts[currentIndex]
            districtScrap=VotersScrapper(district)

    def isScrappingComplete(self):
        return ScrapThread.__currentDistrictIndex >= len(self.__districts)

    @staticmethod
    def getCurrentDistrictIndex():
        return ScrapThread.__currentDistrictIndex

After pickling information from district level to polling stations, it was time to get dirty with App Engine. Used the Google launcher to create a skeleton application that I added to Wing IDE. Added models, created a few pages:-

  • Search: For allowing search for voters using the SearchableModel and allowing search on voter surname, other names, and village
  • Entity Browser: For allowing browsing the lists of districts, parishes, constituencies etc
  • Scrapping: this page is configured in the cron yaml file to be called every one minute to scrap voter pdfs from the electoral website, decrypt the pdf and extract the voters from the pdf using the pyPdf which I had to slightly customize to archive this

I did make good use of the Django templating engine available with the web framework to create HTML templates.

All didn’t go well for me, some issues I did spend a couple of hours figuring out why they don’t work as expected:-

  1. Uploading entites was a night mare, with not enough detailed documentation and samples like am used to on MSDN, I had to look at the code that transforms csv and uploads data back and forth experiencing errors. To get it to finally work:-

    • I generated csv’s that have the same number of columns as the models with a key in the first column and also included headers in the csv with same names as specified in the bulkloader.yaml file external_name property
    • Edited my bulkloader.yaml file to transform each kind of entity columns as defined in the model
    • I did specify key_is_id=True for fields that are reference properties.
  2. Fetching urls was resulting from Download Error 5 exception. Took time to figure it out that I had to increase the default timeout from 5 seconds to 10 seconds.
  3. When I fetch the polling station PDF voter list, extract information and save it the database I needed it to be atomic. so added it to run in a transaction. Roughly each PDF has about 800+ voters, the maximum you can save at once is 500. Extract below:-

    class ScrapPage(BasePage):
        def OnLoad(self):
            #try:
            self.templateName=None
            pollingStation = PollingStation.gql("where scrapped = :1 order by scrapped, priority DESC", False).get()                
            if  pollingStation:
                voterPages = VotersListScrapper().getPollingStationVoters(pollingStation)
                if(len(voterPages)>0 and len(voterPages[0]) > 0):
                    db.run_in_transaction(self.persistVoters,voterPages,pollingStation)
                    
            #except Exception, e:
                #logging.error(str(e))
            
        def persistVoters(self, voters, pollingStation):
            currentCount=0
            newVoters=[]
            for voterPage in voters:
                    for voter in voterPage:
                        newVoter = Voter(parent=pollingStation)
                        newVoter.surname=voter[0]
                        newVoter.other_names=voter[1]
                        birthDate=voter[2].split('-')
                        
                        try:
                            newVoter.birth_date=datetime.date(int(birthDate[2]), int(birthDate[1]), int(birthDate[0]))
                        except:
                            newVoter.birth_date=None
                        newVoter.is_male = voter[3] == 'M'
                        newVoter.village=voter[4]
                        newVoter.polling_station=pollingStation.key()
                        newVoters.append(newVoter)
                        
                        currentCount = currentCount + 1
                        if(currentCount==400):
                            db.put(newVoters)
                            newVoters=[]
                            currentCount=0
                        
            pollingStation.scrapped=True
            newVoters.append(pollingStation)
            db.put(newVoters)
    

I will continue editing this page to add more huddles I faced.
I have attached the whole project.

  • scripts folder has batch files uploading and deploying the application
  • datasetup.py has code to download districts to polling stations as save them into csv files

My short experience with Google App Engine has taught me to always build proof of concept apps before recommending a technology. You get a chance to get acquainted with the technology and its shortfalls. If your new to Google App Engine as I am, take some time to learn what you can and can’t archive with GAE before mounting a big project to run on GAE.

The final project can be seen in action at ugvoters.appspot.com

Download

Advertisements
Categories: Programming Tags: , ,
  1. December 1, 2010 at 12:45 am

    Hi Joseph,

    Thanks for this write-up! This feedback is incredibly useful for us in improving our documentation.

    – Ikai

  1. February 19, 2011 at 6:18 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: