Projects‎ > ‎

Project0

Python Warm-Up: Parsing HTML Files

Due Wed Jan 30 at 11:59pm in your git repository.  Your repo should look like this:

<username>/cs345/project0/find_links.py

I will be automating the testing of your code, if you don't follow this convention, my tests will fail you won't get credit.

This project is intended to get you familiar with Python and some of built-in functions and modules.

Your goal is to write a Python program called, find_links.py, that takes one or more HTML files as arguments and find all of the anchor links, e.g.,

<a href="http://www.cs.usfca.edu">USF Computer Science</a>

Your program should also do the following:
  • Find repeated links and report the link count
  • Sort the link about alphabetically by link name by default
  • Accept a command line argument called -bycount that will sort the link list by link count, from most to least.  If links have the same count, then they should be sorted alphabetically.
  • You need only look for anchor links with absolute href addresses.
  • You should find both http:// and https://
  • Look for both single and double quotes around href strings, e.g., href="http://yahoo.com" and href='http://yahoo.com'.
  • Allow for spaces around href attribute, e.g., href = "http://yahoo.com".
  • You should strip leading and trailing whitespace inside a link.  E.g., "http://yahoo.com  " should become "http://yahoo.com".
  • You can assume the '-bycount' argument is the first argument if it exists.
  • You should handle both upper and lower case tags, e.g., <A HREF="..."></A> and <a href="..."></a>.
  • Links must appear in quotes, e.g., href="http://www.yahoo.com"
For example:

$ ./find_links foo.html bar.html
"http://www.google.com", 2
"http://www.cs.usfca.edu", 7

$ ./find_links -bycount foo.html bar.html
"http://www.cs.usfca.edu", 7
"http://www.google.com", 2

Finally, you need to provided a a function called 

def get_links(files, bycount=False):
    # Implementation

that returns a list of tuples, where each tuple is in the form of (link, count), e.g.,:

[('http://www.google.com', 2), ('http://www.cs.usfca.edu', 7)]

This will be used for automated testing of your solution.

Notes:
  • You program should follow the if __name__ == '__main__': convention.
  • You cannot use any HTML parsing modules, standard or otherwise



ċ
both.alpha
(27k)
Greg Benson,
Jan 29, 2013, 9:48 PM
ċ
both.bycount
(27k)
Greg Benson,
Jan 29, 2013, 9:48 PM
ċ
gen_output.sh
(0k)
Greg Benson,
Jan 29, 2013, 9:48 PM
ċ
test1.html
(1k)
Greg Benson,
Jan 29, 2013, 9:48 PM
ċ
test1.html.alpha
(0k)
Greg Benson,
Jan 29, 2013, 9:48 PM
ċ
test1.html.bycount
(0k)
Greg Benson,
Jan 29, 2013, 9:48 PM
ċ
www.nytimes.com.html
(171k)
Greg Benson,
Jan 29, 2013, 9:48 PM
ċ
www.nytimes.com.html.alpha
(27k)
Greg Benson,
Jan 29, 2013, 9:49 PM
ċ
www.nytimes.com.html.bycount
(27k)
Greg Benson,
Jan 29, 2013, 9:49 PM
Comments