Interview Question #2

I found the following question today that piqued my interest:

Given an array of items where each item has a name and a weight (integer value), select a random item from the array based on the weight.

What this means is that items in the array should be randomly chosen such that the larger the weight, the more likely the item will be picked.

To approach this problem, I will create an array where each element gives us the probability the ith item would be chosen.  To build this array, I will first find the total amount of weight amongst all items and divided each item’s weight by this total value.  Once I have this array, I will be able to pick a random number between 0.0 (Inclusive) and 1.0 (Exclusive) to determine which item to pick.

My language of choice for this example will be Python.  Let’s say I am given the data as an ordered list of Tuples where the first element is the name of the item and the second element is the weight of that item.

items = [('Item1', 4), ('Item2', 3), ('Item3', 9), ('Item4', 7)]

With this array of items, I will first grab the weights and store it in an array such that the ith element in the this array corresponds to the ith element in the items array.

weights = map(lambda item: item[1], items) # [4, 3, 9, 2]

Once I have this array of weights, it’s trivial to find the total weight for all items.

total = float(sum(weights)) # 23

Note that I’m casting the sum to a float. I will need this total as a float to keep precision for when I divide each weight by this value.  If we left the total as an integer, the divide operation for each weight will return 0 since an integer divided by an integer is an integer.

probabilities = map(lambda weight: weight / total, weights) # [0.1739, 0.1304, 0.3913, 0.304] - Values shortened for brevity

Since these are probabilities, these values added together should equal 1.0.  Visually, think each of each entry as a slice of a pie.  The higher the probability value, the larger the piece of pie is.  We now need to determine which slice of pie we get, if we randomly select a piece.

To do this, we can use the Python function random.random(), which will return a value between 0.0 and 1.0.  To pick the correct value, I will iterate through this list of normalized weights and add the current item to a total.  If the random value is less than the total, then we have our random item.

random_val = random.random()
total_weight = 0
  for i, weight in enumerate(probabilities):
      total_weight += weight
      if random_val < total_weight:
          return items[i][0]

Let’s step through our example above to get a better understanding of how this all works. Let’s say our random value is 0.6195. During the first iteration, total_weight is set to to 0.1739. Our random value is greater than this value, so we continue onto the next iteration. During the second iteration total_weight is set to 0.3043. Again, the random value is greater than this value, so we continue onto the 3rd iteration where we set total_weight to 0.6956. Finally, the random value is less than total_weight, so the third item is our random value.

Overall this algorithm runs in O(n) time. If we know that the values are already sorted, we can instead use a binary search to get this down to O(lgn) time. However, sorting will take O(nlgn) operations, so an O(n) solution will suffice for now.

Interview Question #1

I was going through some old bookmarks on my computer and found a site of programming teaser questions that were asked of candidates applying at Google.  Being a sucker for these sort of things I checked out the first question:

There is an array A[N] of N numbers. You have to compose an array Output[N] such that Output[i] will be equal to multiplication of all the elements of A[N] except A[i]. For example Output[0] will be multiplication of A[1] to A[N-1] and Output[1] will be multiplication of A[0] and from A[2] to A[N-1].

Solve it without division operator and in O(n).

I thought about this question for a while and I have to admit that it absolutely stumped me!  In attempt to learn something for this question, we’re going to look at the answer and then examine why this works.

For this example, let’s say we’re given the array [1, 2, 3, 4, 5]; The solution to this problem would then be [120, 60, 40, 30, 24].

The solution is as follows:

  1. Initialize a variable called product, to 1.
  2. Starting at the second element of the array, iterate through the array going forwards.  Multiply the current value of the given array with the product and store that value in the previous element of the results array.
  3. Re-initialize the product variable to 1.
  4. Starting at the last element, Iterate through the elements going backwards. Multiply the current value of the given array with the product and store it in the product variable.
  5. Multiply the current value of the product with the previous element of the results array.

Before we look at any code, let’s work this out.

The first step is trivial, so using the sample given above, let’s examine how we would get to the end result.  Let’s check out what the results array looks like during the second step.

  1. results = [1, 1, 1, 1, 1]   // results[1] = data[0] × product (1)
  2. results = [1, 1, 2, 1, 1]   // results[2] = data[1] × product (1)
  3. results = [1, 1, 2, 6, 1]   // results[3] = data[2] × product (2)
  4. results = [1, 1, 2, 6, 24] // results[4] = data[3] × product (6)

Now, let’s take a look at the value of the product and the results array during step 4.

  1. results = [1, 1, 2, 6, 24] product = 5 (product × 5)
  2. results = [1, 1, 2, 30, 24] product = 20 (product × 4)
  3. results = [1, 1, 40, 30, 24] product = 60 (product × 3)
  4. results = [1, 60, 40, 30, 24] product = 120 (product × 2)
  5. results = [120, 60, 40, 30, 24] product = 120 (product × 1)

Bam! We end up our answer!  Why does this work?  Well, let’s examine the factors of the answer:

results[0] = 5 × 4 × 3 × 2
results[1] = 5 × 4 × 3 × 1
results[2] = 5 × 4 × 2 × 1
results[3] = 5 × 3 × 2 × 1
results[4] = 4 × 3 × 2 × 1

Let’s see how we accumulate those factors in step 2

  1. results = [1, 1, 1, 1, 1]
  2. results = [1, 1, 2 × 1, 1, 1]
  3. results = [1, 1, 2 × 1, 3 × 2 × 1 , 1]
  4. results = [1, 1, 2 × 1, 3 × 2 × 1, 4 x 3 x 2 x 1]

Step 3

  1. results = [1, 1, 2 × 1, 3 × 2 × 1, 4 x 3 x 2 x 1] product = 5
  2. results = [1, 1, 2 × 1, 5 × 3 × 2 × 1, 4 x 3 x 2 x 1] product = 5 × 4
  3. results = [1, 1, 5 × 4 × 2 × 1, 5 × 3 × 2 × 1, 4 x 3 x 2 x 1] product = 5 × 4 × 3
  4. results = [1, 5 × 4 × 3 × 1, 5 × 4 × 2 × 1, 5 × 3 × 2 × 1, 4 x 3 x 2 x 1] product = 5 × 4 × 3 × 2
  5. results = [5 × 4 × 3 × 2 × 1, 5 × 4 × 3 × 1, 5 × 4 × 2 × 1, 5 × 3 × 2 × 1, 4 x 3 x 2 x 1] product = 5 × 4 × 3 × 2 × 1

In step 2, we’re first distributing the factors that appear before the given element in the array. In step 4, the product is distributing the factors that appear after each element as we go back.  Therefore, each element in the results array contains all factors that appear and before the same element in the given array.

Anyway, here’s the code for it:

int[] data = { 1, 2, 3, 4, 5 };
int n = data.length;
int[] results = new int[n];
for (int i = 0; < n; i++) results[i] = 1
// Step 2
for (int i = 0, product = 1; i < n - 1; i++) {
 product *= data[i];
 result[i + 1] = product;
}
// Step 4
for (int i = n - 1, product = 1; i > 0; i--) {
 product *= data[i];
 result[i - 1] *= product;
}

Java String Pool

What will the following lines of code print?

1  String a = "abc";
2  String b = "abc";
3  System.out.println(a == b);
4  String c = new String("abc");
5  String d = new String("abc");
6  System.out.println(c == d);

In order to test for equality between two objects, you must use the equals method.  Otherwise, if you test the equality between two variables that point to objects, you’re comparing the value it stores; in this case the address that reference the objects.  As a result, both println statements should both return false, right?  Why does the first println statement (line 3) print true?

It turns out that variables a and b both point to the same object.  This is because both variables reference string literal objects.  A string literal is instead stored in a special area of the Java PermGen heap space called the String Pool.  The String Pool acts as a cache for string literals; If you create a string literal that already exists in the String Pool, Java is smart enough to use the existing reference.

On line 1, when I set the variable a to the literal “abc”, it will store that value in the String Pool.  When I do the same again for variable b, on line 2, it will find the same value in the String Pool and use the reference to the existing object.  Not only does this save memory, but the values of the variables a and b are both equal since they reference the same object.

On the other hand, if we create a String using the new operator (line 4), we will first find the “abc” reference that exists in the String Pool.  This reference will be passed into the String constructor to instantiate a completely new object that now lives in the Java heap space.  After we create two instances of the String object using the new operator (line 5), we have two completely separate instances of the String object.  Therefore the variables c and d reference two completely separate variables.

As you can see, it almost always never makes sense to use the new String(String) constructor.  If for whatever reason you must use this constructor, you may use the String.intern() method to place the object in the String Pool. This method will then return the  reference to the object that lives in the String Pool.

Given what you have just learned, what will the following lines of code print?

System.out.println(c.intern() == d.intern());
System.out.println("a" + "bc" == "ab" + "c");
System.out.println("abcd" == "abc" + "d");

Gawker Password Database Cracking

If you haven’t heard already, Gawker Media has been hacked! Long story short: The Gnosis crew has taken credit for breaking into Gawker Media’s servers and downloaded a database of roughly 1.3 million user records. One of the big stories here was that Gawker stored user passwords using the block cipher DES (Data Encryption Standard). This encryption scheme only encrypts the first 8 characters of a password before storing it in the database. This means ‘password1sSecureBec@useIt’sLong_4nd_uses_5pecial_characters’ will only be stored as DES(‘password’).

My senior project in college involved using the distributed computing framework Hadoop to crack passwords, so naturally this piqued my interest. Unfortunately I don’t have access to a cluster of computers, so instead I wanted to take the opportunity to learn something new (Cuda comes to mind). Before I jump into playing with Cuda, I first wanted to get a feel for what I was doing by writing some Python code to get me started.

First thing I had to do was filter out the records in the database that were not valid or didn’t have an encrypted password. Of the 1,248,120 records there were only 748,508 that I considered valid. Each password was encrypted using a salt, so I couldn’t simply create a rainbow table of encrypted passwords and do a lookup. Instead I had to crack each password individually. I decided to crack the passwords using a dictionary-based attack. I used a list of the 500 most commonly use passwords as my initial dictionary (The most popular passwords being first). I figured this would be my best bet for quickly shortening the list passwords that I would need to crack. The jist of how this works is the following:
1. Read in a word from the list.
2. Iterate through each account and compute the encrypted value using the user’s given salt.
3. If the computed value is the same as the encrypted password, then we have the original password.
4. If the password was cracked, remove it from the list so we don’t have to worry about it in future computations.

As of this writing I’m at the 39th word and I’ve managed to shorten the list by 16,512 users, or by 0.02%. This doesn’t seem to be going fast enough, so perhaps I’ll look more into doing this with something faster such as C or Cuda. The point of this exercise was really to get myself familiar with the process of cracking these passwords. I was able to quickly write some code in Python to test my ideas, and I quickly found out what ideas worked and what didn’t.

Starting out with Nmap

Nmap is one of those programs that I can’t live without.  Whether I’m using it to learn more about my own network, or somebody else’s network, it really gives me a better picture of what other devices are on the network without actually having to actually physically see the device.

Scanning your network

One of the basic ways to discover devices is to use a ping sweep with Nmap.  This sends out a ping to a single device or a range of devices. It used to be that hosts that dropped ICMP echo requests (ping) would up show as being down in Nmap. It seems that newer versions of Nmap are able to detect if a host is up if it doesn’t respond to a ping. (Try ping microsoft.com and see what it responds with.)

Pinging a single device

nmap -sP 192.168.0.1

Results

Nmap scan report for 192.168.0.1
Host is up (0.055s latency).

Pinging a range of IP addresses

nmap -sP 192.168.0.1-5

Results

Nmap scan report for 192.168.0.1
Host is up (0.0046s latency).
Nmap scan report for 192.168.0.3
Host is up (0.016s latency).
Nmap scan report for 192.168.0.5
Host is up (0.000098s latency).

Pinging the entire network

nmap -sP 192.168.0.0/24

Results

Nmap scan report for 192.168.0.1
Host is up (0.055s latency).
Nmap scan report for 192.168.0.3
Host is up (0.014s latency).
Nmap scan report for 192.168.0.5
Host is up (0.00029s latency).
Nmap scan report for 192.168.0.100
Host is up (0.0090s latency).

The last command will send a ping out to any device that has their IP address starting with 192.168.0.x.  (The /24 is short-hand for the subnet mask of 255.255.255.0.)

Device Enumeration

Once we have a list of devices on our network we may want to know some information about it.  Simply typing the command nmap along with the IP address or name of the device, will return what ports are open on the device and guess what service is running on the device.  For example, when I run nmap on my own machine I get the following:

nmap 192.168.0.5

Results

PORT     STATE SERVICE
22/tcp   open  ssh
88/tcp   open  kerberos-sec
139/tcp  open  netbios-ssn
445/tcp  open  microsoft-ds
515/tcp  open  printer
631/tcp  open  ipp
3689/tcp open  rendezvous

What if I want some more information like what version of ssh am I running?  Use the -sV flag.

nmap -sV 192.16.0.5

Results

PORT     STATE SERVICE      VERSION
22/tcp   open  ssh          OpenSSH 5.2 (protocol 2.0)
88/tcp   open  kerberos-sec Mac OS X kerberos-sec
139/tcp  open  netbios-ssn  Samba smbd 3.X (workgroup: WORKGROUP)
445/tcp  open  netbios-ssn  Samba smbd 3.X (workgroup: WORKGROUP)
515/tcp  open  printer
631/tcp  open  ipp          CUPS 1.4
3689/tcp open  daap         Apple iTunes DAAP 9.0.3
Service Info: OS: Mac OS X

Also notice we get some extra information such as the OS I’m running and the workgroup my samba shares are on.

Now let’s take what we learned and put them together.  How would I get information on all devices on my network?

nmap -sV 192.168.0.0/24

The output of this is too long for here, but I think you get the idea; it’ll display all the service information (along with service versions) for all hosts on the network.

That’s the basics of using nmap.  With the ideas presented here, you should be able to find out most information about about devices on the network you are currently connected to.

Most listened to music of the year

According to last.fm, this is the final count of the bands I listened to the most this year.

  1. Nine Inch Nails (528 Plays) – Nine Inch Nails is and has been one of my favorite bands for about 13 years now. NIN has always been the goto band for me for when I don’t feel like listening to anything else. This year I finally got around to actually listening to Year Zero. I was not enthusiastic about this CD when it first came out, but after seeing them live at the Santa Barbara Bowl this year, I finally got around to listening to it and love it! I actually found that it made for good running music! On top of that, I rediscovered what has to be one of my favorite albums: The Fragile. This CD is up there with Dark Side of the Moon in terms of replayability.
  2. NOFX (431 Plays) – Another one of goto bands when I don’t feel like listening to anything else. NOFX has been another one of my favorite bands for about 11 years now and I’m still not tired of them. This year they release coaster which has been in regular rotation in my library. This summer in particular I have also been re-listening to a lot of their albums when I went on my runs.
  3. The Appleseed Cast (329 Plays) – The Appleseed Cast have been one of my favorite bands of recent times. This year they released Sagarmatha which I absolutely feel in love with. The first track “As The Little Things Go” has to be my favorite track of the year. I will hopefully get to see them live for the first time in 2010.
  4. Placebo (291 Plays) – Again, another one of my favorite bands that I started to around 9 years ago. This year they released “Battle for the Sun” and did it without their original drummer Steven Hewitt. Despite his absense, Placebo released this slightly popier sounding album, but still staying true to their original sound. I have yet to see this band live; this needs to change!
  5. Muse (203 Plays) – This year muse released “The Resistance”, but I’ve also been rediscovering their CDs such as “Origin of Symmetry” and “Absolution”. I also recently watched the movie “Southland Tales” (which I won’t go into here), but it featured the song “Blackout” off of Absolution that I really never listened to. Holy crap what a great track!
  6. Asobi Seksu (181 Plays) – This band is a Japanese/American Indie/Pop band that I admit is a guilty pleasure. This year I’ve been listening to their self-titled album and “Citrus” mostly. This year they came out with the CD “Hush,” but it hasn’t gotten as much playtime as the other two albums from me.
  7. Pixies (163 Plays) – I had my pixies phase around six years ago. I think now was the time to come back and check out my old favorites. Not much to say here really.
  8. Steve Aoki (148 Plays) – Introduced to me randomly by a friend in SLO in August, Steve Aoki has been a running favorite this year.  Steve Aoki kind of reminds me of Girl Talk which I probably listened to A LOT last year.
  9. The Protomen (142 Plays) – The concept of a rock opera based on the Mega Man video game series just cannot fail. I was a big fan of the first act, but when act II came out, I listened to both acts back to back. Both CDs have a unique feel to them; the first act is more inspired by the videogame series (with a hint of Ennio Morricone) while the second act follows up with a sound inspired by the 80s.  Someone needs to turn these albums into an anime!
  10. Pink Floyd (140 Plays) – Pink Floyd is not one of those bands where you pick and choose their greatest hits. Instead when you listen to Pink Floyd, you listen to them an entire album at a time. In particular I remember listening to The Wall, Animals, and Dark Side of the Moon this year.

Find the Largest Prime Factor

I love programming competitions.  In the past two years I’ve participated in both my school’s programming competition and the ACM Southern California regional competition.  Recently I’ve been turned onto http://projecteuler.net which is a competition-like site that gives you a series of math questions which you need to solve by writing a program.  I saw this as a good opportunity to brush up on python, so today I’ve been trying my hand at a few of these problems.

Question #3 asks you to find the largest prime factor of a very large number.  My first instinct was to start at Sqrt(n) and work my way back until I find the first factor.  The problem here is that the first factor you find may not be the first PRIME factor.  Instead, my approach to this problem was to find all factors of the number between 2 and Sqrt(n) and put them in in a list.

Of all the numbers in the list, one of them is the largest prime factor, but which one?  To find out, I made an assumption that turned out to be right:  The composite factors have at least one prime factor that is already in the list.  If I can show that a given number has no factors in the list, then it is a prime factor.  Further, if I have a sorted list and start at the end, I can show that the number is the largest prime factor.

Anyway, here’s the code I used:

n = 600851475143 # Find the largest prime factor of this
div = [x for x in range(2,int(math.sqrt(n))) if n % x == 0] # Get all factors
i, j = 0, len(div) – 1
while j > i: # Determine which one of the factors is the largest prime
# This number is composite, go to the next largest factor
if div[j] % div[i] == 0: i,j = 0,j-1
# See if the next number at the beginning of the list divides the current number
else: i += 1
print div[j] # Print out the largest prime factor
n = 600851475143 # Find the largest prime factor of this
div = [x for x in range(2,int(math.sqrt(n))) if n % x == 0] # Get all factors
i, j = 0, len(div) - 1
while j > i: # Determine which one of the factors is the largest prime
# This number is composite, go to the next largest factor
if div[j] % div[i] == 0: i,j = 0,j-1
# See if the next number at the beginning of the list divides the current number
else: i += 1
print div[j] # Print out the largest prime factor

Fix for CodeIgniter White Screen on Bluehost

This past weekend I spent writing an application using the CodeIgniter framework for my programming languages class.  Sunday night I finally got it working and it was time to move it to production on bluehost.  After I made created the database and made some configuration changes I checked it out and I got nothing.  It was a white page of nothing.  I checked my logs and found nothing.  Normally this wouldn’t be such an issue, but when you have nothing in your logs, then you don’t have much to work with.  I looked at the source of the blank page and found this: “<!– SHTML Wrapper – 500 Server Error –>.”  Still not a lot to work with.

Today I spent some time investing the issue and found out that a lot of people were having the issue, but I didn’t find any fixes.  Long story short the fix was to enable FastCGI PHP5 on Bluehost.  When I did that an error finally appeared on the screen! Apparently I misspelled my database name >_<.

Anyway, if you’re having the same issue, let me know if this fix worked for you.

PHP Frameworks

I’ve been coding in PHP off and on for years now.  Each project I’ve done thus far has always been by hand.  Doing the architecture of your site by hand is the easiest way to introduce bugs and end up with sloppy code.   For the first time I’ve started a project where I used a framework in PHP.  My framework? Codeigniter.  I like it’s simplicity, MFC design and how it’s not like Rails.  I thought Rails was a bit too complicated and involved hacking the internal code way too much to get anything done, but that’s a whole other rant.

Anyway, the project I’m working on?  Just a twitter clone.  Why?  For the sake of learning, that’s all.  It’s actually for a project in school.  The idea is to implement this idea in something that I’m already familiar with (PHP) and then implement it again in another language I’m not familiar with at all.  After I’m doing with the first part I want to implement it in Python using the Django framework.  Python has always been one of those languages I’ve been meaning to learn.  In fact it seems I try to learn it every summer, but I just don’t seem to have the dedication to sit in doors learning it.

Regardless, I hope to use this knowledge in my professional life.  The last PHP project I worked on was stressful just because of some of the design decisisons I made (Doing a half-assed MVC implementation…didn’t do the best job of dividing up the view and controller).  However, I’ve learned from those mistakes and I might be taking up a new project at work where I’ll probably be using this framework.

Using SSL with Gmail and Gmail Notifier

With the recent announcement at Defcon, Gmail users will soon be the target of session hijacking. The reason for this is that Gmail by default does not encrypt any traffic (except logins). This allows anyone on the local network to sniff for session ids passed between gmail and the user when you check your email.  With this session id, a hijacker can act authenticate themselves as you without the need for your username and password.

This has always been an issue for non-encrypted traffic, but it was announced at defcon that a tool has been released that automates this hack.  This was enough reason for Google to release an option to turn on SSL.  The problem here is that you still have to manually turn it on.

To turn on SSL go to “Settings” and at the bottom you’ll see an option called “Browser Connection.”  Choose the option “Always use https.”  Yay now we’re protected from session hijacking!

ssl.png
The problem I next noticed was that gmail notifier stopped working!  After doing some investigating, I found that you had to do some hex editing with gnotify.exe to get it to use SSL.

Before we do anything, close and make a backup copy of gnotify.exe just in case anything happens.  By default you can find this executable in C:\Program Files\Google\Gmail Notifier.  For hex editing I used an old favorite hex editor called “Hex Workshop” for Windows.  After you download/install it, open up gnotify.exe in Hex Workshop.  On the left you’ll see a bunch of hexidecimal characters and on the right you’ll see the ASCII equivalent.

To make the replaecment, we need to first find the area we want to modify.  To find the area, hit CTRL-F and let’s do a search for the string we want to modify: “http://”.  Under the “Type” drop-down choose “Text String.”  When you find “http://mail.google.com/mail/” go ahead and add a “s” after “http.”  You’ll see that whenever you type, it will overwrite whatever was in that field before.  Go ahead and type out the replaced characters until you end up with “https://mail.google.com/mail/”.

hex.png

Go ahead and save the modified executable and open it back up.  If it fails, you can always use the backup you made!  Otherwise, you should know have access to gmail over an encrypted connection!

I’m sure I’ll be writing a part II to this when I get home and my gmail notifier isn’t working there either.