Fighting spam and recapturing books with reCAPTCHA

A CAPTCHA is an anti-spam test used to work out whether a request has been made by a human, or a spambot. CAPTCHAs no longer seem to be as popular as they once were, as other spam identification techniques have emerged, however a considerable number of websites still use them.

CAPTCHA pictures

Some common examples of CAPTCHAs.

CAPTCHAs can be really annoying, hence their downfall in recent years. Take a look at the different CAPTCHAs in the image above, if you had spent 30 seconds filling in a feedback form, would you be willing to try and decipher one of the above CAPTCHAs, or would you just abandon the feedback?

The top left image could be ZYPEB, however it could just as easily be 2tPF8. If you get it wrong, usually you will be forced to do another, which could be just as difficult.

The BBC recently reported how The National Federation for the Blind has criticised CAPTCHAs, due to their restrictive nature for the visually impaired. Many CAPTCHAs do offer an auditory version, however if you check out the BBC article (which has an example of an auditory CAPTCHA), you will see that they are near impossible to understand.

reCAPTCHA

Luis von Ahn is a computer scientist who was instrumental in developing the CAPTCHA back in the late 90’s and early 2000’s. According to an article the Canadian magazine The Walrus, when CAPTCHAs started to become popular, Luis von Ahn “realized that he had unwittingly created a system that was frittering away, in ten-second increments, millions of hours of a most precious resource: human brain cycles.

Anti-spam reCAPTCHA

An example of a reCAPTCHA CAPTCHA.

In order to try and ensure that this time was not wasted, von Ahn set about developing a way to better utilise this time; it was at this point that reCAPTCHA was born.

reCAPTCHA is different to most CAPTCHAs because it uses two words. One word is generated by a computer, whilst the other is taken from an old book, journal, or newspaper article.

Recapturing Literature

As I mentioned, reCAPTCHA shows you two words. One of the images is to prevent spam, and confirm the accuracy of your reading; you must get this one right, or you will be presented with another. The other image is designed to help piece together text from old literature, so that books, newspapers and journals can be digitised.

reCAPTCHA presents the same word to a variety of users and then uses the average response to work out what the word actually says – this helps to stop abuse. In a 2007 quality test, using a standard computer text reader, (also known as OCR) 83.5% of words were identified correctly – a reasonably high amount – however the accuracy of human interpretation via reCAPTCHA was an astonishing 99.1%!

According to an entry in the journal Science, in 2007 reCAPTCHA was present on over 40,000 websites, and users had interpreted over 440 million words! Google claim that today around 200 million CAPTCHAs are solved each day.

If each CAPTCHA took 10 seconds to solve, that would have been around 139 years (or 4.4 billion seconds) of brain time wasted; I am starting to see what Mr von Ahn meant! To put the 440 million words into perspective, the complete works of Shakespeare is around 900,000 words – or 0.9 million.

Whilst the progress of reCAPTCHA seems pretty impressive, it is a tiny step on the path to total digitisation. According to this BBC article, at the time von Ahn is quoted saying:

“There’s still about 100 million books to be digitised, which at the current rate will take us about 400 years to complete”

Google

In 2009 Google acquired reCAPTCHA. The search giant claimed that it wanted to “teach computers to read” hence the acquisition.

Many speculate that Google‘s ultimate aim is to index the world, and reCAPTCHA will help it to accelerate this process. That said, if that is its goal, it is still a very long way off.

We won’t be implementing a CAPTCHA on Technology Bloggers any time soon, however next time you have to fill one in, do spare a thought for the [free] work you might be doing for literature, for history and for Google.

The size of the Internet – and the human brain

How many human brains would it take to store the Internet?

Last September I asked if the human brain were a hard drive how much data could it hold?

The human hard drive: the brainI concluded that approximately 300 exabytes (or 300 million terabytes) of data can be stored in the memory of the average person. Interesting stuff right?

Now I know how much computer data the human brain can potentially hold, I want to know how many people’s brains would be needed to store the Internet.

To do this I need to know how big the Internet is. That can’t be too hard to find out, right?

It sounds like a simple question, but it’s almost like asking how big is the Universe!

Eric Schmidt

In 2005, Executive chairman of Google, Eric Schmidt, famously wrote regarding the size of the Internet:

“A study that was done last year indicated roughly five million terabytes. How much is indexable, searchable today? Current estimate: about 170 terabytes.”

So in 2004, the Internet was estimated to be 5 exobytes (or 5,120,000,000,000,000,000 bytes).

The Journal Science

In early 2011, the journal Science calculated that the amount of data in the world in 2007 was equivalent to around 300 exabytes. That’s a lot of data, and most would have been stored in such a way that it was accessible via the Internet – whether publicly accessible or not.

So in 2007, the average memory capacity of just one person, could have stored all the virtual data in the world. Technology has some catching up to do. Mother Nature is walking all over it!

The Impossible Question

In 2013, the size of the Internet is unknown. Without mass global collaboration, I don’t think we will ever know how big it is. The problem is defining what is the Internet and what isn’t. Is a businesses intranet which is accessible from external locations (so an extranet) part of the Internet? Arguably yes, it is.

A graph of the internet

A map of the known and indexed Internet, developed by Ruslan Enikeev using Alexa rank

I could try and work out how many sites there are, and then times this by the average site size. However what’s the average size of a website? YouTube is petabytes in size, whilst my personal website is just kilobytes. How do you average that out?

Part of the graph of the internet

See the red circle? That is pointing at Technology Bloggers! Yes we are on the Internet map.

The Internet is now too big to try and quantify, so I can’t determine it’s size. My best chance is a rough estimate.

How Big Is The Internet?

What is the size of the Internet in 2013? Or to put it another way, how many bytes is the Internet? Well, if in 2004 Google had indexed around 170 terabytes of an estimated 500 million terabyte net, then it had indexed around 0.00000034% of the web at that time.

On Google’s how search works feature, the company boasts how their index is well over 100,000,000 gigabytes. That’s 100,000 terabytes or 100 petabytes. Assuming that Google is getting slightly better at finding and indexing things, and therefore has now indexed around 0.000001% of the web (meaning it’s indexed three times more of the web as a percentage than it had in 2004) then 0.000001% of the web would be 100 petabytes.

100 petabytes times 1,000,000 is equal to 100 zettabytes, meaning 1% of the net is equal to around 100 zettabytes. Times 100 zettabytes by 100 and you get 10 yottabytes, which is (by my calculations) equivalent to the size of the web.

So the Internet is 10 yottabytes! Or 10,000,000,000,000 (ten thousand billion) terabytes.

How Many People Would It Take Memorise The Internet?

If the web is equivalent to 10 yottabytes (or 10,000,000,000,000,000,000,000,000 bytes) and the memory capacity of a person is 0.0003 yottabytes, (0.3 zettabytes) then currently, in 2013, it would take around 33,333 people to store the Internet – in their heads.

A Human Internet

The population of earth is currently 7.09 billion. So if there was a human Internet, whereby all people on earth were connected, how much data could we all hold?

The calculation: 0.0003 yottabytes x 7,090,000,000 = 2,127,000 yottabytes.

A yottabyte is currently the biggest officially recognised unit of data, however the next step (which isn’t currently recognised) is a brontobyte. So if mankind was to max-out its memory, we could store 2,127 brontobytes of data.

I estimated the Internet would take up a tiny 0.00047% of humanities memory capacity.

The conclusion of my post on how much data the human brain can hold was that we won’t ever be able to technically match the amazing feats that nature has achieved. Have I changed my mind? Not really, no.

The Samsung Galaxy S4

This coming Saturday, Samsung’s latest smartphone, the Galaxy S4, goes on sale.

Smartphone Battles

2009-2012 smartphone market by provider

Global smartphone market share by provider.

Like with most mass market technology, there is a war going on in the smartphone industry. In 2012, according to market analyst firm ICD, Samsung controlled 30.3% of the global smartphone market, 59.5% up on the 19% of the market it controlled the year before.

There is no doubt that Samsung is currently the dominant force in the smartphone market. The firm seems to slowly be winning its battle with Apple, and looks set to take on Google next, with rumours that it soon plans to ditch Google’s Android operating system altogether.

Nokia are predicted to make a comeback (how successful I am unsure) thanks to Windows RT, and makers of BlackBerry, RIM, are also looking stronger in 2013 after the release of BlackBerry 10 earlier this year.

Galaxy S4

Samsung are trying to steal even more of the market from its competitors with the Galaxy S4, so it has pulled out a few stops, maybe not all the stops, but quite a few, to make sure that the phone is a success.

So, the phone has loads of new features, to make it slightly better than its predecessor – the S3.

The S4 has a slightly bigger (5mm to be exact) screen, boasting a whole 5 inches of full HD display, which no doubt gives it amazing clarity. The new phone is also slightly thinner than the S3.

You can buy a Galaxy S4 in black and white, or as Samsung like to call them: black mist and white frost. I have never looked at a phone before (smart or not) and thought “that looks like frost” or mist, but maybe the S4 really does; or maybe it’s just marketing.

Touch and use even with gloves - Samsung Galaxy S4Samsung claim the latest edition of its Galaxy is usable even with gloves on, hopefully reducing the cases of zombie fingers – Jonny, you might be able to use it! 😉

The phone has various other new features, such as Samsung WatchON, which connects your phone to your TV, turning your phone into a remote control.

Another new feature is the multi-speaker capability – if you have more than one handy, you can sync them together to create a better quality of sound.

The S4 will also come with built in 4G compatibility, which the original S3 didn’t. If a fast internet connection is important to you when you are on the go, then the S4 is probably a better choice than the S3.

Eye-Tracking

Probably the most exciting new feature of the Galaxy S4 is the new eye-tracking technology. The phone uses its front camera to monitor the users eye movements, and uses can use this function for a host of different activities.

One of the features which uses the eye-tracking technology is video playback. If you are watching something, and then look away, the device automatically pauses the media for you. Furthermore, eye-tracking technology can be used to scroll up and down a page, without the need to even touch the screen.

Photos

There are two interesting developments in the photographic area of the phone, the first is that you can now add audio snippets to pictures, to enable you to catch even more of the moment. You can also merge video with picture, creating partially animated pictures – sort of like the photographs in the Harry Potter films.

The S4 can also use (and display) the front and rear camera simultaneously, which shows that its quad-core ARM processor is pretty quick!

Your Thoughts

So what are your thoughts on the S4? If you are getting one, do let us know!

Do you think that Samsung have done enough to fend off the competition from its closest rivals?

Personally I think the S4 looks like it is set to become the best smartphone on the market when it goes live at the end of the week.