MapD Offers a Columnar Database System that Runs on GPUs

May 20, 2016, 1:25 am

≪ Previous: F5 Networks Fuses Node.js with Load Balancing for Workflow Orchestration

MapD Offers a Columnar Database System that Runs on GPUs:

via thenewstack.io

San Francisco start-up MapD has released a database system, ParallelDB, built to run on GPUs (graphics processing units), which can be used to explore multi-billion row datasets quickly in milliseconds, according to the company.

The idea of using GPUs for database work may initially seem unusual, but after you think about it for a bit, you start to wonder why no one has commercialized the idea before.

“Imagine an SQL query. Or any kind of relational operator, doing the same thing over every row of data. That lends itself really well to the vector model of GPUs,” said Todd Mostak, founder, and CEO of MapD.

GPUs offer massive parallelism, or the ability to carry out a computation task across a wide number of vectors simultaneously, a vital operation for rendering graphics across a computer screen. There is no reason why this parallelism couldn’t also be used for data analysis; a database row is, after all, a nothing more than a single vector. And visualizing the data directly from the GPUs would, of course, dramatically reduce the amount of data shuffling that typically takes place to create such graphics.

Today, the largest bottlenecks for database systems are CPU and memory. As its turns out, GPUs have both in spades. Mostak designed a GPU-based database architecture that could offer 100x speedups over traditional CPU-based database systems (read: pretty much all database systems), offering the capability of executing a query in milliseconds rather than minutes.

MapD’s Ed O’Donnell (left) and Todd Mostak at IBM Interconnect, February. To quickly explore multi-billion row datasets in milliseconds, according to the company.

MapD could be, for instance, set up on a machine with eight GPU cards in a single server, a setup that could offer a throughput of 3TB per second across 40,000 GPU cores.

Initially, MapD would be most attractive to big data projects with log analytics, geographical information systems, business intelligence, and social media analytics.

The technology has already been tested by a number of large companies in telecommunications, telecom, retail, finance, and advertising. Digital advertising company Simulmedia has been testing MapD to match inventory availability against ad units. Facebook, Nike, and Verizon are kicking the tires, as is MIT Lincoln Laboratory.

The company has raised $10 million in Series A funding from a consortium of investors, including Google Ventures, Verizon Ventures, and, naturally, GPU maker Nvidia.

The Inevitable Twitter Challenge

Mostak developed the idea for a GPU-powered database system, while a student, doing research at the MIT Computer Science and Artificial Intelligence Laboratory, working under database luminaries Sam Madden and Mike Stonebraker.

Mostak wasn’t even majoring in comp sci. Living in Egypt and Syria, Mostak was pursuing Middle Eastern studies at Harvard University. The final thesis project involved analyzing a lot of Tweets, and initially Mostak was using PostgreSQL along with Python and C code.

“Everything was just taking too long,” he said, noting that he had to run analysis jobs overnight. Mostak had computer science was an elective course, so at the time, he was taking a GPU programming class, where the idea for a GPU database system germinated.

The first prototypes didn’t yet challenge the sizes of in-memory database systems, MapD’s chief competitors. Harvard deployed an instance that ran 16Gb across 4 GPUs. However, the major strides that GPU builders are making — spurred on by 4K gaming and deep learning analysis — ensures successive new generations of ever-more powerful cards.

MapD1

Now a MapD database running on a single server could be as large as 192GB per server, installed with eight Nvidia Tesla K80s. Nvidia’s next generation Pascal architecture-based cards, high-performance SKUs of which will hold 32GB of VRAM, will set the stage for 500GB databases rivaling the performance of in-memory databases.

Let’s stop for a second a reflect on this: MapD is promising an A ½ TB database running at transactional speeds on a single server.

MapD is not the first party to investigate the use of GPUs for database systems. The idea has been kicking around academia for awhile. GPUdb, out of Arlington Virginia, offers what it claims is the first GPU-accelerated database system.

Most of the approaches to date use the GPU as an accelerator. The problem with this approach is that any gains achieved from greater computational inefficiencies are squandered by the time it takes to pass data over the PCI bus, Mostak argued. MapD’s approach is just to make the GPUs the computational elements themselves (you can run ParallelDB on regular CPUs, though this approach offers no particular speed advantage).

mapD2

ParallelDB is a column store database. The system can take incoming vanilla SQL queries and, using the Apple-championed open source LLVM (low level virtual machine) compiler, reduces them to IR (intermediate representation) and then compiles it to GPU code, with emphasis on the vectorizing the execution of the various SQL operators. The company has some patent pending technology on caching hot data on each GPU’s RAM to add extra pep.

The beauty of GPUs is that they have hella cores. A server can run about 10 to 30 cores, but about 40,000 GPU cores. Granted, GPU cores are pretty dumb compared to CPUs, “but you can process a lot with them,” Mostak said.

Mapd3

But the maximum core counts is not the only advantage GPUs bring.

“People think GPUs are great because they have so much computational power but we think that you really win because GPSs have so much memory bandwidth,” Mostak said. The Pascal cards will have the ability to scan data at a rate of 8TB/second scanning capability, a huge jump over CPU capabilities.

The accompanying visualization software can pull the computations directly from the GPUs into an OpenMP graphics card for visualization. “We can place output of the SQL queries into the rendering pipeline,” Mostak said. This could be useful for say displaying a million points on a map, or creating unusually dense scatterplot or network graphs. In addition to working with its visualization software, ParallelDB can also work with other ODBC (Open Database Connectivity)-fluent business intelligence suites such as Tableau.

What is the advantage of all this power? Reduced costs and performance improvements.

“MapD has come up with a unique solution for being able to analyze and query massive amounts of data in real-time using GPU technology,” said James E. Curtis, senior analyst of data platforms and analytics for 451 Research, in a statement. “They are dramatically reducing querying times in a very cost effective way which makes MapD a very disruptive force in the big data market.”

In one test, Verizon benchmarked MapD against a set of 20 Apache Impala servers churning though 3 billion rows. It took the Impala kit 15-20 seconds, whereas it took a single MapD server around 160 milliseconds.

As a result, MapD could pose a lower cost alternative to the likes of columnar stores such as Vertica, Amazon’s Redshift. MapD’s ParallelDB and Immerse can be procured as software for on-site deployment, or as a service from either IBM Softlayer or Amazon AWS.

IBM is a sponsor of The New Stack.

Feature image: Nvidia’s newly-released GTX 1080, the first card based on the company’s Pascal 28 nm fabrication technology.

The post MapD Offers a Columnar Database System that Runs on GPUs appeared first on The New Stack.

↧

The Kubernetes Way: Part One

May 21, 2016, 1:25 am

≫ Next: An intro to Regression Analysis with Decision Trees

≪ Previous: MapD Offers a Columnar Database System that Runs on GPUs

The Kubernetes Way: Part One:

via thenewstack.io

With containers gaining the attention of enterprises, the focus is slowly shifting to container orchestration. Complex workloads running in production need mature scheduling, orchestration, scaling and management tools. Docker made it extremely easy to manage the lifecycle of a container running within a host operating system (OS). Since containerized workloads run across multiple hosts, we need tools that go beyond managing a single container and single host.

That’s where Docker Datacenter, Mesosphere DC/OS, and Kubernetes have a significant role to play. They let developers and operators treat multiple machines as a single, large entity that can run multiple clusters. Each cluster runs multiple containers that belong to one or more applications. DevOps teams submit the job through the application program interface (API), command line interface (CLI) or specialized tools to the container orchestration engine (COE) which becomes responsible for managing the lifecycle of an application.

High-level architecture of a COE

The hosted version of COE is delivered as CaaS, Containers as a Service. Examples of CaaS include Google Container Engine, Carina by Rackspace, Amazon EC2 Container Service, Azure Container Service, and Joyent Triton.

Containers as a Service

Kubernetes, the open source cluster manager, and container orchestration engine is a simplified version of Google’s internal data center management tool called Borg. At KubeCon 2015, the first inaugural Kubernetes conference, the community celebrated the launch of version 1.1 that came with new features.

I wrote an article that compares the COE market landscape with Hadoop’s commercial implementation. There are quite a few startups and established platform vendors trying to capture the enterprise market share for COE. Kubernetes stands out, due to its maturity that comes from Google’s experience of running web-scale workloads. Based on my personal experience, I am attempting to call out the features that make Kubernetes the standard for container orchestration.

Pods: The New Virtual Machine

Containers and microservices have a unique attribute – they run one, and only one, process at a time. While it’s common to see a virtual machine (VM) running the full stack LAMP application, the same application has to be split into at least two containers – one running Apache with PHP and the other running MySQL. If you throw Memcached or Redis into the stack for caching, they need to run on a separate container as well.

This pattern makes deployment challenging. For example, the cache container should be kept close to the web container. When the web tier is scaled out by running additional containers, the cache container also needs to be scaled out. When the request comes to a web container, it checks for the data set within the corresponding cache container; if it is not found, a database query is made to MySQL. This design calls for pairing the web and cache container together and co-locating them within the same host.

If Kubernetes is the new operating system, then a pod is the new process.

The concept of a pod in Kubernetes makes it easy to tag multiple containers that are treated as a single unit of deployment. They are co-located on the same host and share the same resources, such as network, memory and storage of the node. Each pod gets a dedicated IP address that’s shared by all the containers belonging to it. That’s not all – each container running within the same pod gets the same hostname, so that they can be addressed as a unit.

When a pod is scaled out, all the containers within it are scaled as a group. This design makes up for the differences between virtualized apps and containerized apps. While still retaining the concept of running one process per container, we can easily group containers together that are treated as one unit. So, a pod is the new VM in the context of microservices and Kubernetes. Even if there is only one container that needs to be deployed, it has to be packaged as a pod.

Pods manage the separation of concern between development and deployment. While developers focus on their code, operators will decide what goes into a pod. They assemble relevant containers and stitch them through the definition of a pod. This gives ultimate portability, as no special packaging is required for containers. Simply put, a pod is just a manifest of multiple container images managed together.

If Kubernetes is the new operating system, then a pod is the new process. As they become more popular, we will see DevOps teams exchanging pod manifests instead of multiple container images. Helm, from the makers of Deis, is an example of a service acting as a marketplace for Kubernetes pods.

Service: Easily Discoverable Endpoints

One of the key differences between monolithic services and microservices is the way the dependencies are discovered. While monoliths may always refer to a dedicated IP address or a DNS entry, microservices will have to discover the dependency before making a call to it. That’s because the containers and pods may get relocated to any node at runtime. Each time a container or a pod gets resurrected, it gets a new IP address. This makes it extremely hard to keep track of the endpoints. Developers have to advertise explicitly and query for services in discovery backends, such as etcd, Consul, ZooKeeper or SkyDNS. This requires code-level changes for applications to work correctly.

Kubernetes shines bright with its in-built service discovery feature. Services in Kubernetes consistently maintain a well-defined endpoint for pods. These endpoints remain the same, even when the pods are relocated to other nodes or when they get resurrected.

Multiple pods running across multiple nodes of the cluster can be exposed as a service. This is an essential building block of microservices. The service manifest has the right labels and selectors to identify and group multiple pods that act as a microservice.

For example, all the Apache web server pods running on any node of the cluster that matches the label “frontend” will become a part of the service. It’s an abstraction layer that brings multiple pods running across the cluster under one endpoint. The service has an IP address and port combination along with a name. The consumers can refer to a service either by the IP address or the name of the service. This capability makes it extremely flexible in porting legacy applications to containers.

If multiple pods share the same endpoint, how do they evenly receive the traffic? That’s where the load balancing capability of the service comes in. This feature is a key differentiator of Kubernetes when compared to other COEs. Kubernetes has a lightweight internal load balancer that can route traffic to all the participating pods in a service.

Services can be exposed in one of the three forms: internal, external and load balanced.

Internal: Certain services, such as databases and cache endpoints, don’t need to be exposed. They are only consumed by other pods internal to the application. These services are exposed through an IP address that’s accessible only within the cluster but not to the outside world. Kubernetes obscures the sensitive services by exposing an endpoint that’s available to the internal dependencies. This feature brings an additional layer to security by hiding the private pods from the public.
External: Services running web servers or publicly accessible pods are exposed through an external endpoint. These endpoints are available on each node through a specific port.
Load balanced: In scenarios where the cloud provider offers an external load balancer, a service can be wired with that. For example, the pods might receive traffic via an elastic load balancer (ELB) or the HTTP load balancer of Google Container Engine (GCE). This feature enables integrating a third-party load balancer with the Kubernetes service.

Kubernetes does the heavy lifting by taking over the responsibility of discovery and load balancing of microservices. It relieves DevOps from dealing with complex plumbing required at the infrastructure level. Developers can focus on their code with a standard convention of using hostnames or environment variables without worrying about additional code required for registering and discovering services.

Docker, Joyent and Mesosphere are sponsors of The New Stack.

Feature Image via Pixabay.

The post The Kubernetes Way: Part One appeared first on The New Stack.

↧

An intro to Regression Analysis with Decision Trees

May 22, 2016, 1:25 am

≫ Next: A. Jesse Jiryu Davis: Captioning Myself And 6 Other Ways I'll Prepare In The 24 Hours Before I Speak At Pycon

≪ Previous: The Kubernetes Way: Part One

An intro to Regression Analysis with Decision Trees:

via glowingpython.blogspot.com

It’s a while that there are no posts on this blog, but the Glowing Python is still active and strong! I just decided to publish some of my post on the Cambridge Coding Academy blog. Here are the links to a series of two posts about Regression Analysis with Decision Trees:

In this introduction to Regression Analysis we will see how to user scikit-learn to train Decision Trees to solve a specific problem: “How to predict the number of bikes hired in a bike sharing system in a given day?”

In the first post, we will see how to train a simple Decision Tree to exploit the relation between temperature and bikes hired, this tree will be analysed to explain the result of the training process and gain insights about the data. In the second, we will see how to learn more complex decision trees and how to assess the accuracy of the prediction using cross validation.

Here’s a sneak peak of the figures that we will generate:

↧

A. Jesse Jiryu Davis: Captioning Myself And 6 Other Ways I'll Prepare In The 24 Hours Before I Speak At Pycon

May 30, 2016, 8:25 am

≫ Next: Autonomous Racing Car Series

≪ Previous: An intro to Regression Analysis with Decision Trees

A. Jesse Jiryu Davis: Captioning Myself And 6 Other Ways I'll Prepare In The 24 Hours Before I Speak At Pycon:

via emptysqua.re

At 1:10pm Pacific Time this Wednesday, I’ll be in Portland, sitting in a lecture room at PyCon, judging the right moment to walk on stage and start my talk. I have a regimen for the 24 hours before that primes me to give you the best talk I can.

This whole series of articles about conference speaking is inspired by my friend Sasha Laundy’s 24 hours checklist, which is the definitive guide for public speakers. I’ll add a few more tips for you to choose among, particularly my method to ensure my talk is accessible to Deaf people.

Marketing

Most of my audience will be there because they saw my talk on the schedule and like the topic. Still, it won’t hurt to compose a few tweets advertising my talk, before nervousness robs me of wit, and schedule them for the night and morning before my talk. (I need to remember to set Buffer to PDT so it tweets at the right time in Portland.)

Landing Page

I’ll make a page on my site with links to further study. I’ve collected links about my topic during the months when I wrote my proposal and outlined my talk, so I’ll assemble those into a landing page on my site.

(Next week when PyCon publishes the video of my talk, I can link to it from this page for future visitors.)

To show you an example: I made a landing page for the talk I gave last year titled “Eventually Correct”, about testing asynchronous applications. I had used Super Mario Bros. as a gimmicky analogy in the talk, so I continued the theme on the landing page with more Super Mario stuff, as well as links to my code and related libraries.

Screen capture of Super Mario Bros. 2, with Toady character leaping over a miniboss named Birdo

Unlike a blog post, a “page” on my site is not automatically linked from my home page, so publishing it won’t spoil my talk. Once it’s online, I make a memorable shortlink to it. (Last year’s was “bit.ly/eventually-correct”.) I’ll add this link to my last slide, and I’ll also schedule a tweet with the link at 1:40pm, towards the ending of my speaking slot.

Rehearse

I’ll rehearse a couple times in the final 24 hours, of course.

It’s natural to practice my talk sitting down and staring at my slides on a laptop, but my speaking coach exhorts me not to. On stage, I’ll stand in front of the big screen holding a remote clicker, looking at the audience, and only occasionally glance at the Keynote presenter view on my laptop. Ideally I’ll have my deck largely memorized by Tuesday, so I’ll rehearse standing up, looking out the window of my Airbnb apartment, using the clicker.

Record

My slides will be done by Tuesday night and my performance well-rehearsed. It’ll be a good time to record the talk. On Mac, iShowU HD is an easy way to capture my slides and voice. Since I won’t see my speaker notes in Keynote while I record, I’ll use my iPad to read them.

If I severely screw up in the middle of the talk I can edit it in post-production, but I won’t worry about small flubs. I’m not making Serial here. PyCon’s A/V experts will make a great recording of the actual talk on Wednesday, anyway, so the main point of this recording isn’t the audio track: it’s for captioning.

Caption

Once I’ve made a decent recording of the talk with a screen capture of my slide deck, I’ll upload it privately to YouTube and caption it. When I first captioned a video last year I was pleased how well YouTube’s online interface works—its killer feature is that it pauses the video while I type and resumes it automatically, so I rarely need to take my fingers from the home row. Captioning my half-hour talk will take about an hour.

Last year, for the first time, all PyCon talks had professional real-time captioning, and that’s great if you’re a Deaf person in the room. (It also helps if English is a foreign language to you.) But those captions weren’t published, so a Deaf person watching the video later sees comical computer-generated captions. The first sentence of my “Eventually Correct” talk is a very unfortunate misunderstanding:

I believe the caption-generating software heard the opening “pop” of the audio track, followed by my session chair introducing me with “he is a staff software engineer at MongoDB.” In the face of ambiguity, the computer couldn’t refuse the temptation to guess. If you’re Deaf, better to watch the version I captioned myself.

Captioning isn’t just for the Deaf and hard of hearing; it’s an investment that has other dividends. First, it’s the start of an article: after I recorded last year’s talk I dumped the caption file from YouTube and massaged the text into a blog post on the topic. Reduce, reuse, recycle! And second, captioning my video is the ultimate rehearsal technique. Once I’ve recorded this year’s talk, listened to the recording, and typed it, I’ll know it by heart.

Sing

After I spoke at MongoDB World a couple years ago, Meghan Gill gave me the best speaker gift in modern history:

Green portable speaker with a New York City skyline stenciled into its shiny metal front panel. The panel has

It’s a portable Bluetooth speaker! I’ll take it with me to Portland next week. The night before my talk, I’ll open the SingFriend app on my iPhone, connect it to the speaker, and sing scales for half an hour. I’m not going to be an American Idol (particularly since Idol was canceled) but it doesn’t matter how well I sing: doing a long vocal warmup the night before makes a big difference in how strong my voice will be when I speak on Wednesday.

Pack

I’ll lay out my supplies, so I don’t forget anything when I leave my Airbnb for the conference center Wednesday morning:

Water bottle. (The session runner will offer me a disposable bottle, but why waste one?)
My clicker and a spare pair of batteries.
Cough drops.
Business cards (so I can forget to give them to people).
A cute outfit.

I’m finally ready.

Read the my other articles about conference speaking.

↧

Autonomous Racing Car Series

May 30, 2016, 8:25 am

≫ Next: Intent

≪ Previous: A. Jesse Jiryu Davis: Captioning Myself And 6 Other Ways I'll Prepare In The 24 Hours Before I Speak At Pycon

Autonomous Racing Car Series:

via blog.ouseful.info

Sometime last year, the VC sponsored RoboRace autonomous car race series was announced as a supporting race for the Formula-E for the 2016-17 season.

According to the NVidia blog, the first RoboRace cars will be powered by an NVidia Drive PX2 computer (more: PCWorld – The specs and story behind the autonomous Robocar and its Nvidia Drive PX 2 brains). (Specs for the PX2 don’t appear to be on the Nvidia Automotive Solutions webpages yet?)

I haven’t seen any announcements regarding the teams yet, so if you haven’t had an invite already (nor me!;-), you’re probably not on the list. I did see this ad a few weeks ago though…

Race_Engineers_-_Roborace__CV_Library___www_jobsthamesvalley_co_uk

Here’s a list of folk currently claiming to be associated with Roborace on LinkedIn.

However, that doesn’t necessarily mean you can’t get into the autonomous racing thing…

For example, you could always sign up for the International Autonomous Robot Racing Challenge (IARRC), or have a go at building your own version of Georgia Tech’s AutoRally autonomous robot rally car (the Georgia Tech folk have made their code available…).

Or how about giving your wheels a day out with a self-driving car trackday?

One thing that perked my attention in this Medium post on The First Autonomous Track Day: An interview with creator and racer Joshua Schachter was the name Joshua Schachter. Hmm… is that the self-same Joshua Schachter who create the delicious social-bookmarking website?

↧

Intent

May 31, 2016, 1:25 am

≫ Next: Spot-Check Regression Machine Learning Algorithms in Python with scikit-learn

≪ Previous: Autonomous Racing Car Series

Intent:

via www.xaprb.com

One of the core teachings of Drucker’s classic “Managing Oneself” is to form hypotheses about what you’ll do well or poorly, and then observe the outcomes. By repeatedly practicing this, you learn what you’re good and bad at.

Compass

At some point I noticed that I sometimes wasn’t certain why I failed or succeeded. As I’ve explored this more deeply in the last few years, I’ve come to see that the biggest factor in my success or failure is often the clarity of my purpose itself.

One Of Our Quarterly Goals

A large company once asked for a call to discuss a potential partnership. This was the type of thing one does not simply brush off, so despite the fact that it wasn’t a current focus for our team, several of us put aside our work to discuss the opportunity. On the call, the person who invited us mentioned that expanding their partnership program was one of their quarterly goals. Hearing this made me feel even more excited. This wasn’t a random call! This company was committing resources and energy to this! I left the call energized about the possibility of an alliance with a brand that could vault us into the spotlight.

Later, I put myself mentally in the other person’s shoes. Wasn’t it convenient for them that when they had a quarterly goal of expanding their partnerships, they were able to find companies like ours? Companies that would align their activities and resources around efforts they hadn’t planned? This isn’t meant to express any cynicism or resentment, just to state a fact.

Put another way: under what circumstances would they have found us not willing to engage, even if we’d agreed the alliance would be beneficial? What kind of culture or thought process would that reveal?

I Was Recruited

I often ask people their career story, including the reasons for job changes. Something about telling this part of their story brings out thought patterns.

I spoke to someone who told his career story like this. “I worked for Company A for a while, then I was recruited to Company B by the CEO. After a couple of years, I was recruited into Company C.” This went on; I listened. At the end, I asked, “How did the CEO of Company B recruit you?”

There was a pause. “I received a call from him and was offered the job.”

This person was a senior executive, generally very direct, clear-spoken, even forceful and inspiring. In contrast, the story of his career decisions was entirely in the passive voice from the first word to the last. Whose decisions were they, really? He didn’t seem to hear himself saying that they weren’t his own.

I Reached Out To Follow Up

For years, I’ve worked in software development teams that use daily stand-up meetings. These meetings are meant to be short; you do them standing up to encourage that. Specifics vary, but generally you go around the circle and say what you’re working on and whether anything is blocking you. As simple as it sounds, these meetings are hard to do well.

I once encouraged a different non-software team to adopt something similar: a daily huddle every morning. I asked them to use a very specific format: what did you achieve yesterday, what results do you commit to achieving today, and what’s blocking your progress?

I found that some people were unable or unwilling to adhere strictly to the format. Instead, they’d say things like “yesterday I reached out to John about their account, touched base with Sue on the status of legal, and followed up with Mary about their contract. Today I’m going to call Phil.” All the reaching out, following up, and so forth made it impossible to tell what this person actually did yesterday; was it emails? Voicemails? Conversations? Dials that didn’t connect or got a busy signal? Likewise, calling Phil is not an achievement, it’s an activity. What are you going to achieve today?

Another type of useless meeting is a status update, which feels monotonous. “I’m working on John’s account, Sue’s legal revisions, and Mary’s contract. Today I’ll call Phil.” A lot of stuff is in-progress, but is it progressing? What does “working on” mean? What concrete outcomes need to happen next, what activities will create those outcomes, and do you commit to completing them today?

It takes a lot of calibration to get these meetings right. The leader usually needs to coach each person in one-to-one meetings to make the format work well and get the true point across. If meetings drift into vagueness or status updates, people hear each other saying the same thing every day, the meeting feels the same each day, it’s pointless, and true engagement never gets a chance.

Done well, though, the meetings create clarity, accountability, and peer pressure in a very short time. What you’re really doing is committing to achieving specific outcomes today and revealing whether you kept yesterday’s commitments. If you’re not getting that from the meeting, you’re doing it wrong; if you’re doing it wrong, it’s because of people’s mental approach to the meeting. Insisting on achieving commitment, transparency, and accountability in the meeting forces team members to adopt the right mindset, or makes it blindingly obvious that they can’t or won’t. It’s simple, but hard.

Come What May

I wrote a while ago about spending the end of the day prioritizing what you’ll do the next morning. As I was explaining this productivity technique to someone, I couched it in terms of the daily huddle. “Decide what you’ll achieve tomorrow,” I said. He pointed out that many disruptions arise in an 8-hour day, and if he chooses 8 hours of goals, he can achieve them only by disconnecting from the Internet and isolating himself from others.

This helped me realize that I wasn’t stating explicitly what I meant. I don’t mean you should pick enough things to schedule your day fully. I mean that it’s highly productive to prioritize two, three, maybe five things if they’re small, which together constitute maybe 30-60 minutes of work; 90 minutes if you’re lucky. These are concrete, specific outcomes you want to achieve, come what may. You will do many, many more things during the day, but you make a fierce commitment to yourself that at the end of the day, there is no way in hell you won’t achieve these specific results.

The Second Habit

Stephen Covey’s book The Seven Habits Of Highly Effective People was probably the first such book I ever read, and remains highly influential in my life. The second habit is “Begin With The End In Mind.” What specific result do you want to produce?

Motivation

I’ve asked a lot of people what motivates them about their careers and their work. The answers vary a lot, but there are common themes. Over time I’ve distilled these down. I have a longer document, but in brief,

People want to be challenged
People want to grow and improve
People want their work to have impact
People want to have responsibility
People want to do great work
People want to be free of obstacles that slow them down, such as bureaucracy
People want stellar team mates

It’s interesting that I don’t recall anyone telling me that feeling purposeful motivates them. But I’ve tested this many times, and I’ve found that pretty much anything I do is ten times more motivating when I have a definite intent for it. Seriously: if I go to the gym with the intent to finish some workout, the workout comes and goes and I move on to the next part of my day. If I go with the intent of doing the right amount of the right kinds of exercises to make me healthier and stronger for achieving my life purpose, no more and no less, then it’s completely different. I get in, get it done, and leave when I’ve met my goals for that activity.

And while I’m doing that, I’m on fire. There’s no other way to say it. When I approach my day this way, I am absolutely charged up all day every day. I recommend trying it if you’ve never had that feeling.

Maybe it’s easier to understand as the opposite. You know that feeling, when the day is over and you were “busy” the whole day, but you can’t really say what you did, specifically? You probably took things as they came to you. You feel tired, pointless, unclear. You feel that your efforts had no real impact or purpose. It’s not only demotivating, it’s demoralizing.

The antidote to that is having a Post-It note prepared, with three small things you chose last night. At the end of the day you look at the note and you can say, “yes, I definitely did achieve three specific things of my choosing today.” You still handled 98% of the same reactive stuff, answering emails and putting out fires and whatever else the world threw at you. But it’s totally different, because you Got Shit Done, too. You feel amazing about the three little things, and pretty damn good about the rest.

What I’m saying is that being purposeful is highly motivating and makes everything fun and rewarding.

What do you suppose happens to teams when they engage in a few moments of purposefulness at the beginning of every day? Could a short “huddle” meeting help everyone feel turbocharged all day? Why do football teams do huddles, anyway?

With No Agenda

Indulge me in a thought experiment. Let’s assume that all people live on a spectrum of purpose. They are more or less intentional and results-oriented. I’ll make up some fictional characters to illustrate this and use later in this post. Note: this is intentionally extremist for pedagogical purposes.

On the one extreme is Susan, the person who can instantly answer why she’s doing something. She knows not only the driving force behind her activities, but the specific outcome she wants to achieve, and how that outcome will produce the benefits she desires. Susan is so focused on this that she never does anything without first determining its purpose, whether that aligns with her goals, whether there’d be a better way to achieve those goals with more efficient use of time, money, people, and other resources, and so on. She is extremely purposeful.

To make this more concrete, imagine Susan talking a walk in the park. She won’t do that without first determining if her goal is recreation, fitness, socialization, or so forth. And she likes to maximize her results, so she’s probably going to power-walk with her friend and her dog while practicing deep conscious breathing. She takes the route that ends near a good spot for stretching.

The other extreme is Jacob, the completely passive person. He doesn’t have his own agenda. He does things because why not? Or maybe he does nothing unless he has to. Whatever, dude.

Now ask yourself, has Susan looked around the world and noticed people like Jacob? Of course she has. She’s concluded that these people have 24 hours in a day just like her, but they’re wasting them on nothing particular. Susan therefore decides that if the Jacobs of the world are going to waste the resources and abilities they’ve been given, she might as well put them to good use herself.

The Jacobs of the world are going to find themselves doing what the Susans want to be done. The Jacobs, lacking their own priorities, deciding not to decide, will find that the Susans set their priorities for them. Lacking motivation, Jacobs will find themselves in situations others have created to provide motivation, often in the form of negative consequences. Show up to flip burgers, Jacob, or you’re going to miss your rent payment!

If you’re familiar with the movie The Big Lebowski, it has a few good character studies at various points on the spectrum of intentionality.

At some point I realized about Jacobs and Susans. I asked myself, do I want someone else to choose what I achieve with my life? Because if I don’t take ownership of my priorities, that’s exactly what’s going to happen. There’s no escaping it: if I’m not working on achieving my agenda, I’ll achieve someone else’s. So I’d better have an agenda of my own.

Schwarzenegger

Arnold Schwarzenegger is a great example of a hyper-intentional person. I would not choose to emulate what he’s done with his superpowers, but I think if you study him sincerely, you’ll be forced to admire his purposefulness.

His life story is the most dramatic example I know of consciously selecting what he wants, making a well-thought-out plan to achieve it, and aligning his full efforts behind that.

For example, he’s probably best known as a bodybuilder, but that was only a means to an end for him. One of his goals was to create his own niche as an actor. At the time, it would be an understatement to say there was no demand for musclebound actors with thick accents. The contemporary fashion was the exact opposite. He saw that as an opportunity: he’d create the category and utterly own it because no one else wanted it.

There was a problem; there was zero money in bodybuilding, and he knew an acting career would be financially unstable. So he analyzed the economy, realized there was a huge opportunity in real estate, and made a series of extremely lucrative investments.

Oh, and he recruited a bunch of other bodybuilders to work in his businesses, too. From the descriptions of them, many of them were named Jacob. Arnold didn’t settle for putting his own efforts into his life plan; he pulled everyone else into it too.

And so he went from an extremely driven kid in Austria to governor of California, seemingly unstoppable. At every step in his career, if you analyze him, he was juggling a dozen very carefully selected balls in the air, each with a specific purpose and outcome in mind. Some he found by “luck” and seized the opportunity; others he created from sheer force of will.

It’s Not Mystical

I’m not the first person to observe the correlation between purpose and results. People have called this a hundred different things, and practically every self-help or success book discusses it. Many of them wrap it in hogwash mysticism. It’s popular recently to dress it up in nonsense about laws of the universe and quantum physics and so on. In my view it’s quite simple and straightforward. If you decide which direction to travel, you’ll make progress towards it because you’ll correct your course as life tosses you off track. If you just go the way you’re headed, you won’t make progress towards a discernable goal. Nothing metaphysical about it.

No one should disparage those who live the journey for its own sake. It’d be a great pity to waste your life on a goal, arrive at the end, and discover that you don’t like the goal you chose, and you wish you had another chance.

Systems

I’ve been studying some books that discuss systems for goal-setting, such as OKRs. I welcome your thoughts. I don’t have a perfect system for helping everyone set goals in a way that doesn’t require a lot of effort and is part of the company culture. I don’t feel the need to invent something.

From Macro To Micro

Since noticing how much more efficient my efforts are when there’s a purpose to them, I’ve gotten kind of maniacal about it. In the ideal world, I’m an exemplary Susan. When I achieve that, I’ll have a purpose for everything:

I’ll have a purpose for my life. (I’ve had that since first reading Stephen Covey.)
I’ll have a purpose for my company. (Had that since before founding it.)
I’ll have a purpose for defined periods of time in the company, such as quarters and biweekly sprints. (Working on it; small steps; see above.)
I’ll have a purpose for each team and each person in the company. (My coach has taught me to use a document called a Job Defined Agreement for everyone. It essentially states the purpose of their work.)
I’ll have a purpose for every day of my life. (I do this; I am mostly successful; say 90%).
I’ll have a purpose for every meeting, email, phone call, project, everything.

I try. I try to move in macro cycles of purpose that contain micro cycles: quarterly, weekly, daily, minute to minute. What am I doing? How does it contribute to the intent I have chosen?

Before I pick up the phone: what specific outcomes do I want to get from this call? How, specifically, can I do it?

Before I ask so-and-so for a meeting: what is my goal for this meeting? If I ask for the meeting without being clear in my intent, I’m much less likely to get the meeting, or to get the outcome I want. If I have the intent clear in my mind first, I am already on the right track before I even compose the invitation.

From the large scale to the small scale, there are big goals and there are activities to get to those, and each bit gets broken down into smaller bits, each of which requires focus and intent from moment to moment.

Risks

There are a few risks. One is ever-present for me: whenever something is working for me, I’m impatient to double down on it. I repeatedly get feedback from family, colleagues, and friends that I’m devoting my entire energy and focus to something new to them, something they haven’t had time to adjust to. Maybe I picked up a new book and read it over the weekend. Suddenly being around me is like a firehose of whatever was in the book. Next weekend, I’ll read another book, so watch out. I can give people whiplash if I’m not careful.

Another is the risk of getting too focused on outcomes and forgetting to smell the roses. In some matters, it’s good to be running through walls getting things done. But is that the right way to approach the relaxing afternoon on the patio with good friends, a glass of wine, and a plate full of cheese and fruit? If it’s been raining for weeks straight and the weather clears, maybe there should be no higher purpose than enjoying how glorious the fresh air and the sun feel.

There’s also the risk of being impatient with people. If I’m too selfishly focused on achieving outcomes, the end might justify the means. I continually see cautionary tales of people who let this get out of hand and end up hurting people. I think the danger zone is when you forget that the world doesn’t exist for you alone. I don’t like people who have no use for people who are no use to them. I don’t want to be that person.

Closing Thoughts

This blog post might have seemed rambling, perhaps even purposeless. That’s not the case, however.

It’s a letter to myself in the future, because I’ve repeatedly learned from my old journals and other writing. I’m sharing it publicly with the goal of learning from you.

Photo Credit.

↧

Spot-Check Regression Machine Learning Algorithms in Python with scikit-learn

June 1, 2016, 8:25 am

≫ Next: Use Keras Deep Learning Models with Scikit-Learn in Python

≪ Previous: Intent

Spot-Check Regression Machine Learning Algorithms in Python with scikit-learn:

via machinelearningmastery.com

Spot-checking is a way of discovering which algorithms perform well on your machine learning problem.

You cannot know which algorithms are best suited to your problem before hand. You must trial a number of methods and focus attention on those that prove themselves the most promising.

In this post you will discover 6 machine learning algorithms that you can use when spot checking your regression problem in Python with scikit-learn.

Let’s get started.

Spot-Check Regression Machine Learning Algorithms in Python with scikit-learn
Photo by frankieleon, some rights reserved.

Algorithms Overview

We are going to take a look at 7 classification algorithms that you can spot check on your dataset.

4 Linear Machine Learning Algorithms:

Linear Regression
Ridge Regression
LASSO Linear Regression
Elastic Net Regression

3 Nonlinear Machine Learning Algorithms:

K-Nearest Neighbors
Classification and Regression Trees
Support Vector Machines

Each recipe is demonstrated on a Boston House Price dataset. This is a regression problem where all attributes are numeric.

Each recipe is complete and standalone. This means that you can copy and paste it into your own project and start using it immediately.

A test harness with 10-fold cross validation is used to demonstrate how to spot check each machine learning algorithm and mean squared error measures are used to indicate algorithm performance. Note that mean squared error values are inverted (negative). This is a quirk of the cross_val_score() function used that requires all algorithm metrics to be sorted in ascending order (larger value is better).

The recipes assume that you know about each machine learning algorithm and how to use them. We will not go into the API or parameterization of each algorithm.

Linear Machine Learning Algorithms

This section provides examples of how to use 4 different linear machine learning algorithms for regression in Python with scikit-learn.

1. Linear Regression

Linear regression assumes that the input variables have a Gaussian distribution. It is also assumed that input variables are relevant to the output variable and that they are not highly correlated with each other (a problem called collinearity).

You can construct a linear regression model using the LinearRegression class.

# Linear Regression
import pandas
from sklearn import cross_validation
from sklearn.linear_model import LinearRegression
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
num_folds = 10
num_instances = len(X)
seed = 7
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = LinearRegression()
scoring = 'mean_squared_error'
results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print(results.mean())

Running the example provides a estimate of mean squared error.

-34.7052559445

2. Ridge Regression

Ridge regression is an extension of linear regression where the loss function is modified to minimize the complexity of the model measured as the sum squared value of the coefficient values (also called the l2-norm).

You can construct a ridge regression model by using the Ridge class.

# Ridge Regression
import pandas
from sklearn import cross_validation
from sklearn.linear_model import Ridge
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
num_folds = 10
num_instances = len(X)
seed = 7
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = Ridge()
scoring = 'mean_squared_error'
results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print(results.mean())

Running the example provides an estimate of the mean squared error.

-34.0782462093

3. LASSO Regression

The Least Absolute Shrinkage and Selection Operator (or LASSO for short) is a modification of linear regression, like ridge regression, where the loss function is modified to minimize the complexity of the model measured as the sum absolute value of the coefficient values (also called the l1-norm).

You can construct a LASSO model by using the Lasso class.

# Lasso Regression
import pandas
from sklearn import cross_validation
from sklearn.linear_model import Lasso
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
num_folds = 10
num_instances = len(X)
seed = 7
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = Lasso()
scoring = 'mean_squared_error'
results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print(results.mean())

Running the example provides an estimate of the mean squared error.

-34.4640845883

4. ElasticNet Regression

ElasticNet is a form of regularization regression that combines the properties of both Ridge Regression and LASSO regression. It seeks to minimize the complexity of the regression model (magnitude and number of regression coefficients) by penalizing the model using both the l2-norm (sum squared coefficient values) and the l1-norm (sum absolute coefficient values).

You can construct an ElasticNet model using the ElasticNet class.

# ElasticNet Regression
import pandas
from sklearn import cross_validation
from sklearn.linear_model import ElasticNet
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
num_folds = 10
num_instances = len(X)
seed = 7
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = ElasticNet()
scoring = 'mean_squared_error'
results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print(results.mean())

Running the example provides an estimate of the mean squared error.

-31.1645737142

Nonlinear Machine Learning Algorithms

This section provides examples of how to use 3 different nonlinear machine learning algorithms for regression in Python with scikit-learn.

1. K-Nearest Neighbors

K-Nearest Neighbors (or KNN) locates the K most similar instances in the training dataset for a new data instance. From the K neighbors, a mean or median output variable is taken as the prediction. Of note is the distance metric used (the metric argument). The Minkowski distance is used by default, which is a generalization of both the Euclidean distance (used when all inputs have the same scale) and Manhattan distance (for when the scales of the input variables differ).

You can construct a KNN model for regression using the KNeighborsRegressor class.

# KNN Regression
import pandas
from sklearn import cross_validation
from sklearn.neighbors import KNeighborsRegressor
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
num_folds = 10
num_instances = len(X)
seed = 7
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = KNeighborsRegressor()
scoring = 'mean_squared_error'
results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print(results.mean())

Running the example provides an estimate of the mean squared error.

-107.28683898

2. Classification and Regression Trees

Decision trees or the Classification and Regression Trees (CART as they are know) use the training data to select the best points to split the data in order to minimize a cost metric. The default cost metric for regression decision trees is the mean squared error, specified in the criterion parameter.

You can create a CART model for regression using the DecisionTreeRegressor class.

# Decision Tree Regression
import pandas
from sklearn import cross_validation
from sklearn.tree import DecisionTreeRegressor
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
num_folds = 10
num_instances = len(X)
seed = 7
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = DecisionTreeRegressor()
scoring = 'mean_squared_error'
results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print(results.mean())

Running the example provides an estimate of the mean squared error.

-35.4906027451

3. Support Vector Machines

Support Vector Machines (SVM) were developed for binary classification. The technique has been extended for the prediction real-valued problems called Support Vector Regression (SVR). Like the classification example, SVR is built upon the LIBSVM library.

You can create an SVM model for regression using the SVR class.

# SVM Regression
import pandas
from sklearn import cross_validation
from sklearn.svm import SVR
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
num_folds = 10
num_instances = len(X)
seed = 7
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = SVR()
scoring = 'mean_squared_error'
results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print(results.mean())

Running the example provides an estimate of the mean squared error.

-91.0478243332

Your Guide to Machine Learning with Scikit-Learn

Python Mini-Course Python and scikit-learn are the rising platform among professional data scientists for applied machine learning.

PDF and Email Course.

FREE 14-Day Mini-Course in
Machine Learning with Python and scikit-learn

Download Your FREE Mini-Course >>

Download your PDF containing all 14 lessons.

Get your daily lesson via email with tips and tricks.

Summary

In this post you discovered machine learning recipes for regression in Python using scikit-learn.

Specifically, you learned about:

4 Linear Machine Learning Algorithms:

Linear Regression
Ridge Regression
LASSO Linear Regression
Elastic Net Regression

3 Nonlinear Machine Learning Algorithms:

K-Nearest Neighbors
Classification and Regression Trees
Support Vector Machines

Do you have any questions about regression machine learning algorithms or this post? Ask your questions in the comments and I will do my best to answer them.

Can You Step-Through Machine Learning Projects
in Python with scikit-learn and Pandas?

Discover how to confidently step-through machine learning projects end-to-end in Python with scikit-learn in the new Ebook:

Machine Learning Mastery with Python

Take the next step with 16 self-study lessons covering data preparation, feature selection, ensembles and more.

Includes 3 end-to-end projects and a project template to tie it all together.

Ideal for beginners and intermediate levels.

Apply Machine Learning Like A Professional With Python

The post Spot-Check Regression Machine Learning Algorithms in Python with scikit-learn appeared first on Machine Learning Mastery.

↧

Use Keras Deep Learning Models with Scikit-Learn in Python

June 1, 2016, 8:25 am

≫ Next: Visualize Machine Learning Data in Python With Pandas

≪ Previous: Spot-Check Regression Machine Learning Algorithms in Python with scikit-learn

Use Keras Deep Learning Models with Scikit-Learn in Python:

via machinelearningmastery.com

Keras is one of the most popular deep learning libraries in Python for research and development because of its simplicity and ease of use.

The scikit-learn library is the most popular library for general machine learning in Python.

In this post you will discover how you can use deep learning models from Keras with the scikit-learn library in Python.

This will allow you to leverage the power of the scikit-learn library for tasks like model evaluation and model hyper-parameter optimization.

Let’s get started.

Use Keras Deep Learning Models with Scikit-Learn in Python
Photo by Alan Levine, some rights reserved.

Overview

Keras is a popular library for deep learning in Python, but the focus of the library is deep learning. In fact it strives for minimalism, focusing on only what you need to quickly and simply define and build deep learning models.

The scikit-learn library in Python is built upon the SciPy stack for efficient numerical computation. It is a fully featured library for general machine learning and provides many utilities that are useful in the development of deep learning models. Not least:

Evaluation of models using resampling methods like k-fold cross validation.
Efficient search and evaluation of model hyper-parameters.

The Keras library provides a convenient wrapper for deep learning models to be used as classification or regression estimators in scikit-learn.

In the next sections we will work through examples of using the KerasClassifier wrapper for a classification neural network created in Keras and used in the scikit-learn library.

The test problem is the Pima Indians onset of diabetes classification dataset. This is a small dataset with all numerical attributes that is easy to work with. Download the dataset and place it in your currently working directly with the name pima-indians-diabetes.csv.

The following examples assume you have successfully installed Keras and scikit-learn.

Get Started in Deep Learning With Python

Deep Learning gets state-of-the-art results and Python hosts the most powerful tools.
Get started now!

PDF Download and Email Course.

FREE 14-Day Mini-Course on
Deep Learning With Python

Download Your FREE Mini-Course

Download your PDF containing all 14 lessons.

Get your daily lesson via email with tips and tricks.

Evaluate Deep Learning Models with Cross Validation

The KerasClassifier and KerasRegressor classes in Keras take an argument build_fn which is the name of the function to call to get your model.

You must define a function called whatever you like that defines your model, compiles it and returns it.

In the example below we define a function create_model() that create a simple multi-layer neural network for the problem.

We pass this function name to the KerasClassifier class by the build_fn argument. We also pass in additional arguments of nb_epoch=150 and batch_size=10. These are automatically bundled up and passed on to the fit() function which is called internally by the KerasClassifier class.

In this example we use the scikit-learn StratifiedKFold to perform 10-fold stratified cross validation. This is a resampling technique that can provide a robust estimate of the performance of a machine learning model on unseen data.

We use the scikit-learn function cross_val_score() to evaluate our model using the cross validation scheme and print the results.

# MLP for Pima Indians Dataset with 10-fold cross validation via sklearn
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.cross_validation import StratifiedKFold
from sklearn.cross_validation import cross_val_score
import numpy
import pandas

# Function to create model, required for KerasClassifier
def create_model():
	# create model
	model = Sequential()
	model.add(Dense(12, input_dim=8, init='uniform', activation='relu'))
	model.add(Dense(8, init='uniform', activation='relu'))
	model.add(Dense(1, init='uniform', activation='sigmoid'))
	# Compile model
	model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
	return model

# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# load pima indians dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]
# create model
model = KerasClassifier(build_fn=create_model, nb_epoch=150, batch_size=10)
# evaluate using 10-fold cross validation
kfold = StratifiedKFold(y=Y, n_folds=10, shuffle=True, random_state=seed)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

Running the example displays the skill of the model for each epoch. A total of 10 models are created and evaluated and the final average accuracy is displayed.

...
Epoch 145/150
692/692 [==============================] - 0s - loss: 0.4671 - acc: 0.7803
Epoch 146/150
692/692 [==============================] - 0s - loss: 0.4661 - acc: 0.7847
Epoch 147/150
692/692 [==============================] - 0s - loss: 0.4581 - acc: 0.7803
Epoch 148/150
692/692 [==============================] - 0s - loss: 0.4657 - acc: 0.7688
Epoch 149/150
692/692 [==============================] - 0s - loss: 0.4660 - acc: 0.7659
Epoch 150/150
692/692 [==============================] - 0s - loss: 0.4574 - acc: 0.7702
76/76 [==============================] - 0s
0.756442244065

Grid Search Deep Learning Model Parameters

The previous example showed how easy it is to wrap your deep learning model from Keras and use it in functions from the scikit-learn library.

In this example we go a step further. The function that we specify to the build_fn argument when creating the KerasClassifier wrapper can take arguments. We can uses these arguments to further customize the construction of the model. In addition, we know we can provide arguments to the fit() function.

In this example we use a grid search to evaluate different configurations for our neural network model and report on the combination that provides the best estimated performance.

The create_model() function is defined to take two arguments optimizer and init, both of which must have default values. This will allow us to evaluate the effect of using different optimization algorithms and weight initialization schemes for our network.

After creating our model, we define arrays of values for the parameter we wish to search, specifically:

Optimizers for searching different weight values.
Initializers for preparing the network weights using different schemes.
Epochs for training the model for different number of exposures to the training dataset.
Batches for varying the number of samples before a weight update.

The options are specified into a dictionary and passed to the configuration of the GridSearchCV scikit-learn class. This class will evaluate a version of our neural network model for each combination of parameters (2 x 3 x 3 x 3 for the combinations of optimizers, initializations, epochs and batches). Each combination is then evaluated using the default of 3-fold stratified cross validation.

That is a lot of models and a lot of computation. This is not a scheme that you want to use lightly because of the time it will take. It may be useful for you to design small experiments with a smaller subset of your data that will complete in a reasonable time. This is reasonable in this case because of the small network and the small dataset (less than 1000 instances and 9 attributes).

Finally, the performance and combination of configurations for the best model are displayed, followed by the performance of all combinations of parameters.

# MLP for Pima Indians Dataset with grid search via sklearn
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.grid_search import GridSearchCV
import numpy
import pandas

# Function to create model, required for KerasClassifier
def create_model(optimizer='rmsprop', init='glorot_uniform'):
	# create model
	model = Sequential()
	model.add(Dense(12, input_dim=8, init=init, activation='relu'))
	model.add(Dense(8, init=init, activation='relu'))
	model.add(Dense(1, init=init, activation='sigmoid'))
	# Compile model
	model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
	return model

# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# load pima indians dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]
# create model
model = KerasClassifier(build_fn=create_model)
# grid search epochs, batch size and optimizer
optimizers = ['rmsprop', 'adam']
init = ['glorot_uniform', 'normal', 'uniform']
epochs = numpy.array([50, 100, 150])
batches = numpy.array([5, 10, 20])
param_grid = dict(optimizer=optimizers, nb_epoch=epochs, batch_size=batches, init=init)
grid = GridSearchCV(estimator=model, param_grid=param_grid)
grid_result = grid.fit(X, Y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
for params, mean_score, scores in grid_result.grid_scores_:
    print("%f (%f) with: %r" % (scores.mean(), scores.std(), params))

This might take about 5 minutes to complete on your workstation executed on the CPU (rather than CPU). running the example shows the results below.

We can see that the grid search discovered that using a uniform initialization scheme, rmsprop optimizer, 150 epochs and a batch size of 5 achieved the best cross validation score of approximately 75% on this problem.

Best: 0.751302 using {'init': 'uniform', 'optimizer': 'rmsprop', 'nb_epoch': 150, 'batch_size': 5}
0.653646 (0.031948) with: {'init': 'glorot_uniform', 'optimizer': 'rmsprop', 'nb_epoch': 50, 'batch_size': 5}
0.665365 (0.004872) with: {'init': 'glorot_uniform', 'optimizer': 'adam', 'nb_epoch': 50, 'batch_size': 5}
0.683594 (0.037603) with: {'init': 'glorot_uniform', 'optimizer': 'rmsprop', 'nb_epoch': 100, 'batch_size': 5}
0.709635 (0.034987) with: {'init': 'glorot_uniform', 'optimizer': 'adam', 'nb_epoch': 100, 'batch_size': 5}
0.699219 (0.009568) with: {'init': 'glorot_uniform', 'optimizer': 'rmsprop', 'nb_epoch': 150, 'batch_size': 5}
0.725260 (0.008027) with: {'init': 'glorot_uniform', 'optimizer': 'adam', 'nb_epoch': 150, 'batch_size': 5}
0.686198 (0.024774) with: {'init': 'normal', 'optimizer': 'rmsprop', 'nb_epoch': 50, 'batch_size': 5}
0.718750 (0.014616) with: {'init': 'normal', 'optimizer': 'adam', 'nb_epoch': 50, 'batch_size': 5}
0.725260 (0.028940) with: {'init': 'normal', 'optimizer': 'rmsprop', 'nb_epoch': 100, 'batch_size': 5}
0.727865 (0.028764) with: {'init': 'normal', 'optimizer': 'adam', 'nb_epoch': 100, 'batch_size': 5}
0.748698 (0.035849) with: {'init': 'normal', 'optimizer': 'rmsprop', 'nb_epoch': 150, 'batch_size': 5}
0.712240 (0.039623) with: {'init': 'normal', 'optimizer': 'adam', 'nb_epoch': 150, 'batch_size': 5}
0.699219 (0.024910) with: {'init': 'uniform', 'optimizer': 'rmsprop', 'nb_epoch': 50, 'batch_size': 5}
0.703125 (0.011500) with: {'init': 'uniform', 'optimizer': 'adam', 'nb_epoch': 50, 'batch_size': 5}
0.720052 (0.015073) with: {'init': 'uniform', 'optimizer': 'rmsprop', 'nb_epoch': 100, 'batch_size': 5}
0.712240 (0.034987) with: {'init': 'uniform', 'optimizer': 'adam', 'nb_epoch': 100, 'batch_size': 5}
0.751302 (0.031466) with: {'init': 'uniform', 'optimizer': 'rmsprop', 'nb_epoch': 150, 'batch_size': 5}
0.734375 (0.038273) with: {'init': 'uniform', 'optimizer': 'adam', 'nb_epoch': 150, 'batch_size': 5}
...

Summary

In this post you discovered how you can wrap your Keras deep learning models and use them in the scikit-learn general machine learning library.

You can see that using scikit-learn for standard machine learning operations such as model evaluation and model hyper parameter optimization can save a lot of time over implementing these schemes yourself.

Wrapping your model allowed you to leverage powerful tools from scikit-learn to fit your deep learning models into your general machine learning process.

Do you have any questions about using Keras models in scikit-learn or about this post? Ask your question in the comments and I will do my best to answer.

Do You Want To Get Started With Deep Learning?

You can develop and evaluate deep learning models in just a few lines of Python code. You need:

Deep Learning With Python

Take the next step with 14 self-study tutorials and
7 end-to-end projects.

Covers multi-layer perceptrons, convolutional neural networks, objection recognition and more.

Ideal for machine learning practitioners already familiar with the Python ecosystem.

Bring Deep Learning To Your Machine Learning Projects

The post Use Keras Deep Learning Models with Scikit-Learn in Python appeared first on Machine Learning Mastery.

↧

Visualize Machine Learning Data in Python With Pandas

June 1, 2016, 8:25 am

≫ Next: Dynomite-manager: Managing Dynomite Clusters

≪ Previous: Use Keras Deep Learning Models with Scikit-Learn in Python

Visualize Machine Learning Data in Python With Pandas:

via machinelearningmastery.com

You must understand your data in order to get the best results from machine learning algorithms.

The fastest way to learn more about your data is to use data visualization.

In this post you will discover exactly how you can visualize your machine learning data in Python using Pandas.

Let’s get started.

Visualize Machine Learning Data in Python With Pandas
Photo by Alex Cheek, some rights reserved.

About The Recipes

Each recipe in this post is complete and standalone so that you can copy-and-paste it into your own project and use it immediately.

The Pima Indians dataset is used to demonstrate each plot. This dataset describes the medical records for Pima Indians and whether or not each patient will have an onset of diabetes within five years. As such it is a classification problem.

It is a good dataset for demonstration because all of the input attributes are numeric and the output variable to be predicted is binary (0 or 1).

The data is freely available from the UCI Machine Learning Repository and is downloaded directly as part of each recipe.

Univariate Plots

In this section we will look at techniques that you can use to understand each attribute independently.

Histograms

A fast way to get an idea of the distribution of each attribute is to look at histograms.

Histograms group data into bins and provide you a count of the number of observations in each bin. From the shape of the bins you can quickly get a feeling for whether an attribute is Gaussian’, skewed or even has an exponential distribution. It can also help you see possible outliers.

# Univariate Histograms
import matplotlib.pyplot as plt
import pandas
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
data.hist()
plt.show()

We can see that perhaps the attributes age, pedi and test may have an exponential distribution. We can also see that perhaps the mass and pres and plas attributes may have a Gaussian or nearly Gaussian distribution. This is interesting because many machine learning techniques assume a Gaussian univariate distribution on the input variables.

Univariate Histograms

Density Plots

Density plots are another way of getting a quick idea of the distribution of each attribute. The plots look like an abstracted histogram with a smooth curve drawn through the top of each bin, much like your eye tried to do with the histograms.

# Univariate Density Plots
import matplotlib.pyplot as plt
import pandas
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
data.plot(kind='density', subplots=True, layout=(3,3), sharex=False)
plt.show()

We can see the distribution for each attribute is clearer than the histograms.

Univariate Density Plots

Box and Whisker Plots

Another useful way to review the distribution of each attribute is to use Box and Whisker Plots or boxplots for short.

Boxplots summarize the distribution of each attribute, drawing a line for the median (middle value) and a box around the 25th and 75th percentiles (the middle 50% of the data). The whiskers give an idea of the spread of the data and dots outside of the whiskers show candidate outlier values (values that are 1.5 times greater than the size of spread of the middle 50% of the data).

# Box and Whisker Plots
import matplotlib.pyplot as plt
import pandas
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
data.plot(kind='box', subplots=True, layout=(3,3), sharex=False, sharey=False)
plt.show()

We can see that the spread of attributes is quite different. Some like age, test and skin appear quite skewed towards smaller values.

Univariate Box and Whisker Plots

Multivariate Plots

This section shows examples of plots with interactions between multiple variables.

Correlation Matrix Plot

Correlation gives an indication of how related the changes are between two variables. If two variables change in the same direction they are positively correlated. If the change in opposite directions together (one goes up, one goes down), then they are negatively correlated.

You can calculate the correlation between each pair of attributes. This is called a correlation matrix. You can then plot the correlation matrix and get an idea of which variables have a high correlation with each other.

This is useful to know, because some machine learning algorithms like linear and logistic regression can have poor performance if there are highly correlated input variables in your data.

# Correction Matrix Plot
import matplotlib.pyplot as plt
import pandas
import numpy
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
correlations = data.corr()
# plot correlation matrix
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = numpy.arange(0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.show()

We can see that the matrix is symmetrical, i.e. the bottom left of the matrix is the same as the top right. This is useful as we can see two different views on the same data in one plot. We can also see that each variable is perfectly positively correlated with each other (as you would expected) in the diagonal line from top left to bottom right.

Correlation Matrix Plot

Scatterplot Matrix

A scatterplot shows the relationship between two variables as dots in two dimensions, one axis for each attribute. You can create a scatterplot for each pair of attributes in your data. Drawing all these scatterplots together is called a scatterplot matrix.

Scatter plots are useful for spotting structured relationships between variables, like whether you could summarize the relationship between two variables with a line. Attributes with structured relationships may also be correlated and good candidates for removal from your dataset.

# Scatterplot Matrix
import matplotlib.pyplot as plt
import pandas
from pandas.tools.plotting import scatter_matrix
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
scatter_matrix(data)
plt.show()

Like the Correlation Matrix Plot, the scatterplot matrix is symmetrical. This is useful to look at the pair-wise relationships from different perspectives. Because there is little point oi drawing a scatterplot of each variable with itself, the diagonal shows histograms of each attribute.

Scatterplot Matrix

Your Guide to Machine Learning with Scikit-Learn

Python Mini-Course Python and scikit-learn are the rising platform among professional data scientists for applied machine learning.

PDF and Email Course.

FREE 14-Day Mini-Course in
Machine Learning with Python and scikit-learn

Download Your FREE Mini-Course >>

Download your PDF containing all 14 lessons.

Get your daily lesson via email with tips and tricks.

Summary

In this post you discovered a number of ways that you can better understand your machine learning data in Python using Pandas.

Specifically, you learned how to plot your data using:

Histograms
Density Plots
Box and Whisker Plots
Correlation Matrix Plot
Scatterplot Matrix

Open your Python interactive environment and try out each recipe.

Do you have any questions about Pandas or the recipes in this post? Ask in the comments and I will do my best to answer.

Can You Step-Through Machine Learning Projects
in Python with scikit-learn and Pandas?

Discover how to confidently step-through machine learning projects end-to-end in Python with scikit-learn

Need Help With Machine Learning in Python?

~~Finally understand how to work through a machine learning problem, step-by-step~~ in the new Ebook:

Machine Learning Mastery with Python

Take the next step with 16 self-study lessons covering data preparation, feature selection, ensembles and more.

Includes 3 end-to-end projects and a project template to tie it all together.

Ideal for beginners and intermediate levels.

Apply Machine Learning Like A Professional With Python

The post Visualize Machine Learning Data in Python With Pandas appeared first on Machine Learning Mastery.

↧

Dynomite-manager: Managing Dynomite Clusters

June 1, 2016, 2:25 pm

≫ Next: Presenting Torus: A modern distributed storage system by CoreOS

≪ Previous: Visualize Machine Learning Data in Python With Pandas

Dynomite-manager: Managing Dynomite Clusters:

via techblog.netflix.com

Dynomite has been adopted widely inside Netflix due to its high performance and low latency attributes. In our recent blog, we showcased the performance of Dynomite with Redis as the underlying data storage engine. At this point (Q2 2016), there are almost 50 clusters with more than 1000 nodes, centrally managed by the Cloud Database Engineering (CDE) team. CDE team has a wide experience with other data stores, such as Cassandra, ElasticSearch and Amazon RDS.

Dynomite is used at Netflix both as a:

Cache, with global replication, in front of Netflix’s data store systems e.g. Cassandra, ElasticSearch etc.
Data store layer by itself with persistence and backups

The latter is achieved by keeping multiple copies of the data across AWS regions and Availability zones (high availability), client failover, cold bootstrapping (warm up), S3 backups, and other features. Most of these features are enabled through the use of Dynomite-manager (internally named Florida).

A Dynomite node consists of three processes:

Dynomite (the proxy layer)
Storage Engine (Redis, Memcached, RocksDB, LMDB, ForrestDB etc)
Dynomite-Manager

Fig.1 Two Dynomite instances with Dynomite, Dynomite-manager and Redis as a data store

Depending on the requirements, Dynomite can support multiple storage engines from in-memory data stores like Redis and Memcached to SSD optimized storage engines like RocksDB, LMDB, ForestDB, etc.

Dynomite-manager is a sidecar specifically developed to manage Netflix’s Dynomite clusters and integrate it with the AWS (and Netflix) Ecosystem. It follows similar design principles from more than 6 years of experience of managing Cassandra with Priam, and ElasticSearch clusters with Raigad. Dynomite-manager was designed based on Quartz in order to be extensible to other data stores, and platforms. In the following, we briefly capture some of the key features of Dynomite-manager.

Service Discovery and Healthcheck

Dynomite-manager schedules a Quartz (lightweight thread) every 15 seconds that checks the health of both Dynomite and the underlying storage engine. Since most of our current production deployments leverage Redis, the healthcheck involves a two step approach. In the first step, we check if Dynomite and Redis are running as Linux processes, and in the second step, Dynomite-manager uses the Redis API to perform a PING to both Dynomite and Redis. A Redis PING, and the corresponding response, Redis PONG, ensures that both processes are alive and are able to serve client traffic. If any of these healthcheck steps fail, Dynomite-manager informs Eureka (Netflix Service registry for resilient mid-tier load balancing and failover) and the node is removed from Discovery. This ensures that the Dyno client can gracefully failover the traffic to another Dynomite node with the same token.

Screen Shot 2016-05-29 at 6.43.07 AM.png

Fig.2 Dynomite healthcheck failure on Spinnaker

Token Management and Node Configuration

Dynomite occupies the whole token range on a per rack basis. Hence, it uses a unique token in each rack (differently from Cassandra that uses unique token throughout the cluster). Therefore, tokens can repeat across racks and in the same datacenter.

Dynomite-manager calculates the token of every node by looking at the number of slots (nodes), by which the token range is divided in the rack, and the position of the node. The tokens are then stored in an external data store along with application id, availability zone, datacenter, instance id, hostname, and elastic IP. Since nodes are by nature volatile in the cloud, if a node gets replaced, Dynomite-manager in the new node queries the data store to find if a token was pre-generated. At Netflix, we leverage a Cassandra cluster to store this information.

Dynomite-manager receives other instance metadata from AWS, and dynamic configuration through Archaius Configuration Management API or through the use of external data sources like SimpleDB. For the instance metadata, Dynomite-manager also includes an implementation for local deployments.

Monitoring and Insights Integration

Dynomite-manager exports the statistics of Dynomite and Redis to Atlas for plotting and time-series analysis. We use a tiered architecture for our monitoring system.

Dynomite-manager receives information about Dynomite through a REST call;
Dynomite-manager receives information about Redis through the INFO command.

Currently, Dynomite-manager leverages the Servo client to publish the metrics for time series processing. Nonetheless, other Insight clients can be added in order to deliver metrics to a different Insight system.

Cold Bootstrapping

Cold bootstrapping, also known as warm up, is the process of populating a node with the most recent data before joining the ring. Dynomite has a single copy of the data on each rack, essentially having multiple copies per datacenter (depending on the number of racks per datacenter). At Netflix, when Dynomite is used as a data store, we use three racks per datacenter for high availability. We manage these copies as separate AWS Auto scaling groups. Due to the volatile nature of the cloud or Chaos Monkey exercises, nodes may get terminated. During this time the Dyno client fails over to another node that holds the same token in a different availability zone.

When the new node comes up, Dynomite-manager on the new node is responsible for cold bootstrapping Redis in the same node. The above process enables our infrastructure to sustain multiple failures in production with minimal effect on the client side.In the following, we explain the operation in more detail:

Dynomite-manager boots up due to auto-scaling activities, and a new token is generated for that node. In this case the new token is : 1383429731

Fig.3 Autoscaling brings a new node without data

Dynomite-manager queries the external data store to identify which nodes within the local region have the same token. Dynomite-manager does not warm up from remote regions to avoid cross-region communication latencies. Once a target node is identified, Dynomite-manager tries to connect to it.

Fig.4 A peer node with the same token is identified

Dynomite-manager issues a Redis SLAVEOF command to that peer. Effectively the target Redis instance sets itself as a slave of the redis instance on the peer node. For this, it leverages Redis diskless master-slave replication. In addition, Dynomite-manager sets Dynomite in buffering (standby) mode. This effectively allows the Dyno client to continuously failover to another AWS availability zone during the warm up process.

Fig.5 Dynomite-manager enables Redis replication

Dynomite-manager continuously checks the offset between the Redis master node (the source of truth), and the Redis slave (the node that needs warm up). The offset is determined based on what the Redis master reports via the INFO command. A Dynomite node is considered fully warmed up, if it has received all the data from the remote node, or if the difference is less than a pre-set value. We use the latter to limit the warm-up duration in high throughput deployments.

Fig.6 Data streaming across nodes through Redis diskless replication

Once master and slave are in sync, Dynomite-manager sets Dynomite to allow writes only. This mode allows writes to get buffered and flushed to Redis once everything is complete.
Dynomite-manager stops Redis from peer syncing by using the Redis SLAVEOF NO ONE command.
Dynomite-manager sets Dynomite back to normal state, performs a final check if Dynomite is operational and notifies Service Discovery through the healthcheck.

Fig.8 Dynomite node is warmed up

Done!

S3 Backups/Restores

At Netflix, Dynomite is used as a single point of truth (data store) as well as a cache. A dependable backup and recovery process is therefore critical for Disaster Recovery (DR) and Data Corruption (CR) when choosing a data store in the cloud. With Dynomite-manager, a daily snapshot for all clusters that leverage Dynomite as a data store is used to back them up to Amazon S3. S3 was an obvious choice due to its simple interface and ability to access any amount of data from anywhere.

Backup

Dynomite-manager initiates the S3 backups. The backups feature leverages the persistence feature of Redis to dump data to the drive. Dynomite-manager supports both the RDB and the AOF persistence of Redis, offering the ability to the users to use a readable format of their data for debugging or a memory direct snapshot. The backups leverage the IAM credentials in order to encrypt the communication. Backups (a) can be scheduled using a date in the configuration (or by leveraging Archaius, Netflix configuration management API), and (b) on demand using the REST API.

Restore

Dynomite-manager supports restoring a single node through a REST API, or the complete ring. When performing a restore, Dynomite-manager (on each node), shuts down Redis and Dynomite, locates the snapshot files in S3, and orchestrates the download of the files. Once the snapshot is transferred to the node, Dynomite-manager starts Redis and waits until the data are in memory, and then follows up with starting the Dynomite process. Dynomite-manager can also restore data to clusters with different names. This allows us to spin up multiple test clusters with the same data, enabling refreshes. Refreshes are very important at Netflix, because cluster users can leverage production data in a test environment, hence perform realistics benchmarks and offline analysis on production data. Finally, Dynomite-manager allows for targeted refreshes on a specific date, allowing cluster users to restore data to point prior to the data corruption, test production data for a specific time frame and opening the doors for many other use cases that we have not yet explored.

Credential Management

In regards to credentials, Dynomite-manager supports Amazon’s Identity and Access Management (IAM) key profile management. Using IAM Credentials allows the cluster administrator can provide access to the AWS API without storing an AccessKeyId or SecretAccessKey on the node itself. Alternatively, one can implement the IAMCredential interface.

Cluster Management

Upgrades/Rolling Restarts

With Dynomite-manager we can perform upgrades and rolling restarts of Dynomite clusters in production without any down time. For example, when we want to upgrade or restart Dynomite-manager itself, we increase the polling interval of the Discovery service, allowing the reads/writes to Dynomite and Redis to flow. On the other hand, when performing upgrades of Dynomite and Redis, we take the node out of the Discovery service by shutting down Dynomite-manager itself, and therefore allowing Dyno to gracefully fail over to another availability zone.

REST API

Dynomite-manager provides a REST API for multiple management activities. For example, the following administration operations can be performed through Dynomite-manager:

/start: start Dynomite
/stop: stop Dynomite
/startstorageprocess: start storage process
/stopstorageprocess: stops storage process
/get_seeds: responds with the hostnames and tokens
/cluster_describe: responds with a JSON file of the cluster level information
/s3backup: forces an S3 backups
/s3restore: forces an S3 restore

Future Ideas: Dynomite-manager 2.0

Backups: we will be investigating the use a bandwidth throttler during backup operation to reduce disk and network I/O. This is important for nodes that are receiving thousands of OPS. For better DR, we will also investigate the diversification of our backups across multiple object storage vendors.
Warm up: we will explore further resiliency in our warm up process. For example, we will be considering the use of incrementals for clusters that need to vertically scale to better instance types, as well as perform parallel warm up from multiple nodes if all nodes in the same region are healthy.
In line updates and restarts: currently, we manage Dynomite and the storage engine through python and shell scripts that are invoked through REST calls by our continuous integration system. Our plan is to integrate most of these management operations inside Dynomite-manager (binary upgrades, rolling restarts etc)
Healthcheck: Dynomite-manager has the perfect view of every Dynomite node, hence as Dynomite gets more mature, we plan to integrate auto-remediation inside Dynomite-manager. This can potentially minimize the amount of involvement of our engineers once the cluster is operational.

Today, we are open sourcing Dynomite-manager https://github.com/netflix/dynomite-manager

We will be looking forward to feedback, issues and bugs so that we can improve the Dynomite Ecosystem.

by: Shailesh Birari, Jason Cacciatore, Minh Do, Ioannis Papapanagiotou, Christos Kalantzis.

↧

Presenting Torus: A modern distributed storage system by CoreOS

June 1, 2016, 2:25 pm

≫ Next: Feature Selection For Machine Learning in Python

≪ Previous: Dynomite-manager: Managing Dynomite Clusters

Presenting Torus: A modern distributed storage system by CoreOS:

via coreos.com

Persistent storage in container cluster infrastructure is one of the most interesting current problems in computing. Where do we store the voluminous stream of data that microservices produce and consume, especially when immutable, discrete application deployments are such a powerful pattern? As containers gain critical mass in enterprise deployments, how do we store all of this information in a way developers can depend on in any environment? How is the consistency and durability of that data assured in a world of dynamic, rapidly iterated application containers?

Today CoreOS introduces Torus, a new open source distributed storage system designed to provide reliable, scalable storage to container clusters orchestrated by Kubernetes, the open source container management system. Because we believe open source software must be released early and often to elicit the expertise of a community of developers, testers, and contributors, a prototype version of Torus is now available on GitHub, and we encourage everyone to test it with their data sets and cluster deployments, and help develop the next generation of distributed storage.

Distributed systems: Past, present, and future

At CoreOS we believe distributed systems provide the foundation for a more secure and reliable Internet. Building modular foundations that expand to handle growing workloads, yet remain easy to use and to assemble with other components, is essential for tackling the challenges of computing at web scale. We know this from three years of experience building etcd to solve the problem of distributed consensus — how small but critical pieces of information are democratically agreed upon and kept consistent as a group of machines rapidly and asynchronously updates and accesses them. Today etcd is the fastest and most stable open source distributed key-value store available. It is used by hundreds of leading distributed systems software projects, including Kubernetes, to coordinate configuration among massive groups of nodes and the applications they execute.

The problem of reliable distributed storage is arguably even more historically challenging than distributed consensus. In the algorithms required to implement distributed storage correctly, mistakes can have serious consequences. Data sets in distributed storage systems are often extremely large, and storage errors may propagate alarmingly while remaining difficult to detect. The burgeoning size of this data is also changing the way we create backups, archives, and other fail-safe measures to protect against application errors higher up the stack.

Why we built Torus

Torus provides storage primitives that are extremely reliable, distributed, and simple. It’s designed to solve some major problems common for teams running distributed applications today. While it is possible to connect legacy storage to container infrastructure, the mismatch between these two models convinced us that the new problems of providing storage to container clusters warranted a new solution. Consensus algorithms are notoriously hard. Torus uses etcd, proven in thousands of production deployments, to shepherd metadata and maintain consensus. This frees Torus itself to focus on novel solutions to the storage part of the equation.

Existing storage solutions weren’t designed to be cloud-native

Deploying, managing, and operating existing storage solutions while trying to shoehorn them into a modern container cluster infrastructure is difficult and expensive. These distributed storage systems were mostly designed for a regime of small clusters of large machines, rather than the GIFEE approach that focuses on large clusters of inexpensive, “small” machines. Worse, commercial distributed storage often involves pricey and even custom hardware and software that is not only expensive to acquire, but difficult to integrate with emerging tools and patterns, and costly to upgrade, license, and maintain over time.

Containers need persistent storage

Container cluster infrastructure is more dynamic than ever before, changing quickly in the face of automatic scaling, continuous delivery, and as components fail and are replaced. Ensuring persistent storage for these container microservices as they are started, stopped, upgraded, and migrated between nodes in the cluster is not as simple as providing a backing store for a single server running a group of monolithic applications, or even a number of virtual machines.

Storage for modern clusters must be uniformly available network-wide, and must govern access and consistency as data processing shifts from container to container, even within one application as it increments through versions. Torus exists to address these cases by applying these principles to its architecture:

Extensibility: Like etcd, Torus is a building block, and it enables various types of storage including distributed block devices, or large object storage. Torus is written in Go, and speaks the gRPC protocol to make it easy to create Torus clients in any language.
Ease of use: Designed for containers and cluster orchestration platforms such as Kubernetes, Torus is simple to deploy and operate, and ready to scale.
Correctness: Torus uses the etcd distributed key-value database to store and retrieve file or object metadata. etcd provides a solid, battle-tested base for core distributed systems operations that must execute rapidly and reliably.
Scalability: Torus can currently scale to hundreds of nodes while treating disks collectively as a single storage pool.

“We have seen a clear need from the market for a storage solution that addresses the dynamic nature of containerized applications and can take advantage of the rapidly evolving storage hardware landscape,” said Zachary Smith, CEO of Packet, a New York-based bare metal cloud provider. “We’re excited to see CoreOS lead the community in releasing Torus as the first truly distributed storage solution for cloud-native applications.”

How Torus works

At its core, Torus is a library with an interface that appears as a traditional file, allowing for storage manipulation through well-understood basic file operations. Coordinated and checkpointed through etcd’s consensus process, this distributed file can be exposed to user applications in multiple ways. Today, Torus supports exposing this file as block-oriented storage via a Network Block Device (NBD). We also expect that in the future other storage systems, such as object storage, will be built on top of Torus as collections of these distributed files, coordinated by etcd.

Torus provides simple persistent storage to Kubernetes pods

Torus includes support for consistent hashing, replication, garbage collection, and pool rebalancing through the internal peer-to-peer API. The design includes the ability to support both encryption and efficient Reed-Solomon error correction in the near future, providing greater assurance of data validity and confidentiality throughout the system.

Deploying Torus

Torus can be easily deployed and managed with Kubernetes. This initial release includes Kubernetes manifests to configure and run Torus as an application on any Kubernetes cluster. This makes installing, managing, and upgrading Torus a simple and cloud-native affair. Once spun up as a cluster application, Torus combines with the flex volume plugin in Kubernetes to dynamically attach volumes to pods as they are deployed. To an app running in a pod, Torus appears as a traditional filesystem. Today’s Torus release includes manifests using this feature to demonstrate running the PostgreSQL database server atop Kubernetes flex volumes, backed by Torus storage. Today’s release also documents a simple standalone deployment of Torus with etcd, outside of a Kubernetes cluster, for other testing and development.

What’s next for Torus? Community feedback

Releasing today’s initial version of Torus is just the beginning of our effort to build a world-class cloud-native distributed storage system, and we need your help. Guide and contribute to the project at the Torus repo on GitHub by testing the software, filing issues, and joining our discussions. If you’re in the San Francisco area, join us for the next CoreOS meetup on June 16 at 6 p.m. PT for a deep dive into the implementation and operational details of Torus.

“Distributed storage has historically been an elusive problem for cloud-native applications,” said Peter Bourgon, distributed systems engineer and creator of Go kit. “I’m really happy with what I’ve seen so far from Torus, and quite excited to see where CoreOS and the community take it from here!”

Torus is simple, reliable, distributed storage for modern application containers, and a keystone for wider enterprise Kubernetes adoption.

CoreOS is hiring

If you’re interested in helping develop Torus, or solving other difficult and rewarding problems in distributed systems at CoreOS, join us! We’re hiring distributed storage engineers.

↧

Feature Selection For Machine Learning in Python

June 2, 2016, 1:25 am

≫ Next: How to Create Beautifully Detailed Maps Using Twitter Data

≪ Previous: Presenting Torus: A modern distributed storage system by CoreOS

Feature Selection For Machine Learning in Python:

via machinelearningmastery.com

The data features that you use to train your machine learning models have a huge influence on the performance you can achieve.

Irrelevant or partially relevant features can negatively impact model performance.

In this post you will discover automatic feature selection techniques that you can use to prepare your machine learning data in python with scikit-learn.

Let’s get started.

Feature Selection For Machine Learning in Python
Photo by Baptiste Lafontaine, some rights reserved.

Feature Selection

Feature selection is a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested.

Having irrelevant features in your data can decrease the accuracy of many models, especially linear algorithms like linear and logistic regression.

Three benefits of performing feature selection before modeling your data are:

Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
Improves Accuracy: Less misleading data means modeling accuracy improves.
Reduces Training Time: Less data means that algorithms train faster.

You can learn more about feature selection with scikit-learn in the article Feature selection.

Feature Selection for Machine Learning

This section lists 4 feature selection recipes for machine learning in Python

This post contains recipes for feature selection methods.

Each recipe was designed to be complete and standalone so that you can copy-and-paste it directly into you project and use it immediately.

Recipes uses the Pima Indians onset of diabetes dataset to demonstrate the feature selection method. This is a binary classification problem where all of the attributes are numeric.

1. Univariate Selection

Statistical tests can be used to select those features that have the strongest relationship with the output variable.

The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific number of features.

The example below uses the chi squared (chi^2) statistical test for non-negative features to select 4 of the best features from the Pima Indians onset of diabetes dataset.

# Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)
import pandas
import numpy
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# load data
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)
# summarize scores
numpy.set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)
# summarize selected features
print(features[0:5,:])

You can see the scores for each attribute and the 4 attributes chosen (those with the lowest scores): plas, test, mass and age.

[  111.52   1411.887    17.605    53.108  2175.565   127.669     5.393
   181.304]
[[ 148.     0.    33.6   50. ]
 [  85.     0.    26.6   31. ]
 [ 183.     0.    23.3   32. ]
 [  89.    94.    28.1   21. ]
 [ 137.   168.    43.1   33. ]]

2. Recursive Feature Elimination

The Recursive Feature Elimination (or RFE) works by recursively removing attributes and building a model on those attributes that remain.

It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.

You can learn more about the RFE class in the scikit-learn documentation.

The example below uses RFE with the logistic regression algorithm to select the top 3 features. The choice of algorithm does not matter too much as long as it is skillful and consistent.

# Feature Extraction with RFE
from pandas import read_csv
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# load data
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
model = LogisticRegression()
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print("Num Features: %d") % fit.n_features_
print("Selected Features: %s") % fit.support_
print("Feature Ranking: %s") % fit.ranking_

You can see that RFE chose the the top 3 features as preg, pedi and age. These are marked True in the support_ array and marked with a choice “1” in the ranking_ array.

Num Features: 3
Selected Features: [ True False False False False  True  True False]
Feature Ranking: [1 2 3 5 6 1 1 4]

3. Principal Component Analysis

Principal Component Analysis (or PCA) uses linear algebra to transform the dataset into a compressed form.

Generally this is called a data reduction technique. A property of PCA is that you can choose the number of dimensions or principal component in the transformed result.

In the example below, we use PCA and select 3 principal components.

Learn more about the PCA class in scikit-learn by reviewing the PCA API. Dive deeper into the math behind PCA on the Principal Component Analysis Wikipedia article.

# Feature Extraction with PCA
import numpy
from pandas import read_csv
from sklearn.decomposition import PCA
# load data
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
pca = PCA(n_components=3)
fit = pca.fit(X)
# summarize components
print("Explained Variance: %s") % fit.explained_variance_ratio_
print(fit.components_)

You can see that the transformed dataset (3 principal components) bare little resemblance to the source data.

Explained Variance: [ 0.88854663  0.06159078  0.02579012]
[[ -2.02176587e-03   9.78115765e-02   1.60930503e-02   6.07566861e-02
    9.93110844e-01   1.40108085e-02   5.37167919e-04  -3.56474430e-03]
 [  2.26488861e-02   9.72210040e-01   1.41909330e-01  -5.78614699e-02
   -9.46266913e-02   4.69729766e-02   8.16804621e-04   1.40168181e-01]
 [ -2.24649003e-02   1.43428710e-01  -9.22467192e-01  -3.07013055e-01
    2.09773019e-02  -1.32444542e-01  -6.39983017e-04  -1.25454310e-01]]

4. Feature Importance

Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance of features.

In the example below we construct a ExtraTreesClassifier classifier for the Pima Indians onset of diabetes dataset. You can learn more about the ExtraTreesClassifier class in the scikit-learn API.

# Feature Importance with Extra Trees Classifier
from pandas import read_csv
from sklearn.ensemble import ExtraTreesClassifier
# load data
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
model = ExtraTreesClassifier()
model.fit(X, Y)
print(model.feature_importances_)

You can see that we are given an importance score for each attribute where the larger score the more important the attribute. The scores suggest at the importance of plas, age and mass.

[ 0.11070069  0.2213717   0.08824115  0.08068703  0.07281761  0.14548537 0.12654214  0.15415431]

Your Guide to Machine Learning with Scikit-Learn

Python Mini-Course Python and scikit-learn are the rising platform among professional data scientists for applied machine learning.

PDF and Email Course.

FREE 14-Day Mini-Course in
Machine Learning with Python and scikit-learn

Download Your FREE Mini-Course >>

Download your PDF containing all 14 lessons.

Get your daily lesson via email with tips and tricks.

Summary

In this post you discovered feature selection for preparing machine learning data in Python with scikit-learn.

You learned about 4 different automatic feature selection techniques:

Univariate Selection.
Recursive Feature Elimination.
Principle Component Analysis.
Feature Importance.

If you are looking for more information on feature selection, see these related posts:

Do you have any questions about feature selection or this post? Ask your questions in the comment and I will do my best to answer them.

Can You Step-Through Machine Learning Projects
in Python with scikit-learn and Pandas?

Discover how to confidently step-through machine learning projects end-to-end in Python with scikit-learn in the new Ebook:

Machine Learning Mastery with Python

Take the next step with 16 self-study lessons covering data preparation, feature selection, ensembles and more.

Includes 3 end-to-end projects and a project template to tie it all together.

Ideal for beginners and intermediate levels.

Apply Machine Learning Like A Professional With Python

The post Feature Selection For Machine Learning in Python appeared first on Machine Learning Mastery.

↧

How to Create Beautifully Detailed Maps Using Twitter Data

June 3, 2016, 1:25 am

≫ Next: How To Prepare Your Data For Machine Learning in Python with Scikit-Learn

≪ Previous: Feature Selection For Machine Learning in Python

How to Create Beautifully Detailed Maps Using Twitter Data:

via thecreatorsproject.vice.com

One of Eric Fischer’s tweet maps. Images via

Using the geotagging data from from Twitter’s public API, data artist and “map geek,” Eric Fischer created the most detailed tweet map ever. With the help of his Mapbox article that outlines both his creative process and the tools he built for the project, anyone can replicate his beautiful maps.

Fischer first connected to Twitter’s “statuses/filter” API and received the Tweets in JSON, a format used to transmit data between a server and web application. Because the JSON format came with more metadata than he needed for his map, he created his own program to parse the streams for only the essential information: username, date, time location, client and text.

Twitter Map, “Los Angeles,” 2014

“Even though there are six billion Tweets to map, only nine percent of them are ultimately visible as unique dots. The others are filtered out as duplicate or near-duplicate locations,” Fischer explains in his how-to. To clarify the map, he filtered out duplicate or near-duplicate locations, and eliminated the “banding” effect of Tweets sent from iPhones.

When the viewer zooms in and out of areas on the map’s website, the density and massive scale of the glowing green dots becomes clear. Fisher talks about how the challenge of achieving this effect is finding a way to “include all the detail when zoomed in deeply while unobtrusively dropping dots as you zoom out so that the low zoom levels are not overwhelmingly dense.” For example, he continues, at zoom level 0, when the viewer sees the whole world, there are 1586 dots. At zoom level 14, there are 590 million.

In an interview with CityLab, Fischer states that he believes a successful map is one that can “confirm something that the viewer already knows about their neighborhood or their city, and then broaden that knowledge a little by showing how some other places that the viewer doesn’t know so well are similar or different.” Fischer is making milestones as he brings cartography into the digital age through stunning visuals and data-filled social maps.

Twitter Map, “O’Hare,” 2014

Check out Fischer’s maps below, and to see more of his projects, visit his Flickr page, or keep up with him on Twitter.

This article was originally published on December 12, 2014.

Digital Maps Inspired By Joy Division’s “Unknown Pleasures” Cover

Sprawling ‘Snow Drawings’ Transform a Mountain into Art

Epic Data Maps Let You Vicariously Run Through NYC

↧

How To Prepare Your Data For Machine Learning in Python with Scikit-Learn

June 4, 2016, 8:25 am

≫ Next: On The Fly PrintResearch from Huaishu Peng and Cornell...

≪ Previous: How to Create Beautifully Detailed Maps Using Twitter Data

How To Prepare Your Data For Machine Learning in Python with Scikit-Learn:

via machinelearningmastery.com

Many machine learning algorithms make assumptions about your data.

It is often a very good idea to prepare your data in such way to best expose the structure of the problem to the machine learning algorithms that you intend to use.

In this post you will discover how to prepare your data for machine learning in Python using scikit-learn.

Let’s get started.

How To Prepare Your Data For Machine Learning in Python with Scikit-Learn
Photo by Vinoth Chandar, some rights reserved.

Need For Data Preprocessing

You almost always need to preprocess your data. It is a required step.

A difficulty is that different algorithms make different assumptions about your data and may require different transforms. Further, when you follow all of the rules and prepare your data, sometimes algorithms can deliver better results without the preprocessing.

Generally, I would recommend creating many different views and transforms of your data, then exercise a handful of algorithms on each view of your dataset. This will help you to flush out which data transforms might be better at exposing the structure of your problem in general.

Preprocessing Machine Learning Recipes

This section lists 4 different data preprocessing recipes for machine learning.

All of the recipes were designed to be complete and standalone.

You can copy and paste them directly into your project and start working.

The Pima Indian diabetes dataset is used in each recipe. This is a binary classification problem where all of the attributes are numeric and have different scales. It is a great example of dataset that can benefit from pre-processing.

You can learn more about this data set on the UCI Machine Learning Repository webpage.

Each recipe follows the same structure:

Load the dataset from a URL.
Split the dataset into the input and output variables for machine learning.
Apply a preprocessing transform to the input variables.
Summarize the data to show the change.

The transforms are calculated in such a way that they can be applied to your training data and any samples of data you may have in the future.

The scikit-learn documentationhas some information on how to use various different preprocessing methods. You can review the preprocess API in scikit-learn here.

1. Rescale Data

When your data is comprised of attributes with varying scales, many machine learning algorithms can benefit from rescaling the attributes to all have the same scale.

Often this is referred to as normalization and attributes are often rescaled into the range between 0 and 1. This is useful for optimization algorithms in used in the core of machine learning algorithms like gradient descent. It is also useful for algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like K-Nearest Neighbors.

You can rescale your data using scikit-learn using the MinMaxScaler class.

# Rescale data (between 0 and 1)
import pandas
import scipy
import numpy
from sklearn.preprocessing import MinMaxScaler
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])

After rescaling you can see that all of the values are in the range between 0 and 1.

[[ 0.353  0.744  0.59   0.354  0.     0.501  0.234  0.483]
 [ 0.059  0.427  0.541  0.293  0.     0.396  0.117  0.167]
 [ 0.471  0.92   0.525  0.     0.     0.347  0.254  0.183]
 [ 0.059  0.447  0.541  0.232  0.111  0.419  0.038  0.   ]
 [ 0.     0.688  0.328  0.354  0.199  0.642  0.944  0.2  ]]

2. Standardize Data

Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1.

It is most suitable for techniques that assume a Gaussian distribution in the input variables and work better with rescaled data, such as linear regression, logistic regression and linear discriminate analysis.

You can standardize data using scikit-learn with the StandardScaler class.

# Standardize data (0 mean, 1 stdev)
from sklearn.preprocessing import StandardScaler
import pandas
import numpy
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])

The values for each attribute now have a mean value of 0 and a standard deviation of 1.

[[ 0.64   0.848  0.15   0.907 -0.693  0.204  0.468  1.426]
 [-0.845 -1.123 -0.161  0.531 -0.693 -0.684 -0.365 -0.191]
 [ 1.234  1.944 -0.264 -1.288 -0.693 -1.103  0.604 -0.106]
 [-0.845 -0.998 -0.161  0.155  0.123 -0.494 -0.921 -1.042]
 [-1.142  0.504 -1.505  0.907  0.766  1.41   5.485 -0.02 ]]

3. Normalize Data

Normalizing in scikit-learn refers to rescaling each observation (row) to have a length of 1 (called a unit norm in linear algebra).

This preprocessing can be useful for sparse datasets (lots of zeros) with attributes of varying scales when using algorithms that weight input values such as neural networks and algorithms that use distance measures such as K-Nearest Neighbors.

You can normalize data in Python with scikit-learn using the Normalizer class.

# Normalize data (length of 1)
from sklearn.preprocessing import Normalizer
import pandas
import numpy
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(normalizedX[0:5,:])

The rows are normalized to length 1.

[[ 0.034  0.828  0.403  0.196  0.     0.188  0.004  0.28 ]
 [ 0.008  0.716  0.556  0.244  0.     0.224  0.003  0.261]
 [ 0.04   0.924  0.323  0.     0.     0.118  0.003  0.162]
 [ 0.007  0.588  0.436  0.152  0.622  0.186  0.001  0.139]
 [ 0.     0.596  0.174  0.152  0.731  0.188  0.01   0.144]]

4. Binarize Data (Make Binary)

You can transform your data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0.

This is called binarizing your data or threshold your data. It can be useful when you have probabilities that you want to make crisp values. It is also useful when feature engineering and you want to add new features that indicate something meaningful.

You can create new binary attributes in Python using scikit-learn with the Binarizer class.

# binarization
from sklearn.preprocessing import Binarizer
import pandas
import numpy
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
binarizer = Binarizer(threshold=0.0).fit(X)
binaryX = binarizer.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(binaryX[0:5,:])

You can see that all values equal or less than 0 are marked 0 and all of those above 0 are marked 1.

[[ 1.  1.  1.  1.  0.  1.  1.  1.]
 [ 1.  1.  1.  1.  0.  1.  1.  1.]
 [ 1.  1.  1.  0.  0.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.]
 [ 0.  1.  1.  1.  1.  1.  1.  1.]]

Your Guide to Machine Learning with Scikit-Learn

Python Mini-Course Python and scikit-learn are the rising platform among professional data scientists for applied machine learning.

PDF and Email Course.

FREE 14-Day Mini-Course in
Machine Learning with Python and scikit-learn

Download Your FREE Mini-Course >>

Download your PDF containing all 14 lessons.

Get your daily lesson via email with tips and tricks.

Summary

In this post you discovered how you can prepare your data for machine learning in Python using scikit-learn.

You now have recipes to:

Rescale data.
Standardize data.
Normalize data.
Binarize data.

Your action step for this post is to type or copy-and-paste each recipe and get familiar with data preprocesing in scikit-learn.

Do you have any questions about data preprocessing in Python or this post? Ask in the comments and I will do my best to answer.

Can You Step-Through Machine Learning Projects
in Python with scikit-learn and Pandas?

Discover how to confidently step-through machine learning projects end-to-end in Python with scikit-learn

Need Help With Machine Learning in Python?

~~Finally understand how to work through a machine learning problem, step-by-step~~ in the new Ebook:

Machine Learning Mastery with Python

Take the next step with 16 self-study lessons covering data preparation, feature selection, ensembles and more.

Includes 3 end-to-end projects and a project template to tie it all together.

Ideal for beginners and intermediate levels.

Apply Machine Learning Like A Professional With Python

The post How To Prepare Your Data For Machine Learning in Python with Scikit-Learn appeared first on Machine Learning Mastery.

↧

On The Fly PrintResearch from Huaishu Peng and Cornell...

June 5, 2016, 2:25 pm

≫ Next: Binary Classification Tutorial with the Keras Deep Learning Library

≪ Previous: How To Prepare Your Data For Machine Learning in Python with Scikit-Learn

On The Fly PrintResearch from Huaishu Peng and Cornell...:

via prostheticknowledge.tumblr.com

On The Fly Print

Research from Huaishu Peng and Cornell University is a 3D Printing setup which can make wireframe models as they are designed on a computer:

Currently the 3D digital modeling is primarily an on-screen activity. One has to finish the entire digital design, send it to the printer, and wait for several hours to get the physical output printed. Because the user cannot check the design early on, in many cases, the early printed object is not ideal and need to go through several iterations to end as a polished result.

Is it possible to create a 3D printing system that can print fast enough to keep up with the CAD modelling speed, so that the CAD users can have a timely, low-fidelity physical preview during the early design stage?

We propose On-the-Fly Print: a 3D modeling approach that allows the user to design 3D models digitally while having a low-fidelity physical wireframe model printed in parallel. Our system starts printing features as soon as they are created and updates the physical model as needed. Users can quickly check the design in a real usage context by removing the partial physical print from the printer and replacing it afterwards to continue printing.

More Here

↧

Binary Classification Tutorial with the Keras Deep Learning Library

June 8, 2016, 8:25 am

≫ Next: Rebel Rebel Records

≪ Previous: On The Fly PrintResearch from Huaishu Peng and Cornell...

Binary Classification Tutorial with the Keras Deep Learning Library:

via machinelearningmastery.com

Keras is a Python library for deep learning that wraps the efficient numerical libraries TensorFlow and Theano.

Keras allows you to quickly and simply design and train neural network and deep learning models.

In this post you will discover how to effectively use the Keras library in your machine learning project by working through a binary classification project step-by-step.

After completing this tutorial, you will know:

How to load training data and make it available to Keras.
How to design and train a neural network for tabular data.
How to evaluate the performance of a neural network model in Keras on unseen data.
How to perform data preparation to improve skill when using neural networks.
How to tune the topology and configuration of neural networks in Keras.

Let’s get started.

Binary Classification Worked Example with the Keras Deep Learning Library
Photo by Mattia Merlo, some rights reserved.

1. Description of the Dataset

The dataset we will use in this tutorial is the Sonar dataset.

This is a dataset that describes sonar chirp returns bouncing off different services. The 60 input variables are the strength of the returns at different angles. It is a binary classification problem that requires a model to differentiate rocks from metal cylinders.

You can learn more about this dataset on the UCI Machine Learning repository. You can download the dataset for free and place it in your working directory with the filename sonar.csv.

It is a well understood dataset. All of the variables are continuous and generally in the range of 0 to 1. The output variable is a string “M” for mine and “R” for rock, which will need to be converted to integers 1 and 0.

A benefit of using this dataset is that it is a standard benchmark problem. This means that we have some idea of the expected skill of a good model. Using cross validation, a neural network should be able to achieve performance around 84% with an upper bound on accuracy for custom models at around 88%.

Get Started in Deep Learning With Python

Deep Learning gets state-of-the-art results and Python hosts the most powerful tools.
Get started now!

PDF Download and Email Course.

FREE 14-Day Mini-Course on
Deep Learning With Python

Download Your FREE Mini-Course

Download your PDF containing all 14 lessons.

Get your daily lesson via email with tips and tricks.

2. Baseline Neural Network Model Performance

Let’s create a baseline model and result for this problem.

We will start off by importing all of the classes and functions we will need.

import numpy
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

Next, we can initialize the random number generator to ensure that we always get the same results when executing this code. This will help if we are debugging.

# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

Now we can load the dataset using pandas and split the columns into 60 input variables (X) and 1 output variable (Y). We use pandas to load the data because it easily handles strings (the output variable), whereas attempting to load the data directly using NumPy would be more difficult.

# load dataset
dataframe = pandas.read_csv("sonar.csv", header=None)
dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:60].astype(float)
Y = dataset[:,60]

The output variable is string values. We must convert them into integer values 0 and 1.

We can do this using the LabelEncoder class from scikit-learn. This class will model the encoding required using the entire dataset via the fit() function, then apply the encoding to create a new output variable using the transform() function.

# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)

We are now ready to create our neural network model using Keras.

We are going to use scikit-learn to evaluate the model using stratified k-fold cross validation. This is a resampling technique that will provide an estimate of the performance of the model. It does this by splitting the data into k-parts, training the model on all parts except one which is held out as a test set to evaluate the performance of the model. This process is repeated k-times and the average score across all constructed models is used as a robust estimate of performance. It is stratified, meaning that it will look at the output values and attempt to balance the number of instances that belong to each class in the k-splits of the data.

To use Keras models with scikit-learn, we must use the KerasClassifier wrapper. This class takes a function that creates and returns our neural network model. It also takes arguments that it will pass along to the call to fit() such as the number of epochs and the batch size.

Let’s start off by defining the function that creates our baseline model. Our model will have a single fully connected hidden layer with the same number of neurons as input variables. This is a good default starting point when creating neural networks.

The weights are initialized using a small Gaussian random number. The Rectifier activation function is used. The output layer contains a single neuron in order to make predictions. It uses the sigmoid activation function in order to produce a probability output in the range of 0 to 1 that can easily and automatically be converted to crisp class values.

Finally, we are using the logarithmic loss function (binary_crossentropy) during training, the preferred loss function for binary classification problems. The model also uses the efficient Adam optimization algorithm for gradient descent and accuracy metrics will be collected when the model is trained.

# baseline model
def create_baseline():
	# create model
	model = Sequential()
	model.add(Dense(60, input_dim=60, init='normal', activation='relu'))
	model.add(Dense(1, init='normal', activation='sigmoid'))
	# Compile model
	model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
	return model

Now it is time to evaluate this model using stratified cross validation in the scikit-learn framework.

We pass the number of training epochs to the KerasClassifier, again using reasonable default values. Verbose output is also turned off given that the model will be created 10 times for the 10-fold cross validation being performed.

# evaluate model with standardized dataset
estimator = KerasClassifier(build_fn=create_baseline, nb_epoch=100, batch_size=5, verbose=0)
kfold = StratifiedKFold(y=encoded_Y, n_folds=10, shuffle=True, random_state=seed)
results = cross_val_score(estimator, X, encoded_Y, cv=kfold)
print("Results: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Running this code produces the following output showing the mean and standard deviation of the estimated accuracy of the model on unseen data.

Baseline: 81.68% (5.67%)

This is an excellent score without doing any hard work.

3. Re-Run The Baseline Model With Data Preparation

It is a good practice to prepare your data before modeling.

Neural network models are especially suitable to having consistent input values, both in scale and distribution.

An effective data preparation scheme for tabular data when building neural network models is standardization. This is where the data is rescaled such that the mean value for each attribute is 0 and the standard deviation is 1. This preserves Gaussian and Gaussian-like distributions whilst normalizing the central tendencies for each attribute.

We can use scikit-learn to perform the standardization of our Sonar dataset using the StandardScaler class.

Rather than performing the standardization on the entire dataset, it is good practice to train the standardization procedure on the training data within the pass of a cross validation run and to use the trained standardization to prepare the “unseen” test fold. This makes standardization a step in model preparation in the cross validation process and it prevents the algorithm having knowledge of “unseen” data during evaluation, knowledge that might be passed from the data preparation scheme like a crisper distribution.

We can achieve this in scikit-learn using a Pipeline. The pipeline is a wrapper that executes one or more models within a pass of the cross validation procedure. Here, we can define a pipeline with the StandardScaler followed by our neural network model.

# evaluate baseline model with standardized dataset
numpy.random.seed(seed)
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(build_fn=create_baseline, nb_epoch=100, batch_size=5, verbose=0)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(y=encoded_Y, n_folds=10, shuffle=True, random_state=seed)
results = cross_val_score(pipeline, X, encoded_Y, cv=kfold)
print("Standardized: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Running this example provides the results below. We do see a small but very nice lift in the mean accuracy.

# Standardized: 84.07% (6.23%)

4. Tuning Layers and Number of Neurons in The Model

There are many things to tune on a neural network, such as the weight initialization, activation functions, optimization procedure and so on.

One aspect that may have an outsized effect is the structure of the network itself called the network topology. In this section we take a look at two experiments on the structure of the network: making it smaller and making it larger.

These are good experiments to perform when tuning a neural network on your problem.

4.1. Evaluate a Smaller Network

I suspect that there is a lot of redundancy in the input variables for this problem.

The data describes the same signal from different angles. Perhaps some of those angles are more relevant than others. We can force a type of feature extraction by the network by restricting the representational space in the first hidden layer.

In this experiment we take our baseline model with 60 neurons in the hidden layer and reduce it by half to 30. This will put pressure on the network during training to pick out the most important structure in the input data to model.

We will also standardize the data as in the previous experiment with data preparation and try to take advantage of the small lift in performance.

# smaller model
def create_smaller():
	# create model
	model = Sequential()
	model.add(Dense(30, input_dim=60, init='normal', activation='relu'))
	model.add(Dense(1, init='normal', activation='sigmoid'))
	# Compile model
	model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
	return model

numpy.random.seed(seed)
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(build_fn=create_smaller, nb_epoch=100, batch_size=5, verbose=0)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(y=encoded_Y, n_folds=10, shuffle=True, random_state=seed)
results = cross_val_score(pipeline, X, encoded_Y, cv=kfold)
print("Smaller: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Running this example provides the following result. We can see that we have a very slight boost in the mean estimated accuracy and an important reduction in the standard deviation (average spread) of the accuracy scores for the model.

This is a great result because we are doing slightly better with a network half the size, which in turn takes half the time to train.

Smaller: 84.61% (4.65%)

4.2. Evaluate a Larger Network

A neural network topology with more layers offers more opportunity for the network to extract key features and recombine them in useful non-linear ways.

We can evaluate whether adding more layers to the network improves the performance easily by making another small tweak to the function used to create our model. Here, we add one new layer (one line) to the network that introduces another hidden layer with 30 neurons after the first hidden layer.

Our network now has the topology:

60 inputs -> [60 -> 30] -> 1 output

The idea here is that the network is given the opportunity to model all input variables before being bottlenecked and forced to halve the representational capacity, much like we did in the experiment above with the smaller network.

Instead of squeezing the representation of the inputs themselves, we have an additional hidden layer to aid in the process.

# larger model
def create_larger():
	# create model
	model = Sequential()
	model.add(Dense(60, input_dim=60, init='normal', activation='relu'))
	model.add(Dense(30, init='normal', activation='relu'))
	model.add(Dense(1, init='normal', activation='sigmoid'))
	# Compile model
	model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
	return model

numpy.random.seed(seed)
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(build_fn=create_larger, nb_epoch=100, batch_size=5, verbose=0)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(y=encoded_Y, n_folds=10, shuffle=True, random_state=seed)
results = cross_val_score(pipeline, X, encoded_Y, cv=kfold)
print("Larger: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Running this example produces the results below. We can see that we do get a nice lift in the model performance, achieving near state-of-the-art results with very little effort indeed.

Larger: 86.47% (3.82%)

With further tuning of aspects like the optimization algorithm and the number of training epochs, it is expected that further improvements are possible. What is the best score that you can achieve on this dataset?

Summary

In this post you discovered the Keras Deep Learning library in Python.

You learned how you can work through a binary classification problem step-by-step with Keras, specifically:

How to load and prepare data for use in Keras.
How to create a baseline neural network model.
How to evaluate a Keras model using scikit-learn and stratified k-fold cross validation.
How data preparation schemes can lift the performance of your models.
How experiments adjusting the network topology can lift model performance.

Do you have any questions about Deep Learning with Keras or about this post? Ask your questions in the comments and I will do my best to answer.

Do You Want To Get Started With Deep Learning?

You can develop and evaluate deep learning models in just a few lines of Python code. You need:

Deep Learning With Python

Take the next step with 14 self-study tutorials and
7 end-to-end projects.

Covers multi-layer perceptrons, convolutional neural networks, objection recognition and more.

Ideal for machine learning practitioners already familiar with the Python ecosystem.

Bring Deep Learning To Your Machine Learning Projects

The post Binary Classification Tutorial with the Keras Deep Learning Library appeared first on Machine Learning Mastery.

↧

Rebel Rebel Records

June 8, 2016, 8:25 am

≫ Next: Unrealistic project estimation

≪ Previous: Binary Classification Tutorial with the Keras Deep Learning Library

Rebel Rebel Records:

via vanishingnewyork.blogspot.com

VANISHING

After 28 years in business, Rebel Rebel Records at 319 Bleecker Street is being forced to close by rising rent. It will shutter at the end of June.

Reader John Vairo, Jr., writes in:

Owner David Shebiro “told us that the owner of the building has raised the rent(what else is new) and they plan to put another basic ‘high-end’ clothing store like Intermix in its place–because that’s what the Village needs, another Intermix, or worse another bank or another pharmacy.”

photo: John Vairo, Jr.

John adds, “To say that Rebel Rebel is an institution would be an understatement and to see a unique and sustainable business for nearly 28 years bite the dust like so many others that give~~have~~ this city soul is a fucking tragedy.”

photo: John Vairo, Jr.

The news about Rebel Rebel has been percolating these past weeks. Other readers have written in to tell me that “the clothing store next door” is going to be expanding into the record shop’s space. If that’s the case, that store is either Scotch & Soda to the east or St. James to the west.

I remember when St. James moved in. With its Hamptons chic, the “nautical brand” made me nervous for Rebel Rebel. This kind of gentrification is contagious. Scotch & Soda came next, replacing the local favorite Cafe Angelique when the landlord hiked the rent from $16,000 to $42,000 a month. Sandwiched between those two, it was clear that Rebel Rebel was next.

In 2014, the beloved record shop made the Vanishing New York “What to Worry About” list–a long list that is growing shorter by the day.

Only weeks ago, AMNY listed Rebel Rebel as one of Bleecker’s few remaining icons,~~icons,~~ a rapidly vanishing breed on~~for~~ a street that is turning into a center for high-end luxury shopping mall brands and candy treats–and not much else.

What record stores remain in the Village? Bleecker Bob’s shut down. Bleecker Street Records was pushed off Bleecker when the landlord raised the rent to $27,000 per month, but it’s hanging in there on West 4th. There’s House of Oldies over on Carmine, miraculously surviving.

Now

~~And now~~ the door is closing on another one–not because “it’s natural,” not because “that’s the trend,” or people are shopping online, or any of those other reasons too often given for the apocalyptic die-off of New York culture. It’s because of the rent. Period.

And rebels are no longer welcome in this city.

Once again: #SaveNYC. You can help stop the bleeding.

↧

Unrealistic project estimation

June 8, 2016, 8:25 am

≫ Next: Too Human (Not) to Fail

≪ Previous: Rebel Rebel Records

Unrealistic project estimation:

via devopsreactions.tumblr.com

by @uaiHebert

↧

Too Human (Not) to Fail

June 9, 2016, 8:25 am

≫ Next: HPE Promises a Docker Engine in Every New Server

≪ Previous: Unrealistic project estimation

Too Human (Not) to Fail:

via source.opennews.org

By Lena Groeger

(potential past via Flickr)

[Cross-posted with ProPublica]

A coffee grinder that only works when the lid is on. An electrical plug that only fits into an outlet one way. Fire doors that stay unlocked in an emergency.

Lots of everyday objects are designed to prevent errors—saving clumsy and forgetful humans from our own mistakes or protecting us from worst-case scenarios. Sometimes designers make it impossible for us to mess up, other times they build in a backup plan for when we inevitably do. But regardless, the solution is baked right into the design.

This concept has a lot of names: defensive design, foolproof, mistake proof, fail-safe. None is as delightful as the Japanese poka-yoke.

The idea of the poka-yoke (which means literally, “avoiding mistakes”) is to design something in such a way that you couldn’t mess it up even if you tried. For example, most standard USB cables can only be plugged into a computer the correct way. Not to say you would never attempt to plug it in upside down, but if you do, it simply won’t fit. On the other hand, it’s easy to reverse the + and - ends of a battery when you replace them in your TV remote. The remote’s design provides other clues about the correct way to insert the batteries (like icons), but it’s still physically possible to mess it up. Not so with the USB cable. It only fits one way, by design.

Many consumer coffee grinders are another example of a design that physically prevents you from messing up. Even if you wanted to, you could not chop your fingers on the blade, because the “on” switch for the grinder is triggered by closing the lid (as opposed to a blender, which leaves its blades easily accessible to stray fingers).

The humble coffee grinder that only works when it’s closed. Source: Flickr

Foolproof design can also save your life. The mechanical diver’s watch is designed with a bezel that spins in only one direction. It functions as a simple timer that a diver can use to know how much oxygen is left in the tank.

In a blog post about resilient design, designer Steven Hoober describes how this smart design can prevent disaster:

If the ring were to get bumped, changing its setting, having it show less time might be inconvenient, but its going the other way and showing that you have more time than you do might kill you. You don’t even need to know how it works. It just works.

The diver watch will never show you more time than you actually have left underwater. Source: Flickr

Foolproof measures can be found throughout web design (although perhaps without the life-saving part). Ever fill out an online form incorrectly and only found out because you could not progress to the next step? That’s a conscious decision by a designer to prevent an error. In this case, from Yahoo, it’s even a chance to insert a little humor:

screencap of Yahoo error message asking user if they're really from the future

Yahoo’s humorous design prevents you from being born in the future. Source: UXmas

Sometimes, design cannot prevent you from messing up (we humans somehow always figure out a way to do things wrong). But it can still make it harder for you to do the wrong thing. This type of design is not exactly foolproof—more like fool-resistant.

Child-resistant safety caps on medicine bottles, for example, keep kids from accidentally overdosing. A water dispenser that makes you push an extra button or pull up a lever to dispense hot water makes it harder for you to accidentally scald yourself. Neither of these designs are as foolproof as the coffee grinder. But they do put an additional step between you (or your child) and disaster.

These pop-up messages put a small step between you and the loss of precious files.

We see these features quite often on our computers. Most of us are familiar with the “Are you sure?” messages before you empty the Trash or the “Do you want to…” before you replace a file with another one by the same name. These alerts certainly don’t prevent us from making a mistake (in fact, we probably ignore them most of the time), but their purpose is to slow us down.

Designers have also come up with more elaborate confirmation steps. For instance, Gmail will detect whether you’ve used the word “attached” in an email you’ve written and, if you try to send it without an attachment, will ask you if you meant to include one. Github, a popular website used by software developers to collaborate on code, forces you to type the full name of the project in order to delete it.

screencap of github screen forcing extra work to delete repo

GitHub makes it harder for you to accidentally delete your projects, by design.

Most of these examples work by forcing your attention to the task at hand, breaking your autopilot behavior and make you really consider what you are about to do. Design details don’t make it impossible to screw up, but they certainly make it a little bit harder.

Still other designs revolve around keeping your information secure. Your computer may prompt you for a login if you’ve left it idle for a few minutes, preventing someone else from seeing or stealing sensitive information. Smartphones often do the same thing, requiring a passcode to re-enter. Some web browsers will prevent you from downloading certain files, and your computer’s operating system may ask you if you are SURE you want to open a program you got from the internet. Connect a smartphone to a new computer and it may ask you to confirm that this computer can be trusted. These security measures don’t prevent you from doing dangerous things, but try to prevent a potential horrible outcome due to careless mistakes.

Let’s say it’s too late to prevent the error: the mistake has occurred, the failure has happened. What now? This is where fail-safe design comes in. Fail-safe design prevents failure from becoming absolute catastrophe.

In some cases, it’s the system (or environment) that has failed. In the event of a fire, fire doors are required by law to fail unlocked, so that people can escape a burning building. On the other hand, if you need to protect state secrets or cash in a bank vault, you’d probably want a fail-secure system for those doors, which would fail locked.

Circuit breakers cut the power if an electrical current gets too high. Elevators have brakes and other fail-safe systems that engage if the cable breaks or power goes out, keeping the elevator from plummeting to its passengers’ death.

In other instances, it’s our own human error that the fail-safe system is designed for. SawStop is a table-saw safety technology that automatically shuts off a spinning saw blade if it comes in contact with flesh. The blade has a sensor that can detect whether it’s a piece of wood or your finger, using the same property (electrical conductivity) that makes a touch screen sensitive to your bare fingers but not to your gloves. In less than one thousandth of a second, the saw blade will shut off, giving you in the worst case only a small nick (rather than removing your thumb). Don’t believe this could work so fast? Watch this video:

SawStop, a table saw brake technology that can help you keep your fingers. Source: SawStop

Some industrial paper cutters are designed to shut off if they detect motion nearby (presumably a hand getting too close to the blade). Similarly, many automatic garage doors will stop closing if they sense something, or someone, in the way.

Another well-known fail-safe measure is the dead man’s switch. The dead man’s switch kicks in when a human in charge lets go of the controls—or dies, as the name implies. In the event of an accident (say, a train operator has a heart attack), the dead man’s switch can prevent harm to all the passengers by stopping the train.

This actually happened a few years ago on the New York City subway, when an MTA employee had a heart attack on the G train. His hands lost grip of the controls, the brakes were activated, and the train slowed to a stop.

The dead man’s switch is also a common device in lawn mowers and other equipment that require you to continually hold down a lever or handle to operate. As soon as you let go, the motor stops. U.S. law actually specifies that all walk-behind lawn mowers come equipped with such a switch that stops the blade within 3 seconds of a user releasing her grip.

In software, absolute catastrophe often means losing your work, your files, that long heartfelt email you worked so hard on. So many fail-safe designs revolve around letting you undo actions or automatically saving work in the background as you go along. Auto-saving Google Docs are a vast improvement over other word-processing programs that can lose hours of work with a single crash or loss of power. Web browsers like Chrome can restore all your tabs if you accidentally close a window (even if you’d rather declare tab bankruptcy).

Finally, we have the last-ditch, eleventh-hour design solution to keep you safe from the worst of the worst: The backup.

A backup parachute is perhaps the most dramatic of all backup devices, but many things in the real world are designed to have similar built-in redundancies. Cars have two sets of brake circuits (not to mention a spare tire). Airplanes have multiple redundant control systems. Emergency stairwells have lights that work on battery power if the building’s electricity goes out. On computers, backing up your photos or making a copy of a file before editing it is just common sense.

Backup parachutes: don’t leave home without ‘em. Source: Flickr

In the end, nothing humans build or even touch will ever be free from error. Luckily, designers work tirelessly to save us from our mistakes. And in many cases, we don’t have to know how the poka-yoke works. It just works.

↧

HPE Promises a Docker Engine in Every New Server

June 10, 2016, 1:25 am

≫ Next: Control F'd: Painless live data visualizations

≪ Previous: Too Human (Not) to Fail

HPE Promises a Docker Engine in Every New Server:

via thenewstack.io

In the clearest signal to date that Docker-containerized infrastructure is moving into data centers where Windows abounds — where applications typically run in classic VMware virtual machines deployed on Amazon-style cloud platforms — Hewlett Packard Enterprise announced all new HPE servers will ship with Docker technology built-in, HPE CEO Meg Whitman told a capacity crowd at HPE Discover USA 2016 Tuesday afternoon.

“All HP servers will be bundled with the Docker Engine,” said Whitman, “enabling customers to create distributed applications that are portable across any infrastructure.”

Where the Customers Are

The cross-platform portability of Docker dominated discussions of the technology Tuesday at the conference. HPE was careful to associate Docker with the ability to bring the old stack into the cloud, rather than to resort to the typical enterprise-class characterization of Docker as an experimental technology.

161607 05 HPE Discover (Jay Jamison) “HPE is committed to meeting customers where they are,” said Jay Jamison, HPE’s vice president for product marketing, in a press and analysts’ briefing Tuesday. “As customers look at Mesosphere and Docker as infrastructures that are important to them, HPE is committed to being a strong partner and supporter of those technologies, both in our software, as well as with our hardware infrastructure capabilities. [Ours is] a very broad strategy that we think enables customers to have a strong partner that’s focused on meeting them where they are, and providing them solutions to the problems they most commonly face.”

HPE Vice President for Product Management Omri Gazitt revealed to us that Kubernetes will serve as the orchestration platform for some services within Helion Cloud Suite, and by extension in services bundled with the manufacturer’s Helion CloudSystem 10 series hardware. Those services will manifest themselves, said Gazitt, in how Cloud Suite serves up Cloud Foundry to HPE customers, although he said its main interest is in hiding the complexity of workload orchestration from customers who may have been turned off to containerization thus far.

Virtualization was for Servers

161607 04 HPE Discover (Meg Whitman + Ben Golub) During the first round of keynotes Tuesday, HPE CEO Whitman was joined on stage by Docker Inc. CEO Ben Golub, who was warmly received by IT professionals present, who are well familiar with Docker’s reputation. Still, Golub found himself introducing Docker to many attendees who still may not be familiar with what the platform is.

At one point, Whitman asked Golub what the arrival of Docker as a standard component of HPE servers meant in the long run for virtualization.

“Virtualization is a fantastic technology,” Golub responded, “but it was designed really for building, modifying, and moving around servers. And if people want to build, modify, and assemble applications, Docker’s a much better s 161607 03 HPE Discover (Meg Whitman + Ben Golub) olution.”

Responding to Whitman’s concern about security, Golub responded that more organizations are adopting Docker in response to their security concerns, not despite them. He assured her that workloads deployed in a Docker environment are digitally signed, “so what you’re deploying into your environment has a very low chance of being something bad. And if something bad happens — and of course, bad things always happen — it’s trivial to swap it out. You can swap out one container without screwing up the rest of your applications, without having to shut down machines.”
The HPE CEO appeared delighted by Golub’s quote that Docker implementation can drive up utilization in the enterprise by as much as 20x — a figure that we’ve heard before, at least in The New Stack, but which is only now being disseminated to many enterprises.

The Reality of Windows

Helion has become HPE’s umbrella brand for cloud platforms, and OpenStack is already a major component there. Yet in unveiling new and revised components in its Helion Cloud Portfolio Tuesday, HPE executives portrayed Docker’s deeper integration into the system as enabling choice in workload deployment and orchestration, making certain to include OpenStack and Windows Server-oriented systems on an even playing field.

This latter category explicitly includes Nano Server, Microsoft’s far leaner implementation of Windows Server, designed for remote deployment and administration, including within containers. Azure Stack is also being mentioned in this equation.

What’s emerging is a vision of Docker being a de facto infrastructure provider not for its ability to replace existing workloads, but rather to uphold them. Docker Inc.’s existing partnership with Microsoft might bear significantly more fruit along these lines in the near future.

Evidence of a Rethink

Last December in London, HPE announced the first stages of its partnership with Docker. That announcement was accompanied at the time by the news that HPE would be building its own container-oriented operating system called ContainerOS, in an effort to endow containers with the security tools and other resources they would need to be exploited by HPE’s Synergy line of hardware, also announced last December. Although Synergy continues to be a major product point here at the June Vegas show, we have not seen any mention of ContainerOS thus far. It may yet be a currently offered HPE product, but it’s no longer being billed as the next stage in the Darwinian evolutionary chain.

161607 06 HPE Discover (Omri Gazitt)

Instead, we heard talk about producing technologies that could extend the reach of HPE platforms into the cloud, rather than constrain it to a few product lines. Gazitt told The New Stack that HPE has been working in conjunction with Mesosphere to produce containerized infrastructure for OpenStack and Cloud Foundry, running on top of DC/OS, on HPE — and presumably other — platforms.

HPE is looking to produce servers with turnkey cloud provisioning, with options of OpenStack and Windows-oriented systems.

It is a clear evolution of HPE’s messaging towards containerization, as well as a novel approach to its presentation: not as the encroachment of new methodologies on traditional workloads, but rather a new way to stage those workloads to extend their active life and improve their efficiency.

161607 01 HPE Discover (opening floor show)