Similar presentations:
Data Science. Programming
1.
Data ScienceProgramming
an advocate of
concrete computing –
and HMC's mascot
2.
About myselfWho
Faisal Ahmed
Where
TalTech
What
Research Communication and Software
Engineering
Now
Narva College
Contact
[email protected]
3.
Data?!• Neighbor's name
• A place they consider home
• Are they working at a company now?
Where?
• How many U.S. states have they visited?
• Their favorite unhealthy food… ?
• Do they have any "Data Science" background?
(statistics, machine learning, CS)
4.
• Neighbor's nameData!
Zachary Dodds
• A place they consider home
Pittsburgh, PA
• Are they working at a company now?
Where?
• How many U.S. states have they visited?
• Their favorite unhealthy food… ?
Harvey Mudd
44
M&Ms
• Do they have any "Data Science" background?
(statistics, machine learning, CS)
mostly CS for me…
5.
• Neighbor's nameData!
Zachary Dodds
• A place they consider home
Pittsburgh, PA
• Are they working at a company now?
Where?
• How many U.S. states have they visited?
• Their favorite unhealthy food… ?
Harvey Mudd
44
M&Ms
• Do they have any "Data Science" background?
(statistics, machine learning, CS)
mostly CS for me…
be sure to set up your login + profile for the submission site…
6.
Data Science concernsIs "Data Science"
important or just trendy?
7.
Data Science concernsHmmm…
8.
the companies are expanding as fast as the data!9.
There's certainly a lot of it!Data, data everywhere…
1.8 ZB
8.0 ZB
800 EB
Data produced each year
161 EB
1 Exabyte
logarithmic scale
1 Zettabyte
5 EB
120 PB
100-years of HD video + audio
1 Petabyte
1 Petabyte == 1000 TB
1 TB = 1000 GB
60 PB
Human brain's capacity
14 PB
2002
2006
2009
2011
2015
References
(2015) 8 ZB: http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf
(2002) 5 EB: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm
(2011) 1.8 ZB: http://www.emc.com/leadership/programs/digital-universe.htm
(2009) 800 EB: http://www.emc.com/collateral/analyst-reports/idc-digital-universe-are-you-ready.pdf
(life in video) 60 PB: in 4320p resolution, extrapolated from 16MB for 1:21 of 640x480 video
(w/sound) – almost certainly a gross overestimate, as sleep can be compressed significantly!
(2006) 161 EB: http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf
(brain) 14 PB: http://www.quora.com/Neuroscience-1/How-much-data-can-the-human-brain-store
10.
I'd call it data,not information
wisdom
knowledge
information
data
11.
Big Data?I agree with this…
12.
Make data easier to use ~ by using it!It may be true that
Data Science isn't a
science – but that
doesn't mean it's
not useful!
13.
IST 380 ~ the big pictureWhat?
Data Science
Programming
Why?
Data Rules
All of our insights – large and small, permanent and
ephemeral, natural and artificial – come about
through the integration of lots of data.
Data Science simply recognizes that the rules and
skills behind those insights are widely applicable…
14.
A few examples…Make3d
Andrew Ng ~
Computers and
Thought award,
2009
How is this being done?
and how do we succeed?
… Data Science is at the heart of computer science
15.
A few examples…Learning to
Powerslide
Stanford's
Autonomous
Vehicles project
(Thrun et al.)
… Data Science is at the heart of computer science
16.
A few examples…Learning ground
from obstacles
"my summer was
finding that red line"
… Data Science is at the heart of computer science
17.
A few examples…classification
segmentation
Learning ground from obstacles
18.
Insights beyond science19.
Marketing20.
VisualizationMotivation
21.
22.
Recommender Systemspredicting
movie ratings
23.
Netflix Prize(I don't know this guy)
Bob Bell, winner of the "Netflix prize"
Napoleon Dynamite = 1.22
Batman Begins = .75
Some films are difficult to predict…
Finding Nemo = ??
Lord of the Rings = ??
24.
Netflix Prize(I don't know this guy)
Bob Bell, winner of the "Netflix prize"
Napoleon Dynamite = 1.22
Batman Begins = .75
Finding Nemo = .67
Lord of the Rings = .42
Some films are difficult to predict… and others are easier!
25.
Why IST 380 ?Specific skills:
R statistical environment (and the S programming language)
Experience with several statistical analyses (descriptive statistics)
Experience with predictive statistics (modeling) and
machine learning algorithms
26.
Why IST 380 ?Specific skills:
R statistical environment (and the S programming language)
Experience with several statistical analyses (descriptive statistics)
Experience with predictive statistics (modeling) and
machine learning algorithms
Broad background:
Final project ~ open-ended with datasets of your choice
You'll be confident and capable with whatever datasets you
encounter in the future – on your own or as part of a team.
27.
About IST 380 …28.
DetailsWeb Page:
http://www.cs.hmc.edu/~dodds/IST380
Assignments, online text, necessary files, lecture slides are linked
First week's assignment: Getting started with R
Textbook
An introduction to Data Science
freely available online
jsresearch.net/groups/teachdatascience/
and many online resources…
Programming: R
www.r-project.org/
Grab both of
these now…
29.
HomepageGo to the course page
Grab R and the text from
these two links…
http://www.cs.hmc.edu/~dodds/IST380/
30.
HomeworkAssignments
~ 2-5 problems/week
~ 100 points
extra credit, often
Due Tuesday of the following week by 11:59 pm.
Assignment 1 due Tuesday, February 5.
1 week + 1 day…
31.
HomeworkAssignments
~ 2-5 problems/week
~ 100 points
extra credit, often
Due Tuesday of the following week by 11:59 pm.
Assignment 1 due Tuesday, February 5.
Working on programs:
Submitting programs:
Today's Lab:
On your own or in groups of 2.
Divide the work at the keyboard evenly!
at the submission website
install software ensure accounts are working
try out R - the first HW is officially due on 2/5
32.
Outlineapproximate!
Weeks 1-5
"Data Science"
Weeks 6-10
"Machine Learning"
Weeks 11-15
using R
descriptive statistics
predictive statistics
probability distributions
statistical modeling
support vector machines (SVMs)
nearest neighbors (NN)
random forests
No breaks?!
k-means algorithm
Final Project
33.
GradingGrades
Based on points percentage
~ 800 points for assignments
~ 400 points for the final project
if score >= 0.95: grade = "A"
if score >= 0.90: grade = "A-"
if score >= 0.86: grade = "B+"
see the course syllabus for the full list...
Final project
• the last ~4 weeks will work towards a larger, final project
• there will be a short design phase and a short final presentation
• choose your own problem to study (I'll have some suggestions, too.)
• I'd encourage you to connect R and our Data Science techniques
to other datasets or projects that you use/need/like, etc.
34.
Academic HonestyThis course operates under CGU's (and all of Claremont Schools')
Academic Honesty policies…
•Your work must be your own. This must be true for the whole
team, if you're working in a pair.
•Consulting with others (except team members or myself) is
encouraged, but has to be limited to discussion and debugging
of problems. Sharing of written, electronic, or verbal
solutions/files/code is a violation of CGU’s academic honesty
policy.
•A reasonable guideline: Work is your own if you could delete
all of it and recreate it yourself.
35.
Thoughts?36.
Getting to know…R
37.
Getting to know…http://lang-index.sourceforge.net/#categ
R
R is the programmer's toolkit for statistics; SAS, Stata,
SPSS are preferred by those in business intelligence
38.
Getting to know…R
Free… and very well supported online…
39.
Getting to know…R
R is responsive, up-to-date, and flexible: Data Science vs. Statistics
40.
Getting to know…R
1) Find the IST 380 course webpage
www.cs.hmc.edu/~dodds/IST380/
2) Download and install R
3) Run R and try some basic commands at the prompt:
6 * 7
rnorm(10)
x <- 380
41.
Getting started!1) Open Matloff's Why R? notes
2) Skip ahead to page 7, the "5 minute example session"
3) Try out the commands in section 2.2 to get started…
4) When you finish, save your session and submit it!
This is problem 1 this week
42.
Saving your session1) Create a folder named hw1, perhaps on your desktop
2) Use the Save to file… (Windows) or Save as…
(Mac) in order to save your current console session into
hw1
3) Name that file pr1.txt
4) From your operating system, open up that file in
order to confirm it contains your whole session!
This is problem 1 this week
43.
Submitting your work1) Zip up hw1 into hw1.zip
2) From the course webpage, click on the submission
site link.
3) Choose a submission site login name & let me know!
4) Once your account is made, login, change your password
to something you know, and submit hw1.zip
5) You can submit again – all copies are saved…
You've completed Problem 1!
troubles? email me!
This webserver can be
spacey -- I should know!
44.
ReflectionAssignment?
Creating a vector?
Printing?
Average and standard deviation?
Comments?
Comments?
45.
R typesYou can use mode() to view the type of a variable.
46.
Where's the big data?c ~ concatenate
Vectors are R lists of a single type of element
47.
Where's the big data?c ~ concatenate
the colon : also
creates vectors
Vectors are R lists of a single type of element
48.
Analyzing vectors – try these…Square brackets [] can "subset" (or "slice") vectors
49.
Analyzing vectorsyou can use a
boolean vector
to subset
another vector
Square brackets [] can "subset" (or "slice") vectors
50.
NAR uses NA to represent data that is "not available"
The function is.na( ) tests for NA
What is going on here?
51.
NAR uses NA to represent data that is "not available"
The function is.na( ) tests for NA
What is going on here?
This uses subsetting to remove NA values!
52.
Data framesR's fundamental data structures are data frames
The next tutorial will introduce them…
53.
Irises…virginica
setosa
data() yields many built-in data files. This is iris
54.
Subsetting iris datadf[rows,cols]
As with vectors, you can "subset" data frames.
55.
Lab…The 2nd part of each class meeting dedicated to lab work.
I welcome you to stay for the lab, but it is not required.
Today's lab:
Work through Santorico and Shin's Tutorial for the R
Statistical Package and submit the console sessions as
pr2_1.txt, pr2_1.txt, pr2_1.txt, pr2_1.txt, and pr2_1.txt.
This is a nice reinforcement of vectors, introduction to
data frames, and a look at the graphics that R supports.
56.
HomeworkProblem 3: Challenge exercises in R
These will reinforce the "subsetting" and dataanalysis introduction from pr2's tutorial.
Problem 4: Introduction to Data Science, early chapters
This is a fuller background on R and the field
of data science
(submit your console session for both of these…)
57.
Lab !58.
CS vs. IS and IT ?greater integration
system-wide issues
smaller details
machine specifics
www.acm.org/education/curric_vols/CC2005_Final_Report2.pdf
59.
CS vs. IS and IT ?Where will IS go?
60.
CS vs. IS and IT ?61.
IT ?Where will IT go?
62.
IT ?63.
64.
The bigger pictureWeeks 10-12
Weeks 13-15
Objects
Final Projects
Week 10
Week 13
classes vs. objects
final projects
Week 11
Week 14
methods and data
final projects
Week 12
Week 15
inheritance
final exam
65.
Data?!• Neighbor's name
• A place they consider home
• Are they working at a company now?
Where?
• How many U.S. states have they visited?
• Their favorite unhealthy food… ?
• Do they have any "Data Science"
(statistics, machine learning, CS)
background?
66.
state reminders…67.
• Neighbor's nameData!
Zachary Dodds
• A place they consider home
Pittsburgh, PA
• Are they working at a company now?
Where?
• How many U.S. states have they visited?
• Their favorite unhealthy food… ?
M&Ms
• Do they have any "Data Science"
(statistics, machine learning, CS)
background?
mostly CS for me…
Harvey Mudd
44
68.
• Neighbor's nameData!
Zachary Dodds
• A place they consider home
Pittsburgh, PA
• Are they working at a company now?
Where?
• How many U.S. states have they visited?
• Their favorite unhealthy food… ?
44
M&Ms
• Do they have any "Data Science"
(statistics, machine learning, CS)
background?
Harvey Mudd
mostly CS for me…
This class is truly
seminar-style:
we're devloping
expertise in this
field together.
be sure to set up your login + profile for the submission site…