Data Representation and Modeling

Содержание

Слайд 2

Thinking More Deeply about Data and Computation

We’ve seen:
semi-structured HTML and unstructured text,

Thinking More Deeply about Data and Computation We’ve seen: semi-structured HTML and
represented using tables to be used for visualization and learning
manipulating tabular data
projection (subsetting fields), selection (choosing rows meeting predicates), loc (extract or update cell), apply (compute function over each row/col/cell)
linking tabular data
merge/join, outerjoin, and using string similarity to join
Now let’s dive into more detail on design:
How do we encode data? What are the implications?

Слайд 3

A First Question: What Are We Trying to Capture?

“Structured data should capture the

A First Question: What Are We Trying to Capture? “Structured data should
semantics of the data”
What do we mean by “data semantics”?
This is a topic that has preoccupied philosophers since at least Aristotle and Plato
… and computer scientists for most of the lifetime of the field!

Слайд 4

Part of the Goal: Modeling Concepts and Instances

The famous example from logic and

Part of the Goal: Modeling Concepts and Instances The famous example from
philosophy, attributed to Aristotle:
All men are mortal.
Socrates is a man.
Therefore, Socrates is mortal.
The premise: we have concepts which are classes of things, and instances of those concepts
Properties of the concepts appear in the instances
Instances relate to other instances
Data design is about trying to codify the above!

"Aristotle" by maha-online is licensed under CC BY-SA 2.0 

Слайд 5

Some Starting Points

We model knowledge using notions dating back to ancient Greece:
Classes,

Some Starting Points We model knowledge using notions dating back to ancient
concepts, or sets of entities – e.g., people
Instances of those classes – e.g., Socrates, Aristotle, Plato
Named relationships between classes – e.g., people have teachers who are other people (thus Aristotle has a teacher, namely Plato)
Classes may also have properties, e.g., people have names or are mortal
There are different, equivalent ways of looking at these!
Using logic – “knowledge representation,” a key idea in AI
Using knowledge graphs – named relationships between classes, subclasses, instances, properties
Using entity-relationship modeling – a special case of knowledge graphs
These can all be used to inform our design of dataframes, hierarchical data, etc.

Слайд 6

Modeling Classes, Instances, Properties Using Logical Predicates

We can use logical assertions to

Modeling Classes, Instances, Properties Using Logical Predicates We can use logical assertions
describe everything.
Classes: named, categorized collections of items
“All people are mortal” : Mortal(person).
Classes have specializations or subclasses:
“Men are people” : Subclass(man, person).
Classes have instances:
“Aristotle is a man” : Instance(Aristotle, man)
And we infer predicates from class to subclass, or class to instance, using rules:
Mortal(x) ^ Subclass(y, x) ? Mortal(y)
Mortal(x) ^ Instance(y, x) ? Mortal(x)
Mortal(person) ^ Subclass(man, person) ? Mortal(man)
Mortal(man) ^ Instance(Aristotle, man) ? Mortal(Aristotle)

"Aristotle" by maha-online is licensed under CC BY-SA 2.0 

Слайд 7

We Can Instead Think of this As Links between Classes + Instances

Person

Adult

Man

Aristotle

Life Stage

instanceOf

subclassOf

subclassOf

subclassOf

Plato

Socrates

instanceOf

instanceOf

hasTeacher

hasTeacher



Mortal

subclassOf

We Can Instead Think of this As Links between Classes + Instances

Слайд 8

We Can Instead Think of this As Links between Classes + Instances

Person

Adult

Man

Aristotle

Life Stage

instanceOf

subclassOf

subclassOf

subclassOf

Plato

Socrates

instanceOf

instanceOf

hasTeacher

hasTeacher



Mortal

subclassOf

Here,

We Can Instead Think of this As Links between Classes + Instances
to determine if Aristotle is Mortal, we follow links in the graph (instanceOf, subclassOf) to see if we can find Mortal.
Google & many other services use Knowledge graphs, such as Freebase and DBpedia

Слайд 9

Entity-Relationship Graphs Model Classes as Named Sets of Linked Instances

Person

Adult

Man

Life Stage

subclassOf

subclassOf

subclassOf

hasTeachers (list)

Birth

Death

Man is

Entity-Relationship Graphs Model Classes as Named Sets of Linked Instances Person Adult
an entity set with many men, who are also people

ID

Name

Слайд 10

Entity-Relationship Graphs: A Syntax for Entities, Properties, Relationships

Person

Adult

Man

Life Stage

ID

Name

Birth

Death

Has Teacher

“Is a”:
subclass inherits

Entity-Relationship Graphs: A Syntax for Entities, Properties, Relationships Person Adult Man Life
all properties of superclass
superclass includes all members of subclasses

Is a

Is a

Is a

Слайд 11

Entities and Relationships Correspond to Relationships or Dataframes!

Entity set: represents all of the

Entities and Relationships Correspond to Relationships or Dataframes! Entity set: represents all
entities of a type, and their properties
Person: ID, name, birth, death
Man: inherits the same fields, possibly adds new ones (not shown)
Relationship set: represents a link between people
HasTeacher(teacher: ID of Person, student: ID of Person)

Person

Has Teacher

Man

Person

HasTeacher

(Also: Man)

Слайд 12

The Tables Let Us Encode a Graph within the Data!

Person

HasTeacher

Aristotle

Plato

Socrates

teacher

teacher

student

student

The Tables Let Us Encode a Graph within the Data! Person HasTeacher

Слайд 13

The Tables Let Us Encode a Graph within the Data!

Person

HasTeacher

Aristotle

Plato

Socrates

teacher

teacher

student

student

In-Class Exercise:
Express using dataframe

The Tables Let Us Encode a Graph within the Data! Person HasTeacher
operations:
“Who is the teacher of Aristotle’s
teacher?”
“Show the entire tree of people taught by Socrates”?

Слайд 14

ER is a General Model: A Graph of Entities & Relationships

Vyas et

ER is a General Model: A Graph of Entities & Relationships Vyas
al, BMC Genomics 2009, A proposed syntax for Minimotif Semantics, version 1

ID

sequence

Слайд 15

From the Basics of Entity-Relationship Diagrams to General Data(base) Design

Deciding on the

From the Basics of Entity-Relationship Diagrams to General Data(base) Design Deciding on
entities, relationships, and constraints is part of database design
There are ways to do this to minimize the errors in the database, and make it easiest to keep consistent
For this class: we’ll assume we do simple E-R diagrams with properties
… and that each node becomes a Dataframe

Слайд 16

Considering Non-“Flat” Data

Considering Non-“Flat” Data

Слайд 17

A Common Point of Confusion

“Relational data can only capture flat relationships”
Not true:

A Common Point of Confusion “Relational data can only capture flat relationships”
it represents graphs, which can be traversed by queries!
… Though it might be more convenient to represent certain data structures!

Слайд 18

Hierarchy vs Relations (“NoSQL” vs “SQL”)

Sometimes it’s convenient to take data we could

Hierarchy vs Relations (“NoSQL” vs “SQL”) Sometimes it’s convenient to take data
codify as a graph:
And instead save it as a tree or forest:

Person

owns

Cellphone

[{‘person’: {‘name’: ‘jai’, phones: [{‘mfr’: ‘Apple’, ‘model’: …}, {‘mfr’: ‘Samsung’, ‘model’: …}}, {‘person’: {‘name’: ‘kai’, phones: [{‘mfr’: ‘Apple’, ‘model’: …}]}]

This is what NoSQL databases do!

Слайд 19

NoSQL “Not-only SQL”

Typically store nested objects, or possibly binary objects, by IDs or

NoSQL “Not-only SQL” Typically store nested objects, or possibly binary objects, by
keys
Note that a nested object can be captured in relations, via multiple tables!
Some well-known NoSQL systems:
MongoDB: stores JSON, i.e., lists and dictionaries
Google Bigtable: stores tuples with irregular properties
Amazon S3: stores binary files by key
Major differences from SQL databases:
Querying is often much simpler, e.g. they often don’t do joins!
They support limited notions of consistency when you update

Слайд 20

Recap: Basic Concepts

Knowledge is typically represented as concepts or classes, which can

Recap: Basic Concepts Knowledge is typically represented as concepts or classes, which
be generally thought of as corresponding to tables
But there is also a notion of subclassing (inheriting fields)
And of instances (rows in the tables)
Knowledge representation often describes these relationships as constraints
We can capture knowledge using graphs with nodes (entity sets, concepts) and edges (relationship sets)
Entity-relationship diagrams show this
Entity sets and relationship sets can both become tables!
Graphs + queries can be used to capture any kind of data and relationships (not always conveniently)
NoSQL systems support hierarchy, which “pivots” the graph into a tree with a root

Слайд 21

Let’s Work on Data Modeling, Given a Real Dataset!

1. Extracted data from LinkedIn
~3M

Let’s Work on Data Modeling, Given a Real Dataset! 1. Extracted data
people, stored as a ~9GB list of lines made up of JSON
JSON is nested dictionaries and lists – i.e., NoSQL-style !
We’ll focus on how to parse and store the “slightly hierarchical” data
2. Then we’ll work out an example with very hierarchical data – HTML

Слайд 23

Parsing Even Not-So-Big Data Is Painfully Slow!

Parsing Even Not-So-Big Data Is Painfully Slow!

Слайд 24

Can We Do Better?

Maybe save the data in a way that doesn’t

Can We Do Better? Maybe save the data in a way that
require parsing of strings?

https://cloud.mongodb.com

Слайд 25

MongoDB NoSQL DBMS Lets Us Store + Fetch Hierarchical Data

client = MongoClient('mongodb+srv://cis545:[email protected]/test?retryWrites=true&w=majority')
linkedin_db =

MongoDB NoSQL DBMS Lets Us Store + Fetch Hierarchical Data client =
client['linkedin']
linked_in = open('linkedin.json')
for line in linked_in:
person = json.loads(line)
linkedin_db.posts.insert_one(person)

Слайд 26

Data in MongoDB

Data in MongoDB

Слайд 27

Finding Things, in a Dataframe vs in MongoDB

def find_skills_in_list(skill):
for post in list_for_comparison:

Finding Things, in a Dataframe vs in MongoDB def find_skills_in_list(skill): for post
if 'skills' in post:
skills = post['skills']
for this_skill in skills:
if this_skill == skill:
return post
return None
def find_skills_in_mongodb(skill):
return linkedin_db.posts.find_one({'skills': skill})

Слайд 28

How Do We Convert Hierarchical Data to Dataframes?

Hierarchical data doesn’t work well for

How Do We Convert Hierarchical Data to Dataframes? Hierarchical data doesn’t work
visualization or machine learning

Слайд 29

The Basic Idea: Nesting Becomes Links (“Key/Foreign Key”)

people

experience

The Basic Idea: Nesting Becomes Links (“Key/Foreign Key”) people experience

Слайд 30

Reassembling through (Outer) Joins

pd.read_sql_query("select _id, \'[\' + group_concat(org) + \']\'" +\
"

Reassembling through (Outer) Joins pd.read_sql_query("select _id, \'[\' + group_concat(org) + \']\'" +\
from people left join experience on _id=person "+\
" group by _id", conn)

pd.read_sql_query("select _id, org" +\
" from people left join experience on _id=person ",\ conn)

Слайд 31

Views

Sometimes we use a query enough that we want to give its

Views Sometimes we use a query enough that we want to give
results a name, and make it essentially a table (which we then use in other queries!)

conn.execute('begin transaction')
conn.execute('drop view if exists people_experience')
conn.execute("create view people_experience as " +\
" select _id, group_concat(org) as experience " +\
" from people left join experience on _id=person group by _id")
conn.execute('commit')
pd.read_sql_query('select * from people_experience', conn)

Слайд 32

Occasional Considerations: Access and Consistency

Sometimes we may need to allow for failures and

Occasional Considerations: Access and Consistency Sometimes we may need to allow for
“undo”…
We saw “BEGIN TRANSACTION … COMMIT”
There is also “ROLLBACK”
Relational DBMS typically provide atomic transactions for this; most NoSQL DBMSs don’t
A second consideration when the data is shared: what happens when multiple users are editing and querying at the same time?
Concurrency control (how do we handle concurrent updates) and consistency (when do I see changes)
Имя файла: Data-Representation-and-Modeling.pptx
Количество просмотров: 44
Количество скачиваний: 0