程序代写代做代考 Excel database algorithm SQL python data science javascript Java Slide 1

Slide 1

1
Data Analysis & Programming for Operations Management (DAPOM)

Wout van Wezel (Coordinator)
w.m.c.van.wezel@rug.nl
050-3637181
DUI-621

|

faculty of economics
and business
operations

|
faculty of economics
and business
operations
Databases
No Data Science without Big Data!

Topic today:
Databases
Big Data
Json
Elasticsearch

2

|
faculty of economics
and business
operations
But first, practicals

3

VS

|
faculty of economics
and business
operations
Programming
Programming is more a mindset than a skill.
Essentially, programming is really easy. We have:
Variables
a=3
b=10
Branching based on conditions
if a > 4: a=a+2
else: a=a+3
Iteration (for, while)
while a93 so look at 93.

|
faculty of economics
and business
operations
‘Traditional’ databases
Let’s look again: select name from Persons where age<25 will give me a new table with the corresponding rows. If I store the age unsorted, I have to check all rows individually. If the age is stored sorted, I must search for the first element first element >25, and then all record below that are in my set

13

|
faculty of economics
and business
operations
‘Traditional’ databases
Impressive performance, but in this course, we are more interested in Data Science, which implies:

Making selections on really big data
Making aggregations and calculations

This should be fast. A single user can wait for a few seconds, but:
If you have many users in a client/server setting, requests are queued and waiting time becomes very long
If you want to train a neural network or run an optimization model, you may need to do thousands or millions of queries in a short time
14

|
faculty of economics
and business
operations
‘column oriented’ databases
Relational databases store data as rows
in memory:

Column oriented databases store the columns.

This has several advantages, which is why relational databases are not often used for Data Science.
15

|
faculty of economics
and business
operations
‘column oriented’ databases

Many libraries exists for calculations, statistics, and machine learning (e.g., numpy, scikit). These are highly optimized, and you can simply feed them an array of numbers.

In a column oriented database, the data is already stored as an array of numbers.

So, first advantage of column oriented database: to calculate the average of all ages, I only need to process (in this case) 3 bytes, instead of 21 bytes.

16

|
faculty of economics
and business
operations
‘column oriented’ databases
Second advantage:

Your computer has a processor (e.g., Intel I7) which does all the calculations.
Your computer has memory (e.g., 8GB) where it stores all the data.
To do a calculation, the processor needs to get data from the main memory, do the calculation, and store the data back in main memory.
A processor itself has some memory as well. Give it a small block of data (for example: 32768 bytes) and do all calculations on that on one go, before writing it back.
This processor memory is easily 100 times faster than the normal computer memory.

17

|
faculty of economics
and business
operations
‘column oriented’ databases
This is where the second advantage comes from.
By storing numbers contiguously in main memory, larger blocks are copied in the processor memory, and less transferring from processor memory to main memory is needed.

18

versus

|
faculty of economics
and business
operations
‘column oriented’ databases
For a long time, the speed of the CPU (e.g., 2.8Ghz) increased every year.
A computationally intensive program doubled in speed each few years without having to change the program!
However, both putting more components on a chip and increasing the speed increases heat. Now, number of instructions per second is not really increasing anymore.
19

|
faculty of economics
and business
operations
‘column oriented’ databases
Instead, to improve performance, manufacturers put more processors in parallel (dual core, quad core, etc.)

A third advantage: column oriented databases can split a column in multiple smaller columns, and give data to each core individually

These are three reasons why
(in this query), Access
(14 seconds) is 2800 times slower
than Elasticsearch (5ms)

20

|
faculty of economics
and business
operations
Big Data
What is big data?

Various definitions, but a common ground is:
High volume (too big to fit on one computer)
High velocity
High variety

Traditional databases have trouble distributing data over multiple computers. Therefore, sometimes big data is defined as data that does not fit on one computer.

Various column oriented databases are developed from the start with this ability. If you can spread the load over multiple cores in one processor, you can also spread it over multiple computers.
21

|
faculty of economics
and business
operations
Big Data
So, column oriented databases are more suitable for big data, because they:

Are very fast
Are equipped to do numerical analysis, statistics, and machine learning
Can scale very well over powerful processors with many cores, and over multiple computers in a network (sometimes even hundreds or thousands of computers)

22

|
faculty of economics
and business
operations
Big Data
Philosophical question: what is big data?

Too big for Excel? (1 million rows)
Too big for one computer? What does too big mean?
It does not fit in memory (typically 8 to 256 GB)
It does not fit on harddrive (typically 1 to 32 TB)
Calculate the approximate data size you need
Example: each second measure machine temperature to predict breakdown
31.536.000 data points per year per machine
100 machines, is 3 billion points, is approx. 3GB
23

|
faculty of economics
and business
operations
Big Data
Typically, in big data we measure many dimensions (temperature, vibrations, power usage, humidity, machine efficiency, etc.)

We then get many data points for many features, and we can:
Aggregate (average power usage per day; per machine, or per department, or per factory, etc.)
Correlate (relation between temperature and machine efficiency)
Time series analysis (power usage at time t versus breakdown at time t-1)

This is where column oriented databases are indispensable
24

|
faculty of economics
and business
operations
Elasticsearch
Lucene is a well known full text search engine (started in 1999)
Created as an alternative for commercial databases that were lousy at full text search (Microsoft Sql server, Oracle)

Elasticsearch was created as a program that uses the Lucene core, but adds, for example:
Communication with databases
Big Data (parallelize the database over multiple computers)
Aggregations (calculations, for example the number of taxi trips per weekday and the average price)
Machine learning (relation between weather (rain, wind, snow, etc.), day, time, and number of taxi trips) (Paid version only)
25

|
faculty of economics
and business
operations
Elasticsearch
Elasticsearch is a server. It starts and waits until it gets commands from another program.

There are Python libraries that you can use, which makes it easy to:
Create a database (=index)
Insert data into the index
Query data from the index
Delete data

Elastic (like many other systems) talks Json.
26

|
faculty of economics
and business
operations
What is a server?
A server often refers to a computer that runs server programs.
Standardization of communication with server programs allows that developers can focus:
One developer creates a database server
Another developer creates a desktop program with which the database can be filled.
Yet another developer creates a website that can show the data.

There are two kinds of standards:
Communication
Representation of the data
27

|
faculty of economics
and business
operations
Server communication
Elasticsearch uses Http for communication. It is the same communication protocol used by web browsers to communicate with the web server. Hence, you can talk to Elasticsearch with your web browser.
http://localhost:9200/wikititles/_search?q=Title:star
Elasticsearch uses Json to represent data.

28

|
faculty of economics
and business
operations

Json (Javascript Object Notation)

Advantages
You can have a hierarchy
Your records don’t all have to be the same

29

|
faculty of economics
and business
operations

Json Format, in itself really simple:
“key”:”value”
[] to denote an array of values of the same field
{} to denote an object. Can be nested:

Note that 49 is a number. If you state “49” with quotes, it is regarded as text, similar to your Python code.
30

|
faculty of economics
and business
operations
Elasticsearch
Elastic ‘talks’ Json. You specify commands in Json, documents are stored as Json, and it responds with Json.
Note that all numeric data is also stored in columns for fast querying and calculations.
Most important commands are:
Create an index
Insert a record in the index
Delete a record from the index
Delete the whole index
Query the index
Search for records
Make calculations
The Python layers for Elastic also use Json
31

|
faculty of economics
and business
operations
32
from datetime import datetime
from elasticsearch import Elasticsearch

es = Elasticsearch([{‘host’:’127.0.0.1′, ‘port’: 9200}])

es.indices.create(index=’persons’, ignore=400)

es.index(index=”persons”, id=1, body={“name”: “wout”, “hobby”: “programming”, “age”: 49, “timestamp”: datetime.now()})

es.index(index=”persons”, id=2, body={“name”: “anna”, “hobby”: “netflix”, “age”: 16, “timestamp”: datetime.now()})

|
faculty of economics
and business
operations
33
import json
from elasticsearch import Elasticsearch

es = Elasticsearch([{‘host’:’127.0.0.1′, ‘port’: 9200}])

search_body = {
“query”: {
“bool”: {
“must”: {
“term”:{
“hobby”: “netflix”
}
}
}
}
}

result = es.search(index=”persons”, body=search_body)
print (json.dumps(result, indent=2))

|
faculty of economics
and business
operations

{
“took”: 2,
“timed_out”: false,
“_shards”: {
“total”: 1,
“successful”: 1,
“skipped”: 0,
“failed”: 0
},
“hits”: {
“total”: {
“value”: 1,
“relation”: “eq”
},
“max_score”: 0.6931472,
“hits”: [
{
“_index”: “persons”,
“_type”: “_doc”,
“_id”: “2”,
“_score”: 0.6931472,
“_source”: {
“name”: “anna”,
“hobby”: “netflix”,
“age”: 16,
“timestamp”: “2019-09-22T22:47:46.869209”
}
}
]
}
}
anna
34
Info on the shards that responded
General info about the query
Array with records that conform to
the query
Meta information for the record
Source record

|
faculty of economics
and business
operations
Dictionary
How do I get to this data in Python?
Dictionary: similar to a list (or array), but you don’t give numbers as index. It is more generic, you can use any key as index:

thisdict = {
  “brand”: “Ford”,
  “model”: “Mustang”,
  “year”: 1964
}
print(thisdict)

x = thisdict[“model”]
35

|
faculty of economics
and business
operations
Dictionary
In Elasticsearch, the result of a query can be retrieved as a dictionary:
Result[“data”] gives back a dictionary
Result[“data”][“cars”] gives back an array
Result[“data”][“cars”][0] gives
back the first element of the
array.

print(result[“hits”][“hits”][0][“_source”][“name”])

36

|
faculty of economics
and business
operations
Assignment
Next week (week 3): we work with gps-data. You will create your own (really simple version of) Strava.

In week 4/5: Elasticsearch will be used in the weekly assignments.

In the end assignment: everything comes together:
Meal delivery service
You get much data which you will import in Elasticsearch
Query and parse the data using Json
Work with Gps data
Optimization problem based on aggregate patterns in the data (e.g., determine number of couriers needed using the average number of orders per hour)
37

|
faculty of economics
and business
operations
Any questions?
38

|
faculty of economics
and business
operations

Persons
p_numbernameage
1John16
2William28
3Mary24
………

Sorted
11
14
17
20
44
50
67
90
93
98
99

Unsorted
50
11
14
90
67
44
98
20
93
99
17

nsearches
100 7
1,000 10
10,000 13
100,000 17
1,000,000 20
1,000,000,000 30
10,000,000,000 33

1John162William283Mary24

123JohnWilliamMary162824