Rexster + Bulbs: Unicode node property - node created but not found -
i using bulbs
, rexster
, trying store nodes unicode properties (see example below). apparently, creating nodes in graph works can see nodes in web interface comes rexster (rexster dog house) retrieving same node not work - none
.
everything works expected when create , nodes non-unicode-specific letters in properties. e.g. in following example node name = u'university of cambridge'
retrievable expected.
rexster version:
[info] application - rexster version [2.4.0]
example code:
# -*- coding: utf-8 -*- bulbs.rexster import graph bulbs.model import node bulbs.property import string bulbs.config import debug import bulbs class university(node): element_type = 'university' name = string(nullable=false, indexed=true) g = graph() g.add_proxy('university', university) g.config.set_logger(debug) name = u'université de montréal' g.university.create(name=name) print g.university.index.lookup(name=name) print bulbs.__version__
gives following output on command line:
post url: http://localhost:8182/graphs/emptygraph/tp/gremlin
post body: {"params": {"keys": null, "index_name": "university", "data": {"element_type": "university", "name": "universit\u00e9 de montr\u00e9al"}}, "script": "def createindexedvertex = {\n vertex = g.addvertex()\n index = g.idx(index_name)\n (entry in data.entryset()) {\n if (entry.value == null) continue;\n vertex.setproperty(entry.key,entry.value)\n if (keys == null || keys.contains(entry.key))\n\tindex.put(entry.key,string.valueof(entry.value),vertex)\n }\n return vertex\n }\n def transaction = { final closure closure ->\n try {\n results = closure();\n g.commit();\n return results; \n } catch (e) {\n g.rollback();\n throw e;\n }\n }\n return transaction(createindexedvertex);"} url: http://localhost:8182/graphs/emptygraph/indices/university?value=universit%c3%a9+de+montr%c3%a9al&key=name
body: none none 0.3
ok, got bottom of this.
since tinkergraph uses hashmap index, can see what's being stored in index using gremlin return contents of map.
here's what's being stored in tinkergraph index using bulbs g.university.create(name=name)
method above...
$ curl http://localhost:8182/graphs/emptygraph/tp/gremlin?script="g.idx(\"university\").index"
{"results":[{"name":{"université de montréal":[{"name":"université de montréal","element_type":"university","_id":"0","_type":"vertex"}]},"element_type":{"university":[{"name":"université de montréal","element_type":"university","_id":"0","_type":"vertex"}]}}],"success":true,"version":"2.5.0-snapshot","querytime":3.732632}
all looks -- encodings right.
to create , index vertex 1 above, bulbs uses custom gremlin script via http post request json content type.
here's problem...
rexster's index lookup rest endpoint uses url query params, , bulbs encodes url params utf-8 byte strings.
to see how rexster handles url query params encoded utf-8 byte strings, executed gremlin script via url query param returns encoded string...
$ curl http://localhost:8182/graphs/emptygraph/tp/gremlin?script="'universit%c3%a9%20de%20montr%c3%a9al'"
{"results":["université de montréal"],"success":true,"version":"2.5.0-snapshot","querytime":16.59432}
egad! that's not right. can see, text mangled.
in twist of irony, have gremlin returning gremlins, , that's rexster using key's value in index lookup, can see not what's stored in tinkergraph's hashmap index.
here's what's going on...
this unquoted byte string looks in bulbs:
>>> name u'universit\xe9 de montr\xe9al' >>> bulbs.utils.to_bytes(name) 'universit\xc3\xa9 de montr\xc3\xa9al'
'\xc3\xa9'
utf-8 encoding of unicode character u'\xe9'
(which can specified u'\u00e9'
).
utf-8 uses 2 bytes encode character, , jersey/grizzly 1.x (rexster's app server) has bug doesn't handle 2-byte character encodings utf-8.
see http://markmail.org/message/w6ipdpkpmyghdx2p
it looks fixed in jersey/grizzly 2.0, switching rexster jersey/grizzly 1.x jersey/grizzly 2.x big ordeal.
last year tinkerpop decided switch netty instead, , tinkerpop 3 release summer, rexster in process of morphing gremlin server, based on netty rather grizzly.
until then, here few workarounds...
since grizzly can't handle 2-byte encodings utf-8, client libraries need encode url params 1-byte latin1 encodings (aka iso-8859-1), grizzly's default encoding.
here's same value encoded latin1 byte string...
$ curl http://localhost:8182/graphs/emptygraph/tp/gremlin?script="'universit%e9%20de%20montr%e9al'"
{"results":["université de montréal"],"success":true,"version":"2.5.0-snapshot","querytime":17.765313}
as can see, using latin1 encoding works in case.
however, general purposes, it's best client libraries use custom gremlin script via http post request json content type , avoid url param encoding issue -- bulbs going do, , i'll push bulbs update github later today.
update: turns out though cannot change grizzly's default encoding type, can specify utf-8 charset in http request content-type
header , grizzly use it. bulbs 0.3.29 has been updated include utf-8 charset in request header, , tests pass. update has been pushed both github , pypi.
Comments
Post a Comment